Blog coding and discussion of coding about JavaScript, PHP, CGI, general web building etc.

Monday, January 25, 2016

Regex to replace gibberish

Regex to replace gibberish


I have to clean some input from OCR which recognizes handwriting as gibberish. Any suggestions for a regex to clean out the random characters? Example:


  Federal prosecutors on Monday charged a Miami man with the largest   case of credit and debit card data theft ever in the United States,   accusing the one-time government informant of swiping 130 million   accounts on top of 40 million he stole previously.    , ':, Ie  ':... 11'1  . '(.. ~!' ': f I I  . " .' I ~  I' ,11 l  I I I ~ \ :' ,! .~ , .. r, 1 , ~ I . I' , .' I ,.  , i  I ; J . I.' ,.\ ) ..  . : I  'I', I  .' '  r,"    Gonzalez is a former informant for the U.S. Secret Service who helped   the agency hunt hackers, authorities say. The agency later found out that   he had also been working with criminals and feeding them information   on ongoing investigations, even warning off at least one individual,   according to authorities.    eh....l  ~.\O ::t  e;~~~  s: ~ ~. 0  qs c::; ~ g  o t/J (Ii .,  ::3 (1l Il:l  ~ cil~ 0 2:  t:lHj~(1l  . ~ ~a  0~ ~ S'  N ("b t/J :s  Ot/JIl:l"-<:!  v'g::!t:O  -....c......  VI (:ll <' 0  := - ~  < (1l ::3  (1l ~ '  t/J VJ ~  Pl  .....  ....  (II  

Answer by Russell for Regex to replace gibberish


Well a group of symbols would match a bit of gibberish. Perhaps checking against a dictionary for words?

There seems to be a lot of line breaks where gibberish is, so that may be an indicator too.

Answer by Devin Ceartas for Regex to replace gibberish


Interesting problem.

If this is representative, I suppose you could build a library of common words and delete any line which didn't match any of them.

Or perhaps you could match character and punctuation characters and see if there is a reliable ratio cut-off, or simply a frequency of occurrence of some characters which flags it as gibberish.

Regardless, I think there will have to be some programming logic, not simply a single regular expression.

Answer by Rohin for Regex to replace gibberish


I guess that a regex would not help here. Regex would basically match a deterministic input i.e. a regex will have a predefined set of patterns that it will match. And gibberish would in most cases be random. One way would be to invert the problem i.e. match the relevant text instead of matching the gibberish.

Answer by maykeye for Regex to replace gibberish


One of the simpleset solutions(not involving regexpes):

#pseudopython

number_of_punct = sum([1 if c.ispunct() else 0 for c in line])

if number_of_punct >len(line)/2: line_is_garbage()

well. Or rude regexpish s/[!,'"@#~$%^& ]{5,}//g

Answer by DmitryK for Regex to replace gibberish


Regex won't help here. I'd say if you have control over the recognition part then focus on better quality there: http://www.neurogy.com/ocrpreproc.html

You can also ask user to help you and specify the type of text you work with. e.g. if it is a page from a book then you would expect the majority of lines to be the same length and mainly consisting of letters, spaces and punctuation.

Answer by Noon Silk for Regex to replace gibberish


I'd claim a regex like "any punctuation followed by anything except a space is spam'.

So in .NET it's possibly something like

.Replace("\\p{1,}[a-zA-Z0-9]{1,}", "");  

Then you'd consider "any word with two or more punctuations consecutively:

.Replace(" \\p{2,} ", "");  

Seems like a good start anyway.

Answer by Nick Dandoulakis for Regex to replace gibberish


A simple heuristic, similar to anonymous answer:

listA = [0,1,2..9, a,b,c..z, A,B,C,..Z , ...] // alphanumerical symbols  listB = [!@$%^&...] // other symbols    Na = number_of_alphanumeric_symbols( line )  Nb = number_of_other_symbols( line )    if Na/Nb <= garbage_ratio then    // garbage  

Answer by Breton for Regex to replace gibberish


No idea how well it would work, but I have considered this problem in the past, idly. I've on occasions played with a little programmatic device called a markov chain

Now the wikipedia article probably won't make much sense until you see some of the other things a markov chain is good for. One example of a markov chain in action is this Greeking generator. Another example is the MegaHAL chatbot.

Greeking is gibberish that looks like words. Markov chains provide a way of randomly generating a sequence of letters, but weighting the random choices to emulate the frequency patterns of an examined corpus. So for instance, Given the letter "T", the letter h is more likely to show up next than any other letter. So you examine a corpus (say some newspapers, or blog postings) to produce a kind of fingerprint of the language you're targeting.

Now that you have that frequency table/fingerprint, you can examine your sample text, and rate each letter according to the likelyhood of it appearing. Then, you can flag the letters under a particular threshold likelyhood for removal. In other words, a surprise filter. Filter out surprises.

There's some leeway for how you generate your freqency tables. You're not limited to one letter following another. You can build a frequency table that predicts which letter will likely follow each digraph (group of two letters), or each trigraph, or quadgraph. You can work the other side, predicting likely and unlikely trigraphs to appear in certain positions, given some previous text.

It's kind of like a fuzzy regex. Rather than MATCH or NO MATCH, the whole text is scored on a sliding scale according to how similar it is to your reference text.

Answer by JoshB for Regex to replace gibberish


I did a combo of eliminating lines that don't contain at least two 3 letter words, or one 6 letter word.

([a-z|A-Z]{3,}\s){2,}|([a-z|A-Z]{6,})

http://www.regexpal.com/

Answer by Jim P for Regex to replace gibberish


I like @Breton's answer - I'd suggest using his Corpus approach also with a library of known 'bad scans', which might be easier to identify because 'junk' has more internal consistency than 'good text' if it comes from bad OCR scans (the number of distinct glyphs is lower for example).

Answer by Jay for Regex to replace gibberish


Another good technique is to use a spell checker/dictionary and look up the 'words' after you've eliminated the non readable stuff with regex.

Answer by Sinan nr for Regex to replace gibberish


Here is a Perl implementation of the garbage_ratio heuristic:

#!/usr/bin/perl    use strict;  use warnings;    while ( defined( my $chunk = read_chunk(\*DATA) ) ) {      next unless length $chunk;        my @tokens = split ' ', $chunk;      # what is a word?      my @words  = grep {          /^[A-Za-z]{2,}[.,]?$/              or /^[0-9]+$/              or /^a|I$/              or /^(?:[A-Z][.])+$/      } @tokens;        # completely arbitrary threshold      my $score = @words / @tokens;      print $chunk, "\n" if $score > 0.5;  }    sub read_chunk {      my ($fh) = @_;      my ($chunk, $line);        while ( my $line = <$fh> ) {          if( $line =~ /\S/ ) {              $chunk .= $line;              last;          }      }        while (1) {          $line = <$fh>;          last unless (defined $line) and ($line =~ /\S/);          $chunk .= $line;      }        return $chunk;  }      __DATA__  

Paste the text above after __DATA__ above (not repeating the text here to save space). Of course, the use of the __DATA__ section is for the purpose of posting a self-contained script. In real life, you would have code to open the file etc.

Output:

  Federal prosecutors on Monday charged a Miami man with the largest  case of credit and debit card data theft ever in the United States,  accusing the one-time government informant of swiping 130 million  accounts on top of 40 million he stole previously.    Gonzalez is a former informant for the U.S. Secret Service who helped  the agency hunt hackers, authorities say. The agency later found out that  he had also been working with criminals and feeding them information  on ongoing investigations, even warning off at least one individual,  according to authorities.


Fatal error: Call to a member function getElementsByTagName() on a non-object in D:\XAMPP INSTALLASTION\xampp\htdocs\endunpratama9i\www-stackoverflow-info-proses.php on line 72

0 comments:

Post a Comment

Popular Posts

Powered by Blogger.