Regex to replace gibberish
Regex to replace gibberish
I have to clean some input from OCR which recognizes handwriting as gibberish. Any suggestions for a regex to clean out the random characters? Example:
Federal prosecutors on Monday charged a Miami man with the largest case of credit and debit card data theft ever in the United States, accusing the one-time government informant of swiping 130 million accounts on top of 40 million he stole previously. , ':, Ie ':... 11'1 . '(.. ~!' ': f I I . " .' I ~ I' ,11 l I I I ~ \ :' ,! .~ , .. r, 1 , ~ I . I' , .' I ,. , i I ; J . I.' ,.\ ) .. . : I 'I', I .' ' r," Gonzalez is a former informant for the U.S. Secret Service who helped the agency hunt hackers, authorities say. The agency later found out that he had also been working with criminals and feeding them information on ongoing investigations, even warning off at least one individual, according to authorities. eh....l ~.\O ::t e;~~~ s: ~ ~. 0 qs c::; ~ g o t/J (Ii ., ::3 (1l Il:l ~ cil~ 0 2: t:lHj~(1l . ~ ~a 0~ ~ S' N ("b t/J :s Ot/JIl:l"-<:! v'g::!t:O -....c...... VI (:ll <' 0 := - ~ < (1l ::3 (1l ~ ' t/J VJ ~ Pl ..... .... (II
Answer by Russell for Regex to replace gibberish
Well a group of symbols would match a bit of gibberish. Perhaps checking against a dictionary for words?
There seems to be a lot of line breaks where gibberish is, so that may be an indicator too.
Answer by Devin Ceartas for Regex to replace gibberish
Interesting problem.
If this is representative, I suppose you could build a library of common words and delete any line which didn't match any of them.
Or perhaps you could match character and punctuation characters and see if there is a reliable ratio cut-off, or simply a frequency of occurrence of some characters which flags it as gibberish.
Regardless, I think there will have to be some programming logic, not simply a single regular expression.
Answer by Rohin for Regex to replace gibberish
I guess that a regex would not help here. Regex would basically match a deterministic input i.e. a regex will have a predefined set of patterns that it will match. And gibberish would in most cases be random. One way would be to invert the problem i.e. match the relevant text instead of matching the gibberish.
Answer by maykeye for Regex to replace gibberish
One of the simpleset solutions(not involving regexpes):
#pseudopython
number_of_punct = sum([1 if c.ispunct() else 0 for c in line])
if number_of_punct >len(line)/2: line_is_garbage()
well. Or rude regexpish s/[!,'"@#~$%^& ]{5,}//g
Answer by DmitryK for Regex to replace gibberish
Regex won't help here. I'd say if you have control over the recognition part then focus on better quality there: http://www.neurogy.com/ocrpreproc.html
You can also ask user to help you and specify the type of text you work with. e.g. if it is a page from a book then you would expect the majority of lines to be the same length and mainly consisting of letters, spaces and punctuation.
Answer by Noon Silk for Regex to replace gibberish
I'd claim a regex like "any punctuation followed by anything except a space is spam'.
So in .NET it's possibly something like
.Replace("\\p{1,}[a-zA-Z0-9]{1,}", "");
Then you'd consider "any word with two or more punctuations consecutively:
.Replace(" \\p{2,} ", "");
Seems like a good start anyway.
Answer by Nick Dandoulakis for Regex to replace gibberish
A simple heuristic, similar to anonymous answer:
listA = [0,1,2..9, a,b,c..z, A,B,C,..Z , ...] // alphanumerical symbols listB = [!@$%^&...] // other symbols Na = number_of_alphanumeric_symbols( line ) Nb = number_of_other_symbols( line ) if Na/Nb <= garbage_ratio then // garbage
Answer by Breton for Regex to replace gibberish
No idea how well it would work, but I have considered this problem in the past, idly. I've on occasions played with a little programmatic device called a markov chain
Now the wikipedia article probably won't make much sense until you see some of the other things a markov chain is good for. One example of a markov chain in action is this Greeking generator. Another example is the MegaHAL chatbot.
Greeking is gibberish that looks like words. Markov chains provide a way of randomly generating a sequence of letters, but weighting the random choices to emulate the frequency patterns of an examined corpus. So for instance, Given the letter "T", the letter h is more likely to show up next than any other letter. So you examine a corpus (say some newspapers, or blog postings) to produce a kind of fingerprint of the language you're targeting.
Now that you have that frequency table/fingerprint, you can examine your sample text, and rate each letter according to the likelyhood of it appearing. Then, you can flag the letters under a particular threshold likelyhood for removal. In other words, a surprise filter. Filter out surprises.
There's some leeway for how you generate your freqency tables. You're not limited to one letter following another. You can build a frequency table that predicts which letter will likely follow each digraph (group of two letters), or each trigraph, or quadgraph. You can work the other side, predicting likely and unlikely trigraphs to appear in certain positions, given some previous text.
It's kind of like a fuzzy regex. Rather than MATCH or NO MATCH, the whole text is scored on a sliding scale according to how similar it is to your reference text.
Answer by JoshB for Regex to replace gibberish
I did a combo of eliminating lines that don't contain at least two 3 letter words, or one 6 letter word.
([a-z|A-Z]{3,}\s){2,}|([a-z|A-Z]{6,})
Answer by Jim P for Regex to replace gibberish
I like @Breton's answer - I'd suggest using his Corpus approach also with a library of known 'bad scans', which might be easier to identify because 'junk' has more internal consistency than 'good text' if it comes from bad OCR scans (the number of distinct glyphs is lower for example).
Answer by Jay for Regex to replace gibberish
Another good technique is to use a spell checker/dictionary and look up the 'words' after you've eliminated the non readable stuff with regex.
Answer by Sinan nr for Regex to replace gibberish
Here is a Perl implementation of the garbage_ratio
heuristic:
#!/usr/bin/perl use strict; use warnings; while ( defined( my $chunk = read_chunk(\*DATA) ) ) { next unless length $chunk; my @tokens = split ' ', $chunk; # what is a word? my @words = grep { /^[A-Za-z]{2,}[.,]?$/ or /^[0-9]+$/ or /^a|I$/ or /^(?:[A-Z][.])+$/ } @tokens; # completely arbitrary threshold my $score = @words / @tokens; print $chunk, "\n" if $score > 0.5; } sub read_chunk { my ($fh) = @_; my ($chunk, $line); while ( my $line = <$fh> ) { if( $line =~ /\S/ ) { $chunk .= $line; last; } } while (1) { $line = <$fh>; last unless (defined $line) and ($line =~ /\S/); $chunk .= $line; } return $chunk; } __DATA__
Paste the text above after __DATA__
above (not repeating the text here to save space). Of course, the use of the __DATA__
section is for the purpose of posting a self-contained script. In real life, you would have code to open the file etc.
Output:
Federal prosecutors on Monday charged a Miami man with the largest case of credit and debit card data theft ever in the United States, accusing the one-time government informant of swiping 130 million accounts on top of 40 million he stole previously. Gonzalez is a former informant for the U.S. Secret Service who helped the agency hunt hackers, authorities say. The agency later found out that he had also been working with criminals and feeding them information on ongoing investigations, even warning off at least one individual, according to authorities.
Fatal error: Call to a member function getElementsByTagName() on a non-object in D:\XAMPP INSTALLASTION\xampp\htdocs\endunpratama9i\www-stackoverflow-info-proses.php on line 72
0 comments:
Post a Comment