Blog coding and discussion of coding about JavaScript, PHP, CGI, general web building etc.

Thursday, February 18, 2016

Python regex catastrophic backtracking

Python regex catastrophic backtracking


I am searching an XML file generated from Ms word for some phrases. The thing is that any phrase can be interrupted with some XML tags, that can come between words, or even inside words, as you can see in the example:

To increase knowledge of and acquired skills for implementing social policies with a view to strengthening the capacity of developing countries at the national and community level.

So my approach to handle this problem was to simply reduce all XML tags into clusters of # characters of the same length, so that when I can find any phrase, the regex would ignore all the XML tags between each two characters.

What I need basically is the span of this phrase within the actual xml document, so I will use this span into later processing with the xml document, I cannot use clones.

This approach works remarkablly, but some phrases cause catastropic backtracking, such as the following example, so I need someone to point out where does the backtracking come from, or suggest a better solution to the problem.

================================

Here is an example:

I have this text where there are some clusters of # characters within it (which I want to keep), and the spaces are also unpredictable, such as the following:

Relationship to the #################strategic framework ################## for the period 2014-2015####################: Programme 7, Economic and Social Affairs, subprogramme 3, expected

accomplishment (c)#######

In order to match the following phrase:

Relationship to the strategic framework for the period 2014-2015: programme 7, Economic and Social Affairs, subprogramme 3, expected accomplishment (c)

I came up with this regex to accommodate the unpredictable # and space characters:

u'R#*e#*l#*a#*t#*i#*o#*n#*s#*h#*i#*p#*\\s*#*t#*o#*\\s*#*t#*h#*e#*\\s*#*s#*t#*r#*a#*t#*e#*g#*i#*c#*\\s*#*f#*r#*a#*m#*e#*w#*o#*r#*k#*\\s*#*f#*o#*r#*\\s*#*t#*h#*e#*\\s*#*p#*e#*r#*i#*o#*d#*\\s*#*2#*0#*1#*4#*\\-#*2#*0#*1#*5#*:#*\\s*#*p#*r#*o#*g#*r#*a#*m#*m#*e#*\\s*#*7#*\\,#*\\s*#*E#*c#*o#*n#*o#*m#*i#*c#*\\s*#*a#*n#*d#*\\s*#*S#*o#*c#*i#*a#*l#*\\s*#*A#*f#*f#*a#*i#*r#*s#*\\,#*\\s*#*s#*u#*b#*p#*r#*o#*g#*r#*a#*m#*m#*e#*\\s*#*3#*\\,#*\\s*#*e#*x#*p#*e#*c#*t#*e#*d#*\\s*#*a#*c#*c#*o#*m#*p#*l#*i#*s#*h#*m#*e#*n#*t#*\\s*#*\\(#*c#*\\)'

And it works fine in all the other phrases that I want to match, but this one has a problem leading to some catastrophic backtracking, can anyone spot it?

The original text is separated with xml tags, so to make it simpler for the regex, I replaced the tags with these # clusters, here is the original text:

Relationship to the strategic framework for the period 2014-2015: Programme 7, Economic and Social Affairs, subprogramme 3, expected accomplishment (c)

Answer by neumino for Python regex catastrophic backtracking


Another simpler solution is to remove the pound keys with

your_string.replace('#', '')  

And test your regex (without all the #*) against the string returned by replace.

Answer by Saullo Castro for Python regex catastrophic backtracking


Without using regex you can obtain what you want doing:

text.replace('#','').replace('  ',' ')  

Answer by mishik for Python regex catastrophic backtracking


Since the situation is that complex - don't use regex, just go through your line symbol by symbol:

etalone = "String to find"  etalone_length = len(etalone)  counter = 0  for symbol in your_line:      if symbol == etalone[counter]:          counter += 1          if counter == etalone_length:              print("String matches")              break      elif symbol != " " and sybmol != "#":          # Bad char found          print("Does not match!")  else:  # exited 'for' before full etalone matched      print("Does not match!")  

I just figured out, that above will not, actually, work if the very first symbol we match is not the one we're looking for. How about this instead:

  1. Clone your string
  2. Remove "#" from clone
  3. Match against pattern
  4. If pattern matches - get the location of matched result
  5. By that location - find which exact occurrence of the first symbol was matched. Like if full line is a#b##ca#d#f and the line we're looking for is adf then we would start matching from the second a symbol.
  6. Find nth occurrence of symbol a in the original line. Set counter =
  7. Use above algorithm (storing as span start and counter before break as span end)

Answer by FMc for Python regex catastrophic backtracking


If I understand the problem correctly, here's a way to tackle the problem without resorting to pathological regular expressions or character-by-character parsing:

def do_it(search, text_orig, verbose = False):      # A copy of the text without any "#" markers.      text_clean = text_orig.replace('#', '')        # Start position of search text in the cleaned text.      try:               i = text_clean.index(search)      except ValueError: return [None, None]        # Collect the widths of the runs of markers and non-markers.      rgx    = re.compile(r'#+|[^#]+')      widths = [len(m.group()) for m in rgx.finditer(text_orig)]        # From that data, we can compute the span.      return compute_span(i, len(search), widths, text_orig[0] == '#')  

And here's a fairly simple way to compute the spans from the width data. My first attempt was incorrect, as noted by eyquem. The second attempt was correct but complex. This third approach seems both simple and correct.

def compute_span(span_start, search_width, widths, is_marker):      span_end       = span_start + search_width - 1      to_consume     = span_start + search_width      start_is_fixed = False        for w in widths:          if is_marker:              # Shift start and end rightward.              span_start += (0 if start_is_fixed else w)              span_end   += w          else:              # Reduce amount of non-marker text we need to consume.              # As that amount gets smaller, we'll first fix the              # location of the span_start, and then stop.              to_consume -= w              if to_consume < search_width:                  start_is_fixed = True                  if to_consume <= 0: break          # Toggle the flag.          is_marker = not is_marker        return [span_start, span_end]  

And a bunch of tests to keep the critics at bay:

def main():      tests = [          #                0123456789012345678901234567890123456789          ( [None, None], '' ),          ( [ 0,  5],     'foobar' ),          ( [ 0,  5],     'foobar###' ),          ( [ 3,  8],     '###foobar' ),          ( [ 2,  7],     '##foobar###' ),          ( [25, 34],     'BLAH ##BLAH fo####o##ba##foo###b#ar' ),          ( [12, 26],     'BLAH ##BLAH fo####o##ba###r## BL##AH' ),          ( [None, None], 'jkh##jh#f' ),          ( [ 1, 12],     '#f#oo##ba###r##' ),          ( [ 4, 15],     'a##xf#oo##ba###r##' ),          ( [ 4, 15],     'ax##f#oo##ba###r##' ),          ( [ 7, 18],     'ab###xyf#oo##ba###r##' ),          ( [ 7, 18],     'abx###yf#oo##ba###r##' ),          ( [ 7, 18],     'abxy###f#oo##ba###r##' ),          ( [ 8, 19],     'iji#hkh#f#oo##ba###r##' ),          ( [ 8, 19],     'mn##pps#f#oo##ba###r##' ),          ( [12, 23],     'mn##pab###xyf#oo##ba###r##' ),          ( [12, 23],     'lmn#pab###xyf#oo##ba###r##' ),          ( [ 0, 12],     'fo##o##ba###r## aaaaaBLfoob##arAH' ),          ( [ 0, 12],     'fo#o##ba####r## aaaaaBLfoob##ar#AH' ),          ( [ 0, 12],     'f##oo##ba###r## aaaaaBLfoob##ar' ),          ( [ 0, 12],     'f#oo##ba####r## aaaaBL#foob##arAH' ),          ( [ 0, 12],     'f#oo##ba####r## aaaaBL#foob##ar#AH' ),          ( [ 0, 12],     'foo##ba#####r## aaaaBL#foob##ar' ),          ( [ 1, 12],     '#f#oo##ba###r## aaaBL##foob##arAH' ),          ( [ 1, 12],     '#foo##ba####r## aaaBL##foob##ar#AH' ),          ( [ 2, 12],     '#af#oo##ba##r## aaaBL##foob##ar' ),          ( [ 3, 13],     '##afoo##ba###r## aaaaaBLfoob##arAH' ),          ( [ 5, 17],     'BLAHHfo##o##ba###r aaBLfoob##ar#AH' ),          ( [ 5, 17],     'BLAH#fo##o##ba###r aaBLfoob##ar' ),          ( [ 5, 17],     'BLA#Hfo##o##ba###r###BLfoob##ar' ),          ( [ 5, 17],     'BLA#Hfo##o##ba###r#BL##foob##ar' ),      ]      for exp, t in tests:          span = do_it('foobar', t, verbose = True)          if exp != span:              print '\n0123456789012345678901234567890123456789'              print t              print n              print dict(got = span, exp = exp)    main()  

Answer by sleeplessnerd for Python regex catastrophic backtracking


Depth first search with XML Parser ?

Maybe remember the position in the xml document where the text node was found, for later reverse lookup. You actual goal is still unclear.

Answer by Christian Semrau for Python regex catastrophic backtracking


The backtracking catastrophe may be caused because your regex contains multiple instances of the pattern #*\\s*#*: Each of these will match any block of repeated #, but it can match the same text in multiple ways. When you have several of these patterns in your regex, the numbers of possibilities multiply.

Are your searching in a larger body of text? If so, does the text contain phrases which coincide with the beginning of your search text? If so, the regex engine matches the beginning of the pattern, and on finding a mismatch, starts backtracking.

Note that the text framework ################## for is not matched by the regex f#*r#*a#*m#*e#*w#*o#*r#*k#*\\s*#*f#*o#*r because of an unmatched space characters.

Possible solutions with regexes:

1 Use possessive quantifiers instead of the standard greedy quantifiers. Unfortunately, according to this page, Python does not support possessive quantifiers.

2 Replace the pattern #*\\s*#* with (#|\\s)*, which will reduce the number of ways your regex can match a text. Note that this changed regex can match more than your original text (specifically, the suggested pattern will match the text ## ## ## which the original pattern does not match).

Answer by eyquem for Python regex catastrophic backtracking


In a previous answer, I used re and difflib modules, and your principle of replacing every tag with a character.
But I realized that your problem can be solved using only re and without having to do substitution with an arbitrary character.

Import

import re  

Data

I used tuples to be able to display data in a more readable form during execution

Note that I slightly modified the data to avoid some problems:
put only one blank between framework and for the period,
major P at Programm 7 in the two strings, etc

Norte also that I added a sequence of characters ### in phrase and xmltext (in front of the date 2014-2015) to show that my code still works in this case. Other answers don't manage this eventuality.

Phrase

tu_phrase = ('Relationship to the ',               'strategic framework ',               'for the period ###2014-2015',               ': Programme 7, Economic and Social Affairs, ',               'subprogramme 3, expected accomplishment (c)')  phrase = ''.join(tu_phrase)  

XML text

tu_xmltext = ('EEEEEEE',                '',                'AAAAAAA',                '',                'Relationship to the ',                '',                '',                ''                'strategic framework ',                '',                '',                '',                'for the period ###2014-2015',                '',                '',                '',                ': Programme 7, Economic and Social Affairs, ',                'subprogramme 3, expected accomplishment (c)',                '',                '321354641331')  xmltext = ''.join(tu_xmltext)  

The working functions

The function olding_the_new(stuvw , pat_for_sub) returns a list of triples (pmod,w,pori)
expressing correspondance of positions of common sequences
in stuvw and re.sub(pat_for_sub, stuvw).
These sequences are the ones in stuvw that aren't catched by group(1) of pat_for_sub :
- (pmod,w) describes a sequence in re.sub(pat_for_sub, stuvw)
- pmod is its position in re.sub(pat_for_sub, stuvw)
- w is its width [it's the same in re.sub(pat_for_sub, stuvw) and stuvw]
- pori is the position of this sequence in original stuvw

def olding_the_new(stuvw,pat_for_sub):      triples = []      pmod = 0 # pmod = position in modified stuvw,               # that is to say in re.sub(pat_for_sub,'',stuvw)      for mat in re.finditer('{0}|([\s\S]+?)(?={0}|\Z)'.format(pat_for_sub),                             stuvw):          if mat.group(1):              triples.append((pmod,mat.end()-mat.start(),mat.start()))              pmod += mat.end()-mat.start()      return triples      def finding(LITTLE,BIG,pat_for_sub,              olding_the_new=olding_the_new):      triples = olding_the_new(BIG,'(?:%s)+' % pat_for_sub)      modBIG = re.sub(pat_for_sub,'',BIG)      modLITTLE = re.escape(LITTLE)      for mat in re.finditer(modLITTLE,modBIG):          st,nd = mat.span() # in modBIG          sori = -1 # start original, id est in BIG          for tr in triples:              if st < tr[0]+tr[1] and sori<0:                  sori = tr[2] + st - tr[0]               if nd<=tr[0]+tr[1]:                  yield(sori, tr[2] + nd - tr[0])                  break  

Execution

if __name__ == '__main__':      print ('---------- phrase ----------\n%s\n'             '\n------- phrase written in a readable form --------\n'             '%s\n\n\n'             '---------- xmltext ----------\n%s\n'             '\n------- xmltext written in a readable form --------\n'             '%s\n\n\n'             %             (phrase  , '\n'.join(tu_phrase),              xmltext , '\n'.join(tu_xmltext))    )        print ('*********************************************************\n'             '********** Searching for phrase in xmltext **************\n'             '*********************************************************')        spans = finding(phrase,xmltext,']*>')      if spans:          for s,e in spans:              print ("\nspan in string 'xmltext' :  (%d , %d)\n\n"                     'xmltext[%d:%d] :\n%s'                     % (s,e,s,e,xmltext[s:e]))      else:          print ("-::: The first string isn't in second string :::-")  

RESULT

*********************************************************  ********** Searching for phrase in xmltext **************  *********************************************************    span in string 'xmltext' :  (34 , 448)    xmltext[34:448] :  Relationship to the strategic framework for the period ###2014-2015: Programme 7, Economic and Social Affairs, subprogramme 3, expected accomplishment (c)  

Nota Bene

My code can't detect a phrase in an XML document when the sequences of whitespaces between two words are not exactly the same in phrase and in the XML text.
I tried to obtain this possibility, but it's too much complicated.
In your example, in the XML sequence you show, there's a blank between strategic framework and the following tags, and another blank between these tags and the following for the period. In this condition, my code couldn't work (I doubt that other answers can do better on this point), then I used an xmltext without a blank in front of for the period.

On the other side, my code doesn't use a replacement character, then any character may be in the XML document and the phrase, without having any problem with a character that would be in them while used as the replacement character.

My code directly gives span in the original XML document, not in an intermediary text modified with a replacement character.

It gives all the occurences of phrase in the XML document, not only the first.

................................

With following data:

print ('\n*********************************************************\n'         "********* Searching for 'foobar' in samples *************\n"         '*********************************************************')    for xample in ('fo##o##ba###r## aaaaaBLfoob##arAH',                 '#fo##o##ba###r## aaaaaBLfoob##arAH',                 'BLAHHfo##o##ba###r   BLfoob##arAH',                 'BLAH#fo##o##ba###rBLUHYfoob##arAH',                 'BLA# fo##o##ba###rBLyyyfoob##ar',                 'BLA# fo##o##ba###rBLy##foob##ar',                 'kjhfqshqsk'):      spans = list(finding('foobar',xample,'#'))      if spans:          print ('\n%s\n%s'                 %                 (xample,                  '\n'.join('%s  %s'                            % (sp,xample[sp[0]:sp[1]])                            for sp in spans))                 )      else:          print ("\n%s\n-::: Not found :::-" % xample)  

the results are:

*********************************************************  ********* Searching for 'foobar' in samples *************  *********************************************************    fo##o##ba###r## aaaaaBLfoob##arAH  (0, 13)  fo##o##ba###r  (23, 31)  foob##ar    #fo##o##ba###r## aaaaaBLfoob##arAH  (1, 14)  fo##o##ba###r  (24, 32)  foob##ar    BLAHHfo##o##ba###r   BLfoob##arAH  (5, 18)  fo##o##ba###r  (23, 31)  foob##ar    BLAH#fo##o##ba###rBLUHYfoob##arAH  (5, 18)  fo##o##ba###r  (23, 31)  foob##ar    BLA# fo##o##ba###rBLyyyfoob##ar  (5, 18)  fo##o##ba###r  (23, 31)  foob##ar    BLA# fo##o##ba###rBLy##foob##ar  (5, 18)  fo##o##ba###r  (23, 31)  foob##ar    kjhfqshqsk  -::: Not found :::-  

..........................................

With the following code, I examined your question:

import urllib    sock = urllib.urlopen('http://stackoverflow.com/'                        'questions/17381982/'                        'python-regex-catastrophic-backtracking-where')  r =sock.read()  sock.close()    i = r.find('unpredictable, such as the following')  j = r.find('in order to match the following phrase')  k = r.find('I came up with this regex ')    print 'i == %d   j== %d' % (i,j)  print repr(r[i:j])      print  print 'j == %d   k== %d' % (j,k)  print repr(r[j:k])  

The result is:

i == 10408   j== 10714  'unpredictable, such as the following:

\n\n
\n Relationship to the #################strategic framework ################## for the period 2014-2015####################: Programme 7, Economic and Social Affairs, subprogramme 3, expected\n \n

accomplishment (c)#######

\n
\n\n

so ' j == 10714 k== 10955 'in order to match the following phrase:

\n\n
\n

Relationship to the strategic framework for the period 2014-2015:\n programme 7, Economic and Social Affairs, subprogramme 3, expected\n accomplishment (c)

\n
\n\n

'

Note the additional \n in front of programme 7 , additional \n

in front of accomplishment, the difference Programme 7 and programme 7 , and the presence of two blanks between framework and for the period in the string framework ################## for the period
This may explain your difficulties with your example.

Answer by eyquem for Python regex catastrophic backtracking


The following code shows that FMc 's code doesn't work.

The line
from name_of_file import olding_the_new,finding refers to my code in my personal answer in this thread to this question.
* Give the name name_of_file to the file containing the script of my code (lying in my other answer in this thread) and it will run.
* Or if you are annoyed to copy-paste my code, just comment this line of import, and the following code will run because I put a try-except instruction that will react correctly to the absence of olding_the_new and finding

I used two ways to verify the results of FMc 's code:
-1/ comparing the span returned by his code with index of 'f' and index of 'r', as we search for phrase 'foobar' and I managed that there are no f and r other than the ones in foobar
-2/ comparing with the first span returned by my code, hence the need of the above import from name_of_file

Nota Bene

If disp = None is changed to disp == True, the execution displays intermediary results that help to understand the algorithm.

.

import re  from name_of_file import olding_the_new,finding    def main():      # Two versions of the text: the original,      # and one without any of the "#" markers.      for text_orig  in ('BLAH ##BLAH fo####o##ba###r## BL##AH',                         'jkh##jh#f',                         '#f#oo##ba###r##',                         'a##xf#oo##ba###r##',                         'ax##f#oo##ba###r##',                         'ab###xyf#oo##ba###r##',                         'abx###yf#oo##ba###r##',                         'abxy###f#oo##ba###r##',                         'iji#hkh#f#oo##ba###r##',                         'mn##pps#f#oo##ba###r##',                         'mn##pab###xyf#oo##ba###r##',                         'lmn#pab###xyf#oo##ba###r##',                         'fo##o##ba###r## aaaaaBLfoob##arAH',                         'fo#o##ba####r## aaaaaBLfoob##ar#AH',                         'f##oo##ba###r## aaaaaBLfoob##ar',                         'f#oo##ba####r## aaaaBL#foob##arAH',                         'f#oo##ba####r## aaaaBL#foob##ar#AH',                         'foo##ba#####r## aaaaBL#foob##ar',                         '#f#oo##ba###r## aaaBL##foob##arAH',                         '#foo##ba####r## aaaBL##foob##ar#AH',                         '#af#oo##ba##r## aaaBL##foob##ar',                         '##afoo##ba###r## aaaaaBLfoob##arAH',                         'BLAHHfo##o##ba###r aaBLfoob##ar#AH',                         'BLAH#fo##o##ba###r aaBLfoob##ar',                         'BLA#Hfo##o##ba###r###BLfoob##ar',                         'BLA#Hfo##o##ba###r#BL##foob##ar',                         ):            text_clean = text_orig.replace('#', '')          # Collect data on the positions and widths          # of the markers in the original text.          rgx     = re.compile(r'#+')          markers = [(m.start(), len(m.group()))                     for m in rgx.finditer(text_orig)]            # Find the location of the search phrase in the cleaned text.          # At that point you'll have all the data you need to compute          # the span of the phrase in the original text.          search = 'foobar'          try:              i = text_clean.index(search)              print ('text_clean == %s\n'                     "text_clean.index('%s')==%d   len('%s') == %d\n"                     'text_orig  == %s\n'                     'markers  == %s'                     % (text_clean,                        search,i,search,len(search),                        text_orig,                        markers))              S,E = compute_span(i, len(search), markers)              print "span = (%d,%d)  %s %s     %s"\                    % (S,E,                       text_orig.index('f')==S,                       text_orig.index('r')+1==E,                       list(finding(search,text_orig,'#+')))          except ValueError:              print ('text_clean == %s\n'                     "text_clean.index('%s')   ***Not found***\n"                     'text_orig  == %s\n'                     'markers  == %s'                     % (text_clean,                        search,                        text_orig,                        markers))          print '--------------------------------'  

.

def compute_span(start, width, markers):      # start and width are in expurgated text      # markers are in original text      disp = None # if disp==True => displaying of intermediary results      span_start = start      if disp:          print ('\nAt beginning in compute_span():\n'                 '  span_start==start==%d   width==%d'                 % (start,width))      for s, w in markers: # s and w are in original text          if disp:              print ('\ns,w==%d,%d'                     '   s+w-1(%d)

.

main()  

Result

>>>   text_clean == BLAH BLAH foobar BLAH  text_clean.index('foobar')==10   len('foobar') == 6  text_orig  == BLAH ##BLAH fo####o##ba###r## BL##AH  markers  == [(5, 2), (14, 4), (19, 2), (23, 3), (27, 2), (32, 2)]  span = (12,26)  True False     [(12, 27)]  --------------------------------  text_clean == jkhjhf  text_clean.index('foobar')   ***Not found***  text_orig  == jkh##jh#f  markers  == [(3, 2), (7, 1)]  --------------------------------  text_clean == foobar  text_clean.index('foobar')==0   len('foobar') == 6  text_orig  == #f#oo##ba###r##  markers  == [(0, 1), (2, 1), (5, 2), (9, 3), (13, 2)]  span = (0,11)  False False     [(1, 13)]  --------------------------------  text_clean == axfoobar  text_clean.index('foobar')==2   len('foobar') == 6  text_orig  == a##xf#oo##ba###r##  markers  == [(1, 2), (5, 1), (8, 2), (12, 3), (16, 2)]  span = (2,16)  False True     [(4, 16)]  --------------------------------  text_clean == axfoobar  text_clean.index('foobar')==2   len('foobar') == 6  text_orig  == ax##f#oo##ba###r##  markers  == [(2, 2), (5, 1), (8, 2), (12, 3), (16, 2)]  span = (2,15)  False False     [(4, 16)]  --------------------------------  text_clean == abxyfoobar  text_clean.index('foobar')==4   len('foobar') == 6  text_orig  == ab###xyf#oo##ba###r##  markers  == [(2, 3), (8, 1), (11, 2), (15, 3), (19, 2)]  span = (4,19)  False True     [(7, 19)]  --------------------------------  text_clean == abxyfoobar  text_clean.index('foobar')==4   len('foobar') == 6  text_orig  == abx###yf#oo##ba###r##  markers  == [(3, 3), (8, 1), (11, 2), (15, 3), (19, 2)]  span = (4,18)  False False     [(7, 19)]  --------------------------------  text_clean == abxyfoobar  text_clean.index('foobar')==4   len('foobar') == 6  text_orig  == abxy###f#oo##ba###r##  markers  == [(4, 3), (8, 1), (11, 2), (15, 3), (19, 2)]  span = (4,19)  False True     [(7, 19)]  --------------------------------  text_clean == ijihkhfoobar  text_clean.index('foobar')==6   len('foobar') == 6  text_orig  == iji#hkh#f#oo##ba###r##  markers  == [(3, 1), (7, 1), (9, 1), (12, 2), (16, 3), (20, 2)]  span = (7,18)  False False     [(8, 20)]  --------------------------------  text_clean == mnppsfoobar  text_clean.index('foobar')==5   len('foobar') == 6  text_orig  == mn##pps#f#oo##ba###r##  markers  == [(2, 2), (7, 1), (9, 1), (12, 2), (16, 3), (20, 2)]  span = (7,18)  False False     [(8, 20)]  --------------------------------  text_clean == mnpabxyfoobar  text_clean.index('foobar')==7   len('foobar') == 6  text_orig  == mn##pab###xyf#oo##ba###r##  markers  == [(2, 2), (7, 3), (13, 1), (16, 2), (20, 3), (24, 2)]  span = (9,24)  False True     [(12, 24)]  --------------------------------  text_clean == lmnpabxyfoobar  text_clean.index('foobar')==8   len('foobar') == 6  text_orig  == lmn#pab###xyf#oo##ba###r##  markers  == [(3, 1), (7, 3), (13, 1), (16, 2), (20, 3), (24, 2)]  span = (9,24)  False True     [(12, 24)]  --------------------------------  text_clean == foobar aaaaaBLfoobarAH  text_clean.index('foobar')==0   len('foobar') == 6  text_orig  == fo##o##ba###r## aaaaaBLfoob##arAH  markers  == [(2, 2), (5, 2), (9, 3), (13, 2), (27, 2)]  span = (0,9)  True False     [(0, 13), (23, 31)]  --------------------------------  text_clean == foobar aaaaaBLfoobarAH  text_clean.index('foobar')==0   len('foobar') == 6  text_orig  == fo#o##ba####r## aaaaaBLfoob##ar#AH  markers  == [(2, 1), (4, 2), (8, 4), (13, 2), (27, 2), (31, 1)]  span = (0,7)  True False     [(0, 13), (23, 31)]  --------------------------------  text_clean == foobar aaaaaBLfoobar  text_clean.index('foobar')==0   len('foobar') == 6  text_orig  == f##oo##ba###r## aaaaaBLfoob##ar  markers  == [(1, 2), (5, 2), (9, 3), (13, 2), (27, 2)]  span = (0,11)  True False     [(0, 13), (23, 31)]  --------------------------------  text_clean == foobar aaaaBLfoobarAH  text_clean.index('foobar')==0   len('foobar') == 6  text_orig  == f#oo##ba####r## aaaaBL#foob##arAH  markers  == [(1, 1), (4, 2), (8, 4), (13, 2), (22, 1), (27, 2)]  span = (0,8)  True False     [(0, 13), (23, 31)]  --------------------------------  text_clean == foobar aaaaBLfoobarAH  text_clean.index('foobar')==0   len('foobar') == 6  text_orig  == f#oo##ba####r## aaaaBL#foob##ar#AH  markers  == [(1, 1), (4, 2), (8, 4), (13, 2), (22, 1), (27, 2), (31, 1)]  span = (0,8)  True False     [(0, 13), (23, 31)]  --------------------------------  text_clean == foobar aaaaBLfoobar  text_clean.index('foobar')==0   len('foobar') == 6  text_orig  == foo##ba#####r## aaaaBL#foob##ar  markers  == [(3, 2), (7, 5), (13, 2), (22, 1), (27, 2)]  span = (0,7)  True False     [(0, 13), (23, 31)]  --------------------------------  text_clean == foobar aaaBLfoobarAH  text_clean.index('foobar')==0   len('foobar') == 6  text_orig  == #f#oo##ba###r## aaaBL##foob##arAH  markers  == [(0, 1), (2, 1), (5, 2), (9, 3), (13, 2), (21, 2), (27, 2)]  span = (0,11)  False False     [(1, 13), (23, 31)]  --------------------------------  text_clean == foobar aaaBLfoobarAH  text_clean.index('foobar')==0   len('foobar') == 6  text_orig  == #foo##ba####r## aaaBL##foob##ar#AH  markers  == [(0, 1), (4, 2), (8, 4), (13, 2), (21, 2), (27, 2), (31, 1)]  span = (0,12)  False False     [(1, 13), (23, 31)]  --------------------------------  text_clean == afoobar aaaBLfoobar  text_clean.index('foobar')==1   len('foobar') == 6  text_orig  == #af#oo##ba##r## aaaBL##foob##ar  markers  == [(0, 1), (3, 1), (6, 2), (10, 2), (13, 2), (21, 2), (27, 2)]  span = (2,10)  True False     [(2, 13), (23, 31)]  --------------------------------  text_clean == afoobar aaaaaBLfoobarAH  text_clean.index('foobar')==1   len('foobar') == 6  text_orig  == ##afoo##ba###r## aaaaaBLfoob##arAH  markers  == [(0, 2), (6, 2), (10, 3), (14, 2), (28, 2)]  span = (1,14)  False True     [(3, 14), (24, 32)]  --------------------------------  text_clean == BLAHHfoobar aaBLfoobarAH  text_clean.index('foobar')==5   len('foobar') == 6  text_orig  == BLAHHfo##o##ba###r aaBLfoob##ar#AH  markers  == [(7, 2), (10, 2), (14, 3), (27, 2), (31, 1)]  span = (5,14)  True False     [(5, 18), (23, 31)]  --------------------------------  text_clean == BLAHfoobar aaBLfoobar  text_clean.index('foobar')==4   len('foobar') == 6  text_orig  == BLAH#fo##o##ba###r aaBLfoob##ar  markers  == [(4, 1), (7, 2), (10, 2), (14, 3), (27, 2)]  span = (4,16)  False False     [(5, 18), (23, 31)]  --------------------------------  text_clean == BLAHfoobarBLfoobar  text_clean.index('foobar')==4   len('foobar') == 6  text_orig  == BLA#Hfo##o##ba###r###BLfoob##ar  markers  == [(3, 1), (7, 2), (10, 2), (14, 3), (18, 3), (27, 2)]  span = (5,14)  True False     [(5, 18), (23, 31)]  --------------------------------  text_clean == BLAHfoobarBLfoobar  text_clean.index('foobar')==4   len('foobar') == 6  text_orig  == BLA#Hfo##o##ba###r#BL##foob##ar  markers  == [(3, 1), (7, 2), (10, 2), (14, 3), (18, 1), (21, 2), (27, 2)]  span = (5,14)  True False     [(5, 18), (23, 31)]  --------------------------------  >>>   

.

---------------------------------------------

The code of FMc is very subtle, I had a long hard time to understand its principle and then to be able to correct it.
I will let anybody the task to understand the algorithm. I only say the corrections required to make the code of FMc to work correctly:

.

First correction:

if s + w - 1 < start:  # must be changed to    if s + w - 1 <= start or (s==start):  

EDIT

In my initial present answer,
I had written ... or (s<=start).
That was an error of me, in fact I had the intention to write
.. or (s==start)

NOTA BENE about this EDIT:

This error had no consequence in the code corrected with the two corrections I describe here to correct the initial code of FMc (the very first one, because at present he has changed it two times).
Indeed, if you correct the code with the two corrections, you'll obtain correct results with all the 25 examples taken for text_orig, as well with ... or (s <= start) as with ... or (s==start).
So I thought that the case in which s < start is True could never happen when the first condition s+w-1 <= start is False, presumely based on the fact that w is always greater than 0 and some other reason due to the configuration of the markers and non-marker sequences.....
So I tried to find the demonstration of this impression... and I failed.
Moreover, I reached a state of mind in which I even no more understand the algorithm of FMc (the very first one before any edit he did) !!
Despite this, I let this answer as it is, and I post, at the end of this answer, the explanations trying to explain why these corrections are needed.
But I warn: the very first algorithm of FMc is very outlandish and difficult to comprehend because it does comparison of indices belonging to two different strings, one which is text_orig with markers #### and another one cleaned of all these markers.... and now I am no more convinced that may have a sense....

.

Second correction::

start += w  width = width - (s - start)  # must be changed to     width -= (s-start) # this line MUST BE before the following one  start = s + w # because start += (s-start) + w  

-------------------

I am stunned that 2 people upvoted the answer of FMc though it gives a wrong code. It means that they upvoted an answer without having tested the given code.

----------------------------------------

.

EDIT

Why must the condition if s + w - 1 < start: must be changed to this one:
if s + w - 1 <= start or (s==start): ?

Because it may happen that s + w - 1 < start should be False and s equals start together.
In this case, the execution goes to the else section and executes (in corrected code):

width -= (s - start)  start = s + w  

Consequently, width doesn't change while it evidently should change when we see the sequence concerned.

This case may occur at the moment the first marker is examined, as with the following sequences:

'#f#oo##ba###r##' : s,w==0,1 , 0==s==start==0    'ax##f#oo##ba###r##' : s,w==2,2 , 2==s==start==2      'abxy###f#oo##ba###r##' : s,w==4,3 , 4==s==start==4    '#f#oo##ba###r## aaaBL##foob##arAH' : s,w==0,1 , 0==s==start==0    'BLAH#fo##o##ba###r aaBLfoob##ar' : s,w==4,1 4==s==start==4  

With the following ones, it occurs for the examination of the second marker:

'iji#hkh#f#oo##ba###r##' : s,w==7,1 , 7==s==start==7    'mn##pps#f#oo##ba###r##' : s,w==7,1 , 7==s==start==7    

It can be understood better by executing my code with disp = True setted.

When s + w - 1 <= start is verified, the fact that s may equal start isn't troublesome because the execution doesn't go to the else section, it goes to the first section in which there's only the addition of w to s and to start.
But when s + w - 1 <= start is False while s equals start, the execution goes to the else section where the execution of instruction width -= (s-start) doesn't change anything to the value of width and that's troublesome.
So the condition or (s==start) must be added to prevent this else destination, and it is needed to put it after an or to prevent this destination even when s+w-1 <= start is False, which can happen as some examples show it.

.

Concerning the fact that the instruction s+w-1 < start must be changed to s+w-1 <= start (with =),
it's because of the case w==1 corresponding to 1 character # only ,
as for the cases
mn##pps#f#oo##ba###r## (second marker)
and BLAH#fo##o##ba###r (first marker).


Fatal error: Call to a member function getElementsByTagName() on a non-object in D:\XAMPP INSTALLASTION\xampp\htdocs\endunpratama9i\www-stackoverflow-info-proses.php on line 72

0 comments:

Post a Comment

Popular Posts

Powered by Blogger.