On Rewriting difflib

I hade some [missadventures] with python standard library with [difflib].

computing file differences is closesly related to the [Longuest Common Subsequence] problem, which is supposed to be a classical problem in computer science. Unfortunately, or fortunately, I do not have a computer science background even if I know a thing or two.

Doing the diff of two sequences of size $n$, ie compting the LCS, has a computing time in $\mathcal{O}(n^2)$, and one can achieven a space complexity in $\mathcal O (n)$ if only interested in the length of the LCS and not one of the LCS themselves.

It shall be noted that if one knows properties of the alphabet used in the sequence, or even information on the lcs of theses sequences, more advanced algorithms can be used. This is the case for example when somparing DNA, whe the size of your alphabet (ACGT) is 4 (or six soon maybe), (Nature paper behind paywall, so no links.), were the four russian trick can be used.

From a discussion with Nick Coghlan at Pycon not so long ago, I understood that if I wanted to have somethign changed in the stdlib of python I had to have arguments, and as difflib in one of the things I would like to be fix because I think the abstaction and API is bad. I like to take a look at what I can do.

Some utlities

As I will compare many implementation, I'll just use a decorator that register my function in a global dict for later quick comparison (@greg). Also we will make this decorator check at compile time that the LCS function is symetric (give same result is you swap the arguments), and check a few basic test cases to be sure the implementation is not wrong.

we will use the following cases to check the implementation, we avoid long test cases were stdlib might get it wrong, we will comme back to that later.


In [1]:
cd ..


/Users/bussonniermatthias/difflib2

In [2]:
import difflib
import difflib2.utils as utils
import ipythonblocks
import pandas as pd
import string
import seaborn
import matplotlib.pyplot as plt

from difflib2.utils import greg,gdict, gendepth2
from array import array

%matplotlib inline

In [3]:
cases= utils.test_cases

In [4]:
from difflib2.utils import opt_common_char
from difflib2.utils import rem_common_extrem 
from difflib2.utils import hl

from difflib2.report import report
%run difflib2/cplx_data.ipy

Basic implementation

Nothing fancy, no optimisation, compute the lcs matrix, the last number will be the length of the LCS, it also can allow you to get back all the LCS by backtracking, it's our reference. Goal here is not really to explain the algorithm here.


In [5]:
from difflib2.reference import _lcsmatrix,lcsmatrix
hl('difflib2/reference.py')


this implementation passes basic tests
Out[5]:
from .utils import greg
def _lcsmatrix(s1, s2):
    """ compute the LCS matricx of 2 sequecnce
    """
    m = len(s1)
    n = len(s2) 
    mtx = [[0 for x in range(n)] for y in range(m)]
    for i,c1 in enumerate(s1):
        for j,c2 in enumerate(s2): 
            if c1 == c2 : 
                if i == 0 or j == 0:
                    mtx[i][j] = 1
                else:
                    mtx[i][j] = mtx[i-1][j-1]+1
            else :
                if j == 0: 
                    mtx[i][j] = mtx[i-1][j]
                elif i == 0:
                    mtx[i][j] = mtx[i][j-1]
                else :
                    mtx[i][j] = max(mtx[i][j-1],mtx[i-1][j])
    return mtx

@greg # check implementation is correct
def lcsmatrix(*args):
    """ return the last elemen t of the LCS matrix"""
    return _lcsmatrix(*args)[-1][-1]

In [6]:
%run difflib2/test_case.ipy

Wrapper around stdlib to behave with the same API.


In [7]:
@greg
def stdlib(s1,s2):
    return sum(list(map(lambda x:x.size,difflib.SequenceMatcher(None, s1,s2,autojunk=False).get_matching_blocks())))


this implementation passes basic tests

In [8]:
test_case( cases[12],10,10, ldict={'lcsmatrix': lcsmatrix , 'stdlib':stdlib})


Ok so stdlib is fast, but both implementation get the same result.

First optimisation, we use only $2n$ in memory, will it speed up the time of computation ? Drawback, we can only get the lengh of the LCS not the LCSs themselves.


In [9]:
from difflib2.low_m import lcs_low_m


this implementation passes basic tests

Same using python standard lib and array module that get efficient integer array. Is it worth it ?


In [10]:
from difflib2.low_ma import lcs_low_ma


this implementation passes basic tests

In [11]:
test_case( cases[12],15,15, 
          ldict={'lcsmatrix' : lcsmatrix ,
           'stdlib'    : stdlib,
           'lcs_low_m' : lcs_low_m,
           'lcs_low_ma': lcs_low_ma})


Won't get into much details here, but basically you do not need to compute the all matrix if your chains are quite alike. You can stay "close" to the diagonal near the end. On highly similar sequence this should help a lot.

Implementation is handwaved, there might be off by one errors. But I suppose this would change complexity for some case.

There is though some overhead computation time, so the tradeoff, is not obvious.


In [12]:
from difflib2.utils import lorem

def withb(func,s1,s2,*args,**kwargs):
    bg = ipythonblocks.BlockGrid(len(s2),len(s1), block_size=10)
    r = func(s1,s2,*args,bg=bg,**kwargs)
    print(r)
    return bg

In [13]:
from lcs_cutmodule import lcs_cut2

In [14]:
test_case( cases[12],15,15, 
          ldict={'lcsmatrix' : lcsmatrix ,
           'stdlib'    : stdlib,
           'lcs_cut2'  : lcs_cut2})


For a longer benchmark comparaison, jump to the end of the blog post where you have many more implementation and test-cases.

Code complexity


In [15]:
seed = string.ascii_lowercase + string.digits + string.punctuation
cplx_case = (
# seed1,seed2,upperbound,[name]
(seed,seed,3.2,'ascci, long'),
(lorem,lorem,3.2,'lorem'),
('aaab','aaabb',2.7),
('aaab','aaacc',2.7),
("dcbet","badct",2.8),
("xzbet","badct",3.0),
("xada","bada",3.0),
(lorem, str(reversed(lorem)),3.2,'lorem (reversed)')
)

def opt(func):
    return  opt_common_char(rem_common_extrem(func))

fast_lcs = opt(lcs_cut2)
fast_lcs.__name__ = 'fast_lcs'
fast_lcs = greg(fast_lcs)


this implementation passes basic tests

In [16]:
import time
for case,guide in zip([cplx_case[i] for i in [2,5]],
                      [{'stdlib':3,'fast_lcs':2},
                       {'stdlib':3,'fast_lcs':1}]):
    
    if len(case) == 4:
        name = case[3]
    else :
        name = '(%s,%s)' %(case[0],case[1])
    print('\n====',name,'====')
    t= time.time()
    df,dr = cplx_data(case, {'lcsmatrix':lcsmatrix,'stdlib':stdlib,'fast_lcs':fast_lcs},
                      r=1, nt=25, npoints=15)        
    tp = time.time()
    print('\n',int(tp-t),'s')
    report(df,dr,name=name, guides=guide)


==== (aaab,aaabb) ====

testing  stdlib          ...............
testing  fast_lcs        ...............
testing  lcsmatrix       ...............
 10 s

==== (xzbet,badct) ====

testing  stdlib          ...............
testing  fast_lcs        ...............
testing  lcsmatrix       ...............
 26 s

Here we can see in first graph a case where the standard library function speed is roughly the same that my implementatation, but tend to have a complexity in $O(n^3)$ (dotted line), while fast_lcs behave in $O(n^2)$ (also dotted lin). On the graph on the right, you can though, see that the standard lib get the ratio around 20% too low, by still beeing wrong.

On the second graph, you can see a more extreem case, where the standard lib behave again in $O(n^3)$, this time it get the right answer, but with carefull optimisation, the détermination of the LCS can be done in $O(n)$ For a list of ~1000 element, this make the execution time rop form 1s to around 1ms.

Here the huge increase of speed is due to the fact that the alphabets in the 2 sequences are mostly disjoint, and there is no need to include the non-common letters in the comparaison.

What about getting the subsequence ?

The carefull reader will have seen that the actual standard lib implementation does not anly allow to get the length of the subsequence, but also allow to get the MatchingGroupsin each sequences in order to reconstitute the diff, or do other operations.

Getting the diff as well as the same time as (one of) the longuest common subsequence is possible without changing the time complexity of the algorythm. Though it will require more operation, wich will most likely change the constant in front.


In [17]:
from difflib2.get_lcs_cut2 import get_lcs_cut2

In [18]:
!cat lcs_cutmodule.py | grep -v '#' > tmp.py
!cat difflib2/get_lcs_cut2.py | grep -v '#' > tmp2.py
!colordiff -U0 tmp.py tmp2.py


--- tmp.py	2014-05-18 17:47:13.000000000 +0200
+++ tmp2.py	2014-05-18 17:47:13.000000000 +0200
@@ -5 +5 @@
-def lcs_cut2(s1, s2, lcs_low_bound=0, bg=None, debug=False):
+def get_lcs_cut2(s1, s2, lcs_low_bound=0, bg=None, debug=False):
@@ -31,0 +32,2 @@
+    rngc_lcs = [None for x in range(n+1)]
+    rngp_lcs = [None for x in range(n+1)]
@@ -44,0 +47 @@
+        rngc_lcs, rngp_lcs = rngp_lcs, rngc_lcs
@@ -59,0 +63 @@
+                rngc_lcs[j] = ((i,j),rngp_lcs[j-1])
@@ -89 +93,6 @@
-                newval = rngcjm if rngcjm > b else b
+                if rngcjm > b :
+                    newval = rngcjm
+                    rngc_lcs[j] = rngc_lcs[j-1]
+                else :
+                    newval = b
+                    rngc_lcs[j] = rngp_lcs[j]
@@ -91,0 +101 @@
+    return rngc[-2],rngc_lcs[-2]

ok, so we got a relatively short patch (hehe, we are working on difflib), that will not only return the lenght of the lcs, but a (nested) list of index of the lcs that we can flatten with the following.


In [19]:
def fl(tup):
    if not tup:
        return []
    t = fl(tup[1])
    t.extend((tup[0],))
    return t
fl( ('a',('b',('c',None))))


Out[19]:
['c', 'b', 'a']

I don't do it by default because it can hit recursion depth limit, and need to be unwinded by hand. And I'm lazy.

Lets rerun above benchmark after applying the same optimisation thant for fast_lcs.


In [20]:
# enable same optimisation
get_lcs_cut2 = opt(get_lcs_cut2)
get_lcs_cut2.__name__ = 'get_lcs_cut2'
greg(get_lcs_cut2)


this implementation passes basic tests
Out[20]:
<function difflib2.utils.opt_common_char.<locals>.closure>

In [21]:
s1 = "abadecxzyxzz" + "z"*80 + 'uqx'
s2 = "babcdezzyxyz" + "z"*80 + 'qux'

%timeit -r3 -n20 get_lcs_cut2(s1,s2)
#%timeit -r3 -n20 after_ast(s1,s2)
#%timeit -r3 -n20 lcs_cut2(s1,s2)
%timeit -r3 -n20 fast_lcs(s1,s2)


20 loops, best of 3: 2.73 ms per loop
20 loops, best of 3: 2.29 ms per loop

Ok, I am a little disapointed, I can't get a really high difference right now, but I have seen case with a factor of two difference, so sure you can implement both function, but you kinda-have a DRY problem for only a few line changes.

Maybe this is the right time to meet with our friend the AST, the ast module and the nice green tree snakes introduction.

We will grab the source code from the implementation with the inspect module then use the ast module to remove the nodes that reference rngc_lcs the newly introduces variable.

We don't forget to rename the function name in visit_FunctionDef or we overwrite the initial function.

Warning : the AST module have API that varies under different minor version of Python. this will only work on 3.4


In [22]:
import inspect
import ast
tree = ast.parse(inspect.getsource( get_lcs_cut2._wrapped._wrapped ))

class RewriteLCSName(ast.NodeTransformer):
    def visit_FunctionDef(self, node):
        node.name = 'after_ast'
        return node

class RewriteLCS(ast.NodeTransformer):
    def visit_Assign(self, node):

        if isinstance(node.targets[0], ast.Name) and node.targets[0].id == 'rngc_lcs ':
            return None
        if isinstance(node.targets[0], ast.Subscript) and node.targets[0].value.id == 'rngc_lcs':
            return None
        return node
        


# This does not work if done as only one class
# Do smbd have any idea why ? 
RewriteLCSName().visit(tree)
RewriteLCS().visit(tree)

# todo, rewrite the return statement not to return the tuple of 
# length/lcs, but only one of the two. 

ast.fix_missing_locations(tree)


Out[22]:
<_ast.Module at 0x108208cc0>

We execute the new tree in the current namespace, need to import islice though.


In [23]:
from itertools import islice
exec(compile(tree, filename="<ast>", mode="exec"))

so now we have an after_ast function which should not have the overhead of building the subsequences, and should get the same speed than fast_lcs once the other optimisation added.


In [24]:
after_ast = opt(after_ast)
after_ast.__name__ = 'after_ast'

check that the function passes basic tests, and then benchmark it.


In [25]:
greg(after_ast)
s1 = "abadecxzyz" *150
s2 = "babcdezzyx" *150

%timeit -r1 -n1 get_lcs_cut2(s1,s2)
%timeit -r1 -n1 after_ast(s1,s2)
%timeit -r1 -n1 fast_lcs(s1,s2)


this implementation passes basic tests
1 loops, best of 1: 2.56 s per loop
1 loops, best of 1: 1.96 s per loop
1 loops, best of 1: 1.95 s per loop

So good, we know have a single implementation of an optimized LCS algorithm, which using AST, give us faster and correct way of computing the LCS and its length.

Python is not really good at doing this kind of stuff as it is not homoiconic, and trick like that are probably much more robust and easy in julia or hy (lisp-y python).

So anyway, this was a fun short blog post. I try to keep all this code in a package called difflib2 on github. There are still some rough edge-cases, and any help is welcome. The rest are raw benchmark of different test-cases and comments.


In [26]:
#s1,s2 = 'hello world','bonjour, monde'
#s1,s2 = 'abc','abc'
#iA,iB = list(zip(*get_lcs_cut2(s1,s2)[1]))

#[s1[x] for x in iA],[s2[x] for x in iB]
#iA,iB

In [27]:
# this generate all relevant string for a certain lenght to check
# correct implementation of different algorithm
from copy import copy
def genall(length,size):
    if length <=1:
        for i in range(size):
            yield [i]
    else:
        for sub in genall(length-1,size-1):
            ll = list(sub)
            for i in range(1,size):
                l = copy(ll)
                l.append(i)
                yield l

In [28]:
import imp
import lcs_cutmodule 
imp.reload(lcs_cutmodule)
lcs_cut2 = lcs_cutmodule.lcs_cut2

greg(lcs_cut2)
import string

def full_compare(funa, funb, maxa=3,maxb=3):
    """
    compare implementation of `funa` and `funb` for **all**
    pairs of string of respective max length `maxa` and `maxb`.
    
    try to be smart and consider that comparing `aba` to `xyz` is the
    same as comparing `aca` to `xyz` so do not do it. 
    
    test also that `funa`is symetric
    """
    for i in range(maxa+1):
        print('\n',i,end='')
        for j in range(maxb+1):
            print('\n   ',j,end='')
            for s1 in gendepth2(i):
                print('.',end='')
                ls1 = [string.ascii_letters[k] for k in s1]
                for s2 in gendepth2(j,alphab=i+j):
                    ls2 = [string.ascii_letters[k] for k in s2]
                    ss1 = funb(ls1,ls2)
                    ss2 = funa(ls1,ls2)
                    ssbis = funa(ls2,ls1)
                    if(ss2!=ssbis):
                        raise utils.SymetryError('function not symetric for',ls1,ls2)
                    

                    if ss1 != ss2 :
                        raise ValueError(ls1,ls2,ss1,ss2)
    return 
                            
full_compare(lcs_cut2, lcsmatrix, maxa=4,maxb=4)


this implementation passes basic tests

 0
    0.
    1.
    2.
    3.
    4.
 1
    0.
    1.
    2.
    3.
    4.
 2
    0..
    1..
    2..
    3..
    4..
 3
    0.....
    1.....
    2.....
    3.....
    4.....
 4
    0..............
    1..............
    2..............
    3..............
    4..............

In [29]:
import imp
from lcs_cutmodule import lcs_cut2
lcs_cut2 = greg(lcs_cut2)


this implementation passes basic tests

In [30]:
# %%prun
import imp
import string
n=3000
seed = string.ascii_lowercase + string.digits + string.punctuation

#s1 = ((n//len(seed)+1)*seed)[0:n]
s1 = lorem[:3000]
s2 = [c for c in s1]

# we can runa line profiler on a function to see its hot spot
# as long as it is in a file
try:
    %load_ext line_profiler
    obj = %lprun -r -f lcs_cut2 lcs_cut2(s1,s2)
except ImportError:
    pass

Implementation that use the minimum memory possible ($n+1$), but I suspect the cost from doing loads of modulo operation will kill the advantage of lowering use in memory


In [31]:
@greg
def lcs_len3(Seq1 , Seq2):
    """ Compute the LCS len 2 sequences
    
    Do not calculate the matrix and try to be as efficient as possible 
    in storing only the minimal ammount of elelment in memory, mainly the previous
    matrix row + 1 element.
    """
    LL1 = len(Seq1)+1
    LL2 = len(Seq2)+1

    ## we will do the big loop over the longest sequence (L1)
    ## and store the previous row of the matrix (L2+1)
    if LL2 > LL1 : 
        Seq2, Seq1 = Seq1, Seq2
        LL2, LL1 = LL1, LL2

    previousrow = [0]*(LL2)
    cindex = 0

    for Seq1ii in Seq1:
        for jj in range(1,LL2):
            cindex = (cindex+1) % LL2

            if Seq1ii == Seq2[jj-1]:
                if jj == 1:
                    previousrow[cindex] = 1
                else :
                    previousrow[cindex]+=1
            if Seq1ii != Seq2[jj-1] :
                up = previousrow[(cindex+1) % LL2]
                
                if jj != 1 :
                    left = previousrow[(cindex-1) % LL2]
                    if left > up :
                        previousrow[cindex] = left
                        continue    
                previousrow[cindex] = up


    return previousrow[cindex]


this implementation passes basic tests

Here we test the standard library, and see that it fails at giving the right LCS even for super short sequences


In [32]:
import traceback

try:
    full_compare(stdlib, lcsmatrix)
except utils.SymetryError as e:
    print()
    print(traceback.format_exc())
    pass


 0
    0.
    1.
    2.
    3.
 1
    0.
    1.
    2.
    3.
 2
    0..
    1..
    2..
    3..
 3
    0.....
    1.....
    2.....
    3...
Traceback (most recent call last):
  File "<ipython-input-32-7bb5ccfb4dd4>", line 4, in <module>
    full_compare(stdlib, lcsmatrix)
  File "<ipython-input-28-a09c5b8c16f4>", line 32, in full_compare
    raise utils.SymetryError('function not symetric for',ls1,ls2)
difflib2.utils.SymetryError: ('function not symetric for', ['a', 'b', 'a'], ['b', 'c', 'a'])


In [33]:
stdlib('aba','bca'),stdlib('bca','aba')


Out[33]:
(1, 2)

Of course the lcs is ba and has lenght of 2.

Common optimisation.

One can remove the common begining and end of string to speed up the calculation of the LCS. The characters that are not from the common alphabet can also be removed.


In [34]:
print(inspect.getsource(opt))
print(inspect.getsource(opt_common_char))
print(inspect.getsource(rem_common_extrem))


def opt(func):
    return  opt_common_char(rem_common_extrem(func))

def opt_common_char(func):
    def closure(s1,s2,*args,**kwargs):
        s = set(s1).intersection(set(s2))
        s11 = [c for c in s1 if c in s]
        s22 = [c for c in s2 if c in s]
#        print('will pass ',s11, s22, 'instead of ', s1, s2)
        return func(s11,s22,*args, **kwargs)
    closure._wrapped = func
    return closure

def rem_common_extrem(func):
    def closure(s1,s2,*args,**kwargs):
        delta = 0 
        ss1 = list(s1)
        ss2 = list(s2)
        rem = []
        for k in range(2):
            ss1 = list(reversed(ss1))
            ss2 = list(reversed(ss2))
            for c1,c2 in zip(reversed(ss1),reversed(ss2)):
                if c1==c2:
                    delta +=1
                    rem.append(ss1.pop())
                    ss2.pop()
                else:
                    break
        return func(ss1,ss2,*args, **kwargs)+delta
    closure._wrapped = func
    return closure

Stdandard lib is fast !

Let's run a few test cases, one with a sentence we rotate around:


In [35]:
s1 = "a super long string I try to match, stdlib will be winner by a long shot"
s2 = s1[1:]+s1[1]
case3 = (s1,s2,'permutation')
test_case(case3,10,10)


First things we see, the standard lib is fast, almost an order of margnitude faster than our implementation with the most optimisation (lcs_cut), the reference implementation lcsmatrix is almost the slower, we see that doing memory optimisation lcs_low_m,lcs_low_ma seem to increase the speed in this case by roughly 20%, doing maximum memory optimisation at the expense of more CPU cycle get us back to initial speed. Using array module do not seem to be worth it in this case.

Second graph, we see that all implementation agree on the size of the LCS.

lcsmatrix implementation, lcs_len_3 and lcs_low_m, will always get the same similar arangement. lcs_low_m and lcs_low_ma are almost always close one to the other, so I'll drop the 3 first.


In [36]:
del gdict['lcsmatrix']
del gdict['lcs_len3']
del gdict['lcs_low_m']

Rotate each word.

Instead of rotatin the all sentence, let's rotating each words, we get roughtly the same conclusion. Stdndard lib is not much faster, our implementation seem to be slight more closer than stdlib thant before.


In [37]:
s2 = ' '.join(map(lambda x: (x[1::]+x[0] if len(x)>1 else x),case3[0].split(' ')))
case3b = (s1,s2,'word permutation')
print(s2)

test_case(case3b,3,3)


a upers ongl trings I ryt ot atch,m tdlibs illw eb innerw yb a ongl hots

Here the standard lib, is the faster, not by much


In [38]:
test_case(("xada"*100,"bada"*100,'numpy random'),3,3)


Here the standard libe is super slow, but get the right answer.

When stdlib get it wrong

Here we invert each words, which should prevent long matching subsequences.


In [39]:
s2 = ' '.join(map(lambda x: (x[::-1]),case3[0].split(' ')))
case3c = (s1,s2,'word reversing')
test_case(case3c,8,8)


Stdlib get back in the game, but get it wrong, ok, not to much here, only by 1 char.

A few more test cases.


In [40]:
n=100
s1 = "xzbet"*n
s2 = "badct"*n
case = (s1,s2,20)
utils.test_cases.append(case)

s1 = "xada"*n
s2 = "baza"*n
case2 = (s1,s2,20)
utils.test_cases.append(case2)

wcs = ( string.ascii_letters, list(reversed(string.ascii_letters)), 1, 'worst case scenario')
bcs = ( string.ascii_letters, string.ascii_letters,len(string.ascii_letters), 'best case scenario' )

utils.test_cases.append(wcs)
utils.test_cases.append(bcs)

lets test all our agregated test cases agains all known function:


In [41]:
utils.test_cases


Out[41]:
[('+fgh+jklmnop', '-fghijklmnop', 10),
 ('+fgh+jklmnopfgh+jklmnop', '-fghijklmnopfghijklmnop', 20),
 ('ddb', 'ded', 2),
 ('abcdef', 'abcdef', 6),
 ('abcde', 'abcde', 5),
 ([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], 10),
 ('ab', 'ab', 2),
 ('abx', 'ab', 2),
 ('axb', 'ab', 2),
 ('abx', 'aby', 2),
 ('axb', 'aby', 2),
 ('lmnopfghjklm', 'lmnopfghjklmn', 12),
 ('abcdEfghijklmnopqrSSStuvwxyz1234567890',
  'abcdefghijKLLmnopqrstuvwxyz1234567890',
  32),
 ('xxxxxxx and something that matches',
  ' and something that matchesyyyyyyy',
  27),
 ('xzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbetxzbet',
  'badctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadctbadct',
  20),
 ('xadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxadaxada',
  'bazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabazabaza',
  20),
 ('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ',
  ['Z',
   'Y',
   'X',
   'W',
   'V',
   'U',
   'T',
   'S',
   'R',
   'Q',
   'P',
   'O',
   'N',
   'M',
   'L',
   'K',
   'J',
   'I',
   'H',
   'G',
   'F',
   'E',
   'D',
   'C',
   'B',
   'A',
   'z',
   'y',
   'x',
   'w',
   'v',
   'u',
   't',
   's',
   'r',
   'q',
   'p',
   'o',
   'n',
   'm',
   'l',
   'k',
   'j',
   'i',
   'h',
   'g',
   'f',
   'e',
   'd',
   'c',
   'b',
   'a'],
  1,
  'worst case scenario'),
 ('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ',
  'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ',
  52,
  'best case scenario')]

In [42]:
gdict.keys(),len(utils.test_cases)


Out[42]:
(dict_keys(['lcs_cut2', 'lcs_low_ma', 'after_ast', 'fast_lcs', 'get_lcs_cut2', 'stdlib']),
 18)

In [43]:
print('-'*len(utils.test_cases))
for case in utils.test_cases:
    print('.',end='')
    test_case(case,1,1)


------------------
..................

let's see how much stdlib is underestimating the lenght in average as a function of length of the sequence.


In [44]:
import random
import numpy as np

df = pd.DataFrame()

for l in map(int,np.logspace(1.1,2.2,20)):
    a=[]
    print(l,'.',end='')
    for x in range(100):

        s1 = np.random.randint(0,50,l)
        s2 = np.random.randint(0,50,l)
        d = stdlib(s1,s2)-lcs_low_ma(s1,s2)
        if d > 0 :
            print(s1,s2, lcs_low_m(s1,s2),lcs_low_ma(s1,s2),stdlib(s1,s2))
            break
        a.append(d/l*100)
    df[str(l)] = a

df.boxplot(sym='ro')
plt.ylabel('underestimation LCS in %')

#plt.hist(a,bins=30)
#plt.plot((0,20),(0,20))
#plt.xlabel('stdlib lcs')
#plt.ylabel('real lcs')
pass


12 .14 .16 .18 .21 .24 .28 .32 .36 .41 .47 .54 .62 .71 .81 .92 .106 .121 .138 .158 .

In [47]:
#flcs = rem_common_extrem(opt_common_char(lcs_cut2))
#flcs = rem_common_extrem(lcs_cut2)
#flcs.__name__ = 'flcs'
#flcs.__name__
#greg(flcs)

fast_std = rem_common_extrem(opt_common_char(stdlib))
fast_std.__name__ = 'fast_std'
fast_std.__name__
try:
    greg(fast_std)
except ValueError:
    print('yes we know std is not right')
del gdict['fast_std']


yes we know std is not right

fast_lcs always make ~ or much better than lcs_cut2 , removing the second from benchmark


In [48]:
del gdict['lcs_cut2']

In [49]:
gdict


Out[49]:
{'lcs_low_ma': <function difflib2.low_ma.lcs_low_ma>,
 'after_ast': <function difflib2.utils.opt_common_char.<locals>.closure>,
 'fast_lcs': <function difflib2.utils.opt_common_char.<locals>.closure>,
 'get_lcs_cut2': <function difflib2.utils.opt_common_char.<locals>.closure>,
 'stdlib': <function __main__.stdlib>}

Lots of complexity graph to see the behavior of implementations in different cases.


In [50]:
import time
for case in cplx_case[0:]:
    
    if len(case) == 4:
        name = case[3]
    else :
        name = '(%s,%s)' %(case[0],case[1])
    print('\n====',name,'====')
    t= time.time()
    df,dr = cplx_data(case, gdict, r=1, nt=25, npoints=15)        
    tp = time.time()
    print('\n',int(tp-t),'s')
    report(df,dr,name=name)


==== ascci, long ====

testing  lcs_low_ma      ...............
testing  after_ast       ...............
testing  fast_lcs        ...............
testing  get_lcs_cut2    ...............
testing  stdlib          ...............
 8 s

==== lorem ====

testing  lcs_low_ma      ...............
testing  after_ast       ...............
testing  fast_lcs        ...............
testing  get_lcs_cut2    ...............
testing  stdlib          ...............
 8 s

==== (aaab,aaabb) ====

testing  lcs_low_ma      ...............
testing  after_ast       ...............
testing  fast_lcs        ...............
testing  get_lcs_cut2    ...............
testing  stdlib          ...............
 15 s

==== (aaab,aaacc) ====

testing  lcs_low_ma      ...............
testing  after_ast       ...............
testing  fast_lcs        ...............
testing  get_lcs_cut2    ...............
testing  stdlib          ...............
 13 s

==== (dcbet,badct) ====

testing  lcs_low_ma      ...............
testing  after_ast       ...............
testing  fast_lcs        ...............
testing  get_lcs_cut2    ...............
testing  stdlib          ...............
 15 s

==== (xzbet,badct) ====

testing  lcs_low_ma      ...............
testing  after_ast       ...............
testing  fast_lcs        ...............
testing  get_lcs_cut2    ...............
testing  stdlib          ...............
 28 s

==== (xada,bada) ====

testing  lcs_low_ma      ...............
testing  after_ast       ...............
testing  fast_lcs        ...............
testing  get_lcs_cut2    ...............
testing  stdlib          ...............
 58 s

==== lorem (reversed) ====

testing  lcs_low_ma      ...............
testing  after_ast       ...............
testing  fast_lcs        ...............
testing  get_lcs_cut2    ...............
testing  stdlib          ...............
 33 s

In [52]:
for mix in [10,20,30,40]:
    for case in cplx_case[0:]:
        if len(case) == 4:
            name = case[3]
        else :
            name = '(%s,%s)' %(case[0],case[1])
        print('\n====',name,'====')
        t= time.time()
        df,dr = cplx_data(case, {k:v for k,v in gdict.items() if k is not 'lcs_low_ma' }, r=1, nt=2, npoints=10, mix=mix)        
        name = name + "({})".format(mix)
        tp = time.time()
        print('\n',int(tp-t),'s')
        report(df,dr,name=name)


==== ascci, long ====

testing  get_lcs_cut2    ..........
testing  stdlib          ..........
testing  after_ast       ..........
testing  fast_lcs        ..........
 12 s

==== lorem ====

testing  get_lcs_cut2    ..........
testing  stdlib          ..........
testing  after_ast       ..........
testing  fast_lcs        ..........
 32 s

==== (aaab,aaabb) ====

testing  get_lcs_cut2    ..........
testing  stdlib          ..........
testing  after_ast       ..........
testing  fast_lcs        ..........
 9 s

==== (aaab,aaacc) ====

testing  get_lcs_cut2    ..........
testing  stdlib          ..........
testing  after_ast       ..........
testing  fast_lcs        ..........
 5 s

==== (dcbet,badct) ====

testing  get_lcs_cut2    ..........
testing  stdlib          ..........
testing  after_ast       ..........
testing  fast_lcs        ..........
 7 s

==== (xzbet,badct) ====

testing  get_lcs_cut2    ..........
testing  stdlib          ..........
testing  after_ast       ..........
testing  fast_lcs        ..........
 11 s

==== (xada,bada) ====

testing  get_lcs_cut2    ..........
testing  stdlib          ..........
testing  after_ast       ..........
testing  fast_lcs        ..........
 41 s

==== lorem (reversed) ====

testing  get_lcs_cut2    ..........
testing  stdlib          ..........
testing  after_ast       ..........
testing  fast_lcs        ..........
 15 s

==== ascci, long ====

testing  get_lcs_cut2    ..........
testing  stdlib          ..........
testing  after_ast       ..........
testing  fast_lcs        ..........
 11 s

==== lorem ====

testing  get_lcs_cut2    ..........
testing  stdlib          ..........
testing  after_ast       ..........
testing  fast_lcs        ..........
 25 s

==== (aaab,aaabb) ====

testing  get_lcs_cut2    ..........
testing  stdlib          ..........
testing  after_ast       ..........
testing  fast_lcs        ..........
 9 s

==== (aaab,aaacc) ====

testing  get_lcs_cut2    ..........
testing  stdlib          ..........
testing  after_ast       ..........
testing  fast_lcs        ..........
 6 s

==== (dcbet,badct) ====

testing  get_lcs_cut2    ..........
testing  stdlib          ..........
testing  after_ast       ..........
testing  fast_lcs        ..........
 7 s

==== (xzbet,badct) ====

testing  get_lcs_cut2    ..........
testing  stdlib          ..........
testing  after_ast       ..........
testing  fast_lcs        ..........
 14 s

==== (xada,bada) ====

testing  get_lcs_cut2    ..........
testing  stdlib          ..........
testing  after_ast       ..........
testing  fast_lcs        ..........
 40 s

==== lorem (reversed) ====

testing  get_lcs_cut2    ..........
testing  stdlib          ..........
testing  after_ast       ..........
testing  fast_lcs        ..........
 16 s

==== ascci, long ====

testing  get_lcs_cut2    ..........
testing  stdlib          ..........
testing  after_ast       ..........
testing  fast_lcs        ..........
 10 s

==== lorem ====

testing  get_lcs_cut2    ..........
testing  stdlib          ..........
testing  after_ast       ..........
testing  fast_lcs        ..........
 23 s

==== (aaab,aaabb) ====

testing  get_lcs_cut2    ..........
testing  stdlib          ..........
testing  after_ast       ..........
testing  fast_lcs        ..........
 10 s

==== (aaab,aaacc) ====

testing  get_lcs_cut2    ..........
testing  stdlib          ..........
testing  after_ast       ..........
testing  fast_lcs        ..........
 7 s

==== (dcbet,badct) ====

testing  get_lcs_cut2    ..........
testing  stdlib          ..........
testing  after_ast       ..........
testing  fast_lcs        ..........
 8 s

==== (xzbet,badct) ====

testing  get_lcs_cut2    ..........
testing  stdlib          ..........
testing  after_ast       ..........
testing  fast_lcs        ..........
 16 s

==== (xada,bada) ====

testing  get_lcs_cut2    ..........
testing  stdlib          ..........
testing  after_ast       ..........
testing  fast_lcs        ..........
 46 s

==== lorem (reversed) ====

testing  get_lcs_cut2    ..........
testing  stdlib          ..........
testing  after_ast       ..........
testing  fast_lcs        ..........
 16 s

==== ascci, long ====

testing  get_lcs_cut2    ..........
testing  stdlib          ..........
testing  after_ast       ..........
testing  fast_lcs        ..........
 10 s

==== lorem ====

testing  get_lcs_cut2    ..........
testing  stdlib          ..........
testing  after_ast       ..........
testing  fast_lcs        ..........
 21 s

==== (aaab,aaabb) ====

testing  get_lcs_cut2    ..........
testing  stdlib          ..........
testing  after_ast       ..........
testing  fast_lcs        ..........
 10 s

==== (aaab,aaacc) ====

testing  get_lcs_cut2    ..........
testing  stdlib          ..........
testing  after_ast       ..........
testing  fast_lcs        ..........
 7 s

==== (dcbet,badct) ====

testing  get_lcs_cut2    ..........
testing  stdlib          ..........
testing  after_ast       ..........
testing  fast_lcs        ..........
 8 s

==== (xzbet,badct) ====

testing  get_lcs_cut2    ..........
testing  stdlib          ..........
testing  after_ast       ..........
testing  fast_lcs        ..........
 16 s

==== (xada,bada) ====

testing  get_lcs_cut2    ..........
testing  stdlib          ..........
testing  after_ast       ..........
testing  fast_lcs        ..........
 39 s

==== lorem (reversed) ====

testing  get_lcs_cut2    ..........
testing  stdlib          ..........
testing  after_ast       ..........
testing  fast_lcs        ..........
 16 s
/usr/local/lib/python3.4/site-packages/matplotlib/pyplot.py:412: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_num_figures`).
  max_open_warning, RuntimeWarning)

In [ ]:
allr1 = []
allr2 = []
cr1 = []
cr2 = []
for i in range(20):
    print('.',end='')
    s1 = np.random.randint(0,2000,100)
    s2 = np.random.randint(0,2000,100)
    print(',',end='')
    r2 = %timeit -o -n1 -r1 -q stdlib(s1,s2)
    r1 = %timeit -o -n1 -r1 -q flcs(s1,s2)

    cr1.append(flcs(s1,s2))
    cr2.append(stdlib(s1,s2))
    allr1.append(mean(r1.all_runs)*100)
    allr2.append(mean(r2.all_runs)*100)
    
ax = seaborn.jointplot( allr1, allr2,dropna=False);
ax.fig.axes[0].set_xlim(xmin=0)
ax.fig.axes[0].set_ylim(ymin=0)
ax.fig.axes[0].plot([0,1],[0,1])

ax = seaborn.jointplot( cr1, cr2,dropna=False, kind='kde');
#ax.fig.axes[0].set_xlim(0)
#ax.fig.axes[0].set_ylim(0)
#ax.fig.axes[0].plot([0,100],[0,100])

CSS


In [ ]:
def backtrack(C, X, Y,  ii=None, jj=None):
    if ii==None and jj==None :
        ii = len(X)-1
        jj = len(Y)-1

    if ii == -1 or jj == -1:
        return []
    elif X[ii] == Y[jj]:
        p = backtrack(C ,X, Y, ii-1, jj-1)
        p.append((ii,jj))
        return p
    else:
        if C[ii][jj-1] > C[ii-1][jj]:
            return backtrack(C, X, Y, ii, jj-1)
        else:
            return backtrack(C, X, Y, ii-1, jj)

In [ ]:
s1 = "hallow world"
s2 = "Hexllo worlrld"

s1 = "this is a string removed"
s2 = "This is a new string"


C = lcsmatrix(s1,s2)

In [ ]:
ig = backtrack(C, s1, s2)
i1, i2 = zip(*ig)
print ''.join([s1[i] for i in i1])
print ''.join([s2[i] for i in i2])

Sequence1 : --- Sequence2 :

list of tuple that contain : (Seq1, seq2), (None|int, None r int


In [ ]:
def opts(s1,s2,ig):
    ig = iter(ig)
    c1 = 0
    c2 = 0
    op = []
    for ig1,ig2 in ig:
        while c1 < ig1:
            op.append((c1,None))
            c1+=1
        while c2 < ig2:
            op.append((None,c2))
            c2+=1
        op.append((c1,c2))
        c1+=1
        c2+=1
    while c1 < len(s1):
        op.append((c1,None))
        c1+=1
    while c2 < len(s2):
        op.append((None,c2))
        c2+=1
    return op

def idt(seq,ind,default='-'):
    if ind is None:
        return default
    return seq[ind]

l1, l2 = zip(*opts(s1,s2, ig))

In [ ]:
print ''.join([idt(s1,i) for i in l1])
print ''.join(['-' if i is None else ' ' for i in l2])
print ''.join([idt(s2,i) for i in l2])
print ''.join(['+' if i is None else ' ' for i in l1])

print '='*20

print ''.join([idt(s1,i, default='') for i in l1])
print ''.join(['-' if i is None else ' ' if j is not None else '' for i,j in zip(l2,l1)])

print ''.join([idt(s2,i, default='') for i in l2])
print ''.join(['+' if i is None else ' ' if j is not None else '' for i,j in zip(l1,l2)])

In [ ]:
def lcs(s1, s2):
    """
    return the longuest common subsequence of the 2 sequences.
    
    if both are string return the lcs as a string
    if both are unicode return the lcs as unicode
    
    Otherwise, return the lcs as a list. 
    """
    arestring = (type(s1) is str) and (type(s2) is str)
    areunicode = (type(s1) is unicode) and (type(s2) is unicode)
    
    C = lcsmatrix(s1,s2)
    ig = backtrack(C, s1, s2)
    i1, i2 = zip(*ig)
    
    lcs = [s1[i] for i in i1]
    
    if areunicode :
       return  u''.join(lcs)
    if arestring:
        return ''.join(lcs)
    return lcs