I hade some [missadventures] with python standard library with [difflib].
computing file differences is closesly related to the [Longuest Common Subsequence] problem, which is supposed to be a classical problem in computer science. Unfortunately, or fortunately, I do not have a computer science background even if I know a thing or two.
Doing the diff of two sequences of size $n$, ie compting the LCS, has a computing time in $\mathcal{O}(n^2)$, and one can achieven a space complexity in $\mathcal O (n)$ if only interested in the length of the LCS and not one of the LCS themselves.
It shall be noted that if one knows properties of the alphabet used in the sequence, or even information on the lcs of theses sequences, more advanced algorithms can be used. This is the case for example when somparing DNA, whe the size of your alphabet (ACGT) is 4 (or six soon maybe), (Nature paper behind paywall, so no links.), were the four russian trick can be used.
From a discussion with Nick Coghlan at Pycon not so long ago, I understood that if I wanted to have somethign changed in the stdlib of python I had to have arguments, and as difflib in one of the things I would like to be fix because I think the abstaction and API is bad. I like to take a look at what I can do.
As I will compare many implementation, I'll just use a decorator that register my function in a global dict for later quick comparison (@greg
). Also we will make this decorator check at compile time that the LCS function is symetric (give same result is you swap the arguments), and check a few basic test cases to be sure the implementation is not wrong.
we will use the following cases to check the implementation, we avoid long test cases were stdlib might get it wrong, we will comme back to that later.
In [1]:
cd ..
In [2]:
import difflib
import difflib2.utils as utils
import ipythonblocks
import pandas as pd
import string
import seaborn
import matplotlib.pyplot as plt
from difflib2.utils import greg,gdict, gendepth2
from array import array
%matplotlib inline
In [3]:
cases= utils.test_cases
In [4]:
from difflib2.utils import opt_common_char
from difflib2.utils import rem_common_extrem
from difflib2.utils import hl
from difflib2.report import report
%run difflib2/cplx_data.ipy
Nothing fancy, no optimisation, compute the lcs matrix, the last number will be the length of the LCS, it also can allow you to get back all the LCS by backtracking, it's our reference. Goal here is not really to explain the algorithm here.
In [5]:
from difflib2.reference import _lcsmatrix,lcsmatrix
hl('difflib2/reference.py')
Out[5]:
In [6]:
%run difflib2/test_case.ipy
Wrapper around stdlib to behave with the same API.
In [7]:
@greg
def stdlib(s1,s2):
return sum(list(map(lambda x:x.size,difflib.SequenceMatcher(None, s1,s2,autojunk=False).get_matching_blocks())))
In [8]:
test_case( cases[12],10,10, ldict={'lcsmatrix': lcsmatrix , 'stdlib':stdlib})
Ok so stdlib is fast, but both implementation get the same result.
First optimisation, we use only $2n$ in memory, will it speed up the time of computation ? Drawback, we can only get the lengh of the LCS not the LCSs themselves.
In [9]:
from difflib2.low_m import lcs_low_m
Same using python standard lib and array module that get efficient integer array. Is it worth it ?
In [10]:
from difflib2.low_ma import lcs_low_ma
In [11]:
test_case( cases[12],15,15,
ldict={'lcsmatrix' : lcsmatrix ,
'stdlib' : stdlib,
'lcs_low_m' : lcs_low_m,
'lcs_low_ma': lcs_low_ma})
Won't get into much details here, but basically you do not need to compute the all matrix if your chains are quite alike. You can stay "close" to the diagonal near the end. On highly similar sequence this should help a lot.
Implementation is handwaved, there might be off by one errors. But I suppose this would change complexity for some case.
There is though some overhead computation time, so the tradeoff, is not obvious.
In [12]:
from difflib2.utils import lorem
def withb(func,s1,s2,*args,**kwargs):
bg = ipythonblocks.BlockGrid(len(s2),len(s1), block_size=10)
r = func(s1,s2,*args,bg=bg,**kwargs)
print(r)
return bg
In [13]:
from lcs_cutmodule import lcs_cut2
In [14]:
test_case( cases[12],15,15,
ldict={'lcsmatrix' : lcsmatrix ,
'stdlib' : stdlib,
'lcs_cut2' : lcs_cut2})
For a longer benchmark comparaison, jump to the end of the blog post where you have many more implementation and test-cases.
In [15]:
seed = string.ascii_lowercase + string.digits + string.punctuation
cplx_case = (
# seed1,seed2,upperbound,[name]
(seed,seed,3.2,'ascci, long'),
(lorem,lorem,3.2,'lorem'),
('aaab','aaabb',2.7),
('aaab','aaacc',2.7),
("dcbet","badct",2.8),
("xzbet","badct",3.0),
("xada","bada",3.0),
(lorem, str(reversed(lorem)),3.2,'lorem (reversed)')
)
def opt(func):
return opt_common_char(rem_common_extrem(func))
fast_lcs = opt(lcs_cut2)
fast_lcs.__name__ = 'fast_lcs'
fast_lcs = greg(fast_lcs)
In [16]:
import time
for case,guide in zip([cplx_case[i] for i in [2,5]],
[{'stdlib':3,'fast_lcs':2},
{'stdlib':3,'fast_lcs':1}]):
if len(case) == 4:
name = case[3]
else :
name = '(%s,%s)' %(case[0],case[1])
print('\n====',name,'====')
t= time.time()
df,dr = cplx_data(case, {'lcsmatrix':lcsmatrix,'stdlib':stdlib,'fast_lcs':fast_lcs},
r=1, nt=25, npoints=15)
tp = time.time()
print('\n',int(tp-t),'s')
report(df,dr,name=name, guides=guide)
Here we can see in first graph a case where the standard library function speed is roughly the same that my implementatation, but tend to have a complexity in $O(n^3)$ (dotted line), while fast_lcs
behave in $O(n^2)$ (also dotted lin). On the graph on the right, you can though, see that the standard lib get the ratio around 20% too low, by still beeing wrong.
On the second graph, you can see a more extreem case, where the standard lib behave again in $O(n^3)$, this time it get the right answer, but with carefull optimisation, the détermination of the LCS can be done in $O(n)$ For a list of ~1000 element, this make the execution time rop form 1s to around 1ms.
Here the huge increase of speed is due to the fact that the alphabets in the 2 sequences are mostly disjoint, and there is no need to include the non-common letters in the comparaison.
The carefull reader will have seen that the actual standard lib implementation does not anly allow to get the length of the subsequence, but also allow to get the MatchingGroups
in each sequences in order to reconstitute the diff, or do other operations.
Getting the diff as well as the same time as (one of) the longuest common subsequence is possible without changing the time complexity of the algorythm. Though it will require more operation, wich will most likely change the constant in front.
In [17]:
from difflib2.get_lcs_cut2 import get_lcs_cut2
In [18]:
!cat lcs_cutmodule.py | grep -v '#' > tmp.py
!cat difflib2/get_lcs_cut2.py | grep -v '#' > tmp2.py
!colordiff -U0 tmp.py tmp2.py
ok, so we got a relatively short patch (hehe, we are working on difflib), that will not only return the lenght of the lcs, but a (nested) list of index of the lcs that we can flatten with the following.
In [19]:
def fl(tup):
if not tup:
return []
t = fl(tup[1])
t.extend((tup[0],))
return t
fl( ('a',('b',('c',None))))
Out[19]:
I don't do it by default because it can hit recursion depth limit, and need to be unwinded by hand. And I'm lazy.
Lets rerun above benchmark after applying the same optimisation thant for fast_lcs
.
In [20]:
# enable same optimisation
get_lcs_cut2 = opt(get_lcs_cut2)
get_lcs_cut2.__name__ = 'get_lcs_cut2'
greg(get_lcs_cut2)
Out[20]:
In [21]:
s1 = "abadecxzyxzz" + "z"*80 + 'uqx'
s2 = "babcdezzyxyz" + "z"*80 + 'qux'
%timeit -r3 -n20 get_lcs_cut2(s1,s2)
#%timeit -r3 -n20 after_ast(s1,s2)
#%timeit -r3 -n20 lcs_cut2(s1,s2)
%timeit -r3 -n20 fast_lcs(s1,s2)
Ok, I am a little disapointed, I can't get a really high difference right now, but I have seen case with a factor of two difference, so sure you can implement both function, but you kinda-have a DRY problem for only a few line changes.
Maybe this is the right time to meet with our friend the AST, the ast module and the nice green tree snakes introduction.
We will grab the source code from the implementation with the inspect module then use the ast
module to remove the nodes that reference rngc_lcs
the newly introduces variable.
We don't forget to rename the function name in visit_FunctionDef
or we overwrite the initial function.
Warning : the AST module have API that varies under different minor version of Python. this will only work on 3.4
In [22]:
import inspect
import ast
tree = ast.parse(inspect.getsource( get_lcs_cut2._wrapped._wrapped ))
class RewriteLCSName(ast.NodeTransformer):
def visit_FunctionDef(self, node):
node.name = 'after_ast'
return node
class RewriteLCS(ast.NodeTransformer):
def visit_Assign(self, node):
if isinstance(node.targets[0], ast.Name) and node.targets[0].id == 'rngc_lcs ':
return None
if isinstance(node.targets[0], ast.Subscript) and node.targets[0].value.id == 'rngc_lcs':
return None
return node
# This does not work if done as only one class
# Do smbd have any idea why ?
RewriteLCSName().visit(tree)
RewriteLCS().visit(tree)
# todo, rewrite the return statement not to return the tuple of
# length/lcs, but only one of the two.
ast.fix_missing_locations(tree)
Out[22]:
We execute the new tree in the current namespace, need to import islice though.
In [23]:
from itertools import islice
exec(compile(tree, filename="<ast>", mode="exec"))
so now we have an after_ast
function which should not have the overhead of building the subsequences, and should get the same speed than fast_lcs
once the other optimisation added.
In [24]:
after_ast = opt(after_ast)
after_ast.__name__ = 'after_ast'
check that the function passes basic tests, and then benchmark it.
In [25]:
greg(after_ast)
s1 = "abadecxzyz" *150
s2 = "babcdezzyx" *150
%timeit -r1 -n1 get_lcs_cut2(s1,s2)
%timeit -r1 -n1 after_ast(s1,s2)
%timeit -r1 -n1 fast_lcs(s1,s2)
So good, we know have a single implementation of an optimized LCS algorithm, which using AST, give us faster and correct way of computing the LCS and its length.
Python is not really good at doing this kind of stuff as it is not homoiconic, and trick like that are probably much more robust and easy in julia or hy (lisp-y python).
So anyway, this was a fun short blog post. I try to keep all this code in a package called difflib2 on github. There are still some rough edge-cases, and any help is welcome. The rest are raw benchmark of different test-cases and comments.
In [26]:
#s1,s2 = 'hello world','bonjour, monde'
#s1,s2 = 'abc','abc'
#iA,iB = list(zip(*get_lcs_cut2(s1,s2)[1]))
#[s1[x] for x in iA],[s2[x] for x in iB]
#iA,iB
In [27]:
# this generate all relevant string for a certain lenght to check
# correct implementation of different algorithm
from copy import copy
def genall(length,size):
if length <=1:
for i in range(size):
yield [i]
else:
for sub in genall(length-1,size-1):
ll = list(sub)
for i in range(1,size):
l = copy(ll)
l.append(i)
yield l
In [28]:
import imp
import lcs_cutmodule
imp.reload(lcs_cutmodule)
lcs_cut2 = lcs_cutmodule.lcs_cut2
greg(lcs_cut2)
import string
def full_compare(funa, funb, maxa=3,maxb=3):
"""
compare implementation of `funa` and `funb` for **all**
pairs of string of respective max length `maxa` and `maxb`.
try to be smart and consider that comparing `aba` to `xyz` is the
same as comparing `aca` to `xyz` so do not do it.
test also that `funa`is symetric
"""
for i in range(maxa+1):
print('\n',i,end='')
for j in range(maxb+1):
print('\n ',j,end='')
for s1 in gendepth2(i):
print('.',end='')
ls1 = [string.ascii_letters[k] for k in s1]
for s2 in gendepth2(j,alphab=i+j):
ls2 = [string.ascii_letters[k] for k in s2]
ss1 = funb(ls1,ls2)
ss2 = funa(ls1,ls2)
ssbis = funa(ls2,ls1)
if(ss2!=ssbis):
raise utils.SymetryError('function not symetric for',ls1,ls2)
if ss1 != ss2 :
raise ValueError(ls1,ls2,ss1,ss2)
return
full_compare(lcs_cut2, lcsmatrix, maxa=4,maxb=4)
In [29]:
import imp
from lcs_cutmodule import lcs_cut2
lcs_cut2 = greg(lcs_cut2)
In [30]:
# %%prun
import imp
import string
n=3000
seed = string.ascii_lowercase + string.digits + string.punctuation
#s1 = ((n//len(seed)+1)*seed)[0:n]
s1 = lorem[:3000]
s2 = [c for c in s1]
# we can runa line profiler on a function to see its hot spot
# as long as it is in a file
try:
%load_ext line_profiler
obj = %lprun -r -f lcs_cut2 lcs_cut2(s1,s2)
except ImportError:
pass
Implementation that use the minimum memory possible ($n+1$), but I suspect the cost from doing loads of modulo operation will kill the advantage of lowering use in memory
In [31]:
@greg
def lcs_len3(Seq1 , Seq2):
""" Compute the LCS len 2 sequences
Do not calculate the matrix and try to be as efficient as possible
in storing only the minimal ammount of elelment in memory, mainly the previous
matrix row + 1 element.
"""
LL1 = len(Seq1)+1
LL2 = len(Seq2)+1
## we will do the big loop over the longest sequence (L1)
## and store the previous row of the matrix (L2+1)
if LL2 > LL1 :
Seq2, Seq1 = Seq1, Seq2
LL2, LL1 = LL1, LL2
previousrow = [0]*(LL2)
cindex = 0
for Seq1ii in Seq1:
for jj in range(1,LL2):
cindex = (cindex+1) % LL2
if Seq1ii == Seq2[jj-1]:
if jj == 1:
previousrow[cindex] = 1
else :
previousrow[cindex]+=1
if Seq1ii != Seq2[jj-1] :
up = previousrow[(cindex+1) % LL2]
if jj != 1 :
left = previousrow[(cindex-1) % LL2]
if left > up :
previousrow[cindex] = left
continue
previousrow[cindex] = up
return previousrow[cindex]
Here we test the standard library, and see that it fails at giving the right LCS even for super short sequences
In [32]:
import traceback
try:
full_compare(stdlib, lcsmatrix)
except utils.SymetryError as e:
print()
print(traceback.format_exc())
pass
In [33]:
stdlib('aba','bca'),stdlib('bca','aba')
Out[33]:
Of course the lcs is ba
and has lenght of 2
.
One can remove the common begining and end of string to speed up the calculation of the LCS. The characters that are not from the common alphabet can also be removed.
In [34]:
print(inspect.getsource(opt))
print(inspect.getsource(opt_common_char))
print(inspect.getsource(rem_common_extrem))
Let's run a few test cases, one with a sentence we rotate around:
In [35]:
s1 = "a super long string I try to match, stdlib will be winner by a long shot"
s2 = s1[1:]+s1[1]
case3 = (s1,s2,'permutation')
test_case(case3,10,10)
First things we see, the standard lib is fast, almost an order of margnitude faster than our implementation with the most optimisation (lcs_cut
), the reference implementation lcsmatrix
is almost the slower, we see that doing memory optimisation lcs_low_m
,lcs_low_ma
seem to increase the speed in this case by roughly 20%, doing maximum memory optimisation at the expense of more CPU cycle get us back to initial speed. Using array
module do not seem to be worth it in this case.
Second graph, we see that all implementation agree on the size of the LCS.
lcsmatrix
implementation, lcs_len_3
and lcs_low_m
, will always get the same similar arangement. lcs_low_m
and lcs_low_ma
are almost always close one to the other, so I'll drop the 3 first.
In [36]:
del gdict['lcsmatrix']
del gdict['lcs_len3']
del gdict['lcs_low_m']
Instead of rotatin the all sentence, let's rotating each words, we get roughtly the same conclusion. Stdndard lib is not much faster, our implementation seem to be slight more closer than stdlib thant before.
In [37]:
s2 = ' '.join(map(lambda x: (x[1::]+x[0] if len(x)>1 else x),case3[0].split(' ')))
case3b = (s1,s2,'word permutation')
print(s2)
test_case(case3b,3,3)
Here the standard lib, is the faster, not by much
In [38]:
test_case(("xada"*100,"bada"*100,'numpy random'),3,3)
Here the standard libe is super slow, but get the right answer.
Here we invert each words, which should prevent long matching subsequences.
In [39]:
s2 = ' '.join(map(lambda x: (x[::-1]),case3[0].split(' ')))
case3c = (s1,s2,'word reversing')
test_case(case3c,8,8)
Stdlib get back in the game, but get it wrong, ok, not to much here, only by 1 char.
A few more test cases.
In [40]:
n=100
s1 = "xzbet"*n
s2 = "badct"*n
case = (s1,s2,20)
utils.test_cases.append(case)
s1 = "xada"*n
s2 = "baza"*n
case2 = (s1,s2,20)
utils.test_cases.append(case2)
wcs = ( string.ascii_letters, list(reversed(string.ascii_letters)), 1, 'worst case scenario')
bcs = ( string.ascii_letters, string.ascii_letters,len(string.ascii_letters), 'best case scenario' )
utils.test_cases.append(wcs)
utils.test_cases.append(bcs)
lets test all our agregated test cases agains all known function:
In [41]:
utils.test_cases
Out[41]:
In [42]:
gdict.keys(),len(utils.test_cases)
Out[42]:
In [43]:
print('-'*len(utils.test_cases))
for case in utils.test_cases:
print('.',end='')
test_case(case,1,1)
let's see how much stdlib is underestimating the lenght in average as a function of length of the sequence.
In [44]:
import random
import numpy as np
df = pd.DataFrame()
for l in map(int,np.logspace(1.1,2.2,20)):
a=[]
print(l,'.',end='')
for x in range(100):
s1 = np.random.randint(0,50,l)
s2 = np.random.randint(0,50,l)
d = stdlib(s1,s2)-lcs_low_ma(s1,s2)
if d > 0 :
print(s1,s2, lcs_low_m(s1,s2),lcs_low_ma(s1,s2),stdlib(s1,s2))
break
a.append(d/l*100)
df[str(l)] = a
df.boxplot(sym='ro')
plt.ylabel('underestimation LCS in %')
#plt.hist(a,bins=30)
#plt.plot((0,20),(0,20))
#plt.xlabel('stdlib lcs')
#plt.ylabel('real lcs')
pass
In [47]:
#flcs = rem_common_extrem(opt_common_char(lcs_cut2))
#flcs = rem_common_extrem(lcs_cut2)
#flcs.__name__ = 'flcs'
#flcs.__name__
#greg(flcs)
fast_std = rem_common_extrem(opt_common_char(stdlib))
fast_std.__name__ = 'fast_std'
fast_std.__name__
try:
greg(fast_std)
except ValueError:
print('yes we know std is not right')
del gdict['fast_std']
fast_lcs always make ~ or much better than lcs_cut2 , removing the second from benchmark
In [48]:
del gdict['lcs_cut2']
In [49]:
gdict
Out[49]:
Lots of complexity graph to see the behavior of implementations in different cases.
In [50]:
import time
for case in cplx_case[0:]:
if len(case) == 4:
name = case[3]
else :
name = '(%s,%s)' %(case[0],case[1])
print('\n====',name,'====')
t= time.time()
df,dr = cplx_data(case, gdict, r=1, nt=25, npoints=15)
tp = time.time()
print('\n',int(tp-t),'s')
report(df,dr,name=name)
In [52]:
for mix in [10,20,30,40]:
for case in cplx_case[0:]:
if len(case) == 4:
name = case[3]
else :
name = '(%s,%s)' %(case[0],case[1])
print('\n====',name,'====')
t= time.time()
df,dr = cplx_data(case, {k:v for k,v in gdict.items() if k is not 'lcs_low_ma' }, r=1, nt=2, npoints=10, mix=mix)
name = name + "({})".format(mix)
tp = time.time()
print('\n',int(tp-t),'s')
report(df,dr,name=name)
In [ ]:
allr1 = []
allr2 = []
cr1 = []
cr2 = []
for i in range(20):
print('.',end='')
s1 = np.random.randint(0,2000,100)
s2 = np.random.randint(0,2000,100)
print(',',end='')
r2 = %timeit -o -n1 -r1 -q stdlib(s1,s2)
r1 = %timeit -o -n1 -r1 -q flcs(s1,s2)
cr1.append(flcs(s1,s2))
cr2.append(stdlib(s1,s2))
allr1.append(mean(r1.all_runs)*100)
allr2.append(mean(r2.all_runs)*100)
ax = seaborn.jointplot( allr1, allr2,dropna=False);
ax.fig.axes[0].set_xlim(xmin=0)
ax.fig.axes[0].set_ylim(ymin=0)
ax.fig.axes[0].plot([0,1],[0,1])
ax = seaborn.jointplot( cr1, cr2,dropna=False, kind='kde');
#ax.fig.axes[0].set_xlim(0)
#ax.fig.axes[0].set_ylim(0)
#ax.fig.axes[0].plot([0,100],[0,100])
CSS
In [ ]:
def backtrack(C, X, Y, ii=None, jj=None):
if ii==None and jj==None :
ii = len(X)-1
jj = len(Y)-1
if ii == -1 or jj == -1:
return []
elif X[ii] == Y[jj]:
p = backtrack(C ,X, Y, ii-1, jj-1)
p.append((ii,jj))
return p
else:
if C[ii][jj-1] > C[ii-1][jj]:
return backtrack(C, X, Y, ii, jj-1)
else:
return backtrack(C, X, Y, ii-1, jj)
In [ ]:
s1 = "hallow world"
s2 = "Hexllo worlrld"
s1 = "this is a string removed"
s2 = "This is a new string"
C = lcsmatrix(s1,s2)
In [ ]:
ig = backtrack(C, s1, s2)
i1, i2 = zip(*ig)
print ''.join([s1[i] for i in i1])
print ''.join([s2[i] for i in i2])
Sequence1 : --- Sequence2 :
list of tuple that contain : (Seq1, seq2), (None|int, None r int
In [ ]:
def opts(s1,s2,ig):
ig = iter(ig)
c1 = 0
c2 = 0
op = []
for ig1,ig2 in ig:
while c1 < ig1:
op.append((c1,None))
c1+=1
while c2 < ig2:
op.append((None,c2))
c2+=1
op.append((c1,c2))
c1+=1
c2+=1
while c1 < len(s1):
op.append((c1,None))
c1+=1
while c2 < len(s2):
op.append((None,c2))
c2+=1
return op
def idt(seq,ind,default='-'):
if ind is None:
return default
return seq[ind]
l1, l2 = zip(*opts(s1,s2, ig))
In [ ]:
print ''.join([idt(s1,i) for i in l1])
print ''.join(['-' if i is None else ' ' for i in l2])
print ''.join([idt(s2,i) for i in l2])
print ''.join(['+' if i is None else ' ' for i in l1])
print '='*20
print ''.join([idt(s1,i, default='') for i in l1])
print ''.join(['-' if i is None else ' ' if j is not None else '' for i,j in zip(l2,l1)])
print ''.join([idt(s2,i, default='') for i in l2])
print ''.join(['+' if i is None else ' ' if j is not None else '' for i,j in zip(l1,l2)])
In [ ]:
def lcs(s1, s2):
"""
return the longuest common subsequence of the 2 sequences.
if both are string return the lcs as a string
if both are unicode return the lcs as unicode
Otherwise, return the lcs as a list.
"""
arestring = (type(s1) is str) and (type(s2) is str)
areunicode = (type(s1) is unicode) and (type(s2) is unicode)
C = lcsmatrix(s1,s2)
ig = backtrack(C, s1, s2)
i1, i2 = zip(*ig)
lcs = [s1[i] for i in i1]
if areunicode :
return u''.join(lcs)
if arestring:
return ''.join(lcs)
return lcs