In this short Jupyter notebook aims at defining and explaining the Lempel-Ziv complexity.
I will give examples, and benchmarks of different implementations.
The Lempel-Ziv complexity is defined as the number of different substrings encountered as the stream is viewed from begining to the end.
As an example:
>>> s = '1001111011000010'
>>> lempel_ziv_complexity(s) # 1 / 0 / 01 / 1110 / 1100 / 0010
6
Marking in the different substrings, this sequence $s$ has complexity $\mathrm{Lempel}$-$\mathrm{Ziv}(s) = 6$ because $s = 1001111011000010 = 1 / 0 / 01 / 1110 / 1100 / 0010$.
Other examples:
>>> lempel_ziv_complexity('1010101010101010') # 1 / 0 / 10
3
>>> lempel_ziv_complexity('1001111011000010000010') # 1 / 0 / 01 / 1110 / 1100 / 0010 / 000 / 010
7
>>> lempel_ziv_complexity('100111101100001000001010') # 1 / 0 / 01 / 1110 / 1100 / 0010 / 000 / 010 / 10
8
In [1]:
def lempel_ziv_complexity(binary_sequence):
"""Lempel-Ziv complexity for a binary sequence, in simple Python code."""
u, v, w = 0, 1, 1
v_max = 1
length = len(binary_sequence)
complexity = 1
while True:
if binary_sequence[u + v - 1] == binary_sequence[w + v - 1]:
v += 1
if w + v >= length:
complexity += 1
break
else:
if v > v_max:
v_max = v
u += 1
if u == w:
complexity += 1
w += v_max
if w > length:
break
else:
u = 0
v = 1
v_max = 1
else:
v = 1
return complexity
In [2]:
s = '1001111011000010'
lempel_ziv_complexity(s) # 1 / 0 / 01 / 1110 / 1100 / 0010
Out[2]:
In [3]:
%timeit lempel_ziv_complexity(s)
In [4]:
lempel_ziv_complexity('1010101010101010') # 1 / 0 / 10
Out[4]:
In [5]:
lempel_ziv_complexity('1001111011000010000010') # 1 / 0 / 01 / 1110
Out[5]:
In [6]:
lempel_ziv_complexity('100111101100001000001010') # 1 / 0 / 01 / 1110 / 1100 / 0010 / 000 / 010 / 10
Out[6]:
In [7]:
%timeit lempel_ziv_complexity('100111101100001000001010')
We can start to see that the time complexity of this function seems to grow exponentially as the complexity grows.
As this blog post explains it, we can easily try to use Cython in a notebook cell.
See the Cython documentation for more information.
In [7]:
%load_ext cython
In [8]:
%%cython
from __future__ import division
import cython
ctypedef unsigned int DTYPE_t
@cython.boundscheck(False) # turn off bounds-checking for entire function, quicker but less safe
def lempel_ziv_complexity_cython(str binary_sequence not None):
"""Lempel-Ziv complexity for a binary sequence, in simple Cython code (C extension)."""
cdef DTYPE_t u = 0
cdef DTYPE_t v = 1
cdef DTYPE_t w = 1
cdef DTYPE_t v_max = 1
cdef DTYPE_t length = len(binary_sequence)
cdef DTYPE_t complexity = 1
# that was the only needed part, typing statically all the variables
while True:
if binary_sequence[u + v - 1] == binary_sequence[w + v - 1]:
v += 1
if w + v >= length:
complexity += 1
break
else:
if v > v_max:
v_max = v
u += 1
if u == w:
complexity += 1
w += v_max
if w > length:
break
else:
u = 0
v = 1
v_max = 1
else:
v = 1
return complexity
Let try it!
In [9]:
s = '1001111011000010'
lempel_ziv_complexity_cython(s) # 1 / 0 / 01 / 1110 / 1100 / 0010
Out[9]:
In [10]:
%timeit lempel_ziv_complexity_cython(s)
In [12]:
lempel_ziv_complexity_cython('1010101010101010') # 1 / 0 / 10
Out[12]:
In [13]:
lempel_ziv_complexity_cython('1001111011000010000010') # 1 / 0 / 01 / 1110
Out[13]:
In [14]:
lempel_ziv_complexity_cython('100111101100001000001010') # 1 / 0 / 01 / 1110 / 1100 / 0010 / 000 / 010 / 10
Out[14]:
In [15]:
%timeit lempel_ziv_complexity_cython('100111101100001000001010')
$\implies$ Yay! It seems faster indeed!
In [87]:
from numba import jit
In [94]:
@jit("int32(boolean[:])")
def lempel_ziv_complexity_numba_x(binary_sequence):
"""Lempel-Ziv complexity for a binary sequence, in Python code using numba.jit() for automatic speedup (hopefully)."""
u, v, w = 0, 1, 1
v_max = 1
length = len(binary_sequence)
complexity = 1
while True:
if binary_sequence[u + v - 1] == binary_sequence[w + v - 1]:
v += 1
if w + v >= length:
complexity += 1
break
else:
if v > v_max:
v_max = v
u += 1
if u == w:
complexity += 1
w += v_max
if w > length:
break
else:
u = 0
v = 1
v_max = 1
else:
v = 1
return complexity
def str_to_numpy(s):
"""str to np.array of bool"""
return np.array([int(i) for i in s], dtype=np.bool)
def lempel_ziv_complexity_numba(s):
return lempel_ziv_complexity_numba_x(str_to_numpy(s))
Let try it!
In [95]:
str_to_numpy(s)
Out[95]:
In [96]:
s = '1001111011000010'
lempel_ziv_complexity_numba(s) # 1 / 0 / 01 / 1110 / 1100 / 0010
Out[96]:
In [97]:
%timeit lempel_ziv_complexity_numba(s)
In [98]:
lempel_ziv_complexity_numba('1010101010101010') # 1 / 0 / 10
Out[98]:
In [99]:
lempel_ziv_complexity_numba('1001111011000010000010') # 1 / 0 / 01 / 1110
Out[99]:
In [100]:
lempel_ziv_complexity_numba('100111101100001000001010') # 1 / 0 / 01 / 1110 / 1100 / 0010 / 000 / 010 / 10
Out[100]:
In [101]:
%timeit lempel_ziv_complexity_numba('100111101100001000001010')
$\implies$ Well... It doesn't seem that much faster from the naive Python code. We specified the signature when calling
@numba.jit
, and used the more appropriate data structure (string is probably the smaller, numpy array are probably faster). But even these tricks didn't help that much.I tested, and without specifying the signature, the fastest approach is using string, compared to using lists or numpy arrays. Note that the
@jit
-powered function is compiled at runtime when first being called, so the signature used for the first call is determining the signature used by the compile function
In [57]:
from numpy.random import binomial
def bernoulli(p, size=1):
"""One or more samples from a Bernoulli of probability p."""
return binomial(1, p, size)
In [25]:
bernoulli(0.5, 20)
Out[25]:
That's probably not optimal, but we can generate a string with:
In [26]:
''.join(str(i) for i in bernoulli(0.5, 20))
Out[26]:
In [58]:
def random_binary_sequence(n, p=0.5):
"""Uniform random binary sequence of size n, with rate of 0/1 being p."""
return ''.join(str(i) for i in bernoulli(p, n))
In [60]:
random_binary_sequence(50)
random_binary_sequence(50, p=0.1)
random_binary_sequence(50, p=0.25)
random_binary_sequence(50, p=0.5)
random_binary_sequence(50, p=0.75)
random_binary_sequence(50, p=0.9)
Out[60]:
Out[60]:
Out[60]:
Out[60]:
Out[60]:
Out[60]:
And so, this function can test to check that the three implementations (naive, Cython-powered, Numba-powered) always give the same result.
In [29]:
def tests_3_functions(n, p=0.5, debug=True):
s = random_binary_sequence(n, p=p)
c1 = lempel_ziv_complexity(s)
if debug:
print("Sequence s = {} ==> complexity C = {}".format(s, c1))
c2 = lempel_ziv_complexity_cython(s)
c3 = lempel_ziv_complexity_numba(s)
assert c1 == c2 == c3, "Error: the sequence {} gave different values of the Lempel-Ziv complexity from 3 functions ({}, {}, {})...".format(s, c1, c2, c3)
return c1
In [30]:
tests_3_functions(5)
Out[30]:
In [31]:
tests_3_functions(20)
Out[31]:
In [32]:
tests_3_functions(50)
Out[32]:
In [33]:
tests_3_functions(500)
Out[33]:
In [34]:
tests_3_functions(5000)
Out[34]:
In [102]:
%timeit lempel_ziv_complexity('100111101100001000001010')
%timeit lempel_ziv_complexity_cython('100111101100001000001010')
%timeit lempel_ziv_complexity_numba('100111101100001000001010')
In [103]:
%timeit lempel_ziv_complexity('10011110110000100000101000100100101010010111111011001111111110101001010110101010')
%timeit lempel_ziv_complexity_cython('10011110110000100000101000100100101010010111111011001111111110101001010110101010')
%timeit lempel_ziv_complexity_numba('10011110110000100000101000100100101010010111111011001111111110101001010110101010')
Let check the time used by all the three functions, for longer and longer sequences:
In [104]:
%timeit tests_3_functions(10, debug=False)
%timeit tests_3_functions(20, debug=False)
%timeit tests_3_functions(40, debug=False)
%timeit tests_3_functions(80, debug=False)
%timeit tests_3_functions(160, debug=False)
%timeit tests_3_functions(320, debug=False)
In [105]:
def test_cython(n):
s = random_binary_sequence(n)
c = lempel_ziv_complexity_cython(s)
return c
In [39]:
%timeit test_cython(10)
%timeit test_cython(20)
%timeit test_cython(40)
%timeit test_cython(80)
%timeit test_cython(160)
%timeit test_cython(320)
In [40]:
%timeit test_cython(640)
%timeit test_cython(1280)
%timeit test_cython(2560)
%timeit test_cython(5120)
In [41]:
%timeit test_cython(10240)
%timeit test_cython(20480)
In [42]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(context="notebook", style="darkgrid", palette="hls", font="sans-serif", font_scale=1.4)
It's durty, but let us capture manually the times given by the experiments above.
In [43]:
x = [10, 20, 40, 80, 160, 320, 640, 1280, 2560, 5120, 10240, 20480]
y = [18, 30, 55, 107, 205, 471, 977, 2270, 5970, 17300, 56600, 185000]
In [44]:
plt.figure()
plt.plot(x, y, 'o-')
plt.xlabel("Length $n$ of the binary sequence $S$")
plt.ylabel(r"Time in $\mu\;\mathrm{s}$")
plt.title("Time complexity of Lempel-Ziv complexity")
plt.show()
Out[44]:
Out[44]:
Out[44]:
Out[44]:
Out[44]:
In [45]:
plt.figure()
plt.semilogx(x, y, 'o-')
plt.xlabel("Length $n$ of the binary sequence $S$")
plt.ylabel(r"Time in $\mu\;\mathrm{s}$")
plt.title("Time complexity of Lempel-Ziv complexity, semilogx scale")
plt.show()
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
In [46]:
plt.figure()
plt.semilogy(x, y, 'o-')
plt.xlabel("Length $n$ of the binary sequence $S$")
plt.ylabel(r"Time in $\mu\;\mathrm{s}$")
plt.title("Time complexity of Lempel-Ziv complexity, semilogy scale")
plt.show()
Out[46]:
Out[46]:
Out[46]:
Out[46]:
Out[46]:
In [47]:
plt.figure()
plt.loglog(x, y, 'o-')
plt.xlabel("Length $n$ of the binary sequence $S$")
plt.ylabel(r"Time in $\mu\;\mathrm{s}$")
plt.title("Time complexity of Lempel-Ziv complexity, loglog scale")
plt.show()
Out[47]:
Out[47]:
Out[47]:
Out[47]:
Out[47]:
It is linear in $\log\log$ scale, so indeed the algorithm seems to have a linear complexity.
To sum-up, for a sequence $S$ of length $n$, it takes $\mathcal{O}(n)$ basic operations to compute its Lempel-Ziv complexity $\mathrm{Lempel}-\mathrm{Ziv}(S)$.
The Lempel-Ziv complexity is not too hard to implement, and it indeed represents a certain complexity of a binary sequence, capturing the regularity and reproducibility of the sequence.
Using the Cython was quite useful to have a $\simeq \times 100$ speed up on our manual naive implementation !
The algorithm is not easy to analyze, we have a trivial $\mathcal{O}(n^2)$ bound but experiments showed it is more likely to be $\mathcal{O}(n \log n)$ in the worst case, and $\mathcal{O}(n)$ in practice for "not too complicated sequences" (or in average, for random sequences).
In [16]:
%%time
%%script julia
"""Lempel-Ziv complexity for a binary sequence, in simple Julia code."""
function lempel_ziv_complexity(binary_sequence)
u, v, w = 0, 1, 1
v_max = 1
size = length(binary_sequence)
complexity = 1
while true
if binary_sequence[u + v] == binary_sequence[w + v]
v += 1
if w + v >= size
complexity += 1
break
end
else
if v > v_max
v_max = v
end
u += 1
if u == w
complexity += 1
w += v_max
if w > size
break
else
u = 0
v = 1
v_max = 1
end
else
v = 1
end
end
end
return complexity
end
s = "1001111011000010"
lempel_ziv_complexity(s) # 1 / 0 / 01 / 1110 / 1100 / 0010
M = 100;
N = 10000;
for _ in 1:M
s = join(rand(0:1, N));
lempel_ziv_complexity(s);
end
lempel_ziv_complexity(s) # 1 / 0 / 01 / 1110 / 1100 / 0010
And to compare it fairly, let us use Pypy for comparison.
In [17]:
%%time
%%pypy
def lempel_ziv_complexity(binary_sequence):
"""Lempel-Ziv complexity for a binary sequence, in simple Python code."""
u, v, w = 0, 1, 1
v_max = 1
length = len(binary_sequence)
complexity = 1
while True:
if binary_sequence[u + v - 1] == binary_sequence[w + v - 1]:
v += 1
if w + v >= length:
complexity += 1
break
else:
if v > v_max:
v_max = v
u += 1
if u == w:
complexity += 1
w += v_max
if w > length:
break
else:
u = 0
v = 1
v_max = 1
else:
v = 1
return complexity
s = "1001111011000010"
lempel_ziv_complexity(s) # 1 / 0 / 01 / 1110 / 1100 / 0010
from random import random
M = 100
N = 10000
for _ in range(M):
s = ''.join(str(int(random() < 0.5)) for _ in range(N))
lempel_ziv_complexity(s)
So we can check that on these 100 random trials on strings of size 10000, the naive Julia version is about twice as fast as the naive Python version (executed by Pypy for speedup).
That's good, but it's not much...
Thanks for reading! My implementation is now open-source and available on GitHub, on https://github.com/Naereen/Lempel-Ziv_Complexity.
It will be available from PyPi very soon, see https://pypi.python.org/pypi/lempel_ziv_complexity.
See this repo on GitHub for more notebooks, or on nbviewer.jupyter.org.
That's it for this demo! See you, folks!