In this short Jupyter notebook aims at defining and explaining the Lempel-Ziv complexity.
I will give examples, and benchmarks of different implementations.
The Lempel-Ziv complexity is defined as the number of different substrings encountered as the stream is viewed from begining to the end.
As an example:
>>> s = '1001111011000010'
>>> lempel_ziv_complexity(s) # 1 / 0 / 01 / 11 / 10 / 110 / 00 / 010
8
Marking in the different substrings, this sequence $s$ has complexity $\mathrm{Lempel}$-$\mathrm{Ziv}(s) = 6$ because $s = 1001111011000010 = 1 / 0 / 01 / 11 / 10 / 110 / 00 / 010$.
Other examples:
>>> lempel_ziv_complexity('1010101010101010') # 1, 0, 10, 101, 01, 010, 1010
7
>>> lempel_ziv_complexity('1001111011000010000010') # 1, 0, 01, 11, 10, 110, 00, 010, 000
9
>>> lempel_ziv_complexity('100111101100001000001010') # 1, 0, 01, 11, 10, 110, 00, 010, 000, 0101
10
In [112]:
def lempel_ziv_complexity(sequence):
"""Lempel-Ziv complexity for a binary sequence, in simple Python code."""
sub_strings = set()
n = len(sequence)
ind = 0
inc = 1
# this while loop runs at most n times
while True:
if ind + inc > len(sequence):
break
# this can take some time, takes O(inc)
sub_str = sequence[ind : ind + inc]
# and this also, takes a O(log |size set|) in worst case
# max value for inc = n / size set at the end
# so worst case is that the set contains sub strings of the same size
# and the worst loop takes a O(n / |S| * log(|S|))
# ==> so if n/|S| is constant, it gives O(n log(n)) at the end
# but if n/|S| = O(n) then it gives O(n^2)
if sub_str in sub_strings:
inc += 1
else:
sub_strings.add(sub_str)
ind += inc
inc = 1
return len(sub_strings)
In [4]:
s = '1001111011000010'
lempel_ziv_complexity(s) # 1 / 0 / 01 / 11 / 10 / 110 / 00 / 010
Out[4]:
In [5]:
%timeit lempel_ziv_complexity(s)
In [6]:
lempel_ziv_complexity('1010101010101010') # 1, 0, 10, 101, 01, 010, 1010
Out[6]:
In [7]:
lempel_ziv_complexity('1001111011000010000010') # 1, 0, 01, 11, 10, 110, 00, 010, 000
Out[7]:
In [8]:
lempel_ziv_complexity('100111101100001000001010') # 1, 0, 01, 11, 10, 110, 00, 010, 000, 0101
Out[8]:
In [9]:
%timeit lempel_ziv_complexity('100111101100001000001010')
In [11]:
import random
def random_string(size, alphabet="ABCDEFGHIJKLMNOPQRSTUVWXYZ"):
return "".join(random.choices(alphabet, k=size))
def random_binary_sequence(size):
return random_string(size, alphabet="01")
In [13]:
random_string(100)
random_binary_sequence(100)
Out[13]:
Out[13]:
In [43]:
for (r, name) in zip(
[random_string, random_binary_sequence],
["random strings in A..Z", "random binary sequences"]
):
print("\nFor {}...".format(name))
for n in [10, 100, 1000, 10000, 100000]:
print(" of sizes {}, Lempel-Ziv complexity runs in:".format(n))
%timeit lempel_ziv_complexity(r(n))
We can start to see that the time complexity of this function seems to grow linearly as the size grows.
As this blog post explains it, we can easily try to use Cython in a notebook cell.
See the Cython documentation for more information.
In [16]:
%load_ext cython
In [45]:
%%cython
import cython
ctypedef unsigned int DTYPE_t
@cython.boundscheck(False) # turn off bounds-checking for entire function, quicker but less safe
def lempel_ziv_complexity_cython(str sequence not None):
"""Lempel-Ziv complexity for a string, in simple Cython code (C extension)."""
cdef set sub_strings = set()
cdef str sub_str = ""
cdef DTYPE_t n = len(sequence)
cdef DTYPE_t ind = 0
cdef DTYPE_t inc = 1
while True:
if ind + inc > len(sequence):
break
sub_str = sequence[ind : ind + inc]
if sub_str in sub_strings:
inc += 1
else:
sub_strings.add(sub_str)
ind += inc
inc = 1
return len(sub_strings)
Let try it!
In [37]:
s = '1001111011000010'
lempel_ziv_complexity_cython(s) # 1 / 0 / 01 / 11 / 10 / 110 / 00 / 010
Out[37]:
In [29]:
%timeit lempel_ziv_complexity(s)
%timeit lempel_ziv_complexity_cython(s)
In [30]:
lempel_ziv_complexity_cython('1010101010101010') # 1, 0, 10, 101, 01, 010, 1010
Out[30]:
In [31]:
lempel_ziv_complexity_cython('1001111011000010000010') # 1, 0, 01, 11, 10, 110, 00, 010, 000
Out[31]:
In [32]:
lempel_ziv_complexity_cython('100111101100001000001010') # 1, 0, 01, 11, 10, 110, 00, 010, 000, 0101
Out[32]:
Now for a test of the speed?
In [46]:
for (r, name) in zip(
[random_string, random_binary_sequence],
["random strings in A..Z", "random binary sequences"]
):
print("\nFor {}...".format(name))
for n in [10, 100, 1000, 10000, 100000]:
print(" of sizes {}, Lempel-Ziv complexity in Cython runs in:".format(n))
%timeit lempel_ziv_complexity_cython(r(n))
$\implies$ Yay! It seems faster indeed! but only x2 times faster...
In [48]:
from numba import jit
In [79]:
@jit
def lempel_ziv_complexity_numba(sequence : str) -> int:
"""Lempel-Ziv complexity for a sequence, in Python code using numba.jit() for automatic speedup (hopefully)."""
sub_strings = set()
n : int= len(sequence)
ind : int = 0
inc : int = 1
while True:
if ind + inc > len(sequence):
break
sub_str : str = sequence[ind : ind + inc]
if sub_str in sub_strings:
inc += 1
else:
sub_strings.add(sub_str)
ind += inc
inc = 1
return len(sub_strings)
Let try it!
In [80]:
s = '1001111011000010'
lempel_ziv_complexity_numba(s) # 1 / 0 / 01 / 1110 / 1100 / 0010
Out[80]:
In [53]:
%timeit lempel_ziv_complexity_numba(s)
In [54]:
lempel_ziv_complexity_numba('1010101010101010') # 1, 0, 10, 101, 01, 010, 1010
Out[54]:
In [55]:
lempel_ziv_complexity_numba('1001111011000010000010') # 1, 0, 01, 11, 10, 110, 00, 010, 000
9
Out[55]:
In [57]:
lempel_ziv_complexity_numba('100111101100001000001010') # 1, 0, 01, 11, 10, 110, 00, 010, 000, 0101
Out[57]:
In [58]:
%timeit lempel_ziv_complexity_numba('100111101100001000001010')
$\implies$ Well... It doesn't seem that much faster from the naive Python code. We specified the signature when calling
@numba.jit
, and used the more appropriate data structure (string is probably the smaller, numpy array are probably faster). But even these tricks didn't help that much.I tested, and without specifying the signature, the fastest approach is using string, compared to using lists or numpy arrays. Note that the
@jit
-powered function is compiled at runtime when first being called, so the signature used for the first call is determining the signature used by the compile function
In [59]:
from numpy.random import binomial
def bernoulli(p, size=1):
"""One or more samples from a Bernoulli of probability p."""
return binomial(1, p, size)
In [60]:
bernoulli(0.5, 20)
Out[60]:
That's probably not optimal, but we can generate a string with:
In [61]:
''.join(str(i) for i in bernoulli(0.5, 20))
Out[61]:
In [62]:
def random_binary_sequence(n, p=0.5):
"""Uniform random binary sequence of size n, with rate of 0/1 being p."""
return ''.join(str(i) for i in bernoulli(p, n))
In [63]:
random_binary_sequence(50)
random_binary_sequence(50, p=0.1)
random_binary_sequence(50, p=0.25)
random_binary_sequence(50, p=0.5)
random_binary_sequence(50, p=0.75)
random_binary_sequence(50, p=0.9)
Out[63]:
Out[63]:
Out[63]:
Out[63]:
Out[63]:
Out[63]:
And so, this function can test to check that the three implementations (naive, Cython-powered, Numba-powered) always give the same result.
In [64]:
def tests_3_functions(n, p=0.5, debug=True):
s = random_binary_sequence(n, p=p)
c1 = lempel_ziv_complexity(s)
if debug:
print("Sequence s = {} ==> complexity C = {}".format(s, c1))
c2 = lempel_ziv_complexity_cython(s)
c3 = lempel_ziv_complexity_numba(s)
assert c1 == c2 == c3, "Error: the sequence {} gave different values of the Lempel-Ziv complexity from 3 functions ({}, {}, {})...".format(s, c1, c2, c3)
return c1
In [65]:
tests_3_functions(5)
Out[65]:
In [66]:
tests_3_functions(20)
Out[66]:
In [67]:
tests_3_functions(50)
Out[67]:
In [68]:
tests_3_functions(500)
Out[68]:
In [69]:
tests_3_functions(5000)
Out[69]:
In [70]:
%timeit lempel_ziv_complexity('100111101100001000001010')
%timeit lempel_ziv_complexity_cython('100111101100001000001010')
%timeit lempel_ziv_complexity_numba('100111101100001000001010')
In [71]:
%timeit lempel_ziv_complexity('10011110110000100000101000100100101010010111111011001111111110101001010110101010')
%timeit lempel_ziv_complexity_cython('10011110110000100000101000100100101010010111111011001111111110101001010110101010')
%timeit lempel_ziv_complexity_numba('10011110110000100000101000100100101010010111111011001111111110101001010110101010')
Let check the time used by all the three functions, for longer and longer sequences:
In [72]:
%timeit tests_3_functions(10, debug=False)
%timeit tests_3_functions(20, debug=False)
%timeit tests_3_functions(40, debug=False)
%timeit tests_3_functions(80, debug=False)
%timeit tests_3_functions(160, debug=False)
%timeit tests_3_functions(320, debug=False)
In [73]:
def test_cython(n):
s = random_binary_sequence(n)
c = lempel_ziv_complexity_cython(s)
return c
In [74]:
%timeit test_cython(10)
%timeit test_cython(20)
%timeit test_cython(40)
%timeit test_cython(80)
%timeit test_cython(160)
%timeit test_cython(320)
In [75]:
%timeit test_cython(640)
%timeit test_cython(1280)
%timeit test_cython(2560)
%timeit test_cython(5120)
In [76]:
%timeit test_cython(10240)
%timeit test_cython(20480)
In [95]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(context="notebook", style="darkgrid", palette="hls", font="sans-serif", font_scale=1.4)
In [96]:
import numpy as np
import timeit
In [109]:
sizes = np.array(np.trunc(np.logspace(1, 6, 30)), dtype=int)
times = np.array([
timeit.timeit(
stmt="lempel_ziv_complexity_cython(random_string({}))".format(n),
globals=globals(),
number=10,
)
for n in sizes
])
In [110]:
plt.figure(figsize=(15, 10))
plt.plot(sizes, times, 'o-')
plt.xlabel("Length $n$ of the binary sequence $S$")
plt.ylabel(r"Time in $\mu\;\mathrm{s}$")
plt.title("Time complexity of Lempel-Ziv complexity")
plt.show()
Out[110]:
Out[110]:
Out[110]:
Out[110]:
Out[110]:
In [111]:
plt.figure(figsize=(15, 10))
plt.loglog(sizes, times, 'o-')
plt.xlabel("Length $n$ of the binary sequence $S$")
plt.ylabel(r"Time in $\mu\;\mathrm{s}$")
plt.title("Time complexity of Lempel-Ziv complexity, loglog scale")
plt.show()
Out[111]:
Out[111]:
Out[111]:
Out[111]:
Out[111]:
It is linear in $\log\log$ scale, so indeed the algorithm seems to have a linear complexity.
To sum-up, for a sequence $S$ of length $n$, it takes $\mathcal{O}(n)$ basic operations to compute its Lempel-Ziv complexity $\mathrm{Lempel}-\mathrm{Ziv}(S)$.
The Lempel-Ziv complexity is not too hard to implement, and it indeed represents a certain complexity of a binary sequence, capturing the regularity and reproducibility of the sequence.
Using the Cython was quite useful to have a $\simeq \times 100$ speed up on our manual naive implementation !
The algorithm is not easy to analyze, we have a trivial $\mathcal{O}(n^2)$ bound but experiments showed it is more likely to be $\mathcal{O}(n \log n)$ in the worst case, and $\mathcal{O}(n)$ in practice for "not too complicated sequences" (or in average, for random sequences).
In [118]:
%%time
%%script julia
"""Lempel-Ziv complexity for a sequence, in simple Julia code."""
function lempel_ziv_complexity(sequence)
sub_strings = Set()
n = length(sequence)
ind = 1
inc = 1
while true
if ind + inc > n
break
end
sub_str = sequence[ind : ind + inc]
if sub_str in sub_strings
inc += 1
else
push!(sub_strings, sub_str)
ind += inc
inc = 1
end
end
return length(sub_strings)
end
s = "1001111011000010"
lempel_ziv_complexity(s) # 1 / 0 / 01 / 1110 / 1100 / 0010
M = 1000;
N = 10000;
for _ in 1:M
s = join(rand(0:1, N));
lempel_ziv_complexity(s);
end
lempel_ziv_complexity(s) # 1 / 0 / 01 / 1110 / 1100 / 0010
And to compare it fairly, let us use Pypy for comparison.
In [122]:
%%time
%%pypy
def lempel_ziv_complexity(sequence):
"""Lempel-Ziv complexity for a binary sequence, in simple Python code."""
sub_strings = set()
n = len(sequence)
ind = 0
inc = 1
while True:
if ind + inc > len(sequence):
break
sub_str = sequence[ind : ind + inc]
if sub_str in sub_strings:
inc += 1
else:
sub_strings.add(sub_str)
ind += inc
inc = 1
return len(sub_strings)
s = "1001111011000010"
lempel_ziv_complexity(s) # 1 / 0 / 01 / 11 / 10 / 110 / 00 / 010
from random import random
M = 1000
N = 10000
for _ in range(M):
s = ''.join(str(int(random() < 0.5)) for _ in range(N))
lempel_ziv_complexity(s)
So we can check that on these 1000 random trials on strings of size 10000, the naive Julia version is slower than the naive Python version (executed by Pypy for speedup).
Thanks for reading! My implementation is now open-source and available on GitHub, on https://github.com/Naereen/Lempel-Ziv_Complexity.
It will be available from PyPi very soon, see https://pypi.python.org/pypi/lempel_ziv_complexity.
See this repo on GitHub for more notebooks, or on nbviewer.jupyter.org.
That's it for this demo! See you, folks!