Short study of the Lempel-Ziv complexity

In this short Jupyter notebook aims at defining and explaining the Lempel-Ziv complexity.

I will give examples, and benchmarks of different implementations.

  • Reference: Abraham Lempel and Jacob Ziv, « On the Complexity of Finite Sequences », IEEE Trans. on Information Theory, January 1976, p. 75–81, vol. 22, n°1.

Short definition

The Lempel-Ziv complexity is defined as the number of different substrings encountered as the stream is viewed from begining to the end.

As an example:

>>> s = '1001111011000010'
>>> lempel_ziv_complexity(s)  # 1 / 0 / 01 / 1110 / 1100 / 0010
6

Marking in the different substrings, this sequence $s$ has complexity $\mathrm{Lempel}$-$\mathrm{Ziv}(s) = 6$ because $s = 1001111011000010 = 1 / 0 / 01 / 1110 / 1100 / 0010$.

Other examples:

>>> lempel_ziv_complexity('1010101010101010')  # 1 / 0 / 10
3
>>> lempel_ziv_complexity('1001111011000010000010')  # 1 / 0 / 01 / 1110 / 1100 / 0010 / 000 / 010
7
>>> lempel_ziv_complexity('100111101100001000001010')  # 1 / 0 / 01 / 1110 / 1100 / 0010 / 000 / 010 / 10
8

Python implementation


In [1]:
def lempel_ziv_complexity(binary_sequence):
    """Lempel-Ziv complexity for a binary sequence, in simple Python code."""
    u, v, w = 0, 1, 1
    v_max = 1
    length = len(binary_sequence)
    complexity = 1
    while True:
        if binary_sequence[u + v - 1] == binary_sequence[w + v - 1]:
            v += 1
            if w + v >= length:
                complexity += 1
                break
        else:
            if v > v_max:
                v_max = v
            u += 1
            if u == w:
                complexity += 1
                w += v_max
                if w > length:
                    break
                else:
                    u = 0
                    v = 1
                    v_max = 1
            else:
                v = 1
    return complexity

Tests (1/2)


In [2]:
s = '1001111011000010'
lempel_ziv_complexity(s)  # 1 / 0 / 01 / 1110 / 1100 / 0010


Out[2]:
6

In [3]:
%timeit lempel_ziv_complexity(s)


6.1 µs ± 173 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [4]:
lempel_ziv_complexity('1010101010101010')  # 1 / 0 / 10


Out[4]:
3

In [5]:
lempel_ziv_complexity('1001111011000010000010')  # 1 / 0 / 01 / 1110


Out[5]:
7

In [6]:
lempel_ziv_complexity('100111101100001000001010')  # 1 / 0 / 01 / 1110 / 1100 / 0010 / 000 / 010 / 10


Out[6]:
8

In [7]:
%timeit lempel_ziv_complexity('100111101100001000001010')


19.4 µs ± 2.31 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

We can start to see that the time complexity of this function seems to grow exponentially as the complexity grows.


Cython implementation

As this blog post explains it, we can easily try to use Cython in a notebook cell.

See the Cython documentation for more information.


In [7]:
%load_ext cython

In [8]:
%%cython
from __future__ import division
import cython

ctypedef unsigned int DTYPE_t

@cython.boundscheck(False) # turn off bounds-checking for entire function, quicker but less safe
def lempel_ziv_complexity_cython(str binary_sequence not None):
    """Lempel-Ziv complexity for a binary sequence, in simple Cython code (C extension)."""
    cdef DTYPE_t u = 0
    cdef DTYPE_t v = 1
    cdef DTYPE_t w = 1
    cdef DTYPE_t v_max = 1
    cdef DTYPE_t length = len(binary_sequence)
    cdef DTYPE_t complexity = 1
    # that was the only needed part, typing statically all the variables
    while True:
        if binary_sequence[u + v - 1] == binary_sequence[w + v - 1]:
            v += 1
            if w + v >= length:
                complexity += 1
                break
        else:
            if v > v_max:
                v_max = v
            u += 1
            if u == w:
                complexity += 1
                w += v_max
                if w > length:
                    break
                else:
                    u = 0
                    v = 1
                    v_max = 1
            else:
                v = 1
    return complexity

Let try it!


In [9]:
s = '1001111011000010'
lempel_ziv_complexity_cython(s)  # 1 / 0 / 01 / 1110 / 1100 / 0010


Out[9]:
6

In [10]:
%timeit lempel_ziv_complexity_cython(s)


131 ns ± 5.22 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [12]:
lempel_ziv_complexity_cython('1010101010101010')  # 1 / 0 / 10


Out[12]:
3

In [13]:
lempel_ziv_complexity_cython('1001111011000010000010')  # 1 / 0 / 01 / 1110


Out[13]:
7

In [14]:
lempel_ziv_complexity_cython('100111101100001000001010')  # 1 / 0 / 01 / 1110 / 1100 / 0010 / 000 / 010 / 10


Out[14]:
8

In [15]:
%timeit lempel_ziv_complexity_cython('100111101100001000001010')


259 ns ± 13.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

$\implies$ Yay! It seems faster indeed!


Numba implementation

As this blog post explains it, we can also try to use Numba in a notebook cell.


In [87]:
from numba import jit

In [94]:
@jit("int32(boolean[:])")
def lempel_ziv_complexity_numba_x(binary_sequence):
    """Lempel-Ziv complexity for a binary sequence, in Python code using numba.jit() for automatic speedup (hopefully)."""
    u, v, w = 0, 1, 1
    v_max = 1
    length = len(binary_sequence)
    complexity = 1
    while True:
        if binary_sequence[u + v - 1] == binary_sequence[w + v - 1]:
            v += 1
            if w + v >= length:
                complexity += 1
                break
        else:
            if v > v_max:
                v_max = v
            u += 1
            if u == w:
                complexity += 1
                w += v_max
                if w > length:
                    break
                else:
                    u = 0
                    v = 1
                    v_max = 1
            else:
                v = 1
    return complexity

def str_to_numpy(s):
    """str to np.array of bool"""
    return np.array([int(i) for i in s], dtype=np.bool)

def lempel_ziv_complexity_numba(s):
    return lempel_ziv_complexity_numba_x(str_to_numpy(s))

Let try it!


In [95]:
str_to_numpy(s)


Out[95]:
array([ True, False, False,  True,  True,  True,  True, False,  True,
        True, False, False, False, False,  True, False], dtype=bool)

In [96]:
s = '1001111011000010'
lempel_ziv_complexity_numba(s)  # 1 / 0 / 01 / 1110 / 1100 / 0010


Out[96]:
6

In [97]:
%timeit lempel_ziv_complexity_numba(s)


6.16 µs ± 228 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [98]:
lempel_ziv_complexity_numba('1010101010101010')  # 1 / 0 / 10


Out[98]:
3

In [99]:
lempel_ziv_complexity_numba('1001111011000010000010')  # 1 / 0 / 01 / 1110


Out[99]:
7

In [100]:
lempel_ziv_complexity_numba('100111101100001000001010')  # 1 / 0 / 01 / 1110 / 1100 / 0010 / 000 / 010 / 10


Out[100]:
8

In [101]:
%timeit lempel_ziv_complexity_numba('100111101100001000001010')


9.35 µs ± 1.22 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

$\implies$ Well... It doesn't seem that much faster from the naive Python code. We specified the signature when calling @numba.jit, and used the more appropriate data structure (string is probably the smaller, numpy array are probably faster). But even these tricks didn't help that much.

I tested, and without specifying the signature, the fastest approach is using string, compared to using lists or numpy arrays. Note that the @jit-powered function is compiled at runtime when first being called, so the signature used for the first call is determining the signature used by the compile function


Tests (2/2)

To test more robustly, let us generate some (uniformly) random binary sequences.


In [57]:
from numpy.random import binomial

def bernoulli(p, size=1):
    """One or more samples from a Bernoulli of probability p."""
    return binomial(1, p, size)

In [25]:
bernoulli(0.5, 20)


Out[25]:
array([0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1])

That's probably not optimal, but we can generate a string with:


In [26]:
''.join(str(i) for i in bernoulli(0.5, 20))


Out[26]:
'10011100011111111011'

In [58]:
def random_binary_sequence(n, p=0.5):
    """Uniform random binary sequence of size n, with rate of 0/1 being p."""
    return ''.join(str(i) for i in bernoulli(p, n))

In [60]:
random_binary_sequence(50)
random_binary_sequence(50, p=0.1)
random_binary_sequence(50, p=0.25)
random_binary_sequence(50, p=0.5)
random_binary_sequence(50, p=0.75)
random_binary_sequence(50, p=0.9)


Out[60]:
'11100011000111010111000110001110001110010010010010'
Out[60]:
'00000100100000100000000000000000000100000000000000'
Out[60]:
'00000010000010011100111010010000000000000010100010'
Out[60]:
'11000111100001111001100000101011011011011010110110'
Out[60]:
'11111001110101111111111101111111010111111111110111'
Out[60]:
'11111111111111101111101111110111111001111110111111'

And so, this function can test to check that the three implementations (naive, Cython-powered, Numba-powered) always give the same result.


In [29]:
def tests_3_functions(n, p=0.5, debug=True):
    s = random_binary_sequence(n, p=p)
    c1 = lempel_ziv_complexity(s)
    if debug:
        print("Sequence s = {} ==> complexity C = {}".format(s, c1))
    c2 = lempel_ziv_complexity_cython(s)
    c3 = lempel_ziv_complexity_numba(s)
    assert c1 == c2 == c3, "Error: the sequence {} gave different values of the Lempel-Ziv complexity from 3 functions ({}, {}, {})...".format(s, c1, c2, c3)
    return c1

In [30]:
tests_3_functions(5)


Sequence s = 11010 ==> complexity C = 3
Out[30]:
3

In [31]:
tests_3_functions(20)


Sequence s = 00011000000010010111 ==> complexity C = 6
Out[31]:
6

In [32]:
tests_3_functions(50)


Sequence s = 00111111110000010000100011000010001100110001001101 ==> complexity C = 8
Out[32]:
8

In [33]:
tests_3_functions(500)


Sequence s = 01011110110110010010101101110000101110100011011101100101001000001100111000011000101000000010010010010100101101101010111010001100011000100111111101101111000010001111001110010000001100011011111001001011101110001000111101100110100111011010101000010100011000010001101101000101111101001011111001100111001010100010010001111000101100110000100000010100111111000110110011110101000100110101000011001010111010101111101100111010101011110001100001111111000100100000001000100100101011010111011001101110110111111110 ==> complexity C = 60
Out[33]:
60

In [34]:
tests_3_functions(5000)


Sequence s = 00001100000110000110100011111111010111110000010110100111111000000111011011000001011101101101011100101110010001010000011011110100001011001011000101100001000000110100110011111100010111000101000011100010010100011110011110011101011101110011110001110110100100111110010010110010100100100101001010110100111111000010011110101011110001010010000110101000000111001010101000101001101000111100001100101111010010001011111111110001111001100011010011101011001111110110011101100011001000001000110101001010000000101011111101010110111110000000001000111000000011110101001110101011111100000101011111000100000000110110011011011111001111101000110001011111011010011100100110010001011011001001100111000001100111101011000010100010111100010000111001111001100001010001010010100110101111001010011010110001100001100011110011101101000110000010010010111110101011000000101110110011100110110011111111010011001101011101010001110001000001000000001111010000000010000101011001111001101110100110000100110000010010011110111001001111110100101000010101001101011111111010010110011000011001101001110010100010111100001011001000011000001000011111001000011110001010111111111100000000110101110110001100010111100101100010001001101010101000110010100101010100001111010001001001111010100010101101101110010110011001010110101111010011000100011110001110011000001000010010000000000001000001011010100010111001001000101000001101100100011100011110011010010010110011000101110011110000010000100110100100000011011100110111101110110101110010111000011110100110001011010111000111001110101110001100000111011100011100000010000101010110010011000000001111100101010101110100111111101011001100101111000011101111010101001000111001110100100011110000110001110100010101001100101010000101101000010110010000011100000111001111101110100010011010111101000010100011011100111001110000001000011110111001110100110010110000010011111101000011101010000001011101011111100110111010111000101110010011110110110000000110110000110010000011110000111000101001001101011101101101010011010100101000011101010110011100000110100010001101110101001011111110100101010011111110100111000110010111010100000100101001010100000000000000101001111111001010010111011000000001100111111011011100100011011100010101000111000000100101100000110111100101100100110010001000000111000010011100001010111010001011111111011101010101101101000100010110101110001110100101001111101111101100110100101100110111001011011110011110100010100010110101110110000111000100000010101011100101010101100001011100101000100110100110010100101100010010010100010100110001010101001011001100001001111110000101000000111110001101010110001000000111111111101100001101011011000110000010111111101111010001100111011100111001100010110111010110011011100111101110010100011011011000001011111111001000001010001010001100011010101011100010000101111111101000100110110110001110111100011011101010100101010100111011111111001001000000001010001100100111111000111010110111100101001101100001100101101000001111010110000010101000010101000110110001100101010110111110111110110110101101000010101000001111101100001100010111111111001101111000111110100111001010111111110000111111100100110010000101011111010100001001101010101010111111110100010010010101000100011100101011101100001010011100011101111111100000100100111101011011000110010011111110000011010011000111110110110110000111010101101100101101100010100101011010001110111011111100010011011100110000000011111011010001101110110000110001101011111110011011110001111100111011001100011111010010000111000000110001101010101011010000110110011111010101111101001100000000110001111010110110001111111010100001100101000011011101010111000010110110010100001010011010010101011011100001010101011100111011110001100110101000011001000010100111000100110001000110111100010100110111011111101110011000100000011100010011001010001110010011011111110111100000111001010110000011011010011010011000000111100011101101110110101010010001111100100011100001101011001010101001100101101101011010011001011001100100011010001011000101011000011000101010110000110000011111111101000001100011100111110110001000111000111111011111110101000001100001100010111001101011100001010011000001110100010011001001101101111010100100001010011110100110001000001010010110111110001011101000000110011111001110011011011100111000101010101010011100100001000011001100001101100011100110101010010011110101110000111000001111010100011111100011101000010101111010001101100010111111110110000101010111000000101111001001111110011110001110111101000010010001101000001011100010000011000000011101010100000101001001100101001001001001001111111101100111001110110001000001100001010011001001110111111101111100011100011101001000000100000010111101001101001100010101001110100101000010110101011100011111100001100110111010111111110010001000011000101110111110010100011111111001001011110010101111101110111101000000001111010000010001100011011010001101110010110111011000110110111100111111011001000101101011000101110010011010010111001000101000111010101000000010111111010011001110010110001010001111011110001101001100000110000001 ==> complexity C = 420
Out[34]:
420

Benchmarks

On two example of strings (binary sequences), we can compare our three implementation.


In [102]:
%timeit lempel_ziv_complexity('100111101100001000001010')
%timeit lempel_ziv_complexity_cython('100111101100001000001010')
%timeit lempel_ziv_complexity_numba('100111101100001000001010')


16.7 µs ± 471 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
249 ns ± 6.71 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
7.98 µs ± 236 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [103]:
%timeit lempel_ziv_complexity('10011110110000100000101000100100101010010111111011001111111110101001010110101010')
%timeit lempel_ziv_complexity_cython('10011110110000100000101000100100101010010111111011001111111110101001010110101010')
%timeit lempel_ziv_complexity_numba('10011110110000100000101000100100101010010111111011001111111110101001010110101010')


139 µs ± 4.33 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
1.94 µs ± 83.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
22 µs ± 364 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Let check the time used by all the three functions, for longer and longer sequences:


In [104]:
%timeit tests_3_functions(10, debug=False)
%timeit tests_3_functions(20, debug=False)
%timeit tests_3_functions(40, debug=False)
%timeit tests_3_functions(80, debug=False)
%timeit tests_3_functions(160, debug=False)
%timeit tests_3_functions(320, debug=False)


28.2 µs ± 2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
52.3 µs ± 2.75 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
108 µs ± 1.42 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
359 µs ± 91.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
862 µs ± 64.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
3.16 ms ± 299 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [105]:
def test_cython(n):
    s = random_binary_sequence(n)
    c = lempel_ziv_complexity_cython(s)
    return c

In [39]:
%timeit test_cython(10)
%timeit test_cython(20)
%timeit test_cython(40)
%timeit test_cython(80)
%timeit test_cython(160)
%timeit test_cython(320)


17.1 µs ± 490 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
28.8 µs ± 959 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
62.4 µs ± 6.77 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
125 µs ± 7.15 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
216 µs ± 13.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
459 µs ± 16.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [40]:
%timeit test_cython(640)
%timeit test_cython(1280)
%timeit test_cython(2560)
%timeit test_cython(5120)


1.18 ms ± 214 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.38 ms ± 144 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.93 ms ± 71.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
17.1 ms ± 168 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [41]:
%timeit test_cython(10240)
%timeit test_cython(20480)


52.1 ms ± 584 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
178 ms ± 6.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Complexity ?

$\implies$ The function lempel_ziv_complexity_cython seems to be indeed (almost) linear in $n$, the length of the binary sequence $S$.

But let check more precisely, as it could also have a complexity of $\mathcal{O}(n \log n)$.


In [42]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(context="notebook", style="darkgrid", palette="hls", font="sans-serif", font_scale=1.4)

It's durty, but let us capture manually the times given by the experiments above.


In [43]:
x = [10, 20, 40, 80, 160, 320, 640, 1280, 2560, 5120, 10240, 20480]
y = [18, 30, 55, 107, 205, 471, 977, 2270, 5970, 17300, 56600, 185000]

In [44]:
plt.figure()
plt.plot(x, y, 'o-')
plt.xlabel("Length $n$ of the binary sequence $S$")
plt.ylabel(r"Time in $\mu\;\mathrm{s}$")
plt.title("Time complexity of Lempel-Ziv complexity")
plt.show()


Out[44]:
<matplotlib.figure.Figure at 0x7f54a1380f60>
Out[44]:
[<matplotlib.lines.Line2D at 0x7f54a0e2b550>]
Out[44]:
<matplotlib.text.Text at 0x7f54a0ec7240>
Out[44]:
<matplotlib.text.Text at 0x7f54a0e5f0b8>
Out[44]:
<matplotlib.text.Text at 0x7f54a0e82f60>

In [45]:
plt.figure()
plt.semilogx(x, y, 'o-')
plt.xlabel("Length $n$ of the binary sequence $S$")
plt.ylabel(r"Time in $\mu\;\mathrm{s}$")
plt.title("Time complexity of Lempel-Ziv complexity, semilogx scale")
plt.show()


Out[45]:
<matplotlib.figure.Figure at 0x7f549d598400>
Out[45]:
[<matplotlib.lines.Line2D at 0x7f549d539470>]
Out[45]:
<matplotlib.text.Text at 0x7f549d5741d0>
Out[45]:
<matplotlib.text.Text at 0x7f54a0dfc860>
Out[45]:
<matplotlib.text.Text at 0x7f549d54a358>

In [46]:
plt.figure()
plt.semilogy(x, y, 'o-')
plt.xlabel("Length $n$ of the binary sequence $S$")
plt.ylabel(r"Time in $\mu\;\mathrm{s}$")
plt.title("Time complexity of Lempel-Ziv complexity, semilogy scale")
plt.show()


Out[46]:
<matplotlib.figure.Figure at 0x7f549d4b7198>
Out[46]:
[<matplotlib.lines.Line2D at 0x7f549d552470>]
Out[46]:
<matplotlib.text.Text at 0x7f549d4cf240>
Out[46]:
<matplotlib.text.Text at 0x7f549d4ab160>
Out[46]:
<matplotlib.text.Text at 0x7f549d552278>

In [47]:
plt.figure()
plt.loglog(x, y, 'o-')
plt.xlabel("Length $n$ of the binary sequence $S$")
plt.ylabel(r"Time in $\mu\;\mathrm{s}$")
plt.title("Time complexity of Lempel-Ziv complexity, loglog scale")
plt.show()


Out[47]:
<matplotlib.figure.Figure at 0x7f549d53f358>
Out[47]:
[<matplotlib.lines.Line2D at 0x7f549d29bcf8>]
Out[47]:
<matplotlib.text.Text at 0x7f549d326e10>
Out[47]:
<matplotlib.text.Text at 0x7f549d3180b8>
Out[47]:
<matplotlib.text.Text at 0x7f549d2abac8>

It is linear in $\log\log$ scale, so indeed the algorithm seems to have a linear complexity.

To sum-up, for a sequence $S$ of length $n$, it takes $\mathcal{O}(n)$ basic operations to compute its Lempel-Ziv complexity $\mathrm{Lempel}-\mathrm{Ziv}(S)$.


Conclusion

  • The Lempel-Ziv complexity is not too hard to implement, and it indeed represents a certain complexity of a binary sequence, capturing the regularity and reproducibility of the sequence.

  • Using the Cython was quite useful to have a $\simeq \times 100$ speed up on our manual naive implementation !

  • The algorithm is not easy to analyze, we have a trivial $\mathcal{O}(n^2)$ bound but experiments showed it is more likely to be $\mathcal{O}(n \log n)$ in the worst case, and $\mathcal{O}(n)$ in practice for "not too complicated sequences" (or in average, for random sequences).


(Experimental) Julia implementation

I want to (quickly) try to see if I can use Julia to write a faster version of this function. See issue #1.

Disclaimer: I am still learning Julia!


In [16]:
%%time
%%script julia

"""Lempel-Ziv complexity for a binary sequence, in simple Julia code."""
function lempel_ziv_complexity(binary_sequence)
    u, v, w = 0, 1, 1
    v_max = 1
    size = length(binary_sequence)
    complexity = 1
    while true
        if binary_sequence[u + v] == binary_sequence[w + v]
            v += 1
            if w + v >= size
                complexity += 1
                break
            end
        else
            if v > v_max
                v_max = v
            end
            u += 1
            if u == w
                complexity += 1
                w += v_max
                if w > size
                    break
                else
                    u = 0
                    v = 1
                    v_max = 1
                end
            else
                v = 1
            end
        end
    end
    return complexity
end

s = "1001111011000010"
lempel_ziv_complexity(s)  # 1 / 0 / 01 / 1110 / 1100 / 0010

M = 100;
N = 10000;
for _ in 1:M
    s = join(rand(0:1, N));
    lempel_ziv_complexity(s);
end
lempel_ziv_complexity(s)  # 1 / 0 / 01 / 1110 / 1100 / 0010


"Lempel-Ziv complexity for a binary sequence, in simple Julia code."
lempel_ziv_complexity (generic function with 1 method)
"1001111011000010"
6
100
10000
778
CPU times: user 0 ns, sys: 4 ms, total: 4 ms
Wall time: 4.86 s

And to compare it fairly, let us use Pypy for comparison.


In [17]:
%%time
%%pypy

def lempel_ziv_complexity(binary_sequence):
    """Lempel-Ziv complexity for a binary sequence, in simple Python code."""
    u, v, w = 0, 1, 1
    v_max = 1
    length = len(binary_sequence)
    complexity = 1
    while True:
        if binary_sequence[u + v - 1] == binary_sequence[w + v - 1]:
            v += 1
            if w + v >= length:
                complexity += 1
                break
        else:
            if v > v_max:
                v_max = v
            u += 1
            if u == w:
                complexity += 1
                w += v_max
                if w > length:
                    break
                else:
                    u = 0
                    v = 1
                    v_max = 1
            else:
                v = 1
    return complexity

s = "1001111011000010"
lempel_ziv_complexity(s)  # 1 / 0 / 01 / 1110 / 1100 / 0010

from random import random

M = 100
N = 10000
for _ in range(M):
    s = ''.join(str(int(random() < 0.5)) for _ in range(N))
    lempel_ziv_complexity(s)


CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 7.89 s

So we can check that on these 100 random trials on strings of size 10000, the naive Julia version is about twice as fast as the naive Python version (executed by Pypy for speedup).

That's good, but it's not much...


Ending notes

Thanks for reading! My implementation is now open-source and available on GitHub, on https://github.com/Naereen/Lempel-Ziv_Complexity.

It will be available from PyPi very soon, see https://pypi.python.org/pypi/lempel_ziv_complexity.

See this repo on GitHub for more notebooks, or on nbviewer.jupyter.org.

That's it for this demo! See you, folks!