CS1001.py

Extended Introduction to Computer Science with Python, Tel-Aviv University, Spring 2013

Recitation 10 - 16-20.5.2013

Last update: 16.5.2013

Rabin-Karp algorithm for string matching

We have a text of length n and a pattern of length m and we want to find all occurences of the pattern in the text.

Naively we would compare all n-m substrings in the text to the pattern, but this will have a complexity of $O((n-m)m)=O(nm)$.

We want to improve this using a hashtable (as was done for finding repeating substrings in recitation 8. Then we can calculate the hash of the pattern and of all substrings in the text and compare the hashes. Calculating each hash, though, takes $O(m) \;$ and therefore we still have a complexity of $O(mn)$.

Rolling hash

The trick of the Rabin-Karp algorithm is using a rolling hash:

if we already calculated the rolling hash of the substring of length m that started at position i, calculating the rolling hash of the substring of length m that started at position i+1 will take $O(1)$.

The complexity now will be much better - we need to calculate the hash on the pattern and on the substring that starts at position 0, each with complexity $O(m)$. In addition, we need to calculate the hash on the rest of the n-m substrings, each taking $O(1)$ because of the rolling hash. Altogether this is $O(n+m) \;$, much smaller than $O(nm)$.


In [3]:
def arithmetize(text, basis=2**16, r=2**32-3):
    """ convert substring to number using basis powers
    employs Horner method modulo r """
    partial_sum = 0
    for ch in text:
        partial_sum = (partial_sum * basis + ord(ch)) % r
    return partial_sum

In [4]:
ord("b"), ord("e"), ord("n")


Out[4]:
(98, 101, 110)

In [5]:
arithmetize("b"), arithmetize("be"), arithmetize("ben")


Out[5]:
(98, 6422629, 6619540)

In [6]:
def arithmetize_text_naive(text, m, basis=2**16, r=2**32-3):
    """ computes arithmization of all m long substrings
    of text, using basis powers """
    return [arithmetize(text[i:i + m], basis, r) for i in range(len(text) - m + 1)]

def arithmetize_text(text, m, basis=2**16, r=2**32-3):
    """ efficiently computes arithmetization of all m long
    substrings of text, using basis powers """
    b_power = basis ** (m - 1)
    lst = [None]*(len(text) - m + 1) 
    # lst[0] equals first word arithmetization
    lst[0] = arithmetize(text[0:m], basis, r)
    for i in range(1, len(lst)):
        rolling_hash = (lst[i - 1] - ord(text[i - 1])* b_power) * basis + ord(text[i + m - 1])
        rolling_hash %= r
        lst[i] = rolling_hash # append new_number to existing lst
    return lst

In [7]:
song = '''אני כל כך עצוב לי ושמש על העיר
ודיזנגוף נראה לי כמו רכבת לילה לקהיר 
בין כל הצלילים מחפש סימן 
יושב בצד הדרך 
יושב ליד הזמן 

בחדר הסגור בדידות מקיר לקיר 
ואם אצא החוצה המצב רק יחמיר 
אז אני שומע צליל מאוד מוכר 
מגיע מלמטה רחוק מהכיכר'''

In [8]:
print(arithmetize_text(song, 1) == arithmetize_text_naive(song, 1))


True

In [9]:
print(arithmetize_text(song, 2)[:10])


[97519072, 98567641, 98107424, 2098651, 98239964, 98304032, 2098651, 98239962, 98172960, 2098658]

In [10]:
print(arithmetize_text(song, 2, 2**16 - 1)[:10])


[97517584, 98566137, 98105927, 2098619, 98238465, 98302532, 2098619, 98238463, 98171462, 2098626]

In [11]:
%timeit -n 5 arithmetize_text_naive(song, 5)
%timeit -n 5 arithmetize_text(song, 5)


5 loops, best of 3: 1.19 ms per loop
5 loops, best of 3: 417 us per loop

In [12]:
import urllib
with urllib.request.urlopen("http://www.gutenberg.org/cache/epub/2701/pg2701.txt") as f:
    book = f.read().decode('utf8')

In [13]:
print(book[:book.index('\n\r\n\r')])


The Project Gutenberg EBook of Moby Dick; or The Whale, by Herman Melville

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org

In [14]:
len(book)


Out[14]:
1257258

In [15]:
%timeit -n 3 arithmetize_text_naive(book[:len(book)//10], 3)
%timeit -n 3 arithmetize_text_naive(book, 3)


3 loops, best of 3: 310 ms per loop
3 loops, best of 3: 3.14 s per loop

In [16]:
%timeit -n 3 arithmetize_text(book[:len(book)//10], 3)
%timeit -n 3 arithmetize_text(book, 3)


3 loops, best of 3: 222 ms per loop
3 loops, best of 3: 2.23 s per loop

In [17]:
%timeit -n 3 arithmetize_text_naive(book[:len(book)//10], 3)
%timeit -n 3 arithmetize_text_naive(book[:len(book)//10], 5)
%timeit -n 3 arithmetize_text_naive(book[:len(book)//10], 10)


3 loops, best of 3: 310 ms per loop
3 loops, best of 3: 480 ms per loop
3 loops, best of 3: 903 ms per loop

In [18]:
%timeit -n 3 arithmetize_text(book[:len(book)//10], 3)
%timeit -n 3 arithmetize_text(book[:len(book)//10], 5)
%timeit -n 3 arithmetize_text(book[:len(book)//10], 10)


3 loops, best of 3: 220 ms per loop
3 loops, best of 3: 229 ms per loop
3 loops, best of 3: 246 ms per loop

In [19]:
def find_matches(pattern, text, basis=2**16, r=2**32-3):
    """ find all occurances of pattern in text
    using efficient arithmetization of text """
    assert len(pattern) <= len(text)
    pattern_hash = arithmetize(pattern, basis, r)
    text_hashes = arithmetize_text(text, len(pattern), basis, r)
    return [i for i,hs in enumerate(text_hashes) if hs == pattern_hash]

In [20]:
matches = find_matches("moby", book.lower())
print(matches[0], len(matches), book[matches[0]:matches[0] + 4])
matches = find_matches("aye, aye, sir", book.lower())
print(matches[0], len(matches), book[matches[0]:matches[0] + 13])


32 124 Moby
383330 5 Aye, aye, sir

Safe R-K

Now we must remember that a substring can have the same hash as the pattern even if it is not equal to it.

Here is a safe version in which we check that the substring does indeed equals the pattern:


In [21]:
def find_matches_safe(pattern, text, basis=2**16, r=2**32-3):
    """ find all occurances of pattern in text
    using efficient arithmetization of text """
    assert len(pattern) <= len(text)
    pattern_hash = arithmetize(pattern, basis, r)
    text_hashes = arithmetize_text(text, len(pattern), basis, r)
    matches = [i for i,hs in enumerate(text_hashes) if hs == pattern_hash and text[i:i+len(pattern)] == pattern]
    return matches

In [22]:
def foo(v,t):
    print(t)
    return v
print(foo(True,"foo") and foo(True,"bar"))
print(foo(False,"foo") and foo(True,"bar"))


foo
bar
True
foo
False

Worst case

The worst case is when everything matches:


In [23]:
text = "a" * 10**5
for pattern in ["a"*10**2, "a"*10**3, "a"*10**4]:
    for f in [find_matches, find_matches_safe]:
        %timeit -n 1 f(pattern, text)


1 loops, best of 3: 479 ms per loop
1 loops, best of 3: 567 ms per loop
1 loops, best of 3: 3.09 s per loop
1 loops, best of 3: 3.33 s per loop
1 loops, best of 3: 25.9 s per loop
1 loops, best of 3: 28.9 s per loop

But for normal texts there is hardly any difference:


In [24]:
for pattern in ["moby", "aye, aye, sir", "moby dick was his name"]:
    for f in [find_matches, find_matches_safe]:
        %timeit -n 1 f(pattern, book.lower())


1 loops, best of 3: 2.7 s per loop
1 loops, best of 3: 2.69 s per loop
1 loops, best of 3: 2.9 s per loop
1 loops, best of 3: 2.88 s per loop
1 loops, best of 3: 3.08 s per loop
1 loops, best of 3: 3.1 s per loop

Choice of r


In [26]:
find_matches("da","abracadabra", basis=2**16, r=2**16)


Out[26]:
[2, 4, 6, 9]

In [27]:
find_matches_safe("da","abracadabra", basis=2**16, r=2**16)


Out[27]:
[6]

Because $b=r$, $mod r$ just takes the rightmost character. When searching for "da" we actually seach for "?a".

r better not be a power of the base used for arithmetization of the string. We prefer that r is not a power of small primes. Large primes are a good choice.

Alternative rolling hash

We will try a different hash function in which the hash is the sum of the characters of the string.


In [28]:
def arithmetize_sum(text, r=2**32-3):
    partial_sum = 0
    for ch in text:
        partial_sum = (partial_sum + ord(ch)) % r
    return partial_sum

In [29]:
def arithmetize_text_sum(text, m, r=2**32-3):
    lst = []
    lst.append(arithmetize_sum(text[:m], r))
    for i in range(1, len(text) - m + 1):
        rolling_hash = (lst[i-1] - ord(text[i-1]) + ord(text[i + m - 1]))
        rolling_hash %= r
        lst.append(rolling_hash)
    return lst

In [30]:
%timeit -n 3 arithmetize_text(song, 3)
%timeit -n 3 arithmetize_text_sum(song, 3)
%timeit -n 3 arithmetize_text(song, 30)
%timeit -n 3 arithmetize_text_sum(song, 30)


3 loops, best of 3: 411 us per loop
3 loops, best of 3: 309 us per loop
3 loops, best of 3: 540 us per loop
3 loops, best of 3: 290 us per loop

In [31]:
%timeit -n 3 arithmetize_text(book, 3)
%timeit -n 3 arithmetize_text_sum(book, 3)
%timeit -n 3 arithmetize_text(book, 10)
%timeit -n 3 arithmetize_text_sum(book, 10)


3 loops, best of 3: 2.36 s per loop
3 loops, best of 3: 1.76 s per loop
3 loops, best of 3: 2.66 s per loop
3 loops, best of 3: 1.73 s per loop

This rolling hash function is efficient - seems to be at least as good as the previous one if not even faster, probably due to less arithmetics being involved.


In [32]:
arithmetize_sum("I am Lord Voldemort".lower())


Out[32]:
1828

In [33]:
arithmetize_sum("Tom Marvolo Riddle ".lower())


Out[33]:
1828

The problem is that anagrams collide - permutations of the same string get the same hash value.

Fin

This notebook is part of the Extended introduction to computer science course at Tel-Aviv University.

The notebook was written using Python 3.2 and IPython 0.13.1.

The code is available at https://raw.github.com/yoavram/CS1001.py/master/recitation10.ipynb.

The notebook can be viewed online at http://nbviewer.ipython.org/urls/raw.github.com/yoavram/CS1001.py/master/recitation10.ipynb.

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.