We have a text of length n and a pattern of length m and we want to find all occurences of the pattern in the text.
Naively we would compare all n-m substrings in the text to the pattern, but this will have a complexity of $O((n-m)m)=O(nm)$.
We want to improve this using a hashtable (as was done for finding repeating substrings in recitation 8. Then we can calculate the hash of the pattern and of all substrings in the text and compare the hashes. Calculating each hash, though, takes $O(m) \;$ and therefore we still have a complexity of $O(mn)$.
The trick of the Rabin-Karp algorithm is using a rolling hash:
if we already calculated the rolling hash of the substring of length m that started at position i, calculating the rolling hash of the substring of length m that started at position i+1 will take $O(1)$.
The complexity now will be much better - we need to calculate the hash on the pattern and on the substring that starts at position 0, each with complexity $O(m)$. In addition, we need to calculate the hash on the rest of the n-m substrings, each taking $O(1)$ because of the rolling hash. Altogether this is $O(n+m) \;$, much smaller than $O(nm)$.
In [3]:
def arithmetize(text, basis=2**16, r=2**32-3):
""" convert substring to number using basis powers
employs Horner method modulo r """
partial_sum = 0
for ch in text:
partial_sum = (partial_sum * basis + ord(ch)) % r
return partial_sum
In [4]:
ord("b"), ord("e"), ord("n")
Out[4]:
In [5]:
arithmetize("b"), arithmetize("be"), arithmetize("ben")
Out[5]:
In [6]:
def arithmetize_text_naive(text, m, basis=2**16, r=2**32-3):
""" computes arithmization of all m long substrings
of text, using basis powers """
return [arithmetize(text[i:i + m], basis, r) for i in range(len(text) - m + 1)]
def arithmetize_text(text, m, basis=2**16, r=2**32-3):
""" efficiently computes arithmetization of all m long
substrings of text, using basis powers """
b_power = basis ** (m - 1)
lst = [None]*(len(text) - m + 1)
# lst[0] equals first word arithmetization
lst[0] = arithmetize(text[0:m], basis, r)
for i in range(1, len(lst)):
rolling_hash = (lst[i - 1] - ord(text[i - 1])* b_power) * basis + ord(text[i + m - 1])
rolling_hash %= r
lst[i] = rolling_hash # append new_number to existing lst
return lst
In [7]:
song = '''אני כל כך עצוב לי ושמש על העיר
ודיזנגוף נראה לי כמו רכבת לילה לקהיר
בין כל הצלילים מחפש סימן
יושב בצד הדרך
יושב ליד הזמן
בחדר הסגור בדידות מקיר לקיר
ואם אצא החוצה המצב רק יחמיר
אז אני שומע צליל מאוד מוכר
מגיע מלמטה רחוק מהכיכר'''
In [8]:
print(arithmetize_text(song, 1) == arithmetize_text_naive(song, 1))
In [9]:
print(arithmetize_text(song, 2)[:10])
In [10]:
print(arithmetize_text(song, 2, 2**16 - 1)[:10])
In [11]:
%timeit -n 5 arithmetize_text_naive(song, 5)
%timeit -n 5 arithmetize_text(song, 5)
In [12]:
import urllib
with urllib.request.urlopen("http://www.gutenberg.org/cache/epub/2701/pg2701.txt") as f:
book = f.read().decode('utf8')
In [13]:
print(book[:book.index('\n\r\n\r')])
In [14]:
len(book)
Out[14]:
In [15]:
%timeit -n 3 arithmetize_text_naive(book[:len(book)//10], 3)
%timeit -n 3 arithmetize_text_naive(book, 3)
In [16]:
%timeit -n 3 arithmetize_text(book[:len(book)//10], 3)
%timeit -n 3 arithmetize_text(book, 3)
In [17]:
%timeit -n 3 arithmetize_text_naive(book[:len(book)//10], 3)
%timeit -n 3 arithmetize_text_naive(book[:len(book)//10], 5)
%timeit -n 3 arithmetize_text_naive(book[:len(book)//10], 10)
In [18]:
%timeit -n 3 arithmetize_text(book[:len(book)//10], 3)
%timeit -n 3 arithmetize_text(book[:len(book)//10], 5)
%timeit -n 3 arithmetize_text(book[:len(book)//10], 10)
In [19]:
def find_matches(pattern, text, basis=2**16, r=2**32-3):
""" find all occurances of pattern in text
using efficient arithmetization of text """
assert len(pattern) <= len(text)
pattern_hash = arithmetize(pattern, basis, r)
text_hashes = arithmetize_text(text, len(pattern), basis, r)
return [i for i,hs in enumerate(text_hashes) if hs == pattern_hash]
In [20]:
matches = find_matches("moby", book.lower())
print(matches[0], len(matches), book[matches[0]:matches[0] + 4])
matches = find_matches("aye, aye, sir", book.lower())
print(matches[0], len(matches), book[matches[0]:matches[0] + 13])
In [21]:
def find_matches_safe(pattern, text, basis=2**16, r=2**32-3):
""" find all occurances of pattern in text
using efficient arithmetization of text """
assert len(pattern) <= len(text)
pattern_hash = arithmetize(pattern, basis, r)
text_hashes = arithmetize_text(text, len(pattern), basis, r)
matches = [i for i,hs in enumerate(text_hashes) if hs == pattern_hash and text[i:i+len(pattern)] == pattern]
return matches
In [22]:
def foo(v,t):
print(t)
return v
print(foo(True,"foo") and foo(True,"bar"))
print(foo(False,"foo") and foo(True,"bar"))
In [23]:
text = "a" * 10**5
for pattern in ["a"*10**2, "a"*10**3, "a"*10**4]:
for f in [find_matches, find_matches_safe]:
%timeit -n 1 f(pattern, text)
But for normal texts there is hardly any difference:
In [24]:
for pattern in ["moby", "aye, aye, sir", "moby dick was his name"]:
for f in [find_matches, find_matches_safe]:
%timeit -n 1 f(pattern, book.lower())
In [26]:
find_matches("da","abracadabra", basis=2**16, r=2**16)
Out[26]:
In [27]:
find_matches_safe("da","abracadabra", basis=2**16, r=2**16)
Out[27]:
Because $b=r$, $mod r$ just takes the rightmost character. When searching for "da" we actually seach for "?a".
r
better not be a power of the base used for arithmetization of the string.
We prefer that r
is not a power of small primes.
Large primes are a good choice.
In [28]:
def arithmetize_sum(text, r=2**32-3):
partial_sum = 0
for ch in text:
partial_sum = (partial_sum + ord(ch)) % r
return partial_sum
In [29]:
def arithmetize_text_sum(text, m, r=2**32-3):
lst = []
lst.append(arithmetize_sum(text[:m], r))
for i in range(1, len(text) - m + 1):
rolling_hash = (lst[i-1] - ord(text[i-1]) + ord(text[i + m - 1]))
rolling_hash %= r
lst.append(rolling_hash)
return lst
In [30]:
%timeit -n 3 arithmetize_text(song, 3)
%timeit -n 3 arithmetize_text_sum(song, 3)
%timeit -n 3 arithmetize_text(song, 30)
%timeit -n 3 arithmetize_text_sum(song, 30)
In [31]:
%timeit -n 3 arithmetize_text(book, 3)
%timeit -n 3 arithmetize_text_sum(book, 3)
%timeit -n 3 arithmetize_text(book, 10)
%timeit -n 3 arithmetize_text_sum(book, 10)
This rolling hash function is efficient - seems to be at least as good as the previous one if not even faster, probably due to less arithmetics being involved.
In [32]:
arithmetize_sum("I am Lord Voldemort".lower())
Out[32]:
In [33]:
arithmetize_sum("Tom Marvolo Riddle ".lower())
Out[33]:
The problem is that anagrams collide - permutations of the same string get the same hash value.
This notebook is part of the Extended introduction to computer science course at Tel-Aviv University.
The notebook was written using Python 3.2 and IPython 0.13.1.
The code is available at https://raw.github.com/yoavram/CS1001.py/master/recitation10.ipynb.
The notebook can be viewed online at http://nbviewer.ipython.org/urls/raw.github.com/yoavram/CS1001.py/master/recitation10.ipynb.
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.