In [1]:
def f2(lst):
i = 1
while i < len(lst):
j = 1
while j < i:
print(lst[i], lst[j])
j += 5
i *= 2
The inner loop does $O(i/5)$ iterations, because j is bounded by i and does steps of 5.
The value of i after k iterations of the outer loop is $2^k$, and this outer loop goes on untill $2^k=n$, or $k = \log_{2}{n})$.
So the total complexity is:
$$ O(\sum_{k=1}^{\log_{2}{n}}{\frac{2^k}{5}}) = O(\frac{1}{5}\sum_{k=1}^{\log_{2}{n}}{2^k}) = O(\frac{2^{\log_{2}{n}}-1}{2-1}) = O(2^{\log_{2}{n}}) = O(n) $$
In [2]:
def f3(lst):
i = len(lst)
while i > 0:
for j in range(i):
for k in range(j, 10**5):
print(i)
i -= 2
The the most inner loop does an $O(1)$ operation and at most $10^5$ iterations, so the inner loop's complexity is $O(1)$. This is because:
In [7]:
list(range(5,2))
Out[7]:
The middle loop does at most n iterations at the first time, then n-2 etc.: $$ n + n-2 + n-4 + ... + 1 = O(n^2) $$
And therefore the total complexity is $O(n^2)$.
We want a dynamic dictionary with insert, remove, search.
The keys are from a very large set U but the number of distinct elements expected is n, and $n<<|U|$.
We will use a table of size m, where $m\approx n$.
For example, U is the set of 8-digit ID numbers, so $|U| = 10^8$.
n is the size of the Israeli population, so $m\approx n \approx 10^7$.
We need a function $h: U \to {0,1,...,m-1}$.
In [1]:
class MyHashtable:
def __init__(self, m, hash_func=hash):
""" initial hash table, m empty entries """
self.data = [ [] for i in range(m)]
self.hash_mod = lambda x: hash_func(x) % m
def find(self, item):
""" returns True if item in hashtable, False otherwise """
i = self.hash_mod(item)
try:
self.data[i].index(item)
return True
except ValueError:
return False
def insert(self, item):
""" insert an item into table """
i = self.hash_mod(item)
try:
self.data[i].index(item)
except ValueError:
self.data[i].append(item)
In [2]:
ht = MyHashtable(100)
In [3]:
ht.insert("yoav")
ht.insert("amir")
ht.insert("amiram")
ht.insert("haim")
In [4]:
ht.find("yoav")
Out[4]:
In [5]:
ht.find("inon")
Out[5]:
In [6]:
def repeat_naive(st, k):
for i in range(len(st) - k + 1):
for j in range(i + 1, len(st) - k + 1):
if st[i : i + k] == st[j : j + k]:
return True
return False
In [9]:
song = '''So, so you think you can tell
Heaven from Hell,
Blue sky's from pain.
Can you tell a green field
From a cold steel rail?
A smile from a veil?
Do you think you can tell?
And did they get you to trade
Your heroes for ghosts?
Hot ashes for trees?
Hot air for a cool breeze?
Cold comfort for change?
And did you exchange
A walk on part in the war
For a lead role in a cage?
How I wish, how I wish you were here.
We're just two lost souls
Swimming in a fish bowl,
Year after year,
Running over the same old ground.
And how we found
The same old fears.
Wish you were here.'''
In [8]:
print(repeat_naive(song, 5))
%timeit -n 10 repeat_naive(song, 5)
print(repeat_naive(song, 30))
%timeit -n 10 repeat_naive(song, 30)
This is fast but it's a short string.
Let's try it with a larger string - "The Adventures of Tom Sawyer by Mark Twain" from the Gutenberg Project.
We use the builtin urllib.request.urlopen to get the book from the web - this function returns a file-like object. The text is read as bytes and we have to tell python to decode it, usually using utf-8
code which is a type of unicode.
This decoding will allow us to read text in non-latin languages (like Hebrew, Arabic and Russian).
In [16]:
import urllib.request
with urllib.request.urlopen("http://www.gutenberg.org/cache/epub/74/pg74.txt") as r:
book = r.read().decode('utf-8')
In [17]:
print(book[:book.index('\n\r')])
Looking for a short repeat is quick:
In [18]:
print(repeat_naive(book, 5))
%timeit -n 3 repeat_naive(book, 5)
But searching for a longer repeat takes a very long time:
In [110]:
print(repeat_naive(book, 30))
%timeit -n 1 repeat_naive(book, 30)
The worst case happens when there is no such repeat, which is likely when k is large.
The number of iterations is $O((n-k)^2$. Each iteration we compare a substring of length k, which is $O(k)$. So the total complexity is $O(k(n-k)^2)$. Because we consider $k = O(1)\;$ then we get $O(n^2)$.
In With you were here we had n=589, so the number of character comparisons was 346,921.
But with The Adventures of Tom Sawyer, we had n=421869 and roghly 177,973,453,161 character comparisons.
In [5]:
def repeat_hash1(st, k, m=0):
if m == 0: # default hash table size is ~number of substrings to be inserted
m = len(st) - k
htable = MyHashtable(m)
for i in range(len(st) - k + 1):
if htable.find(st[i : i + k]):
return True
else:
htable.insert(st[i : i + k])
return False
In [6]:
%timeit -n 10 repeat_hash1(song, 5)
%timeit -n 10 repeat_hash1(song, 30)
In [7]:
%timeit -n 3 repeat_hash1(book, 5)
%timeit -n 1 repeat_hash1(book, 30)
In [8]:
print(repeat_hash1(book, 1000))
%timeit -n 1 repeat_hash1(book, 1000)
Creating hashtable - $O(m)=O(n-k)$. Number of iterations - $O(n-k)$. Each iteration we search a list of length $O(1)$ on average, so overall $O(k)$. Total complexity on average $O(k(n-k))=O(n)$.
The worst case is not as good - all elements get the same hash and enter the same list, so the complexity is $O(k(n-k)^2)$. But the probability for this is very low.
Which of python's data structures is appropriate?
list
- mutable and orderedstring
, tuple
- immutable and ordereddict
- mutable and unordered with unique immutable keysset
- mutable and unordered with unique immutable valuesset
is just the right structure - we need a structure we can insert items into and check if items are already in it.
In [9]:
def repeat_hash2(st, k, m=0):
htable = set() # Python sets use hash functions for fast lookup
for i in range(len(st) - k + 1):
if st[i : i + k] not in htable:
htable.add(st[i : i + k])
else:
return True
return False
In [122]:
%timeit -n 10 repeat_hash2(song, 5)
%timeit -n 10 repeat_hash2(song, 30)
In [123]:
%timeit -n 10 repeat_hash2(book, 5)
%timeit -n 10 repeat_hash2(book, 30)
In [125]:
%timeit -n 1 repeat_hash2(book, 1000)
An interesting use for this kind of a repeat
function is to find a recurring DNA motif in a DNA string.
In this case the alphabet is much smaller - 4 letters: A, G, C, T - so the chance for repitiotion is higher and the worst case happens less frequently.
But, DNA strings are very long.
Lets look at the lacZ
gene sequence (from http://berry.engin.umich.edu/gene2oligo/lacZ.txt):
In [10]:
import urllib
In [11]:
with urllib.request.urlopen("http://berry.engin.umich.edu/gene2oligo/lacZ.txt") as r:
lacZ = r.read().decode('utf-8')
In [12]:
print(lacZ)
We need to remove the first two lines, remove newlines and change lowercase letters to uppercase:
In [13]:
lacZ = str.join('',lacZ.split('\n')[2:]).upper()
print(len(lacZ))
Now we can use the repeat functions to check for repeating subsequences:
In [149]:
print(repeat_hash2(lacZ, 10))
print(repeat_hash2(lacZ, 12))
In [14]:
%timeit -n 10 repeat_hash1(lacZ,12)
%timeit -n 10 repeat_hash2(lacZ,12)
This notebook is part of the Extended introduction to computer science course at Tel-Aviv University.
The notebook was written using Python 3.2 and IPython 0.13.1.
The code is available at https://raw.github.com/yoavram/CS1001.py/master/recitation8.ipynb.
The notebook can be viewed online at http://nbviewer.ipython.org/urls/raw.github.com/yoavram/CS1001.py/master/recitation8.ipynb.
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.