CS1001.py

Extended Introduction to Computer Science with Python, Tel-Aviv University, Spring 2013

Recitation 8 - 2-6.5.2013

Last update: 6.5.2013

Homework 3

Question 4b



In [1]:

    
def f2(lst):
    i = 1
    while i < len(lst):
        j = 1
        while j < i:
            print(lst[i], lst[j])
            j += 5
        i *= 2

The inner loop does $O(i/5)$ iterations, because j is bounded by i and does steps of 5.

The value of i after k iterations of the outer loop is $2^k$, and this outer loop goes on untill $2^k=n$, or $k = \log_{2}{n})$.

So the total complexity is:

$$ O(\sum_{k=1}^{\log_{2}{n}}{\frac{2^k}{5}}) = O(\frac{1}{5}\sum_{k=1}^{\log_{2}{n}}{2^k}) = O(\frac{2^{\log_{2}{n}}-1}{2-1}) = O(2^{\log_{2}{n}}) = O(n) $$

Question 4c



In [2]:

    
def f3(lst):
    i = len(lst)
    while i > 0:
        for j in range(i):
            for k in range(j, 10**5):
                print(i)
        i -= 2

The the most inner loop does an $O(1)$ operation and at most $10^5$ iterations, so the inner loop's complexity is $O(1)$. This is because:



In [7]:

    
list(range(5,2))









    Out[7]:





[]

The middle loop does at most n iterations at the first time, then n-2 etc.: $$ n + n-2 + n-4 + ... + 1 = O(n^2) $$

And therefore the total complexity is $O(n^2)$.

Hashing

We want a dynamic dictionary with insert, remove, search.

The keys are from a very large set U but the number of distinct elements expected is n, and $n<<|U|$.

We will use a table of size m, where $m\approx n$.

For example, U is the set of 8-digit ID numbers, so $|U| = 10^8$.

n is the size of the Israeli population, so $m\approx n \approx 10^7$.

Hash function

We need a function $h: U \to {0,1,...,m-1}$.

Collisons are unavoidable (pigeon hole principle), so we want h to distribute the values uniformly on ${0,1,...,m-1}$, and we can then avoid collisons using lists, coockoo hashing, etc.
We want h to be deterministic (not random). That means that for a given input the output is always the same - h(x)=h(x).
We need h to be easily computable for all inputs.

Hash table



In [1]:

    
class MyHashtable:
    def __init__(self, m, hash_func=hash):
        """ initial hash table, m empty entries """
        self.data = [ [] for i in range(m)]
        self.hash_mod = lambda x: hash_func(x) % m
    
    def find(self, item):
        """ returns True if item in hashtable, False otherwise  """
        i = self.hash_mod(item)      
        try:
            self.data[i].index(item)
            return True
        except ValueError:
            return False

    def insert(self, item):
        """ insert an item into table """
        i = self.hash_mod(item)
        try:
            self.data[i].index(item)
        except ValueError:
            self.data[i].append(item)



In [2]:

    
ht = MyHashtable(100)



In [3]:

    
ht.insert("yoav")
ht.insert("amir")
ht.insert("amiram")
ht.insert("haim")



In [4]:

    
ht.find("yoav")









    Out[4]:





True



In [5]:

    
ht.find("inon")









    Out[5]:





False

Exercise – repeating substring

Input: We are given a string st of length n, and a small integer k (k=O(1) with respect to n).

Output: Is there a substring of st of length k which repeats itself (more than once)?

Naive solution



In [6]:

    
def repeat_naive(st, k):
    for i in range(len(st) - k + 1):
        for j in range(i + 1, len(st) - k + 1):
            if st[i : i + k] == st[j : j + k]:
                return True
    return False



In [9]:

    
song = '''So, so you think you can tell 
Heaven from Hell, 
Blue sky's from pain. 
Can you tell a green field 
From a cold steel rail? 
A smile from a veil? 
Do you think you can tell? 

And did they get you to trade 
Your heroes for ghosts? 
Hot ashes for trees? 
Hot air for a cool breeze? 
Cold comfort for change? 
And did you exchange 
A walk on part in the war 
For a lead role in a cage? 

How I wish, how I wish you were here. 
We're just two lost souls 
Swimming in a fish bowl, 
Year after year, 
Running over the same old ground. 
And how we found
The same old fears. 
Wish you were here.'''



In [8]:

    
print(repeat_naive(song, 5))
%timeit -n 10 repeat_naive(song, 5)
print(repeat_naive(song, 30))
%timeit -n 10 repeat_naive(song, 30)









    



True
10 loops, best of 3: 2.07 ms per loop
False
10 loops, best of 3: 137 ms per loop

This is fast but it's a short string.

Let's try it with a larger string - "The Adventures of Tom Sawyer by Mark Twain" from the Gutenberg Project.

We use the builtin urllib.request.urlopen to get the book from the web - this function returns a file-like object. The text is read as bytes and we have to tell python to decode it, usually using utf-8 code which is a type of unicode. This decoding will allow us to read text in non-latin languages (like Hebrew, Arabic and Russian).



In [16]:

    
import urllib.request
with urllib.request.urlopen("http://www.gutenberg.org/cache/epub/74/pg74.txt") as r:
    book = r.read().decode('utf-8')



In [17]:

    
print(book[:book.index('\n\r')])









    




The Project Gutenberg EBook of The Adventures of Tom Sawyer, Complete by
Mark Twain (Samuel Clemens)

Looking for a short repeat is quick:



In [18]:

    
print(repeat_naive(book, 5))
%timeit -n 3 repeat_naive(book, 5)









    



True
3 loops, best of 3: 709 ms per loop

But searching for a longer repeat takes a very long time:



In [110]:

    
print(repeat_naive(book, 30))
%timeit -n 1 repeat_naive(book, 30)









    



True
1 loops, best of 3: 10.9 s per loop

Worst case complexity

The worst case happens when there is no such repeat, which is likely when k is large.

The number of iterations is $O((n-k)^2$. Each iteration we compare a substring of length k, which is $O(k)$. So the total complexity is $O(k(n-k)^2)$. Because we consider $k = O(1)\;$ then we get $O(n^2)$.

In With you were here we had n=589, so the number of character comparisons was 346,921.

But with The Adventures of Tom Sawyer, we had n=421869 and roghly 177,973,453,161 character comparisons.

Hash solution



In [5]:

    
def repeat_hash1(st, k, m=0):
    if m == 0: # default hash table size is ~number of substrings to be inserted
        m = len(st) - k
    htable = MyHashtable(m)
    for i in range(len(st) - k + 1):
        if htable.find(st[i : i + k]):
            return True
        else: 
            htable.insert(st[i : i + k])
    return False



In [6]:

    
%timeit -n 10 repeat_hash1(song, 5)
%timeit -n 10 repeat_hash1(song, 30)









    



10 loops, best of 3: 226 us per loop
10 loops, best of 3: 7.38 ms per loop



In [7]:

    
%timeit -n 3 repeat_hash1(book, 5)
%timeit -n 1 repeat_hash1(book, 30)









    



3 loops, best of 3: 102 ms per loop
1 loops, best of 3: 96.3 ms per loop



In [8]:

    
print(repeat_hash1(book, 1000))
%timeit -n 1 repeat_hash1(book, 1000)









    



False
1 loops, best of 3: 15.5 s per loop

Average case complexity

Creating hashtable - $O(m)=O(n-k)$. Number of iterations - $O(n-k)$. Each iteration we search a list of length $O(1)$ on average, so overall $O(k)$. Total complexity on average $O(k(n-k))=O(n)$.

The worst case is not as good - all elements get the same hash and enter the same list, so the complexity is $O(k(n-k)^2)$. But the probability for this is very low.

Python builtin solution

Which of python's data structures is appropriate?

list - mutable and ordered
string, tuple - immutable and ordered
dict - mutable and unordered with unique immutable keys
set - mutable and unordered with unique immutable values

set is just the right structure - we need a structure we can insert items into and check if items are already in it.



In [9]:

    
def repeat_hash2(st, k, m=0):
    htable = set() # Python sets use hash functions for fast lookup
    for i in range(len(st) - k + 1):
        if st[i : i + k] not in htable:
            htable.add(st[i : i + k])
        else: 
            return True
    return False



In [122]:

    
%timeit -n 10 repeat_hash2(song, 5)
%timeit -n 10 repeat_hash2(song, 30)









    



10 loops, best of 3: 17.7 us per loop
10 loops, best of 3: 718 us per loop



In [123]:

    
%timeit -n 10 repeat_hash2(book, 5)
%timeit -n 10 repeat_hash2(book, 30)









    



10 loops, best of 3: 49.1 us per loop
10 loops, best of 3: 458 us per loop



In [125]:

    
%timeit -n 1 repeat_hash2(book, 1000)









    



1 loops, best of 3: 3.44 s per loop

DNA sequences

An interesting use for this kind of a repeat function is to find a recurring DNA motif in a DNA string. In this case the alphabet is much smaller - 4 letters: A, G, C, T - so the chance for repitiotion is higher and the worst case happens less frequently. But, DNA strings are very long.

Lets look at the lacZ gene sequence (from http://berry.engin.umich.edu/gene2oligo/lacZ.txt):



In [10]:

    
import urllib



In [11]:

    
with urllib.request.urlopen("http://berry.engin.umich.edu/gene2oligo/lacZ.txt") as r:
    lacZ = r.read().decode('utf-8')



In [12]:

    
print(lacZ)









    



Name: LacZ
Sequence:
ACCATGATTACGGATTCACTGGCCGTCGTTTTACAACGTCGTGACTGGGAAAACCCTGGCGTTACCCAAC
TTAATCGCCTTGCAGCACATCCCCCTTTCGCCAGCTGGCGTAATAGCGAAGAGGCCCGCACCGATCGCCC
TTCCCAACAGTTGCGCAGCCTGAATGGCGAATGGCGCTTTGCCTGGTTTCCGGCACCAGAAGCGGTGCCG
GAAAGCTGGCTGGAGTGCGATCTTCCTGAGGCCGATACTGTCGTCGTCCCCTCAAACTGGCAGATGCACG
GTTACGATGCGCCCATCTACACCAACGTAACCTATCCCATTACGGTCAATCCGCCGTTTGTTCCCACGGA
GAATCCGACGGGTTGTTACTCGCTCACATTTAATGTTGATGAAAGCTGGCTACAGGAAGGCCAGACGCGA
ATTATTTTTGATGGCGTTAACTCGGCGTTTCATCTGTGGTGCAACGGGCGCTGGGTCGGTTACGGCCAGG
ACAGTCGTTTGCCGTCTGAATTTGACCTGAGCGCATTTTTACGCGCCGGAGAAAACCGCCTCGCGGTGAT
GGTGCTGCGTTGGAGTGACGGCAGTTATCTGGAAGATCAGGATATGTGGCGGATGAGCGGCATTTTCCGT
GACGTCTCGTTGCTGCATAAACCGACTACACAAATCAGCGATTTCCATGTTGCCACTCGCTTTAATGATG
ATTTCAGCCGCGCTGTACTGGAGGCTGAAGTTCAGATGTGCGGCGAGTTGCGTGACTACCTACGGGTAAC
AGTTTCTTTATGGCAGGGTGAAACGCAGGTCGCCAGCGGCACCGCGCCTTTCGGCGGTGAAATTATCGAT
GAGCGTGGTGGTTATGCCGATCGCGTCACACTACGTCTGAACGTCGAAAACCCGAAACTGTGGAGCGCCG
AAATCCCGAATCTCTATCGTGCGGTGGTTGAACTGCACACCGCCGACGGCACGCTGATTGAAGCAGAAGC
CTGCGATGTCGGTTTCCGCGAGGTGCGGATTGAAAATGGTCTGCTGCTGCTGAACGGCAAGCCGTTGCTG
ATTCGAGGCGTTAACCGTCACGAGCATCATCCTCTGCATGGTCAGGTCATGGATGAGCAGACGATGGTGC
AGGATATCCTGCTGATGAAGCAGAACAACTTTAACGCCGTGCGCTGTTCGCATTATCCGAACCATCCGCT
GTGGTACACGCTGTGCGACCGCTACGGCCTGTATGTGGTGGATGAAGCCAATATTGAAACCCACGGCATG
GTGCCAATGAATCGTCTGACCGATGATCCGCGCTGGCTACCGGCGATGAGCGAACGCGTAACGCGAATGG
TGCAGCGCGATCGTAATCACCCGAGTGTGATCATCTGGTCGCTGGGGAATGAATCAGGCCACGGCGCTAA
TCACGACGCGCTGTATCGCTGGATCAAATCTGTCGATCCTTCCCGCCCGGTGCAGTATGAAGGCGGCGGA
GCCGACACCACGGCCACCGATATTATTTGCCCGATGTACGCGCGCGTGGATGAAGACCAGCCCTTCCCGG
CTGTGCCGAAATGGTCCATCAAAAAATGGCTTTCGCTACCTGGAGAGACGCGCCCGCTGATCCTTTGCGA
ATACGCCCACGCGATGGGTAACAGTCTTGGCGGTTTCGCTAAATACTGGCAGGCGTTTCGTCAGTATCCC
CGTTTACAGGGCGGCTTCGTCTGGGACTGGGTGGATCAGTCGCTGATTAAATATGATGAAAACGGCAACC
CGTGGTCGGCTTACGGCGGTGATTTTGGCGATACGCCGAACGATCGCCAGTTCTGTATGAACGGTCTGGT
CTTTGCCGACCGCACGCCGCATCCAGCGCTGACGGAAGCAAAACACCAGCAGCAGTTTTTCCAGTTCCGT
TTATCCGGGCAAACCATCGAAGTGACCAGCGAATACCTGTTCCGTCATAGCGATAACGAGCTCCTGCACT
GGATGGTGGCGCTGGATGGTAAGCCGCTGGCAAGCGGTGAAGTGCCTCTGGATGTCGCTCCACAAGGTAA
ACAGTTGATTGAACTGCCTGAACTACCGCAGCCGGAGAGCGCCGGGCAACTCTGGCTCACAGTACGCGTA
GTGCAACCGAACGCGACCGCATGGTCAGAAGCCGGGCACATCAGCGCCTGGCAGCAGTGGCGTCTGGCGG
AAAACCTCAGTGTGACGCTCCCCGCCGCGTCCCACGCCATCCCGCATCTGACCACCAGCGAAATGGATTT
TTGCATCGAGCTGGGTAATAAGCGTTGGCAATTTAACCGCCAGTCAGGCTTTCTTTCACAGATGTGGATT
GGCGATAAAAAACAACTGCTGACGCCGCTGCGCGATCAGTTCACCCGTGCACCGCTGGATAACGACATTG
GCGTAAGTGAAGCGACCCGCATTGACCCTAACGCCTGGGTCGAACGCTGGAAGGCGGCGGGCCATTACCA
GGCCGAAGCAGCGTTGTTGCAGTGCACGGCAGATACACTTGCTGATGCGGTGCTGATTACGACCGCTCAC
GCGTGGCAGCATCAGGGGAAAACCTTATTTATCAGCCGGAAAACCTACCGGATTGATGGTAGTGGTCAAA
TGGCGATTACCGTTGATGTTGAAGTGGCGAGCGATACACCGCATCCGGCGCGGATTGGCCTGAACTGCCA
GCTGGCGCAGGTAGCAGAGCGGGTAAACTGGCTCGGATTAGGGCCGCAAGAAAACTATCCCGACCGCCTT
ACTGCCGCCTGTTTTGACCGCTGGGATCTGCCATTGTCAGACATGTATACCCCGTACGTCTTCCCGAGCG
AAAACGGTCTGCGCTGCGGGACGCGCGAATTGAATTATGGCCCACACCAGTGGCGCGGCGACTTCCAGTT
CAACATCAGCCGCTACAGTCAACAGCAACTGATGGAAACCAGCCATCGCCATCTGCTGCACGCGGAAGAA
GGCACATGGCTGAATATCGACGGTTTCCATATGGGGATTGGTGGCGACGACTCCTGGAGCCCGTCAGTAT
CGGCGGAATTCCAGCTGAGCGCCGGTCGCTACCATTACCAGTTGGTCTGGTGTCAAAAATAATAATAAcg
gctgccgt

We need to remove the first two lines, remove newlines and change lowercase letters to uppercase:



In [13]:

    
lacZ = str.join('',lacZ.split('\n')[2:]).upper()
print(len(lacZ))

Now we can use the repeat functions to check for repeating subsequences:



In [149]:

    
print(repeat_hash2(lacZ, 10))
print(repeat_hash2(lacZ, 12))









    



True
False



In [14]:

    
%timeit -n 10 repeat_hash1(lacZ,12)
%timeit -n 10 repeat_hash2(lacZ,12)









    



10 loops, best of 3: 34.1 ms per loop
10 loops, best of 3: 3.75 ms per loop

Fin

This notebook is part of the Extended introduction to computer science course at Tel-Aviv University.

The notebook was written using Python 3.2 and IPython 0.13.1.

The code is available at https://raw.github.com/yoavram/CS1001.py/master/recitation8.ipynb.

The notebook can be viewed online at http://nbviewer.ipython.org/urls/raw.github.com/yoavram/CS1001.py/master/recitation8.ipynb.

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.