Coupon Collector Problem
Collecting Coupons: This is a famous problem in Mosteller's book. Coupons in cereal boxes are numbered 1 to 5, and a set of one of each is required for a prize. With one coupon per box, how many boxes on the average are required to make a complete set?


In [21]:
import random

no_simulation=1000
coupon_number_set=set()
total_coupon_numbers=5
flag=False
boxes_no=0

for i in range(no_simulation):
    counter=0
    coupon_number_set=set()
    flag=False
    
    while not flag:
        coupon=random.randint(1,total_coupon_numbers)
        
        counter+=1
        if len(coupon_number_set)==total_coupon_numbers:
            boxes_no+=counter
            flag=True;
            #print boxes_no
            
        else:
            #print coupon_number_set
            coupon_number_set.add(coupon)
            
print boxes_no            
avg_boxes= (boxes_no * 1.0)/no_simulation
print 'avg number of boxes to make a complete set: ', avg_boxes


12358
avg number of boxes to make a complete set:  12.358

Print most frequent N-grams in given file.

Problem description: Build a tool which receives a corpus of text, analyses it and reports the top 10 most frequent bigrams, trigrams, four-grams (i.e. most frequently occurring two, three and four word consecutive combinations). </b>

In terms of performance, solution is O(N X M) where N is the number of words in the text, and M is the number of lengths of n-grams you're counting. In this case we're counting digrams, trigrams, and four-grams, so M is 3 and the running time is O(N * 3) = O(N), in other words, linear time.


In [155]:
import collections
import re
import sys

In [156]:
# file we use is available at  http://www.gutenberg.org/cache/epub/10/pg10.txt

def tokenize(string):
    "split string to words and in lower case, ignoring punctuations and returning list of words"
    return re.findall(r'\w+',string.lower())

def count_ngrams(lines,min_length=2,max_length=4):
    "iterate through given line iterator and return n-gram frequencies."
    
    # list of consecutive elements from min_length to max_length
    lengths=range(min_length,max_length+1)
    
    # dictionary of length as key and Counter as value
    # [2: Counter(), 3:Counter(), 4:Counter]
    ngrams={length: collections.Counter() for length in lengths}
    
    # create queue of max length maxlen
    queue=collections.deque(maxlen=max_length)
        
    def add_queue():
        current=tuple(queue)
        for length in lengths:
            if len(current)>= length:
                ngrams[length][current[:length]] += 1
                
    # iterate over all the lines    
    for line in lines:
        for word in tokenize(line):
            queue.append(word)
            if len(queue)>= max_length:
                # after every element over maxlength, queue pops the word from the front
                add_queue()
                
    
    while len(queue)> min_length:
        queue.popleft()
        add_queue()
     
    return ngrams

def common_ngrams(ngrams,num=10):
    "print most common ngrams of size num"
  
    for n in sorted(ngrams):
        print '----- {} most common {} grams -----'.format(num,n)
        for gram,count in ngrams[n].most_common(num):
            print "{0} {1} ".format(' '.join(gram), count)
        print '\n'

In [157]:
# sample run 
# min_length=2
# max_length=4
# line=['i am samwise gamjee the king of good times']
# print count_ngrams(line)

        
#print ngrams[2]
#print ngrams[3]

In [158]:
data_file='./data/pg10.txt'
min_length=2
max_length=4
with open(data_file) as f:
    lines=f.readlines()
print 'total lines in the file:  ' , len(lines)

ngrams=count_ngrams(lines,min_length,max_length)


total lines in the file:   100222

In [159]:
for i in range(min_length,max_length+1):
    print  'total ngrams of size %d : ' %(i)  , len(ngrams[i])

print '\n'
print common_ngrams(ngrams,10)


total ngrams of size 2 :  177083
total ngrams of size 3 :  480781
total ngrams of size 4 :  682406


----- 10 most common 2 grams -----
of the 11572 
the lord 7035 
and the 6271 
in the 5045 
and he 2791 
shall be 2462 
to the 2168 
all the 2146 
and they 2086 
unto the 2032 


----- 10 most common 3 grams -----
of the lord 1775 
the son of 1450 
the children of 1355 
the house of 883 
saith the lord 854 
out of the 805 
and i will 672 
children of israel 647 
the land of 617 
and the lord 571 


----- 10 most common 4 grams -----
the children of israel 638 
it came to pass 453 
thus saith the lord 415 
and it came to 396 
of the children of 374 
the lord thy god 304 
the house of the 279 
the word of the 266 
word of the lord 258 
saith the lord god 257 


None

In [ ]:


In [ ]: