Coupon Collector Problem
Collecting Coupons: This is a famous problem in Mosteller's book.
Coupons in cereal boxes are numbered 1 to 5, and a set of one of each is required for a prize. With one coupon per box, how many boxes on the average are required to make a complete set?
In [21]:
import random
no_simulation=1000
coupon_number_set=set()
total_coupon_numbers=5
flag=False
boxes_no=0
for i in range(no_simulation):
counter=0
coupon_number_set=set()
flag=False
while not flag:
coupon=random.randint(1,total_coupon_numbers)
counter+=1
if len(coupon_number_set)==total_coupon_numbers:
boxes_no+=counter
flag=True;
#print boxes_no
else:
#print coupon_number_set
coupon_number_set.add(coupon)
print boxes_no
avg_boxes= (boxes_no * 1.0)/no_simulation
print 'avg number of boxes to make a complete set: ', avg_boxes
Print most frequent N-grams in given file.
Problem description: Build a tool which receives a corpus of text, analyses it and reports the top 10 most frequent bigrams, trigrams, four-grams (i.e. most frequently occurring two, three and four word consecutive combinations). </b>
In terms of performance, solution is O(N X M) where N is the number of words in the text, and M is the number of lengths of n-grams you're counting. In this case we're counting digrams, trigrams, and four-grams, so M is 3 and the running time is O(N * 3) = O(N), in other words, linear time.
In [155]:
import collections
import re
import sys
In [156]:
# file we use is available at http://www.gutenberg.org/cache/epub/10/pg10.txt
def tokenize(string):
"split string to words and in lower case, ignoring punctuations and returning list of words"
return re.findall(r'\w+',string.lower())
def count_ngrams(lines,min_length=2,max_length=4):
"iterate through given line iterator and return n-gram frequencies."
# list of consecutive elements from min_length to max_length
lengths=range(min_length,max_length+1)
# dictionary of length as key and Counter as value
# [2: Counter(), 3:Counter(), 4:Counter]
ngrams={length: collections.Counter() for length in lengths}
# create queue of max length maxlen
queue=collections.deque(maxlen=max_length)
def add_queue():
current=tuple(queue)
for length in lengths:
if len(current)>= length:
ngrams[length][current[:length]] += 1
# iterate over all the lines
for line in lines:
for word in tokenize(line):
queue.append(word)
if len(queue)>= max_length:
# after every element over maxlength, queue pops the word from the front
add_queue()
while len(queue)> min_length:
queue.popleft()
add_queue()
return ngrams
def common_ngrams(ngrams,num=10):
"print most common ngrams of size num"
for n in sorted(ngrams):
print '----- {} most common {} grams -----'.format(num,n)
for gram,count in ngrams[n].most_common(num):
print "{0} {1} ".format(' '.join(gram), count)
print '\n'
In [157]:
# sample run
# min_length=2
# max_length=4
# line=['i am samwise gamjee the king of good times']
# print count_ngrams(line)
#print ngrams[2]
#print ngrams[3]
In [158]:
data_file='./data/pg10.txt'
min_length=2
max_length=4
with open(data_file) as f:
lines=f.readlines()
print 'total lines in the file: ' , len(lines)
ngrams=count_ngrams(lines,min_length,max_length)
In [159]:
for i in range(min_length,max_length+1):
print 'total ngrams of size %d : ' %(i) , len(ngrams[i])
print '\n'
print common_ngrams(ngrams,10)
In [ ]:
In [ ]: