HomeWork 1

Unigrams, bigrams, and in general n-grams are 1,2 or n words that appear consecutively in a single sentence. Consider the sentence:

"to know you is to love you."

This sentence contains:

Unigrams(single words): to(2 times), know(1 time), you(2 times), is(1 time), love(1 time)
Bigrams: "to know","know you","you is", "is to","to love", "love you" (all 1 time)
Trigrams: "to know you", "know you is", "you is to", "is to love", "to love you" (all 1 time)

The goal of this HW is to find the most common n-grams in the text of Moby Dick.

Your task is to:

  • Convert all text to lower case, remove all punctuations. (Finally, the text should contain only letters, numbers and spaces)
  • Count the occurance of each word and of each 2,3,4,5 - gram
  • List the 5 most common elements for each order (word, bigram, trigram...). For each element, list the sequence of words and the number of occurances.

Basically, you need to change all punctuations to a space and define as a word anything that is between whitespace or at the beginning or the end of a sentence, and does not consist of whitespace (strings consisiting of only white spaces should not be considered as words). The important thing here is to be simple, not to be 100% correct in terms of parsing English. Evaluation will be primarily based on identifying the 5 most frequent n-grams in correct order for all values of n. Some slack will be allowed in the values of frequency of ngrams to allow flexibility in text processing.

This text is short enough to process on a single core using standard python. However, you are required to solve it using RDD's for the whole process. At the very end you can use .take(5) to bring the results to the central node for printing.

The code for reading the file and splitting it into sentences is shown below:


In [3]:
import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext()

In [4]:
textRDD = sc.newAPIHadoopFile('Data/Moby-Dick.txt',
                              'org.apache.hadoop.mapreduce.lib.input.TextInputFormat',
                              'org.apache.hadoop.io.LongWritable',
                              'org.apache.hadoop.io.Text',
                               conf={'textinputformat.record.delimiter': "\r\n\r\n"}) \
            .map(lambda x: x[1])


sentences=textRDD.flatMap(lambda x: x.split(". ")).map(lambda x: x.encode('utf-8'))

In [5]:
def find_ngrams(input_list, n):
  return zip(*[input_list[i:] for i in range(n)])

In [6]:
import string
replace_punctuation = string.maketrans(string.punctuation, ' '*len(string.punctuation))
sentences.map(lambda x: ' '.join(x.split()).lower())\
    .map(lambda x: x.translate(None, string.punctuation))\
    .flatMap(lambda x: find_ngrams(x.split(" "), 5))\
    .map(lambda x: (x,1))\
    .reduceByKey(lambda x,y: x+y)\
    .map(lambda x:(x[1],x[0])) \
    .sortByKey(False)\
    .take(100)


Out[6]:
[(13, ('the', 'project', 'gutenberg', 'literary', 'archive')),
 (13, ('project', 'gutenberg', 'literary', 'archive', 'foundation')),
 (10, ('and', 'at', 'the', 'same', 'time')),
 (9, ('the', 'bottom', 'of', 'the', 'sea')),
 (7, ('the', 'terms', 'of', 'this', 'agreement')),
 (6, ('and', 'not', 'only', 'that', 'but')),
 (6, ('to', 'the', 'project', 'gutenberg', 'literary')),
 (6, ('moby', 'dick', 'or', 'the', 'whale')),
 (6, ('here', 'be', 'it', 'said', 'that')),
 (5, ('hast', 'seen', 'the', 'white', 'whale')),
 (5, ('the', 'other', 'side', 'of', 'the')),
 (5, ('in', 'the', 'middle', 'of', 'the')),
 (5, ('the', 'cape', 'of', 'good', 'hope')),
 (5, ('the', 'full', 'project', 'gutenbergtm', 'license')),
 (5, ('i', 'should', 'like', 'to', 'know')),
 (5, ('out', 'of', 'sight', 'of', 'land')),
 (4, ('of', 'the', 'sperm', 'whale', 'is')),
 (4, ('the', 'greenland', 'or', 'right', 'whale')),
 (4, ('was', 'it', 'not', 'so', 'o')),
 (4, ('as', 'in', 'the', 'case', 'of')),
 (4, ('of', 'the', 'great', 'sperm', 'whale')),
 (4, ('the', 'upper', 'part', 'of', 'the')),
 (4, ('whats', 'the', 'matter', 'with', 'you')),
 (4, ('as', 'if', 'it', 'had', 'been')),
 (4, ('not', 'to', 'speak', 'of', 'the')),
 (4, ('from', 'one', 'to', 'the', 'other')),
 (4, ('sleep', 'two', 'in', 'a', 'bed')),
 (4, ('as', 'the', 'case', 'might', 'be')),
 (4, ('down', 'into', 'the', 'cabin', 'to')),
 (4, ('the', 'middle', 'of', 'the', 'room')),
 (4, ('gave', 'me', 'to', 'understand', 'that')),
 (4, ('now', 'that', 'i', 'think', 'of')),
 (4, ('in', 'one', 'hand', 'and', 'a')),
 (4, ('the', 'seven', 'hundred', 'and', 'seventyseventh')),
 (4, ('all', 'this', 'as', 'it', 'may')),
 (4, ('of', 'project', 'gutenbergtm', 'electronic', 'works')),
 (4, ('at', 'one', 'and', 'the', 'same')),
 (4, ('as', 'if', 'it', 'were', 'a')),
 (4, ('the', 'sperm', 'whale', 'and', 'the')),
 (4, ('that', 'i', 'think', 'of', 'it')),
 (4, ('go', 'to', 'sea', 'as', 'a')),
 (4, ('on', 'the', 'other', 'side', 'of')),
 (4, ('of', 'the', 'whale', 'in', 'his')),
 (4, ('be', 'all', 'this', 'as', 'it')),
 (4, ('to', 'the', 'bottom', 'of', 'the')),
 (4, ('it', 'came', 'to', 'pass', 'that')),
 (3, ('have', 'mercy', 'on', 'us', 'all')),
 (3, ('look', 'ye', 'look', 'they', 'look')),
 (3, ('at', 'the', 'time', 'of', 'the')),
 (3, ('upon', 'the', 'top', 'of', 'the')),
 (3, ('the', 'skin', 'of', 'the', 'whale')),
 (3, ('in', 'the', 'face', 'of', 'all')),
 (3, ('tell', 'you', 'what', 'it', 'is')),
 (3, ('as', 'if', 'he', 'were', 'a')),
 (3, ('you', 'would', 'have', 'almost', 'thought')),
 (3, ('sporty', 'gamy', 'jesty', 'joky', 'hokypoky')),
 (3, ('owner', 'of', 'the', 'project', 'gutenbergtm')),
 (3, ('the', 'lower', 'part', 'of', 'the')),
 (3, ('all', 'the', 'world', 'like', 'a')),
 (3, ('business', 'is', 'that', 'of', 'yours')),
 (3, ('funny', 'sporty', 'gamy', 'jesty', 'joky')),
 (3, ('he', 'looks', 'we', 'look', 'ye')),
 (3, ('of', 'the', 'spare', 'boats', 'and')),
 (3, ('jesty', 'joky', 'hokypoky', 'lad', 'is')),
 (3, ('his', 'hand', 'as', 'with', 'a')),
 (3, ('i', 'look', 'you', 'look', 'he')),
 (3, ('it', 'seems', 'to', 'me', 'that')),
 (3, ('seemed', 'on', 'the', 'point', 'of')),
 (3, ('at', 'the', 'bottom', 'of', 'the')),
 (3, ('the', 'precise', 'bearing', 'of', 'the')),
 (3, ('a', 'project', 'gutenbergtm', 'electronic', 'work')),
 (3, ('you', 'look', 'he', 'looks', 'we')),
 (3, ('in', 'the', 'minds', 'of', 'the')),
 (3, ('look', 'you', 'look', 'he', 'looks')),
 (3, ('lad', 'is', 'the', 'ocean', 'oh')),
 (3, ('but', 'a', 'fastfish', 'what', 'is')),
 (3, ('pilot', 'of', 'the', 'living', 'god')),
 (3, ('on', 'top', 'of', 'his', 'head')),
 (3, ('hokypoky', 'lad', 'is', 'the', 'ocean')),
 (3, ('on', 'the', 'summit', 'of', 'the')),
 (3, ('on', 'land', 'and', 'on', 'sea')),
 (3, ('a', 'species', 'of', 'the', 'leviathan')),
 (3, ('the', 'isles', 'of', 'the', 'sea')),
 (3, ('a', 'funny', 'sporty', 'gamy', 'jesty')),
 (3, ('but', 'be', 'all', 'this', 'as')),
 (3, ('bear', 'me', 'out', 'in', 'it')),
 (3, ('what', 'business', 'is', 'that', 'of')),
 (3, ('for', 'all', 'the', 'world', 'like')),
 (3, ('it', 'would', 'be', 'hard', 'to')),
 (3, ('moving', 'his', 'hand', 'as', 'with')),
 (3, ('story', 'of', 'jonah', 'and', 'the')),
 (3, ('the', 'whole', 'of', 'the', 'law')),
 (3, ('morning', 'to', 'ye', 'shipmates', 'morning')),
 (3, ('of', 'jonah', 'and', 'the', 'whale')),
 (3, ('between', 'the', 'whale', 'and', 'the')),
 (3, ('is', 'by', 'far', 'the', 'most')),
 (3, ('on', 'the', 'part', 'of', 'the')),
 (3, ('this', 'work', 'or', 'any', 'other')),
 (3, ('looks', 'we', 'look', 'ye', 'look')),
 (3, ('position', 'of', 'the', 'whales', 'eyes'))]

In [ ]:

Note: For running the file on cluster, change the file path to '/data/Moby-Dick.txt'

Let freq_ngramRDD be the final result RDD containing the n-grams sorted by their frequency in descending order. Use the following function to print your final output:


In [7]:
def printOutput(n,freq_ngramRDD):
    top=freq_ngramRDD.take(5)
    print '\n============ %d most frequent %d-grams'%(5,n)
    print '\nindex\tcount\tngram'
    for i in range(5):
        print '%d.\t%d: \t"%s"'%(i+1,top[i][0],' '.join(top[i][1]))

Your output for unigrams should look like:

============ 5 most frequent 1-grams

index   count   ngram
1.       40:     "a"
2.     25:   "the"
3.     21:   "and"
4.     16:   "to"
5.     9:    "of"

Note: This is just a sample output and does not resemble the actual results in any manner.

Your final program should generate an output using the following code:


In [ ]:


In [8]:
for n in range(1,6):
    # Put your logic for generating the sorted n-gram RDD here and store it in freq_ngramRDD variable
    freq_ngramRDD = sentences.map(lambda x: x.lower())\
    .map(lambda x: x.translate(replace_punctuation))\
    .flatMap(lambda x: find_ngrams(' '.join(x.split()).split(" "), n))\
    .map(lambda x: (x,1))\
    .reduceByKey(lambda x,y: x+y)\
    .map(lambda x:(x[1],x[0])) \
    .sortByKey(False)
    printOutput(n,freq_ngramRDD)
    
    freq_ngramRDD = sentences.map(lambda x: x.lower())\
    .map(lambda x: x.translate(replace_punctuation))\
    .flatMap(lambda x: find_ngrams(' '.join(x.split()).split(" "), n))\
    .map(lambda x: (x,1))\
    .reduceByKey(lambda x,y: x+y)\
    .map(lambda x:(x[1],x[0]))\
    .sortByKey(False)


============ 5 most frequent 1-grams

index	count	ngram
1.	14620: 	"the"
2.	6732: 	"of"
3.	6502: 	"and"
4.	4799: 	"a"
5.	4706: 	"to"

============ 5 most frequent 2-grams

index	count	ngram
1.	1906: 	"of the"
2.	1193: 	"in the"
3.	746: 	"to the"
4.	444: 	"from the"
5.	413: 	"the whale"

============ 5 most frequent 3-grams

index	count	ngram
1.	116: 	"the sperm whale"
2.	109: 	"of the whale"
3.	88: 	"the white whale"
4.	64: 	"one of the"
5.	60: 	"of the sea"

============ 5 most frequent 4-grams

index	count	ngram
1.	43: 	"of the sperm whale"
2.	27: 	"the sperm whale s"
3.	20: 	"at the same time"
4.	18: 	"project gutenberg tm electronic"
5.	18: 	"of the whale s"

============ 5 most frequent 5-grams

index	count	ngram
1.	13: 	"the project gutenberg literary archive"
2.	13: 	"project gutenberg literary archive foundation"
3.	12: 	"project gutenberg tm electronic works"
4.	11: 	"of the sperm whale s"
5.	10: 	"and at the same time"

In [ ]: