In [3]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import mrjobs as mr

Homework 5

Copy this notebook. Rename it as: YOURNAME-HW4-mapreduce-XX

with your name replacing YOURNAME and the xx replaced with the date you submit or copy this HW.

Upload your completed jupyter notebook to elearning site as your homework submission. Do not put this notebook on your github.

Do all the homeworks problems below: As noted doing the homework gets a 3 out of 5. Extension of homework to to implement an TD-IDF algorithm (see below)

Use the data/bible+shakes.nonpunc.txt file as the source of you analysis in this homework

Homework 5.1

A bigram is the combination of words. Find the 10 most common bigrams from the text. Order counts in the bigram combination for example "in the" is not the same bigram as "the in"


In [ ]:

Homework 5.2

Now do the same analysis but make the word order not count "in the" == "the in". Find the 10 most common ordered bigrams from the alice text.


In [ ]:

Homework 5.3

A trigram are three word combintation. Find the 10 most common ordered trigrams from the alice text. Make it so that the order of the words do not count in the trigram combination for example "in the air" is the same trigram as "the air in" or "air in the"...


In [ ]:

Homework 5.4

Create graphs to explain the relationship of the frequency of monograms ( words ) to bigrams and trigam frequencies


In [ ]:

For greater than a score of 3

Create a TD - IDF implementation and

Analyze the following Sherlock Holmes book from Project Gutenberg text versions of :

The Adventures of Sherlock Holmes- http://www.gutenberg.org/ebooks/1661.txt.utf-8

A Study in Scarlet - http://www.gutenberg.org/files/244/244-0.txt

The Hound of the Baskervilles - http://www.gutenberg.org/files/2852/2852-0.txt

The Return of Sherlock Holmes - http://www.gutenberg.org/files/108/108-0.txt

The Sign of the Four - http://www.gutenberg.org/ebooks/2097.txt.utf-8

Display the scores for the top 20 highest frequencty terms and the relationship to the books


In [ ]: