Algorithms Exercise 1

Imports


In [1]:
%matplotlib inline
from matplotlib import pyplot as plt
import numpy as np

Word counting

Write a function tokenize that takes a string of English text returns a list of words. It should also remove stop words, which are common short words that are often removed before natural language processing. Your function should have the following logic:

  • Split the string into lines using splitlines.
  • Split each line into a list of words and merge the lists for each line.
  • Use Python's builtin filter function to remove all punctuation.
  • If stop_words is a list, remove all occurences of the words in the list.
  • If stop_words is a space delimeted string of words, split them and remove them.
  • Remove any remaining empty words.
  • Make all words lowercase.

In [6]:
wasteland = """
APRIL is the cruellest month, breeding
Lilacs out of the dead land, mixing
Memory and desire, stirring
Dull roots with spring rain.
"""

def tokenize(s, stop_words=None, punctuation='`~!@#$%^&*()_-+={[}]|\:;"<,>.?/}\t'):
#     """Split a string into a list of words, removing punctuation and stop words."""
#     # YOUR CODE HERE
    words=[]
    split=s.split('\n')
    return split


['APRIL']
['APRIL', 'is']
['APRIL', 'is', 'the']
['APRIL', 'is', 'the', 'cruellest']
['APRIL', 'is', 'the', 'cruellest', 'month,']
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-6-e34667c26772> in <module>()
     26 #i dont know why this function is having trouble reading the last word in the sentence
     27 while s[0:] != '\n': #incrementing through the first line until it reaches the EOL
---> 28     while s[i] != ' ' and s[i]!='\n' : #incrementing through the first line until it reaches the end of the first word
     29         letter=(s[i])
     30         letters.append(letter)

IndexError: string index out of range

In [ ]:
assert tokenize("This, is the way; that things will end", stop_words=['the', 'is']) == \
    ['this', 'way', 'that', 'things', 'will', 'end']
wasteland = """
APRIL is the cruellest month, breeding
Lilacs out of the dead land, mixing
Memory and desire, stirring
Dull roots with spring rain.
"""

assert tokenize(wasteland, stop_words='is the of and') == \
    ['april','cruellest','month','breeding','lilacs','out','dead','land',
     'mixing','memory','desire','stirring','dull','roots','with','spring',
     'rain']

Write a function count_words that takes a list of words and returns a dictionary where the keys in the dictionary are the unique words in the list and the values are the word counts.


In [ ]:
def count_words(data):
    """Return a word count dictionary from the list of words in data."""
    # YOUR CODE HERE
    raise NotImplementedError()
"""
start with first word add to dictionary and count =1
next word if its in dictionary increment count
else add to dictionary and count=1
"""

In [ ]:
assert count_words(tokenize('this and the this from and a a a')) == \
    {'a': 3, 'and': 2, 'from': 1, 'the': 1, 'this': 2}

Write a function sort_word_counts that return a list of sorted word counts:

  • Each element of the list should be a (word, count) tuple.
  • The list should be sorted by the word counts, with the higest counts coming first.
  • To perform this sort, look at using the sorted function with a custom key and reverse argument.

In [ ]:
def sort_word_counts(wc):
    """Return a list of 2-tuples of (word, count), sorted by count descending."""
    # YOUR CODE HERE
    raise NotImplementedError()

"""
start with first word compare to second word if first is bigger keep in first 
move to second if third is bigger second becomes first
if a swap happened move back to first and compare to second(the one that used to be third)
continue until you pass all the way through the data with no swaps
"""

In [ ]:
assert sort_word_counts(count_words(tokenize('this and a the this this and a a a'))) == \
    [('a', 4), ('this', 3), ('and', 2), ('the', 1)]

Perform a word count analysis on Chapter 1 of Moby Dick, whose text can be found in the file mobydick_chapter1.txt:

  • Read the file into a string.
  • Tokenize with stop words of 'the of and a to in is it that as'.
  • Perform a word count, the sort and save the result in a variable named swc.

In [ ]:
# YOUR CODE HERE
raise NotImplementedError()

In [ ]:
assert swc[0]==('i',43)
assert len(swc)==848

Create a "Cleveland Style" dotplot of the counts of the top 50 words using Matplotlib. If you don't know what a dotplot is, you will have to do some research...


In [ ]:
# YOUR CODE HERE
raise NotImplementedError()

In [ ]:
assert True # use this for grading the dotplot