Algorithms Exercise 1

Imports


In [1]:
%matplotlib inline
from matplotlib import pyplot as plt
import numpy as np

Word counting

Write a function tokenize that takes a string of English text returns a list of words. It should also remove stop words, which are common short words that are often removed before natural language processing. Your function should have the following logic:

  • Split the string into lines using splitlines.
  • Split each line into a list of words and merge the lists for each line.
  • Use Python's builtin filter function to remove all punctuation.
  • If stop_words is a list, remove all occurences of the words in the list.
  • If stop_words is a space delimeted string of words, split them and remove them.
  • Remove any remaining empty words.
  • Make all words lowercase.

In [2]:
def tokenize(s, stop_words=None, punctuation='`~!@#$%^&*()_-+={[}]|\:;"<,>.?/}\t'):
    """Split a string into a list of words, removing punctuation and stop words."""
    # YOUR CODE HERE
    t = list(filter(lambda x: x not in punctuation, s))
    d = ''.join(t)
    h = d.splitlines()
    k = []
    for i in h:
        j = i.lower()
        v = j.split()
        k.append(v)
        b = sum(k, [])
    if stop_words is None:
        dr = []
    elif type(stop_words)==str:
        st = stop_words.splitlines()
        kay = []
        for it in st:
            lo = it.lower()
            ve = lo.split()
            kay.append(ve)
            dr = sum(kay, [])
    elif type(stop_words)==list:
        dr = stop_words

    for f in b:
        n = 0
        for g in dr:
            n = 0
            while n!=len(b):
                if b[n]==g:
                    b.remove(g)
                    n += 1
                    break
                else:
                    n +=1
    return b
print(tokenize("This, is the way; that things will end"))


['this', 'is', 'the', 'way', 'that', 'things', 'will', 'end']

In [3]:
assert tokenize("This, is the way; that things will end", stop_words=['the', 'is']) == \
    ['this', 'way', 'that', 'things', 'will', 'end']
wasteland = """
APRIL is the cruellest month, breeding
Lilacs out of the dead land, mixing
Memory and desire, stirring
Dull roots with spring rain.
"""

assert tokenize(wasteland, stop_words='is the of and') == \
    ['april','cruellest','month','breeding','lilacs','out','dead','land',
     'mixing','memory','desire','stirring','dull','roots','with','spring',
     'rain']

Write a function count_words that takes a list of words and returns a dictionary where the keys in the dictionary are the unique words in the list and the values are the word counts.


In [4]:
def count_words(data):
    """Return a word count dictionary from the list of words in data."""
    # YOUR CODE HERE
    wc = {}
    for word in data:
        if word in wc:
            wc[word] = wc[word]+1
        else:
            wc[word] = 1
    return wc

In [5]:
assert count_words(tokenize('this and the this from and a a a')) == \
    {'a': 3, 'and': 2, 'from': 1, 'the': 1, 'this': 2}

Write a function sort_word_counts that return a list of sorted word counts:

  • Each element of the list should be a (word, count) tuple.
  • The list should be sorted by the word counts, with the higest counts coming first.
  • To perform this sort, look at using the sorted function with a custom key and reverse argument.

In [6]:
def sort_word_counts(wc):
    """Return a list of 2-tuples of (word, count), sorted by count descending."""
    # YOUR CODE HERE
    return list(sorted(wc.items(), key=lambda x: x[1], reverse=True))

In [7]:
assert sort_word_counts(count_words(tokenize('this and a the this this and a a a'))) == \
    [('a', 4), ('this', 3), ('and', 2), ('the', 1)]

Perform a word count analysis on Chapter 1 of Moby Dick, whose text can be found in the file mobydick_chapter1.txt:

  • Read the file into a string.
  • Tokenize with stop words of 'the of and a to in is it that as'.
  • Perform a word count, the sort and save the result in a variable named swc.

In [10]:
# YOUR CODE HERE
with open('mobydick_chapter1.txt') as f:
    rawtext = f.read()

wl = tokenize(rawtext, stop_words='the of and a to in is it that as')
wc = count_words(wl)
swc = sort_word_counts(wc)
# print(rawtext, wl)

In [11]:
assert swc[0]==('i',43)
assert len(swc)==848


---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-11-79f30673d9f4> in <module>()
      1 assert swc[0]==('i',43)
----> 2 assert len(swc)==848

AssertionError: 

Create a "Cleveland Style" dotplot of the counts of the top 50 words using Matplotlib. If you don't know what a dotplot is, you will have to do some research...


In [ ]:
# YOUR CODE HERE
raise NotImplementedError()

In [ ]:
assert True # use this for grading the dotplot