# Algorithms Exercise 1

## Imports

``````

In :

%matplotlib inline
from matplotlib import pyplot as plt
import numpy as np

``````

## Word counting

Write a function `tokenize` that takes a string of English text returns a list of words. It should also remove stop words, which are common short words that are often removed before natural language processing. Your function should have the following logic:

• Split the string into lines using `splitlines`.
• Split each line into a list of words and merge the lists for each line.
• Use Python's builtin `filter` function to remove all punctuation.
• If `stop_words` is a list, remove all occurences of the words in the list.
• If `stop_words` is a space delimeted string of words, split them and remove them.
• Remove any remaining empty words.
• Make all words lowercase.
``````

In :

file = open('mobydick_chapter1.txt')
mobydick = file.read()
mobydick.splitlines()
mobydick.split()
print (len(mobydick.split()))

``````
``````

2190

``````
``````

In :

def tokenize(s, stop_words=None, punctuation='`~!@#\$%^&*()_-+={[}]|\:;"<,>.?/}\t'):
lines = s.splitlines()
i = 0
empty = []
while i < len.lines:
empty.extend(lines[i].split())
i += 1
a = []
b = 0
while b < len(empty):
a.append(''.join([c for c in empty[b] if c not in punctuation]))
b += 1
for g in a:
g.lower()
c = 0
d = []
while c < len(h):
d.append(''.join([e for e in h[c] if h[c] not in stop_words]))
c+=1
answer = list(filter(None,d))
return answer

"""Split a string into a list of words, removing punctuation and stop words."""

``````
``````

In :

assert tokenize("This, is the way; that things will end", stop_words=['the', 'is']) == \
['this', 'way', 'that', 'things', 'will', 'end']
wasteland = """
APRIL is the cruellest month, breeding
Lilacs out of the dead land, mixing
Memory and desire, stirring
Dull roots with spring rain.
"""

assert tokenize(wasteland, stop_words='is the of and') == \
['april','cruellest','month','breeding','lilacs','out','dead','land',
'mixing','memory','desire','stirring','dull','roots','with','spring',
'rain']

``````
``````

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-4-cb7875c10e3e> in <module>()
----> 1 assert tokenize("This, is the way; that things will end", stop_words=['the', 'is']) ==     ['this', 'way', 'that', 'things', 'will', 'end']
2 wasteland = """
3 APRIL is the cruellest month, breeding
4 Lilacs out of the dead land, mixing
5 Memory and desire, stirring

<ipython-input-3-034065b48e22> in tokenize(s, stop_words, punctuation)
3     i = 0
4     empty = []
----> 5     while i < len.lines:
6         empty.extend(lines[i].split())
7         i += 1

AttributeError: 'builtin_function_or_method' object has no attribute 'lines'

``````

Write a function `count_words` that takes a list of words and returns a dictionary where the keys in the dictionary are the unique words in the list and the values are the word counts.

``````

In :

def count_words(data):
"""Return a word count dictionary from the list of words in data."""

``````
``````

In :

assert count_words(tokenize('this and the this from and a a a')) == \
{'a': 3, 'and': 2, 'from': 1, 'the': 1, 'this': 2}

``````
``````

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-6-71b7cc9f406e> in <module>()
----> 1 assert count_words(tokenize('this and the this from and a a a')) ==     {'a': 3, 'and': 2, 'from': 1, 'the': 1, 'this': 2}

<ipython-input-3-034065b48e22> in tokenize(s, stop_words, punctuation)
3     i = 0
4     empty = []
----> 5     while i < len.lines:
6         empty.extend(lines[i].split())
7         i += 1

AttributeError: 'builtin_function_or_method' object has no attribute 'lines'

``````

Write a function `sort_word_counts` that return a list of sorted word counts:

• Each element of the list should be a `(word, count)` tuple.
• The list should be sorted by the word counts, with the higest counts coming first.
• To perform this sort, look at using the `sorted` function with a custom `key` and `reverse` argument.
``````

In :

"""Return a list of 2-tuples of (word, count), sorted by count descending."""

``````
``````

Out:

'Return a list of 2-tuples of (word, count), sorted by count descending.'

``````
``````

In :

assert sort_word_counts(count_words(tokenize('this and a the this this and a a a'))) == \
[('a', 4), ('this', 3), ('and', 2), ('the', 1)]

``````
``````

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-8-2b19d16a049a> in <module>()
----> 1 assert sort_word_counts(count_words(tokenize('this and a the this this and a a a'))) ==     [('a', 4), ('this', 3), ('and', 2), ('the', 1)]

NameError: name 'sort_word_counts' is not defined

``````

Perform a word count analysis on Chapter 1 of Moby Dick, whose text can be found in the file `mobydick_chapter1.txt`:

• Read the file into a string.
• Tokenize with stop words of `'the of and a to in is it that as'`.
• Perform a word count, the sort and save the result in a variable named `swc`.
``````

In [ ]:

``````
``````

In :

assert swc==('i',43)
assert len(swc)==848

``````
``````

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-9-79f30673d9f4> in <module>()
----> 1 assert swc==('i',43)
2 assert len(swc)==848

NameError: name 'swc' is not defined

``````

Create a "Cleveland Style" dotplot of the counts of the top 50 words using Matplotlib. If you don't know what a dotplot is, you will have to do some research...

``````

In [ ]:

``````
``````

In :

assert True # use this for grading the dotplot

``````