Wordcount example

This exercise uses a simple example to demonstrate how virtual environments can be used to facilitate reproducible research.

We are running the wordcount code used in previous exercises in a virtual environment with the help of Docker.

1. Download a book

First, we'll download a book from Project Gutenberg.


In [14]:
!wget http://www.gutenberg.org/files/174/174.zip -O dorian.zip


--2015-03-24 18:34:33--  http://www.gutenberg.org/files/174/174.zip
Resolving www.gutenberg.org... 152.19.134.47
Connecting to www.gutenberg.org|152.19.134.47|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 175175 (171K) [application/zip]
Saving to: 'dorian.zip'

dorian.zip          100%[=====================>] 171.07K  --.-KB/s   in 0.002s 

2015-03-24 18:34:34 (85.6 MB/s) - 'dorian.zip' saved [175175/175175]

2. Review the text


In [15]:
!unzip dorian.zip
!mv 174.txt dorian.txt
!head dorian.txt


Archive:  dorian.zip
  inflating: 174.txt                 










3. Review the wordcount script

The commands used to carry out the analysis are contained within the countwords script:


In [17]:
!cat countwords


#!/bin/sh
# Script that counts the number of words in a book

# Has a filename for the book been provided?
thebook="$1"
if [ -z "$thebook" ] ; then
    echo "Please provide the filename of the book."
    exit 1
elif [ ! -f "$thebook" ] ; then
    echo "$thebook was not found."
    exit 1
fi

# Create output directory
mkdir -p "output"

# Count the words and output the results
OUTPUTFILE="output/wordcount.txt"
cat $thebook | ./mapper.py | sort | ./reducer.py > $OUTPUTFILE
echo "Word count saved as $OUTPUTFILE" 

4. Run the analysis


In [18]:
!./countwords dorian.txt


Word count saved as output/wordcount.txt

In [7]:
!head output/wordcount.txt


the 	3948
of 	2298
and 	2279
to 	2181
a 	1730
i 	1694
he 	1544
you 	1515
that 	1376
it 	1358

5. Plot a barchart showing word frequency

Plot and save a barchart showing the frequency of occurrence of the n most popular words.


In [19]:
# Load the plotting library
import matplotlib.pyplot as plt
%matplotlib inline

# Open the file
filename = "output/wordcount.txt"
f = open(filename, 'r')

# Create lists to hold X and Y data
wordlist = []
frequency = []

In [20]:
# Populate the word and frequency lists
for line in f:
    line = line.split()
    wordlist.append(line[0])
    frequency.append(int(line[1]))

In [21]:
# Snip the lists to the first ten items
n_words = 10
wordlist = wordlist[:n_words+1]
frequency = frequency[:n_words+1]

# Check lists by printing them
print(wordlist)
print(frequency)


['the', 'of', 'and', 'to', 'a', 'i', 'he', 'you', 'that', 'it', 'in']
[3948, 2298, 2279, 2181, 1730, 1694, 1544, 1515, 1376, 1358, 1266]

In [22]:
# Plot the barchart
bookname = "Dorian Gray"
plt.barh(range(len(frequency)),frequency,align='center')
plt.title('Top ' + str(n_words) + ' words in ' + bookname)
plt.xlabel('Word frequency')
plt.yticks(range(len(frequency)),wordlist)
plt.show()


6. Plot the dispersion of words

Using the Python Natural Language Toolkit we will also look at the dispersion of selected words throughout the book.


In [23]:
# import the toolkit
import nltk, re, pprint
from nltk import word_tokenize
nltk.download('punkt')


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/tompollard/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Out[23]:
True

In [24]:
# open the book
f = open('dorian.txt')
rawtext = f.read() # string of contents

In [25]:
# tokenize the text
tokens = word_tokenize(rawtext)
print(tokens[:10])


['The', 'Project', 'Gutenberg', 'EBook', 'of', 'The', 'Picture', 'of', 'Dorian', 'Gray']

In [26]:
# convert the tokens to NLTK formatted text
doriantext = nltk.Text(tokens)
type(doriantext)


Out[26]:
nltk.text.Text

In [141]:
# select the words for our dispersion plot
words = ['Dorian','Sibyl','James','Charming','love','marriage','beautiful','corruption','horrible']

In [142]:
# plot the appearance of our words throughout the book
nltk.draw.dispersion_plot(doriantext,words)



In [ ]: