Wordcount example

This exercise uses a simple example to demonstrate how virtual environments can be used to facilitate reproducible research.

We are running the wordcount code used in previous exercises in a virtual environment with the help of Docker.

1. Download a book

First, we'll download a book from Project Gutenberg.



In [14]:

    
!wget http://www.gutenberg.org/files/174/174.zip -O dorian.zip









    



--2015-03-24 18:34:33--  http://www.gutenberg.org/files/174/174.zip
Resolving www.gutenberg.org... 152.19.134.47
Connecting to www.gutenberg.org|152.19.134.47|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 175175 (171K) [application/zip]
Saving to: 'dorian.zip'

dorian.zip          100%[=====================>] 171.07K  --.-KB/s   in 0.002s 

2015-03-24 18:34:34 (85.6 MB/s) - 'dorian.zip' saved [175175/175175]

2. Review the text



In [15]:

    
!unzip dorian.zip
!mv 174.txt dorian.txt
!head dorian.txt









    



Archive:  dorian.zip
  inflating: 174.txt

3. Review the wordcount script

The commands used to carry out the analysis are contained within the countwords script:



In [17]:

    
!cat countwords









    



#!/bin/sh
# Script that counts the number of words in a book

# Has a filename for the book been provided?
thebook="$1"
if [ -z "$thebook" ] ; then
    echo "Please provide the filename of the book."
    exit 1
elif [ ! -f "$thebook" ] ; then
    echo "$thebook was not found."
    exit 1
fi

# Create output directory
mkdir -p "output"

# Count the words and output the results
OUTPUTFILE="output/wordcount.txt"
cat $thebook | ./mapper.py | sort | ./reducer.py > $OUTPUTFILE
echo "Word count saved as $OUTPUTFILE"

4. Run the analysis



In [18]:

    
!./countwords dorian.txt









    



Word count saved as output/wordcount.txt



In [7]:

    
!head output/wordcount.txt









    



the 	3948
of 	2298
and 	2279
to 	2181
a 	1730
i 	1694
he 	1544
you 	1515
that 	1376
it 	1358

5. Plot a barchart showing word frequency

Plot and save a barchart showing the frequency of occurrence of the n most popular words.



In [19]:

    
# Load the plotting library
import matplotlib.pyplot as plt
%matplotlib inline

# Open the file
filename = "output/wordcount.txt"
f = open(filename, 'r')

# Create lists to hold X and Y data
wordlist = []
frequency = []



In [20]:

    
# Populate the word and frequency lists
for line in f:
    line = line.split()
    wordlist.append(line[0])
    frequency.append(int(line[1]))



In [21]:

    
# Snip the lists to the first ten items
n_words = 10
wordlist = wordlist[:n_words+1]
frequency = frequency[:n_words+1]

# Check lists by printing them
print(wordlist)
print(frequency)









    



['the', 'of', 'and', 'to', 'a', 'i', 'he', 'you', 'that', 'it', 'in']
[3948, 2298, 2279, 2181, 1730, 1694, 1544, 1515, 1376, 1358, 1266]



In [22]:

    
# Plot the barchart
bookname = "Dorian Gray"
plt.barh(range(len(frequency)),frequency,align='center')
plt.title('Top ' + str(n_words) + ' words in ' + bookname)
plt.xlabel('Word frequency')
plt.yticks(range(len(frequency)),wordlist)
plt.show()

6. Plot the dispersion of words

Using the Python Natural Language Toolkit we will also look at the dispersion of selected words throughout the book.



In [23]:

    
# import the toolkit
import nltk, re, pprint
from nltk import word_tokenize
nltk.download('punkt')









    



[nltk_data] Downloading package punkt to
[nltk_data]     /Users/tompollard/nltk_data...
[nltk_data]   Package punkt is already up-to-date!






    Out[23]:





True



In [24]:

    
# open the book
f = open('dorian.txt')
rawtext = f.read() # string of contents



In [25]:

    
# tokenize the text
tokens = word_tokenize(rawtext)
print(tokens[:10])









    



['The', 'Project', 'Gutenberg', 'EBook', 'of', 'The', 'Picture', 'of', 'Dorian', 'Gray']



In [26]:

    
# convert the tokens to NLTK formatted text
doriantext = nltk.Text(tokens)
type(doriantext)









    Out[26]:





nltk.text.Text



In [141]:

    
# select the words for our dispersion plot
words = ['Dorian','Sibyl','James','Charming','love','marriage','beautiful','corruption','horrible']



In [142]:

    
# plot the appearance of our words throughout the book
nltk.draw.dispersion_plot(doriantext,words)



In [ ]: