This exercise uses a simple example to demonstrate how virtual environments can be used to facilitate reproducible research.
We are running the wordcount code used in previous exercises in a virtual environment with the help of Docker.
First, we'll download a book from Project Gutenberg.
In [14]:
!wget http://www.gutenberg.org/files/174/174.zip -O dorian.zip
In [15]:
!unzip dorian.zip
!mv 174.txt dorian.txt
!head dorian.txt
In [17]:
!cat countwords
In [18]:
!./countwords dorian.txt
In [7]:
!head output/wordcount.txt
In [19]:
# Load the plotting library
import matplotlib.pyplot as plt
%matplotlib inline
# Open the file
filename = "output/wordcount.txt"
f = open(filename, 'r')
# Create lists to hold X and Y data
wordlist = []
frequency = []
In [20]:
# Populate the word and frequency lists
for line in f:
line = line.split()
wordlist.append(line[0])
frequency.append(int(line[1]))
In [21]:
# Snip the lists to the first ten items
n_words = 10
wordlist = wordlist[:n_words+1]
frequency = frequency[:n_words+1]
# Check lists by printing them
print(wordlist)
print(frequency)
In [22]:
# Plot the barchart
bookname = "Dorian Gray"
plt.barh(range(len(frequency)),frequency,align='center')
plt.title('Top ' + str(n_words) + ' words in ' + bookname)
plt.xlabel('Word frequency')
plt.yticks(range(len(frequency)),wordlist)
plt.show()
Using the Python Natural Language Toolkit we will also look at the dispersion of selected words throughout the book.
In [23]:
# import the toolkit
import nltk, re, pprint
from nltk import word_tokenize
nltk.download('punkt')
Out[23]:
In [24]:
# open the book
f = open('dorian.txt')
rawtext = f.read() # string of contents
In [25]:
# tokenize the text
tokens = word_tokenize(rawtext)
print(tokens[:10])
In [26]:
# convert the tokens to NLTK formatted text
doriantext = nltk.Text(tokens)
type(doriantext)
Out[26]:
In [141]:
# select the words for our dispersion plot
words = ['Dorian','Sibyl','James','Charming','love','marriage','beautiful','corruption','horrible']
In [142]:
# plot the appearance of our words throughout the book
nltk.draw.dispersion_plot(doriantext,words)
In [ ]: