This notebook was originally prepared for use during a workshop called "An Introduction to Visualizing Text with Python," which took place during Columbia's Art of Data Visualization week in April 2016. But you can run these commands yourself. To begin, you'll need this software:
There are lots of different ways to get this software. You can either install it on your computer, or run it in the cloud. Here are a few different ways of doing that. When you see text in a monospace typeface
, those are commands to be entered in the terminal. On a Mac, open a terminal by typing "Terminal" into Spotlight. On Windows, press Win+R and type cmd
to get a terminal.
sudo apt-get update
sudo apt-get install python3 python3-pip python3-pandas
sudo pip3 install jupyter nltk wordcloud
jupyter notebook
conda install --name wordcloud
In [1]:
# Get the Natural Language Processing Toolkit
import nltk
In [ ]:
nltk.download('book') # You only need to run this command once, to get the NLTK book data.
In [3]:
# Get the data science package Pandas
import pandas as pd
# Get the library matplotlib for making pretty charts
import matplotlib
import matplotlib.pyplot as plt
# Make plots appear here in this notebook
%matplotlib inline
# This just makes the plot size bigger, so that we can see it easier.
plt.rcParams['figure.figsize'] = (12,4)
# We'll use the OS module to download things.
import os
# Get all the example books from the NLTK textbook
from nltk.book import *
In [4]:
# Download Alice in Wonderland
os.system('wget http://www.gutenberg.org/files/11/11-0.txt')
Out[4]:
In [5]:
# Tokenize it (break it into words), and make an NLTK Text object out of it.
aliceRaw = open('11-0.txt').read()
aliceWords = nltk.word_tokenize(aliceRaw)
alice = nltk.Text(aliceWords)
In [6]:
alice
Out[6]:
In [7]:
text1.collocations()
In [8]:
text2.collocations()
But what if we get tired of doing that for each text, and want to do it with all of them? Let's put the texts into a list.
In [9]:
alltexts = [text1, text2, text3, text4, text5, text6, text7, text8, text9, alice]
Let's look at it to make sure it's all there.
In [10]:
alltexts
Out[10]:
Now that we have a list of all the texts, we can loop through each one, running the collocations()
function on each:
In [11]:
for text in alltexts: # For each text in the list "alltexts,"
text.collocations() # Get the collocations
print('---') # Print a divider between the collocations
In [12]:
text6.concordance('shrubbery')
Not bad. But what if we want to see visually where those words occur over the course of the text? We can use the function dispersion_plot
:
In [13]:
text6.dispersion_plot(['shrubbery', 'ni'])
Let's try that on Moby Dick:
In [14]:
text1.dispersion_plot(['Ahab', 'Ishmael', 'whale'])
By looking at dispersion plots of characters' names, we can almost tell which characters in Sense and Sensibility have romantic relationships:
In [15]:
text2.dispersion_plot(['Elinor', 'Marianne', 'Edward', 'Willoughby'])
In [16]:
len(text1)
Out[16]:
And we can do this for all the texts by putting it in a lookup function, like this:
In [17]:
lengths = {text.name: len(text) for text in alltexts}
lengths
Out[17]:
If we import this table into Pandas, we can see this data a little easier
In [18]:
pd.Series(lengths)
Out[18]:
And by plotting it, we can get a better visual representation:
In [19]:
pd.Series(lengths).plot(kind='bar')
Out[19]:
But word counts themselves are not very interesting, so let's see if we can not only count the words,
but count the vocabulary of a text. To do that, we can use set()
, which will count every word once.
In [20]:
porky_sentence = "the the the the the that's all folks"
porky_words = porky_sentence.split()
porky_words
Out[20]:
We can count the words in the sentence easily:
In [21]:
len(porky_words)
Out[21]:
To count the words, but ignore repeated words, we can use the function set().
In [22]:
set(porky_words)
Out[22]:
So if we count this set, we can determine the vocabulary of a text:
In [23]:
len(set(porky_words))
Out[23]:
Let's see if we can find the vocabulary of Moby Dick.
In [24]:
len(set(text1))
Out[24]:
Pretty big, but then again, Moby Dick is kind of a long novel. We can adjust for the words by adjusting for the total words:
In [25]:
len(text1) / len(set(text1))
Out[25]:
This would get tedious if we did this for every text, so let's write a function!
In [26]:
def vocab(text): # Define a function called `vocab` that takes the input `text`
return len(text) / len(set(text)) # Divide the number of words by the number of unique words.
In [27]:
vocab(porky_words)
Out[27]:
Let's go through each text, and get its vocabulary, and put it in a table.
In [28]:
vocabularies = {text.name: vocab(text) for text in alltexts}
Let's put that table into Pandas so we can see it better:
In [29]:
pd.Series(vocabularies)
Out[29]:
Now let's plot that:
In [30]:
pd.Series(vocabularies).plot(kind='bar')
Out[30]:
OK, now let's make a famous wordcloud from a text. This just takes the most statistically significant words, and plots them where the size of each word corresponds to its frequency.
In [31]:
from wordcloud import WordCloud # Get the library
In [40]:
rawtext = ' '.join(text1.tokens) # Stitch it back together.
wc = WordCloud(width=1000, height=600, background_color='white').generate(rawtext)
# This just makes the plot size bigger, so that we can see it easier.
plt.rcParams['figure.figsize'] = (12,4)
plt.figure()
plt.axis('off') # Turn off axis ticks
plt.imshow(wc, interpolation="bilinear");# Plot it
Now let's take a look at the inaugural address corpus in detail.
In [33]:
from nltk.corpus import inaugural
We'll set up a conditional word frequency distribution for it, pairing off a list of words with the list of inaugural addresses.
In [34]:
plt.rcParams['figure.figsize'] = (14,5) # Adjust the plot size.
cfd = nltk.ConditionalFreqDist(
(target, fileid[:4])
for fileid in inaugural.fileids()
for w in inaugural.words(fileid)
for target in ['america', 'citizen']
if w.lower().startswith(target))
cfd.plot()
You can replace the words 'america' and 'citizen' here with whatever words you want, to further explore this corpus.
Now let's play around with the Brown corpus. It's a categorized text corpus. Let's see all the categories:
In [35]:
nltk.corpus.brown.categories()
Out[35]:
Now let's create another conditional frequency distribution, this time based on these genres.
In [36]:
genres = ['adventure', 'romance', 'science_fiction']
words = ['can', 'could', 'may', 'might', 'must', 'will']
cfdist = nltk.ConditionalFreqDist(
(genre, word)
for genre in genres
for word in nltk.corpus.brown.words(categories=genre)
if word in words)
In [37]:
cfdist
Out[37]:
Finally, we can plot these words by genre:
In [38]:
pd.DataFrame(cfdist).T.plot(kind='bar')
Out[38]:
Play around with this CFD a bit by changing the genres and words used above.
To learn more, check out the NLTK book (from which a lot of the examples here were adapted): http://nltk.org/book
To see what's possible with more advanced techniques, using the SpaCy library, check out my workshop notebook in advanced text analysis.
Read about some experiments in text analysis on my blog: http://jonreeve.com
In [ ]: