Text Analysis and Visualization with Python and the NLTK

This notebook was originally prepared for use during a workshop called "An Introduction to Visualizing Text with Python," which took place during Columbia's Art of Data Visualization week in April 2016. But you can run these commands yourself. To begin, you'll need this software:

  • Python 3
  • These Python 3 packages (make sure you don't install their Python 2 versions):
    • Jupyter (formerly called iPython Notebook)
    • NLTK (the Natural Language Processing Toolkit)
    • Pandas (a data science library)
    • Wordcloud

There are lots of different ways to get this software. You can either install it on your computer, or run it in the cloud. Here are a few different ways of doing that. When you see text in a monospace typeface, those are commands to be entered in the terminal. On a Mac, open a terminal by typing "Terminal" into Spotlight. On Windows, press Win+R and type cmd to get a terminal.

Installation on Linux

  1. Make sure your package list is up to date: sudo apt-get update
  2. Get Python 3: sudo apt-get install python3 python3-pip python3-pandas
  3. Get the Python packages: sudo pip3 install jupyter nltk wordcloud
  4. Start a Jupyter notebook: jupyter notebook

Installation on Mac or Windows:

  1. Get Anaconda with Python 3 from https://www.continuum.io/downloads
  2. Anaconda comes with Pandas, NLTK, and Jupyter, so just install Wordcloud: conda install --name wordcloud
  3. Start a Jupyter notebook: jupyter notebook

Or Use DHBox

  1. Go to http://dhbox.org and click "log in." Log into the workshop box with the credentials I gave earlier.
  2. In the "dataviz-2017" profile menu in the top right, click "Apps."
  3. Click the "Jupyter Notebook" tab.

In [ ]:
# Get the Natural Language Processing Toolkit
import nltk

In [ ]:
nltk.download('book') # You only need to run this command once, to get the NLTK book data.

In [ ]:
# Get the data science package Pandas
import pandas as pd

# Get the library matplotlib for making pretty charts
import matplotlib
import matplotlib.pyplot as plt

# Make plots appear here in this notebook
%matplotlib inline

# This just makes the plot size bigger, so that we can see it easier. 
plt.rcParams['figure.figsize'] = (12,4)

# We'll use the OS module to download things. 
import os

# Get all the example books from the NLTK textbook
from nltk.book import *

Work with our Own Text


In [ ]:
# Download Alice in Wonderland
os.system('wget http://www.gutenberg.org/files/11/11-0.txt')

In [ ]:
# Tokenize it (break it into words), and make an NLTK Text object out of it. 
aliceRaw = open('11-0.txt').read()
aliceWords = nltk.word_tokenize(aliceRaw)
alice = nltk.Text(aliceWords)

In [ ]:
alice

Exploring Texts

Let's explore these texts a little. There are lots of things we can do with these texts. To see a list, type text1. and press <Tab>. One thing we can do is look at statistically significant co-occurring two-word phrases, here known as collocations:


In [ ]:
text1.collocations()

In [ ]:
text2.collocations()

But what if we get tired of doing that for each text, and want to do it with all of them? Let's put the texts into a list.


In [ ]:
alltexts = [text1, text2, text3, text4, text5, text6, text7, text8, text9, alice]

Let's look at it to make sure it's all there.


In [ ]:
alltexts

Now that we have a list of all the texts, we can loop through each one, running the collocations() function on each:


In [ ]:
for text in alltexts:     # For each text in the list "alltexts,"
    text.collocations()   # Get the collocations
    print('---')          # Print a divider between the collocations

Concordances and Dispersion Plots

Now let's look up an individual word in a text, and have NLTK give us some context:


In [ ]:
text6.concordance('shrubbery')

Not bad. But what if we want to see visually where those words occur over the course of the text? We can use the function dispersion_plot:


In [ ]:
text6.dispersion_plot(['shrubbery', 'ni'])

Let's try that on Moby Dick:


In [ ]:
text1.dispersion_plot(['Ahab', 'Ishmael', 'whale'])

By looking at dispersion plots of characters' names, we can almost tell which characters in Sense and Sensibility have romantic relationships:


In [ ]:
text2.dispersion_plot(['Elinor', 'Marianne', 'Edward', 'Willoughby'])

Measuring Text Vocabulary

We can use the len (length) function to count the total number of words in a text:


In [ ]:
len(text1)

And we can do this for all the texts by putting it in a lookup function, like this:


In [ ]:
lengths = {text.name: len(text) for text in alltexts}
lengths

If we import this table into Pandas, we can see this data a little easier


In [ ]:
pd.Series(lengths)

And by plotting it, we can get a better visual representation:


In [ ]:
pd.Series(lengths).plot(kind='bar')

But word counts themselves are not very interesting, so let's see if we can not only count the words, but count the vocabulary of a text. To do that, we can use set(), which will count every word once.


In [ ]:
porky_sentence = "the the the the the that's all folks"
porky_words = porky_sentence.split()
porky_words

We can count the words in the sentence easily:


In [ ]:
len(porky_words)

To count the words, but ignore repeated words, we can use the function set().


In [ ]:
set(porky_words)

So if we count this set, we can determine the vocabulary of a text:


In [ ]:
len(set(porky_words))

Let's see if we can find the vocabulary of Moby Dick.


In [ ]:
len(set(text1))

Pretty big, but then again, Moby Dick is kind of a long novel. We can adjust for the words by adjusting for the total words:


In [ ]:
len(text1) / len(set(text1))

This would get tedious if we did this for every text, so let's write a function!


In [ ]:
def vocab(text):                       # Define a function called `vocab` that takes the input `text` 
    return len(text) / len(set(text))  # Divide the number of words by the number of unique words.

In [ ]:
vocab(porky_words)

Let's go through each text, and get its vocabulary, and put it in a table.


In [ ]:
vocabularies = {text.name: vocab(text) for text in alltexts}

Let's put that table into Pandas so we can see it better:


In [ ]:
pd.Series(vocabularies)

Now let's plot that:


In [ ]:
pd.Series(vocabularies).plot(kind='bar')

OK, now let's make a famous wordcloud from a text. This just takes the most statistically significant words, and plots them where the size of each word corresponds to its frequency.


In [ ]:
from wordcloud import WordCloud # Get the library

In [ ]:
rawtext = ' '.join(text1.tokens) # Stitch it back together. 
wc = WordCloud(width=1000, height=600, background_color='white').generate(rawtext)
plt.axis('off')                          # Turn off axis ticks
plt.imshow(wc, interpolation="bilinear");# Plot it

Plotting Words (Conditional Frequency Distributions)

Now let's take a look at the inaugural address corpus in detail.


In [ ]:
from nltk.corpus import inaugural

We'll set up a conditional word frequency distribution for it, pairing off a list of words with the list of inaugural addresses.


In [ ]:
plt.rcParams['figure.figsize'] = (14,5) # Adjust the plot size. 
cfd = nltk.ConditionalFreqDist(
           (target, fileid[:4])
           for fileid in inaugural.fileids()
           for w in inaugural.words(fileid)
           for target in ['america', 'citizen']
           if w.lower().startswith(target))
cfd.plot()

You can replace the words 'america' and 'citizen' here with whatever words you want, to further explore this corpus.

Now let's play around with the Brown corpus. It's a categorized text corpus. Let's see all the categories:


In [ ]:
nltk.corpus.brown.categories()

Now let's create another conditional frequency distribution, this time based on these genres.


In [ ]:
genres = ['adventure', 'romance', 'science_fiction']
words = ['can', 'could', 'may', 'might', 'must', 'will']
cfdist = nltk.ConditionalFreqDist(
              (genre, word)
              for genre in genres
              for word in nltk.corpus.brown.words(categories=genre)
              if word in words)

In [ ]:
cfdist

Finally, we can plot these words by genre:


In [ ]:
pd.DataFrame(cfdist).T.plot(kind='bar')

Play around with this CFD a bit by changing the genres and words used above.

Further resources

To learn more, check out the NLTK book (from which a lot of the examples here were adapted): http://nltk.org/book

To see what's possible with more advanced techniques, using the SpaCy library, check out my workshop notebook in advanced text analysis.

Read about some experiments in text analysis on my blog: http://jonreeve.com


In [ ]: