Textual Analysis in Python (for DHers, etc).

Part Two: Natural Language Toolkit (NLTK)

Author: A. Sean Pue

A lesson for AL340 Digital Humanities Seminar (Spring 2015)

--> Double Click here to start<--

Welcome!

This is an IPython Notebook. It contains numerous cells that can be of different types (Markdown, Code, Headings). This is lesson two. You may need to install the Natural Language Toolkit to begin.

Your turn

Select the cell above this one by double clicking it with your mouse.

You can see that it contains text in a format called Markdown. To execute the cell, press shift+enter. To complete this tutorial (which is meant for a classroom), execute the cells and follow the instructions.


In [ ]:
#The next line imports the nltk
import nltk
# next line makes graphics appear in the browser rather than separate window
%matplotlib inline

Let's build a corpus.

Here we will work with this file: http://www.gutenberg.org/cache/epub/11/pg11.txt

First, let's download it


In [ ]:
import urllib2
url = 'http://www.gutenberg.org/cache/epub/11/pg11.txt'

# get text
t = urllib2.urlopen(url).read().decode('utf8')
assert(t)
len(t)

In [ ]:
text_starts= t.index('CHAPTER I') # find where chapter starts
text_ends = t.index('THE END')
t = t[text_starts:text_ends]
len(t)

Let's split the text into different chapters


In [ ]:
import re # imports regular expression module (which allows fancy text searching)
chapters = re.findall('CHAPTER .+?(?=CHAPTER|$)',t,re.DOTALL) # This is a regular expression that splits out chapters

In [ ]:
# Here is a sample of chapter 2 (or chapter[1])
chapters[1][0:500]

Now let's write the chapters to separate text files


In [ ]:
for chapter_number, chapter_text in enumerate(chapters):
    file_name = 'data/alice/chapter-'+str(chapter_number+1)+'.txt'
    with open(file_name,'w') as f: 
        f.write(chapter_text)

Now using NTLK, let's create a corpus from those files


In [ ]:
from nltk.corpus.reader.plaintext import PlaintextCorpusReader


corpusdir = 'data/alice/' # Directory of corpus.
corpus0 = PlaintextCorpusReader(corpusdir, '.*')
corpus  = nltk.Text(corpus0.words())

Analyzing text

How many words are there?


In [ ]:
len(corpus)

How many unique words are there?


In [ ]:
len(set(corpus))

How do we access particular words?


In [ ]:
corpus[0:10]

How do we find words in contexts?


In [ ]:
corpus.concordance('Alice')  # you can also try corpus.concordance('Alice',lines=all) or lines = 100, etc.

How to find words in similar contexts (in a sentence)?


In [ ]:
corpus.similar('caterpillar')

How to find the common contexts (within a sentence)?


In [ ]:
corpus.common_contexts(['hatter','queen'])

Now let's build a frequency distribution


In [ ]:
fd = nltk.FreqDist(corpus)

How many times does a word occur?


In [ ]:
fd['Alice']

What are all the words?


In [ ]:
fd.keys()

Which words occur only once? (hapax legomenon)


In [ ]:
fd.hapaxes()

How do you plot the most frequency words?


In [ ]:
fd.plot(20,cumulative=False)

How to normalize text?


In [ ]:
new_corpus = [w.lower() for w in corpus if w.isalpha()] # keep only alphabetic words and lowercase it

In [ ]:
stopwords = nltk.corpus.stopwords.words('english')
print stopwords

In [ ]:
new_corpus = [w for w in new_corpus if not w in stopwords] # erase words in stoplist
corpus = nltk.Text(new_corpus)

In [ ]:
fd2 = nltk.FreqDist(corpus) # create a FreqDist (Frequency Distribution)

In [ ]:
fd2.plot(20,cumulative=False)

How do you search for words?


In [ ]:
corpus.dispersion_plot(['alice','rabbit','queen'])

How do you find collocations (words that occur together)?


In [ ]:
corpus.collocations(num=1000)

How to find bi-grams, tri-grams, and n-grams?


In [ ]:
for x in nltk.bigrams(corpus):
    print x

In [ ]:
for x in nltk.trigrams(corpus):
    print x

In [ ]:
for x in nltk.ngrams(corpus,7):
    print x

In [ ]:
nltk.pos_tag(corpus[0:1000])

The End!


In [ ]: