Textual Analysis in Python (for DHers, etc).

Part Two: Natural Language Toolkit (NLTK)

A lesson for AL340 Digital Humanities Seminar (Spring 2015)

--> Double Click here to start<--

Welcome!

This is an IPython Notebook. It contains numerous cells that can be of different types (Markdown, Code, Headings). This is lesson two. You may need to install the Natural Language Toolkit to begin.

Your turn

Select the cell above this one by double clicking it with your mouse.

You can see that it contains text in a format called Markdown. To execute the cell, press shift+enter. To complete this tutorial (which is meant for a classroom), execute the cells and follow the instructions.



In [ ]:

    
#The next line imports the nltk
import nltk
# next line makes graphics appear in the browser rather than separate window
%matplotlib inline

Let's build a corpus.

Here we will work with this file: http://www.gutenberg.org/cache/epub/11/pg11.txt

First, let's download it



In [ ]:

    
import urllib2
url = 'http://www.gutenberg.org/cache/epub/11/pg11.txt'

# get text
t = urllib2.urlopen(url).read().decode('utf8')
assert(t)
len(t)

Next, let's remove the header and footer



In [ ]:

    
text_starts= t.index('CHAPTER I') # find where chapter starts
text_ends = t.index('THE END')
t = t[text_starts:text_ends]
len(t)

Let's split the text into different chapters



In [ ]:

    
import re # imports regular expression module (which allows fancy text searching)
chapters = re.findall('CHAPTER .+?(?=CHAPTER|$)',t,re.DOTALL) # This is a regular expression that splits out chapters



In [ ]:

    
# Here is a sample of chapter 2 (or chapter[1])
chapters[1][0:500]

Now let's write the chapters to separate text files



In [ ]:

    
for chapter_number, chapter_text in enumerate(chapters):
    file_name = 'data/alice/chapter-'+str(chapter_number+1)+'.txt'
    with open(file_name,'w') as f: 
        f.write(chapter_text)

Now using NTLK, let's create a corpus from those files



In [ ]:

    
from nltk.corpus.reader.plaintext import PlaintextCorpusReader


corpusdir = 'data/alice/' # Directory of corpus.
corpus0 = PlaintextCorpusReader(corpusdir, '.*')
corpus  = nltk.Text(corpus0.words())

Analyzing text

How many words are there?



In [ ]:

    
len(corpus)

How many unique words are there?



In [ ]:

    
len(set(corpus))

How do we access particular words?



In [ ]:

    
corpus[0:10]

How do we find words in contexts?



In [ ]:

    
corpus.concordance('Alice')  # you can also try corpus.concordance('Alice',lines=all) or lines = 100, etc.

How to find words in similar contexts (in a sentence)?



In [ ]:

    
corpus.similar('caterpillar')

How to find the common contexts (within a sentence)?



In [ ]:

    
corpus.common_contexts(['hatter','queen'])

Now let's build a frequency distribution



In [ ]:

    
fd = nltk.FreqDist(corpus)

How many times does a word occur?



In [ ]:

    
fd['Alice']

What are all the words?



In [ ]:

    
fd.keys()

Which words occur only once? (hapax legomenon)



In [ ]:

    
fd.hapaxes()

How do you plot the most frequency words?



In [ ]:

    
fd.plot(20,cumulative=False)

How to normalize text?



In [ ]:

    
new_corpus = [w.lower() for w in corpus if w.isalpha()] # keep only alphabetic words and lowercase it



In [ ]:

    
stopwords = nltk.corpus.stopwords.words('english')
print stopwords



In [ ]:

    
new_corpus = [w for w in new_corpus if not w in stopwords] # erase words in stoplist
corpus = nltk.Text(new_corpus)



In [ ]:

    
fd2 = nltk.FreqDist(corpus) # create a FreqDist (Frequency Distribution)



In [ ]:

    
fd2.plot(20,cumulative=False)

How do you search for words?



In [ ]:

    
corpus.dispersion_plot(['alice','rabbit','queen'])

How do you find collocations (words that occur together)?



In [ ]:

    
corpus.collocations(num=1000)

How to find bi-grams, tri-grams, and n-grams?



In [ ]:

    
for x in nltk.bigrams(corpus):
    print x



In [ ]:

    
for x in nltk.trigrams(corpus):
    print x



In [ ]:

    
for x in nltk.ngrams(corpus,7):
    print x



In [ ]:

    
nltk.pos_tag(corpus[0:1000])

The End!



In [ ]: