[Data, the Humanist's New Best Friend](index.ipynb)
*Class 11*

In this class you are expected to learn:

  • NLTK
  • Tokenization
  • Concordance
  • Co-Occurrence and similarity
  • Word and phrase frequencies
  • Dispersion plots
  • TextBlob
*It really is awesome!*

Natural Language Processing (NLP)

Extracted from Tooling up for Digital Humanities: The Text Deluge (highly recommended reading!):

According to one estimate, human beings created some 150 exabytes (billion gigabytes) of data in 2005 alone. This year, we will create approximately 1,200 exabytes. The Library of Congress announced its decision archive Twitter, which includes the addition of some 50 million tweets per day. A search in Google Books for the phrase “slave trade” in July 2010, for example, returned the following: “About 1,600,000 results (0.21 seconds).” Scholars once accustomed to studying a handful of letters or a couple hundred diary entries are now faced with massive amounts of data that cannot possibly be analyzed in traditional ways.

The trend towards an increasing deluge of information raises the question posed by Gregory Crane in 2006: “What do you do with a million books?” “My answer to that question” wrote Tanya Clement and others in a 2008 article, “is that whatever you do, you don't read them, because you can’t.”

And that's the key for text analysis today: to not read, which is still kind of ironic. But then, if we can't read one million books, or blogs, or a trillion of tweets, of hundred thousand of margin notes, how are we supposed to analyze that? The answer is Natural Language Processing, or NLP.

There is a bunch of things that NLP can do for us, let's see some of them:

For most of them there is a package in Python, and most of the time that package is the Natural Language Toolkit, usually abbreviated as NLTK.

NLTK

The Natural Language Toolkit is a huge package that covers almost every text processing need you might have. It was designed with four primary goals in mind:

  • Simplicity: To provide an intuitive framework along with substantial building blocks, giving users a practical knowledge of NLP without getting bogged down in the tedious house-keeping usually associated with processing annotated language data.
  • Consistency: To provide a uniform framework with consistent interfaces and data structures, and easily-guessable method names.
  • Extensibility: To provide a structure into which new software modules can be easily accommodated, including alternative implementations and competing approaches to the same task.
  • Modularity: To provide components that can be used independently without needing to understand the rest of the toolkit.

The list of features is overwhelming. Unfortunately, we'll only see a fraction of them.

Language processing task NLTK modules Functionality
Accessing corpora `nltk.corpus` standardized interfaces to corpora and lexicons
String processing `nltk.tokenize`, `nltk.stem` tokenizers, sentence tokenizers, stemmers
Collocation discovery `nltk.collocations` t-test, chi-squared, point-wise mutual information
Part-of-speech tagging `nltk.tag` n-gram, backoff, Brill, HMM, TnT
Classification `nltk.classify`, `nltk.cluster` decision tree, maximum entropy, naive Bayes, EM, k-means
Chunking `nltk.chunk` regular expression, n-gram, named-entity
Parsing `nltk.parse` chart, feature-based, unification, probabilistic, dependency
Semantic interpretation `nltk.sem`, `nltk.inference` lambda calculus, first-order logic, model checking
Evaluation metrics `nltk.metrics` precision, recall, agreement coefficients
Probability and estimation `nltk.probability` frequency distributions, smoothed probability distributions
Applications `nltk.app`, `nltk.chat` graphical concordancer, parsers, WordNet browser, chatbots
Linguistic fieldwork `nltk.toolbox` manipulate data in SIL Toolbox format

If this is the first time you've used NLTK (and I'm pretty sure it is), you need to download some files that NLTK needs: books, corpora, information for the tagger, dictionaries, etc. NLTK brings its own downloader, all you have to do is import the module and invok download(). The downloader then will ask you what do you want to do, so you type d for download, and then all to download everything, everything! It may take some time, but you'll do this only once.


In [5]:
import nltk 
nltk.download()


NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> l
Packages:
  [-] book_grammars....... Grammars from NLTK Book
  [-] brown............... Brown Corpus
  [ ] framenet_v15........ FrameNet 1.5
  [-] maxent_ne_chunker... ACE Named Entity Chunker (Maximum entropy)
  [-] maxent_treebank_pos_tagger Treebank Part of Speech Tagger (Maximum entropy)
  [-] punkt............... Punkt Tokenizer Models
  [-] sample_grammars..... Sample Grammars
  [-] tagsets............. Help on Tagsets
  [-] udhr2............... Universal Declaration of Human Rights Corpus
                           (Unicode Version)
  [ ] universal_tagset.... Mappings to the Universal Part-of-Speech Tagset

Collections:
  [-] all-corpora......... All the corpora
  [-] all................. All packages
  [-] book................ Everything used in the NLTK Book

([*] marks installed packages; [-] marks out-of-date or corrupt packages)

Download which package (l=list; x=cancel)?
  Identifier> book_grammars all
    Downloading package book_grammars to /home/me/nltk_data...
      Unzipping grammars/book_grammars.zip.
    Downloading collection 'all'
       | 
       | Downloading package abc to /home/me/nltk_data...
       |   Package abc is already up-to-date!
       | Downloading package alpino to /home/me/nltk_data...
       |   Package alpino is already up-to-date!
       | Downloading package biocreative_ppi to
       |     /home/me/nltk_data...
       |   Package biocreative_ppi is already up-to-date!
       | Downloading package brown to /home/me/nltk_data...
       |   Unzipping corpora/brown.zip.
       | Downloading package brown_tei to /home/me/nltk_data...
       |   Package brown_tei is already up-to-date!
       | Downloading package cess_cat to /home/me/nltk_data...
       |   Package cess_cat is already up-to-date!
       | Downloading package cess_esp to /home/me/nltk_data...
       |   Package cess_esp is already up-to-date!
       | Downloading package chat80 to /home/me/nltk_data...
       |   Package chat80 is already up-to-date!
       | Downloading package city_database to
       |     /home/me/nltk_data...
       |   Package city_database is already up-to-date!
       | Downloading package cmudict to /home/me/nltk_data...
       |   Package cmudict is already up-to-date!
       | Downloading package comtrans to /home/me/nltk_data...
       |   Package comtrans is already up-to-date!
       | Downloading package conll2000 to /home/me/nltk_data...
       |   Package conll2000 is already up-to-date!
       | Downloading package conll2002 to /home/me/nltk_data...
       |   Package conll2002 is already up-to-date!
       | Downloading package conll2007 to /home/me/nltk_data...
       |   Package conll2007 is already up-to-date!
       | Downloading package dependency_treebank to
       |     /home/me/nltk_data...
       |   Package dependency_treebank is already up-to-date!
       | Downloading package europarl_raw to /home/me/nltk_data...
       |   Package europarl_raw is already up-to-date!
       | Downloading package floresta to /home/me/nltk_data...
       |   Package floresta is already up-to-date!
       | Downloading package framenet_v15 to /home/me/nltk_data...
       |   Unzipping corpora/framenet_v15.zip.
       | Downloading package gazetteers to /home/me/nltk_data...
       |   Package gazetteers is already up-to-date!
       | Downloading package genesis to /home/me/nltk_data...
       |   Package genesis is already up-to-date!
       | Downloading package gutenberg to /home/me/nltk_data...
       |   Package gutenberg is already up-to-date!
       | Downloading package ieer to /home/me/nltk_data...
       |   Package ieer is already up-to-date!
       | Downloading package inaugural to /home/me/nltk_data...
       |   Package inaugural is already up-to-date!
       | Downloading package indian to /home/me/nltk_data...
       |   Package indian is already up-to-date!
       | Downloading package jeita to /home/me/nltk_data...
       |   Package jeita is already up-to-date!
       | Downloading package kimmo to /home/me/nltk_data...
       |   Package kimmo is already up-to-date!
       | Downloading package knbc to /home/me/nltk_data...
       |   Package knbc is already up-to-date!
       | Downloading package langid to /home/me/nltk_data...
       |   Package langid is already up-to-date!
       | Downloading package lin_thesaurus to
       |     /home/me/nltk_data...
       |   Package lin_thesaurus is already up-to-date!
       | Downloading package mac_morpho to /home/me/nltk_data...
       |   Package mac_morpho is already up-to-date!
       | Downloading package machado to /home/me/nltk_data...
       |   Package machado is already up-to-date!
       | Downloading package movie_reviews to
       |     /home/me/nltk_data...
       |   Package movie_reviews is already up-to-date!
       | Downloading package names to /home/me/nltk_data...
       |   Package names is already up-to-date!
       | Downloading package nombank.1.0 to /home/me/nltk_data...
       |   Package nombank.1.0 is already up-to-date!
       | Downloading package nps_chat to /home/me/nltk_data...
       |   Package nps_chat is already up-to-date!
       | Downloading package paradigms to /home/me/nltk_data...
       |   Package paradigms is already up-to-date!
       | Downloading package pil to /home/me/nltk_data...
       |   Package pil is already up-to-date!
       | Downloading package pl196x to /home/me/nltk_data...
       |   Package pl196x is already up-to-date!
       | Downloading package ppattach to /home/me/nltk_data...
       |   Package ppattach is already up-to-date!
       | Downloading package problem_reports to
       |     /home/me/nltk_data...
       |   Package problem_reports is already up-to-date!
       | Downloading package propbank to /home/me/nltk_data...
       |   Package propbank is already up-to-date!
       | Downloading package ptb to /home/me/nltk_data...
       |   Package ptb is already up-to-date!
       | Downloading package oanc_masc to /home/me/nltk_data...
       |   Package oanc_masc is already up-to-date!
       | Downloading package qc to /home/me/nltk_data...
       |   Package qc is already up-to-date!
       | Downloading package reuters to /home/me/nltk_data...
       |   Package reuters is already up-to-date!
       | Downloading package rte to /home/me/nltk_data...
       |   Package rte is already up-to-date!
       | Downloading package semcor to /home/me/nltk_data...
       |   Package semcor is already up-to-date!
       | Downloading package senseval to /home/me/nltk_data...
       |   Package senseval is already up-to-date!
       | Downloading package shakespeare to /home/me/nltk_data...
       |   Package shakespeare is already up-to-date!
       | Downloading package sinica_treebank to
       |     /home/me/nltk_data...
       |   Package sinica_treebank is already up-to-date!
       | Downloading package smultron to /home/me/nltk_data...
       |   Package smultron is already up-to-date!
       | Downloading package state_union to /home/me/nltk_data...
       |   Package state_union is already up-to-date!
       | Downloading package stopwords to /home/me/nltk_data...
       |   Package stopwords is already up-to-date!
       | Downloading package swadesh to /home/me/nltk_data...
       |   Package swadesh is already up-to-date!
       | Downloading package switchboard to /home/me/nltk_data...
       |   Package switchboard is already up-to-date!
       | Downloading package timit to /home/me/nltk_data...
       |   Package timit is already up-to-date!
       | Downloading package toolbox to /home/me/nltk_data...
       |   Package toolbox is already up-to-date!
       | Downloading package treebank to /home/me/nltk_data...
       |   Package treebank is already up-to-date!
       | Downloading package udhr to /home/me/nltk_data...
       |   Package udhr is already up-to-date!
       | Downloading package udhr2 to /home/me/nltk_data...
       |   Unzipping corpora/udhr2.zip.
       | Downloading package unicode_samples to
       |     /home/me/nltk_data...
       |   Package unicode_samples is already up-to-date!
       | Downloading package verbnet to /home/me/nltk_data...
       |   Package verbnet is already up-to-date!
       | Downloading package webtext to /home/me/nltk_data...
       |   Package webtext is already up-to-date!
       | Downloading package wordnet to /home/me/nltk_data...
       |   Package wordnet is already up-to-date!
       | Downloading package wordnet_ic to /home/me/nltk_data...
       |   Package wordnet_ic is already up-to-date!
       | Downloading package words to /home/me/nltk_data...
       |   Package words is already up-to-date!
       | Downloading package ycoe to /home/me/nltk_data...
       |   Package ycoe is already up-to-date!
       | Downloading package rslp to /home/me/nltk_data...
       |   Package rslp is already up-to-date!
       | Downloading package hmm_treebank_pos_tagger to
       |     /home/me/nltk_data...
       |   Package hmm_treebank_pos_tagger is already up-to-date!
       | Downloading package maxent_treebank_pos_tagger to
       |     /home/me/nltk_data...
       |   Unzipping taggers/maxent_treebank_pos_tagger.zip.
       | Downloading package universal_tagset to
       |     /home/me/nltk_data...
       |   Unzipping taggers/universal_tagset.zip.
       | Downloading package maxent_ne_chunker to
       |     /home/me/nltk_data...
       |   Unzipping chunkers/maxent_ne_chunker.zip.
       | Downloading package punkt to /home/me/nltk_data...
       |   Unzipping tokenizers/punkt.zip.
       | Downloading package book_grammars to
       |     /home/me/nltk_data...
       |   Package book_grammars is already up-to-date!
       | Downloading package sample_grammars to
       |     /home/me/nltk_data...
       |   Unzipping grammars/sample_grammars.zip.
       | Downloading package spanish_grammars to
       |     /home/me/nltk_data...
       |   Package spanish_grammars is already up-to-date!
       | Downloading package basque_grammars to
       |     /home/me/nltk_data...
       |   Package basque_grammars is already up-to-date!
       | Downloading package large_grammars to
       |     /home/me/nltk_data...
       |   Package large_grammars is already up-to-date!
       | Downloading package tagsets to /home/me/nltk_data...
       | 
     Done downloading collection all

---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> q
Out[5]:
True

After downloading everything, we've gained acccess to a corpus of books to play with. One of them is Moby Dick by Herman Melville, under nltk.book.text1; other is Sense and Sensibility by Jane Austen, under nltk.books.text2.


In [1]:
from nltk.book import text1 as moby_dick
moby_dick


*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
Out[1]:
<Text: Moby Dick by Herman Melville 1851>

In [2]:
from nltk.book import text2 as sense_sensibility
sense_sensibility


Out[2]:
<Text: Sense and Sensibility by Jane Austen 1811>

Searching text

These included books are actually instances of Text, which is a class defined by NLTK that behaves like a very rich collection of strings. However, regular operations like checking if a word is in a text, or slicing part of the text, is done the same way that in strings.


In [3]:
type(sense_sensibility)


Out[3]:
nltk.text.Text

In [4]:
"love" in sense_sensibility


Out[4]:
True

In [5]:
sense_sensibility.index("love")


Out[5]:
1447

In [6]:
sense_sensibility[1447:1452]


Out[6]:
['love', 'for', 'all', 'her', 'three']

Notice that slicing a Text gives us words and punctuation symbols, or tokens, instead of characters.

There are many ways to examine the context of a text apart from simply reading it. A concordance view shows us every occurrence of a given word, together with some context. Here we look up the word "love" in our two books; unsurprisingly, there are way more matches in Sense and Sensibility than in Moby Dick.


In [7]:
sense_sensibility.concordance("love")


Building index...
Displaying 25 of 77 matches:
priety of going , and her own tender love for all her three children determine
es ." " I believe you are right , my love ; it will be better that there shoul
 . It implies everything amiable . I love him already ." " I think you will li
sentiment of approbation inferior to love ." " You may esteem him ." " I have 
n what it was to separate esteem and love ." Mrs . Dashwood now took pains to 
oner did she perceive any symptom of love in his behaviour to Elinor , than sh
 how shall we do without her ?" " My love , it will be scarcely a separation .
ise . Edward is very amiable , and I love him tenderly . But yet -- he is not 
ll never see a man whom I can really love . I require so much ! He must have a
ry possible charm ." " Remember , my love , that you are not seventeen . It is
f I do not now . When you tell me to love him as a brother , I shall no more s
hat Colonel Brandon was very much in love with Marianne Dashwood . She rather 
e were ever animated enough to be in love , must have long outlived every sens
hirty - five anything near enough to love , to make him a desirable companion 
roach would have been spared ." " My love ," said her mother , " you must not 
pect that the misery of disappointed love had already been known to him . This
 most melancholy order of disastrous love . CHAPTER 12 As Elinor and Marianne 
hen she considered what Marianne ' s love for him was , a quarrel seemed almos
ctory way ;-- but you , Elinor , who love to doubt where you can -- it will no
 man whom we have all such reason to love , and no reason in the world to thin
ded as he must be of your sister ' s love , should leave her , and leave her p
cannot think that . He must and does love her I am sure ." " But with a strang
 I believe not ," cried Elinor . " I love Willoughby , sincerely love him ; an
or . " I love Willoughby , sincerely love him ; and suspicion of his integrity
deed a man could not very well be in love with either of her daughters , witho

In [8]:
moby_dick.concordance("love")


Building index...
Displaying 24 of 24 matches:
 to bespeak a monument for her first love , who had been killed by a whale in 
erlasting itch for things remote . I love to sail forbidden seas , and land on
astic our stiff prejudices grow when love once comes to bend them . For now I 
ng . Now , it was plainly a labor of love for Captain Sleet to describe , as h
he whole , I greatly admire and even love the brave , the honest , and learned
to - night with hearts as light , To love , as gay and fleeting As bubbles tha
he fleece of celestial innocence and love ; and hence , by bringing together t
s this visible world seems formed in love , the invisible spheres were formed 
tism in them , still , while for the love of it they give chase to Moby Dick ,
own hearty good - will and brotherly love about it at all . As touching Slave 
stubborn , as malicious . He did not love Steelkilt , and Steelkilt knew it . 
 of sea - usages and the instinctive love of neatness in seamen ; some of whom
g to let that rascal beat ye ? Do ye love brandy ? A hogshead of brandy , then
h a sog ! such a sogger ! Don ' t ye love sperm ? There goes three thousand do
atever they may reveal of the divine love in the Son , the soft , curled , her
 come to deadly battle , and all for love . They fence with their long lower j
de overtakes the sated Turk ; then a love of ease and virtue supplants the lov
ove of ease and virtue supplants the love for maidens ; our Ottoman enters upo
go , the Virgin ! that ' s our first love ; we marry and think to be happy for
Tranquo , being gifted with a devout love for all matters of barbaric vertu , 
ght worship is defiance . To neither love nor reverence wilt thou be kind ; an
 is woe . Come in thy lowest form of love , and I will kneel and kiss thee ; b
es , yet full of the sweet things of love and gratitude . Come ! I feel proude
dihood of a Nantucketer ' s paternal love , had thus early sought to initiate 

Activity

What would you expect when searching for word "monstrous" in these two books? Sure? Let's see it!


In [ ]:
moby_dick.concordance("monstrous")

In [ ]:
sense_sensibility.concordance("monstrous")

Activity

The *NPS Chat Corpus*, under `nltk.book.text5`, is uncensored. Try search for words like "lol".

A concordance permits us to see words in context. For example, we saw that "love" occurred in contexts such as the "of" and "and". What other words appear in a similar range of contexts? We can find out by using the similar() function.


In [9]:
moby_dick.similar("love")


Building word-context index...
man sea it ship ahab air blubber boats bone by captain chase death
fear gush hand head him hope land

In [10]:
sense_sensibility.similar("love")


Building word-context index...
affection heart mother see sister time town dear elinor it life
marianne me word bed do family head her him

Observe that we get different results for different texts. Austen uses this word quite differently from Melville; for her, "love" has connotations related to the family, and usually goes along with "him". The function common_contexts() allows us to examine just the contexts that are shared by two or more words, such as "love" with "him" or "her", by passing them as a list.


In [11]:
sense_sensibility.common_contexts(["love", "him"])


in_by of_in to_and to_but to_you

In [12]:
sense_sensibility.common_contexts(["love", "her"])


in_more to_and to_but to_to to_you

This means that in the text, the words "love" and "him" appear toghether with those specigic surroundings:

  • in love her more
  • to love her and
  • to love her but
  • to love her to
  • to love her you

Notice that punctuation is ignored by this and others functions.

Dispersion plots

It is one thing to automatically detect that a particular word occurs in a text, and to display some words that appear in the same context. However, we can also determine the location of a word in the text: how many words from the beginning it appears. This positional information can be displayed using a dispersion plot. Each stripe represents an instance of a word, and each row represents the entire text.


In [15]:
# Due to an issue in NLTK, we need to use the IPython magic %pylab, nothing serious
%pylab inline
pylab.rcParams['figure.figsize'] = (12.0, 6.0)
from nltk.draw.dispersion import dispersion_plot


Populating the interactive namespace from numpy and matplotlib

In [16]:
dispersion_plot(moby_dick, ["monstrous", "love", "sail", "death", "dead"])


/home/versae/.venvs/data/lib/python3.3/site-packages/matplotlib/backends/backend_agg.py:517: DeprecationWarning: npy_PyFile_Dup is deprecated, use npy_PyFile_Dup2
  filename_or_obj, self.figure.dpi)
/home/versae/.venvs/data/lib/python3.3/site-packages/matplotlib/backends/backend_agg.py:517: DeprecationWarning: npy_PyFile_Dup is deprecated, use npy_PyFile_Dup2
  filename_or_obj, self.figure.dpi)

Activity

Get dispersion plots for words of your choice from *Sense and Sensibility*.

Counting words

One thing that comes out of the previous examples is the fact that both books use different sets of words, or vocabularies. And we are able to see how different they are. Let's begin by finding out the length of a text from start to finish, in terms of the words and punctuation symbols that appear.


In [17]:
len(moby_dick)


Out[17]:
260819

In Python there is another data structure called the set, that it's like a list with no duplicates. So in order to get the vocabulary used in a text, we need to remove duplicate words.


In [20]:
len(set(moby_dick))


Out[20]:
19317

But that number includes numbers and punctuation symbols. Let's take a look to the some elements. We can sort the words by using the built-in function sorted()


In [31]:
sorted(set(moby_dick))[275:290]


Out[31]:
["?'--'",
 '?--',
 '?--"',
 "?--'",
 'A',
 'ABOUT',
 'ACCOUNT',
 'ADDITIONAL',
 'ADVANCING',
 'ADVENTURES',
 'AFFGHANISTAN',
 'AFRICA',
 'AFTER',
 'AGAINST',
 'AHAB']

So in order to calculate the number of different words, we must start at position 279. We discover the size of the vocabulary indirectly, by asking for the number of items in the set, and again we can use len() to obtain this number.


In [33]:
len(sorted(set(moby_dick))[279:])


Out[33]:
19038

Although it has 260,819 tokens, this book has only 19,038 distinct words, or word types. A word type is the form or spelling of the word independently of its specific occurrences in a text — that is, the word considered as a unique item of vocabulary. Our previous count of 19,317 including punctuation symbols is generally called unique items types instead of word types.

Now, let's calculate a measure of the lexical richness or lexical diversity of the text as the average number of times that each word has been used in a text. For this measure we will include punctuation symbols. The next example shows us that each word is used 13 times on average in Moby Dick.


In [34]:
len(moby_dick) / len(set(moby_dick))


Out[34]:
13.502044830977896

Next, let's focus on particular words. We can count how often a word occurs in a text, and compute what percentage of the text is taken up by a specific word:


In [35]:
moby_dick.count("death")


Out[35]:
71

In [38]:
100 * moby_dick.count('the') / len(moby_dick)


Out[38]:
5.260736372733581

Activity

Create two functions: 1) `lexical_richness(text)` receives a list of words or a `Text` and returns its lexical richness; and 2) `word_percentage(text, word)` receives a list of words or a `Text` and a word and return the percentage of the text taken up by the word.

For example, `lexical_richness(moby_dick)` should return `13.502044830977896`; and `word_percentage(moby_dick, "the")` should return `5.260736372733581`.

Use these new functions to calculate lexical richness of `nltk.book.text3`, `nltk.book.text4`, and `nltk.book.text5`, as well as the percentage of the following words: a, the, this, those, these, I.

Frequency Distributions

The preceding percentage measure is nice to compare words between different texts, but doesn't help identify the words of a text that are most informative about the topic and genre of the text. Imagine how you might go about finding the 50 most frequent words of a book. One method would be to keep a tally for each vocabulary item. The tally would need thousands of rows, and it would be an exceedingly laborious process — so laborious that we would rather assign the task to a machine. A tally like that is known as a frequency distribution, and it tells us the frequency of each vocabulary item in the text. It is a distribution because it tells us how the total number of word tokens in the text are distributed across the vocabulary items. Since we often need frequency distributions in language processing, NLTK provides built-in support for them. Let's use a FreqDist to find the most frequent words of Moby Dick.


In [39]:
from nltk import FreqDist

In [60]:
moby_dick_fdist = FreqDist(moby_dick)
moby_dick_fdist


Out[60]:
<FreqDist with 19317 samples and 260819 outcomes>

In [66]:
moby_dick_fdist["whale"]


Out[66]:
906

In [67]:
moby_dick_fdist.freq("whale")  # Frequency


Out[67]:
0.003473673313677301

If we want to get the 50 most common words, we need to sort moby_dick_fdist, which is like a dictionary, but sorting by value in a descending way, first the higher and then lower. In Python there is a trick to sort dictionary keys by their values: use the key parameter in sorted().


In [57]:
x = {1: 2, 3: 4, 4:3, 2:1, 0:0}
sorted_x = sorted(x, key=x.get)
sorted_x


Out[57]:
[0, 2, 1, 4, 3]

And now we just reverse the resulting list by invoking the method reverse() from the list.


In [58]:
sorted_x.reverse()
sorted_x


Out[58]:
[3, 4, 1, 2, 0]

Let's put all toghether to get the 50 most common words in Moby Dick.


In [61]:
sorted_moby_dick = sorted(moby_dick_fdist, key=moby_dick_fdist.get)
sorted_moby_dick.reverse()
sorted_moby_dick[:50]


Out[61]:
[',',
 'the',
 '.',
 'of',
 'and',
 'a',
 'to',
 ';',
 'in',
 'that',
 "'",
 '-',
 'his',
 'it',
 'I',
 's',
 'is',
 'he',
 'with',
 'was',
 'as',
 '"',
 'all',
 'for',
 'this',
 '!',
 'at',
 'by',
 'but',
 'not',
 '--',
 'him',
 'from',
 'be',
 'on',
 'so',
 'whale',
 'one',
 'you',
 'had',
 'have',
 'there',
 'But',
 'or',
 'were',
 'now',
 'which',
 '?',
 'me',
 'like']

Activity

Create a function, `most_commont(text, n)`, that receives a list of words or a `Text` and a number and returns the most common words.

For example, `most_commont(moby_dick, 5)` should return the 5 most common words: `[',', 'the', '.', 'of', 'and']`.

Do any words produced in the last example help us grasp the topic or genre of this text? Only one word, whale, is slightly informative! It occurs over 900 times. The rest of the words tell us nothing about the text; they're just English "plumbing" or stop words. What proportion of the text is taken up with such words? We can generate a cumulative frequency plot for these words, which is pretty similar to the histogram we've already seen in past classes.


In [64]:
moby_dick_fdist.plot(50, cumulative=True)


/home/versae/.venvs/data/lib/python3.3/site-packages/matplotlib/backends/backend_agg.py:517: DeprecationWarning: npy_PyFile_Dup is deprecated, use npy_PyFile_Dup2
  filename_or_obj, self.figure.dpi)
/home/versae/.venvs/data/lib/python3.3/site-packages/matplotlib/backends/backend_agg.py:517: DeprecationWarning: npy_PyFile_Dup is deprecated, use npy_PyFile_Dup2
  filename_or_obj, self.figure.dpi)

These 50 words account for nearly half the book!

If the frequent words don't help us, how about the words that occur once only, the so-called hapaxes? View them by calling moby_dick_fdist.hapaxes(). This list contains lexicographer, cetological, contraband, expostulations, and about 9,000 others. It seems that there are too many rare words, and without seeing the context we probably can't guess what half of the hapaxes mean in any case! Since neither frequent nor infrequent words help, we need to try something else... in the next class!

Activity

Using Python comprehension lists we can get words from a vocabulary that meet certain conditions. For example, `[w for w in set(sense_sensibility) if len(w) > 15]` returns a list of unique words that are longer than 15 characters. What if we also wanted words that are longer than 10 and happen more than 5 times in the text?

Tokenization

But before we finish: a few words on tokenization. In previous examples from the NLTK corpora, the books were already in Text format, but how can we build those list of words from a text? That's tokenization, which is basically the process by which you split a text in parts. What you use to take apart the text is up to you; can be by line breaks, by word, by commas, etc. In text processing, spliting by words is so common than NLTK include that tokenizer by default.

For example, let's tokenize the book Crime and Punishment by Fyodor Dostoyevsky. We first load the content from the file.


In [73]:
crime_and_punishment_txt = open("data/crime_and_punishment.txt").read()

And then we tokenize it, that simple.


In [76]:
from nltk import tokenize
word_tokenize = tokenize.WordPunctTokenizer()  # We need to create an instance
word_tokenize.tokenize(crime_and_punishment_txt)[:13]


Out[76]:
['\ufeff',
 'The',
 'Project',
 'Gutenberg',
 'EBook',
 'of',
 'Crime',
 'and',
 'Punishment',
 ',',
 'by',
 'Fyodor',
 'Dostoevsky']

The last step is to convert this list into a Text object.


In [85]:
from nltk import Text
Text(word_tokenize.tokenize(crime_and_punishment_txt))


Out[85]:
<Text:  The Project Gutenberg EBook of Crime and...>

And there is many more tokenizers, so we can use a sentence tokenizer, like PunktSentenceTokenizer, and calculate same measures for sentences instead of words. Some tokenizers, like word and sentence tokenizers, are so common that NLTK has handy functions ready for them, nltk.tokenize.word_tokenize() and nltk.tokenize.sent_tokenize().

Activity

Spend some time playing around the other tokenizers.

TextBlob

In the remote case that those tokenizers seemed difficult to you, let me introduce you TextBlob. From its website, "TextBlob aims to provide access to common text-processing operations through a familiar interface." We will see more of TextBlobl in future classes, but for now, just a small preview to show how easy is to tokenize.


In [79]:
from textblob import TextBlob
textblob = TextBlob(crime_and_punishment_txt)

In [81]:
textblob.words[:10]


Out[81]:
WordList(['\ufeffThe', 'Project', 'Gutenberg', 'EBook', 'of', 'Crime', 'and', 'Punishment', 'by', 'Fyodor'])

In [84]:
textblob.sentences[:5]


Out[84]:
[Sentence("The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky
 
 This eBook is for the use of anyone anywhere at no cost and with
 almost no restrictions whatsoever."),
 Sentence("You may copy it, give it away or
 re-use it under the terms of the Project Gutenberg License included
 with this eBook or online at www.gutenberg.org
 
 
 Title: Crime and Punishment
 
 Author: Fyodor Dostoevsky
 
 Release Date: March 28, 2006 [EBook #2554]
 [Last updated: November 15, 2011]
 
 Language: English
 
 
 *** START OF THIS PROJECT GUTENBERG EBOOK CRIME AND PUNISHMENT ***
 
 
 
 
 Produced by John Bickers; and Dagny
 
 
 
 
 
 CRIME AND PUNISHMENT
 
 By Fyodor Dostoevsky
 
 
 
 Translated By Constance Garnett
 
 
 
 
 TRANSLATOR'S PREFACE
 
 A few words about Dostoevsky himself may help the English reader to
 understand his work."),
 Sentence("Dostoevsky was the son of a doctor."),
 Sentence("His parents were very hard-working
 and deeply religious people, but so poor that they lived with their five
 children in only two rooms."),
 Sentence("The father and mother spent their evenings
 in reading aloud to their children, generally from books of a serious
 character.")]

For the next class