In this class you are expected to learn:
Extracted from Tooling up for Digital Humanities: The Text Deluge (highly recommended reading!):
According to one estimate, human beings created some 150 exabytes (billion gigabytes) of data in 2005 alone. This year, we will create approximately 1,200 exabytes. The Library of Congress announced its decision archive Twitter, which includes the addition of some 50 million tweets per day. A search in Google Books for the phrase “slave trade” in July 2010, for example, returned the following: “About 1,600,000 results (0.21 seconds).” Scholars once accustomed to studying a handful of letters or a couple hundred diary entries are now faced with massive amounts of data that cannot possibly be analyzed in traditional ways.
The trend towards an increasing deluge of information raises the question posed by Gregory Crane in 2006: “What do you do with a million books?” “My answer to that question” wrote Tanya Clement and others in a 2008 article, “is that whatever you do, you don't read them, because you can’t.”
And that's the key for text analysis today: to not read, which is still kind of ironic. But then, if we can't read one million books, or blogs, or a trillion of tweets, of hundred thousand of margin notes, how are we supposed to analyze that? The answer is Natural Language Processing, or NLP.
There is a bunch of things that NLP can do for us, let's see some of them:
For most of them there is a package in Python, and most of the time that package is the Natural Language Toolkit, usually abbreviated as NLTK.
The Natural Language Toolkit is a huge package that covers almost every text processing need you might have. It was designed with four primary goals in mind:
The list of features is overwhelming. Unfortunately, we'll only see a fraction of them.
Language processing task | NLTK modules | Functionality |
---|---|---|
Accessing corpora | `nltk.corpus` | standardized interfaces to corpora and lexicons |
String processing | `nltk.tokenize`, `nltk.stem` | tokenizers, sentence tokenizers, stemmers |
Collocation discovery | `nltk.collocations` | t-test, chi-squared, point-wise mutual information |
Part-of-speech tagging | `nltk.tag` | n-gram, backoff, Brill, HMM, TnT |
Classification | `nltk.classify`, `nltk.cluster` | decision tree, maximum entropy, naive Bayes, EM, k-means |
Chunking | `nltk.chunk` | regular expression, n-gram, named-entity |
Parsing | `nltk.parse` | chart, feature-based, unification, probabilistic, dependency |
Semantic interpretation | `nltk.sem`, `nltk.inference` | lambda calculus, first-order logic, model checking |
Evaluation metrics | `nltk.metrics` | precision, recall, agreement coefficients |
Probability and estimation | `nltk.probability` | frequency distributions, smoothed probability distributions |
Applications | `nltk.app`, `nltk.chat` | graphical concordancer, parsers, WordNet browser, chatbots |
Linguistic fieldwork | `nltk.toolbox` | manipulate data in SIL Toolbox format |
If this is the first time you've used NLTK (and I'm pretty sure it is), you need to download some files that NLTK needs: books, corpora, information for the tagger, dictionaries, etc. NLTK brings its own downloader, all you have to do is import the module and invok download()
. The downloader then will ask you what do you want to do, so you type d
for download, and then all
to download everything, everything! It may take some time, but you'll do this only once.
In [5]:
import nltk
nltk.download()
Out[5]:
After downloading everything, we've gained acccess to a corpus of books to play with. One of them is Moby Dick by Herman Melville, under nltk.book.text1
; other is Sense and Sensibility by Jane Austen, under nltk.books.text2
.
In [1]:
from nltk.book import text1 as moby_dick
moby_dick
Out[1]:
In [2]:
from nltk.book import text2 as sense_sensibility
sense_sensibility
Out[2]:
These included books are actually instances of Text
, which is a class defined by NLTK that behaves like a very rich collection of strings. However, regular operations like checking if a word is in a text, or slicing part of the text, is done the same way that in strings.
In [3]:
type(sense_sensibility)
Out[3]:
In [4]:
"love" in sense_sensibility
Out[4]:
In [5]:
sense_sensibility.index("love")
Out[5]:
In [6]:
sense_sensibility[1447:1452]
Out[6]:
Notice that slicing a Text
gives us words and punctuation symbols, or tokens, instead of characters.
There are many ways to examine the context of a text apart from simply reading it. A concordance view shows us every occurrence of a given word, together with some context. Here we look up the word "love" in our two books; unsurprisingly, there are way more matches in Sense and Sensibility than in Moby Dick.
In [7]:
sense_sensibility.concordance("love")
In [8]:
moby_dick.concordance("love")
Activity
What would you expect when searching for word "monstrous" in these two books? Sure? Let's see it!
In [ ]:
moby_dick.concordance("monstrous")
In [ ]:
sense_sensibility.concordance("monstrous")
Activity
The *NPS Chat Corpus*, under `nltk.book.text5`, is uncensored. Try search for words like "lol".
A concordance permits us to see words in context. For example, we saw that "love" occurred in contexts such as the "of" and "and". What other words appear in a similar range of contexts? We can find out by using the similar()
function.
In [9]:
moby_dick.similar("love")
In [10]:
sense_sensibility.similar("love")
Observe that we get different results for different texts. Austen uses this word quite differently from Melville; for her, "love" has connotations related to the family, and usually goes along with "him". The function common_contexts()
allows us to examine just the contexts that are shared by two or more words, such as "love" with "him" or "her", by passing them as a list.
In [11]:
sense_sensibility.common_contexts(["love", "him"])
In [12]:
sense_sensibility.common_contexts(["love", "her"])
This means that in the text, the words "love" and "him" appear toghether with those specigic surroundings:
Notice that punctuation is ignored by this and others functions.
It is one thing to automatically detect that a particular word occurs in a text, and to display some words that appear in the same context. However, we can also determine the location of a word in the text: how many words from the beginning it appears. This positional information can be displayed using a dispersion plot. Each stripe represents an instance of a word, and each row represents the entire text.
In [15]:
# Due to an issue in NLTK, we need to use the IPython magic %pylab, nothing serious
%pylab inline
pylab.rcParams['figure.figsize'] = (12.0, 6.0)
from nltk.draw.dispersion import dispersion_plot
In [16]:
dispersion_plot(moby_dick, ["monstrous", "love", "sail", "death", "dead"])
Activity
Get dispersion plots for words of your choice from *Sense and Sensibility*.
One thing that comes out of the previous examples is the fact that both books use different sets of words, or vocabularies. And we are able to see how different they are. Let's begin by finding out the length of a text from start to finish, in terms of the words and punctuation symbols that appear.
In [17]:
len(moby_dick)
Out[17]:
In Python there is another data structure called the set
, that it's like a list with no duplicates. So in order to get the vocabulary used in a text, we need to remove duplicate words.
In [20]:
len(set(moby_dick))
Out[20]:
But that number includes numbers and punctuation symbols. Let's take a look to the some elements. We can sort the words by using the built-in function sorted()
In [31]:
sorted(set(moby_dick))[275:290]
Out[31]:
So in order to calculate the number of different words, we must start at position 279. We discover the size of the vocabulary indirectly, by asking for the number of items in the set, and again we can use len()
to obtain this number.
In [33]:
len(sorted(set(moby_dick))[279:])
Out[33]:
Although it has 260,819 tokens, this book has only 19,038 distinct words, or word types. A word type is the form or spelling of the word independently of its specific occurrences in a text — that is, the word considered as a unique item of vocabulary. Our previous count of 19,317 including punctuation symbols is generally called unique items types instead of word types.
Now, let's calculate a measure of the lexical richness or lexical diversity of the text as the average number of times that each word has been used in a text. For this measure we will include punctuation symbols. The next example shows us that each word is used 13 times on average in Moby Dick.
In [34]:
len(moby_dick) / len(set(moby_dick))
Out[34]:
Next, let's focus on particular words. We can count how often a word occurs in a text, and compute what percentage of the text is taken up by a specific word:
In [35]:
moby_dick.count("death")
Out[35]:
In [38]:
100 * moby_dick.count('the') / len(moby_dick)
Out[38]:
Activity
Create two functions: 1) `lexical_richness(text)` receives a list of words or a `Text` and returns its lexical richness; and 2) `word_percentage(text, word)` receives a list of words or a `Text` and a word and return the percentage of the text taken up by the word.
For example, `lexical_richness(moby_dick)` should return `13.502044830977896`; and `word_percentage(moby_dick, "the")` should return `5.260736372733581`.
Use these new functions to calculate lexical richness of `nltk.book.text3`, `nltk.book.text4`, and `nltk.book.text5`, as well as the percentage of the following words: a, the, this, those, these, I.
The preceding percentage measure is nice to compare words between different texts, but doesn't help identify the words of a text that are most informative about the topic and genre of the text. Imagine how you might go about finding the 50 most frequent words of a book. One method would be to keep a tally for each vocabulary item. The tally would need thousands of rows, and it would be an exceedingly laborious process — so laborious that we would rather assign the task to a machine.
A tally like that is known as a frequency distribution, and it tells us the frequency of each vocabulary item in the text. It is a distribution because it tells us how the total number of word tokens in the text are distributed across the vocabulary items. Since we often need frequency distributions in language processing, NLTK provides built-in support for them. Let's use a FreqDist
to find the most frequent words of Moby Dick.
In [39]:
from nltk import FreqDist
In [60]:
moby_dick_fdist = FreqDist(moby_dick)
moby_dick_fdist
Out[60]:
In [66]:
moby_dick_fdist["whale"]
Out[66]:
In [67]:
moby_dick_fdist.freq("whale") # Frequency
Out[67]:
If we want to get the 50 most common words, we need to sort moby_dick_fdist
, which is like a dictionary, but sorting by value in a descending way, first the higher and then lower. In Python there is a trick to sort dictionary keys by their values: use the key
parameter in sorted()
.
In [57]:
x = {1: 2, 3: 4, 4:3, 2:1, 0:0}
sorted_x = sorted(x, key=x.get)
sorted_x
Out[57]:
And now we just reverse the resulting list by invoking the method reverse()
from the list.
In [58]:
sorted_x.reverse()
sorted_x
Out[58]:
Let's put all toghether to get the 50 most common words in Moby Dick.
In [61]:
sorted_moby_dick = sorted(moby_dick_fdist, key=moby_dick_fdist.get)
sorted_moby_dick.reverse()
sorted_moby_dick[:50]
Out[61]:
Activity
Create a function, `most_commont(text, n)`, that receives a list of words or a `Text` and a number and returns the most common words.
For example, `most_commont(moby_dick, 5)` should return the 5 most common words: `[',', 'the', '.', 'of', 'and']`.
Do any words produced in the last example help us grasp the topic or genre of this text? Only one word, whale, is slightly informative! It occurs over 900 times. The rest of the words tell us nothing about the text; they're just English "plumbing" or stop words. What proportion of the text is taken up with such words? We can generate a cumulative frequency plot for these words, which is pretty similar to the histogram we've already seen in past classes.
In [64]:
moby_dick_fdist.plot(50, cumulative=True)
These 50 words account for nearly half the book!
If the frequent words don't help us, how about the words that occur once only, the so-called hapaxes? View them by calling moby_dick_fdist.hapaxes()
. This list contains lexicographer, cetological, contraband, expostulations, and about 9,000 others. It seems that there are too many rare words, and without seeing the context we probably can't guess what half of the hapaxes mean in any case! Since neither frequent nor infrequent words help, we need to try something else... in the next class!
Activity
Using Python comprehension lists we can get words from a vocabulary that meet certain conditions. For example, `[w for w in set(sense_sensibility) if len(w) > 15]` returns a list of unique words that are longer than 15 characters. What if we also wanted words that are longer than 10 and happen more than 5 times in the text?
But before we finish: a few words on tokenization. In previous examples from the NLTK corpora, the books were already in Text
format, but how can we build those list of words from a text? That's tokenization, which is basically the process by which you split a text in parts. What you use to take apart the text is up to you; can be by line breaks, by word, by commas, etc. In text processing, spliting by words is so common than NLTK include that tokenizer by default.
For example, let's tokenize the book Crime and Punishment by Fyodor Dostoyevsky. We first load the content from the file.
In [73]:
crime_and_punishment_txt = open("data/crime_and_punishment.txt").read()
And then we tokenize it, that simple.
In [76]:
from nltk import tokenize
word_tokenize = tokenize.WordPunctTokenizer() # We need to create an instance
word_tokenize.tokenize(crime_and_punishment_txt)[:13]
Out[76]:
The last step is to convert this list into a Text
object.
In [85]:
from nltk import Text
Text(word_tokenize.tokenize(crime_and_punishment_txt))
Out[85]:
And there is many more tokenizers, so we can use a sentence tokenizer, like PunktSentenceTokenizer
, and calculate same measures for sentences instead of words. Some tokenizers, like word and sentence tokenizers, are so common that NLTK has handy functions ready for them, nltk.tokenize.word_tokenize()
and nltk.tokenize.sent_tokenize()
.
Activity
Spend some time playing around the other tokenizers.
In the remote case that those tokenizers seemed difficult to you, let me introduce you TextBlob. From its website, "TextBlob aims to provide access to common text-processing operations through a familiar interface." We will see more of TextBlobl in future classes, but for now, just a small preview to show how easy is to tokenize.
In [79]:
from textblob import TextBlob
textblob = TextBlob(crime_and_punishment_txt)
In [81]:
textblob.words[:10]
Out[81]:
In [84]:
textblob.sentences[:5]
Out[84]: