[Data, the Humanist's New Best Friend](index.ipynb)
Class 12

In this class you are expected to learn:

Corpora
Conditional Frequency Distributions
Sources of data
Language detection
Machine translation

*Morfeo, always kidding*

Corpora

NLTK is distributed with corpora that can be accessed using nltk.corpus package. First we import the Brown Corpus, the oldest million-word, part-of-speech tagged (stuff like subject, verb, etc.) electronic corpus of English. Each of the sections represents a different genre, or category of text. A list of these categories can be accessed using categories(), and a list of the files in each section can be accessed using fileids().



In [1]:

    
from nltk.corpus import brown



In [94]:

    
brown.fileids()[:10]









    Out[94]:





['ca01',
 'ca02',
 'ca03',
 'ca04',
 'ca05',
 'ca06',
 'ca07',
 'ca08',
 'ca09',
 'ca10']



In [3]:

    
brown.categories()









    Out[3]:





['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']



In [95]:

    
brown.fileids(['adventure', 'romance'])[:10]









    Out[95]:





['cn01',
 'cn02',
 'cn03',
 'cn04',
 'cn05',
 'cn06',
 'cn07',
 'cn08',
 'cn09',
 'cn10']



In [6]:

    
brown.categories(['ca01', 'cp28'])









    Out[6]:





['news', 'romance']

The methods words() and sents() provide access to the corpus as a list of words, or a list of sentences, respectively.



In [9]:

    
brown.words('ca01')









    Out[9]:





['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]



In [10]:

    
brown.sents('ca01')









    Out[10]:





[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]



In [11]:

    
brown.sents('ca01')[0]









    Out[11]:





['The',
 'Fulton',
 'County',
 'Grand',
 'Jury',
 'said',
 'Friday',
 'an',
 'investigation',
 'of',
 "Atlanta's",
 'recent',
 'primary',
 'election',
 'produced',
 '``',
 'no',
 'evidence',
 "''",
 'that',
 'any',
 'irregularities',
 'took',
 'place',
 '.']



In [12]:

    
brown.sents(brown.fileids(['adventure']))









    Out[12]:





[['Dan', 'Morgan', 'told', 'himself', 'he', 'would', 'forget', 'Ann', 'Turner', '.'], ['He', 'was', 'well', 'rid', 'of', 'her', '.'], ...]

Activity

Write a function, `word_freq()`, that takes a word and the name of a section of the Brown Corpus as arguments, and computes the frequency of the word in that section of the corpus.

But there are many many more corpora in NLTK, take a look to the complete list. Another interesting corpus is the Project Gutenberg, although NLTK does not include the whole collection of about 25,000 books.



In [97]:

    
from nltk import Text
from nltk.corpus import gutenberg
Text(gutenberg.raw('melville-moby_dick.txt'))









    Out[97]:





<Text: [ M o b y   D i...>

Activity

Write a program to create a table of word frequencies by genre. Choose your own words and try to find words whose presence (or absence) is typical of a genre. Discuss your findings with your partner.

Conditional Frequency Distributions

The adventage of having your data as a corpus is that you can generalize the idea of frequency distributions. When the texts of a corpus are divided into several categories, by genre, topic, author, etc, we can maintain separate frequency distributions for each category. This will allow us to study systematic differences between the categories. We achieve this using NLTK's ConditionalFreqDist data type. A conditional frequency distribution is a collection of frequency distributions, each one for a different "condition". The condition will often be the category of the text.

A frequency distribution counts observable events, such as the appearance of words in a text. A conditional frequency distribution needs to pair each event with a condition. So instead of processing a sequence of words, we have to process a sequence of pairs.

text = ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
pairs = [('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ...]

Each pair has the form (condition, event). If we were processing the entire Brown Corpus by genre there would be 15 conditions (one per genre), and 1,161,192 events (one per word).

So ConditionalFreqDist() takes a list of pairs.



In [68]:

    
from nltk.corpus import brown
cfd = nltk.ConditionalFreqDist(
          (genre, word)
          for genre in brown.categories()
          for word in brown.words(categories=genre))

Let's break this down, and look at just two genres, news and romance. For each genre [2], we loop over every word in the genre [3], producing pairs consisting of the genre and the word [1]:



In [69]:

    
genre_word = [(genre, word)  # [1]
              for genre in ['news', 'romance']  # [2]
              for word in brown.words(categories=genre)]  # [3]
len(genre_word)









    Out[69]:





170576

So, as we can see below, pairs at the beginning of the list genre_word will be of the form ('news', word), while those at the end will be of the form ('romance', word).



In [70]:

    
genre_word[:4]









    Out[70]:





[('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ('news', 'Grand')]



In [71]:

    
genre_word[-4:]









    Out[71]:





[('romance', 'afraid'),
 ('romance', 'not'),
 ('romance', "''"),
 ('romance', '.')]

We can now use this list of pairs to create a ConditionalFreqDist, and save it in a variable cfd. As usual, we can type the name of the variable to inspect it [1], and verify it has two conditions [2]:



In [74]:

    
cfd = nltk.ConditionalFreqDist(genre_word)
cfd









    Out[74]:





<ConditionalFreqDist with 2 conditions>



In [75]:

    
cfd.conditions()









    Out[75]:





['news', 'romance']

Let's access the two conditions, and satisfy ourselves that each is just a frequency distribution.



In [76]:

    
cfd['news']









    Out[76]:





<FreqDist with 14394 samples and 100554 outcomes>



In [77]:

    
cfd['romance']









    Out[77]:





<FreqDist with 8452 samples and 70022 outcomes>



In [93]:

    
list(cfd['romance'])[:5]









    Out[93]:





[',', '.', 'the', 'and', 'to']



In [79]:

    
cfd['romance']['could']









    Out[79]:





193

As a cheat sheet, here are some of the most commonly used methods. We haven't seen yet plot() or tabulate(), though.

Example	Description
`cfdist = ConditionalFreqDist(pairs)`	create a conditional frequency distribution from a list of pairs
`cfdist.conditions()`	alphabetically sorted list of conditions
`cfdist[condition]`	the frequency distribution for this condition
`cfdist[condition][sample]`	frequency for the given sample for this condition
`cfdist.tabulate()`	tabulate the conditional frequency distribution
`cfdist.tabulate(samples, conditions)`	tabulation limited to the specified samples and conditions
`cfdist.plot()`	graphical plot of the conditional frequency distribution
`cfdist.plot(samples, conditions)`	graphical plot limited to the specified samples and conditions
`cfdist1 < cfdist2`	test if samples in `cfdist1` occur less frequently than in `cfdist2`

Activity

Zipf's Law: Let $f(w)$ be the frequency of a word $w$ in free text. Suppose that all the words of a text are ranked according to their frequency, with the most frequent word first. Zipf's law states that the frequency of a word type is inversely proportional to its rank (i.e. $f × r = k$, for some constant $k$). For example, the 50th most common word type should occur three times as frequently as the 150th most common word type.

Write a function to process a large text and plot word frequency against word rank using regular `matplotlib` plot. Do you confirm Zipf's law? (*Hint*: it helps to use a logarithmic scale). What is going on at the extreme ends of the plotted line?

Generate random text, e.g., using `random.choice("abcdefg ")`, taking care to include the space character. You will need to import random first. Use the string concatenation operator to accumulate characters into a (very) long string. Then tokenize this string, and generate the Zipf plot as before, and compare the two plots. What do you make of Zipf's Law in the light of this?

Loading your own Corpus

So if a corpus is so useful, why not create your own? How hard could it be? And the answer is... easy! As (almost) everything else in Python!

If you have your own collection of text files that you would like to access using the above methods, you can easily load them with the help of NLTK's PlaintextCorpusReader. Check the path of your files on your computer. Let's say that we have all we need at /usr/share/dict. Whatever the location, set this to be the value of corpus_root, the first parameter for PlaintextCorpusReader. The second parameter can be a list of fileids, like ['a.txt', 'test/b.txt'], or a pattern that matches all fileids, like '[abc]/.*\.txt' (you don't know yet about regular expressions, but give it a chance).



In [81]:

    
from nltk.corpus import PlaintextCorpusReader
corpus_root = '/usr/share/dict'
wordlists = PlaintextCorpusReader(corpus_root, '.*')
wordlists.fileids()









    Out[81]:





['README.select-wordlist',
 'american-english',
 'british-english',
 'cracklib-small',
 'spanish',
 'words',
 'words.pre-dictionaries-common']



In [83]:

    
wordlists.words('words')









    Out[83]:





['A', 'A', "'", 's', 'AA', "'", 's', 'AB', "'", 's', ...]

It was easy, wasn't it?

Sources of data

It's really nice that NLTK has all those books and corpora ready to use. But the reality is that very often the data you need to analyze is on the Internet, in a blog, a newspaper or it comes from Twitter or Facebook. So let's see some ways to scrape content like that.

The most basic library in the Standard Python Library to fetch content from the Internet is urllib. There is a urllib2, and even a urllib3, fortunately in Python 3 eveything is under urllib. However, in the past years, more powerful tools have shown up. Probably the best is requests, a package with the moto HTTP for Humans. Well, there is no such a thing, but still is really good and intuitive.

The most basic example is fetching plain text content.



In [63]:

    
from urllib.request import urlopen
url = "http://www.gutenberg.org/files/2554/2554.txt"
html = urlopen(url).read()
html[:75]









    Out[63]:





b'The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r\n'

We can also fetch content of a website like Globe and Mail.



In [40]:

    
import nltk
from urllib.request import urlopen
html = urlopen("http://www.theglobeandmail.com/").read()
html[:300]









    Out[40]:





b'<!DOCTYPE html>\n<!--[if lt IE 7]><html lang="en-ca" class="ie6" xmlns:fb="http://www.facebook.com/2008/fbml" ><![endif]-->\n<!--[if IE 7]><html lang="en-ca" class="ie7" xmlns:fb="http://www.facebook.com/2008/fbml" ><![endif]-->\n<!--[if IE 8]><html lang="en-ca" class="ie8" xmlns:fb="http://www.faceboo'

We can filter out only the text. To do so, we remove all HTML tags by using BeautifulSoup, which is not necessary to master unless you want to parse HTML, which is not the case. Almost all we need to know so far about BeautifulSoup is how remove HTML tags, and that's done by calling the method get_text() after creating an instance of BeautifulSoup. A worthwhile reading is Intro to Beautiful Soup, from The Programming Historian.



In [42]:

    
from bs4 import BeautifulSoup
globe_mail = BeautifulSoup(html).get_text()
print(globe_mail[:30])









    




Home - The Globe and Mail

And that's pretty much it. From here we can tokenize in words, sentences, or whatever we need to do.



In [61]:

    
print(nltk.sent_tokenize(globe_mail)[19])









    



Quebec Decides


Did Quebec vote for stability, or change?

Usually news sites and blogs have what is called sindication, that allows to access the content in cleaner way, avoiding the pain of dealing with HTML as much as possible. This is called a RSS feed. With the help of a third-party Python library called the Universal Feed Parser we can access the content of a blog, as shown below:



In [64]:

    
import feedparser
llog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom")
llog['feed']['title']









    Out[64]:





'Language Log'



In [65]:

    
len(llog.entries)









    Out[65]:





15



In [66]:

    
post = llog.entries[2]
post.title









    Out[66]:





'Male aunty'



In [67]:

    
content = post.content[0].value
content[:70]









    Out[67]:





'<p>Joel Martinsen came across <a href="http://www.douban.com/people/sw'

Activity

Play around with the objects and methods from `feedparser` and `BeautifulSoup`.

Language detection

Language detection has always been a very difficult task. Althouh it's really important when scrapping and analyzing text in order to filter out only what you are interested in. It is possible to achieve good results by using just NLTK, but it gets too hard. What we are really looking for is TextBlob.

TextBlob is built on top of NLTK and other library called pattern, and one of those task that makes dead-simple for us is language detection. It's so simple, and gives so good results, that it's enough for us.



In [86]:

    
from textblob import TextBlob



In [87]:

    
TextBlob(u"Simple is better than complex.").detect_language()









    Out[87]:





'en'



In [88]:

    
TextBlob("Simple es mejor que complejo.").detect_language()









    Out[88]:





'es'



In [89]:

    
TextBlob(u"美丽优于丑陋").detect_language()









    Out[89]:





'zh-CN'



In [90]:

    
TextBlob(u"بسيط هو أفضل من مجمع").detect_language()









    Out[90]:





'ar'

Simply amazing! The language codes, 'en' for English, 'es' for Spanish, zh-CN for Chinese, or 'ar' for Arabic, are in ISO639-1 format.

Machine translation

The other task that TextBlob improves is translation. There is not much to say, just take a look to the functions.



In [92]:

    
en_blob = TextBlob(u"Simple is better than complex.")
en_blob.translate(to="es")









    Out[92]:





TextBlob("Simple es mejor que complejo .")

If no source language is specified, TextBlob will attempt to detect the language. You can specify the source language explicitly, like so.



In [91]:

    
chinese_blob = TextBlob(u"美丽优于丑陋")
chinese_blob.translate(from_lang="zh-CN", to='en')









    Out[91]:





TextBlob("Beautiful is better than ugly")

The translation service is provided by the Google Translate API, but please, don't trust it that much. It may work pretty good for short sentences and words, but it's not very reliable in large chunks of text, so tokenize first.

Activity

Write a program that loads feeds from the [Spanish Blog in Digital Humanities](http://humanidadesdigitales.net/blog/feed/), get the first 10 entries using `feedparser`, and for each, returns the next in English and withouth stopwords (*Hint*: take a look to the stopwords in NLTK under `nltk.corpus.stopwords.words('spanish')`):

Title
Number of sentences
Number of words
Number of unique words (vocabulary)
Number of hapaxes
Top 10 most frequent words

For the next class

Read Chapter 5, from Natural Language Processing with Python.

[Data, the Humanist's New Best Friend](index.ipynb)*Class 12*

Corpora

Conditional Frequency Distributions

Loading your own Corpus

Sources of data

Language detection

Machine translation

For the next class

[Data, the Humanist's New Best Friend](index.ipynb)
Class 12