In [1]:
%matplotlib inline
Natural Language Processing (NLP) is often taught at the academic level from the perspective of computational linguists. However, as data scientists, we have a richer view of the natural language world - unstructured data that by its very nature has latent information that is important to humans. NLP practioners have benefited from machine learning techniques to unlock meaning from large corpora, and in this class we’ll explore how to do that particularly with Python and with the Natural Language Toolkit (NLTK).
NLTK is an excellent library for machine-learning based NLP, written in Python by experts from both academia and industry. Python allows you to create rich data applications rapidly, iterating on hypotheses. The combination of Python + NLTK means that you can easily add language-aware data products to your larger analytical workflows and applications.
NLTK stands for the Natural Language Toolkit and is written by two eminent computational linguists, Steven Bird (Senior Research Associate of the LDC and professor at the University of Melbourne) and Ewan Klein (Professor of Linguistics at Edinburgh University). NTLK provides a combination of natural language corpora, lexical resources, and example grammars with language processing algorithms, methodologies and demonstrations for a very pythonic "batteries included" view of Natural Language Processing.
As such, NLTK is perfect for researh driven (hypothesis driven) workflows for agile data science. Its suite of libraries includes:
NLTK is a useful pedagogical resource for learning NLP with Python and serves as a starting place for producing production grade code that requires natural language analysis. It is also important to understand what NLTK is not:
NLTK provides a variety of tools that can be used to explore the linguistic domain but is not a lightweight dependency that can be easily included in other workflows, especially those that require unit and integration testing or other build processes. This stems from the fact that NLTK includes a lot of added code but also a rich and complete library of corpora that power the built-in algorithms.
Syntactic Parsing
The sem package
Lots of extra stuff (heavyweight dependency)
Knowing the good and the bad parts will help you explore NLTK further - looking into the source code to extract the material you need, then moving that code to production. We will explore NLTK in more detail in the rest of this notebook.
This notebook has a few dependencies, most of which can be installed via the python package manger - pip
.
Once you have Python and pip installed you can install NLTK as follows:
~$ pip install nltk
~$ pip install matplotlib
~$ pip install beautifulsoup4
~$ pip install gensim
Note that these will also install Numpy and Scipy if they aren't already installed.
To download the corpora, open a python interperter:
In [2]:
import nltk
In [3]:
nltk.download()
Out[3]:
This will open up a window with which you can download the various corpora and models to a specified location. For now, go ahead and download it all as we will be exploring as much of NLTK as we can. Also take note of the download_directory
- you're going to want to know where that is so you can get a detailed look at the corpora that's included. I usually export an enviornment variable to track this:
~$ export NLTK_DATA=/path/to/nltk_data
Take a momement to explore what is in this directory
In [4]:
moby = nltk.text.Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt'))
The nltk.text.Text
class is a wrapper around a sequence of simple (string) tokens - intended only for the initial exploration of text usually via the Python REPL. It has the following methods:
You shouldn't use this class in production level systems, but it is useful to explore (small) snippets of text in a meaningful fashion.
The corcordance function performs a search for the given token and then also provides the surrounding context:
In [5]:
moby.concordance("ship")
Given some context surrounding a word, we can discover similar words, e.g. words that that occur frequently in the same context and with a similar distribution: Distributional similarity:
In [6]:
moby.similar("monstrous")
austen = nltk.text.Text(nltk.corpus.gutenberg.words('austen-sense.txt'))
print
austen.similar("monstrous")
As you can see, this takes a bit of time to build the index in memory, one of the reasons it's not suggested to use this class in production code. Now that we can do searching and similarity, find the common contexts of a set of words:
In [7]:
moby.common_contexts(["sea", "whale"])
your turn, go ahead and explore similar words and contexts - what does the common context mean?
NLTK also uses matplotlib and pylab to display graphs and charts that can show dispersions and frequency. This is especially interesting for the corpus of innagural addresses given by U.S. presidents.
In [8]:
inaugural = nltk.text.Text(nltk.corpus.inaugural.words())
inaugural.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])
To explore much of the built in corpus, use the following methods:
In [9]:
# Lists the various corpora and CorpusReader classes in the nltk.corpus module
for name in dir(nltk.corpus):
if name.islower() and not name.startswith('_'): print name
In [10]:
# For a specific corpus, list the fileids that are available:
print nltk.corpus.shakespeare.fileids()
In [11]:
print nltk.corpus.gutenberg.fileids()
In [12]:
print nltk.corpus.stopwords.fileids()
These corpora export several vital methods:
In [13]:
corpus = nltk.corpus.brown
print corpus.paras()
In [14]:
print corpus.sents()
In [15]:
print corpus.words()
In [16]:
print corpus.raw()[:200] # Be careful!
Your turn! Explore some of the text in the available corpora
In statistical machine learning approaches to NLP, the very first thing we need to do is count things - especially the unigrams that appear in the text and their relationships to each other. NLTK provides two very excellent classes to enable these frequency analyses:
FreqDist
ConditionalFreqDist
And these two classes serve as the foundation for most of the probability and statistical analyses that we will conduct.
First we will compute the following:
In [17]:
reuters = nltk.corpus.reuters # Corpus of news articles
counts = nltk.FreqDist(reuters.words())
vocab = len(counts.keys())
words = sum(counts.values())
lexdiv = float(words) / float(vocab)
print "Corpus has %i types and %i tokens for a lexical diversity of %0.3f" % (vocab, words, lexdiv)
In [18]:
counts.B()
Out[18]:
In [19]:
print counts.most_common(40) # The n most common tokens in the corpus
In [20]:
print counts.max() # tThe most frequent token in the corpus
In [21]:
print counts.hapaxes()[0:10] # A list of all hapax legomena
In [22]:
counts.freq('with') * 100 # percentage of the corpus for this token
Out[22]:
In [23]:
counts.plot(20, cumulative=False)
In [24]:
from itertools import chain
brown = nltk.corpus.brown
categories = brown.categories()
counts = nltk.ConditionalFreqDist(chain(*[[(cat, word) for word in brown.words(categories=cat)] for cat in categories]))
for category, dist in counts.items():
vocab = len(dist.keys())
tokens = sum(dist.values())
lexdiv = float(tokens) / float(vocab)
print "%s: %i types with %i tokens and lexical diveristy of %0.3f" % (category, vocab, tokens, lexdiv)
Your turn: compute the conditional frequency distribution of bigrams in a corpus
Hint:
In [25]:
for bigram in nltk.bigrams(["The", "bear", "walked", "in", "the", "woods", "at", "midnight"]):
print bigram
NLTK is great at the preprocessing of Raw text - it provides the following tools for dividing text into it's constituent parts:
sent_tokenize
: a Punkt sentence tokenizer:
This tokenizer divides a text into a list of sentences, by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the target language before it can be used.
However, Punkt is designed to learn parameters (a list of abbreviations, etc.) unsupervised from a corpus similar to the target domain. The pre-packaged models may therefore be unsuitable: use PunktSentenceTokenizer(text) to learn parameters from the given text.
word_tokenize
: a Treebank tokenizer
The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank. This is the method that is invoked by word_tokenize()
. It assumes that the text has already been segmented into sentences, e.g. using sent_tokenize()
.
pos_tag
: a maximum entropy tagger trained on the Penn Treebank
There are several other taggers including (notably) the BrillTagger as well as the BrillTrainer to train your own tagger or tagset.
In [26]:
text = u"Medical personnel returning to New York and New Jersey from the Ebola-riddled countries in West Africa will be automatically quarantined if they had direct contact with an infected person, officials announced Friday. New York Gov. Andrew Cuomo (D) and New Jersey Gov. Chris Christie (R) announced the decision at a joint news conference Friday at 7 World Trade Center. “We have to do more,” Cuomo said. “It’s too serious of a situation to leave it to the honor system of compliance.” They said that public-health officials at John F. Kennedy and Newark Liberty international airports, where enhanced screening for Ebola is taking place, would make the determination on who would be quarantined. Anyone who had direct contact with an Ebola patient in Liberia, Sierra Leone or Guinea will be quarantined. In addition, anyone who traveled there but had no such contact would be actively monitored and possibly quarantined, authorities said. This news came a day after a doctor who had treated Ebola patients in Guinea was diagnosed in Manhattan, becoming the fourth person diagnosed with the virus in the United States and the first outside of Dallas. And the decision came not long after a health-care worker who had treated Ebola patients arrived at Newark, one of five airports where people traveling from West Africa to the United States are encountering the stricter screening rules."
for sent in nltk.sent_tokenize(text):
print sent
print
In [27]:
for sent in nltk.sent_tokenize(text):
print list(nltk.word_tokenize(sent))
print
In [28]:
for sent in nltk.sent_tokenize(text):
print list(nltk.pos_tag(nltk.word_tokenize(sent)))
print
All of these taggers work pretty well - but you can (and should train them on your own corpora).
We have an immense number of word forms as you can see from our various counts in the FreqDist
above - it is helpful for many applications to normalize these word forms (especially applications like search) into some canonical word for further exploration. In English (and many other languages) - mophological context indicate gender, tense, quantity, etc. but these sublties might not be necessary:
Stemming = chop off affixes to get the root stem of the word:
running --> run
flowers --> flower
geese --> geese
Lemmatization = look up word form in a lexicon to get canonical lemma
women --> woman
foxes --> fox
sheep --> sheep
There are several stemmers available:
- Lancaster (English, newer and aggressive)
- Porter (English, original stemmer)
- Snowball (Many langauges, newest)
The Lemmatizer uses the WordNet lexicon
In [29]:
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem.porter import PorterStemmer
text = list(nltk.word_tokenize("The women running in the fog passed bunnies working as computer scientists."))
snowball = SnowballStemmer('english')
lancaster = LancasterStemmer()
porter = PorterStemmer()
for stemmer in (snowball, lancaster, porter):
stemmed_text = [stemmer.stem(t) for t in text]
print stemmed_text
In [30]:
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(t) for t in text]
print lemmas
Note that the lemmatizer has to load the WordNet corpus which takes a bit.
Typical normalization of text for use as features in machine learning models looks something like this:
In [31]:
import string
## Module constants
lemmatizer = WordNetLemmatizer()
stopwords = set(nltk.corpus.stopwords.words('english'))
punctuation = string.punctuation
def normalize(text):
for token in nltk.word_tokenize(text):
token = token.lower()
token = lemmatizer.lemmatize(token)
if token not in stopwords and token not in punctuation:
yield token
print list(normalize("The eagle ... flies at midnight."))
In [32]:
print nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize("John Smith is from the United States of America and works at Microsoft Research Labs")))
You can also wrap the Stanford NER system, which many of you are also probably used to using.
In [33]:
import os
from nltk.tag.stanford import NERTagger
# change the paths below to point to wherever you unzipped the Stanford NER download file
stanford_root = '/Users/benjamin/Development/stanford-ner-2014-01-04'
stanford_data = os.path.join(stanford_root, 'classifiers/english.all.3class.distsim.crf.ser.gz')
stanford_jar = os.path.join(stanford_root, 'stanford-ner-2014-01-04.jar')
st = NERTagger(stanford_data, stanford_jar, 'utf-8')
for i in st.tag("John Bengfort is from the United States of America and works at Microsoft Research Labs".split()):
print '[' + i[1] + '] ' + i[0]
The primary responsibility you will have before any task involving NLP is to ingest and transform raw text into a corpus that can then be used for performing further evaluations. NLTK provides many corpora for you to work with for exploration, but you must become able to design and construct your own corpora, and to implement nltk.CorpusReader
objects - classes that in a memory safe and efficient way are able to read entire corpora and analyze them.
Many people get away with the nltk.PlainTextCorpusReader
- which uses built-in taggers and tokenizers to deal with raw text. However, this methodology leaves you at the mercy of the tagging model that you have provided, and does not allow you to make corrections that are saved in between runs. Instead you should preprocess your text to allow it to be read by the nltk.corpus.TaggedCorpusReader
or the penultimate corpus, the nltk.corpus.BracketParseCorpusReader
.
In this task, you will transform raw text into a format that can then be read by the nltk.corpus.TaggedCorpusReader
. See the documentation at http://www.nltk.org/api/nltk.corpus.reader.html for more information on this reader.
You will find 20-40 documents of recent tech articles from Engadget and Tech Crunch at the following link: http://bit.ly/nlpnltkcorpus - please download them to your local file system. Write a Python program that uses NLTK to preprocess these documents into a format that can be easily read by the nltk.corpus.TaggedCorpusReader
.
Note that you will have to process these files and remove HTML tags and you might have to do other tasks related to the clean up; to do this I suggest you use the third party library BeautifulSoup which can be found at http://www.crummy.com/software/BeautifulSoup/. See also Chapter 3 in the NLTK book for more information.
nltk.corpus.stopwords('english')
list)Given a seed inventory of pre-terminal and non-terminal symbols (grammatical categories) and a sample lexicon, write a grammar for English noun phrases. Your grammar should cover all legal structures of noun phrases used by the grammatical categories provided. You must include the following:
Note: You do not need to cover more than one PP in a row, more than one adjective in a row, noun-noun compounds of length > 2, quantifiers followed by determiners ("all of these") or mass nouns ("beer", "sincerity")
You should then write a program that uses an NLTK parser and the grammar you constructed that will return a syntactic tree if the input is a noun phrase or None if the input is ungrammatical. Your program will have to take the input sentence, tokenize it and then tag it according to the lexicon (you can assume that words in this lexicon do not have multiple senses) - you'll then have to pass the grammar phrase (the tags) to the parser.
N = noun
NP = noun phrase
Adj = adjective
AdjP = adjective phrase
Adv = adverb
Prep = preposition
PP = prepositional phrase
Quant = quantifier
Ord = ordinal numeral
Card = cardinal numeral Rel-Cl = relative clause
Rel-Pro = relative pronoun
V = verb
S = sentence
Det = determiner
Dem-Det = demonstrative determiner
Wh-Det = wh-determiner
PPron = personal pronoun
PoPron = possessive pronoun
a Det
an Det
at Prep
airplane NSg
airplanes NPl
airport NSg
airports NPl
any Quant
beautiful Adj
big Adj
eat V
eats V3Sg
finished VPastPP
four Card
fourth Ord
he PPron
his PoPron
in Prep
many Quant
my PoPron
new Adj
of Prep
offered VPastPP
on Prep
restaurant NSg
restaurants NPl
runway NSg
runways NPl
second Ord
some Quant
that Dem-DetSg
that Rel-Pro
the Det
this Dem-DetSg
these Dem-DetPl
third Ord
those Dem-DetPl
three Card
two Card
very Adv
which Wh-Det
who Wh-Det
you PPron
In the first week you created an ingestion mechanism and an NLTK corpus reader for a set of RSS feeds. These feeds potentially have topics associated with them (broad tags like tech, news, sports, etc). In this question you'll build a classifier on a data set of RSS feeds that is provided in the course materials to decide whether or not you can categorize the various topics using one of the classifiers you learned in this week.
The corpus is constructed as follows. Each individual blog post is in its own HTML file stored in a directory labled with the topic. Use the nltk.CategorizedCorpusReader
or the nltk.CategorizedPlaintextCorpusReader
to construct your corpora (you may review how the movie reviews data set is structured). To do this you need to pass to the corpus the path to the root of your corpus, and a regular expression to match file names. You also need to use a regular expression passed as the cat_pattern
keyword argument, which is used to match the category labels. Here is an example for the spam corpus:
from nltk.corpus import CategorizedPlaintextCorpusReader as EmailCorpus
corpus = EmailCorpus("./data/nbspam", r'(?!\.).*\.[a-f0-9]+',
cat_pattern=r'(spam|ham)/.*', encoding='iso-8859-1')
print corpus.categories()
print corpus.fileids()
Create a test set, a dev test set, and a training set from randomly shuffled documents that are in the corpus to use in your development. Save these sets to disk with pickles to ensure that you can develop easily with them.
Create a function that extracts features per document. Choose any features you would like. One idea is to use the most common unigrams; you might be able to use common bigrams as well. If you can think of any other features, feel free to include them as well (maybe an includes_recipe feature, etc.)) You may want to consider a TF-IDF feature to improve your results.
Train the classifier of your choice on the training data, and then improve it with your dev set. Report your final accuracy and the most informative features by running the accuracy checker on the final test set.
The second question involves comparing and contrast the Naive Bayes Classifier with the Maximum Entropy classifier. You will be given an abbreviated data set of product names and their descriptions as well as their label (tops, bottoms, shoes, etc.) - similarly to question one, create a corpus that can read the CSV file - you may want to look at the nltk.corpus.WordListCorpusReader
for inspiration about how to create such a corpus (each product is on a single line).
Create test and training sets from the data then build both a NaiveBayes and Maxent classifier - make sure that you save these classifiers to disk using the pickle
module! The Maxent classifier in particular will take a long time to train. Once they're trained; report the accuracy of each as well as the most informative features. Are there any surprises? Which classifier performs better?
In [ ]: