NLP with NLTK

Today's talk will address various concepts in the Natural Language Processing pipeline through the use of NLTK. A fundmental understanding of Python is necessary. We will cover:

  1. Pre-processing
  2. Preparing and declaring your own corpus
  3. POS-Tagging
  4. Dependency Parsing
  5. NER
  6. Sentiment Analysis

You will need:

  • NLTK ( \$ pip install nltk)
  • the parser wrapper requires the Stanford Parser (in Java)
  • the NER wrapper requires the Stanford NER (in Java)

1) Pre-processing

This won't be covered much today, but regex and basic python string methods are most important in preprocessing tasks. NLTK does, however, offer an array of tokenizers and stemmers for various languages.

Tokenizing


In [ ]:
text = '''Hello, my name is Chris. 
I'll be talking about the python library NLTK today. 
NLTK is a popular tool to conduct text processing tasks in NLP.'''

In [ ]:
from nltk.tokenize import word_tokenize

print("Notice the difference!")
print()
print(word_tokenize(text))

print()
print("vs.")
print()

print(text.split())

You can also tokenize sentences.


In [ ]:
from nltk.tokenize import sent_tokenize
print(sent_tokenize(text))

In [ ]:
tokenized_text = [word_tokenize(sent) for sent in sent_tokenize(text)]
print(tokenized_text)

A list of sentences with a list of tokenized words is generally the accepted format for most libraries for analysis.

Stemming/Lemmatizing


In [ ]:
from nltk import SnowballStemmer

snowball = SnowballStemmer('english')

print(snowball.stem('running'))
print(snowball.stem('eats'))
print(snowball.stem('embarassed'))

But watch out for errors:


In [ ]:
print(snowball.stem('cylinder'))
print(snowball.stem('cylindrical'))

Or collision:


In [ ]:
print(snowball.stem('vacation'))
print(snowball.stem('vacate'))

This is why lemmatizing, if the computing power and time is sufficient, is always preferable:


In [ ]:
from nltk import WordNetLemmatizer
wordnet = WordNetLemmatizer()

print(wordnet.lemmatize('vacation'))
print(wordnet.lemmatize('vacate'))

So why is this important?


In [ ]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

categories = ['talk.politics.mideast', 'rec.autos', 'sci.med']
twenty = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)

data_no_stems = [[word_tokenize(sent) for sent in sent_tokenize(text.lower())] for text in twenty.data]
data_stems = [[[snowball.stem(word) for word in word_tokenize(sent)]
               for sent in sent_tokenize(text)] for text in twenty.data]

In [ ]:
print(data_no_stems[400][5])

In [ ]:
print(data_stems[400][5])

In [ ]:
data_no_stems = [' '.join([item for sublist in l for item in sublist]) for l in data_no_stems]
data_stems = [' '.join([item for sublist in l for item in sublist]) for l in data_stems]

vectorizer = TfidfVectorizer()
X_data_no_stems = vectorizer.fit_transform(data_no_stems)

vectorizer2 = TfidfVectorizer()
X_data_stems = vectorizer2.fit_transform(data_stems)

In [ ]:
from sklearn.cross_validation import train_test_split
from sklearn import ensemble

X_train, X_test, y_train, y_test = train_test_split(X_data_no_stems, twenty.target,
                                                    train_size=0.75, test_size=0.25)

rf_classifier = ensemble.RandomForestClassifier(n_estimators=10,  # number of trees
                       criterion='gini',  # or 'entropy' for information gain
                       max_depth=None,  # how deep tree nodes can go
                       min_samples_split=2,  # samples needed to split node
                       min_samples_leaf=1,  # samples needed for a leaf
                       min_weight_fraction_leaf=0.0,  # weight of samples needed for a node
                       max_features='auto',  # number of features for best split
                       max_leaf_nodes=None,  # max nodes
                       min_impurity_split=1e-07,  # early stopping
                       n_jobs=1,  # CPUs to use
                       class_weight="balanced")  # adjusts weights inverse of freq, also "balanced_subsample" or None

model = rf_classifier.fit(X_train, y_train)
print(model.score(X_test, y_test))

In [ ]:
X_train, X_test, y_train, y_test = train_test_split(X_data_stems, twenty.target,
                                                    train_size=0.75, test_size=0.25)

rf_classifier = ensemble.RandomForestClassifier(n_estimators=10,  # number of trees
                       criterion='gini',  # or 'entropy' for information gain
                       max_depth=None,  # how deep tree nodes can go
                       min_samples_split=2,  # samples needed to split node
                       min_samples_leaf=1,  # samples needed for a leaf
                       min_weight_fraction_leaf=0.0,  # weight of samples needed for a node
                       max_features='auto',  # number of features for best split
                       max_leaf_nodes=None,  # max nodes
                       min_impurity_split=1e-07,  # early stopping
                       n_jobs=1,  # CPUs to use
                       class_weight="balanced")  # adjusts weights inverse of freq, also "balanced_subsample" or None

model = rf_classifier.fit(X_train, y_train)
print(model.score(X_test, y_test))

2) Declaring a corpus in NLTK

While you can use NLTK on strings and lists of sentences, it's better to formally declare your corpus, as this will take care of the above for you and provide methods to access them. For our purposes today, we'll use a corpus of book summaries. I've changed them into a folder of .txt files for demonstration.


In [ ]:
from nltk.corpus import PlaintextCorpusReader

corpus_root = "texts/"  # relative path to texts.
my_texts = PlaintextCorpusReader(corpus_root, '.*txt')

We now have a text corpus, on which we can run all the basic preprocessing methods. To list all the files in our corpus:


In [ ]:
my_texts.fileids()[:10]

In [ ]:
my_texts.words('To Kill A Mockingbird.txt')  # uses punkt tokenizer like above

In [ ]:
my_texts.sents('To Kill A Mockingbird.txt')

It also add as paragraph method:


In [ ]:
my_texts.paras('To Kill A Mockingbird.txt')[0]

Let's save these to a variable to look at the next step on a low level:


In [ ]:
m_sents = my_texts.sents('To Kill A Mockingbird.txt')
print (m_sents)

We now have a corpus, or text, from which we can get any of the statistics you learned in Day 3 of the Python workshop. We will review some of these functions once we get some more information

3) POS-Tagging

There are many situations, in which "tagging" words (or really anything) may be useful in order to determine or calculate trends, or for further text analysis to extract meaning. NLTK contains several methods to achieve this, from simple regex to more advanced machine learning models models.

It is important to note that in Natural Language Processing (NLP), POS (Part of Speech) tagging is the most common use for tagging, but the actual tag can be anything. Other applications include sentiment analysis and NER (Named Entity Recognition). Tagging is simply labeling a word to a specific category via a tuple.

Nevertheless, for training more advanced tagging models, POS tagging is nearly essential. If you are defining a machine learning model to predict patterns in your text, these patterns will most likley rely on, among other things, POS features. You will therefore first tag POS and then use the POS as a feature in your model.

On a low-level

Tagging is creating a tuple of (word, tag) for every word in a text or corpus. For example: "My name is Chris" may be tagged for POS as:

My/PossessivePronoun name/Noun is/Verb Chris/ProperNoun ./Period

NB: type 'nltk.data.path' to find the path on your computer to your downloaded nltk corpora. You can explore these files to see how large corpora are formatted.

You'll notice how the text is annotated, using a forward slash to match the word to its tag. So how can we get this to a useful form for Python?


In [ ]:
from nltk.tag import str2tuple

line = "My/Possessive_Pronoun name/Noun is/Verb Chris/Proper_Noun ./Period"
tagged_sent = [str2tuple(t) for t in line.split()]

print (tagged_sent)

Further analysis of tags with NLTK requires a list of sentences, otherwise you will get an index error on higher level methods.

Naturally, these tags are a bit verbose, the standard tagging conventions follow the Penn Treebank (more in a second): https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

Automatic Tagging

NLTK's stock English pos_tag tagger is a perceptron tagger:


In [ ]:
from nltk import pos_tag
m_tagged_sent = pos_tag(m_sents[0])
print (m_tagged_sent)

What do these tags mean?


In [ ]:
from nltk import help
help.upenn_tagset()

In [ ]:
m_tagged_all = [pos_tag(sent) for sent in m_sents]
print(m_tagged_all[:3])

We can find and aggregate certain parts of speech too:


In [ ]:
from nltk import ConditionalFreqDist
def find_tags(tag_prefix, tagged_text):
    cfd = ConditionalFreqDist((tag, word) for (word, tag) in tagged_text
                                  if tag.startswith(tag_prefix))
    return dict((tag, cfd[tag].most_common(5)) for tag in cfd.conditions()) #cfd.conditions() yields all tags possibilites

In [ ]:
m_tagged_words = [item for sublist in m_tagged_all for item in sublist]

tagdict = find_tags('JJ', m_tagged_words)
for tag in sorted(tagdict):
    print(tag, tagdict[tag])

We can begin to quantify syntax by look at environments of words, so what commonly follows a verb?


In [ ]:
tags = [b[1] for (a, b) in nltk.bigrams(m_tagged_words) if a[1].startswith('VB')]
fd1 = nltk.FreqDist(tags)

print ("To Kill A Mockingbird")
fd1.tabulate(10)

Creating a tagged corpus

Now that we know how tagging works, we can quickly tag all of our documents, but we'll only do a few hundred from the much larger corpus.


In [ ]:
tagged_sents = {}
for fid in my_texts.fileids()[::10]:
    tagged_sents[fid.split(".")[0]] = [pos_tag(sent) for sent in my_texts.sents(fid)]

In [ ]:
tagged_sents.keys()

In [ ]:
tagged_sents["Harry Potter and the Prisoner of Azkaban"]

Absolute frequencies are available through NLTK's FreqDist method:


In [ ]:
all_tags = []
all_tups = []

for k in tagged_sents.keys():
    for s in tagged_sents[k]:
        for t in s:
            all_tags.append(t[1])
            all_tups.append(t)

nltk.FreqDist(all_tags).tabulate(10)

In [ ]:
tags = ['NN', 'VB', 'JJ']
for t in tags:
    tagdict = find_tags(t, all_tups)
    for tag in sorted(tagdict):
        print(tag, tagdict[tag])

We can compare this to other genres:


In [ ]:
from nltk.corpus import brown

for c in brown.categories():
    tagged_words = brown.tagged_words(categories=c) #not universal tagset
    tag_fd = nltk.FreqDist(tag for (word, tag) in tagged_words)
    print(c.upper())
    tag_fd.tabulate(10)
    print()
    tags = ['NN', 'VB', 'JJ']
    for t in tags:
        tagdict = find_tags(t, tagged_words)
        for tag in sorted(tagdict):
            print(tag, tagdict[tag])
    print()
    print()

We can also look at what linguistic environment words are in on a low level, below lists all the words preceding "love" in the romance category:


In [ ]:
brown_news_text = brown.words(categories='romance')
sorted(set(a for (a, b) in nltk.bigrams(brown_news_text) if b == 'love'))

4) Dependency Parsing

While tagging parts of speech can be helpful for certain NLP tasks, dependency parsing is better at extracting real relationships within a sentence.


In [ ]:
from nltk.parse.stanford import StanfordDependencyParser

dependency_parser = StanfordDependencyParser(path_to_jar = "/Users/chench/Documents/stanford-parser-full-2015-12-09/stanford-parser.jar",
                                             path_to_models_jar = "/Users/chench/Documents/stanford-parser-full-2015-12-09/stanford-parser-3.6.0-models.jar")

result = dependency_parser.raw_parse_sents(['I shot an elephant in my sleep.', 'It was great.'])

As the program takes longer to run, I will not run it on the entire corpus, but an example is below:


In [ ]:
for r in result:
    for o in r:
        trips = list(o.triples())  # ((head word, head tag), rel, (dep word, dep tag))
        for t in trips:
            print(t)
            print()

5) Named Entity Recognition

After tokening, tagging, and parser, one of the last steps in the pipeline is NER. Identifying named entities can be useful in determing many different relationships, and often serves as a prerequisite to mapping textual relationships within a set of documents.


In [ ]:
from nltk.tag.stanford import StanfordNERTagger

ner_tag = StanfordNERTagger(
        '/Users/chench/Documents/stanford-ner-2015-12-09/classifiers/english.all.3class.distsim.crf.ser.gz',
        '/Users/chench/Documents/stanford-ner-2015-12-09/stanford-ner.jar')

In [ ]:
import pyprind

ner_sents = {}
books = ["To Kill A Mockingbird.txt", "Harry Potter and the Prisoner of Azkaban.txt"]

for fid in books:
    bar = pyprind.ProgBar(len(my_texts.sents(fid)), monitor=True, bar_char="#")
    tagged_sents = []
    for sent in my_texts.sents(fid):
        tagged_sents.append(ner_tag.tag(sent))
        bar.update()
    ner_sents[fid.split(".")[0]] = tagged_sents
    print()

We can look on the low level at a single summary:


In [ ]:
print(ner_sents["To Kill A Mockingbird"])

In [ ]:
from itertools import groupby
from nltk import FreqDist

NER = {"LOCATION": [],
       "PERSON": [],
       "ORGANIZATION": [],
       }

for sentence in ner_sents["To Kill A Mockingbird"]:
    for tag, chunk in groupby(sentence, lambda x: x[1]):
        if tag != "O":
            NER[tag].append(" ".join(w for w, t in chunk))

if NER["LOCATION"]:
    print("Locations:")
    FreqDist(NER["LOCATION"]).tabulate()
    print()

if NER["PERSON"]:
    print("Persons:")
    FreqDist(NER["PERSON"]).tabulate()
    print()

if NER["ORGANIZATION"]:
    print("Organizations")
    FreqDist(NER["ORGANIZATION"]).tabulate()

Or between the two:


In [ ]:
NER = {"LOCATION": [],
       "PERSON": [],
       "ORGANIZATION": [],
       }

for k in ner_sents.keys():
    for sentence in ner_sents[k]:
        for tag, chunk in groupby(sentence, lambda x: x[1]):
            if tag != "O":
                NER[tag].append(" ".join(w for w, t in chunk))

if NER["LOCATION"]:
    print("Locations:")
    FreqDist(NER["LOCATION"]).tabulate()
    print()

if NER["PERSON"]:
    print("Persons:")
    FreqDist(NER["PERSON"]).tabulate()
    print()

if NER["ORGANIZATION"]:
    FreqDist(NER["ORGANIZATION"]).tabulate()

6) Sentiment Analysis

While earlier sentiment analysis was based on simple dictionary look-up methods denoting words as positive or negative, or assigning numerical values to words, newer methods are better able to take a word's or sentence's environment into account. VADER (Valence Aware Dictionary and sEntiment Reasoner) is one such example.


In [ ]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import numpy as np

sid = SentimentIntensityAnalyzer()

print(sid.polarity_scores("I really don't like that book.")["compound"])

In [ ]:
for fid in books:
    print(fid.upper())
    sent_pols = [sid.polarity_scores(s)["compound"] for s in sent_tokenize(my_texts.raw(fid))]
    for i, s in enumerate(my_texts.sents(fid)):
        print(s, sent_pols[i])
        print()
    
    print()
    print("Mean: ", np.mean(sent_pols))
    print()
    print("="*100)
    print()