Today's talk will address various concepts in the Natural Language Processing pipeline through the use of NLTK. A fundmental understanding of Python is necessary. We will cover:
You will need:
This won't be covered much today, but regex and basic python string methods are most important in preprocessing tasks. NLTK does, however, offer an array of tokenizers and stemmers for various languages.
In [ ]:
text = '''Hello, my name is Chris.
I'll be talking about the python library NLTK today.
NLTK is a popular tool to conduct text processing tasks in NLP.'''
In [ ]:
from nltk.tokenize import word_tokenize
print("Notice the difference!")
print()
print(word_tokenize(text))
print()
print("vs.")
print()
print(text.split())
You can also tokenize sentences.
In [ ]:
from nltk.tokenize import sent_tokenize
print(sent_tokenize(text))
In [ ]:
tokenized_text = [word_tokenize(sent) for sent in sent_tokenize(text)]
print(tokenized_text)
A list of sentences with a list of tokenized words is generally the accepted format for most libraries for analysis.
In [ ]:
from nltk import SnowballStemmer
snowball = SnowballStemmer('english')
print(snowball.stem('running'))
print(snowball.stem('eats'))
print(snowball.stem('embarassed'))
But watch out for errors:
In [ ]:
print(snowball.stem('cylinder'))
print(snowball.stem('cylindrical'))
Or collision:
In [ ]:
print(snowball.stem('vacation'))
print(snowball.stem('vacate'))
This is why lemmatizing, if the computing power and time is sufficient, is always preferable:
In [ ]:
from nltk import WordNetLemmatizer
wordnet = WordNetLemmatizer()
print(wordnet.lemmatize('vacation'))
print(wordnet.lemmatize('vacate'))
So why is this important?
In [ ]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
categories = ['talk.politics.mideast', 'rec.autos', 'sci.med']
twenty = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)
data_no_stems = [[word_tokenize(sent) for sent in sent_tokenize(text.lower())] for text in twenty.data]
data_stems = [[[snowball.stem(word) for word in word_tokenize(sent)]
for sent in sent_tokenize(text)] for text in twenty.data]
In [ ]:
print(data_no_stems[400][5])
In [ ]:
print(data_stems[400][5])
In [ ]:
data_no_stems = [' '.join([item for sublist in l for item in sublist]) for l in data_no_stems]
data_stems = [' '.join([item for sublist in l for item in sublist]) for l in data_stems]
vectorizer = TfidfVectorizer()
X_data_no_stems = vectorizer.fit_transform(data_no_stems)
vectorizer2 = TfidfVectorizer()
X_data_stems = vectorizer2.fit_transform(data_stems)
In [ ]:
from sklearn.cross_validation import train_test_split
from sklearn import ensemble
X_train, X_test, y_train, y_test = train_test_split(X_data_no_stems, twenty.target,
train_size=0.75, test_size=0.25)
rf_classifier = ensemble.RandomForestClassifier(n_estimators=10, # number of trees
criterion='gini', # or 'entropy' for information gain
max_depth=None, # how deep tree nodes can go
min_samples_split=2, # samples needed to split node
min_samples_leaf=1, # samples needed for a leaf
min_weight_fraction_leaf=0.0, # weight of samples needed for a node
max_features='auto', # number of features for best split
max_leaf_nodes=None, # max nodes
min_impurity_split=1e-07, # early stopping
n_jobs=1, # CPUs to use
class_weight="balanced") # adjusts weights inverse of freq, also "balanced_subsample" or None
model = rf_classifier.fit(X_train, y_train)
print(model.score(X_test, y_test))
In [ ]:
X_train, X_test, y_train, y_test = train_test_split(X_data_stems, twenty.target,
train_size=0.75, test_size=0.25)
rf_classifier = ensemble.RandomForestClassifier(n_estimators=10, # number of trees
criterion='gini', # or 'entropy' for information gain
max_depth=None, # how deep tree nodes can go
min_samples_split=2, # samples needed to split node
min_samples_leaf=1, # samples needed for a leaf
min_weight_fraction_leaf=0.0, # weight of samples needed for a node
max_features='auto', # number of features for best split
max_leaf_nodes=None, # max nodes
min_impurity_split=1e-07, # early stopping
n_jobs=1, # CPUs to use
class_weight="balanced") # adjusts weights inverse of freq, also "balanced_subsample" or None
model = rf_classifier.fit(X_train, y_train)
print(model.score(X_test, y_test))
While you can use NLTK on strings and lists of sentences, it's better to formally declare your corpus, as this will take care of the above for you and provide methods to access them. For our purposes today, we'll use a corpus of book summaries. I've changed them into a folder of .txt files for demonstration.
In [ ]:
from nltk.corpus import PlaintextCorpusReader
corpus_root = "texts/" # relative path to texts.
my_texts = PlaintextCorpusReader(corpus_root, '.*txt')
We now have a text corpus, on which we can run all the basic preprocessing methods. To list all the files in our corpus:
In [ ]:
my_texts.fileids()[:10]
In [ ]:
my_texts.words('To Kill A Mockingbird.txt') # uses punkt tokenizer like above
In [ ]:
my_texts.sents('To Kill A Mockingbird.txt')
It also add as paragraph method:
In [ ]:
my_texts.paras('To Kill A Mockingbird.txt')[0]
Let's save these to a variable to look at the next step on a low level:
In [ ]:
m_sents = my_texts.sents('To Kill A Mockingbird.txt')
print (m_sents)
We now have a corpus, or text, from which we can get any of the statistics you learned in Day 3 of the Python workshop. We will review some of these functions once we get some more information
There are many situations, in which "tagging" words (or really anything) may be useful in order to determine or calculate trends, or for further text analysis to extract meaning. NLTK contains several methods to achieve this, from simple regex to more advanced machine learning models models.
It is important to note that in Natural Language Processing (NLP), POS (Part of Speech) tagging is the most common use for tagging, but the actual tag can be anything. Other applications include sentiment analysis and NER (Named Entity Recognition). Tagging is simply labeling a word to a specific category via a tuple.
Nevertheless, for training more advanced tagging models, POS tagging is nearly essential. If you are defining a machine learning model to predict patterns in your text, these patterns will most likley rely on, among other things, POS features. You will therefore first tag POS and then use the POS as a feature in your model.
Tagging is creating a tuple of (word, tag) for every word in a text or corpus. For example: "My name is Chris" may be tagged for POS as:
My/PossessivePronoun name/Noun is/Verb Chris/ProperNoun ./Period
NB: type 'nltk.data.path' to find the path on your computer to your downloaded nltk corpora. You can explore these files to see how large corpora are formatted.
You'll notice how the text is annotated, using a forward slash to match the word to its tag. So how can we get this to a useful form for Python?
In [ ]:
from nltk.tag import str2tuple
line = "My/Possessive_Pronoun name/Noun is/Verb Chris/Proper_Noun ./Period"
tagged_sent = [str2tuple(t) for t in line.split()]
print (tagged_sent)
Further analysis of tags with NLTK requires a list of sentences, otherwise you will get an index error on higher level methods.
Naturally, these tags are a bit verbose, the standard tagging conventions follow the Penn Treebank (more in a second): https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
NLTK's stock English pos_tag tagger is a perceptron tagger:
In [ ]:
from nltk import pos_tag
m_tagged_sent = pos_tag(m_sents[0])
print (m_tagged_sent)
What do these tags mean?
In [ ]:
from nltk import help
help.upenn_tagset()
In [ ]:
m_tagged_all = [pos_tag(sent) for sent in m_sents]
print(m_tagged_all[:3])
We can find and aggregate certain parts of speech too:
In [ ]:
from nltk import ConditionalFreqDist
def find_tags(tag_prefix, tagged_text):
cfd = ConditionalFreqDist((tag, word) for (word, tag) in tagged_text
if tag.startswith(tag_prefix))
return dict((tag, cfd[tag].most_common(5)) for tag in cfd.conditions()) #cfd.conditions() yields all tags possibilites
In [ ]:
m_tagged_words = [item for sublist in m_tagged_all for item in sublist]
tagdict = find_tags('JJ', m_tagged_words)
for tag in sorted(tagdict):
print(tag, tagdict[tag])
We can begin to quantify syntax by look at environments of words, so what commonly follows a verb?
In [ ]:
tags = [b[1] for (a, b) in nltk.bigrams(m_tagged_words) if a[1].startswith('VB')]
fd1 = nltk.FreqDist(tags)
print ("To Kill A Mockingbird")
fd1.tabulate(10)
Now that we know how tagging works, we can quickly tag all of our documents, but we'll only do a few hundred from the much larger corpus.
In [ ]:
tagged_sents = {}
for fid in my_texts.fileids()[::10]:
tagged_sents[fid.split(".")[0]] = [pos_tag(sent) for sent in my_texts.sents(fid)]
In [ ]:
tagged_sents.keys()
In [ ]:
tagged_sents["Harry Potter and the Prisoner of Azkaban"]
Absolute frequencies are available through NLTK's FreqDist method:
In [ ]:
all_tags = []
all_tups = []
for k in tagged_sents.keys():
for s in tagged_sents[k]:
for t in s:
all_tags.append(t[1])
all_tups.append(t)
nltk.FreqDist(all_tags).tabulate(10)
In [ ]:
tags = ['NN', 'VB', 'JJ']
for t in tags:
tagdict = find_tags(t, all_tups)
for tag in sorted(tagdict):
print(tag, tagdict[tag])
We can compare this to other genres:
In [ ]:
from nltk.corpus import brown
for c in brown.categories():
tagged_words = brown.tagged_words(categories=c) #not universal tagset
tag_fd = nltk.FreqDist(tag for (word, tag) in tagged_words)
print(c.upper())
tag_fd.tabulate(10)
print()
tags = ['NN', 'VB', 'JJ']
for t in tags:
tagdict = find_tags(t, tagged_words)
for tag in sorted(tagdict):
print(tag, tagdict[tag])
print()
print()
We can also look at what linguistic environment words are in on a low level, below lists all the words preceding "love" in the romance category:
In [ ]:
brown_news_text = brown.words(categories='romance')
sorted(set(a for (a, b) in nltk.bigrams(brown_news_text) if b == 'love'))
While tagging parts of speech can be helpful for certain NLP tasks, dependency parsing is better at extracting real relationships within a sentence.
In [ ]:
from nltk.parse.stanford import StanfordDependencyParser
dependency_parser = StanfordDependencyParser(path_to_jar = "/Users/chench/Documents/stanford-parser-full-2015-12-09/stanford-parser.jar",
path_to_models_jar = "/Users/chench/Documents/stanford-parser-full-2015-12-09/stanford-parser-3.6.0-models.jar")
result = dependency_parser.raw_parse_sents(['I shot an elephant in my sleep.', 'It was great.'])
As the program takes longer to run, I will not run it on the entire corpus, but an example is below:
In [ ]:
for r in result:
for o in r:
trips = list(o.triples()) # ((head word, head tag), rel, (dep word, dep tag))
for t in trips:
print(t)
print()
After tokening, tagging, and parser, one of the last steps in the pipeline is NER. Identifying named entities can be useful in determing many different relationships, and often serves as a prerequisite to mapping textual relationships within a set of documents.
In [ ]:
from nltk.tag.stanford import StanfordNERTagger
ner_tag = StanfordNERTagger(
'/Users/chench/Documents/stanford-ner-2015-12-09/classifiers/english.all.3class.distsim.crf.ser.gz',
'/Users/chench/Documents/stanford-ner-2015-12-09/stanford-ner.jar')
In [ ]:
import pyprind
ner_sents = {}
books = ["To Kill A Mockingbird.txt", "Harry Potter and the Prisoner of Azkaban.txt"]
for fid in books:
bar = pyprind.ProgBar(len(my_texts.sents(fid)), monitor=True, bar_char="#")
tagged_sents = []
for sent in my_texts.sents(fid):
tagged_sents.append(ner_tag.tag(sent))
bar.update()
ner_sents[fid.split(".")[0]] = tagged_sents
print()
We can look on the low level at a single summary:
In [ ]:
print(ner_sents["To Kill A Mockingbird"])
In [ ]:
from itertools import groupby
from nltk import FreqDist
NER = {"LOCATION": [],
"PERSON": [],
"ORGANIZATION": [],
}
for sentence in ner_sents["To Kill A Mockingbird"]:
for tag, chunk in groupby(sentence, lambda x: x[1]):
if tag != "O":
NER[tag].append(" ".join(w for w, t in chunk))
if NER["LOCATION"]:
print("Locations:")
FreqDist(NER["LOCATION"]).tabulate()
print()
if NER["PERSON"]:
print("Persons:")
FreqDist(NER["PERSON"]).tabulate()
print()
if NER["ORGANIZATION"]:
print("Organizations")
FreqDist(NER["ORGANIZATION"]).tabulate()
Or between the two:
In [ ]:
NER = {"LOCATION": [],
"PERSON": [],
"ORGANIZATION": [],
}
for k in ner_sents.keys():
for sentence in ner_sents[k]:
for tag, chunk in groupby(sentence, lambda x: x[1]):
if tag != "O":
NER[tag].append(" ".join(w for w, t in chunk))
if NER["LOCATION"]:
print("Locations:")
FreqDist(NER["LOCATION"]).tabulate()
print()
if NER["PERSON"]:
print("Persons:")
FreqDist(NER["PERSON"]).tabulate()
print()
if NER["ORGANIZATION"]:
FreqDist(NER["ORGANIZATION"]).tabulate()
While earlier sentiment analysis was based on simple dictionary look-up methods denoting words as positive or negative, or assigning numerical values to words, newer methods are better able to take a word's or sentence's environment into account. VADER (Valence Aware Dictionary and sEntiment Reasoner) is one such example.
In [ ]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import numpy as np
sid = SentimentIntensityAnalyzer()
print(sid.polarity_scores("I really don't like that book.")["compound"])
In [ ]:
for fid in books:
print(fid.upper())
sent_pols = [sid.polarity_scores(s)["compound"] for s in sent_tokenize(my_texts.raw(fid))]
for i, s in enumerate(my_texts.sents(fid)):
print(s, sent_pols[i])
print()
print()
print("Mean: ", np.mean(sent_pols))
print()
print("="*100)
print()