Author: Clare Corthell, Luminant Data
Conference: Talking Machines, Manila
Date: 18 February 2016
Description: Much of human knowledge is “locked up” in a type of data called text. Humans are great at reading, but are computers? This workshop leads you through open source data science libraries in python that turn text into valuable data, then tours an open source system built for the Wordnik dictionary to source definitions of words from across the internet.
Goal: Learn the basics of text manipulation and analytics with open sources tools in python.
There are many great libraries and programmatic resources out there in languages other than python. For the purposes of a contained intro, I'll focus soley on Python today.
Give yourself a good chunk of time to troubleshoot installation if you're doing this for the first time. These resources are available for most platforms, including OSX and Windows.
In [1]:
%pwd
# make sure we're running our script from the right place;
# imports like "filename" are relative to where we're running ipython
Out[1]:
Collection of TED talk transcripts from the ted.com website:
In [43]:
# file of half of history of ted talks from (http://ted.com)
# i've already preprocessed this in an easy-to-consume .csv file
filename = 'data/tedtalks.csv'
In [44]:
# pandas, a handy and powerful data manipulation library
import pandas as pd
In [45]:
# this file has a header that includes column names
df = pd.DataFrame.from_csv(filename, encoding='utf8')
The pandas data structure DataFrame is like a spreadsheet.
It's easy to select columns, records, or data points. There are a multitude of handy features in this library for manipulating data.
In [46]:
# look at a slice (sub-selection of records) of the first four records
df[:4]
Out[46]:
In [47]:
df.shape
Out[47]:
In [48]:
# look at a slice of one column
df['headline'][:4]
Out[48]:
In [49]:
# select one data point
df['headline'][2]
Out[49]:
Linguistics, or the scientific study of language, plays a large role in how we build methods to understand text.
Computational Linguistics is "field concerned with the statistical or rule-based modeling of natural language from a computational perspective." - wp
In [50]:
from textblob import TextBlob
In [51]:
# create a textblob object with one transcript
t = TextBlob(df['transcript'][18])
print "Reading the transcript for '%s'" % df['headline'][18]
From the TextBlob object, we can get things like:
Using the questions we pose, we'll motivate using these methods and explain them throughout this workshop.
In [52]:
t.sentences[21]
Out[52]:
So we might say this sentence is about...
In [53]:
t.sentences[21].noun_phrases
Out[53]:
These are noun phrases, a useful grammatical lens for extracting topics from text.
"A noun phrase or nominal phrase (abbreviated NP) is a phrase which has a noun (or indefinite pronoun) as its head word, or which performs the same grammatical function as such a phrase." - Wikipedia
Noun Phrases are slightly more inclusive than just nouns, they encompass the whole "idea" of the noun, with any modifiers it may have.
If we do this across the whole transcript, we see roughly what it's about without reading it word-for-word:
In [54]:
t.noun_phrases
Out[54]:
If we pick a few random topics from the talk, maybe we can generalize about what it's about:
In [56]:
import random
rand_nps = random.sample(list(t.noun_phrases), 5)
print "This text might be about: \n%s" % ', and '.join(rand_nps)
The computer can't read and summarize on our behalf yet - but so far it's interesting!
Alternatively, we can look at noun phrases that occur more than twice -
In [57]:
np_cnt = t.np_counts
[(n, np_cnt[n]) for n in np_cnt if np_cnt[n] > 2] # pythonic list comprehension
Out[57]:
In [14]:
# get texblobs and noun phrase counts for everything -- this takes a while
blobs = [TextBlob(b).np_counts for b in df['transcript']]
In [58]:
blobs[2:3]
Out[58]:
Text is hard to work with because it is invariably dirty data. Misspelled words, poorly-formed sentences, corrupted files, wrong encodings, cut off text, long-winded writing styles, and a multitude of other problems plague this data type. Because of that, you'll find yourself writing many special cases, cleaning data, and modifying existing solutions to fit your dataset. That's normal. And it's part of the reason that these approaches don't work for every dataset out of the box.
In [59]:
# as we did before, pull the higher incident themes
np_themes = [[n for n in b if b[n] > 2] for b in blobs] # (list comprehension inception)
# pair the speaker with their top themes
speaker_themes = zip(df['speaker'], np_themes)
speaker_themes_df = pd.DataFrame(speaker_themes, columns=['speaker','themes'])
speaker_themes_df[:10]
Out[59]:
Great! But how do we see these themes across speakers' talks?
PAUSE - We'll come back to this later.
Sidebar: Unicode
If you see text like this:
u'\u266b'
don't worry. It's unicode. Text is usually in unicode in Python, and this is a representation of a special character. If you use the python print
function, you'll see that it encodes this character:
♫
In [60]:
print "There are %s sentences" % len(t.sentences)
print "And %s words" % len(t.words)
Psst -- I'll let you in on a secret: most of NLP concerns counting things.
(that isn't always easy, though, as you'll notice)
In text analytics, we use two important terms to refer to types of words and phrases:
Example tokens: ("won't", "giraffes", "1998")
Example 3-grams or tri-grams: ("You won't dance", "giraffes really smell", "that's so 1998")
In [61]:
# to get ngrams from our TextBlob (slice of the first five ngrams)
t.ngrams(n=3)[:5]
Out[61]:
Note the overlap here - the window moves over one token to slice the next ngram.
Language is made of up of many components, but one way to think about language is by the types of words used. Words are categorized into functional "Parts of Speech", which describe the role they play. The most common parts of speech in English are:
In an example sentence:
We can think about Parts of Speech as an abstraction of the words' behavior within a sentence, rather than the things they describe in the world.
POS tagging is hard because language and grammatical structure can and often are ambiguous. Humans can easily identify the subject and object of a sentence to identify a compound noun or a sub-clause, but a computer cannot. The computer needs to learn from examples to tell the difference.
Learning about the most likely tag that a given word should have is called incorporating "prior knowledge" and is the main task of training supervised machine learning models. Models use prior knowledge to infer the correct tag for new sentences.
For example,
He said he banks on the river.
He said the banks of the river were high.
Here, the context for the use of the word "banks" is important to determine whether it's a noun or a verb. Before the 1990s, a more popular technique for part of speech tagging was the rules-based approach, which involved writing a lengthy litany of rules that described these contextual differences in detail. Only in some cases did this work very well, but never in generalized cases. Statistical inference approaches, or describing these difference by learning from data, is now a more popular and more widely used approch.
This becomes important in identifying sub-clauses, which then allow disambiguation for other tasks, such as sentiment analysis. For example:
He said she was sad.
We could be led to believe that the subject of the sentence is "He", and the sentiment is negative ("sad"), but connecting "He" to "sad" would be incorrect. Parsing the clause "she was sad" allows us to achieve greater accuracy in sentiment analysis, especially if we are concerned with attributing sentiment to actors in text.
So let's take an example sentence:
In [62]:
sent = t.sentences[0] # the first sentence in the transcript
sent_tags = sent.tags # get the part of speech tags - you'll notice it takes a second
print "The full sentence:\n", sent
print "\nThe sentence with tags by word: \n", sent_tags
print "\nThe tags of the sentence in order: \n", " ".join([b for a, b in sent_tags])
Note that "I'm" resolves to a personal pronoun 'PRP'
and a verb 'VBP'
. It's a contraction!
In [63]:
t.sentences[35]
Out[63]:
Write this sentence out. What are the part of speech tags? Use the Part-of-Speech Code table above.
And the answer is:
In [64]:
answer_sent = t.sentences[35].tags
print ' '.join(['/'.join(i) for i in answer_sent])
And for fun, What are the noun phrases?
hint: noun phrases are defined slightly differently by different people, so this is sometimes subjective
Answer:
In [65]:
t.sentences[35].noun_phrases
Out[65]:
But don't worry - it's not just the computer that has trouble understanding what roles words play in a sentence. Some sentences are ambiguous or difficult for humans, too:
Mr/NNP Calhoun/NNP never/RB got/VBD around/RP to/TO joining/VBG
All/DT we/PRP gotta/VBN do/VB is/VBZ go/VB around/IN the/DT block/NN
Suite/NNP Mondant/NNP costs/VBZ around/RB 250/CD
This is a great example of why you will always have error - sometimes even a human can't tell the difference, and in those cases, the computer will fail, too.
Part-of-Speech taggers have achieved quite high quality results in recent years, but still suffer from underwhelming performance. This is because they usually are not heuristic-based, or derived from empirical rules, but rather use machine-learned models that require significant pre-processing and feature generation to predict tags. In practice, Data Scientists might develop their own heuristics or domain-trained models for POS tagging. This is a great example of a case where domain-specificity of the model will improve the accuracy, even if based on heuristics alone (see further reading).
Something important to note: NLTK, a popular tool that is underneath much of TextBlob, is a massive framework that was built originally as an academic and teaching tool. It was not built with production performance in mind.
Unfortunately, as Matthew Honnibal notes,
Up-to-date knowledge about natural language processing is mostly locked away in academia.
In [66]:
print t.sentiment.polarity
So this text is mildly positive. Does that tell us anything useful? Not really. But what about the change of tone over the course of the talk?
In [23]:
tone_change = list()
for sentence in t.sentences:
tone_change.append(sentence.sentiment.polarity)
print tone_change
Ok, so we see some significant range here. Let's make it easier to see with a visualization.
In [24]:
# this will show the graph here in the notebook
import matplotlib.pyplot as plt
%matplotlib inline
In [67]:
# dataframes have a handy plot method
pd.DataFrame(tone_change).plot(title='Polarity of Transcript by Sentence')
Out[67]:
Interesting trend - does the talk seem to become more positive over time?
Anecdotally, we know that TED talks seek to motivate and inspire, which could be one explanation for this sentiment polarity pattern.
Play around with this data yourself!
Text is messy and requires pre-processing. Once we've pre-processed and normalized the text, how do we use computation to understand it?
We may want to do a number of things computationally, but I'll focus generally on finding differences and similarities. In the case study, I'll focus on a supervised classification example, where these goals are key.
So let's start where we always do - counting words.
In [68]:
dict(t.word_counts)
Out[68]:
In [69]:
# put this in a dataframe for easy viewing and sorting
word_count_df = pd.DataFrame.from_dict(t.word_counts, orient='index')
word_count_df.columns = ['count']
What are the most common words?
In [70]:
word_count_df.sort('count', ascending=False)[:10]
Out[70]:
Hm. So the most common words will not tell us much about what's going on in this text.
In general, the more frequent a word is in the english language or in a text, the less important it will likely be to us. This concept is well known in text mining, called "term frequency–inverse document frequency". We can represent text using a tf-idf statistic to weigh how important a term is in a particular document. This statistic gives contextual weight to different terms, more accurately representing the importance of different terms of ngrams.
In [29]:
one = df.ix[16]
two = df.ix[20]
one['headline'], two['headline']
Out[29]:
In [30]:
len(one['transcript']), len(two['transcript'])
Out[30]:
In [31]:
one_blob = TextBlob(one['transcript'])
two_blob = TextBlob(two['transcript'])
In [32]:
one_set = set(one_blob.tokenize())
two_set = set(two_blob.tokenize())
In [33]:
# How many words did the two talks use commonly?
len(one_set.intersection(two_set))
Out[33]:
In [34]:
# How many different words did they use total?
total_diff = len(one_set.difference(two_set)) + len(two_set.difference(one_set))
total_diff
Out[34]:
In [35]:
proportion = len(one_set.intersection(two_set)) / float(total_diff)
print "Proportion of vocabulary that is common:", round(proportion, 4)*100, "%"
In [36]:
print one_set.intersection(two_set)
So what we start to see is that if we removed common words from this set, we'd see a few common themes between the two talks, but again, much of the vocabulary is common.
Let's go back to the dataframe of most frequently used noun phrases across transcripts
In [71]:
themes_list = speaker_themes_df['themes'].tolist()
speaker_themes_df[7:10]
Out[71]:
In [38]:
# sci-kit learn is a machine learning library for python
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
vocab = set([d for d in themes_list for d in d])
# we'll just look at ngrams > 2 to see some richer topics
cv = CountVectorizer(stop_words=None, vocabulary=vocab, ngram_range=(2, 4))
In [39]:
# going to turn these back into documents
document_list = [','.join(t) for t in themes_list]
In [40]:
data = cv.fit_transform(document_list).toarray()
A vectorizer is a method that, understandably, creates vectors from data, and ultimately a matrix. Each vector contains the incidence (in this case) of a token across all the documents (in this case, transcripts).
Sparsity
In text analysis, any matrix representing a set of documents against a vocabulary will be sparse. This is because not every word in the vocabulary occurs in every document - quite the contrary. Most of each vector is empty.
In [41]:
cv.get_feature_names() # names of features
dist = np.sum(data, axis=0)
counts_list = list()
for tag, count in zip(vocab, dist):
counts_list.append((count, tag))
Which themes were most common across speakers?
In [42]:
count_df = pd.DataFrame(counts_list, columns=['count','feature'])
count_df.sort('count', ascending=False)[:20]
Out[42]:
So the common theme is - Good News!
In Natural Language Processing, Vectorization is the most common way to abstract text to make it programmatically manipulable. It's your best friend once you get past document-level statistics!
Further Reading: