NLP Workshop

Author: Clare Corthell, Luminant Data

Conference: Talking Machines, Manila

Date: 18 February 2016

Description: Much of human knowledge is “locked up” in a type of data called text. Humans are great at reading, but are computers? This workshop leads you through open source data science libraries in python that turn text into valuable data, then tours an open source system built for the Wordnik dictionary to source definitions of words from across the internet.

Goal: Learn the basics of text manipulation and analytics with open sources tools in python.


Requirements

There are many great libraries and programmatic resources out there in languages other than python. For the purposes of a contained intro, I'll focus soley on Python today.

Give yourself a good chunk of time to troubleshoot installation if you're doing this for the first time. These resources are available for most platforms, including OSX and Windows.

Setup

Go to the root of this repository

cd < root directory >

run the notebook

ipython notebook


In [1]:
%pwd 
# make sure we're running our script from the right place; 
# imports like "filename" are relative to where we're running ipython


Out[1]:
u'/Users/clareputer/Documents/summer/nlp_workshop'

Load A Text Dataset

Collection of TED talk transcripts from the ted.com website:


In [43]:
# file of half of history of ted talks from (http://ted.com)
# i've already preprocessed this in an easy-to-consume .csv file
filename = 'data/tedtalks.csv'

Storage and File types for text

Common text file types for text include .json .csv .txt.

It is also common to store text with other data in relational databases or indexes.


In [44]:
# pandas, a handy and powerful data manipulation library
import pandas as pd

In [45]:
# this file has a header that includes column names
df = pd.DataFrame.from_csv(filename, encoding='utf8')

The pandas data structure DataFrame is like a spreadsheet.

It's easy to select columns, records, or data points. There are a multitude of handy features in this library for manipulating data.


In [46]:
# look at a slice (sub-selection of records) of the first four records
df[:4]


Out[46]:
date headline speaker transcript url
0 2014-10-21 00:00:00 A dance in a hurricane of paper, wind and light Aakash Odedra (Music) (Applause) http://www.ted.com/talks/aakash_odedra_a_dance...
1 2010-09-19 00:00:00 America's native prisoners of war Aaron Huey I'm here today to show my photographs of the ... http://www.ted.com/talks/aaron_huey
2 2011-03-02 00:00:00 Visualizing ourselves ... with crowd-sourced data Aaron Koblin So I think data can actually make us more hum... http://www.ted.com/talks/aaron_koblin
3 2011-03-02 00:00:00 Making sense of a visible quantum object Aaron O'Connell This is a representation of your brain, and y... http://www.ted.com/talks/aaron_o_connell_makin...

In [47]:
df.shape


Out[47]:
(1000, 5)

In [48]:
# look at a slice of one column
df['headline'][:4]


Out[48]:
0      A dance in a hurricane of paper, wind and light
1                    America's native prisoners of war
2    Visualizing ourselves ... with crowd-sourced data
3             Making sense of a visible quantum object
Name: headline, dtype: object

In [49]:
# select one data point
df['headline'][2]


Out[49]:
u'Visualizing ourselves ... with crowd-sourced data'

The Basics

We're going to use TextBlob to manipulate our text data. It wraps a number of handy text manipulation tools like NLTK and pattern, making those easy to use under one library.

Method and Python Package

We'll walk though all these methods by posing questions of the data. I'll number them for easy reference.

Back to Background

Linguistics, or the scientific study of language, plays a large role in how we build methods to understand text.

Computational Linguistics is "field concerned with the statistical or rule-based modeling of natural language from a computational perspective." - wp

1. Intuition for Text Analysis


In [50]:
from textblob import TextBlob

In [51]:
# create a textblob object with one transcript
t = TextBlob(df['transcript'][18])
print "Reading the transcript for '%s'" % df['headline'][18]


Reading the transcript for 'Three myths about corruption'

From the TextBlob object, we can get things like:

  • Frequency Analysis
  • Noun Phrases
  • Part-of-Speech Tags
  • Tokenization
  • Parsing
  • Sentiment Polarity
  • Word inflection
  • Spelling correction

Using the questions we pose, we'll motivate using these methods and explain them throughout this workshop.

Q1. What is this text about?

There are many ways we could think about answering this question, but the first might be to look at the topics that the text describes.

Let's look at a sentence in the middle:


In [52]:
t.sentences[21]


Out[52]:
Sentence("We're coming out today from Trinidad and Tobago, a resource-rich, small Caribbean country, and in the early 1970s we had a massive increase in the country's wealth, and that increase was caused by the increase in world oil prices.")

So we might say this sentence is about...


In [53]:
t.sentences[21].noun_phrases


Out[53]:
WordList([u'trinidad', u'tobago', u'caribbean', u'early 1970s', u'massive increase', u"country 's wealth", u'world oil prices'])

These are noun phrases, a useful grammatical lens for extracting topics from text.

Noun Phrases

"A noun phrase or nominal phrase (abbreviated NP) is a phrase which has a noun (or indefinite pronoun) as its head word, or which performs the same grammatical function as such a phrase." - Wikipedia

Noun Phrases are slightly more inclusive than just nouns, they encompass the whole "idea" of the noun, with any modifiers it may have.

If we do this across the whole transcript, we see roughly what it's about without reading it word-for-word:


In [54]:
t.noun_phrases


Out[54]:
WordList([u'okay', u'okay', u'big myth', u'belmont', u'diego', u'marabella', u'honest truth', u'national security', u'economic crime', u'private corruption', u'private sector', u'massive amount', u'private sector', u'private sector', u'public sector corruption', u'private sector', u'important myth', u'important myth', u'fact corruption', u'small problem', u'small problem', u'dangerous myth', u'public mischief', u'trinidad', u'tobago', u'caribbean', u'early 1970s', u'massive increase', u"country 's wealth", u'world oil prices', u"'s ironic", u'central bank', u"history 's", u'central bank', u'central bank', u'okay', u'public office', u'ministry', u'okay', u'applause', u'government-to-government arrangements', u'government-to-government arrangements', u'britain', u'france', u'twin towers', u'whole situation', u'ballah', u'government-to-government arrangements', u'then-prime minister', u'budget speech', u'young man', u'fact \u2014', u'pure mischief', u'forget', u'big people', u'okay', u'okay', u'ballah', u'international audience', u'constitutional outrage', u'suspicious piece', u'suspicious piece', u'suspicious time', u'laughter', u'piarco', u'own lexicon', u'piarco', u'constitutional outrage', u'plot', u'pervert', u'financial nature', u'different forms', u'american embassy', u'word lawyers', u'whole course', u'suspicious passage', u'nasty way', u'mass protests', u'okay', u'piarco', u'superior double bluff kind', u'bit mysterious', u'entire project cost', u'trinidad', u'tobago', u'suspicious activity', u'corrupt activity', u'big thing', u'prosecutions', u'offshore bank accounts', u'offshore bank accounts', u'suspicious person', u'different things', u'november', u'wall street', u'zuccotti', u'wall', u'occupy wall', u'street movement', u'simple sign', u'blonde lady', u'bristol', u'well', u'spy novels', u'laughter', u'spy novels', u'laughter', u'construction sector corruption', u'joint consultative', u'new public procurement system', u'public money', u'private campaigns', u'cl financial', u'cl financial', u'caribbean', u'january', u'unprecedented \u2014', u'lack context', u'wall', u'wall', u'trinidad', u'tobago', u'different laws', u'applause', u'okay', u'listen', u'wall street', u'london', u'europe', u'africa', u'nigeria', u'major commercial banks', u'nigerian', u'nowhere', u'statutory entitlements', u'personal work', u'afraraymond.com', u'not-for-profit blog', u'laughter', u'bitter experience', u'pervert parliament', u'bitter experience', u'august', u'september', u'trinidad', u'tobago', u'okay', u'information', u'may', u'ministry', u'ministry', u'ministry', u'information', u'central bank', u'information', u"ca n't", u"'ll relate", u'short form', u'cl financial', u"ca n't show", u'new laws', u'u.s.', u'caribbean', u'okay', u'clarity understanding', u"n't matter", u'information', u'okay', u"n't dignify", u'information', u'famous case', u'scholarship scandal', u'government money', u'information', u'fantastic', u'information', u'public money', u'ministry', u'permanent', u'ministry', u'cl financial', u'integrity', u'public life act', u'integrity', u'public life act', u"nation 's interest", u'public officials', u'basic safeguards', u"ca n't", u"country 's history", u'public corruption', u'reality check', u'public money', u'russia', u'nigeria', u'alaska', u'ministry', u'jcc', u'trinidad', u'tobago', u'international example', u'journalist [', u'heather', u'brooke', u'government corruption', u'alaveteli.com', u'alaveteli.com', u'open database', u'information', u'collective database', u'collective understanding', u'final thing', u'lovely website', u'india', u'ipaidabribe.com', u'international branches', u'ipaidabribe.com', u'discard', u'discard', u'big thing', u'huge problem', u'economic crime', u'thank'])

If we pick a few random topics from the talk, maybe we can generalize about what it's about:


In [56]:
import random

rand_nps = random.sample(list(t.noun_phrases),  5)
print "This text might be about: \n%s" % ', and '.join(rand_nps)


This text might be about: 
u.s., and blonde lady, and suspicious piece, and russia, and information

The computer can't read and summarize on our behalf yet - but so far it's interesting!

Alternatively, we can look at noun phrases that occur more than twice -


In [57]:
np_cnt = t.np_counts
[(n, np_cnt[n]) for n in np_cnt if np_cnt[n] > 2] # pythonic list comprehension


Out[57]:
[(u'cl financial', 4),
 (u'caribbean', 3),
 (u'tobago', 5),
 (u'government-to-government arrangements', 3),
 (u'ministry', 7),
 (u'wall', 3),
 (u'trinidad', 5),
 (u'piarco', 3),
 (u'central bank', 4),
 (u'laughter', 4),
 (u'private sector', 4),
 (u'public money', 3),
 (u'okay', 11),
 (u'information', 8)]

What are all the texts about?

It's interesting to look at one text, but looking at what's going on across these TED talks may be more interesting. We care about similarities and differences among aggregates.


In [14]:
# get texblobs and noun phrase counts for everything -- this takes a while
blobs = [TextBlob(b).np_counts for b in df['transcript']]

In [58]:
blobs[2:3]


Out[58]:
[defaultdict(int,
             {u"'d look sweet": 1,
              u"'ll look sweet": 1,
              u"'ll pay": 1,
              u"'re hearing": 1,
              u"'s data": 1,
              u"'s head": 1,
              u"'s music": 1,
              u"'s relationship": 1,
              u'... \u266b': 1,
              u'20th century culture': 1,
              u'24-hour offset': 1,
              u'24-hour period': 1,
              u'3d space': 1,
              u'aaron koblin': 1,
              u'abstract': 1,
              u'abstract version': 1,
              u'ai': 3,
              u"ai n't": 2,
              u'air traffic controllers': 1,
              u'airplane traffic': 1,
              u'ak': 5,
              u'amazing interfaces': 1,
              u'amazing robot': 1,
              u'amazing stories': 2,
              u'amazing stuff': 1,
              u'amazing things': 1,
              u'amazon': 1,
              u'america': 1,
              u'amsterdam': 1,
              u'applause': 2,
              u'approximate': 1,
              u'arcade': 1,
              u'archival footage': 1,
              u'arizona': 1,
              u'art': 1,
              u"artist 's name": 1,
              u'at': 1,
              u'atlanta': 1,
              u'back burner': 1,
              u'bad things': 1,
              u'baron wolfgang': 1,
              u'beautiful stipple renderings': 1,
              u'bell labs': 1,
              u'bicycle built': 1,
              u'bicyclebuiltfortwothousand.com': 1,
              u'big low-traffic zones': 1,
              u'body ... \u266b': 1,
              u"ca n't": 5,
              u'carriage \u266b \u266b': 2,
              u'century culture': 1,
              u'certain things': 1,
              u'charlie chaplin': 2,
              u'chess player': 1,
              u'chorus': 1,
              u'chris': 1,
              u'chris milk': 2,
              u'coast planes': 1,
              u'collaborative memorial': 1,
              u'collaborative music video project': 1,
              u'collaborative project': 1,
              u'collaborator': 6,
              u'collection rate': 1,
              u'couple years': 1,
              u'crazy \u266b \u266b': 2,
              u'creative process': 1,
              u'creative toil': 1,
              u'currentcity.org': 1,
              u'daily ebb': 1,
              u'daisy': 6,
              u'daisy bell': 2,
              u'daniel massey': 1,
              u'demographic info': 1,
              u'different airports': 1,
              u'different contributions': 1,
              u'different parts': 3,
              u'different patterns': 1,
              u'different perimeters': 1,
              u'different phases': 1,
              u'different sheep': 1,
              u'different styles': 1,
              u'different types': 1,
              u'different ways': 1,
              u'dollar bill': 1,
              u'dollar bills': 1,
              u'drawing tool': 2,
              u'draws sheep': 1,
              u'elastic mind': 1,
              u'embedded network sensing': 1,
              u'entire thing': 1,
              u'european flights': 1,
              u'fake hundred-dollar bills': 1,
              u'favorite bands': 1,
              u'favorite parts': 1,
              u'federal government': 1,
              u'final album': 1,
              u'flash': 1,
              u'flips directions': 1,
              u'florida': 1,
              u'followed': 1,
              u'good friend': 1,
              u'good things': 1,
              u'google': 1,
              u'google code': 1,
              u'google maps': 1,
              u'grad school': 1,
              u'grave': 1,
              u'grave \u266b \u266b': 3,
              u'great stories': 1,
              u'ground \u266b \u266b': 1,
              u'hal': 1,
              u'happy': 1,
              u'hawaii': 1,
              u'honk': 1,
              u'hope': 2,
              u'html5': 1,
              u'humanity versus': 1,
              u'hundred': 1,
              u'individual airports': 1,
              u'individual contributions': 1,
              u'individual frame': 1,
              u'individual frames': 1,
              u'individual pieces': 1,
              u'individual sheep': 1,
              u'individual thumbnail': 1,
              u'individual thumbnails': 1,
              u'information panel': 1,
              u'interactive tool': 1,
              u'interesting clip': 1,
              u'interesting content': 1,
              u'international cities': 1,
              u'international communications': 1,
              u'ip': 1,
              u'james frost': 1,
              u'james surowieki': 1,
              u'japan': 1,
              u'javascript': 1,
              u'john kelly': 1,
              u'johnny cash': 5,
              u'johnny cash project': 1,
              u'johnnycashproject.com': 1,
              u'kempelen': 1,
              u'kind': 1,
              u'knoxville': 1,
              u'l.a.': 2,
              u'laptop per child': 1,
              u'laptop project': 1,
              u'laughter': 4,
              u'left side': 1,
              u'left-hand corner': 1,
              u'legless man': 1,
              u'lego thom yorke': 1,
              u'live feed': 1,
              u'live globe': 1,
              u'long beach': 1,
              u'long time': 1,
              u'los angeles': 2,
              u'lots': 1,
              u'major airports': 1,
              u'major changes': 1,
              u'max mathews': 1,
              u'mechanical chess': 1,
              u'mechanical process': 1,
              u'mechanical turk': 6,
              u'mit': 1,
              u'modern web browsers': 1,
              u'mr.': 1,
              u'music video': 6,
              u'music video director': 1,
              u"n't meet": 1,
              u"narrator 's": 1,
              u'nerdy side': 2,
              u'nevada': 1,
              u'new year': 1,
              u"new year 's eve": 1,
              u'noise': 1,
              u'obviously': 1,
              u'odyssey': 1,
              u'online music video': 1,
              u'ooh': 3,
              u'original frame': 1,
              u'own address': 1,
              u'own interpretation': 1,
              u'own versions': 1,
              u'own voice': 1,
              u'pain \u266b \u266b': 1,
              u'patterns': 1,
              u"people 's minds": 1,
              u"people 's shoes": 1,
              u'perfect project': 1,
              u"person 's life": 1,
              u'personal contribution': 1,
              u'petit': 1,
              u'plate block': 1,
              u'pointillist version': 1,
              u'powerful feeling': 1,
              u'powerful narrative device': 1,
              u'pretty valid question': 1,
              u'production traits': 1,
              u"queen 's day": 1,
              u'radiohead': 1,
              u'real hundred-dollar bills': 1,
              u'realistic version': 1,
              u'realistic versions': 1,
              u'really': 1,
              u'recently': 1,
              u'recording': 1,
              u'red-eye flights': 1,
              u'relevant data': 1,
              u'revolution': 1,
              u'ricardo cabello': 1,
              u'rick rubin': 1,
              u'right person': 1,
              u'right side': 1,
              u'right-hand corner': 1,
              u'right-hand side': 1,
              u'san francisco': 2,
              u'seat \u266b \u266b': 2,
              u'sensible cities lab': 2,
              u'sheep': 1,
              u'sheep-like criteria': 1,
              u'shoe makers': 1,
              u'short clip': 1,
              u'short clips': 1,
              u'simple audio clip': 1,
              u'sixteen-by-nine window': 1,
              u'sketchy version': 1,
              u'small group': 1,
              u'sms': 3,
              u'sound \u266b \u266b': 1,
              u'source code': 1,
              u'streetview': 1,
              u'stylish marriage \u266b \u266b': 2,
              u'sum total': 1,
              u'takashi kawashima': 1,
              u'teeny pieces': 1,
              u'tennessee': 1,
              u'tenthousandscents.com': 1,
              u'thanks': 1,
              u'thesheepmarket.com': 1,
              u'thom yorke': 2,
              u'time-lapse image': 1,
              u'turk': 1,
              u'tweeted': 1,
              u'u.s.': 1,
              u'upper right-hand corner': 1,
              u'venezuela': 1,
              u'video': 2,
              u'virtual resurrection': 1,
              u'visualize laser scanners': 1,
              u'wait': 1,
              u'watching': 1,
              u'web': 1,
              u'web service': 2,
              u'wee battles': 1,
              u'weird way': 1,
              u'whole bunch': 1,
              u'whole idea': 1,
              u'win butler': 1,
              u'wise media theorist': 1,
              u"world 's": 1,
              u'wow': 1,
              u'york': 3,
              u'youtube': 1,
              u'\u266b \u266b': 19})]

Note: Dirty Data

Text is hard to work with because it is invariably dirty data. Misspelled words, poorly-formed sentences, corrupted files, wrong encodings, cut off text, long-winded writing styles, and a multitude of other problems plague this data type. Because of that, you'll find yourself writing many special cases, cleaning data, and modifying existing solutions to fit your dataset. That's normal. And it's part of the reason that these approaches don't work for every dataset out of the box.


In [59]:
# as we did before, pull the higher incident themes
np_themes = [[n for n in b if b[n] > 2] for b in blobs] # (list comprehension inception)

# pair the speaker with their top themes
speaker_themes = zip(df['speaker'], np_themes)

speaker_themes_df = pd.DataFrame(speaker_themes, columns=['speaker','themes'])
speaker_themes_df[:10]


Out[59]:
speaker themes
0 Aakash Odedra []
1 Aaron Huey [lakota nation, black, nation, laramie, u.s., ...
2 Aaron Koblin [ooh, ca n't, mechanical turk, sms, york, dais...
3 Aaron O'Connell [quantum mechanics, different places, logical ...
4 Abe Davis [silent video, regular video, little, mary, la...
5 Abha Dawesar []
6 Abigail Washburn [♫ ♫, applause, shady, wong, china]
7 Abraham Verghese [fildes, burntisland, laennec, bell, subsequen...
8 Achenyo Idachaba [water hyacinth, gbe'borun, nigeria]
9 Adam Davidson [fiscal cliff, vast majority, u.s., fiscal iss...

Great! But how do we see these themes across speakers' talks?

PAUSE - We'll come back to this later.

Sidebar: Unicode

If you see text like this:

u'\u266b'

don't worry. It's unicode. Text is usually in unicode in Python, and this is a representation of a special character. If you use the python print function, you'll see that it encodes this character:


Let's take another seemingly simple question:

Q2. What's in the text?


In [60]:
print "There are %s sentences" % len(t.sentences)
print "And %s words" % len(t.words)


There are 224 sentences
And 3331 words

Psst -- I'll let you in on a secret: most of NLP concerns counting things.

(that isn't always easy, though, as you'll notice)

Parsing

In text analytics, we use two important terms to refer to types of words and phrases:

  • token - a single "word", or set of continguous characters

Example tokens: ("won't", "giraffes", "1998")

  • n-gram - a contiguous sequence of n tokens

Example 3-grams or tri-grams: ("You won't dance", "giraffes really smell", "that's so 1998")


In [61]:
# to get ngrams from our TextBlob (slice of the first five ngrams)
t.ngrams(n=3)[:5]


Out[61]:
[WordList([u'Okay', u'this', u'morning']),
 WordList([u'this', u'morning', u'I']),
 WordList([u'morning', u'I', u"'m"]),
 WordList([u'I', u"'m", u'speaking']),
 WordList([u"'m", u'speaking', u'on'])]

Note the overlap here - the window moves over one token to slice the next ngram.

What's a sentence made up of?

The Building Blocks of Language

Language is made of up of many components, but one way to think about language is by the types of words used. Words are categorized into functional "Parts of Speech", which describe the role they play. The most common parts of speech in English are:

  • noun
  • verb
  • adjective
  • adverb
  • pronoun
  • preposition
  • conjunction
  • interjection
  • article or determiner

In an example sentence:

We can think about Parts of Speech as an abstraction of the words' behavior within a sentence, rather than the things they describe in the world.

What's the big deal about Part-of-Speech Tagging?

POS tagging is hard because language and grammatical structure can and often are ambiguous. Humans can easily identify the subject and object of a sentence to identify a compound noun or a sub-clause, but a computer cannot. The computer needs to learn from examples to tell the difference.

Learning about the most likely tag that a given word should have is called incorporating "prior knowledge" and is the main task of training supervised machine learning models. Models use prior knowledge to infer the correct tag for new sentences.

For example,

He said he banks on the river.

He said the banks of the river were high.

Here, the context for the use of the word "banks" is important to determine whether it's a noun or a verb. Before the 1990s, a more popular technique for part of speech tagging was the rules-based approach, which involved writing a lengthy litany of rules that described these contextual differences in detail. Only in some cases did this work very well, but never in generalized cases. Statistical inference approaches, or describing these difference by learning from data, is now a more popular and more widely used approch.

This becomes important in identifying sub-clauses, which then allow disambiguation for other tasks, such as sentiment analysis. For example:

He said she was sad.

We could be led to believe that the subject of the sentence is "He", and the sentiment is negative ("sad"), but connecting "He" to "sad" would be incorrect. Parsing the clause "she was sad" allows us to achieve greater accuracy in sentiment analysis, especially if we are concerned with attributing sentiment to actors in text.

So let's take an example sentence:


In [62]:
sent = t.sentences[0] # the first sentence in the transcript
sent_tags = sent.tags # get the part of speech tags - you'll notice it takes a second
print "The full sentence:\n", sent
print "\nThe sentence with tags by word: \n", sent_tags
print "\nThe tags of the sentence in order: \n", " ".join([b for a, b in sent_tags])


The full sentence:
 Okay, this morning I'm speaking on the question of corruption.

The sentence with tags by word: 
[(u'Okay', u'NNP'), (u'this', u'DT'), (u'morning', u'NN'), (u'I', u'PRP'), (u"'m", u'VBP'), (u'speaking', u'VBG'), (u'on', u'IN'), (u'the', u'DT'), (u'question', u'NN'), (u'of', u'IN'), (u'corruption', u'NN')]

The tags of the sentence in order: 
NNP DT NN PRP VBP VBG IN DT NN IN NN

Note that "I'm" resolves to a personal pronoun 'PRP' and a verb 'VBP'. It's a contraction!

See the notation legend for part-of-speech tags here


Practice Example

So let's pick another sentence and check our understanding of labeling:


In [63]:
t.sentences[35]


Out[63]:
Sentence("The then-Prime Minister went to Parliament to give a budget speech, and he said some things that I'll never forget.")

Write this sentence out. What are the part of speech tags? Use the Part-of-Speech Code table above.

And the answer is:


In [64]:
answer_sent = t.sentences[35].tags
print ' '.join(['/'.join(i) for i in answer_sent])


The/DT then-Prime/JJ Minister/NNP went/VBD to/TO Parliament/NNP to/TO give/VB a/DT budget/NN speech/NN and/CC he/PRP said/VBD some/DT things/NNS that/IN I/PRP 'll/MD never/RB forget/VB

And for fun, What are the noun phrases?

hint: noun phrases are defined slightly differently by different people, so this is sometimes subjective

Answer:


In [65]:
t.sentences[35].noun_phrases


Out[65]:
WordList([u'then-prime minister', u'budget speech'])

Bad things are bad

But don't worry - it's not just the computer that has trouble understanding what roles words play in a sentence. Some sentences are ambiguous or difficult for humans, too:

Mr/NNP Calhoun/NNP never/RB got/VBD around/RP to/TO joining/VBG

All/DT we/PRP gotta/VBN do/VB is/VBZ go/VB around/IN the/DT block/NN

Suite/NNP Mondant/NNP costs/VBZ around/RB 250/CD

This is a great example of why you will always have error - sometimes even a human can't tell the difference, and in those cases, the computer will fail, too.

Part-of-Speech Tagging - Runtime

Part-of-Speech taggers have achieved quite high quality results in recent years, but still suffer from underwhelming performance. This is because they usually are not heuristic-based, or derived from empirical rules, but rather use machine-learned models that require significant pre-processing and feature generation to predict tags. In practice, Data Scientists might develop their own heuristics or domain-trained models for POS tagging. This is a great example of a case where domain-specificity of the model will improve the accuracy, even if based on heuristics alone (see further reading).

Something important to note: NLTK, a popular tool that is underneath much of TextBlob, is a massive framework that was built originally as an academic and teaching tool. It was not built with production performance in mind.

Unfortunately, as Matthew Honnibal notes,

Up-to-date knowledge about natural language processing is mostly locked away in academia.

Further Reading:


Q2: What tone does this transcript take?

Sentiment refers to the emotive value of language.

Polarity is the generalized sentiment measured from negative to positive. A sentiment polarity score can be calculated in a range of [-1.0,1.0].

So the general polarity of the text is:


In [66]:
print t.sentiment.polarity


0.104258062274

So this text is mildly positive. Does that tell us anything useful? Not really. But what about the change of tone over the course of the talk?


In [23]:
tone_change = list()
for sentence in t.sentences:
     tone_change.append(sentence.sentiment.polarity)

print tone_change


[0.5, 0.0, 0.5, 0.0, 0.0, 0.0, 0.049999999999999996, 0.0, 0.6, 0.0, 0.5, 0.0, 0.2, 0.0, 0.0, 0.0, 0.0, -0.036538461538461534, -0.69, 0.0, -0.09375, -0.049999999999999996, 0.0, 0.0, 0.1, 0.375, 0.06666666666666667, 0.5, 0.0, 0.0, 0.5, 0.25, 0.0, 0.0, -0.4, 0.0, 0.2857142857142857, 0.1, 0.2857142857142857, 0.0, 0.0, -0.15000000000000002, 0.21428571428571427, 0.0, 0.0, -0.1875, 0.0, 0.5, 0.0, 0.5, -0.2, 0.0, 0.0, 0.0, 0.0, 0.0, 0.4, 0.0, 0.6, 0.0, 0.25, 0.0, 0.1, -0.3333333333333333, 0.0, 0.0, 0.15000000000000002, 0.0, 0.0, 0.1875, 0.6, -0.05000000000000001, 0.0, 0.5, 0.20625, 0.0, 0.4333333333333333, 0.0, 0.0, 0.0, 0.0, -0.0928571428571429, -0.25, 0.05, 0.0, 0.0, 0.0, 0.0, -0.10000000000000002, 0.0, 0.0, 0.0, 0.35, 0.0, -0.15, 0.0, 0.19999999999999998, 0.09285714285714285, 0.45, 0.45, 0.0, 0.0, 0.0, 0.0, 0.2857142857142857, 0.25, 0.25, 0.0, 0.0, 0.16666666666666669, 0.0, 0.04545454545454545, 0.375, -0.05555555555555555, 0.25, -0.015, 0.4666666666666666, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5, -0.3333333333333333, 0.0, 0.0, 0.0, 0.0, 0.0, 0.020833333333333332, 0.2333333333333333, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.2375, -0.06, 0.0, 0.5, 0.0, 0.0, 0.0, -0.125, -0.16666666666666666, 0.2, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.06818181818181818, 0.0, 0.0, -0.15625, 0.0, 0.0, -0.1, -0.20833333333333331, 0.5, -0.16666666666666666, 0.0, 0.08333333333333333, 0.0, 0.0, 0.5, 0.4, 0.0, 0.0, 0.0, 0.19999999999999998, 0.75, 0.4, 0.11428571428571428, 0.0, 0.0, 0.0, 0.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.2, 0.0, 0.0, 0.0, -0.07142857142857142, -0.03571428571428571, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.04545454545454545, 0.0, -0.0625, 0.0, 0.0, 0.0, 0.0, 0.0, 0.25, 0.2, 0.55, 0.0, 0.0, 0.25, 0.0, 0.4000000000000001, 0.2, 0.0, 0.0]

Ok, so we see some significant range here. Let's make it easier to see with a visualization.


In [24]:
# this will show the graph here in the notebook
import matplotlib.pyplot as plt
%matplotlib inline

In [67]:
# dataframes have a handy plot method
pd.DataFrame(tone_change).plot(title='Polarity of Transcript by Sentence')


Out[67]:
<matplotlib.axes._subplots.AxesSubplot at 0x1141d0690>

Interesting trend - does the talk seem to become more positive over time?

Anecdotally, we know that TED talks seek to motivate and inspire, which could be one explanation for this sentiment polarity pattern.

Play around with this data yourself!


2. Representations of Text for Computation

Text is messy and requires pre-processing. Once we've pre-processed and normalized the text, how do we use computation to understand it?

We may want to do a number of things computationally, but I'll focus generally on finding differences and similarities. In the case study, I'll focus on a supervised classification example, where these goals are key.

So let's start where we always do - counting words.


In [68]:
dict(t.word_counts)


Out[68]:
{u'09': 1,
 u'1.6': 2,
 u'10': 2,
 u'15': 2,
 u'1970s': 1,
 u'1982': 2,
 u'1999': 1,
 u'24': 2,
 u'30': 4,
 u'34': 4,
 u'40': 1,
 u'50th': 1,
 u'60': 2,
 u'a': 69,
 u'able': 1,
 u'about': 24,
 u'abuse': 1,
 u'according': 1,
 u'accountability': 3,
 u'accounts': 4,
 u'accused': 6,
 u'across': 1,
 u'act': 9,
 u'activity': 2,
 u'actually': 1,
 u'admit': 2,
 u'advance': 2,
 u'advertised': 1,
 u'afraraymond.com': 1,
 u'africa': 1,
 u'again': 2,
 u'against': 2,
 u'ago': 5,
 u'airport': 4,
 u'alaska': 1,
 u'alaveteli.com': 2,
 u'all': 9,
 u'also': 3,
 u'always': 2,
 u'am': 3,
 u'american': 1,
 u'amount': 1,
 u'an': 12,
 u'and': 125,
 u'anniversary': 1,
 u'another': 1,
 u'answer': 1,
 u'any': 2,
 u'anything': 2,
 u'anywhere': 5,
 u'applause': 2,
 u'application': 1,
 u'applications': 1,
 u'applied': 1,
 u'applying': 1,
 u'appointed': 1,
 u'are': 16,
 u'around': 4,
 u'arrangements': 3,
 u'as': 12,
 u'ask': 2,
 u'asked': 6,
 u'assets': 1,
 u'at': 13,
 u'attention': 2,
 u'audience': 1,
 u'august': 1,
 u'autumn': 1,
 u'back': 4,
 u'backhanders': 1,
 u'bailed': 1,
 u'bailout': 1,
 u'bailouts': 4,
 u'ballah': 2,
 u'bank': 6,
 u'banks': 1,
 u'basic': 1,
 u'battered-looking': 1,
 u'battle': 1,
 u'be': 9,
 u'bearings': 1,
 u'became': 1,
 u'because': 12,
 u'been': 15,
 u'before': 1,
 u'being': 4,
 u'belmont': 1,
 u'benefit': 1,
 u'betterment': 1,
 u'bid-rigging': 1,
 u'big': 4,
 u'billion': 8,
 u'biology': 1,
 u'bit': 3,
 u'bitter': 2,
 u'blog': 1,
 u'blonde': 1,
 u'bluff': 1,
 u'board': 1,
 u'brain': 1,
 u'branches': 1,
 u'bribes': 1,
 u'bring': 5,
 u'bristol': 1,
 u'britain': 1,
 u'brooke': 1,
 u'budget': 1,
 u'build': 2,
 u'building': 2,
 u'bursting': 1,
 u'but': 13,
 u'by': 2,
 u'ca': 3,
 u'call': 3,
 u'called': 4,
 u'came': 1,
 u'campaigns': 1,
 u'can': 7,
 u'career': 2,
 u'carefully': 2,
 u'caribbean': 3,
 u'case': 3,
 u'caught': 1,
 u'cause': 2,
 u'caused': 1,
 u'celebrated': 1,
 u'celebrating': 1,
 u'center': 1,
 u'central': 4,
 u'changed': 1,
 u'changing': 1,
 u'check': 2,
 u'children': 1,
 u'cl': 4,
 u'clarity': 1,
 u'collapsed': 3,
 u'collective': 2,
 u'comes': 2,
 u'coming': 3,
 u'commercial': 1,
 u'commission': 1,
 u'commissioner': 1,
 u'commitment': 1,
 u'compared': 1,
 u'complained': 1,
 u'complexes': 1,
 u'conducting': 1,
 u'conglomerate': 1,
 u'connection': 1,
 u'consciousness': 1,
 u'consisted': 1,
 u'constitutional': 2,
 u'constructed': 1,
 u'construction': 1,
 u'consultative': 1,
 u'context': 7,
 u'continue': 6,
 u'contradiction': 1,
 u'cool': 1,
 u'corrupt': 1,
 u'corruption': 16,
 u'cost': 2,
 u'could': 4,
 u'council': 1,
 u'countries': 1,
 u'country': 10,
 u'courage': 2,
 u'course': 2,
 u'court': 3,
 u'creditors': 3,
 u'crime': 11,
 u'd': 2,
 u'damp': 1,
 u'dangerous': 2,
 u'dark': 1,
 u'database': 2,
 u'day': 2,
 u'deal': 2,
 u'dealing': 3,
 u'deeper': 1,
 u'defined': 2,
 u'demonstrate': 1,
 u'destroy': 2,
 u'details': 1,
 u'develop': 1,
 u'diego': 1,
 u'different': 3,
 u'digging': 1,
 u'dignify': 1,
 u'director': 1,
 u'directors': 1,
 u'discard': 2,
 u'discarded': 1,
 u'disclosure': 2,
 u'discovered': 1,
 u'discuss': 1,
 u'dismantle': 1,
 u'dispersed': 1,
 u'do': 8,
 u'does': 1,
 u'doing': 1,
 u'dollars': 11,
 u'double': 1,
 u'drinking': 1,
 u'each': 1,
 u'early': 1,
 u'economic': 3,
 u'effect': 1,
 u'eh': 1,
 u'either': 1,
 u'else': 2,
 u'embarked': 2,
 u'embarrassing': 1,
 u'embassy': 1,
 u'entire': 1,
 u'entitlements': 1,
 u'equal': 2,
 u'equation': 1,
 u'europe': 1,
 u'even': 3,
 u'events': 1,
 u'ever': 1,
 u'every': 1,
 u'example': 3,
 u'examples': 1,
 u'excellent': 1,
 u'excess': 1,
 u'exchanges': 1,
 u'exempt': 1,
 u'expenditure': 2,
 u'experience': 3,
 u'fact': 12,
 u'family': 3,
 u'famous': 1,
 u'fantastic': 1,
 u'fashion': 1,
 u'fear': 1,
 u'feel': 1,
 u'fiasco': 2,
 u'figures': 1,
 u'file': 2,
 u'filing': 1,
 u'filings': 1,
 u'final': 1,
 u'finance': 9,
 u'financial': 5,
 u'financiers': 2,
 u'find': 1,
 u'finding': 1,
 u'first': 7,
 u'fit': 2,
 u'flowed': 1,
 u'focusing': 1,
 u'follow': 1,
 u'for': 19,
 u'force': 2,
 u'forced': 1,
 u'forever': 2,
 u'forget': 3,
 u'form': 1,
 u'forms': 1,
 u'fortified': 1,
 u'found': 2,
 u'four': 3,
 u'framework': 1,
 u'france': 1,
 u'free': 1,
 u'freedom': 8,
 u'friends': 3,
 u'from': 5,
 u'gave': 2,
 u'generosity': 3,
 u'generous': 1,
 u'get': 8,
 u'getting': 5,
 u'give': 1,
 u'giving': 1,
 u'go': 4,
 u'going': 13,
 u'good': 1,
 u'got': 2,
 u'government': 7,
 u'government-to-government': 3,
 u'group': 1,
 u'grown': 1,
 u'had': 13,
 u'half': 3,
 u'happened': 3,
 u'happening': 1,
 u'has': 8,
 u'have': 31,
 u'he': 11,
 u'heart': 1,
 u'heather': 1,
 u'helps': 1,
 u'her': 1,
 u'here': 14,
 u'highest': 1,
 u'his': 3,
 u'history': 2,
 u'hit': 2,
 u'honest': 1,
 u'hope': 1,
 u'how': 6,
 u'huge': 1,
 u'i': 71,
 u'idea': 1,
 u'if': 12,
 u'ignored': 1,
 u'immediately': 1,
 u'immune': 1,
 u'important': 5,
 u'importantly': 1,
 u'in': 77,
 u'increase': 4,
 u'independence': 3,
 u'india': 1,
 u'individual': 1,
 u'information': 10,
 u'inquiry': 1,
 u'institution': 1,
 u'integrity': 3,
 u'interest': 1,
 u'interested': 1,
 u'interesting': 1,
 u'international': 3,
 u'into': 10,
 u'introduced': 1,
 u'involved': 1,
 u'involving': 1,
 u'ipaidabribe.com': 2,
 u'ironic': 1,
 u'ironies': 1,
 u'irony': 1,
 u'irresponsibility': 1,
 u'is': 65,
 u'it': 79,
 u'january': 1,
 u'jcc': 1,
 u'jcc.org.tt': 1,
 u'joining': 1,
 u'joint': 1,
 u'joke': 1,
 u'journalist': 1,
 u'jubilee': 1,
 u'just': 7,
 u'kind': 5,
 u'know': 1,
 u'labeled': 1,
 u'lack': 1,
 u'lady': 2,
 u'largely': 1,
 u'largest': 4,
 u'last': 1,
 u'laughter': 4,
 u'law': 8,
 u'laws': 3,
 u'lawyers': 1,
 u'lead': 1,
 u'leaders': 1,
 u'leading': 1,
 u'legal': 1,
 u'let': 6,
 u'lexicon': 1,
 u'liabilities': 1,
 u'life': 2,
 u'like': 12,
 u'listen': 2,
 u'little': 5,
 u'll': 2,
 u'located': 1,
 u'log': 1,
 u'london': 1,
 u'look': 3,
 u'looking': 1,
 u'looting': 1,
 u'lot': 7,
 u'lovely': 1,
 u'm': 20,
 u'made': 6,
 u'major': 1,
 u'make': 1,
 u'making': 3,
 u'man': 1,
 u'marabella': 1,
 u'marker': 1,
 u'mass': 1,
 u'massive': 3,
 u'matter': 1,
 u'matters': 1,
 u'maturation': 1,
 u'may': 1,
 u'maybe': 1,
 u'me': 10,
 u'mean': 1,
 u'million': 2,
 u'minister': 4,
 u'ministry': 7,
 u'mischief': 2,
 u'miseducated': 1,
 u'money': 11,
 u'months': 1,
 u'more': 1,
 u'morning': 2,
 u'most': 2,
 u'motivated': 1,
 u'movement': 1,
 u'must': 1,
 u'my': 10,
 u'myself': 1,
 u'mysterious': 1,
 u'myth': 7,
 u'myths': 1,
 u"n't": 14,
 u'name': 3,
 u'nancy-story': 1,
 u'nasty': 1,
 u'nation': 1,
 u'national': 1,
 u'nature': 1,
 u'nearly': 1,
 u'need': 5,
 u'never': 1,
 u'new': 3,
 u'next': 3,
 u'nigeria': 2,
 u'nigerian': 1,
 u'no': 2,
 u'nobody': 1,
 u'not': 13,
 u'not-for-profit': 2,
 u'nothing': 1,
 u'novels': 2,
 u'november': 1,
 u'now': 4,
 u'nowhere': 1,
 u'number': 2,
 u'occupy': 1,
 u'of': 123,
 u'office': 1,
 u'officials': 2,
 u'offshore': 2,
 u'oil': 1,
 u'okay': 14,
 u'on': 26,
 u'one': 12,
 u'only': 4,
 u'open': 1,
 u'or': 15,
 u'order': 1,
 u'other': 3,
 u'our': 17,
 u'ours': 1,
 u'out': 13,
 u'outdated': 1,
 u'outrage': 3,
 u'outraged': 3,
 u'outrageous': 1,
 u'outwitted': 1,
 u'over': 2,
 u'own': 2,
 u'parallel': 1,
 u'park': 1,
 u'parliament': 6,
 u'part': 3,
 u'participates': 1,
 u'particular': 1,
 u'parts': 1,
 u'passage': 2,
 u'passed': 2,
 u'passing': 2,
 u'pause': 6,
 u'paying': 2,
 u'people': 5,
 u'percent': 2,
 u'permanent': 1,
 u'person': 4,
 u'personal': 1,
 u'personally': 1,
 u'pervert': 2,
 u'perverted': 1,
 u'perverts': 1,
 u'petitions': 1,
 u'petrodollars': 2,
 u'physics': 1,
 u'piarco': 3,
 u'piece': 3,
 u'place': 6,
 u'planet': 2,
 u'please': 1,
 u'plot': 2,
 u'plunged': 1,
 u'point': 9,
 u'police': 1,
 u'political': 1,
 u'popular': 1,
 u'position': 1,
 u'press': 2,
 u'previous': 1,
 u'prices': 1,
 u'private': 6,
 u'privilege': 1,
 u'probably': 1,
 u'problem': 6,
 u'procurement': 1,
 u'produce': 1,
 u'project': 2,
 u'projects': 1,
 u'proper': 2,
 u'prosecutions': 1,
 u'protest': 2,
 u'protesters': 1,
 u'protests': 1,
 u'provisions': 2,
 u'public': 13,
 u'pure': 1,
 u'put': 1,
 u'question': 7,
 u'questions': 2,
 u'rapidly': 1,
 u're': 18,
 u'read': 2,
 u'reality': 2,
 u'really': 7,
 u'reason': 1,
 u'recalculate': 2,
 u'reconstruct': 1,
 u'reconvened': 1,
 u'relate': 1,
 u'relates': 1,
 u'relation': 1,
 u'relationship': 1,
 u'release': 1,
 u'relying': 1,
 u'remember': 1,
 u'repaid': 1,
 u'repay': 1,
 u'repeal': 1,
 u'repealed': 2,
 u'replies': 1,
 u'reply': 1,
 u'report': 1,
 u'reported': 2,
 u'required': 1,
 u'resolution': 1,
 u'resource-rich': 2,
 u'responsible': 1,
 u'reversed': 1,
 u'rich': 1,
 u'ridicule': 1,
 u'right': 6,
 u'room': 1,
 u'run': 1,
 u'russia': 1,
 u's': 48,
 u'safeguard': 1,
 u'safeguards': 1,
 u'said': 11,
 u'same': 2,
 u'saw': 1,
 u'say': 6,
 u'saying': 1,
 u'says': 1,
 u'scandal': 1,
 u'scholarship': 1,
 u'scholarships': 2,
 u'second': 6,
 u'secret': 1,
 u'secretary': 1,
 u'secrets': 1,
 u'section': 4,
 u'sector': 6,
 u'security': 1,
 u'see': 11,
 u'segue': 1,
 u'september': 1,
 u'series': 3,
 u'serious': 2,
 u'she': 1,
 u'short': 1,
 u'show': 1,
 u'sign': 4,
 u'signed': 1,
 u'signing': 1,
 u'simple': 1,
 u'since': 2,
 u'single': 2,
 u'situation': 3,
 u'six': 1,
 u'size': 1,
 u'slide': 1,
 u'small': 4,
 u'so': 27,
 u'so-and-sos': 1,
 u'society': 2,
 u'some': 11,
 u'somebody': 1,
 u'something': 3,
 u'sometimes': 1,
 u'sort': 1,
 u'speak': 2,
 u'speaking': 9,
 u'speech': 1,
 u'speeches': 1,
 u'spent': 2,
 u'spy': 2,
 u'stability': 1,
 u'stand': 1,
 u'standing': 3,
 u'start': 1,
 u'started': 2,
 u'statement': 1,
 u'statements': 1,
 u'states': 1,
 u'statutory': 1,
 u'step': 1,
 u'stolen': 2,
 u'stopped': 1,
 u'street': 6,
 u'struggle': 2,
 u'stuff': 3,
 u'subject': 1,
 u'suffered': 1,
 u'superior': 1,
 u'supposed': 3,
 u'sure': 1,
 u'suspects': 1,
 u'suspicious': 8,
 u'sustainability': 1,
 u'swiftly': 1,
 u'system': 1,
 u'table': 2,
 u'take': 3,
 u'takes': 1,
 u'talk': 2,
 u'talking': 2,
 u'taxpayers': 3,
 u'tell': 3,
 u'telling': 2,
 u'temple': 1,
 u'terms': 2,
 u'terrace': 1,
 u'thank': 1,
 u'that': 72,
 u'the': 175,
 u'their': 1,
 u'them': 6,
 u'then-prime': 1,
 u'there': 14,
 u'these': 4,
 u'they': 20,
 u'thing': 11,
 u'things': 5,
 u'thinking': 1,
 u'third': 1,
 u'thirds': 1,
 u'this': 33,
 u'those': 6,
 u'thought': 1,
 u'three': 4,
 u'through': 2,
 u'time': 7,
 u'to': 112,
 u'tobago': 5,
 u'today': 7,
 u'together': 3,
 u'told': 3,
 u'too': 2,
 u'took': 4,
 u'tower': 2,
 u'towers': 1,
 u'traced': 1,
 u'transacted': 1,
 u'transparency': 3,
 u'treasury': 1,
 u'treated': 2,
 u'trinidad': 5,
 u'trust': 1,
 u'truth': 1,
 u'try': 1,
 u'trying': 2,
 u'tune': 1,
 u'tv': 1,
 u'twin': 1,
 u'two': 3,
 u'u.s': 1,
 u'under': 2,
 u'understand': 7,
 u'understanding': 2,
 u'understood': 1,
 u'united': 1,
 u'unprecedented': 3,
 u'up': 5,
 u'us': 17,
 u'use': 3,
 u'used': 1,
 u'using': 5,
 u've': 9,
 u'very': 5,
 u'walking': 2,
 u'wall': 6,
 u'want': 9,
 u'was': 39,
 u'wasted': 2,
 u'waters': 1,
 u'way': 2,
 u'we': 66,
 u'wealth': 1,
 u'website': 2,
 u'weekend': 1,
 u'well': 3,
 u'went': 3,
 u'were': 11,
 u'what': 32,
 u'whatever': 1,
 u'when': 5,
 u'where': 5,
 u'whether': 3,
 u'which': 6,
 u'who': 8,
 u'whole': 3,
 u'will': 7,
 u'with': 19,
 u'within': 2,
 u'without': 4,
 u'word': 2,
 u'words': 1,
 u'work': 8,
 u'worked': 1,
 u'working': 1,
 u'works': 1,
 u'world': 2,
 u'would': 2,
 u'writing': 1,
 u'written': 3,
 u'wrongs': 1,
 u'wrote': 2,
 u'yeah': 1,
 u'year': 2,
 u'years': 7,
 u'you': 32,
 u'young': 1,
 u'your': 6,
 u'yourself': 1,
 u'zuccotti': 1,
 u'\u2014': 6}

In [69]:
# put this in a dataframe for easy viewing and sorting
word_count_df = pd.DataFrame.from_dict(t.word_counts, orient='index')
word_count_df.columns = ['count']

What are the most common words?


In [70]:
word_count_df.sort('count', ascending=False)[:10]


Out[70]:
count
the 175
and 125
of 123
to 112
it 79
in 77
that 72
i 71
a 69
we 66

Hm. So the most common words will not tell us much about what's going on in this text.

In general, the more frequent a word is in the english language or in a text, the less important it will likely be to us. This concept is well known in text mining, called "term frequency–inverse document frequency". We can represent text using a tf-idf statistic to weigh how important a term is in a particular document. This statistic gives contextual weight to different terms, more accurately representing the importance of different terms of ngrams.

Compare two transcripts

Let's look at two transcripts -


In [29]:
one = df.ix[16]
two = df.ix[20]
one['headline'], two['headline']


Out[29]:
(u'A second opinion on developmental disorders',
 u'The mothers who found forgiveness, friendship')

In [30]:
len(one['transcript']), len(two['transcript'])


Out[30]:
(5799, 5341)

In [31]:
one_blob = TextBlob(one['transcript'])
two_blob = TextBlob(two['transcript'])

In [32]:
one_set = set(one_blob.tokenize())
two_set = set(two_blob.tokenize())

In [33]:
# How many words did the two talks use commonly?
len(one_set.intersection(two_set))


Out[33]:
118

In [34]:
# How many different words did they use total?
total_diff = len(one_set.difference(two_set)) + len(two_set.difference(one_set))
total_diff


Out[34]:
550

In [35]:
proportion = len(one_set.intersection(two_set)) / float(total_diff)
print "Proportion of vocabulary that is common:", round(proportion, 4)*100, "%"


Proportion of vocabulary that is common: 21.45 %

In [36]:
print one_set.intersection(two_set)


set([u'all', u'just', u'human', u'suffer', u'children', u'(', u'had', u',', u'day', u'to', u'going', u'parents', u'suffered', u'then', u'Applause', u'very', u'me', u'suffering', u'words', u'not', u'now', u'him', u'like', u'these', u'each', u'For', u'are', u'our', u'special', u'really', u"n't", u'what', u'still', u'for', u'find', u'told', u'be', u'we', u'knew', u'This', u'never', u'here', u'change', u'on', u'about', u'her', u'of', u'could', u'or', u'When', u'one', u'been', u'story', u'from', u'would', u'there', u'three', u'But', u'.', u'their', u'much', u'too', u':', u'was', u'tell', u'today', u'knows', u'that', u'but', u'it', u'So', u'child', u'with', u'those', u'he', u'And', u'this', u'up', u'us', u'will', u'stories', u'can', u'were', u'my', u'called', u'and', u'is', u'mind', u'mine', u'an', u'say', u'something', u'have', u'in', u'as', u'if', u')', u'six', u'when', u'other', u'which', u'you', u'out', u"'s", u'I', u'who', u'most', u"'d", u'such', u"'m", u'a', u'later', u'It', u'so', u'fact', u'time', u'the', u'came'])

So what we start to see is that if we removed common words from this set, we'd see a few common themes between the two talks, but again, much of the vocabulary is common.

Vectorizers

Let's go back to the dataframe of most frequently used noun phrases across transcripts


In [71]:
themes_list = speaker_themes_df['themes'].tolist()
speaker_themes_df[7:10]


Out[71]:
speaker themes
7 Abraham Verghese [fildes, burntisland, laennec, bell, subsequen...
8 Achenyo Idachaba [water hyacinth, gbe'borun, nigeria]
9 Adam Davidson [fiscal cliff, vast majority, u.s., fiscal iss...

In [38]:
# sci-kit learn is a machine learning library for python
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

vocab = set([d for d in themes_list for d in d])

# we'll just look at ngrams > 2 to see some richer topics
cv = CountVectorizer(stop_words=None, vocabulary=vocab, ngram_range=(2, 4))

In [39]:
# going to turn these back into documents 
document_list = [','.join(t) for t in themes_list]

In [40]:
data = cv.fit_transform(document_list).toarray()

A vectorizer is a method that, understandably, creates vectors from data, and ultimately a matrix. Each vector contains the incidence (in this case) of a token across all the documents (in this case, transcripts).

Sparsity

In text analysis, any matrix representing a set of documents against a vocabulary will be sparse. This is because not every word in the vocabulary occurs in every document - quite the contrary. Most of each vector is empty.


In [41]:
cv.get_feature_names() # names of features

dist = np.sum(data, axis=0) 

counts_list = list()
for tag, count in zip(vocab, dist):
     counts_list.append((count, tag))

Which themes were most common across speakers?


In [42]:
count_df = pd.DataFrame(counts_list, columns=['count','feature'])
count_df.sort('count', ascending=False)[:20]


Out[42]:
count feature
670 14 young people
2025 10 solar system
1144 9 good news
1772 8 breast cancer
211 7 human beings
829 7 high school
963 6 long time
225 6 cell phone
3456 6 stem cells
2918 6 social media
1997 5 quantum mechanics
1818 5 health care
1651 5 real world
1635 5 death row
3379 5 dark energy
2916 5 san francisco
1102 5 video games
1725 4 world war
1022 4 world peace
925 4 los angeles

So the common theme is - Good News!

In Natural Language Processing, Vectorization is the most common way to abstract text to make it programmatically manipulable. It's your best friend once you get past document-level statistics!

Further Reading:

Don't forget -

If you don't know where to start, just start counting things.