[Data, the Humanist's New Best Friend](index.ipynb)
*Class 14*

In this class you are expected to learn:

  • Information extraction
  • Named Entity Recognition
  • Mining HTML, Twitter and Facebook
  • Basic Sentiment Analysis
Although the approach is very rough

In [1]:
%matplotlib inline
import nltk
import textblob as tb

Information extraction

From NLTK book: "Information comes in many shapes and sizes. One important form is structured data, where there is a regular and predictable organization of entities and relationships. For example, we might be interested in the relation between companies and locations. Given a particular company, we would like to be able to identify the locations where it does business; conversely, given a location, we would like to discover which companies do business in that location. If our data is in tabular form [...] then answering these queries is straightforward."

However, getting similar outcomes from text is a bit more tricky. For example, consider the following snippet (from nltk.corpus.ieer, for fileid NYT19980315.0085).

The fourth Wells account moving to another agency is the packaged paper-products division of Georgia-Pacific Corp., which arrived at Wells only last fall. Like Hertz and the History Channel, it is also leaving for an Omnicom-owned agency, the BBDO South unit of BBDO Worldwide. BBDO South in Atlanta, which handles corporate advertising for Georgia-Pacific, will assume additional duties for brands like Angel Soft toilet tissue and Sparkle paper towels, said Ken Haldin, a spokesman for Georgia-Pacific in Atlanta.

After reading it, it wouldn't be very hard for you to answer a question like "Which organizations operate in Atlanta?" Making a machine understand the text and come up with the answer is a much harder task. This is because machines are not very good at dealing with non structured information, like the snippet above.

It would be very nice if a machine could understand the meaning of the text, in fact, that's one approach to the problem. However, because understandin meaning is beyond the scope of this course, we will focus on creating structured information from text that we could later query using any other method such as Standard Query Language (SQL), for example. This method of getting meaning from text is called Information Extraction.

"Information Extraction has many applications, including business intelligence, resume harvesting, media analysis, sentiment detection, patent search, and email scanning. A particularly important area of current research involves the attempt to extract structured data out of electronically-available scientific literature, especially in the domain of biology and medicine."

Named Entity Recognition

The techniques that we have seen so far, i.e. segmenting, tokenizing, and part-of-speech tagging, are all necessary steps for performing information extraction. Once we have POS tags, we can search for specific types of entity. Noun phrases are the first kind of entities we can recognize, however they don't provide the fine grained information we are looking for. Common named entity types include ORGANIZATION, PERSON, LOCATION, DATE, TIME, MONEY, and GPE (geo-political entity). These are usually called classes (alluding the machine learning algorithms used).


NE Type Examples
ORGANIZATION Georgia-Pacific Corp., WHO
PERSON Eddy Bonte, President Obama
LOCATION Murray River, Mount Everest
DATE June, 2008-06-29
TIME two fifty a m, 1:30 p.m.
MONEY 175 million Canadian Dollars, GBP 10.40
PERCENT twenty pct, 18.75 %
FACILITY Washington Monument, Stonehenge
GPE South East Asia, Midlothian
[Commonly Used Types of Named Entity](http://www.nltk.org/book/ch07.html)

The goal of a named entity recognition (NER) system is to identify all textual mentions of the named entities. This can be broken down into two sub-tasks: identifying the boundaries of the NE, and identifying its type.

The task of tagging the different named entities with their types is known as a supervised learning task, where the machine is fed with training data (already tagged named entities), and outputs a classifier able to label new entities. NLTK comes with a built-in classifier that has already been trained to recognize named entities. The function nltk.ne_chunk() receives POS tagged words and produces trees with the named entity types. If we set the parameter binary=True, then named entities are just tagged as NE; otherwise, the classifier adds category labels such as PERSON, ORGANIZATION, and GPE.

Let's take the first snippet from the frontpage of the New York Times and extract just the named entities.


In [2]:
text = (
    u"Israel holds national elections Tuesday to determine if "
    u"Prime Minister Benjamin Netanyahu and his Likud Party will "
    u"earn a third consecutive term, or if the Zionist Union will "
    u"garner enough seats to form a new government."
)

sent = nltk.pos_tag(nltk.tokenize.word_tokenize(text))
print(nltk.ne_chunk(sent, binary=True))


(S
  (NE Israel/NNP)
  holds/NNS
  national/JJ
  elections/NNS
  Tuesday/NNP
  to/TO
  determine/VB
  if/IN
  Prime/NNP
  Minister/NNP
  (NE Benjamin/NNP Netanyahu/NNP)
  and/CC
  his/PRP$
  (NE Likud/NNP Party/NNP)
  will/MD
  earn/VB
  a/DT
  third/JJ
  consecutive/JJ
  term/NN
  ,/,
  or/CC
  if/IN
  the/DT
  (NE Zionist/NNP Union/NNP)
  will/MD
  garner/VB
  enough/RB
  seats/NNS
  to/TO
  form/NN
  a/DT
  new/JJ
  government/NN
  ./.)

By removing the binary=True, we will get the named entity types.


In [3]:
print(nltk.ne_chunk(sent))


(S
  (GPE Israel/NNP)
  holds/NNS
  national/JJ
  elections/NNS
  Tuesday/NNP
  to/TO
  determine/VB
  if/IN
  Prime/NNP
  Minister/NNP
  (PERSON Benjamin/NNP Netanyahu/NNP)
  and/CC
  his/PRP$
  (ORGANIZATION Likud/NNP Party/NNP)
  will/MD
  earn/VB
  a/DT
  third/JJ
  consecutive/JJ
  term/NN
  ,/,
  or/CC
  if/IN
  the/DT
  (ORGANIZATION Zionist/NNP Union/NNP)
  will/MD
  garner/VB
  enough/RB
  seats/NNS
  to/TO
  form/NN
  a/DT
  new/JJ
  government/NN
  ./.)

Extracting NEs is just a matter of traversing the tree and filtering by the NEs we want.


In [4]:
def filter_nes(tree):
    return tree.label() in ["ORGANIZATION", "PERSON", "LOCATION", "DATE", "TIME", "MONEY", "GPE"]

tree = nltk.ne_chunk(sent)
for subtree in tree.subtrees(filter=filter_nes):
    print(subtree)


(GPE Israel/NNP)
(PERSON Benjamin/NNP Netanyahu/NNP)
(ORGANIZATION Likud/NNP Party/NNP)
(ORGANIZATION Zionist/NNP Union/NNP)
Should we be able to recognize that name? I don't think so...

Stanford Named Entity Recognizer

Sometimes, the accuracy achieved by the built-in NE tagger in NLTK might be not enough. For those cases, NLTK can interface with the state of the art NE and IE systems by Stanford University, that has been in development for almost 10 years now.

First we need to have a Java runtime environment (JRE). In order to see if you need to install it, just execute the next cell. If you see an error of any kind, then you need to install Java.


In [5]:
! java -version


java version "1.7.0_76"
Java(TM) SE Runtime Environment (build 1.7.0_76-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.76-b04, mixed mode)

Next step is to download the Standford Named Entity Recognizer (3.4.1), and uncompress to a path of your choice. In my case I put it under $HOME/bin/stanford-ner. To check that everything is OK, just execute the next cell, changing $HOME/bin/stanford-ner to your path.

Note: If you have Java 8, download version 3.5.1 instead.


In [6]:
! export NER=$HOME/bin/stanford-ner; \
  java -mx500m -cp $NER/stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier \
       -loadClassifier $NER/classifiers/english.all.3class.distsim.crf.ser.gz \
       -textFile $NER/sample.txt


CRFClassifier invoked on Thu Mar 19 01:24:22 EDT 2015 with arguments:
   -loadClassifier /home/versae/bin/stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz -textFile /home/versae/bin/stanford-ner/sample.txt
loadClassifier=/home/versae/bin/stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz
textFile=/home/versae/bin/stanford-ner/sample.txt
Loading classifier from /home/versae/bin/stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz ... done [3.2 sec].
The/O fate/O of/O Lehman/ORGANIZATION Brothers/ORGANIZATION ,/O the/O beleaguered/O investment/O bank/O ,/O hung/O in/O the/O balance/O on/O Sunday/O as/O Federal/ORGANIZATION Reserve/ORGANIZATION officials/O and/O the/O leaders/O of/O major/O financial/O institutions/O continued/O to/O gather/O in/O emergency/O meetings/O trying/O to/O complete/O a/O plan/O to/O rescue/O the/O stricken/O bank/O ./O 
Several/O possible/O plans/O emerged/O from/O the/O talks/O ,/O held/O at/O the/O Federal/ORGANIZATION Reserve/ORGANIZATION Bank/ORGANIZATION of/ORGANIZATION New/ORGANIZATION York/ORGANIZATION and/O led/O by/O Timothy/PERSON R./PERSON Geithner/PERSON ,/O the/O president/O of/O the/O New/ORGANIZATION York/ORGANIZATION Fed/ORGANIZATION ,/O and/O Treasury/ORGANIZATION Secretary/O Henry/PERSON M./PERSON Paulson/PERSON Jr./PERSON ./O 
CRFClassifier tagged 85 words in 2 documents at 240.79 words per second.

The last step is to integrate Stanford NER into NLTK by using the nltk.tag.stanford module.

The class nltk.tag.stanford.NERTagger for ner tagging with Stanford Tagger receives as inputs:

  • a model trained on training data, available in the folder classifier of the Stanford NER zip.
  • the path to the Stanford tagger jar file. If not specified here, then this jar file must be specified in the CLASSPATH envinroment variable.
  • optionally, the encoding of the training data (default: ASCII)

In [7]:
from nltk.tag.stanford import NERTagger
ner, = ! echo $HOME/bin/stanford-ner
st = NERTagger(ner + '/classifiers/english.all.3class.distsim.crf.ser.gz',
               ner + '/stanford-ner.jar') 
st.tag(nltk.tokenize.word_tokenize(text))


Out[7]:
[('Israel', 'LOCATION'),
 ('holds', 'O'),
 ('national', 'O'),
 ('elections', 'O'),
 ('Tuesday', 'O'),
 ('to', 'O'),
 ('determine', 'O'),
 ('if', 'O'),
 ('Prime', 'O'),
 ('Minister', 'O'),
 ('Benjamin', 'PERSON'),
 ('Netanyahu', 'PERSON'),
 ('and', 'O'),
 ('his', 'O'),
 ('Likud', 'ORGANIZATION'),
 ('Party', 'ORGANIZATION'),
 ('will', 'O'),
 ('earn', 'O'),
 ('a', 'O'),
 ('third', 'O'),
 ('consecutive', 'O'),
 ('term', 'O'),
 (',', 'O'),
 ('or', 'O'),
 ('if', 'O'),
 ('the', 'O'),
 ('Zionist', 'ORGANIZATION'),
 ('Union', 'ORGANIZATION'),
 ('will', 'O'),
 ('garner', 'O'),
 ('enough', 'O'),
 ('seats', 'O'),
 ('to', 'O'),
 ('form', 'O'),
 ('a', 'O'),
 ('new', 'O'),
 ('government', 'O'),
 ('.', 'O')]

Included with Stanford NER are a 4 class model trained for CoNLL, a 7 class model trained for MUC, and a 3 class model trained on both data sets for the intersection of those class sets.

  • 3 class: Location, Person, Organization
  • 4 class: Location, Person, Organization, Misc
  • 7 class: Time, Location, Organization, Person, Money, Percent, Date

And the models included (models for Spanish, German and Chinese are also available).

  • english.all.3class.distsim.crf.ser.gz
  • english.all.7class.distsim.crf.ser.gz
  • english.conll.4class.distsim.crf.ser.gz
  • english.muc.7class.distsim.crf.ser.gz
  • english.nowiki.3class.distsim.crf.ser.gz

Standford NER also has a web demo of their system.

Activity

Write a function `extract_geo(text)` that receives `text` and return a list of the geo-political entities found. For example, `extract_geo("I was born in Seville, Spain")` should return `['Seville', 'Spain']`.


In [8]:
def extract_geo(text):
    ...

extract_geo("I was born in Seville, Spain")


Out[8]:
['Seville', 'Spain']
[*The Barber of Seville*](https://en.wikipedia.org/wiki/The_Barber_of_Seville)

Relationship Extraction

Finally, the information extraction system looks at entities that are mentioned near one another in the text, and tries to determine whether specific relationships hold between those entities.

We will typically be looking for relations between specified types of named entity. One way of approaching this task is to initially look for all triples of the form (X, α, Y), where X and Y are named entities of the required types, and α is the string of words that intervenes between X and Y--in clausal form will be α(X, Y). We can then use regular expressions to pull out just those instances of α that express the relation that we are looking for.

For example, let's take the next text as an example:


In [9]:
doc = nltk.corpus.ieer.parsed_docs('NYT_19980315')[10]
nyt = " ".join([leaf for leaf in doc.text.leaves()])
nyt


Out[9]:
"NEW YORK _ Time has run out for the besieged Wells BDDP , the onetime Madison Avenue powerhouse that had recently stumbled into a stunning free fall as large clients left amid executive turmoil and ownership changes. GGT Group PLC , the parent of Wells , said on Friday that Wells would be closed after 32 years , effective on May 13 . The shutdown will affect 133 employees in New York , though at the start of the year Wells had more than twice that number. Efforts will be made to place employees with affiliates of Omnicom Group , the giant agency company that agreed in late January to acquire GGT . They will be eligible for ``at least six months to as much as a year '' in severance pay, said John Wren , the president and chief executive at Omnicom in New York . The shutdown will abruptly end Wells ' battle for survival during which billings fell to less than $200 million from a peak of almost $1 billion seven years ago. Just since November , Wells has lost more than $330 million in billings from clients like Bristol-Myers Squibb , Heineken USA Inc. , Liberty Mutual Insurance Co. , Procter &AMP; Gamble Co. and Tag Heuer USA . The closing also ends more than three decades of advertising achievement that included such familiar campaigns as ``Quality is Job 1'' for Ford , ``I can't believe I ate the whole thing'' for Alka-Seltzer , ``Oh, the disadvantages'' for Benson &AMP; Hedges cigarettes and ``A totally organic experience'' for Clairol Herbal Essences shampoo. `` Wells and the kind of work it did was one of the reasons I got into this business,'' said Steve Davis , who joined Wells only six weeks ago as chairman and chief executive after the dismissal of Frank Assumma . ``To look at the agency's reel was to see the best and the brightest,'' Davis said. ``But it got to the point there was not enough critical mass to keep going. ``And for us to dance our way into a `merger' with another agency would have been fairly transparent,'' he added, ``because there's not that much left to merge with.'' The client roster of Wells and its affiliate, Moss/Dragoti , had dwindled to five accounts. With the closing date posted, four of the five have started leaving for other agencies; the fifth, Chase Manhattan Corp. , had already placed its account in review. Omnicom moved to acquire GGT after the loss of Wells ' largest client, Procter &AMP; Gamble _ with billings estimated at $125 million _ plunged GGT into crisis. The hope had been widespread that Wells would somehow remain in business, perhaps as a unit of an Omnicom agency like DDB Needham Worldwide or TBWA Chiat/Day . But ``there was virtually no revenue left'' because of the substantial account losses, Wren said. That sealed Wells ' fate, on a Friday the 13th no less. <ANNOTATION> (STORY CAN END HERE _ OPTIONAL MATERIAL FOLLOWS) </ANNOTATION> ``I feel terrible about this,'' said Charlie Moss , the chairman of Moss/Dragoti , who was one of the initial employees of Wells when it opened as Wells, Rich, Greene in 1966 . ``It's very sad to see it happen.'' ``We should have a memorial service someday,'' he added, ``to say goodbye in a nice way.'' Two clients that had been handled by Moss/Dragoti are headed for DDB Needham in New York along with Moss and his longtime partner, Stan Dragoti . One is Hertz Corp. and the other is the History Channel , the cable television network owned by Walt Disney Co. , Hearst Corp. and the NBC unit of General Electric Co. Combined billings are estimated at more than $40 million . Ken Kaess , president of the DDB Needham U.S. division, was pleased to land two additional clients unexpectedly but was dismayed at the circumstances. ``It's too bad,'' he said. `` Wells is a terrific brand name.'' The Wells name had for years been burnished by Mary Wells Lawrence , whose fierce devotion to clients and creative instincts made her an advertising legend. Her fledgling agency grew quickly from the 1960s into the 1970s , attracting blue-chip clients like Ford Motor Co. ; ITT Sheraton Corp. ; IBM ; Miles Inc. , now part of Bayer ; Philip Morris Cos. ; and P&AMP;G . `` Mary created an environment and an attitude here,'' Davis said, ``and underpinned it with great people who delivered on that promise. Wells could always be depended on for something new and different.'' But after the agency was sold in 1990 to BDDP , a French agency company, Ms. Lawrence withdrew from active involvement, and financial problems began to impede Wells ' performance. Wells then suffered through waves of account losses and management tumult as well as another change in ownership when BDDP was acquired last year by GGT . For instance, one six-month period in 1995 brought the sudden departures from Wells of the chairman and chief executive, the president and the chief financial officer. ``It was a big roller coaster,'' said Linda Kaplan Thaler , who worked at Wells in a top creative post from 1994 to 1997 . ``I went on a maternity leave, and when I came back, nobody was there. ``It was an unfortunate sequence of happenstances,'' she added. ``No one person is responsible; everyone there had the best of intentions.'' Kaplan Thaler Group in New York , an advertising and production company that Ms. Kaplan Thaler opened after leaving Wells , is being awarded the account of another Wells client, Toys ``R'' Us , with billings estimated at $30 million to $40 million . Ms. Kaplan Thaler had worked for Toys ``R'' Us at Wells and at the J. Walter Thompson New York unit of WPP Group , where she wrote the Toys ``R'' Us jingle. Ms. Kaplan Thaler had been bound by a noncompetition agreement with Wells that expires in July . But the closing led Wells to waive the stricture so the account could move now. The fourth Wells account moving to another agency is the packaged paper-products division of Georgia-Pacific Corp. , which arrived at Wells only last fall . Like Hertz and the History Channel , it is also leaving for an Omnicom -owned agency, the BBDO South unit of BBDO Worldwide . BBDO South in Atlanta , which handles corporate advertising for Georgia-Pacific , will assume additional duties for brands like Angel Soft toilet tissue and Sparkle paper towels, said Ken Haldin , a spokesman for Georgia-Pacific in Atlanta . Billings were estimated at $30 million to $40 million . Omnicom anticipates completing the acquisition of GGT in the next two weeks , Wren said, adding that he planned to sponsor a `` three-day job fair to try to place as many Wells employees as possible'' at Omnicom agencies and subsidiaries. ``The employees are the innocent victims of all these events,'' he added. Davis , who said he would consider his next move after Wells closed, agreed with Wren . ``I told the staff this didn't have anything to do with them,'' he said, referring to his remarks at a meeting at the Wells office on Friday afternoon. ``I've never seen such passion and conviction in the face of having to read about the agency in the papers every day.''"

And say that we want to extract the relationship in. Occurrences of in can be shown by using concordance in NTLK, however no information about named entities can be used to improve filtering.


In [10]:
nltk.Text(nltk.word_tokenize(nyt)).concordance("in")


Displaying 21 of 21 matches:
he shutdown will affect 133 employees in New York , though at the start of the
 the giant agency company that agreed in late January to acquire GGT . They wi
st six months to as much as a year '' in severance pay , said John Wren , the 
sident and chief executive at Omnicom in New York . The shutdown will abruptly
ells has lost more than $ 330 million in billings from clients like Bristol-My
orp. , had already placed its account in review . Omnicom moved to acquire GGT
pread that Wells would somehow remain in business , perhaps as a unit of an Om
en it opened as Wells , Rich , Greene in 1966 . `` It 's very sad to see it ha
day , '' he added , `` to say goodbye in a nice way . '' Two clients that had 
ss/Dragoti are headed for DDB Needham in New York along with Moss and his long
nt . '' But after the agency was sold in 1990 to BDDP , a French agency compan
ment tumult as well as another change in ownership when BDDP was acquired last
. For instance , one six-month period in 1995 brought the sudden departures fr
a Kaplan Thaler , who worked at Wells in a top creative post from 1994 to 1997
f intentions . '' Kaplan Thaler Group in New York , an advertising and product
ion agreement with Wells that expires in July . But the closing led Wells to w
h unit of BBDO Worldwide . BBDO South in Atlanta , which handles corporate adv
din , a spokesman for Georgia-Pacific in Atlanta . Billings were estimated at 
tes completing the acquisition of GGT in the next two weeks , Wren said , addi
ever seen such passion and conviction in the face of having to read about the 
ce of having to read about the agency in the papers every day . ''

The idea is to define the tuple (X, α, Y), where X and Y are named entities, and α is a regular expression that joins the named entities.

Let's say that we want to find only relationships of type ORGANIZATION in GPE. Therefore, we need do define our α as the pattern given by the regular expression .*\bin\b(?!\b.+ing), that includes a negative lookahead assertion that allows us to disregard strings such as success in supervising the transition of, where in is followed by a gerund.

To extract these relationships tuples, NLTK provides the function extract_rels(), that receives X, Y, a chunked sentence, and pattern (α). Functions rtuple(), clause() both receive a relationship and print the tuple version or the clause version.


In [11]:
import re
from nltk.sem import extract_rels, rtuple, clause

sentences = nltk.sent_tokenize(nyt)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]

pattern = re.compile(r'.*\bin\b(?!\b.+ing)')

for sent in tagged_sentences:
    rels = extract_rels('ORGANIZATION', 'GPE', nltk.ne_chunk(sent), pattern=pattern)
    for rel in rels:
        print(clause(rel, "in"), "\n\t", rtuple(rel))


in('ddb_needham', 'new_york') 
	 [ORG: 'DDB/NNP Needham/NNP'] 'in/IN' [GPE: 'New/NNP York/NNP']
in('bbdo_south', 'atlanta') 
	 [ORG: 'BBDO/NNP South/NNP'] 'in/IN' [GPE: 'Atlanta/NNP']

Basic Sentiment Analysis

Another case of machine learning in NLP is the use of classifiers to find if a text is expressing something positive or negative. This is usually referred to as Sentiment Analysis.

A sentiment analyzer is basically a classifier trained on a dataset. Let's say that we want to analyze sentiment in tweets and we have the next dataset with tagged tweets:


In [12]:
dataset = [
    ('I love this sandwich.', 'pos'),
    ('This is an amazing place!', 'pos'),
    ('I feel very good about these beers.', 'pos'),
    ('This is my best work.', 'pos'),
    ("What an awesome view", 'pos'),
    ('I do not like this restaurant', 'neg'),
    ('I am tired of this stuff.', 'neg'),
    ("I can't deal with this", 'neg'),
    ('He is my sworn enemy!', 'neg'),
    ('My boss is horrible.', 'neg'),
    ('The beer was good.', 'pos'),
    ('I do not enjoy my job', 'neg'),
    ("I ain't feeling dandy today.", 'neg'),
    ("I feel amazing!", 'pos'),
    ('Gary is a friend of mine.', 'pos'),
    ("I can't believe I'm doing this.", 'neg')
]

First of all, we need to split our dataset into training and testing data, so we can train our classifier and then test how well it works. There is much more on this topic, but we will not cover it in this class. For now, let's say that we split the training and testing sets as follows.


In [13]:
train = dataset[:10]
test = dataset[10:]
train + test


Out[13]:
[('I love this sandwich.', 'pos'),
 ('This is an amazing place!', 'pos'),
 ('I feel very good about these beers.', 'pos'),
 ('This is my best work.', 'pos'),
 ('What an awesome view', 'pos'),
 ('I do not like this restaurant', 'neg'),
 ('I am tired of this stuff.', 'neg'),
 ("I can't deal with this", 'neg'),
 ('He is my sworn enemy!', 'neg'),
 ('My boss is horrible.', 'neg'),
 ('The beer was good.', 'pos'),
 ('I do not enjoy my job', 'neg'),
 ("I ain't feeling dandy today.", 'neg'),
 ('I feel amazing!', 'pos'),
 ('Gary is a friend of mine.', 'pos'),
 ("I can't believe I'm doing this.", 'neg')]

Using TextBlob API, we create a new classifier by passing training data into the constructor for a NaiveBayesClassifier. Then, we can start classifying arbitrary text by calling the NaiveBayesClassifier.classify(text) method.


In [14]:
from textblob.classifiers import NaiveBayesClassifier
cl = NaiveBayesClassifier(train)
cl.classify("Their burgers are amazing")  # "pos"


Out[14]:
'pos'

In [15]:
cl.classify("I don't like their pizza.")  # "neg"


Out[15]:
'neg'

Remember that we left part of the dataset for testing, now we check the accuracy on the test set. As a good practice, you want to avoid having only training data, since then your classifier or model will overfit the data and we will not have a way to evaluate its accuracy. On the other hand, different partitions of the dataset can lead to different accuracies. To reduce this variability a technique known as cross-validation is usually used.


In [16]:
cl.accuracy(test)


Out[16]:
0.8333333333333334

An accuracy of 0.83 or 83% means that 4 out 5 sentences are assessed correctly. It might seem like good enough, but actually, for this use case of sentiment analysis, is not that good. The way this classifier works is usually known as bag-of-words, where each document is seen as a bag containing words. Using the training set, an according to the different measures, like the frequency distributions of individual words in sentences assessed as class, scores are calculated. We can see which were the most infomrative features used in our very naive model.


In [17]:
# Most Informative Features
cl.show_informative_features(5)


Most Informative Features
          contains(this) = True              neg : pos    =      2.3 : 1.0
          contains(this) = False             pos : neg    =      1.8 : 1.0
            contains(an) = False             neg : pos    =      1.6 : 1.0
          contains(This) = False             neg : pos    =      1.6 : 1.0
             contains(I) = True              neg : pos    =      1.4 : 1.0

Therefore, containing the word "this" but not containing the word "an" tend to be negative.

One way to improve our accuracy is by adding more training data. For example using the nltk.corpus.movie_reviews corpus, with reviews of products and their associated tags. Or we could also collect our own data if we were interested in specific topic for the evaluation, or even different classes other than positive or negative.

TextBlob has two already trained sentiment classifiers. Once is the default, PatternAnalyzer, and does not only measure sentiment (polarity) but subjectivity as well; the second one is actually from NLTK, NaiveBayesAnalyzer. Both return named tuples. Let's see which one has better accuracy.


In [18]:
from textblob import TextBlob
from textblob.sentiments import NaiveBayesAnalyzer, PatternAnalyzer

In [19]:
cl1 = NaiveBayesAnalyzer()
TextBlob("I love the smell of napalm in the morning", analyzer=cl1).sentiment


Out[19]:
Sentiment(classification='pos', p_pos=0.6406723473877074, p_neg=0.3593276526122936)

In [20]:
cl2 = PatternAnalyzer()
TextBlob("I love the smell of napalm in the morning", analyzer=cl2).sentiment


Out[20]:
Sentiment(polarity=0.5, subjectivity=0.6)

Accuracy is just the proportion of true results (both true positives and true negatives) among the total number of cases examined.


In [21]:
sum([cl1.analyze(s).classification == tag for (s, tag) in dataset]) / len(dataset)


Out[21]:
0.625

In [22]:
sum([cl2.analyze(s).polarity > 0 if tag == "pos" else cl2.analyze(s).polarity < 0 for (s, tag) in dataset]) / len(dataset)


Out[22]:
0.625

Interestingly, for our toy dataset both classifiers perform equally bad :D

*Even twitter supports it now!*

Mining HTML, Twitter and Facebook

Sentiment analysis can also be used to analyze news, political programs, or even plays. One common use is to analyze how people react in the social media. For example, let's say that you are brand that is releasing a new product, and a couple of days after the release, you want to know how the public reacted to the new product. Sentiment analysis can become in handy in this situation, because the analysis of tweets or Facebook status can tell you how actually people are receiving new products and how they are feeling about it.

There are basically two ways of collecting data from the web: web scraping, and programmatic APIs.

Web Scraping

Web Scraping allows to extract information from websites. Usually, such software programs simulate human exploration by either implementing low-level Hypertext Transfer Protocol (HTTP), or embedding a fully-fledged web browser, such as Internet Explorer or Mozilla Firefox.

For basic extraction of the generated or static HTML of websites, Python provides several utilities. Perhaps, the easiest to use is requests, that allows to perform all the verbs specified in the HTTP protocol. The two most common verbs are GET and POST. GET is used everytime a website is accessed by either a browser or another software. POST is what happens after a form is sent in a website. There are more (PUT, PATCH, DELETE, etc.) but those have special uses.

For example, let's say we want to retrieve the frontpage of the Globe and Mail.


In [23]:
import requests

response = requests.get("http://www.theglobeandmail.com/")
response


Out[23]:
<Response [200]>

After a request is performed by any of the verbs, the server replies with codes indicating if there was an error or not. The code 200 means that everything went good. Other codes included 404 for not found, or 500 for server error.


In [24]:
if response.ok:
    print(response.text[:1000])


<!DOCTYPE html>
<!--[if lt IE 7]><html lang="en-ca" class="ie6 ltie9" xmlns:fb="http://www.facebook.com/2008/fbml" ><![endif]-->
<!--[if IE 7]><html lang="en-ca" class="ie7 ltie9" xmlns:fb="http://www.facebook.com/2008/fbml" ><![endif]-->
<!--[if IE 8]><html lang="en-ca" class="ie8 ltie9" xmlns:fb="http://www.facebook.com/2008/fbml" ><![endif]-->
<!--[if IE 9 ]><html lang="en-ca" class="ie9" xmlns:fb="http://www.facebook.com/2008/fbml" ><![endif]-->
<!--[if (gt IE 9)|!(IE)]><!-->
<html lang="en-ca" xmlns:fb="http://www.facebook.com/2008/fbml"><!--<![endif]-->
<head>
<title>Home - The Globe and Mail</title>
<link rel="dns-prefetch" href="http://static.theglobeandmail.ca">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<meta http-equiv="expires" content="0">
<meta http-equiv="pragma" content="no-cache">
<meta http-equiv="window-target" content="_top">
<meta http-equiv="Content-Language" content="en-ca">
<

Even after printing 1000 characters, there is still almost no content. That happens because websites are written using HTML, a markup language that embeds how the text is shown plus the text itself. All those symbols enclosed in angular brackets, <>, are actually only interpreted by the browser, so the browser handles those and render the proper style or layout for the user. For example, <strong>Truth is out there</strong> is actually rendered as Truth is out there. The tags <strong> makes the browser print the text enclosed as bold.

HTML documents are actually a tree, and as such can be traversed and accessed. BeautifulSoup is a Python library that makes easier to handle HTML documents.


In [25]:
from bs4 import BeautifulSoup

html = BeautifulSoup(response.text)
print(html.title)
print(html.title.name, ":", html.title.string)


<title>Home - The Globe and Mail</title>
title : Home - The Globe and Mail

But most of the times we will just want a clean version of the text.


In [26]:
print("...", html.text[4360:4430], "...")


... 
Are chilly Canadian-U.S. cross-border relations about to get frostier ...

Unfortunately, modern webs usually include some sort of interactivity in the form of Javascript that modifies the content of the page in real time for the user. That makes virtually impossible for requests to catch the actual content, since is Javascript the actor that is actually writing that content in the browser. There are ways to solve this, like using PhantomJS or SlimmerJS through something like Selenium, but it's a bit more complicated that just using requests.

Programmatic APIs

A programmatic API is an endpoint provided by a website that instead of returning HTML, it returns serialized information such as JSON. JSON stands for Javascript Object Notation and is probably the most used serialization format of the web. Python can read and write its own data structures, such as list, dictionaries or objects, into and from JSON.


In [27]:
import json

d = {"key": [1, 2, 3]}
json.dumps(d)


Out[27]:
'{"key": [1, 2, 3]}'

In [28]:
json.loads('{"key": [1, 2, 3]}')


Out[28]:
{'key': [1, 2, 3]}

In [29]:
json.loads(json.dumps(d)) == d


Out[29]:
True

Major sites usually provide a public API, so instead of scraping the data (which depending on the site can be even illegal), you can just use their API to interact with the site. After retrieving the data and loading it into Python from a string, you will have a functional Python data structure. For example, let's access the public GitHub API for events and return the usernames of each.


In [30]:
gh_resp = requests.get('https://api.github.com/events')
gh = json.loads(gh_resp.text)
for event in gh:
    print(event["actor"]["login"])


SupSaiYaJin
mathslinux
leyleo
nathanchen
maxbilbow
freepbx-tango
fang289040324
f0o
f0o
geekonweb
masakyst
theemathas
pierrejoye
Ming-Tang
hdl
Ming-Tang
d-plaindoux
snockminder
infinitus
williamshowalter
niranjv
Dasharathgoswami
mkc188
masakyst
ciceropablo
Dasharathgoswami
cngo-github
overtrue
keyansf
jackstine

In fact, some APIs are so popular that they have their own clients in Python to access them without worrying about the underlying requets being performed.

Activity

Write a function `wikipedia_entities(url, named_entity)` that receives an `url` from the English Wikipedia and return the frequency of the terms according to `named_entity`. For example, after executing `wikipedia_entities("http://en.wikipedia.org/wiki/Iraq_War", "GPE")`, the first 15 results should be.

    Iraq              649
    Iraqi             262
    U.S.              199
    Baghdad            82
    United States      73
    Iraqis             35
    New York Times     35
    US                 26
    British            22
    Iranian            22
    American           21
    Iran               21
    Guardian           20
    London             19
    Main               15

For the class Next class will be class 17