In this class you are expected to learn:
In [1]:
%matplotlib inline
import nltk
import textblob as tb
From NLTK book: "Information comes in many shapes and sizes. One important form is structured data, where there is a regular and predictable organization of entities and relationships. For example, we might be interested in the relation between companies and locations. Given a particular company, we would like to be able to identify the locations where it does business; conversely, given a location, we would like to discover which companies do business in that location. If our data is in tabular form [...] then answering these queries is straightforward."
However, getting similar outcomes from text is a bit more tricky. For example, consider the following snippet (from nltk.corpus.ieer
, for fileid NYT19980315.0085
).
The fourth Wells account moving to another agency is the packaged paper-products division of Georgia-Pacific Corp., which arrived at Wells only last fall. Like Hertz and the History Channel, it is also leaving for an Omnicom-owned agency, the BBDO South unit of BBDO Worldwide. BBDO South in Atlanta, which handles corporate advertising for Georgia-Pacific, will assume additional duties for brands like Angel Soft toilet tissue and Sparkle paper towels, said Ken Haldin, a spokesman for Georgia-Pacific in Atlanta.
After reading it, it wouldn't be very hard for you to answer a question like "Which organizations operate in Atlanta?" Making a machine understand the text and come up with the answer is a much harder task. This is because machines are not very good at dealing with non structured information, like the snippet above.
It would be very nice if a machine could understand the meaning of the text, in fact, that's one approach to the problem. However, because understandin meaning is beyond the scope of this course, we will focus on creating structured information from text that we could later query using any other method such as Standard Query Language (SQL), for example. This method of getting meaning from text is called Information Extraction.
"Information Extraction has many applications, including business intelligence, resume harvesting, media analysis, sentiment detection, patent search, and email scanning. A particularly important area of current research involves the attempt to extract structured data out of electronically-available scientific literature, especially in the domain of biology and medicine."
The techniques that we have seen so far, i.e. segmenting, tokenizing, and part-of-speech tagging, are all necessary steps for performing information extraction. Once we have POS tags, we can search for specific types of entity. Noun phrases are the first kind of entities we can recognize, however they don't provide the fine grained information we are looking for. Common named entity types include ORGANIZATION, PERSON, LOCATION, DATE, TIME, MONEY, and GPE (geo-political entity). These are usually called classes (alluding the machine learning algorithms used).
NE Type | Examples |
---|---|
ORGANIZATION | Georgia-Pacific Corp., WHO |
PERSON | Eddy Bonte, President Obama |
LOCATION | Murray River, Mount Everest |
DATE | June, 2008-06-29 |
TIME | two fifty a m, 1:30 p.m. |
MONEY | 175 million Canadian Dollars, GBP 10.40 |
PERCENT | twenty pct, 18.75 % |
FACILITY | Washington Monument, Stonehenge |
GPE | South East Asia, Midlothian |
The goal of a named entity recognition (NER) system is to identify all textual mentions of the named entities. This can be broken down into two sub-tasks: identifying the boundaries of the NE, and identifying its type.
The task of tagging the different named entities with their types is known as a supervised learning task, where the machine is fed with training data (already tagged named entities), and outputs a classifier able to label new entities. NLTK comes with a built-in classifier that has already been trained to recognize named entities. The function nltk.ne_chunk()
receives POS tagged words and produces trees with the named entity types. If we set the parameter binary=True
, then named entities are just tagged as NE; otherwise, the classifier adds category labels such as PERSON
, ORGANIZATION
, and GPE
.
Let's take the first snippet from the frontpage of the New York Times and extract just the named entities.
In [2]:
text = (
u"Israel holds national elections Tuesday to determine if "
u"Prime Minister Benjamin Netanyahu and his Likud Party will "
u"earn a third consecutive term, or if the Zionist Union will "
u"garner enough seats to form a new government."
)
sent = nltk.pos_tag(nltk.tokenize.word_tokenize(text))
print(nltk.ne_chunk(sent, binary=True))
By removing the binary=True
, we will get the named entity types.
In [3]:
print(nltk.ne_chunk(sent))
Extracting NEs is just a matter of traversing the tree and filtering by the NEs we want.
In [4]:
def filter_nes(tree):
return tree.label() in ["ORGANIZATION", "PERSON", "LOCATION", "DATE", "TIME", "MONEY", "GPE"]
tree = nltk.ne_chunk(sent)
for subtree in tree.subtrees(filter=filter_nes):
print(subtree)
Sometimes, the accuracy achieved by the built-in NE tagger in NLTK might be not enough. For those cases, NLTK can interface with the state of the art NE and IE systems by Stanford University, that has been in development for almost 10 years now.
First we need to have a Java runtime environment (JRE). In order to see if you need to install it, just execute the next cell. If you see an error of any kind, then you need to install Java.
In [5]:
! java -version
Next step is to download the Standford Named Entity Recognizer (3.4.1), and uncompress to a path of your choice. In my case I put it under $HOME/bin/stanford-ner
. To check that everything is OK, just execute the next cell, changing $HOME/bin/stanford-ner
to your path.
Note: If you have Java 8, download version 3.5.1 instead.
In [6]:
! export NER=$HOME/bin/stanford-ner; \
java -mx500m -cp $NER/stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier \
-loadClassifier $NER/classifiers/english.all.3class.distsim.crf.ser.gz \
-textFile $NER/sample.txt
The last step is to integrate Stanford NER into NLTK by using the nltk.tag.stanford
module.
The class nltk.tag.stanford.NERTagger
for ner tagging with Stanford Tagger receives as inputs:
classifier
of the Stanford NER zip.
In [7]:
from nltk.tag.stanford import NERTagger
ner, = ! echo $HOME/bin/stanford-ner
st = NERTagger(ner + '/classifiers/english.all.3class.distsim.crf.ser.gz',
ner + '/stanford-ner.jar')
st.tag(nltk.tokenize.word_tokenize(text))
Out[7]:
Included with Stanford NER are a 4 class model trained for CoNLL, a 7 class model trained for MUC, and a 3 class model trained on both data sets for the intersection of those class sets.
And the models included (models for Spanish, German and Chinese are also available).
english.all.3class.distsim.crf.ser.gz
english.all.7class.distsim.crf.ser.gz
english.conll.4class.distsim.crf.ser.gz
english.muc.7class.distsim.crf.ser.gz
english.nowiki.3class.distsim.crf.ser.gz
Standford NER also has a web demo of their system.
Activity
Write a function `extract_geo(text)` that receives `text` and return a list of the geo-political entities found. For example, `extract_geo("I was born in Seville, Spain")` should return `['Seville', 'Spain']`.
In [8]:
def extract_geo(text):
...
extract_geo("I was born in Seville, Spain")
Out[8]:
Finally, the information extraction system looks at entities that are mentioned near one another in the text, and tries to determine whether specific relationships hold between those entities.
We will typically be looking for relations between specified types of named entity. One way of approaching this task is to initially look for all triples of the form (X, α, Y), where X and Y are named entities of the required types, and α is the string of words that intervenes between X and Y--in clausal form will be α(X, Y). We can then use regular expressions to pull out just those instances of α that express the relation that we are looking for.
For example, let's take the next text as an example:
In [9]:
doc = nltk.corpus.ieer.parsed_docs('NYT_19980315')[10]
nyt = " ".join([leaf for leaf in doc.text.leaves()])
nyt
Out[9]:
And say that we want to extract the relationship in
. Occurrences of in
can be shown by using concordance in NTLK, however no information about named entities can be used to improve filtering.
In [10]:
nltk.Text(nltk.word_tokenize(nyt)).concordance("in")
The idea is to define the tuple (X, α, Y), where X and Y are named entities, and α is a regular expression that joins the named entities.
Let's say that we want to find only relationships of type ORGANIZATION in GPE
. Therefore, we need do define our α as the pattern given by the regular expression .*\bin\b(?!\b.+ing)
, that includes a negative lookahead assertion that allows us to disregard strings such as success in supervising the transition of
, where in
is followed by a gerund.
To extract these relationships tuples, NLTK provides the function extract_rels()
, that receives X, Y, a chunked sentence, and pattern (α). Functions rtuple()
, clause()
both receive a relationship and print the tuple version or the clause version.
In [11]:
import re
from nltk.sem import extract_rels, rtuple, clause
sentences = nltk.sent_tokenize(nyt)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
pattern = re.compile(r'.*\bin\b(?!\b.+ing)')
for sent in tagged_sentences:
rels = extract_rels('ORGANIZATION', 'GPE', nltk.ne_chunk(sent), pattern=pattern)
for rel in rels:
print(clause(rel, "in"), "\n\t", rtuple(rel))
Another case of machine learning in NLP is the use of classifiers to find if a text is expressing something positive or negative. This is usually referred to as Sentiment Analysis.
A sentiment analyzer is basically a classifier trained on a dataset. Let's say that we want to analyze sentiment in tweets and we have the next dataset with tagged tweets:
In [12]:
dataset = [
('I love this sandwich.', 'pos'),
('This is an amazing place!', 'pos'),
('I feel very good about these beers.', 'pos'),
('This is my best work.', 'pos'),
("What an awesome view", 'pos'),
('I do not like this restaurant', 'neg'),
('I am tired of this stuff.', 'neg'),
("I can't deal with this", 'neg'),
('He is my sworn enemy!', 'neg'),
('My boss is horrible.', 'neg'),
('The beer was good.', 'pos'),
('I do not enjoy my job', 'neg'),
("I ain't feeling dandy today.", 'neg'),
("I feel amazing!", 'pos'),
('Gary is a friend of mine.', 'pos'),
("I can't believe I'm doing this.", 'neg')
]
First of all, we need to split our dataset into training and testing data, so we can train our classifier and then test how well it works. There is much more on this topic, but we will not cover it in this class. For now, let's say that we split the training and testing sets as follows.
In [13]:
train = dataset[:10]
test = dataset[10:]
train + test
Out[13]:
Using TextBlob
API, we create a new classifier by passing training data into the constructor for a NaiveBayesClassifier
. Then, we can start classifying arbitrary text by calling the NaiveBayesClassifier.classify(text)
method.
In [14]:
from textblob.classifiers import NaiveBayesClassifier
cl = NaiveBayesClassifier(train)
cl.classify("Their burgers are amazing") # "pos"
Out[14]:
In [15]:
cl.classify("I don't like their pizza.") # "neg"
Out[15]:
Remember that we left part of the dataset for testing, now we check the accuracy on the test set. As a good practice, you want to avoid having only training data, since then your classifier or model will overfit the data and we will not have a way to evaluate its accuracy. On the other hand, different partitions of the dataset can lead to different accuracies. To reduce this variability a technique known as cross-validation is usually used.
In [16]:
cl.accuracy(test)
Out[16]:
An accuracy of 0.83 or 83% means that 4 out 5 sentences are assessed correctly. It might seem like good enough, but actually, for this use case of sentiment analysis, is not that good. The way this classifier works is usually known as bag-of-words, where each document is seen as a bag containing words. Using the training set, an according to the different measures, like the frequency distributions of individual words in sentences assessed as class, scores are calculated. We can see which were the most infomrative features used in our very naive model.
In [17]:
# Most Informative Features
cl.show_informative_features(5)
Therefore, containing the word "this" but not containing the word "an" tend to be negative.
One way to improve our accuracy is by adding more training data. For example using the nltk.corpus.movie_reviews
corpus, with reviews of products and their associated tags. Or we could also collect our own data if we were interested in specific topic for the evaluation, or even different classes other than positive or negative.
TextBlob
has two already trained sentiment classifiers. Once is the default, PatternAnalyzer
, and does not only measure sentiment (polarity) but subjectivity as well; the second one is actually from NLTK, NaiveBayesAnalyzer
. Both return named tuples. Let's see which one has better accuracy.
In [18]:
from textblob import TextBlob
from textblob.sentiments import NaiveBayesAnalyzer, PatternAnalyzer
In [19]:
cl1 = NaiveBayesAnalyzer()
TextBlob("I love the smell of napalm in the morning", analyzer=cl1).sentiment
Out[19]:
In [20]:
cl2 = PatternAnalyzer()
TextBlob("I love the smell of napalm in the morning", analyzer=cl2).sentiment
Out[20]:
Accuracy is just the proportion of true results (both true positives and true negatives) among the total number of cases examined.
In [21]:
sum([cl1.analyze(s).classification == tag for (s, tag) in dataset]) / len(dataset)
Out[21]:
In [22]:
sum([cl2.analyze(s).polarity > 0 if tag == "pos" else cl2.analyze(s).polarity < 0 for (s, tag) in dataset]) / len(dataset)
Out[22]:
Interestingly, for our toy dataset both classifiers perform equally bad :D
Sentiment analysis can also be used to analyze news, political programs, or even plays. One common use is to analyze how people react in the social media. For example, let's say that you are brand that is releasing a new product, and a couple of days after the release, you want to know how the public reacted to the new product. Sentiment analysis can become in handy in this situation, because the analysis of tweets or Facebook status can tell you how actually people are receiving new products and how they are feeling about it.
There are basically two ways of collecting data from the web: web scraping, and programmatic APIs.
Web Scraping allows to extract information from websites. Usually, such software programs simulate human exploration by either implementing low-level Hypertext Transfer Protocol (HTTP), or embedding a fully-fledged web browser, such as Internet Explorer or Mozilla Firefox.
For basic extraction of the generated or static HTML of websites, Python provides several utilities. Perhaps, the easiest to use is requests
, that allows to perform all the verbs specified in the HTTP protocol. The two most common verbs are GET and POST. GET is used everytime a website is accessed by either a browser or another software. POST is what happens after a form is sent in a website. There are more (PUT, PATCH, DELETE, etc.) but those have special uses.
For example, let's say we want to retrieve the frontpage of the Globe and Mail.
In [23]:
import requests
response = requests.get("http://www.theglobeandmail.com/")
response
Out[23]:
After a request is performed by any of the verbs, the server replies with codes indicating if there was an error or not. The code 200 means that everything went good. Other codes included 404 for not found, or 500 for server error.
In [24]:
if response.ok:
print(response.text[:1000])
Even after printing 1000 characters, there is still almost no content. That happens because websites are written using HTML, a markup language that embeds how the text is shown plus the text itself. All those symbols enclosed in angular brackets, <>
, are actually only interpreted by the browser, so the browser handles those and render the proper style or layout for the user. For example, <strong>Truth is out there</strong>
is actually rendered as Truth is out there
. The tags <strong>
makes the browser print the text enclosed as bold.
HTML documents are actually a tree, and as such can be traversed and accessed. BeautifulSoup
is a Python library that makes easier to handle HTML documents.
In [25]:
from bs4 import BeautifulSoup
html = BeautifulSoup(response.text)
print(html.title)
print(html.title.name, ":", html.title.string)
But most of the times we will just want a clean version of the text.
In [26]:
print("...", html.text[4360:4430], "...")
Unfortunately, modern webs usually include some sort of interactivity in the form of Javascript that modifies the content of the page in real time for the user. That makes virtually impossible for requests
to catch the actual content, since is Javascript the actor that is actually writing that content in the browser. There are ways to solve this, like using PhantomJS or SlimmerJS through something like Selenium, but it's a bit more complicated that just using requests
.
A programmatic API is an endpoint provided by a website that instead of returning HTML, it returns serialized information such as JSON. JSON stands for Javascript Object Notation and is probably the most used serialization format of the web. Python can read and write its own data structures, such as list, dictionaries or objects, into and from JSON.
In [27]:
import json
d = {"key": [1, 2, 3]}
json.dumps(d)
Out[27]:
In [28]:
json.loads('{"key": [1, 2, 3]}')
Out[28]:
In [29]:
json.loads(json.dumps(d)) == d
Out[29]:
Major sites usually provide a public API, so instead of scraping the data (which depending on the site can be even illegal), you can just use their API to interact with the site. After retrieving the data and loading it into Python from a string, you will have a functional Python data structure. For example, let's access the public GitHub API for events and return the usernames of each.
In [30]:
gh_resp = requests.get('https://api.github.com/events')
gh = json.loads(gh_resp.text)
for event in gh:
print(event["actor"]["login"])
In fact, some APIs are so popular that they have their own clients in Python to access them without worrying about the underlying requets being performed.
Activity
Write a function `wikipedia_entities(url, named_entity)` that receives an `url` from the English Wikipedia and return the frequency of the terms according to `named_entity`. For example, after executing `wikipedia_entities("http://en.wikipedia.org/wiki/Iraq_War", "GPE")`, the first 15 results should be.
Iraq 649 Iraqi 262 U.S. 199 Baghdad 82 United States 73 Iraqis 35 New York Times 35 US 26 British 22 Iranian 22 American 21 Iran 21 Guardian 20 London 19 Main 15