This tutorial is going to introduce some simple tools for detecting sentiment in Tweets. We will be using a set of tools called the Natural Language Toolkit (NLTK). This is collection of software written in the Python programming language. An important design goal behind Python is that it should be easy to read and fun to use, so well-suited for beginners. A similar motivation inspired NLTK: it should make complex tasks easy to carry out, and it should be written in a way that would allow users to inspect and understand the code.
Why is this relevant? Well, a lot of software these days is built to be easy to use, but hard to inspect. For example, smartphones have a lot of slick apps on them, but very few people have the expertise to look under the hood to find out how they work. NLTK has quite the opposite approach: you are actively encouraged to discover how the code works. However, your level of understanding will depend heavily on how far you get to grips with Python itself.
This tutorial is written using the IPython framework. This allows text to be interspersed by fragments of code, occuring in special "cells". Just below is a cell where we are using Python to do a simple calculation:
In [100]:
3 + 4
Out[100]:
Some of the cells will contain snippets of code that are necessary for the big story to work, but which you don't need to understand. We'll try to make it clear when it's important for you to pay attention to one of the cells.
As you know, people are tweeting all the time. The rate varies, with about 6,000 per second being the average, but when I last checked, the rate was over 10,000 Tweets per second. So, a lot. Twitter kindly allows people to tap into a small sample of this stream — unless you're able to pay, the sample is at most 1% of the total stream.
Here's a tiny snapshow to Tweets, reflecting the Twitter public stream at the point this tutorial was last executed. By using the keywords 'love, hate'
, we restrict our sample to just those Tweets containing one or both of those words.
In [68]:
import nltk # load up the NLTK library
from nltk.twitter import Twitter
tw = Twitter() # start a new client that connects to Twitter
tw.tweets(keywords='love, hate', limit=25) #filter Tweets from the public stream
You too can sample Tweets in this way, but you'll need to set up your Twitter API keys according to these instructions, and also install NLTK (and IPython if you want) on your own computer. Since this is a bit of hassle, for the rest of this tutorial, we'll focus our attention on a sample of 20,000 English-language Tweets that were collected at the end of April 2015. In order focus on Tweets about the UK general election, the public stream was filtered with the following set of terms:
david cameron, miliband, milliband, sturgeon, clegg, farage, tory, tories, ukip, snp, libdem
The following code cell allows us to get hold of this collection, and prints out the text of the first 15. You don't need to worry about the details of how this happens.
In [4]:
from nltk.corpus import twitter_samples
strings = twitter_samples.strings('tweets.20150430-223406.json')
In [5]:
for string in strings[:20]:
print(string)
When we talk about understanding natural language, we often focus on 'who did what to whom'. Yet in many situations, we are more interested in attitudes and opinions. When someone writes about a movie, did they like it or hate it? Is a product review for a water bottle on Amazon positive or negative? Is a Tweet about the US President supportive or critical? We might also care about the intensity of the views expressed: "this is a fine movie" is different from "WOW! This movie is soooooo great!!!!" even though both are positive.
Sentiment analysis (or opinion mining) is a broad term for a range of techniques that try to identify the subjective views expressed in texts. Many organisations care deeply about public opinion — whether these concern commercial products, creative works, or political parties and policies — and have consequently turned to sentiment analysis as a way of gleaning valuable insights from voluminous bodies of online text. This in turn has stimulated much activity in the area, ranging from academic research to commercial applications and industry-focussed conferences.
However, it's worth saying at the outset that sentiment analysis is hard. Although it is designed to work with written text, the way in which people express their feelings is often goes far beyond what they literally say. In spoken language, intonation will be important. And of course we often express emotion using no words at all, as illustrated in this picture from Darwin's book The Expression of the Emotions.
Let's say that we want to classify a sentence into one of three categories: positive, negative or neutral. Each of these can be illustrated by posts on Twitter collected during the UK General Election in 2015.
The term polarity is often used to refer to whether a piece of text is judged to be positive or negative.
The easiest approach to classifying examples like these is to get hold of two lists of words, positive ones such as good, excellent, fine, triumph, well, succeed, ... and negative ones such as bad, poor, dismal, lying, fail, disaster, .... We figure out an overall polarity score based on the ratio of positive tokens to negative ones in a given string. A sentence with neither positive or negative tokens (or possibly an equal number of each) will be categorised as neutral. This simple approach is likely to yield the roughly correct results for the Twitter examples above.
Things become more complicated when negation enter into the picture. The next example is mildly positive (at least in British English), so we need to ensure that not reverses the polarity of bad in appropriate contexts,
Given Miliband personal ratings still 20 points behind Cameron, I'd say that not a bad margin for Labour leader https://t.co/ILQP93VYLF
VADER is a system for determining the sentiment of texts which has been incorporated into NLTK. It is based on the idea of looking for positive and negative words, but adds to important new elements. First, it uses a lexicon of 7,500 items which have been manually annotated for both polarity and intensity. Second, the overall score for an input text is computed by using a complex set of rules that take into account not just words (and negation), but also the boosting effect of devices like capitalisation and punctuation.
In [69]:
from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()
In [80]:
sia.polarity_scores("I REALLY adore Starwars!!!!! :-)")
Out[80]:
In [6]:
full_tweets = twitter_samples.docs('tweets.20150430-223406.json')
In the next example, we are going to create a table of Tweets using the pandas library. We will use the term data
to refer to this table
In [105]:
import pandas as pd
from numpy import nan
data = pd.DataFrame()
data['text'] = [t['text'] for t in full_tweets] # add a column corresponding to the text of each Tweet
Next, we will try to add labels for political parties and party leaders in a way that corresponds to the text of the Tweets. However, in some cases, it may not be possible or appropriate to add a label and instead we want to have a 'blank cell' that will be ignored by pandas. We'll do this by inserting a value NaN
(Not a Number).
In [108]:
parties = {}
parties['conservative'] = set(['osborne', 'portillo', 'pickles', 'tory', 'tories',
'torie', 'voteconservative', 'conservative', 'conservatives', 'bullingdon', 'telegraph'])
parties['labour'] = set(['uklabour', 'scottishlabour', 'labour', 'lab', 'murphy'])
parties['libdem'] = set(['libdem', 'libdems', 'dems', 'alexander'])
parties['ukip'] = set(['ukip', 'davidcoburnukip'])
parties['snp'] = set(['salmond', 'snp', 'snpwin', 'votesnp', 'snpbecause', 'scotland',
'scotlands', 'scottish', 'indyref', 'independence', 'celebs4indy'])
leaders = {}
leaders['cameron'] = set(['cameron', 'david_cameron', 'davidcameron','dave', 'davecamm'])
leaders['miliband'] = set(['miliband', 'ed_miliband', 'edmiliband', 'edm', 'milliband', 'ed', 'edforchange', 'edforpm', 'milifandom'])
leaders['clegg'] = set(['clegg'])
leaders['farage'] = set(['farage', 'nigel_farage', 'nsegel', 'askfarage', 'asknigelfarage', 'asknigelfar'])
leaders['sturgeon'] = set(['sturgeon', 'nicola_sturgeon', 'nicolasturgeon', 'nicola'])
def tweet_classify(text, keywords):
label = nan
from nltk.tokenize import wordpunct_tokenize
import operator
toks = wordpunct_tokenize(text)
toks_lower = [t.lower() for t in toks]
d = {}
for k in keywords:
d[k] = len(keywords[k] & set(toks_lower))
best = max(d.items(), key=operator.itemgetter(1))
if best[1] > 0:
label = best[0]
return label
data['party'] = [tweet_classify(row['text'], parties) for index, row in data.iterrows()]
data['leader'] = [tweet_classify(row['text'], leaders) for index, row in data.iterrows()]
data.head(25)
Out[108]:
To add a sentiment column, we will use the polarity_scores()
method from VADER that we briefly described earlier. We'll only look at the overall 'compound' polarity score.
In [109]:
data['sentiment'] = [sia.polarity_scores(row['text'])['compound'] for index, row in data.iterrows()]
In [48]:
data.describe() # summarise the table
Out[48]:
Let's inspect the 25 most positive Tweets:
In [57]:
data.sort_index(by="sentiment", ascending=False).head(25)
Out[57]:
Let's print out the text of the Tweet in row 15079.
In [63]:
print(data.iloc[15079]['text'])
Now let's have a peek at the 25 most negative Tweets.
In [110]:
data.sort_index(by="sentiment").head(25)
Out[110]:
And here is the text of the Tweet at row 5069:
In [84]:
print(data.iloc[5069]['text'])
In the next few examples, we group the Tweets together either by leader or by party, and then look at some summary statistics.
In [34]:
grouped_leader = data['sentiment'].groupby(data['leader'])
grouped_leader.mean()
Out[34]:
In [30]:
grouped_party = data['sentiment'].groupby(data['party'])
grouped_party.mean()
Out[30]:
In [35]:
grouped_leader.count()
Out[35]:
In [36]:
grouped_party.count()
Out[36]:
In [86]:
grouped_party.max()
Out[86]:
In [87]:
grouped_leader.max()
Out[87]:
It's not hard to find examples where something close to full natural language understanding is required to determine the correct polarity.
In [113]:
sia.polarity_scores("David Cameron doesn't seem to have done too badly until now." +
"Otherwise #milifandom and #cleggers would be attacking him for these bad things.")
Out[113]:
A further challenge in sentiment analysis is deciding the right level of granularity for the topic under discussion. Often, we can agree in the overall polarity of a sentence (or even of larger texts) because there is a single dominant topic. But in a list-like construction such as the following, different sentiments are associated with different entities, and there is no sensible way of aggregating this into a combined polarity score for the text as a whole:
<i>@hugorifkind Audience - good. Mili - bad. Clegg - a bit sad. Cam - unscathed</i>
Finally, as we have already seen, current approaches to language processing struggle with sarcasm, irony and satire, since these (intentionally) agin lead to polarity reversals.
In [ ]: