In [1]:
from IPython.display import HTML
HTML('''<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')
Out[1]:
For sentiment classification, one approach is to use the SentiWordNet (SWN) word dictionary, which builds on top of the original Princeton WordNet dictionary by adding sentiment scores to each word. In addition to positivity / negativity, it also gives an objectivity score, which is a measure of how much sentiment a word has. E.g. "awesome" is very subjective, "red" is very objective.
The scores all add up to 1, and are split between positivity/negativity and objectivity. An objectivity score of 1 necessarily implies 0 negative or positive sentiment. An objectivity score of 0.5 means there is 0.5 remaining that could be distributed to the positive or negative sentiment in any way.
These scores can then be agrregated into article scores, which can then be converted into daily scores.
It is possible to download the original SentiWordNet data files, but luckily NLTK provides an already parsed version of the same corpus, along with convenience functions to search the database.
In [6]:
import nltk
from nltk.corpus import sentiwordnet as swn
This notebook requires not only NLTK, but also additional corpora that come with NLTK to perform various functions:
In [27]:
# To download the SentiWordNet and stopword corpora, run the commands below
nltk.download("sentiwordnet")
nltk.download("stopwords")
Now let's try it on one word, "plant". Specifically, "plant.n.01":
In [23]:
plant01 = swn.senti_synset("plant.n.01")
In [28]:
plant01.neg_score(), plant01.pos_score(), plant01.obj_score()
Out[28]:
You'll notice that this is a very specific "sense" of a word. "plant.n.01" means "the first meaning of the word "plant", that is a noun. Wordnet provides many meanings of words. You can search for multiple, like so.
In [49]:
swn.senti_synsets("plant")
Out[49]:
The results are approximately sorted in frequency of usage. Now, we can use the SWN dictionary like so, when parsing actual sentences:
In [29]:
from nltk.tag.perceptron import PerceptronTagger
from nltk.tokenize import word_tokenize
First, we take a sentence and split it into tokens: "This red car is horrible. It is a very broken car. I want to punch it."
In [133]:
article = "This red car is horrible. It is a very broken car. I want to punch it."
tokens = nltk.word_tokenize(article)
tokens
Out[133]:
Then we tag these tokens according to what parts of speech they might be (noun, verb, etc):
In [69]:
tagged_tokens = nltk.pos_tag(tokens)
tagged_tokens
Out[69]:
Now, we have enough information to look up words properly.
Here is the meaning of all the tags (e.g. JJ, NN, VB, etc): https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
Unfortunately SentiWordNet and NLTK use different notations for these. Here is one way where we can map the tags from one system to another:
In [70]:
nltk_to_sentiwordnet = {
"NN": "n",
"VB": "v",
"JJ": "a",
"RB": "r",
}
Now that we have that, we can try looking up the sentiments for each token in the sentence in SentiWordNet, and print the top results for meanings that we find, along with the definition, the positive score, the negative score and the objectivity score. These scores are normalized such that they add up to 1.
In [77]:
for word, pos in tagged_tokens:
swn_pos = nltk_to_sentiwordnet.get(pos[:2], None)
if swn_pos == None:
continue
synsets = swn.senti_synsets(word.lower(), pos=swn_pos)
if len(synsets) == 0:
continue
print("{}:".format(word))
for synset in synsets[:3]:
print(synset, synset.pos_score(), synset.neg_score(), synset.obj_score())
print " ", synset.synset.definition()
print("------")
OK, now this leaves us with a few problems to solve further:
A naive option is to choose the first word - senses that have opposite sentiment are an edge case, and mistakes here shouldn't have a huge effect given that our large article size. Among hundreds of words, this is not a huge deal.
A more advanced solution is to use Word Sense Disambiguation (e.g. the Lesk algorithm) to get a best guess of which sense of the word is being used. This is computationally expensive, but there are libraries to do this, e.g. pyWSD.
This is easy, we can just skip the words - it's the same as no sentiment data.
Since the scores are already scaled by objectivity, we can subtract the negative from the positive to get an overall score. We can also return the word count for any scaling / weighting by word count that might be necessary.
Another thought is to filter most of the high objectivity words (e.g. "Potato" rather than "Great"). These don't necessarily add much value in terms of sentiment information, and dilute real information. However again, words that have high objectivity scores have lower P / N scores, so this is probably not necessary. One thing we could do is to exaggerate this effect to perhaps skew the weighting more, if necessary.
In [ ]:
In [78]:
# To install pyWSD, run the command below:
# !pip install git+https://github.com/alvations/pywsd.git@master
In [100]:
from pywsd import disambiguate
In [108]:
disambiguated_sentence = disambiguate(sentence)
for word, synset in disambiguated_sentence:
if synset:
senti_synset = swn.senti_synset(synset._name)
print word, senti_synset, senti_synset.obj_score()
For some reason this seems to eat up some words. For example, where is "horrible"? It turns out that the disambiguate() function doesn't return anything if the word does not have more than one meaning. We can try to remedy that:
In [ ]:
disambiguated_sentence = disambiguate(sentence)
for word, synset in disambiguated_sentence:
if synset:
senti_synset = swn.senti_synset(synset._name)
print word, senti_synset, senti_synset.obj_score()
else:
senti_synsets = swn.senti_synsets(word)
if len(senti_synsets) == 1:
print word, senti_synsets[0], senti_synsets[0].obj_score()
Still some stuff is missing. Where is "red"? Since this doesn't seem to be giving great results, and our initial prediction is that it'll have a marginal benefit anyway, I'll opt to not use WSD.
In [183]:
import nltk
from nltk.corpus import stopwords
from nltk.corpus import sentiwordnet as swn
import numpy as np
#nltk.download("sentiwordnet")
#nltk.download("stopwords")
flatten = lambda l: [item for sublist in l for item in sublist]
english_stopwords = set(stopwords.words("english"))
nltk_to_sentiwordnet = {
"NN": "n",
"VB": "v",
"JJ": "a",
"RB": "r",
}
def get_sentiment(article):
sentences = nltk.sent_tokenize(article)
sentence_words = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentence_words = flatten(nltk.pos_tag_sents(sentence_words))
# Filter stopwords
tagged_sentence_words = [word for word in tagged_sentence_words if word[1] not in english_stopwords]
pos_scores = []
neg_scores = []
subj_scores = []
for word, pos in tagged_sentence_words:
swn_pos = nltk_to_sentiwordnet.get(pos[:2], None)
if swn_pos == None:
continue
synsets = swn.senti_synsets(word.lower(), pos=swn_pos)
if len(synsets) == 0:
continue
#print("{}:".format(word))
for synset in synsets[:1]:
pos_scores.append(synset.pos_score())
neg_scores.append(synset.neg_score())
subj_scores.append(1 - synset.obj_score())
return np.average(pos_scores, weights=subj_scores) , np.average(neg_scores, weights=subj_scores), np.mean(subj_scores)
In [184]:
get_sentiment(article)
Out[184]:
Example scores for "This car is amazing. I really love the detailing and the finish. It's just the way I like it.":
In [185]:
get_sentiment("This car is amazing. I really love the detailing and the finish. It's just the way I like it.")
Out[185]:
The actual implementation used is in sentiwordnet.py in the repository.