In [1]:
from IPython.display import HTML

HTML('''<script>
code_show=true;
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')


Out[1]:

Sentiment Classification Using SentiWordNet

For sentiment classification, one approach is to use the SentiWordNet (SWN) word dictionary, which builds on top of the original Princeton WordNet dictionary by adding sentiment scores to each word. In addition to positivity / negativity, it also gives an objectivity score, which is a measure of how much sentiment a word has. E.g. "awesome" is very subjective, "red" is very objective.

The scores all add up to 1, and are split between positivity/negativity and objectivity. An objectivity score of 1 necessarily implies 0 negative or positive sentiment. An objectivity score of 0.5 means there is 0.5 remaining that could be distributed to the positive or negative sentiment in any way.

These scores can then be agrregated into article scores, which can then be converted into daily scores.

It is possible to download the original SentiWordNet data files, but luckily NLTK provides an already parsed version of the same corpus, along with convenience functions to search the database.


In [6]:
import nltk
from nltk.corpus import sentiwordnet as swn

This notebook requires not only NLTK, but also additional corpora that come with NLTK to perform various functions:


In [27]:
# To download the SentiWordNet and stopword corpora, run the commands below
nltk.download("sentiwordnet")
nltk.download("stopwords")

Now let's try it on one word, "plant". Specifically, "plant.n.01":


In [23]:
plant01 = swn.senti_synset("plant.n.01")

In [28]:
plant01.neg_score(), plant01.pos_score(), plant01.obj_score()


Out[28]:
(0.0, 0.0, 1.0)

You'll notice that this is a very specific "sense" of a word. "plant.n.01" means "the first meaning of the word "plant", that is a noun. Wordnet provides many meanings of words. You can search for multiple, like so.


In [49]:
swn.senti_synsets("plant")


Out[49]:
[SentiSynset('plant.n.01'),
 SentiSynset('plant.n.02'),
 SentiSynset('plant.n.03'),
 SentiSynset('plant.n.04'),
 SentiSynset('plant.v.01'),
 SentiSynset('implant.v.01'),
 SentiSynset('establish.v.02'),
 SentiSynset('plant.v.04'),
 SentiSynset('plant.v.05'),
 SentiSynset('plant.v.06')]

The results are approximately sorted in frequency of usage. Now, we can use the SWN dictionary like so, when parsing actual sentences:


In [29]:
from nltk.tag.perceptron import PerceptronTagger
from nltk.tokenize import word_tokenize

First, we take a sentence and split it into tokens: "This red car is horrible. It is a very broken car. I want to punch it."


In [133]:
article = "This red car is horrible. It is a very broken car. I want to punch it."
tokens = nltk.word_tokenize(article)
tokens


Out[133]:
['This',
 'red',
 'car',
 'is',
 'horrible',
 '.',
 'It',
 'is',
 'a',
 'very',
 'broken',
 'car',
 '.',
 'I',
 'want',
 'to',
 'punch',
 'it',
 '.']

Then we tag these tokens according to what parts of speech they might be (noun, verb, etc):


In [69]:
tagged_tokens = nltk.pos_tag(tokens)
tagged_tokens


Out[69]:
[('This', 'DT'),
 ('red', 'JJ'),
 ('car', 'NN'),
 ('is', 'VBZ'),
 ('horrible', 'JJ'),
 ('.', '.'),
 ('It', 'PRP'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('very', 'RB'),
 ('broken', 'JJ'),
 ('car', 'NN'),
 ('.', '.'),
 ('I', 'PRP'),
 ('want', 'VBP'),
 ('to', 'TO'),
 ('punch', 'VB'),
 ('it', 'PRP'),
 ('.', '.')]

Now, we have enough information to look up words properly.

Here is the meaning of all the tags (e.g. JJ, NN, VB, etc): https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

Unfortunately SentiWordNet and NLTK use different notations for these. Here is one way where we can map the tags from one system to another:

  • NN: n
  • VB: v
  • JJ: a
  • RB: r

In [70]:
nltk_to_sentiwordnet = {
    "NN": "n",
    "VB": "v",
    "JJ": "a",
    "RB": "r",
}

Now that we have that, we can try looking up the sentiments for each token in the sentence in SentiWordNet, and print the top results for meanings that we find, along with the definition, the positive score, the negative score and the objectivity score. These scores are normalized such that they add up to 1.


In [77]:
for word, pos in tagged_tokens:
    
    swn_pos = nltk_to_sentiwordnet.get(pos[:2], None)
    
    if swn_pos == None:
        continue
    
    synsets = swn.senti_synsets(word.lower(), pos=swn_pos)
    
    if len(synsets) == 0:
        continue
    
    print("{}:".format(word))
    for synset in synsets[:3]:
        print(synset, synset.pos_score(), synset.neg_score(), synset.obj_score())
        print "    ", synset.synset.definition()
    print("------")


red:
(SentiSynset('red.s.01'), 0.0, 0.0, 1.0)
     of a color at the end of the color spectrum (next to orange); resembling the color of blood or cherries or tomatoes or rubies
(SentiSynset('crimson.s.02'), 0.25, 0.625, 0.125)
     characterized by violence or bloodshed
(SentiSynset('crimson.s.03'), 0.0, 0.25, 0.75)
     (especially of the face) reddened or suffused with or as if with blood from emotion or exertion
------
car:
(SentiSynset('car.n.01'), 0.0, 0.0, 1.0)
     a motor vehicle with four wheels; usually propelled by an internal combustion engine
(SentiSynset('car.n.02'), 0.0, 0.0, 1.0)
     a wheeled vehicle adapted to the rails of railroad
(SentiSynset('car.n.03'), 0.0, 0.0, 1.0)
     the compartment that is suspended from an airship and that carries personnel and the cargo and the power plant
------
is:
(SentiSynset('be.v.01'), 0.25, 0.125, 0.625)
     have the quality of being; (copula, used with an adjective or a predicate noun)
(SentiSynset('be.v.02'), 0.0, 0.0, 1.0)
     be identical to; be someone or something
(SentiSynset('be.v.03'), 0.0, 0.0, 1.0)
     occupy a certain position or area; be somewhere
------
horrible:
(SentiSynset('atrocious.s.03'), 0.0, 0.625, 0.375)
     provoking horror
------
is:
(SentiSynset('be.v.01'), 0.25, 0.125, 0.625)
     have the quality of being; (copula, used with an adjective or a predicate noun)
(SentiSynset('be.v.02'), 0.0, 0.0, 1.0)
     be identical to; be someone or something
(SentiSynset('be.v.03'), 0.0, 0.0, 1.0)
     occupy a certain position or area; be somewhere
------
very:
(SentiSynset('very.r.01'), 0.25, 0.25, 0.5)
     used as intensifiers; `real' is sometimes used informally for `really'; `rattling' is informal
(SentiSynset('very.r.02'), 0.25, 0.0, 0.75)
     precisely so
------
broken:
(SentiSynset('broken.a.01'), 0.0, 0.125, 0.875)
     physically and forcibly separated into pieces or cracked or split
(SentiSynset('broken.a.02'), 0.0, 0.125, 0.875)
     not continuous in space, time, or sequence or varying abruptly
(SentiSynset('broken.s.03'), 0.0, 0.375, 0.625)
     subdued or brought low in condition or status
------
car:
(SentiSynset('car.n.01'), 0.0, 0.0, 1.0)
     a motor vehicle with four wheels; usually propelled by an internal combustion engine
(SentiSynset('car.n.02'), 0.0, 0.0, 1.0)
     a wheeled vehicle adapted to the rails of railroad
(SentiSynset('car.n.03'), 0.0, 0.0, 1.0)
     the compartment that is suspended from an airship and that carries personnel and the cargo and the power plant
------
want:
(SentiSynset('desire.v.01'), 0.25, 0.0, 0.75)
     feel or have a desire for; want strongly
(SentiSynset('want.v.02'), 0.125, 0.125, 0.75)
     have need of
(SentiSynset('want.v.03'), 0.0, 0.0, 1.0)
     hunt or look for; want for a particular reason
------
punch:
(SentiSynset('punch.v.01'), 0.0, 0.0, 1.0)
     deliver a quick blow to
(SentiSynset('punch.v.02'), 0.0, 0.0, 1.0)
     drive forcibly as if by a punch
(SentiSynset('punch.v.03'), 0.0, 0.0, 1.0)
     make a hole into or between, as for ease of separation
------

OK, now this leaves us with a few problems to solve further:

How do we disambiguate between the multiple senses of words?

A naive option is to choose the first word - senses that have opposite sentiment are an edge case, and mistakes here shouldn't have a huge effect given that our large article size. Among hundreds of words, this is not a huge deal.

A more advanced solution is to use Word Sense Disambiguation (e.g. the Lesk algorithm) to get a best guess of which sense of the word is being used. This is computationally expensive, but there are libraries to do this, e.g. pyWSD.

What do we do with words that we don't find?

This is easy, we can just skip the words - it's the same as no sentiment data.

How do we combine the three scores (Positivity, Negativity, Objectivity) to get one score?

Since the scores are already scaled by objectivity, we can subtract the negative from the positive to get an overall score. We can also return the word count for any scaling / weighting by word count that might be necessary.

Do we filter out useless words during that process?

Another thought is to filter most of the high objectivity words (e.g. "Potato" rather than "Great"). These don't necessarily add much value in terms of sentiment information, and dilute real information. However again, words that have high objectivity scores have lower P / N scores, so this is probably not necessary. One thing we could do is to exaggerate this effect to perhaps skew the weighting more, if necessary.


In [ ]:

Word Sense Disambiguation

An experiment in Word Sense Disambiguation:


In [78]:
# To install pyWSD, run the command below: 
# !pip install git+https://github.com/alvations/pywsd.git@master

In [100]:
from pywsd import disambiguate

In [108]:
disambiguated_sentence = disambiguate(sentence)
for word, synset in disambiguated_sentence:
    if synset:
        senti_synset = swn.senti_synset(synset._name)
        print word, senti_synset, senti_synset.obj_score()


car <car.n.04: PosScore=0.0 NegScore=0.0> 1.0
broken <broken.a.04: PosScore=0.0 NegScore=0.125> 0.875
car <car.n.04: PosScore=0.0 NegScore=0.0> 1.0
want <want.v.05: PosScore=0.0 NegScore=0.625> 0.375
punch <punch.v.03: PosScore=0.0 NegScore=0.0> 1.0

For some reason this seems to eat up some words. For example, where is "horrible"? It turns out that the disambiguate() function doesn't return anything if the word does not have more than one meaning. We can try to remedy that:


In [ ]:
disambiguated_sentence = disambiguate(sentence)
for word, synset in disambiguated_sentence:
    if synset:
        senti_synset = swn.senti_synset(synset._name)
        print word, senti_synset, senti_synset.obj_score()
    else:
        senti_synsets = swn.senti_synsets(word)
        if len(senti_synsets) == 1:
            print word, senti_synsets[0], senti_synsets[0].obj_score()

Still some stuff is missing. Where is "red"? Since this doesn't seem to be giving great results, and our initial prediction is that it'll have a marginal benefit anyway, I'll opt to not use WSD.

A full implementation of SentiWordNet scoring


In [183]:
import nltk
from nltk.corpus import stopwords
from nltk.corpus import sentiwordnet as swn
import numpy as np

#nltk.download("sentiwordnet")
#nltk.download("stopwords")

flatten = lambda l: [item for sublist in l for item in sublist]
english_stopwords = set(stopwords.words("english"))

nltk_to_sentiwordnet = {
    "NN": "n",
    "VB": "v",
    "JJ": "a",
    "RB": "r",
}

def get_sentiment(article):
    
    sentences = nltk.sent_tokenize(article)
    sentence_words = [nltk.word_tokenize(sentence) for sentence in sentences]
    tagged_sentence_words = flatten(nltk.pos_tag_sents(sentence_words))
    
    # Filter stopwords
    tagged_sentence_words = [word for word in tagged_sentence_words if word[1] not in english_stopwords]
    
    pos_scores = []
    neg_scores = []
    subj_scores = []

    for word, pos in tagged_sentence_words:
        
        swn_pos = nltk_to_sentiwordnet.get(pos[:2], None)
    
        if swn_pos == None:
            continue
    
        synsets = swn.senti_synsets(word.lower(), pos=swn_pos)
    
        if len(synsets) == 0:
            continue
    
        #print("{}:".format(word))
        for synset in synsets[:1]:
            pos_scores.append(synset.pos_score())
            neg_scores.append(synset.neg_score())
            subj_scores.append(1 - synset.obj_score())
        
    return np.average(pos_scores, weights=subj_scores) , np.average(neg_scores, weights=subj_scores), np.mean(subj_scores)

In [184]:
get_sentiment(article)


Out[184]:
(0.16666666666666666, 0.27777777777777779, 0.22500000000000001)

Example scores for "This car is amazing. I really love the detailing and the finish. It's just the way I like it.":


In [185]:
get_sentiment("This car is amazing. I really love the detailing and the finish. It's just the way I like it.")


Out[185]:
(0.39166666666666666, 0.066666666666666666, 0.20833333333333334)

The actual implementation used is in sentiwordnet.py in the repository.