Extract sentiment vectors from text using NRC lexicon


In [1]:
import pensieve as pv
import pandas as pd
pd.options.display.max_rows = 6
import numpy as np
import re
from tqdm import tqdm_notebook as tqdm
%matplotlib notebook
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

Read raw text

Seven books of Harry Potter in text format converted from rich text format.

Accessible by setting book to

  • book1.txt
  • ...
  • book7.txt

The raw text is read through pv.Doc and split into paragraph objects.


In [29]:
book = 'book7'
doc = pv.Doc('../../clusterpot/%s.txt' %book)
print("title: %s" %doc.paragraphs[0].text)


title: Harry Potter and the Deathly Hallows

Read NRC lexicon

The NRC emotional lexicon is a curated list of words with

  • sentiments : positive and negative
  • emotions: anger, anticipation, disgust, fear, joy, sadness, surprise, trust

Details of this work may be found at Saif Mohammad's homepage

The HDF5 file is cleaned from the original NRC text file.

The resulting pandas dataframe has the following columns

  • word | string
    • words in the lexicon
  • emo | string
    • the complete set of sentiments and emotion
  • binary | int
    • 0 if the word does not correspond to the sentiment or emotion
    • 1 if the word elicits the sentiment or emotion

In [30]:
df = pd.read_hdf('./NRC.h5')
df


Out[30]:
word emo binary
0 aback anger 0
1 aback anticipation 0
2 aback disgust 0
... ... ... ...
141817 zoom sadness 0
141818 zoom surprise 0
141819 zoom trust 0

141820 rows × 3 columns

Generate emotional vectors for every paragraph

  • for p in paragraphs in the text
    • tokenize the paragraph with .split()
    • initiate emotional vector emo_d
    • for every word (w) in paragraph (p)
      • strip special characters
      • query for word in NRC lexicon and read out the emo column with binary = 1
      • for every emotion (e)
        • if the word elicits an emotion in emo_d, then add 1 to the specified emotion.
    • append the emotion vectors to a list

In [31]:
emo_list = []
moo_d = dict()
for p in tqdm(range(len(doc.paragraphs)), desc='par'):
    s = doc.paragraphs[p].text.split()
    emo_d = dict({'joy':0, 'fear':0, 'surprise':0, 'sadness':0, 'disgust':0, 'anger':0})
    moo_d[p] = []
    for w in s:
        w = re.sub("[!@#$%^&*()_+:;,.\\?']", "", w)
        emo = df.query("word=='%s'" %w).query("binary==1")['emo'].as_matrix()
        for e in emo:
            if e in emo_d.keys():
                emo_d['%s' %e] += 1
                moo_d[p].append(w)
            else:
                pass
    moo_d[p] = np.unique(moo_d[p]).tolist()
    emo_list.append(emo_d)



Write emotional vectors to HDF5

HDF5 is structured as

  • bookn
    • len(doc.paragraphs) x len(emo_d.keys()) array of integers

where n spans 1 through 7.


In [32]:
pd.DataFrame(emo_list).to_hdf('./book_emo.h5',book)
pd.DataFrame.from_dict(moo_d, orient='index').to_hdf('./book_moo.h5', book)


/Users/cchang5/anaconda/envs/cdips2017/lib/python3.6/site-packages/pandas/core/generic.py:1282: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block0_values] [items->[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]]

  return pytables.to_hdf(path_or_buf, key, self, **kwargs)

Read emo vectors in as HDF5 to analyse the text

Read in with corresponding text


In [12]:
rbook = 'book2'
doc = pv.Doc('../../clusterpot/%s.txt' %rbook)
book_emotions = pd.read_hdf('book_emo.h5',key=rbook)
book_mood = pd.read_hdf('book_moo.h5',key=rbook)
book_emotions.join(book_mood)


Out[12]:
anger disgust fear joy sadness surprise 0 1 2 3 4 5 6 7 8 9 10 11
0 0 0 0 0 0 0 None None None None None None None None None None None None
1 0 0 0 0 0 0 None None None None None None None None None None None None
2 0 0 0 0 0 0 None None None None None None None None None None None None
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3223 1 0 1 0 1 0 crazy None None None None None None None None None None None
3224 0 0 0 0 0 0 None None None None None None None None None None None None
3225 0 0 0 0 0 0 None None None None None None None None None None None None

3226 rows × 18 columns

Simple example: choose emotion to sort by

Choose emo_kw from list ['anger', disgust', 'fear', 'joy', 'sadness', 'surprise']


In [44]:
emo_kw = 'anger'
emo_rank = np.argsort(book_emotions[emo_kw].as_matrix())[::-1]
emo_rank


Out[44]:
array([ 522, 2645, 3036, ..., 2043, 2042,    0])

pidx | int

  • starts from 0, chooses the paragraph correponding to the pidx-th highest sorted paragraph

In [48]:
pidx = 1
print("paragraph from text\n%s\n" %doc.paragraphs[emo_rank[pidx]].text)
print("moo_d emo Harry feels\n%s\n" %book_mood.iloc[emo_rank[pidx]].dropna())
print("emo vector\n%s" %book_emotions.iloc[emo_rank[pidx]])


paragraph from text
"That is because it is a monstrous thing, to slay a unicorn," said Firenze. "Only one who has nothing to lose, and everything to gain, would commit such a crime. The blood of a unicorn will keep you alive, even if you are an inch from death, but at a terrible price. You have slain something pure and defenseless to save yourself, and you will have but a half-life, a cursed life, from the moment the blood touches your lips."

moo_d emo Harry feels
0       alive
1       crime
2      cursed
       ...   
7        save
8        slay
9    terrible
Name: 2645, Length: 10, dtype: object

emo vector
anger       6
disgust     3
fear        5
joy         3
sadness     5
surprise    2
Name: 2645, dtype: int64

Plot emotional vectors

Plots emotional vectors

rescale | boolean

+ True rescales the vectors with paragraph length.

In [29]:
rescale = True
if rescale:
    slength = [float(len(doc.paragraphs[i].text.split())) for i in range(len(doc.paragraphs))]
else:
    slength = 1
x = book_emotions['anger'].as_matrix()/slength
y = book_emotions['disgust'].as_matrix()/slength
z = book_emotions['sadness'].as_matrix()/slength
fig = plt.figure('emotional correlation',figsize=(7,7))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(xs=x,ys=y,zs=z)
ax.set_xlabel('anger')
ax.set_ylabel('disgust')
ax.set_zlabel('sadness')
plt.draw()
plt.show()


/Users/cchang5/anaconda/envs/cdips2017/lib/python3.6/site-packages/ipykernel_launcher.py:6: RuntimeWarning: invalid value encountered in true_divide
  
/Users/cchang5/anaconda/envs/cdips2017/lib/python3.6/site-packages/ipykernel_launcher.py:7: RuntimeWarning: invalid value encountered in true_divide
  import sys
/Users/cchang5/anaconda/envs/cdips2017/lib/python3.6/site-packages/ipykernel_launcher.py:8: RuntimeWarning: invalid value encountered in true_divide
  

Test code


In [123]:
slength = [float(len(doc.paragraphs[i].text.split())) for i in range(len(doc.paragraphs))]
y = np.sum(book_emotions.as_matrix(),axis=1)
plt.figure('paragraph length vs. emotion sum',figsize=(7,7))
ax = plt.axes([0.15,0.15,0.8,0.8])
ax.scatter(x=slength,y=y)
#ax.set_xlim([0,1])
#ax.set_ylim([0,1])
#ax.set_xscale('log')
plt.draw()
plt.show()



In [126]:
slength = [float(len(doc.paragraphs[i].text.split())) for i in range(len(doc.paragraphs))]
y = np.sum(book_emotions.as_matrix(),axis=1)
plt.figure('paragraph length vs. emotion sum',figsize=(7,7))
ax = plt.axes([0.15,0.15,0.8,0.8])
ax.scatter(x=slength,y=y)
#ax.set_xlim([0,1])
#ax.set_ylim([0,1])
#ax.set_xscale('log')
plt.draw()
plt.show()



In [34]:
s = "Upon the signature of the International Statute of Secrecy in 1689, wizards went into hiding for good. It was natural, perhaps, that they formed their own small communities within a community. Many small villages and hamlets attracted several magical families, who banded together for mutual support and protection. The villages of Tinworsh in Cornwall, Upper Flagley in Yorkshire, and Ottery St. Catchpole on the south coast of England were notable homes to knots of Wizarding families who lived alongside tolerant and sometimes Confunded Muggles. Most celebrated of these half-magical dwelling places is, perhaps, Godrics Hollow, the West Country village where the great wizard Godric Gryffindor was born, and where Bowman Wright, Wizarding smith, forged the first Golden Snitch. The graveyard is full of the names of ancient magical families, and this accounts, no doubt, for the stories of hauntings that have dogged the little church beside it for many centuries."
w = s.split()[0]
%timeit re.sub("[!@#$%^&*()_+:;,.\\?']", "", w)


The slowest run took 4.14 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.04 µs per loop

In [33]:
%%timeit
punctuation = ["!","@","#","$","%","^","&","*","(",")","_","+",":",";",",",".","?","'","\\"]
for punc in punctuation:
    w.replace("%s" %punc,'')


100000 loops, best of 3: 4.67 µs per loop

In [31]:
moodf = pd.DataFrame.from_dict(moo_d, orient='index')

In [38]:
moodf.iloc[25].dropna()


Out[38]:
0    forgotten
1       uneasy
2        words
Name: 25, dtype: object

In [26]:
moodf.to_hdf('./book_moo_v2.h5', book)


/Users/cchang5/anaconda/envs/cdips2017/lib/python3.6/site-packages/pandas/core/generic.py:1282: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block0_values] [items->[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]]

  return pytables.to_hdf(path_or_buf, key, self, **kwargs)

In [39]:
book_mood = pd.read_hdf('book_moo_v2.h5',key=rbook)

In [40]:



Out[40]:
0 1 2 3 4 5 6 7 8 9 10 11
0 None None None None None None None None None None None None
1 None None None None None None None None None None None None
2 None None None None None None None None None None None None
... ... ... ... ... ... ... ... ... ... ... ... ...
3237 good unpleasant None None None None None None None None None None
3238 fun grin surprised None None None None None None None None None
3239 None None None None None None None None None None None None

3240 rows × 12 columns


In [ ]: