Extract sentiment vectors from text using NRC lexicon



In [1]:

    
import pensieve as pv
import pandas as pd
pd.options.display.max_rows = 6
import numpy as np
import re
from tqdm import tqdm_notebook as tqdm
%matplotlib notebook
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

Read raw text

Seven books of Harry Potter in text format converted from rich text format.

Accessible by setting book to

book1.txt
...
book7.txt

The raw text is read through pv.Doc and split into paragraph objects.



In [29]:

    
book = 'book7'
doc = pv.Doc('../../clusterpot/%s.txt' %book)
print("title: %s" %doc.paragraphs[0].text)









    



title: Harry Potter and the Deathly Hallows

Read NRC lexicon

The NRC emotional lexicon is a curated list of words with

sentiments : positive and negative
emotions: anger, anticipation, disgust, fear, joy, sadness, surprise, trust

Details of this work may be found at Saif Mohammad's homepage

The HDF5 file is cleaned from the original NRC text file.

The resulting pandas dataframe has the following columns

word | string
- words in the lexicon
emo | string
- the complete set of sentiments and emotion
binary | int
- 0 if the word does not correspond to the sentiment or emotion
- 1 if the word elicits the sentiment or emotion



In [30]:

    
df = pd.read_hdf('./NRC.h5')
df









    Out[30]:







  
    
      
      word
      emo
      binary
    
  
  
    
      0
      aback
      anger
      0
    
    
      1
      aback
      anticipation
      0
    
    
      2
      aback
      disgust
      0
    
    
      ...
      ...
      ...
      ...
    
    
      141817
      zoom
      sadness
      0
    
    
      141818
      zoom
      surprise
      0
    
    
      141819
      zoom
      trust
      0
    
  

141820 rows × 3 columns

Generate emotional vectors for every paragraph

for p in paragraphs in the text
- tokenize the paragraph with .split()
- initiate emotional vector emo_d
- for every word (w) in paragraph (p)
  - strip special characters
  - query for word in NRC lexicon and read out the emo column with binary = 1
  - for every emotion (e)
    - if the word elicits an emotion in emo_d, then add 1 to the specified emotion.
- append the emotion vectors to a list



In [31]:

    
emo_list = []
moo_d = dict()
for p in tqdm(range(len(doc.paragraphs)), desc='par'):
    s = doc.paragraphs[p].text.split()
    emo_d = dict({'joy':0, 'fear':0, 'surprise':0, 'sadness':0, 'disgust':0, 'anger':0})
    moo_d[p] = []
    for w in s:
        w = re.sub("[!@#$%^&*()_+:;,.\\?']", "", w)
        emo = df.query("word=='%s'" %w).query("binary==1")['emo'].as_matrix()
        for e in emo:
            if e in emo_d.keys():
                emo_d['%s' %e] += 1
                moo_d[p].append(w)
            else:
                pass
    moo_d[p] = np.unique(moo_d[p]).tolist()
    emo_list.append(emo_d)

Write emotional vectors to HDF5

HDF5 is structured as

bookn
- len(doc.paragraphs) x len(emo_d.keys()) array of integers

where n spans 1 through 7.



In [32]:

    
pd.DataFrame(emo_list).to_hdf('./book_emo.h5',book)
pd.DataFrame.from_dict(moo_d, orient='index').to_hdf('./book_moo.h5', book)









    



/Users/cchang5/anaconda/envs/cdips2017/lib/python3.6/site-packages/pandas/core/generic.py:1282: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block0_values] [items->[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]]

  return pytables.to_hdf(path_or_buf, key, self, **kwargs)

Read emo vectors in as HDF5 to analyse the text

Read in with corresponding text



In [12]:

    
rbook = 'book2'
doc = pv.Doc('../../clusterpot/%s.txt' %rbook)
book_emotions = pd.read_hdf('book_emo.h5',key=rbook)
book_mood = pd.read_hdf('book_moo.h5',key=rbook)
book_emotions.join(book_mood)









    Out[12]:







  
    
      
      anger
      disgust
      fear
      joy
      sadness
      surprise
      0
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
    
  
  
    
      0
      0
      0
      0
      0
      0
      0
      None
      None
      None
      None
      None
      None
      None
      None
      None
      None
      None
      None
    
    
      1
      0
      0
      0
      0
      0
      0
      None
      None
      None
      None
      None
      None
      None
      None
      None
      None
      None
      None
    
    
      2
      0
      0
      0
      0
      0
      0
      None
      None
      None
      None
      None
      None
      None
      None
      None
      None
      None
      None
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      3223
      1
      0
      1
      0
      1
      0
      crazy
      None
      None
      None
      None
      None
      None
      None
      None
      None
      None
      None
    
    
      3224
      0
      0
      0
      0
      0
      0
      None
      None
      None
      None
      None
      None
      None
      None
      None
      None
      None
      None
    
    
      3225
      0
      0
      0
      0
      0
      0
      None
      None
      None
      None
      None
      None
      None
      None
      None
      None
      None
      None
    
  

3226 rows × 18 columns

Simple example: choose emotion to sort by

Choose emo_kw from list ['anger', disgust', 'fear', 'joy', 'sadness', 'surprise']



In [44]:

    
emo_kw = 'anger'
emo_rank = np.argsort(book_emotions[emo_kw].as_matrix())[::-1]
emo_rank









    Out[44]:





array([ 522, 2645, 3036, ..., 2043, 2042,    0])

Print the paragraph and emotional vector corresponding to the n-th highest sorted paragraph

pidx | int

starts from 0, chooses the paragraph correponding to the pidx-th highest sorted paragraph



In [48]:

    
pidx = 1
print("paragraph from text\n%s\n" %doc.paragraphs[emo_rank[pidx]].text)
print("moo_d emo Harry feels\n%s\n" %book_mood.iloc[emo_rank[pidx]].dropna())
print("emo vector\n%s" %book_emotions.iloc[emo_rank[pidx]])









    



paragraph from text
"That is because it is a monstrous thing, to slay a unicorn," said Firenze. "Only one who has nothing to lose, and everything to gain, would commit such a crime. The blood of a unicorn will keep you alive, even if you are an inch from death, but at a terrible price. You have slain something pure and defenseless to save yourself, and you will have but a half-life, a cursed life, from the moment the blood touches your lips."

moo_d emo Harry feels
0       alive
1       crime
2      cursed
       ...   
7        save
8        slay
9    terrible
Name: 2645, Length: 10, dtype: object

emo vector
anger       6
disgust     3
fear        5
joy         3
sadness     5
surprise    2
Name: 2645, dtype: int64

Plot emotional vectors

Plots emotional vectors

rescale | boolean

+ True rescales the vectors with paragraph length.



In [29]:

    
rescale = True
if rescale:
    slength = [float(len(doc.paragraphs[i].text.split())) for i in range(len(doc.paragraphs))]
else:
    slength = 1
x = book_emotions['anger'].as_matrix()/slength
y = book_emotions['disgust'].as_matrix()/slength
z = book_emotions['sadness'].as_matrix()/slength
fig = plt.figure('emotional correlation',figsize=(7,7))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(xs=x,ys=y,zs=z)
ax.set_xlabel('anger')
ax.set_ylabel('disgust')
ax.set_zlabel('sadness')
plt.draw()
plt.show()









    



/Users/cchang5/anaconda/envs/cdips2017/lib/python3.6/site-packages/ipykernel_launcher.py:6: RuntimeWarning: invalid value encountered in true_divide
  
/Users/cchang5/anaconda/envs/cdips2017/lib/python3.6/site-packages/ipykernel_launcher.py:7: RuntimeWarning: invalid value encountered in true_divide
  import sys
/Users/cchang5/anaconda/envs/cdips2017/lib/python3.6/site-packages/ipykernel_launcher.py:8: RuntimeWarning: invalid value encountered in true_divide

Test code



In [123]:

    
slength = [float(len(doc.paragraphs[i].text.split())) for i in range(len(doc.paragraphs))]
y = np.sum(book_emotions.as_matrix(),axis=1)
plt.figure('paragraph length vs. emotion sum',figsize=(7,7))
ax = plt.axes([0.15,0.15,0.8,0.8])
ax.scatter(x=slength,y=y)
#ax.set_xlim([0,1])
#ax.set_ylim([0,1])
#ax.set_xscale('log')
plt.draw()
plt.show()



In [126]:

    
slength = [float(len(doc.paragraphs[i].text.split())) for i in range(len(doc.paragraphs))]
y = np.sum(book_emotions.as_matrix(),axis=1)
plt.figure('paragraph length vs. emotion sum',figsize=(7,7))
ax = plt.axes([0.15,0.15,0.8,0.8])
ax.scatter(x=slength,y=y)
#ax.set_xlim([0,1])
#ax.set_ylim([0,1])
#ax.set_xscale('log')
plt.draw()
plt.show()



In [34]:

    
s = "Upon the signature of the International Statute of Secrecy in 1689, wizards went into hiding for good. It was natural, perhaps, that they formed their own small communities within a community. Many small villages and hamlets attracted several magical families, who banded together for mutual support and protection. The villages of Tinworsh in Cornwall, Upper Flagley in Yorkshire, and Ottery St. Catchpole on the south coast of England were notable homes to knots of Wizarding families who lived alongside tolerant and sometimes Confunded Muggles. Most celebrated of these half-magical dwelling places is, perhaps, Godrics Hollow, the West Country village where the great wizard Godric Gryffindor was born, and where Bowman Wright, Wizarding smith, forged the first Golden Snitch. The graveyard is full of the names of ancient magical families, and this accounts, no doubt, for the stories of hauntings that have dogged the little church beside it for many centuries."
w = s.split()[0]
%timeit re.sub("[!@#$%^&*()_+:;,.\\?']", "", w)









    



The slowest run took 4.14 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.04 µs per loop



In [33]:

    
%%timeit
punctuation = ["!","@","#","$","%","^","&","*","(",")","_","+",":",";",",",".","?","'","\\"]
for punc in punctuation:
    w.replace("%s" %punc,'')









    



100000 loops, best of 3: 4.67 µs per loop



In [31]:

    
moodf = pd.DataFrame.from_dict(moo_d, orient='index')



In [38]:

    
moodf.iloc[25].dropna()









    Out[38]:





0    forgotten
1       uneasy
2        words
Name: 25, dtype: object



In [26]:

    
moodf.to_hdf('./book_moo_v2.h5', book)









    



/Users/cchang5/anaconda/envs/cdips2017/lib/python3.6/site-packages/pandas/core/generic.py:1282: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block0_values] [items->[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]]

  return pytables.to_hdf(path_or_buf, key, self, **kwargs)



In [39]:

    
book_mood = pd.read_hdf('book_moo_v2.h5',key=rbook)



In [40]:









    Out[40]:







  
    
      
      0
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
    
  
  
    
      0
      None
      None
      None
      None
      None
      None
      None
      None
      None
      None
      None
      None
    
    
      1
      None
      None
      None
      None
      None
      None
      None
      None
      None
      None
      None
      None
    
    
      2
      None
      None
      None
      None
      None
      None
      None
      None
      None
      None
      None
      None
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      3237
      good
      unpleasant
      None
      None
      None
      None
      None
      None
      None
      None
      None
      None
    
    
      3238
      fun
      grin
      surprised
      None
      None
      None
      None
      None
      None
      None
      None
      None
    
    
      3239
      None
      None
      None
      None
      None
      None
      None
      None
      None
      None
      None
      None
    
  

3240 rows × 12 columns



In [ ]:

	word	emo	binary
0	aback	anger	0
1	aback	anticipation	0
2	aback	disgust	0
...	...	...	...
141817	zoom	sadness	0
141818	zoom	surprise	0
141819	zoom	trust	0

	anger	disgust	fear	joy	sadness	surprise	0	1	2	3	4	5	6	7	8	9	10	11
0	0	0	0	0	0	0	None	None	None	None	None	None	None	None	None	None	None	None
1	0	0	0	0	0	0	None	None	None	None	None	None	None	None	None	None	None	None
2	0	0	0	0	0	0	None	None	None	None	None	None	None	None	None	None	None	None
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3223	1	0	1	0	1	0	crazy	None	None	None	None	None	None	None	None	None	None	None
3224	0	0	0	0	0	0	None	None	None	None	None	None	None	None	None	None	None	None
3225	0	0	0	0	0	0	None	None	None	None	None	None	None	None	None	None	None	None