Applying stanford CoreNLP sentiment to the speechacts

Can only do sentiment in batch processing, so first move all the speechacts to files:


In [88]:
import os
import pickle
import collections
import pandas as pd
import numpy as np

After this, I split the file into one-line parts using the following command:

$ split -l 1 -a 7 speechacts.txt

and removing the original speechacts.txt.

This ensures that each speechact is its' own file, and it receives its' own corenlp object analysis.

then, we compute the overall sentiment for each speechact by batch-parsing the files. parsed is a generator, so no computation will be done until we start iterating over it.

the split files are in alphabetical order, and batch_parse iterates through the files alphabetically, so iterating over the parsed files should retain the order from the dataframe.

this will take around 4 hours to compute. be sure to save intermediary results in a file. (108sec for 26 speechacts, ~26*100 total speechacts, 10800 seconds, ~3hrs)

(see batch-parse-sentiment.py)


In [3]:
sent = pickle.load(open('pickles/corenlp_sentiment/corenlp_sentiment_FINAL.p', 'rb'))

In [5]:
sent[:10]


Out[5]:
[[('xaaaaaaaaaa', u'Negative', 1), ('xaaaaaaaaaa', u'Neutral', 2)],
 [('xaaaaaaamaa', u'Neutral', 2)],
 [('xaaaaaaamab', u'Negative', 1)],
 [('xaaaaaaamac', u'Neutral', 2)],
 [('xaaaaaaamad', u'Neutral', 2)],
 [('xaaaaaaamae', u'Neutral', 2), ('xaaaaaaamae', u'Negative', 1)],
 [('xaaaaaaamaf', u'Neutral', 2)],
 [('xaaaaaaamag', u'Negative', 1)],
 [('xaaaaaaamah', u'Negative', 1)],
 [('xaaaaaaamai', u'Neutral', 2)]]

re-format the list


In [22]:
s = []
for l in sent:
    if type(l) != list:
        s.append(l)
        continue
    fname = l[0][0]
    newl = []
    for tup in l:
        newl.append(tup[1:])
    s.append((fname, newl))

s[:10]


Out[22]:
[('xaaaaaaaaaa', [(u'Negative', 1), (u'Neutral', 2)]),
 ('xaaaaaaamaa', [(u'Neutral', 2)]),
 ('xaaaaaaamab', [(u'Negative', 1)]),
 ('xaaaaaaamac', [(u'Neutral', 2)]),
 ('xaaaaaaamad', [(u'Neutral', 2)]),
 ('xaaaaaaamae', [(u'Neutral', 2), (u'Negative', 1)]),
 ('xaaaaaaamaf', [(u'Neutral', 2)]),
 ('xaaaaaaamag', [(u'Negative', 1)]),
 ('xaaaaaaamah', [(u'Negative', 1)]),
 ('xaaaaaaamai', [(u'Neutral', 2)])]

Make sure we only have unique filenames


In [15]:
s_no_dupes = [l for n,l in enumerate(s) if l not in s[:n] and l not in s[n+1:]]

In [17]:
print len(s_no_dupes)
print len(s)


62
49252

In [18]:
a = [1,2,3,4, 4, 8, 9, 3, 1, 10]
print a[:2]
print a[2:]
a_no_dupes = [l for n,l in enumerate(a) if l not in a[:n] and l not in a[n+1:]]
print a_no_dupes


[1, 2]
[3, 4, 4, 8, 9, 3, 1, 10]
[2, 8, 9, 10]

In [43]:
filenames = [t[0] for t in s if type(t) == tuple]

In [44]:
len(list(set(filenames)))


Out[44]:
4255

make a dict of filename -> sentiment


In [46]:
s = [item for item in s if type(item) == tuple]
sent_filename_dict = dict(s)

So, now we have all the filenames. Now what?

make a dataframe with the speechacts, try to correlate with the speechact data from previous analyses.

Joining with previous data

Get the speechacts for each filename


In [47]:
fpath = '/Users/dan/classes/research/huac-testimony/pickles/speechacts_old/'
speech_dict = {}
for filename in filenames:
    filepath = os.path.join(fpath, filename)
    speechact = ""
    with open(filepath, 'rb') as f:
        speechact = f.readline()
    
    speech_dict[speechact] = sent_filename_dict[filename]

In [48]:
speech_dict.items()[:5]


Out[48]:
[(' Absolutely, sir. If someone had said to me, "Come on to a meeting of the Republican Party and meet a Democrat," I would have gone.\n',
  [(u'Negative', 1), (u'Negative', 1)]),
 (" That is the way it looked to me. \x0c526 COMMUNIrT! ACTIVITIES IN THlE LOS ANa1RLES'AREA \n",
  [(u'Neutral', 2), (u'Neutral', 2), (u'Negative', 1)]),
 (' Now they set minority against majority to create those things to enable them to move in, Isthat a correct statement I\n',
  [(u'Positive', 3)]),
 (' May I ask a question at this point. Mr. Blankfort, upon what do you base your statement that Communists were not permitted to associate with anti-Communists when there is ample testimony in the record before this committee that Communists were directed to maintain entirely cordial relations both in church, in 95829-52-pt. 7- \x0c2336 COMMUNISM IN HOLLYWOOD MOTION-PICTURE INDUSTRY lodges, in political registration with non-Communists for the purpose of influencing ?\n',
  [(u'Neutral', 2), (u'Neutral', 2), (u'Negative', 1)]),
 (' Well, I would like to add this: Since my disassociation from the Communist Party I feel much freer, as though a burden were taken off of my mind, because as I said, for some time the struggle had been going on within me, whether I was doing the right thing by still being attached to something that was so definitely opposed to American democratic tradition, and that having severed all connec- \x0c934 COMMUNIST ACTIVITIES IN THE LOS ANGELES AREA tions and bonds with the Communist Party, I can think and conduct myself, I believe, more in the American tradition. \n',
  [(u'Verynegative', 0)])]

make the new column in the dataframe

read the pickle


In [180]:
df = pd.read_pickle('pickles/final/final_analysis.p')

add the column for corenlp/sentiment data.


In [49]:
import Levenshtein
corenlp_sentiment = []
for n, row in df.iterrows():
    speechact = row['speechact']
    found_match = False
    for sa, sent in speech_dict.items():
        if Levenshtein.ratio(sa, speechact) > 0.85:
            corenlp_sentiment.append(sent)
            found_match = True
            break # don't want multiple. just take the first.
    if not found_match:
        corenlp_sentiment.append(np.nan)

In [50]:
corenlp_sentiment[:10]


Out[50]:
[[(u'Negative', 1), (u'Neutral', 2)],
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan]

In [53]:
pickle.dump(corenlp_sentiment, open('pickles/final/corenlp_sentiment_on_speechacts_list.p', 'wb'))

In [156]:
sentiment = pickle.load(open('pickles/final/corenlp_sentiment_on_speechacts_list.p', 'rb'))

In [165]:
sentiment[:5]


Out[165]:
[[(u'Negative', 1), (u'Neutral', 2)], nan, nan, nan, nan]

Encode the sentiment as a usable number

Neutral sentiment isn't interesting. We should remove neutral sentiment from the series.

So then the encoding looks like this:

p -> 0 - 5

n -> 0 - -5

vp -> 5 - 10

vn -> -5 - -10


In [171]:
def base_and_multiplier_from_category(category):
    categories = ["Verynegative", "Negative", "Positive", "Verypositive"]
    base = (categories.index(category) - 1)/2 * 5
    multiplier = categories.index(category) > 1
    if multiplier:
        return (base, 1)
    else:
        return (base, -1)

def category_and_score_to_number(t):
    "produces a number that represents the given intensity and score."
    category, score = t
    base, multiplier = base_and_multiplier_from_category(category)
    return base + multiplier*score

In [172]:
coded_sentiment = []

for s in sentiment:
    new_s = []
    if type(s) == float:
        coded_sentiment.append(np.nan)
        continue
    for pair in s:
        if pair[0] == "Neutral":
            new_s.append(np.nan)
        else:
            new_s.append(category_and_score_to_number(pair))
    coded_sentiment.append(new_s)

In [173]:
coded_sentiment[:5]


Out[173]:
[[-1, nan], nan, nan, nan, nan]

In [174]:
pickle.dump(coded_sentiment, open('pickles/final/corenlp_sentiment_on_speechacts_list_coded.p', 'wb'))

In [181]:
sentiment = pickle.load(open('pickles/final/corenlp_sentiment_on_speechacts_list_coded.p', 'rb'))

In [182]:
df['corenlp_sentiment_by_sentence'] = sentiment

Now, we have the corenlp sentiment by-sentence for each speechact.

Time to create the graph.


In [183]:
df.head()


Out[183]:
is_interviewee is_response liwc_categories_by_sentence liwc_categories_for_speechact liwc_sentiment_by_sentence liwc_sentiment_for_speechact liwc_sentiment_towards_entities_with_anaphora liwc_sentiment_towards_entities_without_anaphora liwc_sentiment_towards_only_anaphora mention_list_by_sentence_with_anaphora mention_list_by_sentence_without_anaphora mention_list_for_speechact_without_anaphora speechact speaker mention_list_by_sentence_only_anaphora corenlp_sentiment_by_sentence
0 False False [{}, {u'School': 0.111111111111, u'Pronoun': 0... {u'School': 0.0555555555556, u'Pronoun': 0.055... [0.0, 0.0] 0.000000 NaN {} NaN NaN [[], []] [] 'I'AvENNEit. Have both counsel identified them... macia NaN [-1, nan]
1 False False [{}, {u'Time': 0.25, u'Incl': 0.25, u'Space': ... {u'Space': 0.125, u'Incl': 0.125, u'Time': 0.125} [0.0, 0.0] 0.000000 NaN {} NaN NaN [[], []] [] AI'AENNEI1. When and where- macia NaN NaN
2 False False [{}, {u'Negate': 0.5}, {u'Pronoun': 0.125, u'F... {u'Pronoun': 0.0416666666667, u'Future': 0.041... [0.0, 0.0, -0.2] -0.066667 NaN {u'JACK': 0.0} NaN NaN [[JACK], [], []] [JACK] ,JACK ON. No. I am sorry, you cannot. macia NaN NaN
3 False False [{u'Cogmech': 0.0588235294118, u'Posemo': 0.05... {u'Cogmech': 0.0444147355912, u'Tentat': 0.020... [-0.2, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.022222 NaN {} NaN NaN [[], [], [], [], [], [], [], [], []] [] The request, in line with the rules of the co... jackson NaN NaN
4 False False [{}, {u'Eating': 0.0833333333333, u'Pronoun': ... {u'Eating': 0.0416666666667, u'Pronoun': 0.083... [0.0, 0.0] 0.000000 NaN {} NaN NaN [[], []] [] ''M NER. illt volt (Otill emtoyvl its italelo ... macia NaN NaN

In [184]:
df.to_pickle('pickles/final/with_corenlp_sentiment_df.p')

Graphs

Constructing the graph with category scores

We want to construct a graph that contains every edge corenlp picked up as having sentiment. Each edge should have the corenlp sentiment measure as well as the liwc sentiment measure, and the liwc pos/neg categories.

Then, we can compare the two in a graph.

name disambiguation


In [185]:
disambiguated_names = pickle.load(open('pickles/final/disambiguated_names.p', 'rb'))

def get_key(mention, l):
    "returns the numerical key for the given mention"
    mention = mention.lower()
    for chunk in l:
        if mention in chunk:
            return l.index(chunk)

some utilities


In [ ]:
import difflib
transcript_dir = os.path.join("testimony/text/hearings")

interviewee_names = [f.replace(".txt", "") for f in os.listdir(transcript_dir)]
interviewee_names = map(lambda s: s.replace("-", " "), interviewee_names)

def is_interviewer(name):
    return not difflib.get_close_matches(name, interviewee_names)


graph_data = sentiment_graph_data = collections.defaultdict(lambda : collections.defaultdict(int))

In [192]:
import difflib
transcript_dir = os.path.join("testimony/text/hearings")

interviewee_names = [f.replace(".txt", "") for f in os.listdir(transcript_dir)]
interviewee_names = map(lambda s: s.replace("-", " "), interviewee_names)

relevant_categories = ['Posemo', 'Negemo', 'Anger', 'Posfeel']

def is_interviewer(name):
    return not difflib.get_close_matches(name, interviewee_names)

# the graph data has to be stored in a separate dict until we
# construct the graph; adding multiple edges between two nodes just
# replaces the attributes. We want to average/accumulate them.

nn_categories_by_sentence = []

nn_graph_data = collections.defaultdict(lambda : collections.defaultdict(dict))
sentiment_graph_data = collections.defaultdict(lambda : collections.defaultdict(int))
liwc_sentiment_graph_data = collections.defaultdict(lambda : collections.defaultdict(int))
count_data = collections.defaultdict(int)

skipped = 0
for n, row in df.iterrows():
    if n % 500 == 0:
        print n, "rows analyzed"

    speaker = row['speaker']
    if is_interviewer(speaker):
        skipped += 1
        continue
        
    # there's anaphora
    mention_list_w_anaphora = row['mention_list_by_sentence_with_anaphora']
    mention_list_wo_anaphora = row['mention_list_by_sentence_without_anaphora']
    corenlp_sentiment_by_sentence = row['corenlp_sentiment_by_sentence']
    liwc_sentiment_by_sentence = row['liwc_sentiment_by_sentence']
    
    if type(corenlp_sentiment_by_sentence) == float:
        continue
        skipped += 1
        
        

    speaker = get_key(speaker, disambiguated_names)

    if type(mention_list_w_anaphora) == list:
       sentiment_towards_mentions = {}
       for n, mentions in enumerate(mention_list_w_anaphora):
           for mention in mentions:
               mention = get_key(mention, disambiguated_names)                
               if speaker == mention or not mention or type(corenlp_sentiment_by_sentence[n]) == float:
                   skipped += 1
                   continue
               sentiment_graph_data[speaker][mention] += corenlp_sentiment_by_sentence[n]
               liwc_sentiment_graph_data[speaker][mention] += liwc_sentiment_by_sentence[n]
               count_data[(speaker, mention)] += 1
        
    elif type(mention_list_wo_anaphora) == list:
        categories_towards_mentions = {}
        for n, mentions in enumerate(mention_list_wo_anaphora):
            for mention in mentions:
                mention = get_key(mention, disambiguated_names)
                if speaker == mention or not mention or type(corenlp_sentiment_by_sentence[n]) == float:
                    skipped += 1
                    continue
                sentiment_graph_data[speaker][mention] += corenlp_sentiment_by_sentence[n]
                liwc_sentiment_graph_data[speaker][mention] += liwc_sentiment_by_sentence[n]
                count_data[(speaker, mention)] += 1
                
print "skipped", skipped
print "sentiment_graph_data and liwc_sentiment_graph_data are now populated."


0 rows analyzed
500 rows analyzed
1000 rows analyzed
1500 rows analyzed
2000 rows analyzed
2500 rows analyzed
3000 rows analyzed
3500 rows analyzed
4000 rows analyzed
4500 rows analyzed
5000 rows analyzed
5500 rows analyzed
6000 rows analyzed
6500 rows analyzed
7000 rows analyzed
7500 rows analyzed
8000 rows analyzed
8500 rows analyzed
9000 rows analyzed
9500 rows analyzed
10000 rows analyzed
10500 rows analyzed
11000 rows analyzed
11500 rows analyzed
12000 rows analyzed
12500 rows analyzed
skipped 7081
sentiment_graph_data and liwc_sentiment_graph_data are now populated.

In [206]:
sentiment_graph_data.items()[:5]


Out[206]:
[(135,
  defaultdict(<type 'int'>, {416: -1, 1221: -1, 643: -1, 836: -4, 453: -2, 6: -2, 263: -5, 777: -1, 1227: -2, 143: -2, 1170: 3, 451: -3, 948: -1, 1013: -1, 1079: -1, 1242: 3, 986: -1, 191: -1})),
 (136,
  defaultdict(<type 'int'>, {515: -2, 534: -2, 583: 0, 489: -2, 394: -5, 940: 3, 1358: -1, 381: -4, 1328: -10, 721: -1, 946: -1, 179: 3, 1300: -2, 1046: -1, 271: -1, 925: -1})),
 (144,
  defaultdict(<type 'int'>, {386: -1, 1315: -1, 700: -1, 42: -2, 555: -1, 514: -1, 142: -1, 52: -3, 756: -1, 1238: -1, 1332: -1, 508: -2, 20: -1, 191: -6})),
 (148,
  defaultdict(<type 'int'>, {226: -1, 1143: -1, 453: -1, 1351: -5, 167: -1, 233: -1, 10: -1, 395: -2, 492: -1, 898: -5, 686: -5, 624: -1, 594: -5, 404: -1, 297: -1, 698: -5, 767: -2, 1322: -1, 191: -5})),
 (670,
  defaultdict(<type 'int'>, {1024: -1, 262: -5, 264: -7, 907: -1, 531: -2, 879: -5, 412: -15, 30: -1, 1311: -5, 112: -3, 1186: -1, 806: -1, 1063: -1, 938: -4, 813: -1, 814: -2, 944: -2, 946: -2, 691: 6, 1081: -2, 58: -2, 415: -9, 1084: -3, 1342: -2, 191: -7, 842: -15, 1356: -1, 418: -3, 975: -3, 851: -1, 859: -1, 351: -9, 491: -6, 367: -5, 240: -5, 885: -1}))]

In [212]:
pos_stanford = 0
neg_stanford = 0
pos_liwc = 0
neg_liwc = 0
import networkx as nx
G = nx.DiGraph()
for source, targets in sentiment_graph_data.items():
    for accused, sent in targets.items():
        attrs = {'stanford_sent': sent, 'liwc_sent': liwc_sentiment_graph_data[source][accused]}
        if sent > 0:
            pos_stanford += 1
        elif sent < 0:
            neg_stanford += 1
        
        if liwc_sentiment_graph_data[source][accused] > 0:
            pos_liwc += 1
        elif liwc_sentiment_graph_data[source][accused] < 0:
            neg_liwc += 1
            
        G.add_edge(source, accused, attrs)
        
        
for speaker, mentions in anaphora_graph_data.items():
    if n % 10 == 0:
        print "analyzing", n, "mentions."
    for mentioned, attrs in mentions.items():
        count = anaphora_count_data[(speaker, mentioned)]
        normalized_attrs = normalize_dict(attrs, count)
        filtered = filter_categories(normalized_attrs, relevant_categories)
        n += 1
        try:
            dominant = max(filtered.items(), key=lambda p:p[1])
            dominant = {dominant[0] : dominant[1]}
            G_only_anaphora_with_dominant_categories.add_edge(speaker, mentioned, dominant)
        except ValueError:
            G_only_anaphora_with_dominant_categories.add_edge(speaker, mentioned)


  File "<ipython-input-212-189a35e0b1b9>", line 24
    if n % 10 == 0:
    ^
IndentationError: expected an indented block

In [211]:
nx.write_gml(G, 'graphs/final/corenlp_vs_liwc_sentiment.gml')

In [209]:
G.node


Out[209]:
{3: {},
 6: {},
 7: {},
 10: {},
 20: {},
 26: {},
 29: {},
 30: {},
 38: {},
 39: {},
 40: {},
 42: {},
 47: {},
 52: {},
 58: {},
 66: {},
 67: {},
 69: {},
 71: {},
 74: {},
 84: {},
 99: {},
 102: {},
 104: {},
 107: {},
 109: {},
 110: {},
 112: {},
 118: {},
 120: {},
 132: {},
 135: {},
 136: {},
 142: {},
 143: {},
 144: {},
 148: {},
 157: {},
 158: {},
 162: {},
 167: {},
 169: {},
 178: {},
 179: {},
 182: {},
 184: {},
 185: {},
 191: {},
 193: {},
 195: {},
 198: {},
 199: {},
 204: {},
 206: {},
 208: {},
 210: {},
 220: {},
 226: {},
 233: {},
 236: {},
 240: {},
 246: {},
 247: {},
 258: {},
 262: {},
 263: {},
 264: {},
 271: {},
 272: {},
 277: {},
 279: {},
 289: {},
 297: {},
 304: {},
 312: {},
 323: {},
 326: {},
 329: {},
 334: {},
 342: {},
 348: {},
 351: {},
 352: {},
 354: {},
 364: {},
 367: {},
 368: {},
 379: {},
 381: {},
 385: {},
 386: {},
 387: {},
 390: {},
 391: {},
 394: {},
 395: {},
 397: {},
 400: {},
 401: {},
 404: {},
 408: {},
 412: {},
 414: {},
 415: {},
 416: {},
 418: {},
 419: {},
 423: {},
 428: {},
 439: {},
 440: {},
 445: {},
 446: {},
 449: {},
 451: {},
 453: {},
 458: {},
 460: {},
 463: {},
 470: {},
 482: {},
 483: {},
 486: {},
 489: {},
 491: {},
 492: {},
 496: {},
 497: {},
 503: {},
 504: {},
 508: {},
 510: {},
 512: {},
 514: {},
 515: {},
 522: {},
 525: {},
 529: {},
 531: {},
 534: {},
 535: {},
 548: {},
 552: {},
 553: {},
 554: {},
 555: {},
 558: {},
 559: {},
 569: {},
 573: {},
 577: {},
 583: {},
 591: {},
 592: {},
 594: {},
 603: {},
 604: {},
 606: {},
 607: {},
 613: {},
 616: {},
 617: {},
 618: {},
 624: {},
 629: {},
 633: {},
 643: {},
 653: {},
 658: {},
 660: {},
 669: {},
 670: {},
 674: {},
 678: {},
 684: {},
 686: {},
 690: {},
 691: {},
 698: {},
 700: {},
 706: {},
 707: {},
 708: {},
 709: {},
 710: {},
 712: {},
 721: {},
 728: {},
 738: {},
 739: {},
 744: {},
 755: {},
 756: {},
 767: {},
 776: {},
 777: {},
 787: {},
 789: {},
 791: {},
 804: {},
 805: {},
 806: {},
 811: {},
 813: {},
 814: {},
 834: {},
 836: {},
 842: {},
 844: {},
 851: {},
 859: {},
 860: {},
 868: {},
 879: {},
 885: {},
 891: {},
 892: {},
 894: {},
 898: {},
 904: {},
 905: {},
 907: {},
 921: {},
 922: {},
 925: {},
 929: {},
 938: {},
 940: {},
 944: {},
 946: {},
 948: {},
 952: {},
 959: {},
 964: {},
 969: {},
 971: {},
 972: {},
 975: {},
 977: {},
 978: {},
 981: {},
 986: {},
 989: {},
 990: {},
 1000: {},
 1001: {},
 1006: {},
 1013: {},
 1017: {},
 1023: {},
 1024: {},
 1032: {},
 1039: {},
 1040: {},
 1046: {},
 1050: {},
 1063: {},
 1067: {},
 1072: {},
 1077: {},
 1079: {},
 1081: {},
 1084: {},
 1093: {},
 1106: {},
 1107: {},
 1108: {},
 1112: {},
 1114: {},
 1117: {},
 1125: {},
 1131: {},
 1132: {},
 1134: {},
 1137: {},
 1143: {},
 1146: {},
 1148: {},
 1156: {},
 1160: {},
 1169: {},
 1170: {},
 1171: {},
 1173: {},
 1186: {},
 1189: {},
 1195: {},
 1196: {},
 1198: {},
 1212: {},
 1219: {},
 1221: {},
 1226: {},
 1227: {},
 1235: {},
 1236: {},
 1238: {},
 1240: {},
 1242: {},
 1245: {},
 1247: {},
 1256: {},
 1292: {},
 1299: {},
 1300: {},
 1301: {},
 1304: {},
 1306: {},
 1311: {},
 1315: {},
 1322: {},
 1328: {},
 1329: {},
 1332: {},
 1342: {},
 1351: {},
 1356: {},
 1358: {},
 1366: {},
 1367: {},
 1368: {},
 1371: {},
 1372: {}}

In [204]:
count_data[(229, 287)]


Out[204]:
0

In [ ]: