Chapter 3, Table 2

This notebook explains how I used the Harvard General Inquirer to streamline interpretation of a predictive model.

I'm italicizing the word "streamline" because I want to emphasize that I place very little weight on the Inquirer: as I say in the text, "The General Inquirer has no special authority, and I have tried not to make it a load-bearing element of this argument."

To interpret a model, I actually spend a lot of time looking at lists of features, as well as predictions about individual texts. But to explain my interpretation, I need some relatively simple summary. Given real-world limits on time and attention, going on about lists of individual words for five pages is rarely an option. So, although wordlists are crude and arbitrary devices, flattening out polysemy and historical change, I am willing to lean on them rhetorically, where I find that they do in practice echo observations I have made in other ways.

I should also acknowledge that I'm not using the General Inquirer as it was designed to be used. The full version of this tool is not just a set of wordlists, it's a software package that tries to get around polysemy by disambiguating different word senses. I haven't tried to use it in that way: I think it would complicate my explanation, in order to project an impression of accuracy and precision that I don't particularly want to project. Instead, I have stressed that word lists are crude tools, and I'm using them only as crude approximations.

That said, how do I do it?

To start with, we'll load an array of modules. Some standard, some utilities that I've written myself.


In [2]:
# some standard modules

import csv, os, sys
from collections import Counter
import numpy as np
from scipy.stats import pearsonr

# now a module that I wrote myself, located
# a few directories up, in the software
# library for this repository

sys.path.append('../../lib')
import FileCabinet as filecab

Loading the General Inquirer.

This takes some doing, because the General Inquirer doesn't start out as a set of wordlists. I have to translate it into that form.

I start by loading an English dictionary.


In [2]:
# start by loading the dictionary

dictionary = set()

with open('../../lexicons/MainDictionary.txt', encoding = 'utf-8') as f:
    reader = csv.reader(f, delimiter = '\t')
    for row in reader:
        word = row[0]
        count = int(row[2])
        if count < 10000:
            continue
            # that ignores very rare words
            # we end up with about 42,700 common ones
        else:
            dictionary.add(word)

The next stage is to translate the Inquirer. It begins as a table where word senses are row labels, and the Inquirer categories are columns (except for two columns at the beginning and two at the end). This is, by the way, the "basic spreadsheet" described at this site: http://www.wjh.harvard.edu/~inquirer/spreadsheet_guide.htm

I translate this into a dictionary where the keys are Inquirer categories, and the values are sets of words associated with each category.

But to do that, I have to do some filtering and expanding. Different senses of a word are broken out in the spreadsheet thus:

ABOUT#1

ABOUT#2

ABOUT#3

etc.

I need to separate the hashtag part. Also, because I don't want to allow rare senses of a word too much power, I ignore everything but the first sense of a word.

However, I also want to allow singular verb forms and plural nouns to count. So there's some code below that expands words by adding -s -ed, etc to the end. See the suffixes defined below for more details. Note that I use the English dictionary to determine which possible forms are real words.


In [3]:
inquirer = dict()

suffixes = dict()
suffixes['verb'] = ['s', 'es', 'ed', 'd', 'ing']
suffixes['noun'] = ['s', 'es']

allinquirerwords = set()

with open('../../lexicons/inquirerbasic.csv', encoding = 'utf-8') as f:
    reader = csv.DictReader(f)
    fields = reader.fieldnames[2:-2]
    for field in fields:
        inquirer[field] = set()

    for row in reader:
        term = row['Entry']

        if '#' in term:
            parts = term.split('#')
            word = parts[0].lower()
            sense = int(parts[1].strip('_ '))
            partialsense = True
        else:
            word = term.lower()
            sense = 0
            partialsense = False

        if sense > 1:
            continue
            # we're ignoring uncommon senses

        pos = row['Othtags']
        if 'Noun' in pos:
            pos = 'noun'
        elif 'SUPV' in pos:
            pos = 'verb'

        forms = {word}
        if pos == 'noun' or pos == 'verb':
            for suffix in suffixes[pos]:
                if word + suffix in dictionary:
                    forms.add(word + suffix)
                if pos == 'verb' and word.rstrip('e') + suffix in dictionary:
                    forms.add(word.rstrip('e') + suffix)

        for form in forms:
            for field in fields:
                if len(row[field]) > 1:
                    inquirer[field].add(form)
                    allinquirerwords.add(form)
                    
print('Inquirer loaded')
print('Total of ' + str(len(allinquirerwords)) + " words.")


Inquirer loaded
Total of 13707 words.

Load model predictions about volumes

The next step is to create some vectors that store predictions about volumes. In this case, these are predictions about the probability that a volume is fiction, rather than biography.


In [3]:
# the folder where wordcounts will live
# we're only going to load predictions
# that correspond to files located there
sourcedir = '../sourcefiles/'

docs = []
logistic = []

with open('../modeloutput/fullpoetry.results.csv', encoding = 'utf-8') as f:
    reader = csv.DictReader(f)
    for row in reader:
        genre = row['realclass']
        docid = row['volid']
        if not os.path.exists(sourcedir + docid + '.tsv'):
            continue
        docs.append(row['volid'])
        logistic.append(float(row['logistic']))

logistic = np.array(logistic)
numdocs = len(docs)

assert numdocs == len(logistic)

print("We have information about " + str(numdocs) + " volumes.")


We have information about 718 volumes.

And get the wordcounts themselves

This cell of the notebook is very short (one line), but it takes a lot of time to execute. There's a lot of file i/o that happens inside the function get_wordfreqs, in the FileCabinet module, which is invoked here. We come away with a dictionary of wordcounts, keyed in the first instance by volume ID.

Note that these are normalized frequencies rather than the raw integer counts we had in the analogous notebook in chapter 1.


In [5]:
wordcounts = filecab.get_wordfreqs(sourcedir, '.tsv', docs)

Now calculate the representation of each Inquirer category in each doc

We normalize by the total wordcount for a volume.

This cell also takes a long time to run. I've added a counter so you have some confidence that it's still running.


In [19]:
# Initialize empty category vectors

categories = dict()
for field in fields:
    categories[field] = np.zeros(numdocs)
    
# Now fill them

for i, doc in enumerate(docs):
    ctcat = Counter()
    allcats = 0
    for word, count in wordcounts[doc].items():
        if word in dictionary:
            allcats += count
        
        if word not in allinquirerwords:
            continue
        for field in fields:
            if word in inquirer[field]:
                ctcat[field] += count
    for field in fields:
        categories[field][i] = ctcat[field] / (allcats + 0.00000001)
        # Laplacian smoothing there to avoid div by zero, among other things.
        # notice that, since these are normalized freqs, we need to use a very small decimal
        # If these are really normalized freqs, it may not matter very much
        # that we divide at all. The denominator should always be 1, more or less.
        # But I'm not 100% sure about that.
    
    if i % 100 == 1:
        print(i)


1 0.7838509316769926
11 0.8651605569763775
21 0.7174196705268423
31 0.8678146980354263
41 0.8039324286900688
51 0.7632880231188856
61 0.7567096549320351
71 0.847472150814058
81 0.8436090225563799
91 0.7778357235984412
101 0.6135623133315267
111 0.781688291222284
121 0.8080980895352475
131 0.7836375474282049
141 0.8415015641293005
151 0.8345644938712387
161 0.8080959520239721
171 0.7218596059113523
181 0.7066381156316935
191 0.7893436459340282
201 0.7962865281147707
211 0.8313030454318386
221 0.7777552654684182
231 0.8052537813952096
241 0.7845427109303368
251 0.8056564195298831
261 0.7652328969686045
271 0.8052928770664367
281 0.8028547411448701
291 0.7717540488608615
301 0.8204300486950971
311 0.8652380952380974
321 0.7913853799958256
331 0.8534031413612552
341 0.7957053385028167
351 0.803631886734394
361 0.827405305529892
371 0.8089768339768412
381 0.7873943699876046
391 0.7998656618369504
401 0.7738795376191088
411 0.844063647490825
421 0.7778606733189832
431 0.7934141546526723
441 0.8238838432824744
451 0.7760287213477132
461 0.8007641237629965
471 0.855802519151478
481 0.772222540209518
491 0.7782063645130278
501 0.7965763579997797
511 0.8214667521915727
521 0.824754996527495
531 0.8524998085904365
541 0.7627150719929309
551 0.8141807806714141
561 0.8183532021070257
571 0.8088604067985892
581 0.7687674511368456
591 0.7971046585655905
601 0.800838723434482
611 0.8372031662269065
621 0.8345553292599913
631 0.7650273224043721
641 0.8200688432252157
651 0.8171022027636816
661 0.8076453066523904
671 0.7808284023668559
681 0.7714485003487778
691 0.7609374079962161
701 0.7615881213686239
711 0.8076940860960765

Calculate correlations

Now that we have all the information, calculating correlations is easy. We iterate through Inquirer categories, in each case calculating the correlation between a vector of model predictions for docs, and a vector of category-frequencies for docs.


In [20]:
logresults = []

for inq_category in fields:
    l = pearsonr(logistic, categories[inq_category])[0]
    logresults.append((l, inq_category))

logresults.sort()

Load expanded names of Inquirer categories

The terms used in the inquirer spreadsheet are not very transparent. DAV for instance is "descriptive action verbs." BodyPt is "body parts." To make these more transparent, I have provided expanded names for many categories that turned out to be relevant in the book, trying to base my description on the accounts provided here: http://www.wjh.harvard.edu/~inquirer/homecat.htm

We load these into a dictionary.


In [33]:
short2long = dict()
with open('../../lexicons/long_inquirer_names.csv', encoding = 'utf-8') as f:
    reader = csv.DictReader(f)
    for row in reader:
        short2long[row['short_name']] = row['long_name']

I print the top 12 correlations and the bottom 12, skipping categories that are drawn from the "Laswell value dictionary." The Laswell categories are very finely discriminated (things like "enlightenment gain" or "power loss"), and I have little faith that they're meaningful. I especially doubt that they could remain meaningful when the Inquirer is used crudely as a source of wordlists.


In [34]:
print('Printing the correlations of General Inquirer categories')
print('with the predicted probabilities of being fiction in allsubset2.csv:')
print()
print('First, top positive correlations: ')
print()
for prob, n in reversed(logresults[-15 : ]):
    if n in short2long:
        n = short2long[n]
    if 'Laswell' in n:
        continue
    else:
        print(str(prob) + '\t' + n)

print()
print('Now, negative correlations: ')
print()
for prob, n in logresults[0 : 15]:
    if n in short2long:
        n = short2long[n]
    if 'Laswell' in n:
        continue
    else:
        print(str(prob) + '\t' + n)


Printing the correlations of General Inquirer categories
with the predicted probabilities of being fiction in allsubset2.csv:

First, top positive correlations: 

0.39796171844	colors
0.280520737318	body parts
0.244834624764	Sky
0.233408541549	first-person singular
0.228474599837	natural objects
0.204147472536	No
0.19396323244	weakness
0.189284393516	descending motion
0.180165504565	negation or reversal
0.172768109304	dimension (high, large, little)
0.161806446105	physical adjectives
0.157261185448	ordinal numbers
0.14261552527	numbers generally
0.140993507796	parts of buildings

Now, negative correlations: 

-0.529607189911	positive sentiment
-0.525754278995	Virtue
-0.50481679168	positive sentiment
-0.4394300908	verbs that imply an interpretation or explanation of an action
-0.424711499308	power
-0.414416204731	also power
-0.378644328097	dependence or obligation
-0.362161151432	political terms
-0.362141165152	judgment and evaluation
-0.35880038631	human collectivities

Comments

If you compare the printout above to the book's version of Table 3.2, you may notice a few things have been dropped. In particular, I have skipped categories that contain a small number of words, like "Sky" (34), and "No" (7). "Sky" is in effect rolled into "natural objects."

Redundant categories are collapsed; the Inquirer has a couple of different lists for "positive sentiment" and "power." And, finally, "verbs that imply an interpretation or explanation of an action" has been skipped--because I simply don't know how to convey that clearly in a table. In the Inquirer, there's a contrast between DAV and IAV, but it would take a paragraph to explain, and the whole point of this exercise is to produce something concise.


In [ ]: