Chapter 3, Table 3

This notebook explains how I used the Harvard General Inquirer to streamline interpretation of a predictive model.

I'm italicizing the word "streamline" because I want to emphasize that I place very little weight on the Inquirer: as I say in the text, "The General Inquirer has no special authority, and I have tried not to make it a load-bearing element of this argument."

To interpret a model, I actually spend a lot of time looking at lists of features, as well as predictions about individual texts. But to explain my interpretation, I need some relatively simple summary. Given real-world limits on time and attention, going on about lists of individual words for five pages is rarely an option. So, although wordlists are crude and arbitrary devices, flattening out polysemy and historical change, I am willing to lean on them rhetorically, where I find that they do in practice echo observations I have made in other ways.

I should also acknowledge that I'm not using the General Inquirer as it was designed to be used. The full version of this tool is not just a set of wordlists, it's a software package that tries to get around polysemy by disambiguating different word senses. I haven't tried to use it in that way: I think it would complicate my explanation, in order to project an impression of accuracy and precision that I don't particularly want to project. Instead, I have stressed that word lists are crude tools, and I'm using them only as crude approximations.

That said, how do I do it?

To start with, we'll load an array of modules. Some standard, some utilities that I've written myself.



In [1]:

    
# some standard modules

import csv, os, sys
from collections import Counter
import numpy as np
from scipy.stats import pearsonr

# now a module that I wrote myself, located
# a few directories up, in the software
# library for this repository

sys.path.append('../../lib')
import FileCabinet as filecab

Loading the General Inquirer.

This takes some doing, because the General Inquirer doesn't start out as a set of wordlists. I have to translate it into that form.

I start by loading an English dictionary.



In [2]:

    
# start by loading the dictionary

dictionary = set()

with open('../../lexicons/MainDictionary.txt', encoding = 'utf-8') as f:
    reader = csv.reader(f, delimiter = '\t')
    for row in reader:
        word = row[0]
        count = int(row[2])
        if count < 10000:
            continue
            # that ignores very rare words
            # we end up with about 42,700 common ones
        else:
            dictionary.add(word)

The next stage is to translate the Inquirer. It begins as a table where word senses are row labels, and the Inquirer categories are columns (except for two columns at the beginning and two at the end). This is, by the way, the "basic spreadsheet" described at this site: http://www.wjh.harvard.edu/~inquirer/spreadsheet_guide.htm

I translate this into a dictionary where the keys are Inquirer categories, and the values are sets of words associated with each category.

But to do that, I have to do some filtering and expanding. Different senses of a word are broken out in the spreadsheet thus:

ABOUT#1

ABOUT#2

ABOUT#3

etc.

I need to separate the hashtag part. Also, because I don't want to allow rare senses of a word too much power, I ignore everything but the first sense of a word.

However, I also want to allow singular verb forms and plural nouns to count. So there's some code below that expands words by adding -s -ed, etc to the end. See the suffixes defined below for more details. Note that I use the English dictionary to determine which possible forms are real words.



In [3]:

    
inquirer = dict()

suffixes = dict()
suffixes['verb'] = ['s', 'es', 'ed', 'd', 'ing']
suffixes['noun'] = ['s', 'es']

allinquirerwords = set()

with open('../../lexicons/inquirerbasic.csv', encoding = 'utf-8') as f:
    reader = csv.DictReader(f)
    fields = reader.fieldnames[2:-2]
    for field in fields:
        inquirer[field] = set()

    for row in reader:
        term = row['Entry']

        if '#' in term:
            parts = term.split('#')
            word = parts[0].lower()
            sense = int(parts[1].strip('_ '))
            partialsense = True
        else:
            word = term.lower()
            sense = 0
            partialsense = False

        if sense > 1:
            continue
            # we're ignoring uncommon senses

        pos = row['Othtags']
        if 'Noun' in pos:
            pos = 'noun'
        elif 'SUPV' in pos:
            pos = 'verb'

        forms = {word}
        if pos == 'noun' or pos == 'verb':
            for suffix in suffixes[pos]:
                if word + suffix in dictionary:
                    forms.add(word + suffix)
                if pos == 'verb' and word.rstrip('e') + suffix in dictionary:
                    forms.add(word.rstrip('e') + suffix)

        for form in forms:
            for field in fields:
                if len(row[field]) > 1:
                    inquirer[field].add(form)
                    allinquirerwords.add(form)
                    
print('Inquirer loaded')
print('Total of ' + str(len(allinquirerwords)) + " words.")









    



Inquirer loaded
Total of 13707 words.

Load model predictions about volumes

The next step is to create some vectors that store predictions about volumes. In this case, these are predictions about the probability that a volume is fiction, rather than biography.



In [4]:

    
# the folder where wordcounts will live
# we're only going to load predictions
# that correspond to files located there
sourcedir = '../sourcefiles/'

docs = []
logistic = []

with open('../modeloutput/fullfiction.results.csv', encoding = 'utf-8') as f:
    reader = csv.DictReader(f)
    for row in reader:
        genre = row['realclass']
        docid = row['volid']
        if not os.path.exists(sourcedir + docid + '.tsv'):
            continue
        docs.append(row['volid'])
        logistic.append(float(row['logistic']))

logistic = np.array(logistic)
numdocs = len(docs)

assert numdocs == len(logistic)

print("We have information about " + str(numdocs) + " volumes.")









    



We have information about 1200 volumes.

And get the wordcounts themselves

This cell of the notebook is very short (one line), but it takes a lot of time to execute. There's a lot of file i/o that happens inside the function get_wordfreqs, in the FileCabinet module, which is invoked here. We come away with a dictionary of wordcounts, keyed in the first instance by volume ID.

Note that these are normalized frequencies rather than the raw integer counts we had in the analogous notebook in chapter 1.



In [5]:

    
wordcounts = filecab.get_wordfreqs(sourcedir, '.tsv', docs)

Now calculate the representation of each Inquirer category in each doc

We normalize by the total wordcount for a volume.

This cell also takes a long time to run. I've added a counter so you have some confidence that it's still running.



In [6]:

    
# Initialize empty category vectors

categories = dict()
for field in fields:
    categories[field] = np.zeros(numdocs)
    
# Now fill them

for i, doc in enumerate(docs):
    ctcat = Counter()
    allcats = 0
    for word, count in wordcounts[doc].items():
        if word in dictionary:
            allcats += count
        
        if word not in allinquirerwords:
            continue
        for field in fields:
            if word in inquirer[field]:
                ctcat[field] += count
    for field in fields:
        categories[field][i] = ctcat[field] / (allcats + 0.00000001)
        # Laplacian smoothing there to avoid div by zero, among other things.
        # notice that, since these are normalized freqs, we need to use a very small decimal
        # If these are really normalized freqs, it may not matter very much
        # that we divide at all. The denominator should always be 1, more or less.
        # But I'm not 100% sure about that.
    
    if i % 100 == 1:
        print(i, allcats)

Calculate correlations

Now that we have all the information, calculating correlations is easy. We iterate through Inquirer categories, in each case calculating the correlation between a vector of model predictions for docs, and a vector of category-frequencies for docs.



In [7]:

    
logresults = []

for inq_category in fields:
    l = pearsonr(logistic, categories[inq_category])[0]
    logresults.append((l, inq_category))

logresults.sort()

Load expanded names of Inquirer categories

The terms used in the inquirer spreadsheet are not very transparent. DAV for instance is "descriptive action verbs." BodyPt is "body parts." To make these more transparent, I have provided expanded names for many categories that turned out to be relevant in the book, trying to base my description on the accounts provided here: http://www.wjh.harvard.edu/~inquirer/homecat.htm

We load these into a dictionary.



In [10]:

    
short2long = dict()
with open('../../lexicons/long_inquirer_names.csv', encoding = 'utf-8') as f:
    reader = csv.DictReader(f)
    for row in reader:
        short2long[row['short_name']] = row['long_name']

Print results

I print the top 12 correlations and the bottom 12, skipping categories that are drawn from the "Laswell value dictionary." The Laswell categories are very finely discriminated (things like "enlightenment gain" or "power loss"), and I have little faith that they're meaningful. I especially doubt that they could remain meaningful when the Inquirer is used crudely as a source of wordlists.



In [11]:

    
print('Printing the correlations of General Inquirer categories')
print('with the predicted probabilities of being fiction in allsubset2.csv:')
print()
print('First, top positive correlations: ')
print()
for prob, n in reversed(logresults[-15 : ]):
    if n in short2long:
        n = short2long[n]
    if 'Laswell' in n:
        continue
    else:
        print(str(prob) + '\t' + n)

print()
print('Now, negative correlations: ')
print()
for prob, n in logresults[0 : 15]:
    if n in short2long:
        n = short2long[n]
    if 'Laswell' in n:
        continue
    else:
        print(str(prob) + '\t' + n)









    



Printing the correlations of General Inquirer categories
with the predicted probabilities of being fiction in allsubset2.csv:

First, top positive correlations: 

0.293671490879	knowledge and awareness
0.195418805214	natural processes
0.194304389796	Sky
0.191206378091	natural objects
0.175201137607	understatement and qualification
0.169734025673	comparison
0.161911364442	physical adjectives
0.161131715046	frequency and recurrence
0.153801715607	Exprsv
0.146299378496	Abs@
0.145243883974	negation or reversal
0.138093587693	organized systems of belief or knowledge

Now, negative correlations: 

-0.446582408127	Active
-0.418206345146	social relations
-0.399323070214	verbs that imply an interpretation or explanation of an action
-0.393963099557	Travel
-0.347042135206	dependence or obligation
-0.333877641403	achievement and completion
-0.326476252853	actions taken to reach a goal
-0.314548159068	Strong
-0.284164858576	Affil
-0.274709445693	Fetch
-0.273774095828	also power

Comments

If you compare the printout above to the book's version of Table 3.3, you may notice a few things have been dropped. In particular, I have skipped categories that contain a small number of words, like "Sky" (34). "Sky" is in effect rolled into "natural objects."

"Verbs that imply an interpretation or explanation of an action" has also been skipped--because I simply don't know how to convey that clearly in a table. In the Inquirer, there's a contrast between DAV and IAV, but it would take a paragraph to explain, and the whole point of this exercise is to produce something concise.

However, on the whole, Table 3.3 corresponds very closely to the list above.



In [ ]: