Chapter 1, Table 1

This notebook explains how I used the Harvard General Inquirer to streamline interpretation of a predictive model.

I'm italicizing the word "streamline" because I want to emphasize that I place very little weight on the Inquirer: as I say in the text, "The General Inquirer has no special authority, and I have tried not to make it a load-bearing element of this argument."

To interpret a model, I actually spend a lot of time looking at lists of features, as well as predictions about individual texts. But to explain my interpretation, I need some relatively simple summary. Given real-world limits on time and attention, going on about lists of individual words for five pages is rarely an option. So, although wordlists are crude and arbitrary devices, flattening out polysemy and historical change, I am willing to lean on them rhetorically, where I find that they do in practice echo observations I have made in other ways.

I should also acknowledge that I'm not using the General Inquirer as it was designed to be used. The full version of this tool is not just a set of wordlists, it's a software package that tries to get around polysemy by disambiguating different word senses. I haven't tried to use it in that way: I think it would complicate my explanation, in order to project an impression of accuracy and precision that I don't particularly want to project. Instead, I have stressed that word lists are crude tools, and I'm using them only as crude approximations.

That said, how do I do it?

To start with, we'll load an array of modules. Some standard, some utilities that I've written myself.


In [1]:
# some standard modules

import csv, os, sys
from collections import Counter
import numpy as np
from scipy.stats import pearsonr

# now a module that I wrote myself, located
# a few directories up, in the software
# library for this repository

sys.path.append('../../lib')
import FileCabinet as filecab

Loading the General Inquirer.

This takes some doing, because the General Inquirer doesn't start out as a set of wordlists. I have to translate it into that form.

I start by loading an English dictionary.


In [2]:
# start by loading the dictionary

dictionary = set()

with open('../../lexicons/MainDictionary.txt', encoding = 'utf-8') as f:
    reader = csv.reader(f, delimiter = '\t')
    for row in reader:
        word = row[0]
        count = int(row[2])
        if count < 10000:
            continue
            # that ignores very rare words
            # we end up with about 42,700 common ones
        else:
            dictionary.add(word)

The next stage is to translate the Inquirer. It begins as a table where word senses are row labels, and the Inquirer categories are columns (except for two columns at the beginning and two at the end). This is, by the way, the "basic spreadsheet" described at this site: http://www.wjh.harvard.edu/~inquirer/spreadsheet_guide.htm

I translate this into a dictionary where the keys are Inquirer categories, and the values are sets of words associated with each category.

But to do that, I have to do some filtering and expanding. Different senses of a word are broken out in the spreadsheet thus:

ABOUT#1

ABOUT#2

ABOUT#3

etc.

I need to separate the hashtag part. Also, because I don't want to allow rare senses of a word too much power, I ignore everything but the first sense of a word.

However, I also want to allow singular verb forms and plural nouns to count. So there's some code below that expands words by adding -s -ed, etc to the end. See the suffixes dictionary defined below for more details.


In [3]:
inquirer = dict()

suffixes = dict()
suffixes['verb'] = ['s', 'es', 'ed', 'd', 'ing']
suffixes['noun'] = ['s', 'es']

allinquirerwords = set()

with open('../../lexicons/inquirerbasic.csv', encoding = 'utf-8') as f:
    reader = csv.DictReader(f)
    fields = reader.fieldnames[2:-2]
    for field in fields:
        inquirer[field] = set()

    for row in reader:
        term = row['Entry']

        if '#' in term:
            parts = term.split('#')
            word = parts[0].lower()
            sense = int(parts[1].strip('_ '))
            partialsense = True
        else:
            word = term.lower()
            sense = 0
            partialsense = False

        if sense > 1:
            continue
            # we're ignoring uncommon senses

        pos = row['Othtags']
        if 'Noun' in pos:
            pos = 'noun'
        elif 'SUPV' in pos:
            pos = 'verb'

        forms = {word}
        if pos == 'noun' or pos == 'verb':
            for suffix in suffixes[pos]:
                if word + suffix in dictionary:
                    forms.add(word + suffix)
                if pos == 'verb' and word.rstrip('e') + suffix in dictionary:
                    forms.add(word.rstrip('e') + suffix)

        for form in forms:
            for field in fields:
                if len(row[field]) > 1:
                    inquirer[field].add(form)
                    allinquirerwords.add(form)
                    
print('Inquirer loaded')
print('Total of ' + str(len(allinquirerwords)) + " words.")


Inquirer loaded
Total of 13707 words.

Load model predictions about volumes

The next step is to create some vectors that store predictions about volumes. In this case, these are predictions about the probability that a volume is fiction, rather than biography.


In [4]:
# the folder where wordcounts will live
# we're only going to load predictions
# that correspond to files located there
sourcedir = '../sourcefiles/'

docs = []
logistic = []

with open('../plotdata/the900.csv', encoding = 'utf-8') as f:
    reader = csv.DictReader(f)
    for row in reader:
        genre = row['realclass']
        docid = row['volid']
        if not os.path.exists(sourcedir + docid + '.tsv'):
            continue
        docs.append(row['volid'])
        logistic.append(float(row['logistic']))

logistic = np.array(logistic)
numdocs = len(docs)

assert numdocs == len(logistic)

print("We have information about " + str(numdocs) + " volumes.")


We have information about 890 volumes.

And get the wordcounts themselves

This cell of the notebook is very short (one line), but it takes a lot of time to execute. There's a lot of file i/o that happens inside the function get_wordcounts, in the FileCabinet module, which is invoked here. We come away with a dictionary of wordcounts, keyed in the first instance by volume ID.


In [5]:
wordcounts = filecab.get_wordcounts(sourcedir, '.tsv', docs)

Now calculate the representation of each Inquirer category in each doc

We normalize by the total wordcount for a volume.

This cell also takes a long time to run. I've added a counter so you have some confidence that it's still running.


In [7]:
# Initialize empty category vectors

categories = dict()
for field in fields:
    categories[field] = np.zeros(numdocs)
    
# Now fill them

for i, doc in enumerate(docs):
    ctcat = Counter()
    allcats = 0
    for word, count in wordcounts[doc].items():
        if word in dictionary:
            allcats += count
        if word not in allinquirerwords:
            continue
        for field in fields:
            if word in inquirer[field]:
                ctcat[field] += count
    for field in fields:
        categories[field][i] = ctcat[field] / (allcats + 0.1)
        # Laplacian smoothing there to avoid div by zero, among other things.
    
    if i % 100 == 1:
        print(i, allcats)


1 91011
101 84002
201 16285
301 56847
401 51395
501 185568
601 93254
701 84775
801 85951

Calculate correlations

Now that we have all the information, calculating correlations is easy. We iterate through Inquirer categories, in each case calculating the correlation between a vector of model predictions for docs, and a vector of category-frequencies for docs.


In [8]:
logresults = []

for inq_category in fields:
    l = pearsonr(logistic, categories[inq_category])[0]
    logresults.append((l, inq_category))

logresults.sort()

Load expanded names of Inquirer categories

The terms used in the inquirer spreadsheet are not very transparent. DAV for instance is "descriptive action verbs." BodyPt is "body parts." To make these more transparent, I have provided expanded names for many categories that turned out to be relevant in the book, trying to base my description on the accounts provided here: http://www.wjh.harvard.edu/~inquirer/homecat.htm

We load these into a dictionary.


In [9]:
short2long = dict()
with open('../../lexicons/long_inquirer_names.csv', encoding = 'utf-8') as f:
    reader = csv.DictReader(f)
    for row in reader:
        short2long[row['short_name']] = row['long_name']

I print the top 12 correlations and the bottom 12, skipping categories that are drawn from the "Laswell value dictionary." The Laswell categories are very finely discriminated (things like "enlightenment gain" or "power loss"), and I have little faith that they're meaningful. I especially doubt that they could remain meaningful when the Inquirer is used crudely as a source of wordlists.


In [10]:
print('Printing the correlations of General Inquirer categories')
print('with the predicted probabilities of being fiction in allsubset2.csv:')
print()
print('First, top positive correlations: ')
print()
for prob, n in reversed(logresults[-12 : ]):
    if n in short2long:
        n = short2long[n]
    if 'Laswell' in n:
        continue
    else:
        print(str(prob) + '\t' + n)

print()
print('Now, negative correlations: ')
print()
for prob, n in logresults[0 : 12]:
    if n in short2long:
        n = short2long[n]
    if 'Laswell' in n:
        continue
    else:
        print(str(prob) + '\t' + n)


Printing the correlations of General Inquirer categories
with the predicted probabilities of being fiction in allsubset2.csv:

First, top positive correlations: 

0.814883672084	action verbs
0.723336865012	body parts
0.719677253657	verbs of sensory perception
0.683865798179	verbs of dialogue
0.683177448649	physical adjectives
0.64713747568	second-person pronouns (likely in dialogue)
0.622209843367	weakness
0.618004178737	interjections and exclamations
0.615809862443	Work
0.598951530674	Stay
0.596355158769	understatement and qualification

Now, negative correlations: 

-0.740594049611	political terms
-0.729728271214	organized systems of belief or knowledge
-0.725883030105	abstract means
-0.692310519992	also power
-0.685490522417	power
-0.674993375953	economic terms
-0.669967392984	political terms
-0.665847187129	human collectivities
-0.599582771882	ABS

Comments

If you compare the printout above to the book's version of Table 1.1, you will notice very slight differences. For instance, "power" appears twice, so those lines have been fused.

Titlecased terms are the terms originally used in the Inquirer. Lowercased terms are my explanations.