This notebook explains how I used the Harvard General Inquirer to streamline interpretation of a predictive model.
I'm italicizing the word "streamline" because I want to emphasize that I place very little weight on the Inquirer: as I say in the text, "The General Inquirer has no special authority, and I have tried not to make it a load-bearing element of this argument."
To interpret a model, I actually spend a lot of time looking at lists of features, as well as predictions about individual texts. But to explain my interpretation, I need some relatively simple summary. Given real-world limits on time and attention, going on about lists of individual words for five pages is rarely an option. So, although wordlists are crude and arbitrary devices, flattening out polysemy and historical change, I am willing to lean on them rhetorically, where I find that they do in practice echo observations I have made in other ways.
I should also acknowledge that I'm not using the General Inquirer as it was designed to be used. The full version of this tool is not just a set of wordlists, it's a software package that tries to get around polysemy by disambiguating different word senses. I haven't tried to use it in that way: I think it would complicate my explanation, in order to project an impression of accuracy and precision that I don't particularly want to project. Instead, I have stressed that word lists are crude tools, and I'm using them only as crude approximations.
That said, how do I do it?
To start with, we'll load an array of modules. Some standard, some utilities that I've written myself.
In [1]:
# some standard modules
import csv, os, sys
from collections import Counter
import numpy as np
from scipy.stats import pearsonr
# now a module that I wrote myself, located
# a few directories up, in the software
# library for this repository
sys.path.append('../../lib')
import FileCabinet as filecab
In [2]:
# start by loading the dictionary
dictionary = set()
with open('../../lexicons/MainDictionary.txt', encoding = 'utf-8') as f:
reader = csv.reader(f, delimiter = '\t')
for row in reader:
word = row[0]
count = int(row[2])
if count < 10000:
continue
# that ignores very rare words
# we end up with about 42,700 common ones
else:
dictionary.add(word)
The next stage is to translate the Inquirer. It begins as a table where word senses are row labels, and the Inquirer categories are columns (except for two columns at the beginning and two at the end). This is, by the way, the "basic spreadsheet" described at this site: http://www.wjh.harvard.edu/~inquirer/spreadsheet_guide.htm
I translate this into a dictionary where the keys are Inquirer categories, and the values are sets of words associated with each category.
But to do that, I have to do some filtering and expanding. Different senses of a word are broken out in the spreadsheet thus:
ABOUT#1
ABOUT#2
ABOUT#3
etc.
I need to separate the hashtag part. Also, because I don't want to allow rare senses of a word too much power, I ignore everything but the first sense of a word.
However, I also want to allow singular verb forms and plural nouns to count. So there's some code below that expands words by adding -s -ed, etc to the end. See the suffixes dictionary defined below for more details.
In [3]:
inquirer = dict()
suffixes = dict()
suffixes['verb'] = ['s', 'es', 'ed', 'd', 'ing']
suffixes['noun'] = ['s', 'es']
allinquirerwords = set()
with open('../../lexicons/inquirerbasic.csv', encoding = 'utf-8') as f:
reader = csv.DictReader(f)
fields = reader.fieldnames[2:-2]
for field in fields:
inquirer[field] = set()
for row in reader:
term = row['Entry']
if '#' in term:
parts = term.split('#')
word = parts[0].lower()
sense = int(parts[1].strip('_ '))
partialsense = True
else:
word = term.lower()
sense = 0
partialsense = False
if sense > 1:
continue
# we're ignoring uncommon senses
pos = row['Othtags']
if 'Noun' in pos:
pos = 'noun'
elif 'SUPV' in pos:
pos = 'verb'
forms = {word}
if pos == 'noun' or pos == 'verb':
for suffix in suffixes[pos]:
if word + suffix in dictionary:
forms.add(word + suffix)
if pos == 'verb' and word.rstrip('e') + suffix in dictionary:
forms.add(word.rstrip('e') + suffix)
for form in forms:
for field in fields:
if len(row[field]) > 1:
inquirer[field].add(form)
allinquirerwords.add(form)
print('Inquirer loaded')
print('Total of ' + str(len(allinquirerwords)) + " words.")
In [4]:
# the folder where wordcounts will live
# we're only going to load predictions
# that correspond to files located there
sourcedir = '../sourcefiles/'
docs = []
logistic = []
with open('../plotdata/the900.csv', encoding = 'utf-8') as f:
reader = csv.DictReader(f)
for row in reader:
genre = row['realclass']
docid = row['volid']
if not os.path.exists(sourcedir + docid + '.tsv'):
continue
docs.append(row['volid'])
logistic.append(float(row['logistic']))
logistic = np.array(logistic)
numdocs = len(docs)
assert numdocs == len(logistic)
print("We have information about " + str(numdocs) + " volumes.")
This cell of the notebook is very short (one line), but it takes a lot of time to execute. There's a lot of file i/o that happens inside the function get_wordcounts, in the FileCabinet module, which is invoked here. We come away with a dictionary of wordcounts, keyed in the first instance by volume ID.
In [5]:
wordcounts = filecab.get_wordcounts(sourcedir, '.tsv', docs)
In [7]:
# Initialize empty category vectors
categories = dict()
for field in fields:
categories[field] = np.zeros(numdocs)
# Now fill them
for i, doc in enumerate(docs):
ctcat = Counter()
allcats = 0
for word, count in wordcounts[doc].items():
if word in dictionary:
allcats += count
if word not in allinquirerwords:
continue
for field in fields:
if word in inquirer[field]:
ctcat[field] += count
for field in fields:
categories[field][i] = ctcat[field] / (allcats + 0.1)
# Laplacian smoothing there to avoid div by zero, among other things.
if i % 100 == 1:
print(i, allcats)
In [8]:
logresults = []
for inq_category in fields:
l = pearsonr(logistic, categories[inq_category])[0]
logresults.append((l, inq_category))
logresults.sort()
The terms used in the inquirer spreadsheet are not very transparent. DAV for instance is "descriptive action verbs." BodyPt is "body parts." To make these more transparent, I have provided expanded names for many categories that turned out to be relevant in the book, trying to base my description on the accounts provided here: http://www.wjh.harvard.edu/~inquirer/homecat.htm
We load these into a dictionary.
In [9]:
short2long = dict()
with open('../../lexicons/long_inquirer_names.csv', encoding = 'utf-8') as f:
reader = csv.DictReader(f)
for row in reader:
short2long[row['short_name']] = row['long_name']
I print the top 12 correlations and the bottom 12, skipping categories that are drawn from the "Laswell value dictionary." The Laswell categories are very finely discriminated (things like "enlightenment gain" or "power loss"), and I have little faith that they're meaningful. I especially doubt that they could remain meaningful when the Inquirer is used crudely as a source of wordlists.
In [10]:
print('Printing the correlations of General Inquirer categories')
print('with the predicted probabilities of being fiction in allsubset2.csv:')
print()
print('First, top positive correlations: ')
print()
for prob, n in reversed(logresults[-12 : ]):
if n in short2long:
n = short2long[n]
if 'Laswell' in n:
continue
else:
print(str(prob) + '\t' + n)
print()
print('Now, negative correlations: ')
print()
for prob, n in logresults[0 : 12]:
if n in short2long:
n = short2long[n]
if 'Laswell' in n:
continue
else:
print(str(prob) + '\t' + n)