This notebook attempts a slight improvement on the methods deployed in my 2015 article, "The Life Cycles of Genres."
In 2015, I used a set number of features and a set regularization constant. Now I optimize n (number of features) and c (the regularization constant) through gridsearch, running multiple crossvalidations on a train/test set to find the best constants for a given sample.
To avoid exaggerating accuracy through multiple trials, I have also moved to a train/test/validation split: constants are optimized through crossvalidation on the train-test set, but the model is then tested on a separate validation set. I repeat that process on random train/test/validation splits in order to visualize model accuracy as a distribution.
Getting the train/test vs. validation split right can be challenging, because we want to avoid repeating authors from the train/test set in validation. (Or in both train and test for that matter.) Authorial diction is constant enough that this could become an unfair advantage for genres with a few prolific authors. We also want to ensure that the positive & negative classes within a given set have a similar distribution across historical time. (Otherwise the model will become a model of language change.) Building sets where all these conditions hold is more involved than a random sample of volumes.
Most of the code in this notebook is concerned with creating the train/test-vs-validation split. The actual modeling happens in versatiletrainer2, which we import in the first cell.
In [1]:
import sys
import os, csv, random
import numpy as np
import pandas as pd
import versatiletrainer2
import metaselector
import matplotlib.pyplot as plt
from scipy import stats
% matplotlib inline
The functions defined below are used to create a train/test/validation divide, while also ensuring
But the best way to understand the overall workflow may be to scan down a few cells to the bottom function, train_and_validate().
In [2]:
def evenlymatchdate(meta, tt_positives, v_positives, negatives):
'''
Given a metadata file, two lists of positive indexes and a (larger) list
of negative indexes, this assigns negatives that match the date distribution
of the two positive lists as closely as possible, working randomly so that
neither list gets "a first shot" at maximally close matches.
The task is complicated by our goal of ensuring that authors are only
represented in the train/test OR the validation set. To do this while
using as much of our sample as we can, we encourage the algorithm to choose
works from already-selected authors when they fit the date parameters needed.
This is the function of the selected_neg_unmatched set: works by authors we have
chosen, not yet matched to a positive work.
'''
assert len(negatives) > (len(tt_positives) + len(v_positives))
authors = dict()
authors['tt'] = set(meta.loc[tt_positives, 'author'])
authors['v'] = set(meta.loc[v_positives, 'author'])
neg_matched = dict()
neg_matched['tt'] = []
neg_matched['v'] = []
neg_unmatched = dict()
neg_unmatched['v'] = []
neg_unmatched['tt'] = []
negative_meta = meta.loc[negatives, : ]
allpositives = [(x, 'tt') for x in tt_positives]
allpositives.extend([(x, 'v') for x in v_positives])
random.shuffle(allpositives)
for idx, settype in allpositives:
if settype == 'v':
inversetype = 'tt'
else:
inversetype = 'v'
date = meta.loc[idx, 'firstpub']
found = False
negative_meta = negative_meta.assign(diff = np.abs(negative_meta['firstpub'] - date))
for idx2 in neg_unmatched[settype]:
matchdate = meta.loc[idx2, 'firstpub']
if abs(matchdate - date) < 3:
neg_matched[settype].append(idx2)
location = neg_unmatched[settype].index(idx2)
neg_unmatched[settype].pop(location)
found = True
break
if not found:
candidates = []
for i in range(200):
aspirants = negative_meta.index[negative_meta['diff'] == i].tolist()
# the following section insures that authors in
# traintest don't end up also in validation
for a in aspirants:
asp_author = meta.loc[a, 'author']
if asp_author not in authors[inversetype]:
# don't even consider books by authors already
# in the other set
candidates.append(a)
if len(candidates) > 0:
break
chosen = random.sample(candidates, 1)[0]
chosenauth = negative_meta.loc[chosen, 'author']
allbyauth = negative_meta.index[negative_meta['author'] == chosenauth].tolist()
authors[settype].add(chosenauth)
if len(allbyauth) < 1:
print('error')
for idx3 in allbyauth:
if idx3 == chosen:
neg_matched[settype].append(idx3)
# the one we actually chose
else:
neg_unmatched[settype].append(idx3)
# others by same author, to be considered first in future
negative_meta.drop(allbyauth, inplace = True)
if len(negative_meta) == 0:
print('Exhausted negatives! This is surprising.')
break
# other books by same authors can be added to the set in the end
tt_neg = neg_matched['tt'] + neg_unmatched['tt']
v_neg = neg_matched['v'] + neg_unmatched['v']
remaining_neg = negative_meta.index.tolist()
return tt_neg, v_neg, remaining_neg
In [3]:
def tags2tagset(x):
''' function that will be applied to transform
fantasy|science-fiction into {'fantasy', 'science-fiction'} '''
if type(x) == float:
return set()
else:
return set(x.split(' | '))
def divide_training_from_validation(tags4positive, tags4negative, sizecap, metadatapath):
''' This function divides a dataset into two parts: a training-and-test set, and a
validation set. We ensure that authors are represented in one set *or* the other,
not both.
A model is optimized by gridsearch and crossvalidation on the training-and-test set. Then this model
is applied to the validation set, and accuracy is recorded.
'''
meta = pd.read_csv(metadatapath)
column_of_sets = meta['genretags'].apply(tags2tagset)
meta = meta.assign(tagset = column_of_sets)
overlap = []
negatives = []
positives = []
for idx, row in meta.iterrows():
if 'drop' in row['tagset']:
continue
# these works were dropped and will not be present in the data folder
posintersect = len(row['tagset'] & tags4positive)
negintersect = len(row['tagset'] & tags4negative)
if posintersect and negintersect:
overlap.append(idx)
elif posintersect:
positives.append(idx)
elif negintersect:
negatives.append(idx)
print()
print('-------------')
print('Begin construction of validation split.')
print("Positives/negatives:", len(positives), len(negatives))
random.shuffle(overlap)
print('Overlap (assigned to pos class): ' + str(len(overlap)))
positives.extend(overlap)
# We do selection by author
positiveauthors = list(set(meta.loc[positives, 'author'].tolist()))
random.shuffle(positiveauthors)
traintest_pos = []
validation_pos = []
donewithtraintest = False
for auth in positiveauthors:
this_auth_indices = meta.index[meta['author'] == auth].tolist()
confirmed_auth_indices = []
for idx in this_auth_indices:
if idx in positives:
confirmed_auth_indices.append(idx)
if not donewithtraintest:
traintest_pos.extend(confirmed_auth_indices)
else:
validation_pos.extend(confirmed_auth_indices)
if len(traintest_pos) > sizecap:
# that's deliberately > rather than >= because we want a cushion
donewithtraintest = True
# Now let's get a set of negatives that match the positives' distribution
# across the time axis.
traintest_neg, validation_neg, remaining_neg = evenlymatchdate(meta, traintest_pos, validation_pos, negatives)
traintest = meta.loc[traintest_pos + traintest_neg, : ]
realclass = ([1] * len(traintest_pos)) + ([0] * len(traintest_neg))
traintest = traintest.assign(realclass = realclass)
print("Traintest pos/neg:", len(traintest_pos), len(traintest_neg))
if len(validation_neg) > len(validation_pos):
validation_neg = validation_neg[0: len(validation_pos)]
# we want the balance of pos and neg examples to be even
print("Validation pos/neg:", len(validation_pos), len(validation_neg))
validation = meta.loc[validation_pos + validation_neg, : ]
realclass = ([1] * len(validation_pos)) + ([0] * len(validation_neg))
validation = validation.assign(realclass = realclass)
return traintest, validation
Because we have a relatively small number of data points for our positive classes, there's a fair amount of variation in model accuracy depending on the exact sample chosen. It's therefore necessary to run the whole train/test/validation cycle multiple times to get a distribution and a median value.
The best way to understand the overall workflow may be to look first at the bottom function, train_and_validate(). Essentially we create a split between train/test and validation sets, and write both as temporary files. Then the first, train/test file is passed to a function that runs a grid-search on it (via crossvalidation). We get back some parameters, including cross-validated accuracy; the model and associated objects (e.g. vocabulary, scaler, etc) are pickled and written to disk.
Then finally we apply the pickled model to the held-out validation set in order to get validation accuracy.
We do all of that multiple times to get a sense of the distribution of possible outcomes.
In [4]:
def tune_a_model(name, tags4positive, tags4negative, sizecap, sourcefolder, metadatapath):
'''
This tunes a model through gridsearch, and puts the resulting model in a ../temp
folder, where it can be retrieved
'''
vocabpath = '../lexica/' + name + '.txt'
modeloutpath = '../temp/' + name + '.csv'
c_range = [.0001, .001, .003, .01, .03, 0.1, 1, 10, 100, 300, 1000]
featurestart = 1000
featureend = 7000
featurestep = 500
modelparams = 'logistic', 10, featurestart, featureend, featurestep, c_range
forbiddenwords = {}
floor = 1700
ceiling = 2020
metadata, masterdata, classvector, classdictionary, orderedIDs, authormatches, vocablist = versatiletrainer2.get_simple_data(sourcefolder, metadatapath, vocabpath, tags4positive, tags4negative, sizecap, extension = '.fic.tsv', excludebelow = floor, excludeabove = ceiling,
forbid4positive = {'drop'}, forbid4negative = {'drop'}, force_even_distribution = False, forbiddenwords = forbiddenwords)
matrix, maxaccuracy, metadata, coefficientuples, features4max, best_regularization_coef = versatiletrainer2.tune_a_model(metadata, masterdata, classvector, classdictionary, orderedIDs, authormatches,
vocablist, tags4positive, tags4negative, modelparams, name, modeloutpath)
meandate = int(round(np.sum(metadata.firstpub) / len(metadata.firstpub)))
floor = np.min(metadata.firstpub)
ceiling = np.max(metadata.firstpub)
os.remove(vocabpath)
return floor, ceiling, meandate, maxaccuracy, features4max, best_regularization_coef, modeloutpath
def confirm_separation(df1, df2):
'''
Just some stats on the train/test vs validation split.
'''
authors1 = set(df1['author'])
authors2 = set(df2['author'])
overlap = authors1.intersection(authors2)
if len(overlap) > 0:
print('Overlap: ', overlap)
pos1date = np.mean(df1.loc[df1.realclass == 0, 'firstpub'])
neg1date = np.mean(df1.loc[df1.realclass == 1, 'firstpub'])
pos2date = np.mean(df2.loc[df2.realclass == 0, 'firstpub'])
neg2date = np.mean(df2.loc[df2.realclass == 1, 'firstpub'])
print("Traintest mean date pos:", pos1date, "neg:", neg1date)
print("Validation mean date pos", pos2date, "neg:", neg2date)
print()
def train_and_validate(modelname, tags4positive, tags4negative, sizecap, sourcefolder, metadatapath):
outmodels = modelname + '_models.tsv'
if not os.path.isfile(outmodels):
with open(outmodels, mode = 'w', encoding = 'utf-8') as f:
outline = 'name\tsize\tfloor\tceiling\tmeandate\ttestacc\tvalidationacc\tfeatures\tregularization\ti\n'
f.write(outline)
for i in range(10):
name = modelname + str(i)
traintest, validation = divide_training_from_validation(tags4positive, tags4negative, sizecap, metadatapath)
confirm_separation(traintest, validation)
traintest.to_csv('../temp/traintest.csv', index = False)
validation.to_csv('../temp/validation.csv', index = False)
floor, ceiling, meandate, testacc, features4max, best_regularization_coef, modeloutpath = tune_a_model(name, tags4positive, tags4negative, sizecap, sourcefolder, '../temp/traintest.csv')
modelinpath = modeloutpath.replace('.csv', '.pkl')
results = versatiletrainer2.apply_pickled_model(modelinpath, sourcefolder, '.fic.tsv', '../temp/validation.csv')
right = 0
wrong = 0
columnname = 'alien_model'
for idx, row in results.iterrows():
if float(row['realclass']) >= 0.5 and row[columnname] >= 0.5:
right +=1
elif float(row['realclass']) <= 0.5 and row[columnname] <= 0.5:
right += 1
else:
wrong += 1
validationacc = right / (right + wrong)
validoutpath = modeloutpath.replace('.csv', '.validate.csv')
results.to_csv(validoutpath)
print()
print('Validated: ', validationacc)
with open(outmodels, mode = 'a', encoding = 'utf-8') as f:
outline = '\t'.join([name, str(sizecap), str(floor), str(ceiling), str(meandate), str(testacc), str(validationacc), str(features4max), str(best_regularization_coef), str(i)]) + '\n'
f.write(outline)
In [17]:
train_and_validate('BoWGothic', {'lochorror', 'pbgothic', 'locghost', 'stangothic', 'chihorror'},
{'random', 'chirandom'}, 125, '../newdata/', '../meta/finalmeta.csv')
In [19]:
train_and_validate('BoWSF', {'anatscifi', 'locscifi', 'chiscifi', 'femscifi'},
{'random', 'chirandom'}, 125, '../newdata/', '../meta/finalmeta.csv')
In [18]:
train_and_validate('BoWMystery', {'locdetective', 'locdetmyst', 'chimyst', 'det100'},
{'random', 'chirandom'}, 125, '../newdata/', '../meta/finalmeta.csv')
In [5]:
def tune_a_model(name, tags4positive, tags4negative, sizecap, sourcefolder, metadatapath):
'''
This tunes a model through gridsearch, and puts the resulting model in a ../temp
folder, where it can be retrieved
'''
vocabpath = '../lexica/' + name + '.txt'
modeloutpath = '../temp/' + name + '.csv'
c_range = [.00001, .0001, .001, .003, .01, .03, 0.1, 1, 10, 100, 300, 1000]
featurestart = 10
featureend = 1500
featurestep = 100
modelparams = 'logistic', 10, featurestart, featureend, featurestep, c_range
forbiddenwords = {}
floor = 1700
ceiling = 2020
metadata, masterdata, classvector, classdictionary, orderedIDs, authormatches, vocablist = versatiletrainer2.get_simple_data(sourcefolder, metadatapath, vocabpath, tags4positive, tags4negative, sizecap, extension = '.fic.tsv', excludebelow = floor, excludeabove = ceiling,
forbid4positive = {'drop'}, forbid4negative = {'drop'}, force_even_distribution = False, forbiddenwords = forbiddenwords)
matrix, maxaccuracy, metadata, coefficientuples, features4max, best_regularization_coef = versatiletrainer2.tune_a_model(metadata, masterdata, classvector, classdictionary, orderedIDs, authormatches,
vocablist, tags4positive, tags4negative, modelparams, name, modeloutpath)
meandate = int(round(np.sum(metadata.firstpub) / len(metadata.firstpub)))
floor = np.min(metadata.firstpub)
ceiling = np.max(metadata.firstpub)
os.remove(vocabpath)
return floor, ceiling, meandate, maxaccuracy, features4max, best_regularization_coef, modeloutpath
In [6]:
train_and_validate('BoWShrunkenGothic', {'lochorror', 'pbgothic', 'locghost', 'stangothic', 'chihorror'},
{'random', 'chirandom'}, 40, '../reduced_data/', '../meta/finalmeta.csv')
In [5]:
sf = pd.read_csv('../results/ABsfembeds_models.tsv', sep = '\t')
sf.head()
Out[5]:
In [6]:
sf.shape
Out[6]:
In [14]:
new = sf.loc[[x for x in range(31,41)], : ]
old = sf.loc[[x for x in range(21,31)], : ]
In [17]:
print(np.median(new.testacc), np.median(old.testacc))
In [19]:
print(np.mean(new.validationacc), np.mean(old.validationacc))
In [20]:
new
Out[20]:
In [21]:
old
Out[21]:
In [22]:
print(np.mean(new.features), np.mean(old.features))
In [54]:
hist = pd.read_csv('../results/HistShrunkenGothic_models.tsv', sep = '\t')
hist1990 = pd.read_csv('../results/Hist1990ShrunkenGothic_models.tsv', sep = '\t')
In [42]:
bow = pd.read_csv('../results/BoWShrunkenGothic_models.tsv', sep = '\t')
In [57]:
glove = pd.read_csv('../results/GloveShrunkenGothic_models.tsv', sep = '\t')
In [58]:
print(np.mean(hist.testacc), np.mean(bow.testacc), np.mean(glove.testacc))
In [59]:
print(np.mean(hist.validationacc), np.mean(hist1990.validationacc), np.mean(bow.validationacc), np.mean(glove.validationacc))
In [53]:
print(np.mean(hist.features), np.mean(hist1990.features), np.mean(bow.features), np.mean(glove.features))
In [17]:
print(np.mean(myst.testacc[0:10]), np.mean(myst.testacc[10: ]))
In [24]:
hist = pd.read_csv('../results/HistGothic_models.tsv', sep = '\t')
In [27]:
print(np.mean(hist.validationacc), np.mean(bowgoth.validationacc))
In [60]:
hist = pd.read_csv('../results/HistGothic_models.tsv', sep = '\t')
hist1990 = pd.read_csv('../results/Hist1990Gothic_models.tsv', sep = '\t')
bow = pd.read_csv('../results/BoWGothic_models.tsv', sep = '\t')
print(np.mean(hist.validationacc), np.mean(hist1990.validationacc), np.mean(bow.validationacc))
In [13]:
bow = pd.read_csv('BoWMystery_models.tsv', sep = '\t')
In [14]:
np.mean(bow.validationacc[0:30])
Out[14]:
In [15]:
np.mean(bow.validationacc[30: ])
Out[15]:
In [ ]: