Running multiple train-test-validation splits

to better estimate predictive accuracy in modeling genre.

This notebook attempts a slight improvement on the methods deployed in my 2015 article, "The Life Cycles of Genres."

In 2015, I used a set number of features and a set regularization constant. Now I optimize n (number of features) and c (the regularization constant) through gridsearch, running multiple crossvalidations on a train/test set to find the best constants for a given sample.

To avoid exaggerating accuracy through multiple trials, I have also moved to a train/test/validation split: constants are optimized through crossvalidation on the train-test set, but the model is then tested on a separate validation set. I repeat that process on random train/test/validation splits in order to visualize model accuracy as a distribution.

Getting the train/test vs. validation split right can be challenging, because we want to avoid repeating authors from the train/test set in validation. (Or in both train and test for that matter.) Authorial diction is constant enough that this could become an unfair advantage for genres with a few prolific authors. We also want to ensure that the positive & negative classes within a given set have a similar distribution across historical time. (Otherwise the model will become a model of language change.) Building sets where all these conditions hold is more involved than a random sample of volumes.

Most of the code in this notebook is concerned with creating the train/test-vs-validation split. The actual modeling happens in versatiletrainer2, which we import in the first cell.



In [1]:

    
import sys
import os, csv, random
import numpy as np
import pandas as pd
import versatiletrainer2
import metaselector
import matplotlib.pyplot as plt
from scipy import stats
% matplotlib inline

Managing the validation split.

The functions defined below are used to create a train/test/validation divide, while also ensuring

No author is present in more than one of those sets, so we don't overfit on a specific style.
Positive and negative classes are equally distributed across time (so we don't end up modeling language change instead of genre!)

But the best way to understand the overall workflow may be to scan down a few cells to the bottom function, train_and_validate().



In [2]:

    
def evenlymatchdate(meta, tt_positives, v_positives, negatives):
    '''
    Given a metadata file, two lists of positive indexes and a (larger) list
    of negative indexes, this assigns negatives that match the date distribution
    of the two positive lists as closely as possible, working randomly so that
    neither list gets "a first shot" at maximally close matches.
    
    The task is complicated by our goal of ensuring that authors are only
    represented in the train/test OR the validation set. To do this while
    using as much of our sample as we can, we encourage the algorithm to choose
    works from already-selected authors when they fit the date parameters needed.
    This is the function of the selected_neg_unmatched set: works by authors we have
    chosen, not yet matched to a positive work.
    '''
    
    assert len(negatives) > (len(tt_positives) + len(v_positives))
    authors = dict()
    authors['tt'] = set(meta.loc[tt_positives, 'author'])
    authors['v'] = set(meta.loc[v_positives, 'author'])
    
    neg_matched = dict()
    neg_matched['tt'] = []
    neg_matched['v'] = []
    neg_unmatched = dict()
    neg_unmatched['v'] = []
    neg_unmatched['tt'] = []
    
    negative_meta = meta.loc[negatives, : ]
    
    allpositives = [(x, 'tt') for x in tt_positives]
    allpositives.extend([(x, 'v') for x in v_positives])
    random.shuffle(allpositives)
    
    for idx, settype in allpositives:
        if settype == 'v':
            inversetype = 'tt'
        else:
            inversetype = 'v'
            
        date = meta.loc[idx, 'firstpub']
        found = False
        negative_meta = negative_meta.assign(diff = np.abs(negative_meta['firstpub'] - date))
        
        for idx2 in neg_unmatched[settype]:
            matchdate = meta.loc[idx2, 'firstpub']
            if abs(matchdate - date) < 3:
                neg_matched[settype].append(idx2)
                location = neg_unmatched[settype].index(idx2)
                neg_unmatched[settype].pop(location)
                found = True
                break
        
        if not found:
            candidates = []
            for i in range(200):
                aspirants = negative_meta.index[negative_meta['diff'] == i].tolist()
                
                # the following section insures that authors in
                # traintest don't end up also in validation
                for a in aspirants:
                    asp_author = meta.loc[a, 'author']
                    if asp_author not in authors[inversetype]:
                        # don't even consider books by authors already
                        # in the other set
                        candidates.append(a)
                        
                if len(candidates) > 0:
                    break
        
            chosen = random.sample(candidates, 1)[0]
            chosenauth = negative_meta.loc[chosen, 'author']
            allbyauth = negative_meta.index[negative_meta['author'] == chosenauth].tolist()
            authors[settype].add(chosenauth)
            
            if len(allbyauth) < 1:
                print('error')
                
            for idx3 in allbyauth:
                if idx3 == chosen:
                    neg_matched[settype].append(idx3)
                    # the one we actually chose
                else:
                    neg_unmatched[settype].append(idx3)
                    # others by same author, to be considered first in future
            
            negative_meta.drop(allbyauth, inplace = True)
            
            if len(negative_meta) == 0:
                print('Exhausted negatives! This is surprising.')
                break
    
    # other books by same authors can be added to the set in the end
    tt_neg = neg_matched['tt'] + neg_unmatched['tt']
    v_neg = neg_matched['v'] + neg_unmatched['v']
    
    remaining_neg = negative_meta.index.tolist()

    return tt_neg, v_neg, remaining_neg



In [3]:

    
def tags2tagset(x):
    ''' function that will be applied to transform
    fantasy|science-fiction into {'fantasy', 'science-fiction'} '''
    if type(x) == float:
        return set()
    else:
        return set(x.split(' | '))

def divide_training_from_validation(tags4positive, tags4negative, sizecap, metadatapath):
    ''' This function divides a dataset into two parts: a training-and-test set, and a
    validation set. We ensure that authors are represented in one set *or* the other,
    not both.
    
    A model is optimized by gridsearch and crossvalidation on the training-and-test set. Then this model
    is applied to the validation set, and accuracy is recorded.
    '''
    
    meta = pd.read_csv(metadatapath)
    column_of_sets = meta['genretags'].apply(tags2tagset)
    meta = meta.assign(tagset = column_of_sets)
    
    overlap = []
    negatives = []
    positives = []
    
    for idx, row in meta.iterrows():
        if 'drop' in row['tagset']:
            continue
            # these works were dropped and will not be present in the data folder
            
        posintersect = len(row['tagset'] & tags4positive)
        negintersect = len(row['tagset'] & tags4negative)
        
        if posintersect and negintersect:
            overlap.append(idx)
        elif posintersect:
            positives.append(idx)
        elif negintersect:
            negatives.append(idx)
            
    print()
    print('-------------')
    print('Begin construction of validation split.')
    print("Positives/negatives:", len(positives), len(negatives))
    
    random.shuffle(overlap)
    print('Overlap (assigned to pos class): ' + str(len(overlap)))
    positives.extend(overlap)
    
    # We do selection by author
    positiveauthors = list(set(meta.loc[positives, 'author'].tolist()))
    
    random.shuffle(positiveauthors)
    
    traintest_pos = []
    validation_pos = []
    donewithtraintest = False
    
    for auth in positiveauthors:
        this_auth_indices = meta.index[meta['author'] == auth].tolist()
        confirmed_auth_indices = []
        for idx in this_auth_indices:
            if idx in positives:
                confirmed_auth_indices.append(idx)
        
        if not donewithtraintest:
            traintest_pos.extend(confirmed_auth_indices)
        else:
            validation_pos.extend(confirmed_auth_indices)
        
        if len(traintest_pos) > sizecap:
            # that's deliberately > rather than >= because we want a cushion
            donewithtraintest = True
    
    # Now let's get a set of negatives that match the positives' distribution
    # across the time axis.
    
    traintest_neg, validation_neg, remaining_neg = evenlymatchdate(meta, traintest_pos, validation_pos, negatives)
    traintest = meta.loc[traintest_pos + traintest_neg, : ]
    realclass = ([1] * len(traintest_pos)) + ([0] * len(traintest_neg))
    traintest = traintest.assign(realclass = realclass)
    print("Traintest pos/neg:", len(traintest_pos), len(traintest_neg))
    
    if len(validation_neg) > len(validation_pos):
        validation_neg = validation_neg[0: len(validation_pos)]
        # we want the balance of pos and neg examples to be even
        
    print("Validation pos/neg:", len(validation_pos), len(validation_neg))
    
    validation = meta.loc[validation_pos + validation_neg, : ]
    realclass = ([1] * len(validation_pos)) + ([0] * len(validation_neg))
    validation = validation.assign(realclass = realclass)
    
    return traintest, validation

Iteratively testing multiple splits.

Because we have a relatively small number of data points for our positive classes, there's a fair amount of variation in model accuracy depending on the exact sample chosen. It's therefore necessary to run the whole train/test/validation cycle multiple times to get a distribution and a median value.

The best way to understand the overall workflow may be to look first at the bottom function, train_and_validate(). Essentially we create a split between train/test and validation sets, and write both as temporary files. Then the first, train/test file is passed to a function that runs a grid-search on it (via crossvalidation). We get back some parameters, including cross-validated accuracy; the model and associated objects (e.g. vocabulary, scaler, etc) are pickled and written to disk.

Then finally we apply the pickled model to the held-out validation set in order to get validation accuracy.

We do all of that multiple times to get a sense of the distribution of possible outcomes.



In [4]:

    
def tune_a_model(name, tags4positive, tags4negative, sizecap, sourcefolder, metadatapath):
    '''
    This tunes a model through gridsearch, and puts the resulting model in a ../temp
    folder, where it can be retrieved
    '''

    vocabpath = '../lexica/' + name + '.txt'
    modeloutpath = '../temp/' + name + '.csv'

    c_range = [.0001, .001, .003, .01, .03, 0.1, 1, 10, 100, 300, 1000]
    featurestart = 1000
    featureend = 7000
    featurestep = 500
    modelparams = 'logistic', 10, featurestart, featureend, featurestep, c_range
    forbiddenwords = {}
    floor = 1700
    ceiling = 2020

    metadata, masterdata, classvector, classdictionary, orderedIDs, authormatches, vocablist = versatiletrainer2.get_simple_data(sourcefolder, metadatapath, vocabpath, tags4positive, tags4negative, sizecap, extension = '.fic.tsv', excludebelow = floor, excludeabove = ceiling,
        forbid4positive = {'drop'}, forbid4negative = {'drop'}, force_even_distribution = False, forbiddenwords = forbiddenwords)

    matrix, maxaccuracy, metadata, coefficientuples, features4max, best_regularization_coef = versatiletrainer2.tune_a_model(metadata, masterdata, classvector, classdictionary, orderedIDs, authormatches,
        vocablist, tags4positive, tags4negative, modelparams, name, modeloutpath)

    meandate = int(round(np.sum(metadata.firstpub) / len(metadata.firstpub)))
    floor = np.min(metadata.firstpub)
    ceiling = np.max(metadata.firstpub)

    os.remove(vocabpath)
    
    return floor, ceiling, meandate, maxaccuracy, features4max, best_regularization_coef, modeloutpath

def confirm_separation(df1, df2):
    '''
    Just some stats on the train/test vs validation split.
    '''
    
    authors1 = set(df1['author'])
    authors2 = set(df2['author'])
    overlap = authors1.intersection(authors2)
    if len(overlap) > 0:
        print('Overlap: ', overlap)
    
    pos1date = np.mean(df1.loc[df1.realclass == 0, 'firstpub'])
    neg1date = np.mean(df1.loc[df1.realclass == 1, 'firstpub'])
    pos2date = np.mean(df2.loc[df2.realclass == 0, 'firstpub'])
    neg2date = np.mean(df2.loc[df2.realclass == 1, 'firstpub'])
    
    print("Traintest mean date pos:", pos1date, "neg:", neg1date)
    print("Validation mean date pos", pos2date, "neg:", neg2date)
    print()
    

def train_and_validate(modelname, tags4positive, tags4negative, sizecap, sourcefolder, metadatapath):
    
    outmodels = modelname + '_models.tsv'
    if not os.path.isfile(outmodels):
        with open(outmodels, mode = 'w', encoding = 'utf-8') as f:
            outline = 'name\tsize\tfloor\tceiling\tmeandate\ttestacc\tvalidationacc\tfeatures\tregularization\ti\n'
            f.write(outline)
    
    for i in range(10):
        name = modelname + str(i)
        
        traintest, validation = divide_training_from_validation(tags4positive, tags4negative, sizecap, metadatapath)
        
        confirm_separation(traintest, validation)
    
        traintest.to_csv('../temp/traintest.csv', index = False)
        validation.to_csv('../temp/validation.csv', index = False)
        
        floor, ceiling, meandate, testacc, features4max, best_regularization_coef, modeloutpath = tune_a_model(name, tags4positive, tags4negative, sizecap, sourcefolder, '../temp/traintest.csv')
        modelinpath = modeloutpath.replace('.csv', '.pkl')
        results = versatiletrainer2.apply_pickled_model(modelinpath, sourcefolder, '.fic.tsv', '../temp/validation.csv')
        right = 0
        wrong = 0
        columnname = 'alien_model'
        for idx, row in results.iterrows():
            if float(row['realclass']) >= 0.5 and row[columnname] >= 0.5:
                right +=1
            elif float(row['realclass']) <= 0.5 and row[columnname] <= 0.5:
                right += 1
            else:
                wrong += 1
        
        validationacc = right / (right + wrong)
        validoutpath = modeloutpath.replace('.csv', '.validate.csv')
        results.to_csv(validoutpath)
        print()
        print('Validated: ', validationacc)
        
        with open(outmodels, mode = 'a', encoding = 'utf-8') as f:
            outline = '\t'.join([name, str(sizecap), str(floor), str(ceiling), str(meandate), str(testacc), str(validationacc), str(features4max), str(best_regularization_coef), str(i)]) + '\n'
            f.write(outline)



In [17]:

    
train_and_validate('BoWGothic', {'lochorror', 'pbgothic', 'locghost', 'stangothic', 'chihorror'},
        {'random', 'chirandom'}, 125, '../newdata/', '../meta/finalmeta.csv')









    



-------------
Begin construction of validation split.
Positives/negatives: 166 370
Overlap (assigned to pos class): 0
Traintest pos/neg: 126 129
Validation pos/neg: 40 40
Traintest mean date pos: 1874.9302325581396 neg: 1873.579365079365
Validation mean date pos 1888.6 neg: 1888.725

We started with 255 rows in metadata, but
lost 1 that were missing in the data folder.
Assigning overlap to positive class.

We have 125 potential positive instances and
129 potential negative instances. Choosing only
125 of each class.
MATCHING DATES
Instances chosen.

250 volumes range in date from 1761 to 1989.

Building vocabulary.

Authors matched.

[13, 13, 13, 13, 13, 12, 12, 12, 12, 12]
[13, 13, 13, 12, 12, 13, 12, 12, 13, 12]

words: 1000  reg: 0.0001  acc: 0.72
words: 1000  reg: 0.001  acc: 0.752
words: 1000  reg: 0.003  acc: 0.772
words: 1000  reg: 0.01  acc: 0.804
True positives 96
True negatives 105
False positives 20
False negatives 29
F1 : 0.7966804979253111
0.804 0.804
80
(80, 1000)

Validated:  0.825

-------------
Begin construction of validation split.
Positives/negatives: 166 370
Overlap (assigned to pos class): 0
Traintest pos/neg: 128 132
Validation pos/neg: 38 38
Traintest mean date pos: 1881.0454545454545 neg: 1880.5390625
Validation mean date pos 1864.8947368421052 neg: 1866.078947368421

We started with 260 rows in metadata, but
lost 1 that were missing in the data folder.
Assigning overlap to positive class.

We have 127 potential positive instances and
132 potential negative instances. Choosing only
125 of each class.
MATCHING DATES
Instances chosen.

250 volumes range in date from 1760 to 1989.

Building vocabulary.

Authors matched.

[13, 13, 13, 13, 13, 12, 12, 12, 12, 12]
[13, 13, 13, 13, 13, 12, 12, 12, 12, 12]

words: 1000  reg: 0.0001  acc: 0.704
words: 1000  reg: 0.001  acc: 0.756
words: 1500  reg: 0.001  acc: 0.776
words: 2000  reg: 0.001  acc: 0.788
words: 2000  reg: 0.003  acc: 0.792
words: 3500  reg: 0.001  acc: 0.796
words: 4000  reg: 0.001  acc: 0.8
words: 4500  reg: 0.001  acc: 0.804
words: 5000  reg: 0.001  acc: 0.808
words: 5500  reg: 0.003  acc: 0.812
True positives 95
True negatives 108
False positives 17
False negatives 30
F1 : 0.8016877637130801
0.812 0.812
76
(76, 5500)

Validated:  0.8157894736842105

-------------
Begin construction of validation split.
Positives/negatives: 166 370
Overlap (assigned to pos class): 0
Traintest pos/neg: 127 127
Validation pos/neg: 39 39
Traintest mean date pos: 1881.5905511811025 neg: 1881.6692913385828
Validation mean date pos 1864.4615384615386 neg: 1862.7692307692307

We started with 254 rows in metadata, but
lost 1 that were missing in the data folder.
Assigning overlap to positive class.

We have 126 potential positive instances and
127 potential negative instances. Choosing only
125 of each class.
MATCHING DATES
Instances chosen.

250 volumes range in date from 1761 to 1989.

Building vocabulary.

Authors matched.

[13, 13, 13, 13, 13, 12, 12, 12, 12, 12]
[13, 13, 13, 13, 13, 12, 12, 12, 12, 12]

words: 1000  reg: 0.0001  acc: 0.712
words: 1000  reg: 0.001  acc: 0.744
words: 1000  reg: 0.003  acc: 0.756
words: 2000  reg: 0.001  acc: 0.772
words: 3500  reg: 0.1  acc: 0.78
words: 5500  reg: 0.1  acc: 0.784
words: 5500  reg: 1  acc: 0.788
True positives 87
True negatives 110
False positives 15
False negatives 38
F1 : 0.7665198237885462
0.788 0.788
78
(78, 5500)

Validated:  0.8333333333333334

-------------
Begin construction of validation split.
Positives/negatives: 166 370
Overlap (assigned to pos class): 0
Traintest pos/neg: 127 130
Validation pos/neg: 39 39
Traintest mean date pos: 1873.7230769230769 neg: 1873.259842519685
Validation mean date pos 1890.025641025641 neg: 1890.1538461538462

We started with 257 rows in metadata, but
lost 1 that were missing in the data folder.
Assigning overlap to positive class.

We have 126 potential positive instances and
130 potential negative instances. Choosing only
125 of each class.
MATCHING DATES
Instances chosen.

250 volumes range in date from 1760 to 1989.

Building vocabulary.

Authors matched.

[13, 13, 13, 13, 13, 12, 12, 12, 12, 12]
[13, 13, 13, 13, 13, 12, 12, 12, 12, 12]

words: 1000  reg: 0.0001  acc: 0.72
words: 1000  reg: 0.001  acc: 0.772
words: 1000  reg: 0.003  acc: 0.796
words: 1000  reg: 10  acc: 0.8
words: 1500  reg: 0.003  acc: 0.804
words: 4000  reg: 0.001  acc: 0.808
words: 4000  reg: 0.003  acc: 0.812
words: 4500  reg: 0.03  acc: 0.816
words: 5000  reg: 1  acc: 0.82
words: 5000  reg: 10  acc: 0.824
True positives 96
True negatives 110
False positives 15
False negatives 29
F1 : 0.8135593220338984
0.824 0.824
78
(78, 5000)

Validated:  0.8333333333333334

-------------
Begin construction of validation split.
Positives/negatives: 166 370
Overlap (assigned to pos class): 0
Traintest pos/neg: 127 131
Validation pos/neg: 39 39
Traintest mean date pos: 1875.496183206107 neg: 1873.4803149606298
Validation mean date pos 1890.5384615384614 neg: 1889.4358974358975

We started with 258 rows in metadata, but
lost 1 that were missing in the data folder.
Assigning overlap to positive class.

We have 126 potential positive instances and
131 potential negative instances. Choosing only
125 of each class.
MATCHING DATES
Instances chosen.

250 volumes range in date from 1761 to 1989.

Building vocabulary.

Authors matched.

[13, 13, 13, 13, 13, 12, 12, 12, 12, 12]
[13, 13, 13, 13, 13, 12, 12, 12, 12, 12]

words: 1000  reg: 0.0001  acc: 0.764
words: 1000  reg: 0.001  acc: 0.776
words: 1000  reg: 0.003  acc: 0.804
words: 1000  reg: 0.01  acc: 0.816
True positives 94
True negatives 110
False positives 15
False negatives 31
F1 : 0.8034188034188036
0.816 0.816
78
(78, 1000)

Validated:  0.7564102564102564

-------------
Begin construction of validation split.
Positives/negatives: 166 370
Overlap (assigned to pos class): 0
Traintest pos/neg: 131 136
Validation pos/neg: 35 35
Traintest mean date pos: 1872.7205882352941 neg: 1871.6259541984732
Validation mean date pos 1896.8 neg: 1898.2

We started with 267 rows in metadata, but
lost 1 that were missing in the data folder.
Assigning overlap to positive class.

We have 130 potential positive instances and
136 potential negative instances. Choosing only
125 of each class.
MATCHING DATES
Instances chosen.

250 volumes range in date from 1764 to 1989.

Building vocabulary.

Authors matched.

[13, 13, 13, 13, 13, 12, 12, 12, 12, 12]
[14, 13, 13, 13, 12, 12, 12, 12, 12, 12]

words: 1000  reg: 0.0001  acc: 0.716
words: 1000  reg: 0.001  acc: 0.752
words: 1000  reg: 0.003  acc: 0.76
words: 1000  reg: 0.01  acc: 0.764
words: 1500  reg: 0.001  acc: 0.772
words: 1500  reg: 0.003  acc: 0.776
words: 1500  reg: 0.1  acc: 0.78
words: 1500  reg: 100  acc: 0.784
words: 2000  reg: 1  acc: 0.788
words: 2500  reg: 0.003  acc: 0.796
words: 3500  reg: 0.001  acc: 0.8
words: 3500  reg: 0.01  acc: 0.804
True positives 91
True negatives 110
False positives 15
False negatives 34
F1 : 0.787878787878788
0.804 0.804
70
(70, 3500)

Validated:  0.7285714285714285

-------------
Begin construction of validation split.
Positives/negatives: 166 370
Overlap (assigned to pos class): 0
Traintest pos/neg: 127 130
Validation pos/neg: 39 39
Traintest mean date pos: 1877.0384615384614 neg: 1877.5275590551182
Validation mean date pos 1878.6153846153845 neg: 1876.2564102564102

Assigning overlap to positive class.

We have 127 potential positive instances and
130 potential negative instances. Choosing only
125 of each class.
MATCHING DATES
Instances chosen.

250 volumes range in date from 1764 to 1989.

Building vocabulary.

Authors matched.

[13, 13, 13, 13, 13, 12, 12, 12, 12, 12]
[13, 13, 13, 13, 13, 12, 12, 12, 12, 12]

words: 1000  reg: 0.0001  acc: 0.708
words: 1000  reg: 0.001  acc: 0.74
words: 1000  reg: 0.003  acc: 0.764
words: 1000  reg: 0.01  acc: 0.772
words: 1000  reg: 0.03  acc: 0.78
True positives 89
True negatives 106
False positives 19
False negatives 36
F1 : 0.7639484978540771
0.78 0.78
../newdata/hvd.hwpn81.fic.tsv
77
(77, 1000)

Validated:  0.8333333333333334

-------------
Begin construction of validation split.
Positives/negatives: 166 370
Overlap (assigned to pos class): 0
Traintest pos/neg: 128 135
Validation pos/neg: 38 38
Traintest mean date pos: 1880.585185185185 neg: 1878.96875
Validation mean date pos 1871.5 neg: 1871.3684210526317

We started with 263 rows in metadata, but
lost 1 that were missing in the data folder.
Assigning overlap to positive class.

We have 127 potential positive instances and
135 potential negative instances. Choosing only
125 of each class.
MATCHING DATES
Instances chosen.

250 volumes range in date from 1764 to 1989.

Building vocabulary.

Authors matched.

[13, 13, 13, 13, 13, 12, 12, 12, 12, 12]
[13, 13, 12, 12, 12, 12, 15, 12, 12, 12]

words: 1000  reg: 0.0001  acc: 0.704
words: 1000  reg: 0.001  acc: 0.716
words: 1000  reg: 0.003  acc: 0.736
words: 1500  reg: 0.001  acc: 0.74
words: 2000  reg: 0.003  acc: 0.744
words: 2500  reg: 0.001  acc: 0.748
words: 3000  reg: 0.001  acc: 0.752
words: 3500  reg: 0.001  acc: 0.76
words: 4000  reg: 0.001  acc: 0.768
words: 5000  reg: 1  acc: 0.772
words: 6000  reg: 0.003  acc: 0.776
words: 6000  reg: 0.01  acc: 0.784
words: 6000  reg: 0.03  acc: 0.792
True positives 90
True negatives 108
False positives 17
False negatives 35
F1 : 0.7758620689655171
0.792 0.792
76
(76, 6000)

Validated:  0.8026315789473685

-------------
Begin construction of validation split.
Positives/negatives: 166 370
Overlap (assigned to pos class): 0
Traintest pos/neg: 127 134
Validation pos/neg: 39 39
Traintest mean date pos: 1879.313432835821 neg: 1876.5984251968505
Validation mean date pos 1879.3333333333333 neg: 1879.2820512820513

Assigning overlap to positive class.

We have 127 potential positive instances and
134 potential negative instances. Choosing only
125 of each class.
MATCHING DATES
Instances chosen.

250 volumes range in date from 1761 to 1989.

Building vocabulary.

Authors matched.

[13, 13, 13, 13, 12, 12, 12, 12, 12, 13]
[13, 13, 13, 13, 12, 12, 12, 12, 12, 13]

words: 1000  reg: 0.0001  acc: 0.716
words: 1000  reg: 0.001  acc: 0.748
words: 1000  reg: 0.003  acc: 0.768
words: 1000  reg: 0.01  acc: 0.788
words: 1000  reg: 0.03  acc: 0.796
words: 3000  reg: 0.01  acc: 0.8
words: 3500  reg: 0.003  acc: 0.808
words: 3500  reg: 100  acc: 0.812
words: 4500  reg: 0.01  acc: 0.816
words: 4500  reg: 0.03  acc: 0.828
True positives 96
True negatives 111
False positives 14
False negatives 29
F1 : 0.8170212765957446
0.828 0.828
../newdata/hvd.hwpn81.fic.tsv
77
(77, 4500)

Validated:  0.7564102564102564

-------------
Begin construction of validation split.
Positives/negatives: 166 370
Overlap (assigned to pos class): 0
Traintest pos/neg: 126 130
Validation pos/neg: 40 40
Traintest mean date pos: 1880.8076923076924 neg: 1880.3650793650793
Validation mean date pos 1866.3 neg: 1867.35

Assigning overlap to positive class.

We have 126 potential positive instances and
130 potential negative instances. Choosing only
125 of each class.
MATCHING DATES
Instances chosen.

250 volumes range in date from 1760 to 1989.

Building vocabulary.

Authors matched.

[13, 13, 13, 13, 13, 12, 12, 12, 12, 12]
[13, 13, 13, 13, 13, 12, 12, 12, 12, 12]

words: 1000  reg: 0.0001  acc: 0.74
words: 1000  reg: 0.001  acc: 0.752
words: 1000  reg: 0.003  acc: 0.76
words: 1000  reg: 0.01  acc: 0.772
words: 1500  reg: 0.001  acc: 0.776
words: 1500  reg: 0.01  acc: 0.792
words: 1500  reg: 0.1  acc: 0.796
words: 1500  reg: 1  acc: 0.8
words: 2000  reg: 0.1  acc: 0.804
words: 3000  reg: 1000  acc: 0.808
words: 5500  reg: 10  acc: 0.812
True positives 95
True negatives 108
False positives 17
False negatives 30
F1 : 0.8016877637130801
0.812 0.812
../newdata/hvd.hwpn81.fic.tsv
79
(79, 5500)

Validated:  0.7125



In [19]:

    
train_and_validate('BoWSF', {'anatscifi', 'locscifi', 'chiscifi', 'femscifi'},
        {'random', 'chirandom'}, 125, '../newdata/', '../meta/finalmeta.csv')









    



-------------
Begin construction of validation split.
Positives/negatives: 175 347
Overlap (assigned to pos class): 23
Traintest pos/neg: 126 131
Validation pos/neg: 72 72
Traintest mean date pos: 1942.3358778625955 neg: 1943.388888888889
Validation mean date pos 1929.1666666666667 neg: 1929.4305555555557

Assigning overlap to positive class.

We have 126 potential positive instances and
131 potential negative instances. Choosing only
125 of each class.
MATCHING DATES
Instances chosen.

250 volumes range in date from 1771 to 1989.

Building vocabulary.

Authors matched.

[13, 13, 13, 13, 13, 12, 12, 12, 12, 12]
[13, 13, 13, 13, 12, 12, 12, 13, 12, 12]

words: 1000  reg: 0.0001  acc: 0.8
words: 1000  reg: 0.001  acc: 0.832
words: 1000  reg: 0.003  acc: 0.836
words: 1000  reg: 0.01  acc: 0.852
words: 1000  reg: 0.03  acc: 0.856
words: 1000  reg: 0.1  acc: 0.86
words: 1000  reg: 1  acc: 0.864
words: 1500  reg: 0.003  acc: 0.88
words: 1500  reg: 0.01  acc: 0.888
words: 1500  reg: 0.03  acc: 0.892
words: 2500  reg: 0.03  acc: 0.904
words: 3000  reg: 1  acc: 0.912
words: 3500  reg: 1  acc: 0.916
words: 3500  reg: 10  acc: 0.92
True positives 113
True negatives 117
False positives 8
False negatives 12
F1 : 0.9186991869918698
0.92 0.92
../newdata/inu.30000042750632.fic.tsv
../newdata/dul1.ark+=13960=t95723c7j.fic.tsv
142
(142, 3500)

Validated:  0.8819444444444444

-------------
Begin construction of validation split.
Positives/negatives: 175 347
Overlap (assigned to pos class): 23
Traintest pos/neg: 126 127
Validation pos/neg: 72 72
Traintest mean date pos: 1945.1338582677165 neg: 1945.611111111111
Validation mean date pos 1924.986111111111 neg: 1925.5416666666667

Assigning overlap to positive class.

We have 126 potential positive instances and
127 potential negative instances. Choosing only
125 of each class.
MATCHING DATES
Instances chosen.

250 volumes range in date from 1771 to 1989.

Building vocabulary.

Authors matched.

[13, 13, 13, 13, 13, 12, 12, 12, 12, 12]
[13, 13, 13, 13, 13, 12, 12, 12, 12, 12]

words: 1000  reg: 0.0001  acc: 0.804
words: 1000  reg: 0.001  acc: 0.816
words: 1000  reg: 0.003  acc: 0.836
words: 1000  reg: 0.01  acc: 0.852
words: 1000  reg: 0.03  acc: 0.864
words: 1500  reg: 0.003  acc: 0.876
words: 1500  reg: 0.01  acc: 0.88
words: 1500  reg: 0.1  acc: 0.884
words: 2000  reg: 300  acc: 0.888
words: 2500  reg: 10  acc: 0.892
words: 2500  reg: 300  acc: 0.896
words: 4000  reg: 100  acc: 0.9
words: 5500  reg: 100  acc: 0.904
True positives 110
True negatives 116
False positives 9
False negatives 15
F1 : 0.901639344262295
0.904 0.904
../newdata/inu.30000042750632.fic.tsv
../newdata/dul1.ark+=13960=t95723c7j.fic.tsv
142
(142, 5500)

Validated:  0.8888888888888888

-------------
Begin construction of validation split.
Positives/negatives: 175 347
Overlap (assigned to pos class): 23
Traintest pos/neg: 127 133
Validation pos/neg: 71 71
Traintest mean date pos: 1939.9398496240601 neg: 1942.2992125984251
Validation mean date pos 1930.9154929577464 neg: 1931.1830985915492

We started with 260 rows in metadata, but
lost 1 that were missing in the data folder.
Assigning overlap to positive class.

We have 126 potential positive instances and
133 potential negative instances. Choosing only
125 of each class.
MATCHING DATES
Instances chosen.

250 volumes range in date from 1818 to 1989.

Building vocabulary.

Authors matched.

[13, 13, 13, 13, 13, 12, 12, 12, 12, 12]
[13, 13, 13, 13, 12, 12, 13, 12, 12, 12]

words: 1000  reg: 0.0001  acc: 0.776
words: 1000  reg: 0.001  acc: 0.828
words: 1000  reg: 0.003  acc: 0.856
words: 1500  reg: 0.03  acc: 0.86
words: 2000  reg: 0.003  acc: 0.864
words: 3000  reg: 300  acc: 0.868
words: 4000  reg: 10  acc: 0.88
words: 4500  reg: 100  acc: 0.888
words: 4500  reg: 300  acc: 0.892
words: 6000  reg: 100  acc: 0.896
True positives 112
True negatives 112
False positives 13
False negatives 13
F1 : 0.8960000000000001
0.896 0.896
../newdata/dul1.ark+=13960=t95723c7j.fic.tsv
141
(141, 6000)

Validated:  0.9366197183098591

-------------
Begin construction of validation split.
Positives/negatives: 175 347
Overlap (assigned to pos class): 23
Traintest pos/neg: 129 137
Validation pos/neg: 69 69
Traintest mean date pos: 1937.021897810219 neg: 1939.3100775193798
Validation mean date pos 1935.9710144927535 neg: 1936.4492753623188

We started with 266 rows in metadata, but
lost 1 that were missing in the data folder.
Assigning overlap to positive class.

We have 128 potential positive instances and
137 potential negative instances. Choosing only
125 of each class.
MATCHING DATES
Instances chosen.

250 volumes range in date from 1771 to 1989.

Building vocabulary.

Authors matched.

[13, 13, 13, 13, 13, 12, 12, 12, 12, 12]
[13, 13, 13, 13, 13, 12, 12, 12, 12, 12]

words: 1000  reg: 0.0001  acc: 0.792
words: 1000  reg: 0.001  acc: 0.832
words: 1000  reg: 0.003  acc: 0.844
words: 1000  reg: 0.01  acc: 0.856
words: 1000  reg: 0.1  acc: 0.86
words: 1500  reg: 1  acc: 0.864
words: 2000  reg: 10  acc: 0.868
words: 2500  reg: 0.003  acc: 0.872
words: 3000  reg: 0.01  acc: 0.88
words: 3000  reg: 0.03  acc: 0.884
words: 3000  reg: 100  acc: 0.888
words: 3500  reg: 0.01  acc: 0.892
words: 3500  reg: 0.03  acc: 0.896
True positives 111
True negatives 113
False positives 12
False negatives 14
F1 : 0.8951612903225806
0.896 0.896
../newdata/inu.30000042750632.fic.tsv
137
(137, 3500)

Validated:  0.8768115942028986

-------------
Begin construction of validation split.
Positives/negatives: 175 347
Overlap (assigned to pos class): 23
Traintest pos/neg: 126 134
Validation pos/neg: 72 72
Traintest mean date pos: 1937.44776119403 neg: 1940.2301587301588
Validation mean date pos 1934.0277777777778 neg: 1934.9583333333333

Assigning overlap to positive class.

We have 126 potential positive instances and
134 potential negative instances. Choosing only
125 of each class.
MATCHING DATES
Instances chosen.

250 volumes range in date from 1818 to 1989.

Building vocabulary.

Authors matched.

[13, 13, 13, 13, 13, 12, 12, 12, 12, 12]
[13, 13, 13, 13, 13, 12, 12, 12, 12, 12]

words: 1000  reg: 0.0001  acc: 0.808
words: 1000  reg: 0.001  acc: 0.852
words: 1000  reg: 0.01  acc: 0.856
words: 1000  reg: 0.1  acc: 0.86
words: 2000  reg: 0.001  acc: 0.864
words: 2000  reg: 0.003  acc: 0.868
words: 2500  reg: 0.001  acc: 0.872
words: 2500  reg: 100  acc: 0.876
words: 2500  reg: 300  acc: 0.884
words: 3000  reg: 100  acc: 0.892
words: 4000  reg: 0.003  acc: 0.9
words: 5500  reg: 0.001  acc: 0.904
words: 5500  reg: 0.003  acc: 0.908
True positives 108
True negatives 119
False positives 6
False negatives 17
F1 : 0.9037656903765691
0.908 0.908
../newdata/inu.30000042750632.fic.tsv
../newdata/dul1.ark+=13960=t95723c7j.fic.tsv
142
(142, 5500)

Validated:  0.8680555555555556

-------------
Begin construction of validation split.
Positives/negatives: 175 347
Overlap (assigned to pos class): 23
Traintest pos/neg: 126 133
Validation pos/neg: 72 72
Traintest mean date pos: 1934.3533834586467 neg: 1936.7063492063492
Validation mean date pos 1940.986111111111 neg: 1941.125

We started with 259 rows in metadata, but
lost 2 that were missing in the data folder.
Assigning overlap to positive class.

We have 124 potential positive instances and
133 potential negative instances. Choosing only
124 of each class.
MATCHING DATES
Instances chosen.

248 volumes range in date from 1818 to 1989.

Building vocabulary.

Authors matched.

[13, 13, 13, 13, 12, 12, 12, 12, 12, 12]
[13, 13, 13, 13, 12, 12, 12, 12, 12, 12]

words: 1000  reg: 0.0001  acc: 0.8064516129032258
words: 1000  reg: 0.001  acc: 0.8306451612903226
words: 1000  reg: 0.003  acc: 0.8467741935483871
words: 1000  reg: 0.01  acc: 0.8629032258064516
words: 1500  reg: 0.01  acc: 0.8669354838709677
words: 2000  reg: 0.003  acc: 0.875
words: 2000  reg: 0.01  acc: 0.8790322580645161
words: 2000  reg: 0.03  acc: 0.8830645161290323
words: 2000  reg: 1  acc: 0.8870967741935484
words: 2500  reg: 0.003  acc: 0.8951612903225806
words: 2500  reg: 0.01  acc: 0.8991935483870968
words: 3000  reg: 1  acc: 0.9032258064516129
words: 3000  reg: 300  acc: 0.907258064516129
words: 4000  reg: 0.01  acc: 0.9153225806451613
True positives 110
True negatives 117
False positives 7
False negatives 14
F1 : 0.9128630705394192
0.9153225806451613 0.915322580645
144
(144, 4000)

Validated:  0.8472222222222222

-------------
Begin construction of validation split.
Positives/negatives: 175 347
Overlap (assigned to pos class): 23
Traintest pos/neg: 129 132
Validation pos/neg: 69 69
Traintest mean date pos: 1935.0227272727273 neg: 1936.968992248062
Validation mean date pos 1940.4782608695652 neg: 1940.8260869565217

We started with 261 rows in metadata, but
lost 2 that were missing in the data folder.
Assigning overlap to positive class.

We have 127 potential positive instances and
132 potential negative instances. Choosing only
125 of each class.
MATCHING DATES
Instances chosen.

250 volumes range in date from 1771 to 1989.

Building vocabulary.

Authors matched.

[13, 14, 13, 13, 12, 12, 12, 12, 12, 12]
[13, 13, 13, 13, 13, 12, 12, 12, 12, 12]

words: 1000  reg: 0.0001  acc: 0.816
words: 1000  reg: 0.001  acc: 0.852
words: 1000  reg: 0.003  acc: 0.856
words: 1000  reg: 0.01  acc: 0.872
words: 2000  reg: 0.01  acc: 0.884
words: 2500  reg: 0.003  acc: 0.888
words: 2500  reg: 0.01  acc: 0.892
words: 3500  reg: 100  acc: 0.896
words: 4500  reg: 1  acc: 0.9
words: 4500  reg: 100  acc: 0.904
words: 4500  reg: 300  acc: 0.908
words: 6000  reg: 10  acc: 0.912
True positives 109
True negatives 119
False positives 6
False negatives 16
F1 : 0.9083333333333333
0.912 0.912
138
(138, 6000)

Validated:  0.9057971014492754

-------------
Begin construction of validation split.
Positives/negatives: 175 347
Overlap (assigned to pos class): 23
Traintest pos/neg: 126 132
Validation pos/neg: 72 72
Traintest mean date pos: 1936.2121212121212 neg: 1936.9126984126983
Validation mean date pos 1940.0555555555557 neg: 1940.763888888889

We started with 258 rows in metadata, but
lost 2 that were missing in the data folder.
Assigning overlap to positive class.

We have 124 potential positive instances and
132 potential negative instances. Choosing only
124 of each class.
MATCHING DATES
Instances chosen.

248 volumes range in date from 1771 to 1989.

Building vocabulary.

Authors matched.

[13, 13, 13, 13, 12, 12, 12, 12, 12, 12]
[13, 13, 13, 13, 12, 12, 12, 12, 12, 12]

words: 1000  reg: 0.0001  acc: 0.8266129032258065
words: 1000  reg: 0.001  acc: 0.8629032258064516
words: 1000  reg: 0.003  acc: 0.8830645161290323
words: 1000  reg: 0.1  acc: 0.8870967741935484
words: 1000  reg: 1  acc: 0.8911290322580645
words: 1500  reg: 100  acc: 0.8951612903225806
words: 2500  reg: 1  acc: 0.8991935483870968
words: 4500  reg: 100  acc: 0.9032258064516129
words: 5000  reg: 0.01  acc: 0.9112903225806451
words: 5000  reg: 10  acc: 0.9193548387096774
words: 6000  reg: 1000  acc: 0.9233870967741935
True positives 114
True negatives 115
False positives 9
False negatives 10
F1 : 0.923076923076923
0.9233870967741935 0.923387096774
144
(144, 6000)

Validated:  0.8888888888888888

-------------
Begin construction of validation split.
Positives/negatives: 175 347
Overlap (assigned to pos class): 23
Traintest pos/neg: 126 128
Validation pos/neg: 72 72
Traintest mean date pos: 1943.0546875 neg: 1943.8412698412699
Validation mean date pos 1928.361111111111 neg: 1928.638888888889

We started with 254 rows in metadata, but
lost 1 that were missing in the data folder.
Assigning overlap to positive class.

We have 125 potential positive instances and
128 potential negative instances. Choosing only
125 of each class.
MATCHING DATES
Instances chosen.

250 volumes range in date from 1827 to 1989.

Building vocabulary.

Authors matched.

[13, 13, 13, 13, 13, 12, 12, 12, 12, 12]
[13, 13, 13, 13, 13, 12, 12, 12, 12, 12]

words: 1000  reg: 0.0001  acc: 0.828
words: 1000  reg: 0.001  acc: 0.876
words: 1000  reg: 0.01  acc: 0.888
words: 1000  reg: 0.03  acc: 0.892
words: 1000  reg: 0.1  acc: 0.9
words: 1000  reg: 100  acc: 0.904
words: 4500  reg: 0.03  acc: 0.908
words: 4500  reg: 0.1  acc: 0.912
words: 5000  reg: 0.1  acc: 0.916
True positives 113
True negatives 116
False positives 9
False negatives 12
F1 : 0.9149797570850203
0.916 0.916
../newdata/dul1.ark+=13960=t95723c7j.fic.tsv
143
(143, 5000)

Validated:  0.8541666666666666

-------------
Begin construction of validation split.
Positives/negatives: 175 347
Overlap (assigned to pos class): 23
Traintest pos/neg: 126 131
Validation pos/neg: 72 72
Traintest mean date pos: 1937.3664122137404 neg: 1939.952380952381
Validation mean date pos 1935.3333333333333 neg: 1935.4444444444443

Assigning overlap to positive class.

We have 126 potential positive instances and
131 potential negative instances. Choosing only
125 of each class.
MATCHING DATES
Instances chosen.

250 volumes range in date from 1771 to 1989.

Building vocabulary.

Authors matched.

[13, 13, 13, 13, 13, 12, 12, 12, 12, 12]
[13, 13, 13, 13, 13, 12, 12, 12, 12, 12]

words: 1000  reg: 0.0001  acc: 0.78
words: 1000  reg: 0.001  acc: 0.824
words: 1000  reg: 0.003  acc: 0.836
words: 1000  reg: 0.01  acc: 0.844
words: 1000  reg: 0.03  acc: 0.848
words: 1000  reg: 100  acc: 0.856
words: 2000  reg: 0.003  acc: 0.86
words: 3500  reg: 0.01  acc: 0.868
words: 3500  reg: 0.1  acc: 0.872
words: 6000  reg: 0.01  acc: 0.876
True positives 103
True negatives 116
False positives 9
False negatives 22
F1 : 0.8691983122362869
0.876 0.876
../newdata/inu.30000042750632.fic.tsv
../newdata/dul1.ark+=13960=t95723c7j.fic.tsv
142
(142, 6000)

Validated:  0.9166666666666666



In [18]:

    
train_and_validate('BoWMystery', {'locdetective', 'locdetmyst', 'chimyst', 'det100'},
        {'random', 'chirandom'}, 125, '../newdata/', '../meta/finalmeta.csv')









    



-------------
Begin construction of validation split.
Positives/negatives: 206 327
Overlap (assigned to pos class): 43
Traintest pos/neg: 126 128
Validation pos/neg: 123 123
Traintest mean date pos: 1928.3046875 neg: 1929.468253968254
Validation mean date pos 1932.5934959349593 neg: 1932.4552845528456

Assigning overlap to positive class.

We have 126 potential positive instances and
128 potential negative instances. Choosing only
125 of each class.
MATCHING DATES
Instances chosen.

250 volumes range in date from 1832 to 1989.

Building vocabulary.

Authors matched.

[13, 13, 13, 13, 13, 12, 12, 12, 12, 12]
[13, 13, 13, 13, 12, 12, 12, 12, 13, 12]

words: 1000  reg: 0.0001  acc: 0.844
words: 1000  reg: 0.001  acc: 0.88
words: 1000  reg: 0.003  acc: 0.896
words: 1000  reg: 0.03  acc: 0.9
words: 2000  reg: 0.1  acc: 0.904
words: 3000  reg: 0.001  acc: 0.908
words: 3000  reg: 0.003  acc: 0.916
words: 3500  reg: 0.01  acc: 0.92
True positives 118
True negatives 112
False positives 13
False negatives 7
F1 : 0.921875
0.92 0.92
246
(246, 3500)

Validated:  0.959349593495935

-------------
Begin construction of validation split.
Positives/negatives: 206 327
Overlap (assigned to pos class): 43
Traintest pos/neg: 126 126
Validation pos/neg: 123 123
Traintest mean date pos: 1930.7777777777778 neg: 1930.2698412698412
Validation mean date pos 1931.3577235772357 neg: 1931.6341463414635

Assigning overlap to positive class.

We have 126 potential positive instances and
126 potential negative instances. Choosing only
125 of each class.
MATCHING DATES
Instances chosen.

250 volumes range in date from 1841 to 1989.

Building vocabulary.

Authors matched.

[13, 13, 13, 13, 13, 12, 12, 12, 12, 12]
[13, 13, 13, 13, 13, 12, 12, 12, 12, 12]

words: 1000  reg: 0.0001  acc: 0.848
words: 1000  reg: 0.001  acc: 0.904
words: 1000  reg: 0.003  acc: 0.92
words: 1000  reg: 0.01  acc: 0.928
words: 1500  reg: 0.003  acc: 0.932
words: 1500  reg: 0.01  acc: 0.936
words: 2500  reg: 0.1  acc: 0.94
words: 4500  reg: 100  acc: 0.944
True positives 116
True negatives 120
False positives 5
False negatives 9
F1 : 0.943089430894309
0.944 0.944
246
(246, 4500)

Validated:  0.9146341463414634

-------------
Begin construction of validation split.
Positives/negatives: 206 327
Overlap (assigned to pos class): 43
Traintest pos/neg: 127 131
Validation pos/neg: 122 122
Traintest mean date pos: 1929.1145038167938 neg: 1932.5984251968505
Validation mean date pos 1928.5573770491803 neg: 1929.22131147541

Assigning overlap to positive class.

We have 127 potential positive instances and
131 potential negative instances. Choosing only
125 of each class.
MATCHING DATES
Instances chosen.

250 volumes range in date from 1838 to 1989.

Building vocabulary.

Authors matched.

[13, 13, 13, 13, 12, 12, 12, 13, 12, 12]
[13, 13, 13, 13, 13, 12, 12, 12, 12, 12]

words: 1000  reg: 0.0001  acc: 0.836
words: 1000  reg: 0.001  acc: 0.86
words: 1000  reg: 0.003  acc: 0.876
words: 1000  reg: 0.03  acc: 0.884
words: 1500  reg: 0.001  acc: 0.888
words: 2000  reg: 0.003  acc: 0.896
words: 2000  reg: 0.01  acc: 0.9
words: 4500  reg: 0.003  acc: 0.904
words: 5500  reg: 0.003  acc: 0.912
True positives 112
True negatives 116
False positives 9
False negatives 13
F1 : 0.9105691056910571
0.912 0.912
244
(244, 5500)

Validated:  0.9426229508196722

-------------
Begin construction of validation split.
Positives/negatives: 206 327
Overlap (assigned to pos class): 43
Traintest pos/neg: 126 130
Validation pos/neg: 123 123
Traintest mean date pos: 1932.1846153846154 neg: 1933.8492063492063
Validation mean date pos 1926.6910569105692 neg: 1927.9674796747968

Assigning overlap to positive class.

We have 126 potential positive instances and
130 potential negative instances. Choosing only
125 of each class.
MATCHING DATES
Instances chosen.

250 volumes range in date from 1832 to 1989.

Building vocabulary.

Authors matched.

[13, 13, 13, 13, 13, 12, 12, 12, 12, 12]
[13, 13, 13, 12, 12, 14, 12, 12, 12, 12]

words: 1000  reg: 0.0001  acc: 0.844
words: 1000  reg: 0.001  acc: 0.884
words: 1000  reg: 0.003  acc: 0.892
words: 1000  reg: 0.03  acc: 0.896
words: 1000  reg: 0.1  acc: 0.904
True positives 111
True negatives 115
False positives 10
False negatives 14
F1 : 0.9024390243902439
0.904 0.904
246
(246, 1000)

Validated:  0.9105691056910569

-------------
Begin construction of validation split.
Positives/negatives: 206 327
Overlap (assigned to pos class): 43
Traintest pos/neg: 126 129
Validation pos/neg: 123 123
Traintest mean date pos: 1921.891472868217 neg: 1924.0238095238096
Validation mean date pos 1937.6341463414635 neg: 1938.0325203252032

Assigning overlap to positive class.

We have 126 potential positive instances and
129 potential negative instances. Choosing only
125 of each class.
MATCHING DATES
Instances chosen.

250 volumes range in date from 1829 to 1989.

Building vocabulary.

Authors matched.

[13, 13, 13, 13, 13, 12, 12, 12, 12, 12]
[13, 13, 13, 13, 13, 12, 12, 12, 12, 12]

words: 1000  reg: 0.0001  acc: 0.824
words: 1000  reg: 0.001  acc: 0.868
words: 1000  reg: 0.003  acc: 0.876
words: 1000  reg: 0.01  acc: 0.888
words: 1000  reg: 0.03  acc: 0.892
words: 1500  reg: 0.001  acc: 0.9
words: 2000  reg: 0.001  acc: 0.908
words: 2000  reg: 0.003  acc: 0.916
words: 3000  reg: 0.001  acc: 0.92
words: 3500  reg: 0.001  acc: 0.924
words: 4500  reg: 100  acc: 0.928
True positives 116
True negatives 116
False positives 9
False negatives 9
F1 : 0.928
0.928 0.928
246
(246, 4500)

Validated:  0.9186991869918699

-------------
Begin construction of validation split.
Positives/negatives: 206 327
Overlap (assigned to pos class): 43
Traintest pos/neg: 126 130
Validation pos/neg: 123 123
Traintest mean date pos: 1924.176923076923 neg: 1925.595238095238
Validation mean date pos 1936.4878048780488 neg: 1936.4227642276423

Assigning overlap to positive class.

We have 126 potential positive instances and
130 potential negative instances. Choosing only
125 of each class.
MATCHING DATES
Instances chosen.

250 volumes range in date from 1832 to 1989.

Building vocabulary.

Authors matched.

[13, 13, 13, 13, 13, 12, 12, 12, 12, 12]
[13, 13, 13, 13, 12, 13, 12, 12, 12, 12]

words: 1000  reg: 0.0001  acc: 0.868
words: 1000  reg: 0.001  acc: 0.888
words: 1000  reg: 0.003  acc: 0.892
words: 1500  reg: 0.003  acc: 0.9
words: 2000  reg: 0.01  acc: 0.912
words: 2000  reg: 0.03  acc: 0.916
words: 4000  reg: 0.003  acc: 0.92
True positives 114
True negatives 116
False positives 9
False negatives 11
F1 : 0.9193548387096775
0.92 0.92
246
(246, 4000)

Validated:  0.926829268292683

-------------
Begin construction of validation split.
Positives/negatives: 206 327
Overlap (assigned to pos class): 43
Traintest pos/neg: 127 128
Validation pos/neg: 122 122
Traintest mean date pos: 1931.1953125 neg: 1931.1732283464567
Validation mean date pos 1931.2459016393443 neg: 1930.704918032787

Assigning overlap to positive class.

We have 127 potential positive instances and
128 potential negative instances. Choosing only
125 of each class.
MATCHING DATES
Instances chosen.

250 volumes range in date from 1829 to 1989.

Building vocabulary.

Authors matched.

[13, 13, 13, 13, 13, 12, 12, 12, 12, 12]
[13, 13, 13, 13, 12, 12, 13, 12, 12, 12]

words: 1000  reg: 0.0001  acc: 0.84
words: 1000  reg: 0.001  acc: 0.872
words: 1000  reg: 0.003  acc: 0.892
words: 1000  reg: 0.01  acc: 0.896
words: 1000  reg: 0.03  acc: 0.904
words: 2000  reg: 0.01  acc: 0.908
words: 4500  reg: 0.003  acc: 0.912
words: 5000  reg: 1  acc: 0.916
words: 5000  reg: 10  acc: 0.92
words: 5500  reg: 10  acc: 0.924
True positives 117
True negatives 114
False positives 11
False negatives 8
F1 : 0.924901185770751
0.924 0.924
244
(244, 5500)

Validated:  0.9139344262295082

-------------
Begin construction of validation split.
Positives/negatives: 206 327
Overlap (assigned to pos class): 43
Traintest pos/neg: 126 131
Validation pos/neg: 123 123
Traintest mean date pos: 1927.6259541984732 neg: 1930.3095238095239
Validation mean date pos 1931.3089430894308 neg: 1931.5934959349593

Assigning overlap to positive class.

We have 126 potential positive instances and
131 potential negative instances. Choosing only
125 of each class.
MATCHING DATES
Instances chosen.

250 volumes range in date from 1829 to 1989.

Building vocabulary.

Authors matched.

[13, 13, 13, 13, 13, 12, 12, 12, 12, 12]
[13, 13, 13, 13, 12, 12, 12, 12, 12, 13]

words: 1000  reg: 0.0001  acc: 0.828
words: 1000  reg: 0.001  acc: 0.884
words: 1000  reg: 0.003  acc: 0.892
words: 1000  reg: 0.01  acc: 0.896
words: 1000  reg: 0.03  acc: 0.904
words: 1000  reg: 0.1  acc: 0.908
words: 1000  reg: 1  acc: 0.916
True positives 117
True negatives 112
False positives 13
False negatives 8
F1 : 0.9176470588235294
0.916 0.916
246
(246, 1000)

Validated:  0.926829268292683

-------------
Begin construction of validation split.
Positives/negatives: 206 327
Overlap (assigned to pos class): 43
Traintest pos/neg: 126 127
Validation pos/neg: 123 123
Traintest mean date pos: 1934.4881889763778 neg: 1935.111111111111
Validation mean date pos 1926.4552845528456 neg: 1926.6747967479675

Assigning overlap to positive class.

We have 126 potential positive instances and
127 potential negative instances. Choosing only
125 of each class.
MATCHING DATES
Instances chosen.

250 volumes range in date from 1829 to 1989.

Building vocabulary.

Authors matched.

[13, 13, 13, 13, 13, 12, 12, 12, 12, 12]
[13, 13, 13, 13, 13, 12, 12, 12, 12, 12]

words: 1000  reg: 0.0001  acc: 0.848
words: 1000  reg: 0.001  acc: 0.888
words: 1500  reg: 0.003  acc: 0.896
words: 1500  reg: 0.01  acc: 0.912
True positives 113
True negatives 115
False positives 10
False negatives 12
F1 : 0.9112903225806451
0.912 0.912
246
(246, 1500)

Validated:  0.926829268292683

-------------
Begin construction of validation split.
Positives/negatives: 206 327
Overlap (assigned to pos class): 43
Traintest pos/neg: 127 127
Validation pos/neg: 122 122
Traintest mean date pos: 1931.5905511811025 neg: 1931.9133858267717
Validation mean date pos 1930.0737704918033 neg: 1929.9344262295083

Assigning overlap to positive class.

We have 127 potential positive instances and
127 potential negative instances. Choosing only
125 of each class.
MATCHING DATES
Instances chosen.

250 volumes range in date from 1832 to 1989.

Building vocabulary.

Authors matched.

[13, 13, 13, 13, 13, 12, 12, 12, 12, 12]
[13, 13, 13, 13, 13, 12, 12, 12, 12, 12]

words: 1000  reg: 0.0001  acc: 0.868
words: 1000  reg: 0.001  acc: 0.896
words: 1000  reg: 0.003  acc: 0.904
words: 1000  reg: 0.03  acc: 0.912
words: 1500  reg: 0.003  acc: 0.916
words: 3000  reg: 0.001  acc: 0.92
words: 5000  reg: 0.01  acc: 0.928
words: 5000  reg: 10  acc: 0.932
True positives 114
True negatives 119
False positives 6
False negatives 11
F1 : 0.9306122448979591
0.932 0.932
244
(244, 5000)

Validated:  0.930327868852459

Trials on reduced data

The same models run on a corpus down-sampled to 5% of the data (each word instance had a 5% chance of being recorded) and 80 instead of 125 volumes.

We used this alternate version of tune_a_model():



In [5]:

    
def tune_a_model(name, tags4positive, tags4negative, sizecap, sourcefolder, metadatapath):
    '''
    This tunes a model through gridsearch, and puts the resulting model in a ../temp
    folder, where it can be retrieved
    '''

    vocabpath = '../lexica/' + name + '.txt'
    modeloutpath = '../temp/' + name + '.csv'

    c_range = [.00001, .0001, .001, .003, .01, .03, 0.1, 1, 10, 100, 300, 1000]
    featurestart = 10
    featureend = 1500
    featurestep = 100
    modelparams = 'logistic', 10, featurestart, featureend, featurestep, c_range
    forbiddenwords = {}
    floor = 1700
    ceiling = 2020

    metadata, masterdata, classvector, classdictionary, orderedIDs, authormatches, vocablist = versatiletrainer2.get_simple_data(sourcefolder, metadatapath, vocabpath, tags4positive, tags4negative, sizecap, extension = '.fic.tsv', excludebelow = floor, excludeabove = ceiling,
        forbid4positive = {'drop'}, forbid4negative = {'drop'}, force_even_distribution = False, forbiddenwords = forbiddenwords)

    matrix, maxaccuracy, metadata, coefficientuples, features4max, best_regularization_coef = versatiletrainer2.tune_a_model(metadata, masterdata, classvector, classdictionary, orderedIDs, authormatches,
        vocablist, tags4positive, tags4negative, modelparams, name, modeloutpath)

    meandate = int(round(np.sum(metadata.firstpub) / len(metadata.firstpub)))
    floor = np.min(metadata.firstpub)
    ceiling = np.max(metadata.firstpub)

    os.remove(vocabpath)
    
    return floor, ceiling, meandate, maxaccuracy, features4max, best_regularization_coef, modeloutpath



In [6]:

    
train_and_validate('BoWShrunkenGothic', {'lochorror', 'pbgothic', 'locghost', 'stangothic', 'chihorror'},
        {'random', 'chirandom'}, 40, '../reduced_data/', '../meta/finalmeta.csv')









    



-------------
Begin construction of validation split.
Positives/negatives: 166 370
Overlap (assigned to pos class): 0
Traintest pos/neg: 43 51
Validation to match: 121 neg avail: 319
Traintest mean date pos: 1881.5882352941176 neg: 1884.4418604651162
Validation mean date pos 1877.3543307086613 neg: 1874.4876033057851

Assigning overlap to positive class.

We have 43 potential positive instances and
51 potential negative instances. Choosing only
40 of each class.
MATCHING DATES
Instances chosen.

80 volumes range in date from 1764 to 1989.

Building vocabulary.

Authors matched.

[5, 4, 4, 4, 4, 4, 4, 4, 4, 3]
[5, 4, 4, 4, 4, 4, 4, 4, 4, 3]

words: 10  reg: 1e-05  acc: 0.5875
words: 110  reg: 1e-05  acc: 0.6
words: 110  reg: 0.01  acc: 0.6375
words: 110  reg: 0.03  acc: 0.65
words: 110  reg: 0.1  acc: 0.6875
True positives 29
True negatives 26
False positives 14
False negatives 11
F1 : 0.6987951807228916
0.6875 0.6875
../reduced_data/hvd.hwpn81.fic.tsv
247
(247, 110)

-------------
Begin construction of validation split.
Positives/negatives: 166 370
Overlap (assigned to pos class): 0
Traintest pos/neg: 44 52
Validation to match: 120 neg avail: 314
Traintest mean date pos: 1887.1538461538462 neg: 1889.7272727272727
Validation mean date pos 1874.792 neg: 1872.4666666666667

Assigning overlap to positive class.

We have 44 potential positive instances and
52 potential negative instances. Choosing only
40 of each class.
MATCHING DATES
Instances chosen.

80 volumes range in date from 1788 to 1988.

Building vocabulary.

Authors matched.

[5, 5, 4, 4, 4, 4, 4, 4, 3, 3]
[5, 4, 4, 5, 4, 4, 4, 4, 3, 3]

words: 10  reg: 1e-05  acc: 0.625
words: 10  reg: 0.03  acc: 0.6625
words: 10  reg: 1  acc: 0.7375
True positives 28
True negatives 31
False positives 9
False negatives 12
F1 : 0.7272727272727273
0.7375 0.7375
../reduced_data/hvd.hwpn81.fic.tsv
244
(244, 10)

-------------
Begin construction of validation split.
Positives/negatives: 166 370
Overlap (assigned to pos class): 0
Traintest pos/neg: 41 47
Validation to match: 119 neg avail: 322
Traintest mean date pos: 1874.212765957447 neg: 1878.5121951219512
Validation mean date pos 1879.25 neg: 1877.109243697479

Assigning overlap to positive class.

We have 41 potential positive instances and
47 potential negative instances. Choosing only
40 of each class.
MATCHING DATES
Instances chosen.

80 volumes range in date from 1785 to 1989.

Building vocabulary.

Authors matched.

[5, 4, 4, 4, 4, 4, 4, 4, 4, 3]
[4, 6, 4, 6, 3, 5, 3, 3, 3, 3]

words: 10  reg: 1e-05  acc: 0.625
words: 10  reg: 0.0001  acc: 0.6375
words: 110  reg: 0.1  acc: 0.6625
words: 710  reg: 1  acc: 0.675
True positives 24
True negatives 30
False positives 10
False negatives 16
F1 : 0.6486486486486486
0.675 0.675
../reduced_data/hvd.hwpn81.fic.tsv
242
(242, 710)

-------------
Begin construction of validation split.
Positives/negatives: 166 370
Overlap (assigned to pos class): 0
Traintest pos/neg: 41 47
Validation to match: 120 neg avail: 318
Traintest mean date pos: 1890.2978723404256 neg: 1887.9268292682927
Validation mean date pos 1874.4634146341464 neg: 1873.025

Assigning overlap to positive class.

We have 41 potential positive instances and
47 potential negative instances. Choosing only
40 of each class.
MATCHING DATES
Instances chosen.

80 volumes range in date from 1764 to 1987.

Building vocabulary.

Authors matched.

[4, 4, 4, 4, 4, 4, 4, 4, 4, 4]
[4, 4, 5, 4, 4, 4, 4, 4, 4, 3]

words: 10  reg: 1e-05  acc: 0.5375
words: 10  reg: 0.01  acc: 0.5625
words: 10  reg: 0.03  acc: 0.575
words: 110  reg: 0.03  acc: 0.6125
words: 110  reg: 0.1  acc: 0.675
words: 110  reg: 10  acc: 0.6875
True positives 27
True negatives 28
False positives 12
False negatives 13
F1 : 0.6835443037974683
0.6875 0.6875
../reduced_data/hvd.hwpn81.fic.tsv
242
(242, 110)

-------------
Begin construction of validation split.
Positives/negatives: 166 370
Overlap (assigned to pos class): 0
Traintest pos/neg: 41 44
Validation to match: 125 neg avail: 325
Traintest mean date pos: 1861.1363636363637 neg: 1863.9756097560976
Validation mean date pos 1885.0597014925372 neg: 1881.576

We started with 85 rows in metadata, but
lost 1 that were missing in the data folder.
Assigning overlap to positive class.

We have 40 potential positive instances and
44 potential negative instances. Choosing only
40 of each class.
MATCHING DATES
Instances chosen.

80 volumes range in date from 1785 to 1981.

Building vocabulary.

Authors matched.

[4, 4, 4, 4, 4, 4, 4, 4, 4, 4]
[4, 6, 5, 4, 3, 4, 3, 3, 5, 3]

words: 10  reg: 1e-05  acc: 0.4
words: 10  reg: 0.001  acc: 0.4125
words: 10  reg: 0.03  acc: 0.4625
words: 10  reg: 0.1  acc: 0.4875
words: 10  reg: 1  acc: 0.525
words: 210  reg: 0.1  acc: 0.5375
words: 310  reg: 0.03  acc: 0.5875
words: 310  reg: 0.1  acc: 0.6125
words: 310  reg: 100  acc: 0.625
True positives 25
True negatives 25
False positives 15
False negatives 15
F1 : 0.625
0.625 0.625
259
(259, 310)

-------------
Begin construction of validation split.
Positives/negatives: 166 370
Overlap (assigned to pos class): 0
Traintest pos/neg: 44 50
Validation to match: 122 neg avail: 319
Traintest mean date pos: 1872.14 neg: 1877.159090909091
Validation mean date pos 1878.2983870967741 neg: 1877.2540983606557

We started with 94 rows in metadata, but
lost 1 that were missing in the data folder.
Assigning overlap to positive class.

We have 43 potential positive instances and
50 potential negative instances. Choosing only
40 of each class.
MATCHING DATES
Instances chosen.

80 volumes range in date from 1785 to 1982.

Building vocabulary.

Authors matched.

[4, 4, 4, 4, 4, 4, 4, 4, 4, 4]
[4, 4, 4, 4, 5, 4, 4, 4, 4, 3]

words: 10  reg: 1e-05  acc: 0.6125
words: 10  reg: 0.01  acc: 0.625
words: 10  reg: 0.1  acc: 0.675
words: 10  reg: 1  acc: 0.7
True positives 30
True negatives 26
False positives 14
False negatives 10
F1 : 0.7142857142857143
0.7 0.7
246
(246, 10)

-------------
Begin construction of validation split.
Positives/negatives: 166 370
Overlap (assigned to pos class): 0
Traintest pos/neg: 41 46
Validation to match: 125 neg avail: 324
Traintest mean date pos: 1856.8260869565217 neg: 1859.0
Validation mean date pos 1886.9626865671642 neg: 1883.208

We started with 87 rows in metadata, but
lost 1 that were missing in the data folder.
Assigning overlap to positive class.

We have 40 potential positive instances and
46 potential negative instances. Choosing only
40 of each class.
MATCHING DATES
Instances chosen.

80 volumes range in date from 1786 to 1989.

Building vocabulary.

Authors matched.

[4, 4, 4, 4, 4, 4, 4, 4, 4, 4]
[5, 5, 4, 5, 4, 4, 3, 4, 3, 3]

words: 10  reg: 1e-05  acc: 0.6
words: 210  reg: 10  acc: 0.6125
words: 310  reg: 10  acc: 0.625
True positives 26
True negatives 24
False positives 16
False negatives 14
F1 : 0.6341463414634146
0.625 0.625
259
(259, 310)

-------------
Begin construction of validation split.
Positives/negatives: 166 370
Overlap (assigned to pos class): 0
Traintest pos/neg: 42 45
Validation to match: 124 neg avail: 323
Traintest mean date pos: 1884.0222222222221 neg: 1888.8095238095239
Validation mean date pos 1876.2307692307693 neg: 1873.3064516129032

Assigning overlap to positive class.

We have 42 potential positive instances and
45 potential negative instances. Choosing only
40 of each class.
MATCHING DATES
Instances chosen.

80 volumes range in date from 1788 to 1988.

Building vocabulary.

Authors matched.

[4, 5, 4, 4, 4, 4, 4, 4, 4, 3]
[4, 4, 4, 4, 4, 4, 6, 4, 3, 3]

words: 10  reg: 1e-05  acc: 0.6125
words: 110  reg: 1e-05  acc: 0.6625
words: 310  reg: 0.03  acc: 0.675
True positives 26
True negatives 28
False positives 12
False negatives 14
F1 : 0.6666666666666667
0.675 0.675
../reduced_data/hvd.hwpn81.fic.tsv
253
(253, 310)

-------------
Begin construction of validation split.
Positives/negatives: 166 370
Overlap (assigned to pos class): 0
Traintest pos/neg: 43 48
Validation to match: 119 neg avail: 321
Traintest mean date pos: 1879.25 neg: 1882.7906976744187
Validation mean date pos 1877.488 neg: 1875.7226890756303

Assigning overlap to positive class.

We have 43 potential positive instances and
48 potential negative instances. Choosing only
40 of each class.
MATCHING DATES
Instances chosen.

80 volumes range in date from 1764 to 1989.

Building vocabulary.

Authors matched.

[4, 4, 4, 5, 4, 4, 4, 4, 4, 3]
[5, 4, 4, 4, 4, 4, 4, 4, 4, 3]

words: 10  reg: 1e-05  acc: 0.5375
words: 10  reg: 0.01  acc: 0.5625
words: 10  reg: 0.1  acc: 0.6375
words: 310  reg: 300  acc: 0.65
True positives 26
True negatives 26
False positives 14
False negatives 14
F1 : 0.65
0.65 0.65
../reduced_data/hvd.hwpn81.fic.tsv
243
(243, 310)

-------------
Begin construction of validation split.
Positives/negatives: 166 370
Overlap (assigned to pos class): 0
Traintest pos/neg: 43 51
Validation to match: 123 neg avail: 318
Traintest mean date pos: 1860.6666666666667 neg: 1863.4651162790697
Validation mean date pos 1885.4883720930231 neg: 1882.040650406504

Assigning overlap to positive class.

We have 43 potential positive instances and
51 potential negative instances. Choosing only
40 of each class.
MATCHING DATES
Instances chosen.

80 volumes range in date from 1790 to 1981.

Building vocabulary.

Authors matched.

[4, 4, 5, 4, 4, 4, 4, 4, 4, 3]
[4, 4, 5, 4, 4, 3, 3, 7, 3, 3]

words: 10  reg: 1e-05  acc: 0.5875
words: 10  reg: 0.1  acc: 0.625
words: 10  reg: 1  acc: 0.65
words: 510  reg: 1  acc: 0.675
words: 610  reg: 0.01  acc: 0.7
True positives 23
True negatives 33
False positives 7
False negatives 17
F1 : 0.6571428571428571
0.7 0.7
../reduced_data/hvd.hwpn81.fic.tsv
251
(251, 610)



In [5]:

    
sf = pd.read_csv('../results/ABsfembeds_models.tsv', sep = '\t')
sf.head()









    Out[5]:







  
    
      
      name
      size
      floor
      ceiling
      meandate
      testacc
      validationacc
      features
      regularization
      i
    
  
  
    
      0
      ABsfembeds0
      100
      1760
      1989
      1912
      0.882653
      0.907104
      5000
      0.030
      0
    
    
      1
      ABsfembeds1
      100
      1761
      1989
      1907
      0.895000
      0.883721
      5000
      100.000
      1
    
    
      2
      ABsfembeds2
      100
      1760
      1989
      1910
      0.905263
      0.911111
      5500
      0.010
      2
    
    
      3
      ABsfembeds3
      100
      1760
      1989
      1907
      0.922680
      0.885714
      5500
      0.100
      3
    
    
      4
      ABsfembeds4
      100
      1761
      1989
      1920
      0.865000
      0.877095
      5500
      0.003
      4



In [6]:

    
sf.shape









    Out[6]:





(41, 10)



In [14]:

    
new = sf.loc[[x for x in range(31,41)], : ]
old = sf.loc[[x for x in range(21,31)], : ]



In [17]:

    
print(np.median(new.testacc), np.median(old.testacc))









    



0.9139999999999999 0.9072845528455284



In [19]:

    
print(np.mean(new.validationacc), np.mean(old.validationacc))









    



0.8804847501141332 0.8868625750894352



In [20]:

    
new









    Out[20]:







  
    
      
      name
      size
      floor
      ceiling
      meandate
      testacc
      validationacc
      features
      regularization
      i
    
  
  
    
      31
      ABsfembeds0
      125
      1771
      1989
      1931
      0.904000
      0.876923
      4500
      0.03
      0
    
    
      32
      ABsfembeds1
      125
      1771
      1989
      1943
      0.916000
      0.875912
      4000
      0.10
      1
    
    
      33
      ABsfembeds2
      125
      1771
      1989
      1936
      0.928000
      0.897260
      4000
      10.00
      2
    
    
      34
      ABsfembeds3
      125
      1818
      1989
      1940
      0.899194
      0.923077
      5500
      0.03
      3
    
    
      35
      ABsfembeds4
      125
      1771
      1989
      1934
      0.919355
      0.914286
      2500
      1.00
      4
    
    
      36
      ABsfembeds5
      125
      1818
      1989
      1939
      0.920000
      0.820690
      6000
      0.03
      5
    
    
      37
      ABsfembeds6
      125
      1771
      1989
      1935
      0.904000
      0.899281
      5500
      0.01
      6
    
    
      38
      ABsfembeds7
      125
      1818
      1989
      1937
      0.904000
      0.899281
      5000
      0.03
      7
    
    
      39
      ABsfembeds8
      125
      1818
      1989
      1938
      0.916000
      0.854167
      5000
      0.01
      8
    
    
      40
      ABsfembeds9
      125
      1818
      1989
      1936
      0.912000
      0.843972
      1000
      0.03
      9



In [21]:

    
old









    Out[21]:







  
    
      
      name
      size
      floor
      ceiling
      meandate
      testacc
      validationacc
      features
      regularization
      i
    
  
  
    
      21
      ABsfembeds0
      125
      1771
      1989
      1944
      0.922764
      0.888158
      4500
      10.000
      0
    
    
      22
      ABsfembeds1
      125
      1771
      1989
      1943
      0.910569
      0.892857
      5000
      100.000
      1
    
    
      23
      ABsfembeds2
      125
      1771
      1989
      1942
      0.910569
      0.871429
      2000
      0.010
      2
    
    
      24
      ABsfembeds3
      125
      1771
      1989
      1940
      0.900000
      0.896552
      3000
      0.030
      3
    
    
      25
      ABsfembeds4
      125
      1818
      1989
      1937
      0.911290
      0.875000
      4500
      0.010
      4
    
    
      26
      ABsfembeds5
      125
      1771
      1989
      1943
      0.902439
      0.909091
      4000
      0.003
      5
    
    
      27
      ABsfembeds6
      125
      1836
      1989
      1938
      0.923387
      0.875862
      3500
      0.010
      6
    
    
      28
      ABsfembeds7
      125
      1771
      1989
      1939
      0.891129
      0.891892
      4500
      10.000
      7
    
    
      29
      ABsfembeds8
      125
      1771
      1989
      1942
      0.904000
      0.881119
      4000
      0.010
      8
    
    
      30
      ABsfembeds9
      125
      1771
      1989
      1939
      0.890244
      0.886667
      3500
      0.010
      9



In [22]:

    
print(np.mean(new.features), np.mean(old.features))









    



4300.0 3850.0



In [54]:

    
hist = pd.read_csv('../results/HistShrunkenGothic_models.tsv', sep = '\t')
hist1990 = pd.read_csv('../results/Hist1990ShrunkenGothic_models.tsv', sep = '\t')



In [42]:

    
bow = pd.read_csv('../results/BoWShrunkenGothic_models.tsv', sep = '\t')



In [57]:

    
glove = pd.read_csv('../results/GloveShrunkenGothic_models.tsv', sep = '\t')



In [58]:

    
print(np.mean(hist.testacc), np.mean(bow.testacc), np.mean(glove.testacc))









    



0.7008620689655172 0.6737499999999998 0.6952499999999999



In [59]:

    
print(np.mean(hist.validationacc), np.mean(hist1990.validationacc), np.mean(bow.validationacc), np.mean(glove.validationacc))









    



0.6332773010353246 0.6586671889400366 0.6041799522115593 0.6269215831334122



In [53]:

    
print(np.mean(hist.features), np.mean(hist1990.features), np.mean(bow.features), np.mean(glove.features))









    



425.86206896551727 615.0 456.6666666666667 262.0



In [17]:

    
print(np.mean(myst.testacc[0:10]), np.mean(myst.testacc[10: ]))









    



0.9228000000000002 0.924



In [24]:

    
hist = pd.read_csv('../results/HistGothic_models.tsv', sep = '\t')



In [27]:

    
print(np.mean(hist.validationacc), np.mean(bowgoth.validationacc))









    



0.7946605264232719 0.7913934965662057



In [60]:

    
hist = pd.read_csv('../results/HistGothic_models.tsv', sep = '\t')
hist1990 = pd.read_csv('../results/Hist1990Gothic_models.tsv', sep = '\t')
bow = pd.read_csv('../results/BoWGothic_models.tsv', sep = '\t')
print(np.mean(hist.validationacc), np.mean(hist1990.validationacc), np.mean(bow.validationacc))









    



0.7946605264232719 0.7925821663697067 0.7913934965662057



In [13]:

    
bow = pd.read_csv('BoWMystery_models.tsv', sep = '\t')



In [14]:

    
np.mean(bow.validationacc[0:30])









    Out[14]:





0.9281187139024163



In [15]:

    
np.mean(bow.validationacc[30: ])









    Out[15]:





0.9209654471544717



In [ ]:

	name	size	floor	ceiling	meandate	testacc	validationacc	features	regularization	i
0	ABsfembeds0	100	1760	1989	1912	0.882653	0.907104	5000	0.030	0
1	ABsfembeds1	100	1761	1989	1907	0.895000	0.883721	5000	100.000	1
2	ABsfembeds2	100	1760	1989	1910	0.905263	0.911111	5500	0.010	2
3	ABsfembeds3	100	1760	1989	1907	0.922680	0.885714	5500	0.100	3
4	ABsfembeds4	100	1761	1989	1920	0.865000	0.877095	5500	0.003	4

	name	size	floor	ceiling	meandate	testacc	validationacc	features	regularization	i
31	ABsfembeds0	125	1771	1989	1931	0.904000	0.876923	4500	0.03	0
32	ABsfembeds1	125	1771	1989	1943	0.916000	0.875912	4000	0.10	1
33	ABsfembeds2	125	1771	1989	1936	0.928000	0.897260	4000	10.00	2
34	ABsfembeds3	125	1818	1989	1940	0.899194	0.923077	5500	0.03	3
35	ABsfembeds4	125	1771	1989	1934	0.919355	0.914286	2500	1.00	4
36	ABsfembeds5	125	1818	1989	1939	0.920000	0.820690	6000	0.03	5
37	ABsfembeds6	125	1771	1989	1935	0.904000	0.899281	5500	0.01	6
38	ABsfembeds7	125	1818	1989	1937	0.904000	0.899281	5000	0.03	7
39	ABsfembeds8	125	1818	1989	1938	0.916000	0.854167	5000	0.01	8
40	ABsfembeds9	125	1818	1989	1936	0.912000	0.843972	1000	0.03	9