Chapter 1, figures 3 and 4

This notebook will show you how to produce figures 1.3 and 1.4 after the predictive modeling is completed.

The predictive modeling itself, unfortunately, doesn't fit in a notebook. The number-crunching can take several hours, and although logistic regression itself is not complicated, the practical details -- dates, authors, multiprocessing to speed things up, etc -- turn it into a couple thousand lines of code. (If you want to dig into that, see chapter1/code/biomodel.py, and the scripts in /logistic at the top level of the repo.)

Without covering those tangled details, this notebook can still explore the results of modeling in enough depth to give you a sense of some important choices made along the way.

Define modeling parameters

I start by finding an optimal number of features for the model, and also a value for C (the regularization constant). To do this I run a "grid search" that tests different values of both parameters. (I use the "gridsearch" option in biomodel, aka: python3 biomodel.py gridsearch.) The result looks like this:

where darker red squares indicate higher accuracies. I haven't labeled the axes correctly, but the vertical axis here is number of features (from 800 to 2500), and the horizontal axis is the C parameter (from .0012 to 10, logarithmically).

It's important to use the same sample size for this test that you plan to use in the final model: in this case a rather small group of 150 volumes (75 positive and 75 negative), because I want to be able to run models in periods as small as 20 years. With such a small sample, it's important to run the gridsearch several times, since the selection of a particular 150 volumes introduces considerable random variability into the process.

One could tune the C parameter for each sample, and I try that in a different chapter, but my experience is that it introduces complexity without actually changing results--plus I get anxious about overfitting through parameter selection. Probably better just to confirm results with multiple samples and multiple C settings. A robust result should hold up.

I've tested the differentiation of genres with multiple parameter settings, and it does hold up. But for figure 1.3, I settled on 1100 features (words) and C = 0.015 as settings that fairly consistently produce good results for the biography / fiction boundary. Then it's possible to

Assess accuracy across time: Figure 1.3

I do this by running python3 biomodel.py usenewdata (the contrast between 'new' and 'old' metadata will become relevant later in this notebook). That produces a file of results visualized below.



In [74]:

    
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
import random



In [77]:

    
accuracy_df = pd.read_csv('../modeloutput/finalbiopredicts.csv')
accuracy_df.head()



In [78]:

    
# I "jitter" results horizontally because we often have multiple results with the same x and y coordinates.

def jitteraframe(df, yname):
    jitter = dict()
    for i in df.index:
        x = df.loc[i, 'center']
        y = df.loc[i, yname]

        if x not in jitter:
            jitter[x] = set()
        elif y in jitter[x]:
            dodge = random.choice([-6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6])
            x = x + dodge
            df.loc[i, 'center'] = x
            if x not in jitter:
                jitter[x] = set()

        jitter[x].add(y)

jitteraframe(accuracy_df, 'accuracy')
        
fig, ax = plt.subplots(figsize = (9, 9))
ax.margins(0.1)
ax.plot(accuracy_df.center, accuracy_df.accuracy, marker = 'o', linestyle = '', alpha = 0.5)
ax.annotate('accuracy', xy = (1700,1), fontsize = 16)
plt.show()

assessment

There's a lot of random variation with this small sample size, but it's still perfectly clear that accuracy rises across this timeline. It may not be a linear relationship: it looks like the boundary between fiction and biography may be sharpest around 1910, and rather than a smooth line, it might be two regimes divided around 1850. But it's still quite clear that accuracy rises: if we modeled it simply as a linear correlation, it would be strong and significant.



In [35]:

    
from scipy.stats import pearsonr
pearsonr(accuracy_df.floor, accuracy_df.accuracy)









    Out[35]:





(0.54427832507144336, 1.3614756409125197e-17)

The first number is the correlation coefficient; the second a p value.

Plotting individual volume probabilities: Figure 1.4

In a sense plotting individual volumes is extremely simple. My modeling process writes files that record the metadata for each volume along with a column logistic that reports the predicted probability of being in the positive class (in this case, fiction). We can just plot the probabilities on the y axis, and dates used for modeling on the x axis. Have done that below.



In [14]:

    
root = '../modeloutput/'
frames = []
for floor in range(1700, 2000, 50):
    sourcefile = root + 'theninehundred' + str(floor) + '.csv'
    thisframe = pd.read_csv(sourcefile)
    frames.append(thisframe)

df = pd.concat(frames)
df.head()









    Out[14]:






  
    
      
      volid
      dateused
      pubdate
      birthdate
      firstpub
      gender
      nation
      allwords
      logistic
      realclass
      trainflag
      trainsize
      author
      title
      genretags
    
  
  
    
      0
      dul1.ark+=13960=t1cj99q4g
      1741
      1741
      1741
      1741
      NaN
      NaN
      22103
      0.454529
      0
      1
      149
      Oldys, William,
      Memoirs of Mrs. Anne Oldfield
      bio
    
    
      1
      nnc1.0037104098
      1739
      1739
      1739
      1739
      NaN
      NaN
      86025
      0.669344
      1
      1
      149
      Challes, Robert,
      The illustrious French lovers;
      fic
    
    
      2
      dul1.ark+=13960=t74t7bs0b
      1735
      1735
      1735
      1735
      NaN
      NaN
      127228
      0.650335
      1
      1
      148
      Chetwood, W. R.
      The voyages and adventures of Captain Robert B...
      fic
    
    
      3
      uva.x030809128
      1715
      1715
      1715
      1715
      NaN
      NaN
      156048
      0.451771
      0
      1
      149
      Whiting, John,
      Persecution expos'd, in some memoirs relating ...
      bio
    
    
      4
      mdp.39015004752112
      1725
      1725
      1725
      1725
      NaN
      NaN
      472248
      0.474368
      0
      1
      149
      Boyle, Robert,
      The philosophical works of the Honourable Robe...
      bio



In [15]:

    
groups = df.groupby('realclass')
groupnames = {0: 'biography', 1: 'fiction'}
groupcolors = {0: 'k', 1: 'r'}
fig, ax = plt.subplots(figsize = (9, 9))

ax.margins(0.1)
for code, group in groups:
    ax.plot(group.dateused, group.logistic, marker='o', linestyle='', ms=6, alpha = 0.66, color = groupcolors[code], label=groupnames[code])
ax.legend(numpoints = 1, loc = 'upper left')

plt.show()

caveats

The pattern you see above is real, and makes a nice visual emblem of generic differentiation. However, there are some choices involved worth reflection. The probabilities plotted above were produced by six models, trained on 50-year segments of the timeline, using 1100 features and a C setting of 0.00008. That C setting works fine, but it's much lower than the one I chose as optimal for assessing accuracy. What happens if we use instead C = 0.015, and in fact simply reuse the evidence from figure 1.3 unchanged?

The accuracies recorded in finalpredictbio.csv come from a series of models named cleanpredictbio (plus some more info). I haven't saved all of them, but we have the last model in each sequence of 15. We can plot those probabilities.



In [73]:

    
root = '../modeloutput/'
frames = []
for floor in range(1700, 2000, 20):
    if floor == 1720:
        continue
        # the first model covers 40 years
    sourcefile = root + 'cleanpredictbio' + str(floor) + '2017-10-15.csv'
    thisframe = pd.read_csv(sourcefile)
    frames.append(thisframe)

df = pd.concat(frames)

bio = []
fic = []
for i in range (1710, 1990):
    segment = df[(df.dateused > (i - 10)) & (df.dateused < (i + 10))]
    bio.append(np.mean(segment[segment.realclass == 0].logistic))
    fic.append(np.mean(segment[segment.realclass == 1].logistic))
    
groups = df.groupby('realclass')
groupnames = {0: 'biography', 1: 'fiction'}
groupcolors = {0: 'k', 1: 'r'}
fig, ax = plt.subplots(figsize = (9, 9))

ax.margins(0.1)
for code, group in groups:
    ax.plot(group.dateused, group.logistic, marker='o', linestyle='', ms=6, alpha = 0.5, color = groupcolors[code], label=groupnames[code])
ax.plot(list(range(1710,1990)), bio, c = 'k')
ax.plot(list(range(1710,1990)), fic, c = 'r')
ax.legend(numpoints = 1, loc = 'upper left')

plt.show()

Whoa, that's a different picture!

If you look closely, there's still a pattern of differentiation: probabilities are more dispersed in the early going, and probs of fiction and biography overlap more. Later on, a space opens up between the genres. I've plotted the mean trend lines to confirm the divergence.

But the picture looks very different. This model uses less aggressive regularization (the bigger C constant makes it more confident), so most probabilities hit the walls around 1.0 or 0.0.

This makes it less obvious, visually, that differentiation is a phenomenon affecting the whole genre. We actually do see a significant change in medians here, as well as means. But it would be hard to see with your eyeballs, because the trend lines are squashed toward the edges.

So I've chosen to use more aggressive regularization (and a smaller number of examples) for the illustration in the book. That's a debatable choice, and a consequential one: as I acknowledge above, it changes the way we understand the word differentiation. I think there are valid reasons for the choice. Neither of the illustrations above is "truer" than than the other; they are alternate, valid perspectives on the same evidence. But if you want to use this kind of visualization, it's important to recognize that tuning the regularization constant will very predictably give you this kind of choice. It can't make a pattern of differentiation appear out of thin air, but it absolutely does change the distribution of probabilities across the y axis. It's a visual-rhetorical choice that needs acknowledging.



In [ ]:

	center	accuracy
0	1720	0.926667
1	1720	0.926667
2	1720	0.913333
3	1720	0.953333
4	1720	0.953333

	volid	dateused	pubdate	birthdate	firstpub	gender	nation	allwords	logistic	realclass	trainflag	trainsize	author	title	genretags
0	dul1.ark+=13960=t1cj99q4g	1741	1741	1741	1741	NaN	NaN	22103	0.454529	0	1	149	Oldys, William,	Memoirs of Mrs. Anne Oldfield	bio
1	nnc1.0037104098	1739	1739	1739	1739	NaN	NaN	86025	0.669344	1	1	149	Challes, Robert,	The illustrious French lovers;	fic
2	dul1.ark+=13960=t74t7bs0b	1735	1735	1735	1735	NaN	NaN	127228	0.650335	1	1	148	Chetwood, W. R.	The voyages and adventures of Captain Robert B...	fic
3	uva.x030809128	1715	1715	1715	1715	NaN	NaN	156048	0.451771	0	1	149	Whiting, John,	Persecution expos'd, in some memoirs relating ...	bio
4	mdp.39015004752112	1725	1725	1725	1725	NaN	NaN	472248	0.474368	0	1	149	Boyle, Robert,	The philosophical works of the Honourable Robe...	bio