This notebook doesn't reproduce everything from scratch. To do that, you would need to go into the code folder for this chapter and recreate models using one of the "reproduce" scripts. My goal here is just to do data analysis on existing models and metadata files, to document some of the figures I quote in the chapter,
In [5]:
import pandas as pd
import numpy as np
In [81]:
poetry = pd.read_csv('metadata/prestigepoemeta.csv')
fiction = pd.read_csv('metadata/prestigeficmeta.csv')
print(poetry.shape, fiction.shape)
In [3]:
def forcefloat(astring):
try:
fval = float(astring)
except:
fval = 0
return fval
fiction['yrrev'] = fiction['yrrev'].apply(forcefloat)
grouped = fiction.loc[ :, ['earliestdate', 'yrrev', 'pubname']].groupby('pubname')
venues = grouped.aggregate(['min', 'max', 'count'])
venues
Out[3]:
In [5]:
poetry['yrrev'] = poetry['yrrev'].apply(forcefloat)
grouped = poetry.loc[ :, ['firstpub', 'yrrev', 'pubname']].groupby('pubname')
poevenues = grouped.aggregate(['min', 'max', 'count'])
poevenues
Out[5]:
In [64]:
def forceint(astring):
try:
intval = int(astring)
except:
intval = float('nan')
return intval
prefic = pd.read_csv('/Users/tunder/work/genre/metadata/ficmeta.csv', encoding = 'latin-1', low_memory = False)
prefic['startdate'] = prefic.startdate.apply(forceint)
prefic.fillna(0, inplace = True)
prepoetry = pd.read_csv('/Users/tunder/work/genre/metadata/poemeta.csv')
prepoetry['startdate'] = prepoetry.startdate.apply(forceint)
prepoetry.fillna(0, inplace = True)
numficafter1850 = sum(prefic.startdate > 1849)
numpoeafter1820 = sum(prepoetry.startdate > 1819)
print(str(numficafter1850) + " fiction")
print(str(numpoeafter1820) + " poetry")
In [14]:
del(prefic)
del(prepoetry)
In [67]:
def forceintup(astring):
try:
intval = int(astring)
except:
intval = 3000
return intval
postfic = pd.read_csv('/Users/tunder/Dropbox/python/train20/subfiction/filteredfiction.csv', dtype = {'metadatasuspicious': object})
postfic['inferreddate'] = postfic.inferreddate.apply(forceintup)
postfic.fillna(3000, inplace = True)
numficbefore1950 = sum(postfic.inferreddate < 1949)
print(str(numficbefore1950) + " fiction")
Just adding up fiction from different sources:
In [17]:
89077 + 23549
Out[17]:
The code below assesses accuracy on a century of fiction, using four models, each only trained on a quarter-century of the evidence.
Note that instead of using a threshold at 0.5, I use the mean prediction for each model as the threshold for assessing accuracy. I think this is a principled solution because models perceive their ancestors as less likely to be reviewed, and their descendants as more likely to be reviewed.
As it happens, this doesn't actually improve accuracy; you get a marginally higher number if you just use 0.5 as the midpoint for all models. But I'm leaving it in, because I think it's in principle the right way to address the diachronic dilemma here. (We could of course also make publication date a variable in the model, but that would tend to obscure an issue I want to foreground.)
In [70]:
accuracies = []
for i in range (1850, 1950, 25):
inpath = 'results/segment' + str(i) + '.applied.csv'
res = pd.read_csv(inpath)
colname = 'segment' + str(i)
midpoint = np.mean(res[colname])
right = 0
wrong = 0
for idx in res.index:
sample = res.loc[idx, 'tags']
logistic = res.loc[idx, colname]
if sample == 'vulgar':
if logistic <= midpoint:
right += 1
else:
wrong += 1
elif sample == 'elite':
if logistic >= midpoint:
right += 1
else:
wrong += 1
accuracy = right/(right + wrong)
print(accuracy)
accuracies.append(accuracy)
print()
print(sum(accuracies) / len(accuracies))
Same drill, just on poetry. We lose more accuracy, but note that most of that is in the first quarter-century, which we know is much less accurate than the others.
Here, it's worth noting, the "midpoint" variable serves us well. Change is more rapid in poetry.
In [72]:
accuracies = []
for i in range (1820, 1920, 25):
inpath = 'results/poe_quarter_' + str(i) + '.applied.csv'
res = pd.read_csv(inpath)
colname = 'poe_quarter_' + str(i)
midpoint = np.mean(res[colname])
right = 0
wrong = 0
for idx in res.index:
sample = res.loc[idx, 'tags']
logistic = res.loc[idx, colname]
if sample == 'random':
if logistic <= midpoint:
right += 1
else:
wrong += 1
elif sample == 'reviewed':
if logistic >= midpoint:
right += 1
else:
wrong += 1
accuracy = right/(right + wrong)
print(accuracy)
accuracies.append(accuracy)
print()
print(sum(accuracies) / len(accuracies))
In a cell above, I wrote that "change is more rapid in poetry." How much more?
To put this more precisely: when you establish a synchronic axis of distinction, how rapidly does the midpoint drift as you move forward in time?
In [13]:
# One way to think about this is r-squared.
# In that sense, it's a small factor, but
# definitely significant.
from scipy.stats import pearsonr
fpoe = pd.read_csv('results/fullpoetry.results.csv')
r, p = pearsonr(fpoe.logistic, fpoe.dateused)
print(r, r**2, p)
In [14]:
# A more intelligible measure of effect size is to
# ask how much the midpoint of the probabilistic
# prediction moves in a year.
import statsmodels.formula.api as smf
# create a fitted model in one line
lm = smf.ols(formula='logistic ~ dateused', data=fpoe).fit()
# print the coefficients
lm.params
Out[14]:
Interpretation: multiplied by 10, I see that as a change of 2% per decade in poetry. The underlying unit is a probabilistic scale of 0% chance of fiction to 100% chance.
In [15]:
ffic = pd.read_csv('results/fullfiction.results.csv')
lm = smf.ols(formula='logistic ~ dateused', data=ffic).fit()
# print the coefficients
lm.params
Out[15]:
Interpretation: multiplied by 10, I see that as a change of 1% per decade in fiction.
In [16]:
r, p = pearsonr(ffic.logistic, ffic.dateused)
print(r, r**2, p)
In [39]:
fiction = pd.read_csv('metadata/prestigeficmeta.csv')
fiction['gender'] = fiction.gender.str.strip()
grouped = fiction.loc[ :, ['tags', 'earliestdate', 'gender']].groupby(['gender'])
ficbygender = grouped.aggregate(['count'])
ficbygender
Out[39]:
In [40]:
poetry = pd.read_csv('metadata/prestigepoemeta.csv')
poetry['gender'] = poetry.gender.str.strip()
grouped = poetry.loc[ :, ['tags', 'earliestdate', 'gender']].groupby(['gender'])
poebygender = grouped.aggregate(['count'])
poebygender
Out[40]:
In [75]:
fiction = pd.read_csv('metadata/prestigeficmeta.csv')
fiction['gender'] = fiction.gender.str.strip()
grouped = fiction.loc[ :, ['tags', 'earliestdate', 'gender']].groupby(['tags', 'gender'])
ficbygender = grouped.aggregate(['count'])
ficbygender
Out[75]:
36% of reviewed volumes are by women but 42% of random volumes.
In [77]:
poetry = pd.read_csv('metadata/prestigepoemeta.csv')
poetry['gender'] = poetry.gender.str.strip()
grouped = poetry.loc[ :, ['tags', 'earliestdate', 'gender']].groupby(['tags', 'gender'])
poebygender = grouped.aggregate(['count'])
poebygender
Out[77]:
26% of reviewed volumes are by women, but only 24% of random volumes. Although women are deeply underrepresented in the poetry dataset as a whole, their distribution across the "reviewed" boundary is favorable.
In [80]:
poetry['nationality'] = poetry.nationality.str.strip()
grouped = poetry.loc[ :, ['tags', 'earliestdate', 'nationality']].groupby(['tags', 'nationality'])
poebynation = grouped.aggregate(['count'])
poebynation
Out[80]:
The US is 65% of random volumes, but 43% of reviewed volumes in poetry.
In [74]:
fiction = pd.read_csv('metadata/prestigeficmeta.csv')
fiction['nationality'] = fiction.nationality.str.strip()
grouped = fiction.loc[ :, ['tags', 'earliestdate', 'nationality']].groupby(['nationality', 'tags'])
ficbynation = grouped.aggregate(['count'])
ficbynation
Out[74]:
In [54]:
gbf = pd.read_csv('results/gender_balanced_fiction.results.csv', index_col = 'volid')
print(gbf.shape)
gbf['genderbalancedprob'] = gbf.logistic
ff = pd.read_csv('results/fullfiction.results.csv', index_col = 'volid')
ff['originalprob'] = ff.logistic
print(ff.shape)
inboth = gbf.loc[: ,['genderbalancedprob']].join(ff.loc[ :, 'originalprob'], how = 'inner')
print(inboth.shape)
r, p = pearsonr(inboth.genderbalancedprob, inboth.originalprob)
print(r, r**2, p)
In [55]:
nbf = pd.read_csv('results/nation_balanced_fiction.csv', index_col = 'volid')
print(nbf.shape)
nbf['nationbalancedprob'] = nbf.logistic
ff = pd.read_csv('results/fullfiction.results.csv', index_col = 'volid')
ff['originalprob'] = ff.logistic
print(ff.shape)
inboth = nbf.loc[: ,['nationbalancedprob']].join(ff.loc[ :, 'originalprob'], how = 'inner')
print(inboth.shape)
r, p = pearsonr(inboth.nationbalancedprob, inboth.originalprob)
print(r, r**2, p)
In [58]:
gbp = pd.read_csv('results/gender_balanced_poetry.results.csv', index_col = 'volid')
print(gbp.shape)
gbp['genderbalancedprob'] = gbp.logistic
fp = pd.read_csv('results/fullpoetry.results.csv', index_col = 'volid')
fp['originalprob'] = fp.logistic
print(fp.shape)
inboth = gbp.loc[: ,['genderbalancedprob']].join(fp.loc[ :, 'originalprob'], how = 'inner')
print(inboth.shape)
r, p = pearsonr(inboth.genderbalancedprob, inboth.originalprob)
print(r, r**2, p)
In [60]:
nbp = pd.read_csv('results/nation_balanced_poetry.results.csv', index_col = 'volid')
print(nbp.shape)
nbp['nationbalancedprob'] = nbp.logistic
fp = pd.read_csv('results/fullpoetry.results.csv', index_col = 'volid')
fp['originalprob'] = fp.logistic
print(fp.shape)
inboth = nbp.loc[: ,['nationbalancedprob']].join(fp.loc[ :, 'originalprob'], how = 'inner')
print(inboth.shape)
r, p = pearsonr(inboth.nationbalancedprob, inboth.originalprob)
print(r, r**2, p)
In [23]:
fiction.head()
Out[23]:
In [ ]: