Chapter Three analysis

This notebook doesn't reproduce everything from scratch. To do that, you would need to go into the code folder for this chapter and recreate models using one of the "reproduce" scripts. My goal here is just to do data analysis on existing models and metadata files, to document some of the figures I quote in the chapter,


In [5]:
import pandas as pd
import numpy as np

In [81]:
poetry = pd.read_csv('metadata/prestigepoemeta.csv')
fiction = pd.read_csv('metadata/prestigeficmeta.csv')
print(poetry.shape, fiction.shape)


(728, 22) (1245, 25)

Lists of publications used

in fiction


In [3]:
def forcefloat(astring):
    try:
        fval = float(astring)
    except:
        fval = 0
    return fval

fiction['yrrev'] = fiction['yrrev'].apply(forcefloat)
grouped = fiction.loc[ :, ['earliestdate', 'yrrev', 'pubname']].groupby('pubname')
venues = grouped.aggregate(['min', 'max', 'count'])
venues


Out[3]:
earliestdate yrrev
min max count min max count
pubname
1854 1854 1 0.0 0.0 1
ADL 1920 1949 11 1923.0 1949.0 9
ATL 1859 1948 197 1860.0 1949.0 197
BM 1861 1896 71 1861.0 1896.0 71
CRIS 1921 1938 6 1922.0 1938.0 6
CRIT 1924 1938 12 1924.0 1939.0 12
DIAL 1913 1926 5 1921.0 1927.0 5
DUB 1929 1947 10 1930.0 1949.0 10
EGO 1913 1918 9 1914.0 1918.0 9
ER 1850 1859 18 1850.0 1859.0 18
FR 1865 1911 102 1865.0 1912.0 102
HOR 1939 1945 5 1940.0 1946.0 5
LM 1919 1936 11 1920.0 1936.0 11
MM 1879 1883 4 1880.0 1884.0 4
NR 1915 1947 33 1915.0 1947.0 29
NY 1925 1945 17 1925.0 1947.0 16
Pulitzer 1924 1936 3 NaN NaN 0
QR 1851 1851 3 1857.0 1857.0 3
SCRU 1940 1946 3 1940.0 1946.0 3
TEM 1851 1859 33 1852.0 1859.0 33
TLR 1914 1915 3 1914.0 1915.0 3
TLS 1920 1928 7 1920.0 1923.0 5
TNA 1903 1913 19 1907.0 1920.0 19
YAL 1930 1949 16 1930.0 1949.0 16
YB 1885 1895 7 1895.0 1895.0 7

and in poetry


In [5]:
poetry['yrrev'] = poetry['yrrev'].apply(forcefloat)
grouped = poetry.loc[ :, ['firstpub', 'yrrev', 'pubname']].groupby('pubname')
poevenues = grouped.aggregate(['min', 'max', 'count'])
poevenues


Out[5]:
firstpub yrrev
min max count min max count
pubname
ATL 1845 1905 122 1859.0 1905.0 122
BM 1838 1896 16 1841.0 1896.0 16
COR 1877 1878 2 1877.0 1879.0 2
EGO 1912 1918 17 1914.0 1919.0 17
ER 1819 1856 36 1820.0 1858.0 36
FR 1863 1912 59 1865.0 1914.0 59
GrM 1827 1855 19 1840.0 1856.0 19
MM 1881 1881 2 1882.0 1882.0 2
PMV 1910 1916 32 1912.0 1917.0 32
QR 1816 1851 20 1820.0 1841.0 20
SAV 1896 1897 2 1896.0 1896.0 2
TNA 1907 1908 2 1908.0 1908.0 2
WR 1828 1867 29 1842.0 1867.0 29
YB 1893 1895 2 1895.0 1896.0 2

Get total numbers of poetry and fiction

In this section I'm simply counting the numbers of volumes I have available in my HathiTrust samples of poetry and fiction, in the relevant date ranges.


In [64]:
def forceint(astring):
    try:
        intval = int(astring)
    except:
        intval = float('nan')
    return intval

prefic = pd.read_csv('/Users/tunder/work/genre/metadata/ficmeta.csv', encoding = 'latin-1', low_memory = False)
prefic['startdate'] = prefic.startdate.apply(forceint)
prefic.fillna(0, inplace = True)
prepoetry = pd.read_csv('/Users/tunder/work/genre/metadata/poemeta.csv')
prepoetry['startdate'] = prepoetry.startdate.apply(forceint)
prepoetry.fillna(0, inplace = True)
numficafter1850 = sum(prefic.startdate > 1849)
numpoeafter1820 = sum(prepoetry.startdate > 1819)
print(str(numficafter1850) + " fiction")
print(str(numpoeafter1820) + " poetry")


89077 fiction
56766 poetry

In [14]:
del(prefic)
del(prepoetry)

In [67]:
def forceintup(astring):
    try:
        intval = int(astring)
    except:
        intval = 3000
    return intval

postfic = pd.read_csv('/Users/tunder/Dropbox/python/train20/subfiction/filteredfiction.csv', dtype = {'metadatasuspicious': object})
postfic['inferreddate'] = postfic.inferreddate.apply(forceintup)
postfic.fillna(3000, inplace = True)

numficbefore1950 = sum(postfic.inferreddate < 1949)
print(str(numficbefore1950) + " fiction")


23549 fiction

Just adding up fiction from different sources:


In [17]:
89077 + 23549


Out[17]:
112626

Can a quarter-century of fiction predict prestige in the rest of the century?

The code below assesses accuracy on a century of fiction, using four models, each only trained on a quarter-century of the evidence.

Note that instead of using a threshold at 0.5, I use the mean prediction for each model as the threshold for assessing accuracy. I think this is a principled solution because models perceive their ancestors as less likely to be reviewed, and their descendants as more likely to be reviewed.

As it happens, this doesn't actually improve accuracy; you get a marginally higher number if you just use 0.5 as the midpoint for all models. But I'm leaving it in, because I think it's in principle the right way to address the diachronic dilemma here. (We could of course also make publication date a variable in the model, but that would tend to obscure an issue I want to foreground.)


In [70]:
accuracies = []

for i in range (1850, 1950, 25):
    inpath = 'results/segment' + str(i) + '.applied.csv'
    res = pd.read_csv(inpath)
    colname = 'segment' + str(i)
    midpoint = np.mean(res[colname])
    
    right = 0
    wrong = 0
    
    for idx in res.index:
        sample = res.loc[idx, 'tags']
        logistic = res.loc[idx, colname]
        if sample == 'vulgar':
            if logistic <= midpoint:
                right += 1
            else:
                wrong += 1
        elif sample == 'elite':
            if logistic >= midpoint:
                right += 1
            else:
                wrong += 1
    
    accuracy = right/(right + wrong)
    print(accuracy)
    accuracies.append(accuracy)
    
print()
print(sum(accuracies) / len(accuracies))


0.7287716405605935
0.6924979389942292
0.7518549051937345
0.6661170651277823

0.7098103874690849

Can a quarter-century of poetry predict prestige in the rest of the century?

Same drill, just on poetry. We lose more accuracy, but note that most of that is in the first quarter-century, which we know is much less accurate than the others.

Here, it's worth noting, the "midpoint" variable serves us well. Change is more rapid in poetry.


In [72]:
accuracies = []

for i in range (1820, 1920, 25):
    inpath = 'results/poe_quarter_' + str(i) + '.applied.csv'
    res = pd.read_csv(inpath)
    colname = 'poe_quarter_' + str(i)
    midpoint = np.mean(res[colname])
    
    right = 0
    wrong = 0
    
    for idx in res.index:
        sample = res.loc[idx, 'tags']
        logistic = res.loc[idx, colname]
        if sample == 'random':
            if logistic <= midpoint:
                right += 1
            else:
                wrong += 1
        elif sample == 'reviewed':
            if logistic >= midpoint:
                right += 1
            else:
                wrong += 1
    
    accuracy = right/(right + wrong)
    print(accuracy)
    accuracies.append(accuracy)
    
print()
print(sum(accuracies) / len(accuracies))


0.7222222222222222
0.7930555555555555
0.7555555555555555
0.7652777777777777

0.7590277777777776

Calculating effect size for diachronic drift relative to synchronic standards

In a cell above, I wrote that "change is more rapid in poetry." How much more?

To put this more precisely: when you establish a synchronic axis of distinction, how rapidly does the midpoint drift as you move forward in time?


In [13]:
# One way to think about this is r-squared.
# In that sense, it's a small factor, but
# definitely significant.

from scipy.stats import pearsonr
fpoe = pd.read_csv('results/fullpoetry.results.csv')
r, p = pearsonr(fpoe.logistic, fpoe.dateused)
print(r, r**2, p)


0.170358321185 0.0290219575969 4.42221194349e-06

In [14]:
# A more intelligible measure of effect size is to
# ask how much the midpoint of the probabilistic
# prediction moves in a year.

import statsmodels.formula.api as smf
# create a fitted model in one line
lm = smf.ols(formula='logistic ~ dateused', data=fpoe).fit()

# print the coefficients
lm.params


Out[14]:
Intercept   -3.133496
dateused     0.001939
dtype: float64

Interpretation: multiplied by 10, I see that as a change of 2% per decade in poetry. The underlying unit is a probabilistic scale of 0% chance of fiction to 100% chance.


In [15]:
ffic = pd.read_csv('results/fullfiction.results.csv')
lm = smf.ols(formula='logistic ~ dateused', data=ffic).fit()
# print the coefficients
lm.params


Out[15]:
Intercept   -1.401773
dateused     0.001003
dtype: float64

Interpretation: multiplied by 10, I see that as a change of 1% per decade in fiction.


In [16]:
r, p = pearsonr(ffic.logistic, ffic.dateused)
print(r, r**2, p)


0.138069581885 0.0190632094418 1.57957034213e-06

Interactions with gender and nationality

First, we might want to know simply how many men or women there are in these datasets. Technically we'll be counting volumes rather than authors.


In [39]:
fiction = pd.read_csv('metadata/prestigeficmeta.csv')
fiction['gender'] = fiction.gender.str.strip()
grouped = fiction.loc[ :, ['tags', 'earliestdate', 'gender']].groupby(['gender'])
ficbygender = grouped.aggregate(['count'])
ficbygender


Out[39]:
tags earliestdate
count count
gender
f 466 466
m 734 734
u 6 6

In [40]:
poetry = pd.read_csv('metadata/prestigepoemeta.csv')
poetry['gender'] = poetry.gender.str.strip()
grouped = poetry.loc[ :, ['tags', 'earliestdate', 'gender']].groupby(['gender'])
poebygender = grouped.aggregate(['count'])
poebygender


Out[40]:
tags earliestdate
count count
gender
f 171 171
m 513 513

Gender and reviewed status


In [75]:
fiction = pd.read_csv('metadata/prestigeficmeta.csv')
fiction['gender'] = fiction.gender.str.strip()
grouped = fiction.loc[ :, ['tags', 'earliestdate', 'gender']].groupby(['tags', 'gender'])
ficbygender = grouped.aggregate(['count'])
ficbygender


Out[75]:
earliestdate
count
tags gender
elite f 214
m 378
u 2
presentaselite f 5
m 21
remove f 2
m 2
vulgar f 245
m 333
u 4

36% of reviewed volumes are by women but 42% of random volumes.


In [77]:
poetry = pd.read_csv('metadata/prestigepoemeta.csv')
poetry['gender'] = poetry.gender.str.strip()
grouped = poetry.loc[ :, ['tags', 'earliestdate', 'gender']].groupby(['tags', 'gender'])
poebygender = grouped.aggregate(['count'])
poebygender


Out[77]:
earliestdate
count
tags gender
addcanon f 3
m 5
random f 76
m 247
reviewed f 92
m 261

26% of reviewed volumes are by women, but only 24% of random volumes. Although women are deeply underrepresented in the poetry dataset as a whole, their distribution across the "reviewed" boundary is favorable.

Nationality and reviewed status


In [80]:
poetry['nationality'] = poetry.nationality.str.strip()
grouped = poetry.loc[ :, ['tags', 'earliestdate', 'nationality']].groupby(['tags', 'nationality'])
poebynation = grouped.aggregate(['count'])
poebynation


Out[80]:
earliestdate
count
tags nationality
addcanon uk 5
us 3
random au 2
ca 8
fr 1
ir 4
uk 98
us 202
reviewed ca 2
ir 15
uk 183
us 152

The US is 65% of random volumes, but 43% of reviewed volumes in poetry.


In [74]:
fiction = pd.read_csv('metadata/prestigeficmeta.csv')
fiction['nationality'] = fiction.nationality.str.strip()
grouped = fiction.loc[ :, ['tags', 'earliestdate', 'nationality']].groupby(['nationality', 'tags'])
ficbynation = grouped.aggregate(['count'])
ficbynation


Out[74]:
earliestdate
count
nationality tags
argentine vulgar 1
au elite 2
vulgar 6
austrian vulgar 1
bengali vulgar 1
ca elite 5
vulgar 7
cu elite 1
dutch vulgar 1
es vulgar 1
fi vulgar 1
fr elite 2
french remove 1
vulgar 4
ger elite 1
vulgar 1
german remove 1
vulgar 3
hu elite 1
hungary vulgar 1
in elite 1
ir elite 22
presentaselite 1
vulgar 10
is elite 1
no elite 1
vulgar 1
nz vulgar 1
ru elite 1
vulgar 2
russian elite 2
serbian vulgar 1
spanish vulgar 1
sw vulgar 1
uk elite 338
presentaselite 13
remove 2
vulgar 246
us elite 210
presentaselite 12
remove 1
vulgar 286
za elite 3

Variance shared between balanced models and the original models


In [54]:
gbf = pd.read_csv('results/gender_balanced_fiction.results.csv', index_col = 'volid')
print(gbf.shape)
gbf['genderbalancedprob'] = gbf.logistic
ff = pd.read_csv('results/fullfiction.results.csv', index_col = 'volid')
ff['originalprob'] = ff.logistic
print(ff.shape)
inboth = gbf.loc[: ,['genderbalancedprob']].join(ff.loc[ :, 'originalprob'], how = 'inner')
print(inboth.shape)
r, p = pearsonr(inboth.genderbalancedprob, inboth.originalprob)
print(r, r**2, p)


(788, 14)
(1200, 15)
(784, 2)
0.916169711148 0.839366939626 9.43137481736e-313

In [55]:
nbf = pd.read_csv('results/nation_balanced_fiction.csv', index_col = 'volid')
print(nbf.shape)
nbf['nationbalancedprob'] = nbf.logistic
ff = pd.read_csv('results/fullfiction.results.csv', index_col = 'volid')
ff['originalprob'] = ff.logistic
print(ff.shape)
inboth = nbf.loc[: ,['nationbalancedprob']].join(ff.loc[ :, 'originalprob'], how = 'inner')
print(inboth.shape)
r, p = pearsonr(inboth.nationbalancedprob, inboth.originalprob)
print(r, r**2, p)


(628, 14)
(1200, 15)
(625, 2)
0.870883362284 0.758437830704 2.38698470976e-194

In [58]:
gbp = pd.read_csv('results/gender_balanced_poetry.results.csv', index_col = 'volid')
print(gbp.shape)
gbp['genderbalancedprob'] = gbp.logistic
fp = pd.read_csv('results/fullpoetry.results.csv', index_col = 'volid')
fp['originalprob'] = fp.logistic
print(fp.shape)
inboth = gbp.loc[: ,['genderbalancedprob']].join(fp.loc[ :, 'originalprob'], how = 'inner')
print(inboth.shape)
r, p = pearsonr(inboth.genderbalancedprob, inboth.originalprob)
print(r, r**2, p)


(284, 14)
(718, 15)
(283, 2)
0.916260097683 0.839532566607 1.17963934458e-113

In [60]:
nbp = pd.read_csv('results/nation_balanced_poetry.results.csv', index_col = 'volid')
print(nbp.shape)
nbp['nationbalancedprob'] = nbp.logistic
fp = pd.read_csv('results/fullpoetry.results.csv', index_col = 'volid')
fp['originalprob'] = fp.logistic
print(fp.shape)
inboth = nbp.loc[: ,['nationbalancedprob']].join(fp.loc[ :, 'originalprob'], how = 'inner')
print(inboth.shape)
r, p = pearsonr(inboth.nationbalancedprob, inboth.originalprob)
print(r, r**2, p)


(284, 14)
(718, 15)
(283, 2)
0.833740715511 0.695123580701 1.8832060417e-74

In [23]:
fiction.head()


Out[23]:
docid actualdate earliestdate firstpub tags recordid OCLC author imprint enumcron ... pubname birthdate gender nationality othername notes canon path authsvols publisher
0 uc2.ark+=13960=t55d8p14j 1854 1843 1843.0 vulgar NaN NaN Stephens, Ann S Philadelphia;T. B. Peterson;c1854. NaN ... NaN 1810.0 f us NaN NaN NaN uc2/pairtree_root/ar/k+/=1/39/60/=t/55/d8/p1/4... 47 Peterson
1 nyp.33433075741789 1852 1845 1845.0 vulgar NaN NaN Reynolds, George W. M London;J. Dicks;1852-64. NaN ... NaN 1814.0 m uk NaN NaN NaN nyp/pairtree_root/33/43/30/75/74/17/89/3343307... 61 Dicks
2 uc1.b249620 1853 1848 1848.0 vulgar NaN NaN Peppergrass, Paul Boston;P. Donahoe;1853. NaN ... NaN 1810.0 m ir NaN NaN NaN uc1/pairtree_root/$b/24/96/20/$b249620/uc1.$b2... 4 Donahoe
3 njp.32101073308494 1852 1849 1849.0 vulgar NaN NaN Manning, Anne New York;D. Appleton;1852. NaN ... NaN 1807.0 f uk NaN NaN NaN njp/pairtree_root/32/10/10/73/30/84/94/3210107... 66 Appleton
4 uc2.ark+=13960=t4wh2fh9b 1850 1850 1850.0 vulgar NaN NaN Leighton, John London;William Tegg and Co.;1850. NaN ... NaN 1822.0 m uk NaN NaN NaN uc2/pairtree_root/ar/k+/=1/39/60/=t/4w/h2/fh/9... 2 William Tegg

5 rows × 25 columns


In [ ]: