Chapter Three analysis

This notebook doesn't reproduce everything from scratch. To do that, you would need to go into the code folder for this chapter and recreate models using one of the "reproduce" scripts. My goal here is just to do data analysis on existing models and metadata files, to document some of the figures I quote in the chapter,



In [5]:

    
import pandas as pd
import numpy as np



In [81]:

    
poetry = pd.read_csv('metadata/prestigepoemeta.csv')
fiction = pd.read_csv('metadata/prestigeficmeta.csv')
print(poetry.shape, fiction.shape)









    



(728, 22) (1245, 25)

Lists of publications used

in fiction



In [3]:

    
def forcefloat(astring):
    try:
        fval = float(astring)
    except:
        fval = 0
    return fval

fiction['yrrev'] = fiction['yrrev'].apply(forcefloat)
grouped = fiction.loc[ :, ['earliestdate', 'yrrev', 'pubname']].groupby('pubname')
venues = grouped.aggregate(['min', 'max', 'count'])
venues









    Out[3]:






  
    
      
      earliestdate
      yrrev
    
    
      
      min
      max
      count
      min
      max
      count
    
    
      pubname
      
      
      
      
      
      
    
  
  
    
      
      1854
      1854
      1
      0.0
      0.0
      1
    
    
      ADL
      1920
      1949
      11
      1923.0
      1949.0
      9
    
    
      ATL
      1859
      1948
      197
      1860.0
      1949.0
      197
    
    
      BM
      1861
      1896
      71
      1861.0
      1896.0
      71
    
    
      CRIS
      1921
      1938
      6
      1922.0
      1938.0
      6
    
    
      CRIT
      1924
      1938
      12
      1924.0
      1939.0
      12
    
    
      DIAL
      1913
      1926
      5
      1921.0
      1927.0
      5
    
    
      DUB
      1929
      1947
      10
      1930.0
      1949.0
      10
    
    
      EGO
      1913
      1918
      9
      1914.0
      1918.0
      9
    
    
      ER
      1850
      1859
      18
      1850.0
      1859.0
      18
    
    
      FR
      1865
      1911
      102
      1865.0
      1912.0
      102
    
    
      HOR
      1939
      1945
      5
      1940.0
      1946.0
      5
    
    
      LM
      1919
      1936
      11
      1920.0
      1936.0
      11
    
    
      MM
      1879
      1883
      4
      1880.0
      1884.0
      4
    
    
      NR
      1915
      1947
      33
      1915.0
      1947.0
      29
    
    
      NY
      1925
      1945
      17
      1925.0
      1947.0
      16
    
    
      Pulitzer
      1924
      1936
      3
      NaN
      NaN
      0
    
    
      QR
      1851
      1851
      3
      1857.0
      1857.0
      3
    
    
      SCRU
      1940
      1946
      3
      1940.0
      1946.0
      3
    
    
      TEM
      1851
      1859
      33
      1852.0
      1859.0
      33
    
    
      TLR
      1914
      1915
      3
      1914.0
      1915.0
      3
    
    
      TLS
      1920
      1928
      7
      1920.0
      1923.0
      5
    
    
      TNA
      1903
      1913
      19
      1907.0
      1920.0
      19
    
    
      YAL
      1930
      1949
      16
      1930.0
      1949.0
      16
    
    
      YB
      1885
      1895
      7
      1895.0
      1895.0
      7

and in poetry



In [5]:

    
poetry['yrrev'] = poetry['yrrev'].apply(forcefloat)
grouped = poetry.loc[ :, ['firstpub', 'yrrev', 'pubname']].groupby('pubname')
poevenues = grouped.aggregate(['min', 'max', 'count'])
poevenues

Get total numbers of poetry and fiction

In this section I'm simply counting the numbers of volumes I have available in my HathiTrust samples of poetry and fiction, in the relevant date ranges.



In [64]:

    
def forceint(astring):
    try:
        intval = int(astring)
    except:
        intval = float('nan')
    return intval

prefic = pd.read_csv('/Users/tunder/work/genre/metadata/ficmeta.csv', encoding = 'latin-1', low_memory = False)
prefic['startdate'] = prefic.startdate.apply(forceint)
prefic.fillna(0, inplace = True)
prepoetry = pd.read_csv('/Users/tunder/work/genre/metadata/poemeta.csv')
prepoetry['startdate'] = prepoetry.startdate.apply(forceint)
prepoetry.fillna(0, inplace = True)
numficafter1850 = sum(prefic.startdate > 1849)
numpoeafter1820 = sum(prepoetry.startdate > 1819)
print(str(numficafter1850) + " fiction")
print(str(numpoeafter1820) + " poetry")









    



89077 fiction
56766 poetry



In [14]:

    
del(prefic)
del(prepoetry)



In [67]:

    
def forceintup(astring):
    try:
        intval = int(astring)
    except:
        intval = 3000
    return intval

postfic = pd.read_csv('/Users/tunder/Dropbox/python/train20/subfiction/filteredfiction.csv', dtype = {'metadatasuspicious': object})
postfic['inferreddate'] = postfic.inferreddate.apply(forceintup)
postfic.fillna(3000, inplace = True)

numficbefore1950 = sum(postfic.inferreddate < 1949)
print(str(numficbefore1950) + " fiction")









    



23549 fiction

Just adding up fiction from different sources:



In [17]:

    
89077 + 23549









    Out[17]:





112626

Can a quarter-century of fiction predict prestige in the rest of the century?

The code below assesses accuracy on a century of fiction, using four models, each only trained on a quarter-century of the evidence.

Note that instead of using a threshold at 0.5, I use the mean prediction for each model as the threshold for assessing accuracy. I think this is a principled solution because models perceive their ancestors as less likely to be reviewed, and their descendants as more likely to be reviewed.

As it happens, this doesn't actually improve accuracy; you get a marginally higher number if you just use 0.5 as the midpoint for all models. But I'm leaving it in, because I think it's in principle the right way to address the diachronic dilemma here. (We could of course also make publication date a variable in the model, but that would tend to obscure an issue I want to foreground.)



In [70]:

    
accuracies = []

for i in range (1850, 1950, 25):
    inpath = 'results/segment' + str(i) + '.applied.csv'
    res = pd.read_csv(inpath)
    colname = 'segment' + str(i)
    midpoint = np.mean(res[colname])
    
    right = 0
    wrong = 0
    
    for idx in res.index:
        sample = res.loc[idx, 'tags']
        logistic = res.loc[idx, colname]
        if sample == 'vulgar':
            if logistic <= midpoint:
                right += 1
            else:
                wrong += 1
        elif sample == 'elite':
            if logistic >= midpoint:
                right += 1
            else:
                wrong += 1
    
    accuracy = right/(right + wrong)
    print(accuracy)
    accuracies.append(accuracy)
    
print()
print(sum(accuracies) / len(accuracies))









    



0.7287716405605935
0.6924979389942292
0.7518549051937345
0.6661170651277823

0.7098103874690849

Can a quarter-century of poetry predict prestige in the rest of the century?

Same drill, just on poetry. We lose more accuracy, but note that most of that is in the first quarter-century, which we know is much less accurate than the others.

Here, it's worth noting, the "midpoint" variable serves us well. Change is more rapid in poetry.



In [72]:

    
accuracies = []

for i in range (1820, 1920, 25):
    inpath = 'results/poe_quarter_' + str(i) + '.applied.csv'
    res = pd.read_csv(inpath)
    colname = 'poe_quarter_' + str(i)
    midpoint = np.mean(res[colname])
    
    right = 0
    wrong = 0
    
    for idx in res.index:
        sample = res.loc[idx, 'tags']
        logistic = res.loc[idx, colname]
        if sample == 'random':
            if logistic <= midpoint:
                right += 1
            else:
                wrong += 1
        elif sample == 'reviewed':
            if logistic >= midpoint:
                right += 1
            else:
                wrong += 1
    
    accuracy = right/(right + wrong)
    print(accuracy)
    accuracies.append(accuracy)
    
print()
print(sum(accuracies) / len(accuracies))









    



0.7222222222222222
0.7930555555555555
0.7555555555555555
0.7652777777777777

0.7590277777777776

Calculating effect size for diachronic drift relative to synchronic standards

In a cell above, I wrote that "change is more rapid in poetry." How much more?

To put this more precisely: when you establish a synchronic axis of distinction, how rapidly does the midpoint drift as you move forward in time?



In [13]:

    
# One way to think about this is r-squared.
# In that sense, it's a small factor, but
# definitely significant.

from scipy.stats import pearsonr
fpoe = pd.read_csv('results/fullpoetry.results.csv')
r, p = pearsonr(fpoe.logistic, fpoe.dateused)
print(r, r**2, p)









    



0.170358321185 0.0290219575969 4.42221194349e-06



In [14]:

    
# A more intelligible measure of effect size is to
# ask how much the midpoint of the probabilistic
# prediction moves in a year.

import statsmodels.formula.api as smf
# create a fitted model in one line
lm = smf.ols(formula='logistic ~ dateused', data=fpoe).fit()

# print the coefficients
lm.params









    Out[14]:





Intercept   -3.133496
dateused     0.001939
dtype: float64

Interpretation: multiplied by 10, I see that as a change of 2% per decade in poetry. The underlying unit is a probabilistic scale of 0% chance of fiction to 100% chance.



In [15]:

    
ffic = pd.read_csv('results/fullfiction.results.csv')
lm = smf.ols(formula='logistic ~ dateused', data=ffic).fit()
# print the coefficients
lm.params









    Out[15]:





Intercept   -1.401773
dateused     0.001003
dtype: float64

Interpretation: multiplied by 10, I see that as a change of 1% per decade in fiction.



In [16]:

    
r, p = pearsonr(ffic.logistic, ffic.dateused)
print(r, r**2, p)









    



0.138069581885 0.0190632094418 1.57957034213e-06

Interactions with gender and nationality

First, we might want to know simply how many men or women there are in these datasets. Technically we'll be counting volumes rather than authors.



In [39]:

    
fiction = pd.read_csv('metadata/prestigeficmeta.csv')
fiction['gender'] = fiction.gender.str.strip()
grouped = fiction.loc[ :, ['tags', 'earliestdate', 'gender']].groupby(['gender'])
ficbygender = grouped.aggregate(['count'])
ficbygender









    Out[39]:






  
    
      
      tags
      earliestdate
    
    
      
      count
      count
    
    
      gender
      
      
    
  
  
    
      f
      466
      466
    
    
      m
      734
      734
    
    
      u
      6
      6



In [40]:

    
poetry = pd.read_csv('metadata/prestigepoemeta.csv')
poetry['gender'] = poetry.gender.str.strip()
grouped = poetry.loc[ :, ['tags', 'earliestdate', 'gender']].groupby(['gender'])
poebygender = grouped.aggregate(['count'])
poebygender









    Out[40]:






  
    
      
      tags
      earliestdate
    
    
      
      count
      count
    
    
      gender
      
      
    
  
  
    
      f
      171
      171
    
    
      m
      513
      513

Gender and reviewed status



In [75]:

    
fiction = pd.read_csv('metadata/prestigeficmeta.csv')
fiction['gender'] = fiction.gender.str.strip()
grouped = fiction.loc[ :, ['tags', 'earliestdate', 'gender']].groupby(['tags', 'gender'])
ficbygender = grouped.aggregate(['count'])
ficbygender









    Out[75]:






  
    
      
      
      earliestdate
    
    
      
      
      count
    
    
      tags
      gender
      
    
  
  
    
      elite
      f
      214
    
    
      m
      378
    
    
      u
      2
    
    
      presentaselite
      f
      5
    
    
      m
      21
    
    
      remove
      f
      2
    
    
      m
      2
    
    
      vulgar
      f
      245
    
    
      m
      333
    
    
      u
      4

36% of reviewed volumes are by women but 42% of random volumes.



In [77]:

    
poetry = pd.read_csv('metadata/prestigepoemeta.csv')
poetry['gender'] = poetry.gender.str.strip()
grouped = poetry.loc[ :, ['tags', 'earliestdate', 'gender']].groupby(['tags', 'gender'])
poebygender = grouped.aggregate(['count'])
poebygender









    Out[77]:






  
    
      
      
      earliestdate
    
    
      
      
      count
    
    
      tags
      gender
      
    
  
  
    
      addcanon
      f
      3
    
    
      m
      5
    
    
      random
      f
      76
    
    
      m
      247
    
    
      reviewed
      f
      92
    
    
      m
      261

26% of reviewed volumes are by women, but only 24% of random volumes. Although women are deeply underrepresented in the poetry dataset as a whole, their distribution across the "reviewed" boundary is favorable.

Nationality and reviewed status



In [80]:

    
poetry['nationality'] = poetry.nationality.str.strip()
grouped = poetry.loc[ :, ['tags', 'earliestdate', 'nationality']].groupby(['tags', 'nationality'])
poebynation = grouped.aggregate(['count'])
poebynation









    Out[80]:






  
    
      
      
      earliestdate
    
    
      
      
      count
    
    
      tags
      nationality
      
    
  
  
    
      addcanon
      uk
      5
    
    
      us
      3
    
    
      random
      au
      2
    
    
      ca
      8
    
    
      fr
      1
    
    
      ir
      4
    
    
      uk
      98
    
    
      us
      202
    
    
      reviewed
      ca
      2
    
    
      ir
      15
    
    
      uk
      183
    
    
      us
      152

The US is 65% of random volumes, but 43% of reviewed volumes in poetry.



In [74]:

    
fiction = pd.read_csv('metadata/prestigeficmeta.csv')
fiction['nationality'] = fiction.nationality.str.strip()
grouped = fiction.loc[ :, ['tags', 'earliestdate', 'nationality']].groupby(['nationality', 'tags'])
ficbynation = grouped.aggregate(['count'])
ficbynation









    Out[74]:






  
    
      
      
      earliestdate
    
    
      
      
      count
    
    
      nationality
      tags
      
    
  
  
    
      argentine
      vulgar
      1
    
    
      au
      elite
      2
    
    
      vulgar
      6
    
    
      austrian
      vulgar
      1
    
    
      bengali
      vulgar
      1
    
    
      ca
      elite
      5
    
    
      vulgar
      7
    
    
      cu
      elite
      1
    
    
      dutch
      vulgar
      1
    
    
      es
      vulgar
      1
    
    
      fi
      vulgar
      1
    
    
      fr
      elite
      2
    
    
      french
      remove
      1
    
    
      vulgar
      4
    
    
      ger
      elite
      1
    
    
      vulgar
      1
    
    
      german
      remove
      1
    
    
      vulgar
      3
    
    
      hu
      elite
      1
    
    
      hungary
      vulgar
      1
    
    
      in
      elite
      1
    
    
      ir
      elite
      22
    
    
      presentaselite
      1
    
    
      vulgar
      10
    
    
      is
      elite
      1
    
    
      no
      elite
      1
    
    
      vulgar
      1
    
    
      nz
      vulgar
      1
    
    
      ru
      elite
      1
    
    
      vulgar
      2
    
    
      russian
      elite
      2
    
    
      serbian
      vulgar
      1
    
    
      spanish
      vulgar
      1
    
    
      sw
      vulgar
      1
    
    
      uk
      elite
      338
    
    
      presentaselite
      13
    
    
      remove
      2
    
    
      vulgar
      246
    
    
      us
      elite
      210
    
    
      presentaselite
      12
    
    
      remove
      1
    
    
      vulgar
      286
    
    
      za
      elite
      3

Variance shared between balanced models and the original models



In [54]:

    
gbf = pd.read_csv('results/gender_balanced_fiction.results.csv', index_col = 'volid')
print(gbf.shape)
gbf['genderbalancedprob'] = gbf.logistic
ff = pd.read_csv('results/fullfiction.results.csv', index_col = 'volid')
ff['originalprob'] = ff.logistic
print(ff.shape)
inboth = gbf.loc[: ,['genderbalancedprob']].join(ff.loc[ :, 'originalprob'], how = 'inner')
print(inboth.shape)
r, p = pearsonr(inboth.genderbalancedprob, inboth.originalprob)
print(r, r**2, p)









    



(788, 14)
(1200, 15)
(784, 2)
0.916169711148 0.839366939626 9.43137481736e-313



In [55]:

    
nbf = pd.read_csv('results/nation_balanced_fiction.csv', index_col = 'volid')
print(nbf.shape)
nbf['nationbalancedprob'] = nbf.logistic
ff = pd.read_csv('results/fullfiction.results.csv', index_col = 'volid')
ff['originalprob'] = ff.logistic
print(ff.shape)
inboth = nbf.loc[: ,['nationbalancedprob']].join(ff.loc[ :, 'originalprob'], how = 'inner')
print(inboth.shape)
r, p = pearsonr(inboth.nationbalancedprob, inboth.originalprob)
print(r, r**2, p)









    



(628, 14)
(1200, 15)
(625, 2)
0.870883362284 0.758437830704 2.38698470976e-194



In [58]:

    
gbp = pd.read_csv('results/gender_balanced_poetry.results.csv', index_col = 'volid')
print(gbp.shape)
gbp['genderbalancedprob'] = gbp.logistic
fp = pd.read_csv('results/fullpoetry.results.csv', index_col = 'volid')
fp['originalprob'] = fp.logistic
print(fp.shape)
inboth = gbp.loc[: ,['genderbalancedprob']].join(fp.loc[ :, 'originalprob'], how = 'inner')
print(inboth.shape)
r, p = pearsonr(inboth.genderbalancedprob, inboth.originalprob)
print(r, r**2, p)









    



(284, 14)
(718, 15)
(283, 2)
0.916260097683 0.839532566607 1.17963934458e-113



In [60]:

    
nbp = pd.read_csv('results/nation_balanced_poetry.results.csv', index_col = 'volid')
print(nbp.shape)
nbp['nationbalancedprob'] = nbp.logistic
fp = pd.read_csv('results/fullpoetry.results.csv', index_col = 'volid')
fp['originalprob'] = fp.logistic
print(fp.shape)
inboth = nbp.loc[: ,['nationbalancedprob']].join(fp.loc[ :, 'originalprob'], how = 'inner')
print(inboth.shape)
r, p = pearsonr(inboth.nationbalancedprob, inboth.originalprob)
print(r, r**2, p)









    



(284, 14)
(718, 15)
(283, 2)
0.833740715511 0.695123580701 1.8832060417e-74



In [23]:

    
fiction.head()









    Out[23]:






  
    
      
      docid
      actualdate
      earliestdate
      firstpub
      tags
      recordid
      OCLC
      author
      imprint
      enumcron
      ...
      pubname
      birthdate
      gender
      nationality
      othername
      notes
      canon
      path
      authsvols
      publisher
    
  
  
    
      0
      uc2.ark+=13960=t55d8p14j
      1854
      1843
      1843.0
      vulgar
      NaN
      NaN
      Stephens, Ann S
      Philadelphia;T. B. Peterson;c1854.
      NaN
      ...
      NaN
      1810.0
      f
      us
      NaN
      NaN
      NaN
      uc2/pairtree_root/ar/k+/=1/39/60/=t/55/d8/p1/4...
      47
      Peterson
    
    
      1
      nyp.33433075741789
      1852
      1845
      1845.0
      vulgar
      NaN
      NaN
      Reynolds, George W. M
      London;J. Dicks;1852-64.
      NaN
      ...
      NaN
      1814.0
      m
      uk
      NaN
      NaN
      NaN
      nyp/pairtree_root/33/43/30/75/74/17/89/3343307...
      61
      Dicks
    
    
      2
      uc1.b249620
      1853
      1848
      1848.0
      vulgar
      NaN
      NaN
      Peppergrass, Paul
      Boston;P. Donahoe;1853.
      NaN
      ...
      NaN
      1810.0
      m
      ir
      NaN
      NaN
      NaN
      uc1/pairtree_root/$b/24/96/20/$b249620/uc1.$b2...
      4
      Donahoe
    
    
      3
      njp.32101073308494
      1852
      1849
      1849.0
      vulgar
      NaN
      NaN
      Manning, Anne
      New York;D. Appleton;1852.
      NaN
      ...
      NaN
      1807.0
      f
      uk
      NaN
      NaN
      NaN
      njp/pairtree_root/32/10/10/73/30/84/94/3210107...
      66
      Appleton
    
    
      4
      uc2.ark+=13960=t4wh2fh9b
      1850
      1850
      1850.0
      vulgar
      NaN
      NaN
      Leighton, John
      London;William Tegg and Co.;1850.
      NaN
      ...
      NaN
      1822.0
      m
      uk
      NaN
      NaN
      NaN
      uc2/pairtree_root/ar/k+/=1/39/60/=t/4w/h2/fh/9...
      2
      William Tegg
    
  

5 rows × 25 columns



In [ ]:

	earliestdate			yrrev
	min	max	count	min	max	count
pubname
	1854	1854	1	0.0	0.0	1
ADL	1920	1949	11	1923.0	1949.0	9
ATL	1859	1948	197	1860.0	1949.0	197
BM	1861	1896	71	1861.0	1896.0	71
CRIS	1921	1938	6	1922.0	1938.0	6
CRIT	1924	1938	12	1924.0	1939.0	12
DIAL	1913	1926	5	1921.0	1927.0	5
DUB	1929	1947	10	1930.0	1949.0	10
EGO	1913	1918	9	1914.0	1918.0	9
ER	1850	1859	18	1850.0	1859.0	18
FR	1865	1911	102	1865.0	1912.0	102
HOR	1939	1945	5	1940.0	1946.0	5
LM	1919	1936	11	1920.0	1936.0	11
MM	1879	1883	4	1880.0	1884.0	4
NR	1915	1947	33	1915.0	1947.0	29
NY	1925	1945	17	1925.0	1947.0	16
Pulitzer	1924	1936	3	NaN	NaN	0
QR	1851	1851	3	1857.0	1857.0	3
SCRU	1940	1946	3	1940.0	1946.0	3
TEM	1851	1859	33	1852.0	1859.0	33
TLR	1914	1915	3	1914.0	1915.0	3
TLS	1920	1928	7	1920.0	1923.0	5
TNA	1903	1913	19	1907.0	1920.0	19
YAL	1930	1949	16	1930.0	1949.0	16
YB	1885	1895	7	1895.0	1895.0	7

		earliestdate
		count
tags	gender
elite	f	214
	m	378
	u	2
presentaselite	f	5
presentaselite	m	21
remove	f	2
remove	m	2
vulgar	f	245
	m	333
	u	4

		earliestdate
		count
tags	nationality
addcanon	uk	5
addcanon	us	3
random	au	2
	ca	8
	fr	1
	ir	4
	uk	98
	us	202
reviewed	ca	2
	ir	15
	uk	183
	us	152

	docid	actualdate	earliestdate	firstpub	tags	recordid	OCLC	author	imprint	enumcron	...	pubname	birthdate	gender	nationality	othername	notes	canon	path	authsvols	publisher
0	uc2.ark+=13960=t55d8p14j	1854	1843	1843.0	vulgar	NaN	NaN	Stephens, Ann S	Philadelphia;T. B. Peterson;c1854.	NaN	...	NaN	1810.0	f	us	NaN	NaN	NaN	uc2/pairtree_root/ar/k+/=1/39/60/=t/55/d8/p1/4...	47	Peterson
1	nyp.33433075741789	1852	1845	1845.0	vulgar	NaN	NaN	Reynolds, George W. M	London;J. Dicks;1852-64.	NaN	...	NaN	1814.0	m	uk	NaN	NaN	NaN	nyp/pairtree_root/33/43/30/75/74/17/89/3343307...	61	Dicks
2	uc1.b249620	1853	1848	1848.0	vulgar	NaN	NaN	Peppergrass, Paul	Boston;P. Donahoe;1853.	NaN	...	NaN	1810.0	m	ir	NaN	NaN	NaN	uc1/pairtree_root/$b/24/96/20/$b249620/uc1.$b2...	4	Donahoe
3	njp.32101073308494	1852	1849	1849.0	vulgar	NaN	NaN	Manning, Anne	New York;D. Appleton;1852.	NaN	...	NaN	1807.0	f	uk	NaN	NaN	NaN	njp/pairtree_root/32/10/10/73/30/84/94/3210107...	66	Appleton
4	uc2.ark+=13960=t4wh2fh9b	1850	1850	1850.0	vulgar	NaN	NaN	Leighton, John	London;William Tegg and Co.;1850.	NaN	...	NaN	1822.0	m	uk	NaN	NaN	NaN	uc2/pairtree_root/ar/k+/=1/39/60/=t/4w/h2/fh/9...	2	William Tegg