Chapter 1, figures 1 and 2

The notebooks included in this repository are intended to show you how raw data was transformed into a particular table or graph.

The graphs may not look exactly like the published versions, because those were created in a different language (R). But they should be substantively equivalent.

Figure 1.1

Graphing the frequency of color vocabulary in a subset of volumes.

The list of words we count as "color words" is contained in colors.txt. The list of volumes to be plotted is contained in prestigeset.csv. That file actually contains a range of volumes, not all of which are particularly prestigious! Its name comes from the fact that it does record, in one column, whether the volume was included in a list of prestigious/reviewed volumes. (For more on the source of that list, see chapter 3).

Counting the frequency of color words

The code below counts words and creates a data file, colorfic.csv.



In [1]:

    
#!/usr/bin/env python3

import csv, os, sys
from collections import Counter

# import utils
sys.path.append('../../lib')

import SonicScrewdriver as utils
import FileCabinet as filecab

# start by loading the hard seeds

colors = set()

with open('../lexicons/colors.txt', encoding = 'utf-8') as f:
    for line in f:
        colors.add(line.strip())

logistic = dict()
realclass = dict()
titles = dict()
dates = dict()

with open('../metadata/prestigeset.csv', encoding = 'utf-8') as f:
    reader = csv.DictReader(f)
    for row in reader:
        logistic[row['volid']] = float(row['logistic'])
        realclass[row['volid']] = row['prestige']
        titles[row['volid']] = row['title']
        dates[row['volid']] = int(row['dateused'])

sourcedir = '../sourcefiles/'
documents = filecab.get_wordcounts(sourcedir, '.tsv', set(logistic))

outrows = []

for docid, doc in documents.items():
    if docid not in logistic:
        continue
    else:
        allwords = 1
        colorct = 0

        for word, count in doc.items():
            allwords += count
            if word in colors:
                colorct += count

        outline = [docid, realclass[docid], logistic[docid], (colorct/allwords), dates[docid], titles[docid]]
        outrows.append(outline)

fields = ['docid', 'class', 'logistic', 'colors', 'date', 'title']
with open('../plotdata/colorfic.csv', mode = 'w', encoding = 'utf-8') as f:
    writer = csv.writer(f)
    writer.writerow(fields)
    for row in outrows:
        writer.writerow(row)

Loading the data we just created as a data frame

It would have been more elegant to create a data frame in memory, instead of writing the data to file as an intermediary step, and then reading it back in.

But that's not how I originally wrote the process, and rewriting several years of code for pure elegance would be a bit extravagant. So having written the data out, let's read it back in.



In [2]:

    
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline

We can take a look at what is actually in the data frame.



In [3]:

    
color_df = pd.read_csv('../plotdata/colorfic.csv')
color_df.head()









    Out[3]:






  
    
      
      docid
      class
      logistic
      colors
      date
      title
    
  
  
    
      0
      uc2.ark+=13960=t1ng4nw76
      0
      0.655477
      0.000689
      1824
      Rothelan;
    
    
      1
      mdp.39015008800537
      0
      0.605308
      0.000270
      1766
      Letters, written
    
    
      2
      uiuo.ark+=13960=t4cn7c75m
      0
      0.584766
      0.000428
      1815
      The royal wanderer, or, The exile of England
    
    
      3
      njp.32101074629229
      0
      0.715176
      0.000158
      1745
      Les amusemens de Spa, or The gallantries of th...
    
    
      4
      loc.ark+=13960=t6154rd2f
      0
      0.580221
      0.001254
      1842
      Sketches of Newport and its vicinity;

Visualizing the data

I'll use color to distinguish reviewed volumes from those not marked as reviewed in elite journals. (We don't actually know that they weren't ever reviewed.)



In [4]:

    
groups = color_df.groupby('class')
groupnames = {0: 'unmarked', 1: 'reviewed'}
groupcolors = {0: 'k', 1: 'r'}
fig, ax = plt.subplots(figsize = (9, 9))

ax.margins(0.05)
for code, group in groups:
    ax.plot(group.date, group.colors, marker='o', linestyle='', ms=6, color = groupcolors[code], label=groupnames[code])
ax.legend(numpoints = 1, loc = 'upper left')

plt.show()

other analysis, not in the book

Is there any difference between the frequency of color words in reviewed volumes and others? Let's focus on the volumes after 1800.



In [5]:

    
post1800 = color_df[color_df.date > 1800]
groups = post1800.groupby('class')
groups.aggregate(np.mean)









    Out[5]:






  
    
      
      logistic
      colors
      date
    
    
      class
      
      
      
    
  
  
    
      0
      0.764169
      0.001441
      1865.076336
    
    
      1
      0.783950
      0.001564
      1870.368421

I guess there is a really slight difference in the "colors" column. Reviewed works refer to colors a little more often. (Ignore the "logistic" column for now, it's inherited from a different process.) But is the difference in frequency of color words significant?



In [6]:

    
from scipy.stats import ttest_ind
ttest_ind(color_df[color_df['class'] == 1].colors, color_df[(color_df['class'] == 0) & (color_df['date'] > 1800)].colors, equal_var = False)









    Out[6]:





Ttest_indResult(statistic=0.68968816275016209, pvalue=0.49146510616036876)

No. That's not a significant result; there doesn't seem to be any meaningful difference between reviewed and unreviewed books, at least not at this scale of analysis.

Figure 1.2

Now let's calculate the frequency of Stanford "hard seeds" in biography and fiction, aggregating by year.

count the "hard seeds"



In [9]:

    
stanford = set()

with open('../lexicons/stanford.csv', encoding = 'utf-8') as f:
    reader = csv.DictReader(f)
    for row in reader:
        if row['class'] == 'hard':
            stanford.add(row['word'])

sourcedir = '../sourcefiles/'

pairedpaths = filecab.get_pairedpaths(sourcedir, '.tsv')

docids = [x[0] for x in pairedpaths]

wordcounts = filecab.get_wordcounts(sourcedir, '.tsv', docids)

metapath = '../metadata/allgenremeta.csv'

genredict = dict()
datedict = dict()
with open(metapath, encoding = 'utf-8') as f:
    reader = csv.DictReader(f)
    for row in reader:
        date = int(row['firstpub'])
        genre = row['tags']
        docid = row['docid']
        if date not in datedict:
            datedict[date] = []
        datedict[date].append(docid)
        genredict[docid] = genre

possible_genres = {'fic', 'bio'}
allcounts = dict()
hardseedcounts = dict()
for genre in possible_genres:
    allcounts[genre] = Counter()
    hardseedcounts[genre] = Counter()

for i in range(1700,2000):
    if i in datedict:
        candidates = datedict[i]
        for anid in candidates:
            genre = genredict[anid]
            if genre not in possible_genres:
                continue
            if anid not in wordcounts:
                print('error')
                continue
            else:
                for word, count in wordcounts[anid].items():
                    allcounts[genre][i] += count
                    if word in stanford:
                        hardseedcounts[genre][i] += count

with open('../plotdata/hardaverages.csv', mode = 'w', encoding = 'utf-8') as f:
    f.write('genre,year,hardpct\n')
    for genre in possible_genres:
        for i in range(1700,2000):
            if i in allcounts[genre]:
                pct = hardseedcounts[genre][i] / (allcounts[genre][i] + 1)
                f.write(genre + ',' + str(i) + ',' + str(pct) + '\n')

look at the data we created



In [10]:

    
hard_df = pd.read_csv('../plotdata/hardaverages.csv')
hard_df.head()

now plot the yearly averages for biography and fiction



In [11]:

    
groups = hard_df.groupby('genre')
groupcolors = {'bio': 'k', 'fic': 'r', 'poe': 'g'}
fig, ax = plt.subplots(figsize = (9, 9))

ax.margins(0.05)
for code, group in groups:
    if code == 'poe':
        continue
    ax.plot(group.year, group.hardpct, marker='o', linestyle='', ms=6, color = groupcolors[code], label=code)
ax.legend(numpoints = 1, loc = 'upper left')

plt.show()

	genre	year	hardpct
0	fic	1702	0.037914
1	fic	1703	0.027492
2	fic	1706	0.025752
3	fic	1708	0.032224
4	fic	1709	0.025049

	docid	logistic	colors	date	title
0	uc2.ark+=13960=t1ng4nw76	0.655477	0.000689	1824	Rothelan;
1	mdp.39015008800537	0.605308	0.000270	1766	Letters, written
2	uiuo.ark+=13960=t4cn7c75m	0.584766	0.000428	1815	The royal wanderer, or, The exile of England
3	njp.32101074629229	0.715176	0.000158	1745	Les amusemens de Spa, or The gallantries of th...
4	loc.ark+=13960=t6154rd2f	0.580221	0.001254	1842	Sketches of Newport and its vicinity;

	logistic	colors	date
class
0	0.764169	0.001441	1865.076336
1	0.783950	0.001564	1870.368421