The notebooks included in this repository are intended to show you how raw data was transformed into a particular table or graph.
The graphs may not look exactly like the published versions, because those were created in a different language (R). But they should be substantively equivalent.
Graphing the frequency of color vocabulary in a subset of volumes.
The list of words we count as "color words" is contained in colors.txt. The list of volumes to be plotted is contained in prestigeset.csv. That file actually contains a range of volumes, not all of which are particularly prestigious! Its name comes from the fact that it does record, in one column, whether the volume was included in a list of prestigious/reviewed volumes. (For more on the source of that list, see chapter 3).
In [1]:
#!/usr/bin/env python3
import csv, os, sys
from collections import Counter
# import utils
sys.path.append('../../lib')
import SonicScrewdriver as utils
import FileCabinet as filecab
# start by loading the hard seeds
colors = set()
with open('../lexicons/colors.txt', encoding = 'utf-8') as f:
for line in f:
colors.add(line.strip())
logistic = dict()
realclass = dict()
titles = dict()
dates = dict()
with open('../metadata/prestigeset.csv', encoding = 'utf-8') as f:
reader = csv.DictReader(f)
for row in reader:
logistic[row['volid']] = float(row['logistic'])
realclass[row['volid']] = row['prestige']
titles[row['volid']] = row['title']
dates[row['volid']] = int(row['dateused'])
sourcedir = '../sourcefiles/'
documents = filecab.get_wordcounts(sourcedir, '.tsv', set(logistic))
outrows = []
for docid, doc in documents.items():
if docid not in logistic:
continue
else:
allwords = 1
colorct = 0
for word, count in doc.items():
allwords += count
if word in colors:
colorct += count
outline = [docid, realclass[docid], logistic[docid], (colorct/allwords), dates[docid], titles[docid]]
outrows.append(outline)
fields = ['docid', 'class', 'logistic', 'colors', 'date', 'title']
with open('../plotdata/colorfic.csv', mode = 'w', encoding = 'utf-8') as f:
writer = csv.writer(f)
writer.writerow(fields)
for row in outrows:
writer.writerow(row)
It would have been more elegant to create a data frame in memory, instead of writing the data to file as an intermediary step, and then reading it back in.
But that's not how I originally wrote the process, and rewriting several years of code for pure elegance would be a bit extravagant. So having written the data out, let's read it back in.
In [2]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
We can take a look at what is actually in the data frame.
In [3]:
color_df = pd.read_csv('../plotdata/colorfic.csv')
color_df.head()
Out[3]:
In [4]:
groups = color_df.groupby('class')
groupnames = {0: 'unmarked', 1: 'reviewed'}
groupcolors = {0: 'k', 1: 'r'}
fig, ax = plt.subplots(figsize = (9, 9))
ax.margins(0.05)
for code, group in groups:
ax.plot(group.date, group.colors, marker='o', linestyle='', ms=6, color = groupcolors[code], label=groupnames[code])
ax.legend(numpoints = 1, loc = 'upper left')
plt.show()
In [5]:
post1800 = color_df[color_df.date > 1800]
groups = post1800.groupby('class')
groups.aggregate(np.mean)
Out[5]:
I guess there is a really slight difference in the "colors" column. Reviewed works refer to colors a little more often. (Ignore the "logistic" column for now, it's inherited from a different process.) But is the difference in frequency of color words significant?
In [6]:
from scipy.stats import ttest_ind
ttest_ind(color_df[color_df['class'] == 1].colors, color_df[(color_df['class'] == 0) & (color_df['date'] > 1800)].colors, equal_var = False)
Out[6]:
No. That's not a significant result; there doesn't seem to be any meaningful difference between reviewed and unreviewed books, at least not at this scale of analysis.
In [9]:
stanford = set()
with open('../lexicons/stanford.csv', encoding = 'utf-8') as f:
reader = csv.DictReader(f)
for row in reader:
if row['class'] == 'hard':
stanford.add(row['word'])
sourcedir = '../sourcefiles/'
pairedpaths = filecab.get_pairedpaths(sourcedir, '.tsv')
docids = [x[0] for x in pairedpaths]
wordcounts = filecab.get_wordcounts(sourcedir, '.tsv', docids)
metapath = '../metadata/allgenremeta.csv'
genredict = dict()
datedict = dict()
with open(metapath, encoding = 'utf-8') as f:
reader = csv.DictReader(f)
for row in reader:
date = int(row['firstpub'])
genre = row['tags']
docid = row['docid']
if date not in datedict:
datedict[date] = []
datedict[date].append(docid)
genredict[docid] = genre
possible_genres = {'fic', 'bio'}
allcounts = dict()
hardseedcounts = dict()
for genre in possible_genres:
allcounts[genre] = Counter()
hardseedcounts[genre] = Counter()
for i in range(1700,2000):
if i in datedict:
candidates = datedict[i]
for anid in candidates:
genre = genredict[anid]
if genre not in possible_genres:
continue
if anid not in wordcounts:
print('error')
continue
else:
for word, count in wordcounts[anid].items():
allcounts[genre][i] += count
if word in stanford:
hardseedcounts[genre][i] += count
with open('../plotdata/hardaverages.csv', mode = 'w', encoding = 'utf-8') as f:
f.write('genre,year,hardpct\n')
for genre in possible_genres:
for i in range(1700,2000):
if i in allcounts[genre]:
pct = hardseedcounts[genre][i] / (allcounts[genre][i] + 1)
f.write(genre + ',' + str(i) + ',' + str(pct) + '\n')
In [10]:
hard_df = pd.read_csv('../plotdata/hardaverages.csv')
hard_df.head()
Out[10]:
In [11]:
groups = hard_df.groupby('genre')
groupcolors = {'bio': 'k', 'fic': 'r', 'poe': 'g'}
fig, ax = plt.subplots(figsize = (9, 9))
ax.margins(0.05)
for code, group in groups:
if code == 'poe':
continue
ax.plot(group.year, group.hardpct, marker='o', linestyle='', ms=6, color = groupcolors[code], label=code)
ax.legend(numpoints = 1, loc = 'upper left')
plt.show()