This notebook is designed to reproduce several findings from Andrew Piper's article "Novel Devotions: Conversional Reading, Computational Modeling, and the Modern Novel" (New Literary History 46.1 (2015), 63-98). See especially Fig 2 (p 72), Fig 4 (p 75), and Table 1 (p 79).
Piper has made his research corpus of novels available here: https://figshare.com/articles/txtlab_Novel450/2062002
We'll download it with wget
:
In [ ]:
!wget https://ndownloader.figshare.com/files/3686778 -P data/
In [ ]:
%%capture
!unzip data/3686778 -d data/
Today we'll see our first, admittedly primitive, computational model of language called "Bag of Words". This model was very popular in early text analysis, and continues to be used today. In fact, the models that have replaced it are still very difficult to actually interpret, giving the BoW approach a slight advantage if we want to understand why the model makes certain decisions.
Getting into the model we'll have to revisit Term Frequency (think Counter
). We'll then see the Document-Term Matrix (DTM), which we've discusssed briefly before. We'll have to normalize these counts if we want to compare. Then we'll look at the available Python libraries to streamline this process.
Once we have our BoW model we can analyze it in a high-dimensional vector space, which gives us more insights into the similarities and clustering of different texts. We'll then get into Piper's analysis.
We'll build our model from scratch with numpy
and datascience
libraries before we get into the higher level libraries:
In [ ]:
%matplotlib inline
import numpy as np
from datascience import *
Let's read in Augustine's Confessions text:
In [ ]:
with open('data/Augustine-Confessions.txt') as f:
confessions = f.read()
confessions
There should be 13 books, which are fortunately separated by six line breaks:
In [ ]:
confessions_list = confessions.split('\n'*6)
len(confessions_list)
Let's peek at the first:
In [ ]:
confessions_list[0]
In [ ]:
import spacy
nlp = spacy.load('en', parser=False)
In [ ]:
first_book = confessions_list[0]
parsed = nlp(first_book)
first_token_list = [token.text for token in parsed]
first_token_list
Now we can use Counter
to get the term frequency:
In [ ]:
from collections import Counter
word_freq = Counter(first_token_list)
word_freq.most_common(20)
In [ ]:
If we plan to compare word frequencies across texts, we could collate these Counter
dictionaries for each book in Confessions
. But we don't want to write all that code! There is an easy function that streamlines the process called CountVectorizer
. We saw it in the first notebook with Moretti, but didn't really explain what it does.
Let's look at the docstring:
In [ ]:
from sklearn.feature_extraction.text import CountVectorizer
CountVectorizer?
Cool. So we'll create the CountVectorizer
object, then transform it on our list
of documents, here that would be the books in Augustine's Confessions
.
In [ ]:
cv = CountVectorizer()
dtm = cv.fit_transform(confessions_list)
dtm
What's this? A sparse matrix just means that some cells in the table don't have value. Why? Because the vocabulary base is not the same for all the books! Let's try to demonstrate this.
In [ ]:
# de-sparsify
desparse = dtm.toarray()
# create labels for columns
word_list = cv.get_feature_names()
# create a new Table
dtm_tb = Table(word_list).with_rows(desparse)
dtm_tb
Welcome to the Document Term Matrix. This is a core concept in NLP and text analysis. It's not that complicated!
We have columns for each word in the entire corpus. Then each row is for each document. In our case, that's books in Confessions. The values are the word count for that word in the corresponding document. Note that there are many 0s, that word just doesn't show up in that document!
We can call up frequencies for a given word for each chapter easily, since they are the column names:
In [ ]:
dtm_tb['read']
Looks to be about 13 counts, one for each book, let's double check!
In [ ]:
len(dtm_tb['read'])
In [ ]:
toks_tab = Table()
toks_tab.append_column(label="Word List", values=word_list)
toks_tab.append_column(label="Frequency", values=sum(desparse)) # this sum(desparse) will sum the word count column
toks_tab.show()
Cool, but we already know how to do this much faster with Counter
. Let's take this another step further. In order to make apples-to-apples comparisons across Books, we can normalize our values by dividing each word count by the total number of words in its Book. To do that, we'll need to sum
on axis=1
, which means summing the row (number of words in that book), as opposed to summing the column.
Once we have the total number of words in that Book, we can get the percentage of words that one particular word accounts for, and we can do that for every word across the matrix!
In [ ]:
row_sums = np.sum(desparse, axis=1)
normed = desparse/row_sums[:,None]
dtm_tb = Table(word_list).with_rows(normed)
dtm_tb
Reading the matrix above, we see that the word "abandoned" accounts for .0145406% of words in Book 1, and .0277855% of words in Book 2.
We can still grab out the normalized frequencies of the word 'read' for each book:
In [ ]:
dtm_tb['abandoned']
For a variety of reasons we like to remove words like "the", "of", "and", etc. These are refered to as 'stopwords.' As Piper notes in footnote 24:
I removed stop words and only kept those words that appeared in at least sixty percent of the documents (twelve of the twenty parts). [95]
In [ ]:
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
ENGLISH_STOP_WORDS
Since we are using an older translation of Augustine, we have to remove archaic forms of these stopwords as well.
In [ ]:
ye_olde_stop_words = ['thou','thy','thee', 'thine', 'ye', 'hath','hast', 'wilt','aught',\
'art', 'dost','doth', 'shall', 'shalt','tis','canst','thyself',\
'didst', 'yea', 'wert']
stop_words = list(ENGLISH_STOP_WORDS) + ye_olde_stop_words
# remove stopwords from column list
dtm_tb = dtm_tb.drop(stop_words)
# it is often more efficient to perform operations on arrays rather than tables
dtm_array = dtm_tb.to_array()
dtm_array
In the script above, we normalized term frequencies before removing stopwords. However, it would have been just as easy to do those steps in the opposite order. Are there situations where this decision has more or less of an impact on the output?
Note: Generally stopwords are removed before counting term frequencies and normalization.
That was a lot of work, if this is such a common task hasn't someone streamlined this? In fact, we can simply instruct CountVectorizer
not to include stopwords at all and another function, TfidfTransformer
, normalizes easily.
In [ ]:
from sklearn.feature_extraction.text import TfidfTransformer
cv = CountVectorizer(stop_words = stop_words)
dtm = cv.fit_transform(confessions_list)
tt = TfidfTransformer(norm='l1',use_idf=False)
dtm_tf = tt.fit_transform(dtm)
word_list = cv.get_feature_names()
dtm_array = dtm_tf.toarray()
Note: If you are processing a text that uses only contemporary English, it may be unnecessary to import the list of stopwords explicitly. Simply pass the value
"english"
into the"stop_words"
argument inCountVectorizer
.
In [ ]:
Table(word_list).with_rows(dtm_array)
My question was: how does a vocabulary that runs throughout the majority of a work change over the course of that work? I then calculated the Euclidean distance between each of the twenty parts of the work based on the frequency of the remaining words and stored those results in a symmetrical distance table. [95]
Great, now we have a matrix with normalized frequencies of all the words in the entire corpus. Right now our corpus is just all the books in Augustine's Confessions.
Let's move away from the table and just create a list of 13 vectors with only the normalized frequency values, one for each Book.
In [ ]:
dtm_array = dtm_tf.toarray()
dtm_array
Each vector has a number of coordinates equal to the number of unique words in the corpus. Let's just take Book 1:
In [ ]:
dtm_array[0]
One way to measure the similarity of texts, which Piper uses in his article, would be to measure the Euclidean distance between their coordinates in space. According to Wikipedia:
The Euclidean distance or Euclidean metric is the "ordinary" straight-line distance between two points in Euclidean space
$\mathrm{d}(\mathbf{b},\mathbf{a})=\sqrt{(a_1-b_1)^2 + (a_2-b_2)^2}$
Let's consider a simple 2 dimensional model. We have two point in space:
In [ ]:
a = (2,6)
b = (5,10)
euc_dist = np.sqrt( (a[0]-b[0])**2 + (a[1]-b[1])**2 )
euc_dist
We can visualize this too:
In [ ]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter([a[0], b[0]], [a[1], b[1]])
plt.plot([a[0], b[0]], [a[1], b[1]])
plt.show()
We can think of this 2 dimensional distance between 2 points as looking at 2 different texts. In this very simple 2-d model though, we only have 2 words in the entire corpus! (2,6)
and (5,10)
would be the absolute counts for each text. Imagine:
Document 1:
the dog the dog dog dog dog dog
Document 2:
the dog the dog the dog the dog the dog dog dog dog dog dog
That would yield the comparison above. If we added a third point (document), we could see which 2 documents were closest to one another!
Ok, not too bad, but how do we do this with hundreds or thousands of dimensions (words) acorss hundreds or thousands of points (documents)? Well it actually scales the same way! Here it is for 3 dimensions:
$\mathrm{d}(\mathbf{b},\mathbf{a})=\sqrt{(a_1-b_1)^2 + (a_2-b_2)^2 + (a_3-b_3)^2}$
In [ ]:
a = (2,6,15)
b = (5,10,3)
euc_dist = np.sqrt( (a[0]-b[0])**2 + (a[1]-b[1])**2 + (a[2]-b[2])**2 )
euc_dist
In [ ]:
import matplotlib as mpl
from mpl_toolkits.mplot3d import Axes3D
mpl.rcParams['legend.fontsize'] = 10
fig = plt.figure()
ax = fig.gca(projection='3d')
ax.scatter([a[0], b[0]], [a[1], b[1]], [a[2], b[2]])
ax.plot([a[0], b[0]], [a[1], b[1]], [a[2], b[2]])
plt.show()
We don't have to use our cool formula to calculate this, or to scale it up for n dimensions. That's what scipy
is for:
In [ ]:
from scipy.spatial import distance
distance.euclidean(a,b)
Another measure of two vectors, more common for text analysis, is called cosine similarity. According to Wikipedia:
Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. The cosine of 0° is 1, and it is less than 1 for any other angle. It is thus a judgment of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors at 90° have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude.
$\text{similarity} = \cos(\theta) = {\mathbf{A} \cdot \mathbf{B} \over \|\mathbf{A}\|_2 \|\mathbf{B}\|_2} = \frac{ \sum\limits_{i=1}^{n}{A_i B_i} }{ \sqrt{\sum\limits_{i=1}^{n}{A_i^2}} \sqrt{\sum\limits_{i=1}^{n}{B_i^2}} }$
Essentially we want to take the cosine of the angle formed between two vectors (documents). We start the vector at the origin and measure the angle between the two vectors we're interested in.
In [ ]:
mpl.rcParams['legend.fontsize'] = 10
origin = (0,0,0)
fig = plt.figure()
ax = fig.gca(projection='3d')
ax.scatter([a[0], b[0], origin[0]], [a[1], b[1], origin[1]], [a[2], b[2], origin[2]])
ax.plot([origin[0], a[0]], [origin[1], a[1]], [origin[2], a[2]])
ax.plot([origin[0], b[0]], [origin[1], b[1]], [origin[2], b[2]])
plt.show()
Let's go back to two dimensions for the vanilla numpy
calculation:
In [ ]:
a = (2,6)
b = (5,10)
# don't worry about the formula so much as the intuition behind it: angle between vectors
cos_dist = 1 - (a[0]*b[0] + a[1]*b[1]) / ( np.sqrt(a[0]**2 + a[1]**2 ) * np.sqrt(b[0]**2 + b[1]**2 ) )
cos_dist
Of course, scipy
has taken care of this for us too:
In [ ]:
distance.cosine(a,b)
For the 3-d model:
In [ ]:
a = (2,6,15)
b = (5,10,3)
distance.cosine(a,b)
In [ ]:
Let's walk through this now. Say we have 3 texts, a
, b
, and c
. The whole corpus, again, only has 2 words (dimensions)!
In [ ]:
a = (2,6)
b = (5,10)
c = (14,11)
print(distance.euclidean(a,b))
print(distance.euclidean(a,c))
print(distance.euclidean(b,c))
We'll make a matrix for the points:
In [ ]:
point_matrix = np.array([a,b,c])
point_matrix
Now we can use sklearn
's pairwise_distances
method to compare each book to each book:
In [ ]:
from sklearn.metrics import pairwise
pairwise.pairwise_distances(point_matrix, metric='euclidean')
Cool! We got what we calculated. Note: the results are mirrored because the columns and rows are both the same texts.
We can do the same thing on Augustine's Confessions, remember the rows are for each Book too!:
In [ ]:
dist_matrix = pairwise.pairwise_distances(dtm_tf, metric='euclidean')
title_list = ['Book '+str(i+1) for i in range(len(confessions_list))]
Table(title_list).with_rows(dist_matrix)
Visualizing hundreds of dimensions is difficult for us. So we can use multi-dimensional scaling (MDS) to put this into a 2-d graph for us:
In [ ]:
from sklearn.manifold import MDS
mds = MDS(n_components = 2, dissimilarity="precomputed")
embeddings = mds.fit_transform(dist_matrix)
_, ax = plt.subplots(figsize=(10,10))
ax.scatter(embeddings[:,0], embeddings[:,1], alpha=0)
for i in range(13):
ax.annotate(i+1, ((embeddings[i,0], embeddings[i,1])))
In [ ]:
In [ ]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2)
kmeans.fit_predict(dist_matrix)
A standard clustering test such as k-means indicates that the two clusters consist of Books 11–12, with Book 13 being grouped with Books 1–10.) [71]
This array (length 13) classifies each book into the n_clusters
we decide based on their vector similarities. We won't do much more clustering, but just know that it's an unsupervised machine learning algorithm to classify data. We have to choose how many classes (categories) there are, and the algorithm will decide in which bucket to place the observation.
The first step was to divide each novel into twenty equal parts. Rather than rely on the irregularity of chapter divisions, which can vary within and between works, this process creates standard units of analysis. [95]
Instead of actually using chapter divisions, Piper elects to split each novel into 20 equal parts. We can write a function text_splitter
that will take in a str
of the text and return a list of 20 equal parts:
In [ ]:
def text_splitter(text):
n = int(len(text)/20) # get length n of each part
text_list = [text[i*n:(i+1)*n] for i in range(20)] # slice out the text
return(text_list)
I then calculated the Euclidean distance between each of the twenty parts of the work based on the frequency of the remaining words and stored those results in a symmetrical distance table. In the end, for each work I had a 20x20 table of distances between every part of a work to every other, in which the distances are considered to be measures of the similarity of the language between a work’s individual parts. [95]
Piper then calculates the Euclidean distances between each part to every other part. So we'll have to calculate the distance and use our pairwise
method. We can write a function for that too! To make it better, let's have it take in a list of texts that our text_splitter
will output:
In [ ]:
def text_distances(text_list):
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.metrics import pairwise
ye_olde_stop_words = ['thou','thy','thee', 'thine', 'ye', 'hath','hast', 'wilt','aught',\
'art', 'dost','doth', 'shall', 'shalt','tis','canst','thyself',\
'didst', 'yea', 'wert']
stop_words = list(ENGLISH_STOP_WORDS)+ye_olde_stop_words
cv = CountVectorizer(stop_words = stop_words, min_df=0.6)
dtm = cv.fit_transform(text_list)
tt = TfidfTransformer(norm='l1',use_idf=False)
dtm_tf = tt.fit_transform(dtm)
dist_matrix = pairwise.pairwise_distances(dtm_tf, metric='euclidean')
return(dist_matrix)
Piper the introduces two new ideas.
for the in-half distance I took the average distance of each part in the first half of a work to every other part in that half and subtracted it from the average distance of every part of the second half to itself. [95]
Let's write a function that does that, and have it take in our matrix returned by text_distances
:
In [ ]:
def in_half_dist(matrix):
n = len(matrix) # length of work, should be 20
d1 = [] # will hold distances for first half
d2 = [] # will hold distances for second half
for i in range(int(n/2)-1): # loop through first half of work (10 in our case)
for j in range(i+1, int(n/2)): # loop through itself (first half again)
d1.append(matrix[i,j]) # append distance between one part to another (in first half)
for i in range(int(n/2), n-1):
for j in range(i+1, n):
d2.append(matrix[i,j])
return(abs(sum(d1)-sum(d2))/len(d1)) # take average of each distance array and subtract 2 from 1
Great! And now for his second measure:
For the cross-half distance, I took the average distance between all of the first ten parts of a work to all of the second ten parts of a work, similar to the process used in group average clustering. [95]
Let's write another function:
In [ ]:
def cross_half_dist(matrix):
n = len(matrix) # number of parts, here 20
d = [] # will hold distnaces
for i in range(int(n/2)): # loop through first half
for j in range(int(n/2), n): # loop through second half
d.append(matrix[i,j]) # append distance between first and second
return(sum(d)/len(d)) # take average
Awesome! We can also write ourselves a quick function to call the four functions we just wrote:
In [ ]:
def text_measures(text):
text_list = text_splitter(text)
dist_matrix = text_distances(text_list)
return(cross_half_dist(dist_matrix), in_half_dist(dist_matrix))
text_measures
should now return two values. The first values is the cross_half_dist
and the second values is the in_half_dist
. Let's test this out on Augustine's `Confessions':
In [ ]:
text_measures(confessions)
Looks good! Now we can read in the corpus Piper used:
In [ ]:
metadata_tb = Table.read_table('data/2_txtlab_Novel450.csv')
metadata_tb.show(5)
We'll stick with English so we don't have to think about the possible issues of going between languages:
In [ ]:
metadata_tb = metadata_tb.where('language', "English")
metadata_tb.show(5)
We'll slightly change our text_measures
function so that it can read in the file of the text we want to read in, instead of taking the confessions
string we already had:
In [ ]:
corpus_path = 'data/2_txtalb_Novel450/'
def text_measures_alt(text_name):
with open(corpus_path+text_name, 'r') as file_in:
text = file_in.read()
text_list = text_splitter(text)
dist_matrix = text_distances(text_list)
return(cross_half_dist(dist_matrix), in_half_dist(dist_matrix))
Now we can use Table
's apply
method to call the function text_measures_alt
on all the files in the corpus:
In [ ]:
measures = metadata_tb.apply(text_measures_alt, 'filename')
measures
Let's add these measures to our Table
:
In [ ]:
metadata_tb['Cross-Half'] = measures[:,0]
metadata_tb['In-Half'] = measures[:,1]
metadata_tb.show(5)
If we want to see which novels stick out, we might be interested in the z-score for a particular novel. This is how many standard devations the novel is away from the mean. Let's write a function:
In [ ]:
def get_zscores(values):
import numpy as np
mn = np.mean(values)
st = np.std(values)
zs = []
for x in values:
z = (x-mn)/st
zs.append(z)
return zs
Now we can add these to the Table
too:
In [ ]:
metadata_tb['Cross-Z-Score'] = get_zscores(measures[:,0])
metadata_tb['In-Z-Score'] = get_zscores(measures[:,1])
metadata_tb.show(5)
Scatter plot, please!
In [ ]:
metadata_tb.scatter('In-Half', 'Cross-Half')
In [ ]:
Piper includes only words that appeared in at least 60% of the book's sections. How might that shape his findings? What if he had used a 50% threshold?
In [ ]:
Try changing the min_df
argument to 0.5. How do the rankings change? Try eliminating the min_df
altogether.
In [ ]:
In [ ]: