Textual Similarity

This notebook is designed to reproduce several findings from Andrew Piper's article "Novel Devotions: Conversional Reading, Computational Modeling, and the Modern Novel" (New Literary History 46.1 (2015), 63-98). See especially Fig 2 (p 72), Fig 4 (p 75), and Table 1 (p 79).

Piper has made his research corpus of novels available here: https://figshare.com/articles/txtlab_Novel450/2062002

We'll download it with wget:


In [ ]:
!wget https://ndownloader.figshare.com/files/3686778 -P data/

In [ ]:
%%capture
!unzip data/3686778 -d data/

Bag of Words (BoW) language model

Today we'll see our first, admittedly primitive, computational model of language called "Bag of Words". This model was very popular in early text analysis, and continues to be used today. In fact, the models that have replaced it are still very difficult to actually interpret, giving the BoW approach a slight advantage if we want to understand why the model makes certain decisions.

Getting into the model we'll have to revisit Term Frequency (think Counter). We'll then see the Document-Term Matrix (DTM), which we've discusssed briefly before. We'll have to normalize these counts if we want to compare. Then we'll look at the available Python libraries to streamline this process.

Once we have our BoW model we can analyze it in a high-dimensional vector space, which gives us more insights into the similarities and clustering of different texts. We'll then get into Piper's analysis.

We'll build our model from scratch with numpy and datascience libraries before we get into the higher level libraries:


In [ ]:
%matplotlib inline
import numpy as np
from datascience import *

Let's read in Augustine's Confessions text:


In [ ]:
with open('data/Augustine-Confessions.txt') as f:
    confessions = f.read()

confessions

There should be 13 books, which are fortunately separated by six line breaks:


In [ ]:
confessions_list = confessions.split('\n'*6)
len(confessions_list)

Let's peek at the first:


In [ ]:
confessions_list[0]

Term Frequency Revisited

We'll remember from last week, that while split might be a quick way to get tokens, it's not the most accurate because it doesn't separate punctuation and contractions. We'll use spacy again to get tokens.


In [ ]:
import spacy
nlp = spacy.load('en', parser=False)

In [ ]:
first_book = confessions_list[0]
parsed = nlp(first_book)
first_token_list = [token.text for token in parsed]
first_token_list

Now we can use Counter to get the term frequency:


In [ ]:
from collections import Counter
word_freq = Counter(first_token_list)
word_freq.most_common(20)

Challenge

Write some code to get the 20 most common words of the second book. How similar are they to those of the first book?


In [ ]:

Document-Term Matrix

If we plan to compare word frequencies across texts, we could collate these Counter dictionaries for each book in Confessions. But we don't want to write all that code! There is an easy function that streamlines the process called CountVectorizer. We saw it in the first notebook with Moretti, but didn't really explain what it does.

Let's look at the docstring:


In [ ]:
from sklearn.feature_extraction.text import CountVectorizer
CountVectorizer?

Cool. So we'll create the CountVectorizer object, then transform it on our list of documents, here that would be the books in Augustine's Confessions.


In [ ]:
cv = CountVectorizer()
dtm = cv.fit_transform(confessions_list)
dtm

What's this? A sparse matrix just means that some cells in the table don't have value. Why? Because the vocabulary base is not the same for all the books! Let's try to demonstrate this.


In [ ]:
# de-sparsify
desparse = dtm.toarray()

# create labels for columns
word_list = cv.get_feature_names()

# create a new Table
dtm_tb = Table(word_list).with_rows(desparse)
dtm_tb

Welcome to the Document Term Matrix. This is a core concept in NLP and text analysis. It's not that complicated!

We have columns for each word in the entire corpus. Then each row is for each document. In our case, that's books in Confessions. The values are the word count for that word in the corresponding document. Note that there are many 0s, that word just doesn't show up in that document!

We can call up frequencies for a given word for each chapter easily, since they are the column names:


In [ ]:
dtm_tb['read']

Looks to be about 13 counts, one for each book, let's double check!


In [ ]:
len(dtm_tb['read'])

Normalization

Piper notes:

The words were thus normalized according to their relative importance within the work. [95]

Let's get the total number of occurences of each word in the whole text. The key to the code below is sum(desparse), which sums the column for all the books in our matrix:


In [ ]:
toks_tab = Table()
toks_tab.append_column(label="Word List", values=word_list)
toks_tab.append_column(label="Frequency", values=sum(desparse))  # this sum(desparse) will sum the word count column
toks_tab.show()

Cool, but we already know how to do this much faster with Counter. Let's take this another step further. In order to make apples-to-apples comparisons across Books, we can normalize our values by dividing each word count by the total number of words in its Book. To do that, we'll need to sum on axis=1, which means summing the row (number of words in that book), as opposed to summing the column.

Once we have the total number of words in that Book, we can get the percentage of words that one particular word accounts for, and we can do that for every word across the matrix!


In [ ]:
row_sums = np.sum(desparse, axis=1)
normed = desparse/row_sums[:,None]
dtm_tb = Table(word_list).with_rows(normed)
dtm_tb

Reading the matrix above, we see that the word "abandoned" accounts for .0145406% of words in Book 1, and .0277855% of words in Book 2.

We can still grab out the normalized frequencies of the word 'read' for each book:


In [ ]:
dtm_tb['abandoned']

For a variety of reasons we like to remove words like "the", "of", "and", etc. These are refered to as 'stopwords.' As Piper notes in footnote 24:

I removed stop words and only kept those words that appeared in at least sixty percent of the documents (twelve of the twenty parts). [95]


In [ ]:
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
ENGLISH_STOP_WORDS

Since we are using an older translation of Augustine, we have to remove archaic forms of these stopwords as well.


In [ ]:
ye_olde_stop_words = ['thou','thy','thee', 'thine', 'ye', 'hath','hast', 'wilt','aught',\
                      'art', 'dost','doth', 'shall', 'shalt','tis','canst','thyself',\
                     'didst', 'yea', 'wert']

stop_words = list(ENGLISH_STOP_WORDS) + ye_olde_stop_words

# remove stopwords from column list
dtm_tb = dtm_tb.drop(stop_words)

# it is often more efficient to perform operations on arrays rather than tables
dtm_array = dtm_tb.to_array()
dtm_array

Question

In the script above, we normalized term frequencies before removing stopwords. However, it would have been just as easy to do those steps in the opposite order. Are there situations where this decision has more or less of an impact on the output?

Note: Generally stopwords are removed before counting term frequencies and normalization.

Streamlining

That was a lot of work, if this is such a common task hasn't someone streamlined this? In fact, we can simply instruct CountVectorizer not to include stopwords at all and another function, TfidfTransformer, normalizes easily.


In [ ]:
from sklearn.feature_extraction.text import TfidfTransformer

cv = CountVectorizer(stop_words = stop_words)
dtm = cv.fit_transform(confessions_list)
tt = TfidfTransformer(norm='l1',use_idf=False)
dtm_tf = tt.fit_transform(dtm)

word_list = cv.get_feature_names()
dtm_array = dtm_tf.toarray()

Note: If you are processing a text that uses only contemporary English, it may be unnecessary to import the list of stopwords explicitly. Simply pass the value "english" into the "stop_words" argument in CountVectorizer.


In [ ]:
Table(word_list).with_rows(dtm_array)

Vector Space Model of Language

My question was: how does a vocabulary that runs throughout the majority of a work change over the course of that work? I then calculated the Euclidean distance between each of the twenty parts of the work based on the frequency of the remaining words and stored those results in a symmetrical distance table. [95]

Great, now we have a matrix with normalized frequencies of all the words in the entire corpus. Right now our corpus is just all the books in Augustine's Confessions.

Let's move away from the table and just create a list of 13 vectors with only the normalized frequency values, one for each Book.


In [ ]:
dtm_array = dtm_tf.toarray()
dtm_array

Each vector has a number of coordinates equal to the number of unique words in the corpus. Let's just take Book 1:


In [ ]:
dtm_array[0]

One way to measure the similarity of texts, which Piper uses in his article, would be to measure the Euclidean distance between their coordinates in space. According to Wikipedia:

The Euclidean distance or Euclidean metric is the "ordinary" straight-line distance between two points in Euclidean space

$\mathrm{d}(\mathbf{b},\mathbf{a})=\sqrt{(a_1-b_1)^2 + (a_2-b_2)^2}$

Let's consider a simple 2 dimensional model. We have two point in space:


In [ ]:
a = (2,6)
b = (5,10)

euc_dist = np.sqrt( (a[0]-b[0])**2  +  (a[1]-b[1])**2 )
euc_dist

We can visualize this too:


In [ ]:
import matplotlib.pyplot as plt
%matplotlib inline

plt.scatter([a[0], b[0]], [a[1], b[1]])
plt.plot([a[0], b[0]], [a[1], b[1]])
plt.show()

We can think of this 2 dimensional distance between 2 points as looking at 2 different texts. In this very simple 2-d model though, we only have 2 words in the entire corpus! (2,6) and (5,10) would be the absolute counts for each text. Imagine:

Document 1:

the dog the dog dog dog dog dog

Document 2:

the dog the dog the dog the dog the dog dog dog dog dog dog

That would yield the comparison above. If we added a third point (document), we could see which 2 documents were closest to one another!


Ok, not too bad, but how do we do this with hundreds or thousands of dimensions (words) acorss hundreds or thousands of points (documents)? Well it actually scales the same way! Here it is for 3 dimensions:

$\mathrm{d}(\mathbf{b},\mathbf{a})=\sqrt{(a_1-b_1)^2 + (a_2-b_2)^2 + (a_3-b_3)^2}$


In [ ]:
a = (2,6,15)
b = (5,10,3)

euc_dist = np.sqrt( (a[0]-b[0])**2 +  (a[1]-b[1])**2 + (a[2]-b[2])**2 )
euc_dist

In [ ]:
import matplotlib as mpl
from mpl_toolkits.mplot3d import Axes3D

mpl.rcParams['legend.fontsize'] = 10

fig = plt.figure()
ax = fig.gca(projection='3d')
ax.scatter([a[0], b[0]], [a[1], b[1]], [a[2], b[2]])
ax.plot([a[0], b[0]], [a[1], b[1]], [a[2], b[2]])
plt.show()

We don't have to use our cool formula to calculate this, or to scale it up for n dimensions. That's what scipy is for:


In [ ]:
from scipy.spatial import distance
distance.euclidean(a,b)

Another measure of two vectors, more common for text analysis, is called cosine similarity. According to Wikipedia:

Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. The cosine of 0° is 1, and it is less than 1 for any other angle. It is thus a judgment of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors at 90° have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude.

$\text{similarity} = \cos(\theta) = {\mathbf{A} \cdot \mathbf{B} \over \|\mathbf{A}\|_2 \|\mathbf{B}\|_2} = \frac{ \sum\limits_{i=1}^{n}{A_i B_i} }{ \sqrt{\sum\limits_{i=1}^{n}{A_i^2}} \sqrt{\sum\limits_{i=1}^{n}{B_i^2}} }$

Essentially we want to take the cosine of the angle formed between two vectors (documents). We start the vector at the origin and measure the angle between the two vectors we're interested in.


In [ ]:
mpl.rcParams['legend.fontsize'] = 10

origin = (0,0,0)

fig = plt.figure()
ax = fig.gca(projection='3d')
ax.scatter([a[0], b[0], origin[0]], [a[1], b[1], origin[1]], [a[2], b[2], origin[2]])
ax.plot([origin[0], a[0]], [origin[1], a[1]], [origin[2], a[2]])
ax.plot([origin[0], b[0]], [origin[1], b[1]], [origin[2], b[2]])
plt.show()

Let's go back to two dimensions for the vanilla numpy calculation:


In [ ]:
a = (2,6)
b = (5,10)

# don't worry about the formula so much as the intuition behind it: angle between vectors
cos_dist = 1 - (a[0]*b[0] + a[1]*b[1]) / ( np.sqrt(a[0]**2 + a[1]**2 ) * np.sqrt(b[0]**2 + b[1]**2 ) )
cos_dist

Of course, scipy has taken care of this for us too:


In [ ]:
distance.cosine(a,b)

For the 3-d model:


In [ ]:
a = (2,6,15)
b = (5,10,3)
distance.cosine(a,b)

Challenge

Try passing different values into both the euclidean and cosine distance functions. What is your intuition about these different measurements? Remember that all values in the Term-Frequency Matrix are positive, between [0,1], and that most are very small.


In [ ]:

Visualizing Texts in Vector Space

Let's walk through this now. Say we have 3 texts, a, b, and c. The whole corpus, again, only has 2 words (dimensions)!


In [ ]:
a = (2,6)
b = (5,10)
c = (14,11)

print(distance.euclidean(a,b))
print(distance.euclidean(a,c))
print(distance.euclidean(b,c))

We'll make a matrix for the points:


In [ ]:
point_matrix = np.array([a,b,c])
point_matrix

Now we can use sklearn's pairwise_distances method to compare each book to each book:


In [ ]:
from sklearn.metrics import pairwise
pairwise.pairwise_distances(point_matrix, metric='euclidean')

Cool! We got what we calculated. Note: the results are mirrored because the columns and rows are both the same texts.

We can do the same thing on Augustine's Confessions, remember the rows are for each Book too!:


In [ ]:
dist_matrix = pairwise.pairwise_distances(dtm_tf, metric='euclidean')

title_list = ['Book '+str(i+1) for i in range(len(confessions_list))]
Table(title_list).with_rows(dist_matrix)

Visualizing hundreds of dimensions is difficult for us. So we can use multi-dimensional scaling (MDS) to put this into a 2-d graph for us:


In [ ]:
from sklearn.manifold import MDS

mds = MDS(n_components = 2, dissimilarity="precomputed")
embeddings = mds.fit_transform(dist_matrix)

_, ax = plt.subplots(figsize=(10,10))
ax.scatter(embeddings[:,0], embeddings[:,1], alpha=0)
for i in range(13):
    ax.annotate(i+1, ((embeddings[i,0], embeddings[i,1])))

Homework

Try visualizing the textual similarities again using the Cosine distance. How does that change the result? Why?


In [ ]:

Brief Aside: K-Means Clustering

Tries to find natural groupings among points, once we tell it how many groups to look for.


In [ ]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=2)
kmeans.fit_predict(dist_matrix)

A standard clustering test such as k-means indicates that the two clusters consist of Books 11–12, with Book 13 being grouped with Books 1–10.) [71]

This array (length 13) classifies each book into the n_clusters we decide based on their vector similarities. We won't do much more clustering, but just know that it's an unsupervised machine learning algorithm to classify data. We have to choose how many classes (categories) there are, and the algorithm will decide in which bucket to place the observation.


The Conversional Novel

The first step was to divide each novel into twenty equal parts. Rather than rely on the irregularity of chapter divisions, which can vary within and between works, this process creates standard units of analysis. [95]

Instead of actually using chapter divisions, Piper elects to split each novel into 20 equal parts. We can write a function text_splitter that will take in a str of the text and return a list of 20 equal parts:


In [ ]:
def text_splitter(text):
    n = int(len(text)/20)  # get length n of each part
    text_list = [text[i*n:(i+1)*n] for i in range(20)]  # slice out the text
    return(text_list)

I then calculated the Euclidean distance between each of the twenty parts of the work based on the frequency of the remaining words and stored those results in a symmetrical distance table. In the end, for each work I had a 20x20 table of distances between every part of a work to every other, in which the distances are considered to be measures of the similarity of the language between a work’s individual parts. [95]

Piper then calculates the Euclidean distances between each part to every other part. So we'll have to calculate the distance and use our pairwise method. We can write a function for that too! To make it better, let's have it take in a list of texts that our text_splitter will output:


In [ ]:
def text_distances(text_list):
    
    from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
    from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
    from sklearn.metrics import pairwise
    
    ye_olde_stop_words = ['thou','thy','thee', 'thine', 'ye', 'hath','hast', 'wilt','aught',\
                          'art', 'dost','doth', 'shall', 'shalt','tis','canst','thyself',\
                         'didst', 'yea', 'wert']
    stop_words = list(ENGLISH_STOP_WORDS)+ye_olde_stop_words
    cv = CountVectorizer(stop_words = stop_words, min_df=0.6)
    dtm = cv.fit_transform(text_list)
    tt = TfidfTransformer(norm='l1',use_idf=False)
    dtm_tf = tt.fit_transform(dtm)
    dist_matrix = pairwise.pairwise_distances(dtm_tf, metric='euclidean')
    return(dist_matrix)

Piper the introduces two new ideas.

for the in-half distance I took the average distance of each part in the first half of a work to every other part in that half and subtracted it from the average distance of every part of the second half to itself. [95]

Let's write a function that does that, and have it take in our matrix returned by text_distances:


In [ ]:
def in_half_dist(matrix):
    n = len(matrix)  # length of work, should be 20
    d1 = []  # will hold distances for first half
    d2 = []  # will hold distances for second half
    for i in range(int(n/2)-1):  # loop through first half of work (10 in our case)
        for j in range(i+1, int(n/2)):  # loop through itself (first half again)
            d1.append(matrix[i,j])  # append distance between one part to another (in first half)
    for i in range(int(n/2), n-1):
        for j in range(i+1, n):
            d2.append(matrix[i,j])
    return(abs(sum(d1)-sum(d2))/len(d1))  # take average of each distance array and subtract 2 from 1

Great! And now for his second measure:

For the cross-half distance, I took the average distance between all of the first ten parts of a work to all of the second ten parts of a work, similar to the process used in group average clustering. [95]

Let's write another function:


In [ ]:
def cross_half_dist(matrix):
    n = len(matrix)  # number of parts, here 20
    d = []  # will hold distnaces
    for i in range(int(n/2)):  # loop through first half
        for j in range(int(n/2), n):  # loop through second half
            d.append(matrix[i,j])  # append distance between first and second
    return(sum(d)/len(d))  # take average

Awesome! We can also write ourselves a quick function to call the four functions we just wrote:


In [ ]:
def text_measures(text):
    text_list = text_splitter(text)
    dist_matrix = text_distances(text_list)
    return(cross_half_dist(dist_matrix), in_half_dist(dist_matrix))

text_measures should now return two values. The first values is the cross_half_dist and the second values is the in_half_dist. Let's test this out on Augustine's `Confessions':


In [ ]:
text_measures(confessions)

Looks good! Now we can read in the corpus Piper used:


In [ ]:
metadata_tb = Table.read_table('data/2_txtlab_Novel450.csv')
metadata_tb.show(5)

We'll stick with English so we don't have to think about the possible issues of going between languages:


In [ ]:
metadata_tb = metadata_tb.where('language', "English")
metadata_tb.show(5)

We'll slightly change our text_measures function so that it can read in the file of the text we want to read in, instead of taking the confessions string we already had:


In [ ]:
corpus_path = 'data/2_txtalb_Novel450/'

def text_measures_alt(text_name):
    with open(corpus_path+text_name, 'r') as file_in:
        text = file_in.read()
    text_list = text_splitter(text)
    dist_matrix = text_distances(text_list)
    return(cross_half_dist(dist_matrix), in_half_dist(dist_matrix))

Now we can use Table's apply method to call the function text_measures_alt on all the files in the corpus:


In [ ]:
measures = metadata_tb.apply(text_measures_alt, 'filename')
measures

Let's add these measures to our Table:


In [ ]:
metadata_tb['Cross-Half'] = measures[:,0]
metadata_tb['In-Half'] = measures[:,1]
metadata_tb.show(5)

If we want to see which novels stick out, we might be interested in the z-score for a particular novel. This is how many standard devations the novel is away from the mean. Let's write a function:


In [ ]:
def get_zscores(values):

    import numpy as np
    mn = np.mean(values)
    st = np.std(values)
    zs = []
    
    for x in values:
        z = (x-mn)/st
        zs.append(z)

    return zs

Now we can add these to the Table too:


In [ ]:
metadata_tb['Cross-Z-Score'] = get_zscores(measures[:,0])
metadata_tb['In-Z-Score'] = get_zscores(measures[:,1])
metadata_tb.show(5)

Scatter plot, please!


In [ ]:
metadata_tb.scatter('In-Half', 'Cross-Half')

Homework

Use our z-scores to rank the novels. Which novels are most "conversional"?


In [ ]:

Piper includes only words that appeared in at least 60% of the book's sections. How might that shape his findings? What if he had used a 50% threshold?


In [ ]:

Try changing the min_df argument to 0.5. How do the rankings change? Try eliminating the min_df altogether.


In [ ]:

Bonus (not assigned)

Visualize distances among the twenty sections of the top-ranked conversional novel in the corpus using the MDS technique.


In [ ]: