This lesson is designed to explore features of word embeddings produced through the word2vec model. The questions we ask in this lesson are guided by Ben Schmidt's blog post Rejecting the Gender Binary.
The primary corpus we wil use consists of the 150 English-language novels made available by the .txtLab at McGill University. We also look at a Word2Vec model trained on the ECCO-TCP corpus of 2,350 eighteenth-century literary texts made available by Ryan Heuser. (Note that the number of terms in the model has been shortened by half in order to conserve memory.)
For further background on Word2Vec's mechanics, I suggest this brief tutorial by Google, especially the sections "Motivation," "Skip-Gram Model," and "Visualizing."
In the first day of this workshop, we explored the strange way that computers read text: by splitting it into tokens and tallying their frequencies. A novel or album review is reduced to a series of word counts. Since then, we have used simple arithmetic and statistics to identify patterns across those tallies. However, it is often useful to consider these patterns from another perspective: geometry.
Each text can be described as a series of word counts. What if we treated those like coordinates in space?
In [ ]:
%pylab inline
matplotlib.style.use('ggplot')
In [ ]:
# dataframes!
import pandas
# Construct dataframe
columns = ['eggs','sausage','bacon']
indices = ['Novel A', 'Novel B', 'Novel C']
dtm = [[50,60,60],[90,10,10], [20,70,70]]
dtm_df = pandas.DataFrame(dtm, columns = columns, index = indices)
# Show dataframe
dtm_df
In [ ]:
# Plot our points
scatter(dtm_df['eggs'], dtm_df['sausage'])
# Make the graph look good
xlim([0,100]), ylim([0,100])
xlabel('eggs'), ylabel('sausage')
At a glance, a couple of points are lying closer to one another. We used the word frequencies of just two words in order to plot our texts in a two-dimensional plane. The term frequency "summaries" of Novel A & Novel C are pretty similar to one another: they both share a major concern with "eggs", whereas Novel B seems to focus primarily on "sausage."
This raises a question: how can we operationalize our intuition that spatial distance expresses topical similarity?
The most common measurement of distance between points is their Cosine Similarity. Imagine that we draw an arrow from the origin of the graph -- point (0,0) -- to the dot representing each text. This arrow is called a vector. The Cosine Similarity between two vectors measures how much they overlap with one another. Values for the cosine similarity between two vectors fall between 0 and 1, so they are easily interpreted and compared.
In [ ]:
# Although we want the Cosine Distance, it is mathematically
# simpler to calculate its opposite: Cosine Similarity
from sklearn.metrics.pairwise import cosine_similarity
In [ ]:
# So we will subtract the similarities from 1
cos_sim = cosine_similarity(dtm_df)
In [ ]:
# And we'll make it a little easier to read
np.round(cos_sim, 2)
The above method has a broad range of applications, such as unsupervised clustering. Common techniques include K-Means Clustering and Heirchical Dendrograms. These attempt to identify groups of texts with shared content, based on these kinds of distance measures.
For this lesson, however, we will turn this logic on its head. Rather than produce vectors representing texts based on their words, we will produce vectors for the words based on their contexts.
In [ ]:
# Turn our DTM sideways
dtm_df.T
In [ ]:
# Find the Cosine Distances between pairs of word-vectors
cos_sim = cosine_similarity(dtm_df.T)
In [ ]:
# In readable format
np.round(cos_sim, 2)
This last cell indicates that "sausage" and "bacon" perfectly align with one another across texts. If we saw this in a real-world example, we might interpret it to mean that the words share some kind of semantic or thematic relationship. In fact, this method is precisely one that humanists have used in order to find interesting linguistic patterns. (See Ted Underwood's blog post, LSA is a marvellous tool, but....)
Recently, however, a more sophisticated technique for finding semantic relationships between words has enjoyed great popularity: Word2Vec
Word2Vec draws from the logic of the concordance that we saw on the first day of the workshop. In the example above, a word (row) is described by its frequency within an entire novel (column). The word2vec algorithm tries to describe a given word in terms of the ones that appear immediately to the right and left in actual sentences. More precisely it learns how to predict the context words.
Without going too deeply into the algorithm, suffice it to say that it involves a two-step process. First, the input word gets compressed into a dense vector. Second, the vector gets decoded into the set of context words. Keywords that appear within similar contexts will have similar vector representations in between steps. This vector is referred to as a word embedding.
Since the word embedding is a vector, we can perform tests like cosine similarity and other kinds of operations. Those operations can reveal many different kinds of relationships between words, as we shall see.
In [ ]:
# Data Wrangling
import os
import numpy as np
import pandas
from scipy.spatial.distance import cosine
from sklearn.metrics import pairwise
from sklearn.manifold import MDS, TSNE
In [ ]:
# Natural Language Processing
import gensim
import nltk
#nltk.download('punkt')
from nltk.tokenize import word_tokenize, sent_tokenize
In [ ]:
# Custom Tokenizer for Classroom Use
def fast_tokenize(text):
# Get a list of punctuation marks
from string import punctuation
lower_case = text.lower()
# Iterate through text removing punctuation characters
no_punct = "".join([char for char in lower_case if char not in punctuation])
# Split text over whitespace into list of words
tokens = no_punct.split()
return tokens
English-language subset of Andrew Piper's novel corpus, totaling 150 novels by British and American authors spanning the years 1771-1930. These texts reside on disk, each in a separate plaintext file. Metadata is contained in a spreadsheet distributed with the novel files.
In [ ]:
# Import Metadata into Pandas Dataframe
meta_df = pandas.read_csv('resources/txtlab_Novel450_English.csv')
In [ ]:
# Check Metadata
meta_df
In [ ]:
# Set location of corpus folder
fiction_folder = 'txtlab_Novel450_English/'
In [ ]:
# Collect the text of each file in the 'fiction_folder' on the hard drive
# Create empty list, each entry will be the string for a given novel
novel_list = []
# Iterate through filenames in 'fiction_folder'
for filename in os.listdir(fiction_folder):
# Read novel text as single string
with open(fiction_folder + filename, 'r') as file_in:
this_novel = file_in.read()
# Add novel text as single string to master list
novel_list.append(this_novel)
In [ ]:
# Inspect first item in novel_list
novel_list[0]
Word2Vec learns about the relationships among words by observing them in context. This means that we want to split our texts into word-units. However, we want to maintain sentence boundaries as well, since the last word of the previous sentence might skew the meaning of the next sentence.
Since novels were imported as single strings, we'll first use sent_tokenize to divide them into sentences, and second, we'll split each sentence into its own list of words.
In [ ]:
# Split each novel into sentences
sentences = [sentence for novel in novel_list for sentence in sent_tokenize(novel)]
In [ ]:
# Inspect first sentence
sentences[0]
In [ ]:
# Split each sentence into tokens
words_by_sentence = [fast_tokenize(sentence) for sentence in sentences]
In [ ]:
# Remove any sentences that contain zero tokens
words_by_sentence = [sentence for sentence in words_by_sentence if sentence != []]
In [ ]:
# Inspect first sentence
words_by_sentence[0]
Word2Vec is the most prominent word embedding algorithm. Word embedding generally attempts to identify semantic relationships between words by observing them in context.
Imagine that each word in a novel has its meaning determined by the ones that surround it in a limited window. For example, in Moby Dick's first sentence, “me” is paired on either side by “Call” and “Ishmael.” After observing the windows around every word in the novel (or many novels), the computer will notice a pattern in which “me” falls between similar pairs of words to “her,” “him,” or “them.” Of course, the computer had gone through a similar process over the words “Call” and “Ishmael,” for which “me” is reciprocally part of their contexts. This chaining of signifiers to one another mirrors some of humanists' most sophisticated interpretative frameworks of language.
The two main flavors of Word2Vec are CBOW (Continuous Bag of Words) and Skip-Gram, which can be distinguished partly by their input and output during training. Skip-Gram takes a word of interest as its input (e.g. "me") and tries to learn how to predict its context words ("Call","Ishmael"). CBOW does the opposite, taking the context words ("Call","Ishmael") as a single input and tries to predict the word of interest ("me").
In general, CBOW is is faster and does well with frequent words, while Skip-Gram potentially represents rare words better.
Note: Script uses default value for each argument
In [ ]:
# Train word2vec model from txtLab corpus
model = gensim.models.Word2Vec(words_by_sentence, size=100, window=5, \
min_count=25, sg=1, alpha=0.025, iter=5, batch_words=10000)
In [ ]:
# Return dense word vector
model['whale']
In [ ]:
# Find cosine distance between two given word vectors
model.similarity('pride','prejudice')
In [ ]:
# Find nearest word vectors by cosine distance
model.most_similar('pride')
In [ ]:
# Given a list of words, we can ask which doesn't belong
# Finds mean vector of words in list
# and identifies the word further from that mean
model.doesnt_match(['pride','prejudice', 'whale'])
A word embedding may encode both primary and secondary meanings that are both present at the same time. In order to identify secondary meanings in a word, we can subtract the vectors of primary (or simply unwanted) meanings. For example, we may wish to remove the sense of river bank from the word bank. This would be written mathetmatically as RIVER - BANK, which in gensim's interface lists RIVER as a positive meaning and BANK as a negative one.
In [ ]:
# Get most similar words to BANK, in order
# to get a sense for its primary meaning
model.most_similar('bank')
In [ ]:
# Remove the sense of "river bank" from "bank" and see what is left
model.most_similar(positive=['bank'], negative=['river'])
Analogies are rendered as simple mathematical operations in vector space. For example, the canonic word2vec analogy MAN is to KING as WOMAN is to ?? is rendered as KING - MAN + WOMAN. In the gensim interface, we designate KING and WOMAN as positive terms and MAN as a negative term, since it is subtracted from those.
In [ ]:
# Get most similar words to KING, in order
# to get a sense for its primary meaning
model.most_similar('king')
In [ ]:
# The canonic word2vec analogy: King - Man + Woman -> Queen
model.most_similar(positive=['woman', 'king'], negative=['man'])
In [ ]:
# Feminine Vector
model.most_similar(positive=['she','her','hers','herself'], negative=['he','him','his','himself'])
In [ ]:
# Masculine Vector
model.most_similar(positive=['he','him','his','himself'], negative=['she','her','hers','herself'])
In [ ]:
## EX. Use the most_similar method to find the tokens nearest to 'car' in our model.
## Do the same for 'motorcar'.
## Q. What characterizes each word in our corpus? Does this make sense?
In [ ]:
## EX. How does our model answer the analogy: MADRID is to SPAIN as PARIS is to __________
## Q. What has our model learned about nation-states?
In [ ]:
## EX. Perform the canonic Word2Vec addition again but leave out a term:
## Try 'king' - 'man', 'woman' - 'man', 'woman' + 'king'
## Q. What do these indicate semantically?
In [ ]:
# Dictionary of words in model
model.wv.vocab
#model.vocab # deprecated
In [ ]:
# Visualizing the whole vocabulary would make it hard to read
len(model.wv.vocab)
#len(model.vocab) # deprecated
In [ ]:
# For interpretability, we'll select words that already have a semantic relation
her_tokens = [token for token,weight in model.most_similar(positive=['she','her','hers','herself'], \
negative=['he','him','his','himself'], topn=50)]
In [ ]:
# Inspect list
her_tokens
In [ ]:
# Get the vector for each sampled word
vectors = [model[word] for word in her_tokens]
In [ ]:
# Calculate distances among texts in vector space
dist_matrix = pairwise.pairwise_distances(vectors, metric='cosine')
In [ ]:
# Multi-Dimensional Scaling (Project vectors into 2-D)
mds = MDS(n_components = 2, dissimilarity='precomputed')
embeddings = mds.fit_transform(dist_matrix)
In [ ]:
# Make a pretty graph
_, ax = plt.subplots(figsize=(10,10))
ax.scatter(embeddings[:,0], embeddings[:,1], alpha=0)
for i in range(len(vectors)):
ax.annotate(her_tokens[i], ((embeddings[i,0], embeddings[i,1])))
In [ ]:
# For comparison, here is the same graph using a masculine-pronoun vector
his_tokens = [token for token,weight in model.most_similar(positive=['he','him','his','himself'], \
negative=['she','her','hers','herself'], topn=50)]
vectors = [model[word] for word in his_tokens]
dist_matrix = pairwise.pairwise_distances(vectors, metric='cosine')
mds = MDS(n_components = 2, dissimilarity='precomputed')
embeddings = mds.fit_transform(dist_matrix)
_, ax = plt.subplots(figsize=(10,10))
ax.scatter(embeddings[:,0], embeddings[:,1], alpha=0)
for i in range(len(vectors)):
ax.annotate(his_tokens[i], ((embeddings[i,0], embeddings[i,1])))
In [ ]:
## Q. What kinds of semantic relationships exist in the diagram above?
## Are there any words that seem out of place?
In [ ]:
# Save current model for later use
model.wv.save_word2vec_format('resources/word2vec.txtlab_Novel150_English.txt')
#model.save_word2vec_format('resources/word2vec.txtlab_Novel150_English.txt') # deprecated
In [ ]:
# Load up models from disk
# Model trained on Eighteenth Century Collections Online corpus (~2500 texts)
# Made available by Ryan Heuser: http://ryanheuser.org/word-vectors-1/
ecco_model = gensim.models.KeyedVectors.load_word2vec_format('resources/word2vec.ECCO-TCP.txt')
#ecco_model = gensim.models.Word2Vec.load_word2vec_format('resources/word2vec.ECCO-TCP.txt') # deprecated
In [ ]:
# What are similar words to BANK?
ecco_model.most_similar('bank')
In [ ]:
# What if we remove the sense of "river bank"?
ecco_model.most_similar(positive=['bank'], negative=['river'])
In [ ]:
## EX. Heuser's blog post explores an analogy in eighteenth-century thought that
## RICHES are to VIRTUE what LEARNING is to GENIUS. How true is this in
## the ECCO-trained Word2Vec model? Is it true in the one we trained?
## Q. How might we compare word2vec models more generally?
In [ ]:
# ECCO model: RICHES are to VIRTUE what LEARNING is to ??
In [ ]:
# txtLab model: RICHES are to VIRTUE what LEARNING is to ??
At this point, we have seen a number of mathemetical operations that we may use to explore word2vec's word embeddings. These enable us to answer a set of new, interesting questions dealing with semantics, yet there are many other questions that remain unanswered.
For example: