It Starts with a Research Question...

Word2Vec

This lesson is designed to explore features of word embeddings produced through the word2vec model. The questions we ask in this lesson are guided by Ben Schmidt's blog post Rejecting the Gender Binary.

The primary corpus we wil use consists of the 150 English-language novels made available by the .txtLab at McGill University. We also look at a Word2Vec model trained on the ECCO-TCP corpus of 2,350 eighteenth-century literary texts made available by Ryan Heuser. (Note that the number of terms in the model has been shortened by half in order to conserve memory.)

For further background on Word2Vec's mechanics, I suggest this brief tutorial by Google, especially the sections "Motivation," "Skip-Gram Model," and "Visualizing."

Workshop Agenda

  • Vector-Space Model of Language
  • Import & Pre-Processing
  • Word2Vec
    • Training
    • Embeddings
    • Visualization
  • Saving/Loading Models

0. Vector-Space Model of Language

In the first day of this workshop, we explored the strange way that computers read text: by splitting it into tokens and tallying their frequencies. A novel or album review is reduced to a series of word counts. Since then, we have used simple arithmetic and statistics to identify patterns across those tallies. However, it is often useful to consider these patterns from another perspective: geometry.

Each text can be described as a series of word counts. What if we treated those like coordinates in space?

Prep for visualization


In [ ]:
%pylab inline
matplotlib.style.use('ggplot')

Create a DTM with a Few Pseudo-Texts


In [ ]:
# dataframes!
import pandas

# Construct dataframe
columns = ['eggs','sausage','bacon']
indices = ['Novel A', 'Novel B', 'Novel C']
dtm = [[50,60,60],[90,10,10], [20,70,70]]
dtm_df = pandas.DataFrame(dtm, columns = columns, index = indices)

# Show dataframe
dtm_df

Visualize


In [ ]:
# Plot our points
scatter(dtm_df['eggs'], dtm_df['sausage'])

# Make the graph look good
xlim([0,100]), ylim([0,100])
xlabel('eggs'), ylabel('sausage')

Vectors

At a glance, a couple of points are lying closer to one another. We used the word frequencies of just two words in order to plot our texts in a two-dimensional plane. The term frequency "summaries" of Novel A & Novel C are pretty similar to one another: they both share a major concern with "eggs", whereas Novel B seems to focus primarily on "sausage."

This raises a question: how can we operationalize our intuition that spatial distance expresses topical similarity?

The most common measurement of distance between points is their Cosine Similarity. Imagine that we draw an arrow from the origin of the graph -- point (0,0) -- to the dot representing each text. This arrow is called a vector. The Cosine Similarity between two vectors measures how much they overlap with one another. Values for the cosine similarity between two vectors fall between 0 and 1, so they are easily interpreted and compared.

Cosine Distance


In [ ]:
# Although we want the Cosine Distance, it is mathematically
# simpler to calculate its opposite: Cosine Similarity

from sklearn.metrics.pairwise import cosine_similarity

In [ ]:
# So we will subtract the similarities from 1

cos_sim = cosine_similarity(dtm_df)

In [ ]:
# And we'll make it a little easier to read

np.round(cos_sim, 2)

Vector Semantics

The above method has a broad range of applications, such as unsupervised clustering. Common techniques include K-Means Clustering and Heirchical Dendrograms. These attempt to identify groups of texts with shared content, based on these kinds of distance measures.

For this lesson, however, we will turn this logic on its head. Rather than produce vectors representing texts based on their words, we will produce vectors for the words based on their contexts.


In [ ]:
# Turn our DTM sideways

dtm_df.T

In [ ]:
# Find the Cosine Distances between pairs of word-vectors

cos_sim = cosine_similarity(dtm_df.T)

In [ ]:
# In readable format

np.round(cos_sim, 2)

Word2Vec

This last cell indicates that "sausage" and "bacon" perfectly align with one another across texts. If we saw this in a real-world example, we might interpret it to mean that the words share some kind of semantic or thematic relationship. In fact, this method is precisely one that humanists have used in order to find interesting linguistic patterns. (See Ted Underwood's blog post, LSA is a marvellous tool, but....)

Recently, however, a more sophisticated technique for finding semantic relationships between words has enjoyed great popularity: Word2Vec

Word2Vec draws from the logic of the concordance that we saw on the first day of the workshop. In the example above, a word (row) is described by its frequency within an entire novel (column). The word2vec algorithm tries to describe a given word in terms of the ones that appear immediately to the right and left in actual sentences. More precisely it learns how to predict the context words.

Without going too deeply into the algorithm, suffice it to say that it involves a two-step process. First, the input word gets compressed into a dense vector. Second, the vector gets decoded into the set of context words. Keywords that appear within similar contexts will have similar vector representations in between steps. This vector is referred to as a word embedding.

Since the word embedding is a vector, we can perform tests like cosine similarity and other kinds of operations. Those operations can reveal many different kinds of relationships between words, as we shall see.

1. Import & Pre-Processing

Import Packages


In [ ]:
# Data Wrangling

import os
import numpy as np
import pandas
from scipy.spatial.distance import cosine
from sklearn.metrics import pairwise
from sklearn.manifold import MDS, TSNE

In [ ]:
# Natural Language Processing

import gensim
import nltk
#nltk.download('punkt')
from nltk.tokenize import word_tokenize, sent_tokenize

In [ ]:
# Custom Tokenizer for Classroom Use

def fast_tokenize(text):
    
    # Get a list of punctuation marks
    from string import punctuation
    
    lower_case = text.lower()
    
    # Iterate through text removing punctuation characters
    no_punct = "".join([char for char in lower_case if char not in punctuation])
    
    # Split text over whitespace into list of words
    tokens = no_punct.split()
    
    return tokens

Corpus Description

English-language subset of Andrew Piper's novel corpus, totaling 150 novels by British and American authors spanning the years 1771-1930. These texts reside on disk, each in a separate plaintext file. Metadata is contained in a spreadsheet distributed with the novel files.

Metadata Columns

  1. Filename: Name of file on disk
  2. ID: Unique ID in Piper corpus
  3. Language: Language of novel
  4. Date: Initial publication date
  5. Title: Title of novel
  6. Gender: Authorial gender
  7. Person: Textual perspective
  8. Length: Number of tokens in novel

Import Metadata


In [ ]:
# Import Metadata into Pandas Dataframe

meta_df = pandas.read_csv('resources/txtlab_Novel450_English.csv')

In [ ]:
# Check Metadata

meta_df

Import Corpus


In [ ]:
# Set location of corpus folder

fiction_folder = 'txtlab_Novel450_English/'

In [ ]:
# Collect the text of each file in the 'fiction_folder' on the hard drive

# Create empty list, each entry will be the string for a given novel
novel_list = []

# Iterate through filenames in 'fiction_folder'
for filename in os.listdir(fiction_folder):
    
    # Read novel text as single string
    with open(fiction_folder + filename, 'r') as file_in:
        this_novel = file_in.read()
    
    # Add novel text as single string to master list
    novel_list.append(this_novel)

In [ ]:
# Inspect first item in novel_list

novel_list[0]

Pre-Processing

Word2Vec learns about the relationships among words by observing them in context. This means that we want to split our texts into word-units. However, we want to maintain sentence boundaries as well, since the last word of the previous sentence might skew the meaning of the next sentence.

Since novels were imported as single strings, we'll first use sent_tokenize to divide them into sentences, and second, we'll split each sentence into its own list of words.


In [ ]:
# Split each novel into sentences

sentences = [sentence for novel in novel_list for sentence in sent_tokenize(novel)]

In [ ]:
# Inspect first sentence

sentences[0]

In [ ]:
# Split each sentence into tokens

words_by_sentence = [fast_tokenize(sentence) for sentence in sentences]

In [ ]:
# Remove any sentences that contain zero tokens

words_by_sentence = [sentence for sentence in words_by_sentence if sentence != []]

In [ ]:
# Inspect first sentence

words_by_sentence[0]

2. Word2Vec

Word Embedding

Word2Vec is the most prominent word embedding algorithm. Word embedding generally attempts to identify semantic relationships between words by observing them in context.

Imagine that each word in a novel has its meaning determined by the ones that surround it in a limited window. For example, in Moby Dick's first sentence, “me” is paired on either side by “Call” and “Ishmael.” After observing the windows around every word in the novel (or many novels), the computer will notice a pattern in which “me” falls between similar pairs of words to “her,” “him,” or “them.” Of course, the computer had gone through a similar process over the words “Call” and “Ishmael,” for which “me” is reciprocally part of their contexts. This chaining of signifiers to one another mirrors some of humanists' most sophisticated interpretative frameworks of language.

The two main flavors of Word2Vec are CBOW (Continuous Bag of Words) and Skip-Gram, which can be distinguished partly by their input and output during training. Skip-Gram takes a word of interest as its input (e.g. "me") and tries to learn how to predict its context words ("Call","Ishmael"). CBOW does the opposite, taking the context words ("Call","Ishmael") as a single input and tries to predict the word of interest ("me").

In general, CBOW is is faster and does well with frequent words, while Skip-Gram potentially represents rare words better.

Word2Vec Features

  • Size: Number of dimensions for word embedding model
  • Window: Number of context words to observe in each direction
  • min_count: Minimum frequency for words included in model
  • sg (Skip-Gram): '0' indicates CBOW model; '1' indicates Skip-Gram
  • Alpha: Learning rate (initial); prevents model from over-correcting, enables finer tuning
  • Iterations: Number of passes through dataset
  • Batch Size: Number of words to sample from data during each pass

Note: Script uses default value for each argument

Training


In [ ]:
# Train word2vec model from txtLab corpus

model = gensim.models.Word2Vec(words_by_sentence, size=100, window=5, \
                               min_count=25, sg=1, alpha=0.025, iter=5, batch_words=10000)

Embeddings


In [ ]:
# Return dense word vector

model['whale']

Vector-Space Operations

Similarity

Since words are represented as dense vectors, we can ask how similiar words' meanings are based on their cosine similarity (essentially how much they overlap). gensim has a few dout-of-the-box functions that enable different kinds of comparisons.


In [ ]:
# Find cosine distance between two given word vectors

model.similarity('pride','prejudice')

In [ ]:
# Find nearest word vectors by cosine distance

model.most_similar('pride')

In [ ]:
# Given a list of words, we can ask which doesn't belong

# Finds mean vector of words in list
# and identifies the word further from that mean

model.doesnt_match(['pride','prejudice', 'whale'])

Multiple Valences

A word embedding may encode both primary and secondary meanings that are both present at the same time. In order to identify secondary meanings in a word, we can subtract the vectors of primary (or simply unwanted) meanings. For example, we may wish to remove the sense of river bank from the word bank. This would be written mathetmatically as RIVER - BANK, which in gensim's interface lists RIVER as a positive meaning and BANK as a negative one.


In [ ]:
# Get most similar words to BANK, in order
# to get a sense for its primary meaning

model.most_similar('bank')

In [ ]:
# Remove the sense of "river bank" from "bank" and see what is left

model.most_similar(positive=['bank'], negative=['river'])

Analogy

Analogies are rendered as simple mathematical operations in vector space. For example, the canonic word2vec analogy MAN is to KING as WOMAN is to ?? is rendered as KING - MAN + WOMAN. In the gensim interface, we designate KING and WOMAN as positive terms and MAN as a negative term, since it is subtracted from those.


In [ ]:
# Get most similar words to KING, in order
# to get a sense for its primary meaning

model.most_similar('king')

In [ ]:
# The canonic word2vec analogy: King - Man + Woman -> Queen

model.most_similar(positive=['woman', 'king'], negative=['man'])

Gendered Vectors

Can we find gender a la Schmidt (2015)? (Note that this method uses vector projection, whereas Schmidt had used rejection.)


In [ ]:
# Feminine Vector

model.most_similar(positive=['she','her','hers','herself'], negative=['he','him','his','himself'])

In [ ]:
# Masculine Vector

model.most_similar(positive=['he','him','his','himself'], negative=['she','her','hers','herself'])

Exercises


In [ ]:
## EX. Use the most_similar method to find the tokens nearest to 'car' in our model.
##     Do the same for 'motorcar'.

## Q.  What characterizes each word in our corpus? Does this make sense?

In [ ]:
## EX. How does our model answer the analogy: MADRID is to SPAIN as PARIS is to __________

## Q.  What has our model learned about nation-states?

In [ ]:
## EX. Perform the canonic Word2Vec addition again but leave out a term:
##     Try 'king' - 'man', 'woman' - 'man', 'woman' + 'king'

## Q.  What do these indicate semantically?

Visualization


In [ ]:
# Dictionary of words in model

model.wv.vocab
#model.vocab # deprecated

In [ ]:
# Visualizing the whole vocabulary would make it hard to read

len(model.wv.vocab)
#len(model.vocab) # deprecated

In [ ]:
# For interpretability, we'll select words that already have a semantic relation

her_tokens = [token for token,weight in model.most_similar(positive=['she','her','hers','herself'], \
                                                       negative=['he','him','his','himself'], topn=50)]

In [ ]:
# Inspect list

her_tokens

In [ ]:
# Get the vector for each sampled word

vectors = [model[word] for word in her_tokens]

In [ ]:
# Calculate distances among texts in vector space

dist_matrix = pairwise.pairwise_distances(vectors, metric='cosine')

In [ ]:
# Multi-Dimensional Scaling (Project vectors into 2-D)

mds = MDS(n_components = 2, dissimilarity='precomputed')
embeddings = mds.fit_transform(dist_matrix)

In [ ]:
# Make a pretty graph

_, ax = plt.subplots(figsize=(10,10))
ax.scatter(embeddings[:,0], embeddings[:,1], alpha=0)
for i in range(len(vectors)):
    ax.annotate(her_tokens[i], ((embeddings[i,0], embeddings[i,1])))

In [ ]:
# For comparison, here is the same graph using a masculine-pronoun vector

his_tokens = [token for token,weight in model.most_similar(positive=['he','him','his','himself'], \
                                                       negative=['she','her','hers','herself'], topn=50)]
vectors = [model[word] for word in his_tokens]
dist_matrix = pairwise.pairwise_distances(vectors, metric='cosine')
mds = MDS(n_components = 2, dissimilarity='precomputed')
embeddings = mds.fit_transform(dist_matrix)
_, ax = plt.subplots(figsize=(10,10))
ax.scatter(embeddings[:,0], embeddings[:,1], alpha=0)
for i in range(len(vectors)):
    ax.annotate(his_tokens[i], ((embeddings[i,0], embeddings[i,1])))

In [ ]:
## Q. What kinds of semantic relationships exist in the diagram above?
##    Are there any words that seem out of place?

3. Saving/Loading Models


In [ ]:
# Save current model for later use

model.wv.save_word2vec_format('resources/word2vec.txtlab_Novel150_English.txt')
#model.save_word2vec_format('resources/word2vec.txtlab_Novel150_English.txt') # deprecated

In [ ]:
# Load up models from disk

# Model trained on Eighteenth Century Collections Online corpus (~2500 texts)
# Made available by Ryan Heuser: http://ryanheuser.org/word-vectors-1/

ecco_model = gensim.models.KeyedVectors.load_word2vec_format('resources/word2vec.ECCO-TCP.txt')
#ecco_model = gensim.models.Word2Vec.load_word2vec_format('resources/word2vec.ECCO-TCP.txt') # deprecated

In [ ]:
# What are similar words to BANK?

ecco_model.most_similar('bank')

In [ ]:
# What if we remove the sense of "river bank"?

ecco_model.most_similar(positive=['bank'], negative=['river'])

Exercise


In [ ]:
## EX. Heuser's blog post explores an analogy in eighteenth-century thought that
##     RICHES are to VIRTUE what LEARNING is to GENIUS. How true is this in
##     the ECCO-trained Word2Vec model? Is it true in the one we trained?

##  Q. How might we compare word2vec models more generally?

In [ ]:
# ECCO model: RICHES are to VIRTUE what LEARNING is to ??

In [ ]:
# txtLab model: RICHES are to VIRTUE what LEARNING is to ??

4. Open Questions

At this point, we have seen a number of mathemetical operations that we may use to explore word2vec's word embeddings. These enable us to answer a set of new, interesting questions dealing with semantics, yet there are many other questions that remain unanswered.

For example:

  1. How to compare word usages in different texts (within the same model)?
  2. How to compare word meanings in different models? compare whole models?
  3. What about the space “in between” words?
  4. Do we agree with the Distributional Hypothesis that words with the same contexts share their meanings?
    1. If not, then what information do we think is encoded in a word’s context?
  5. What good, humanistic research questions do analogies shed light on?
    1. shades of meaning?
    2. context similarity?