In [ ]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
from datascience import *
import numpy as np
from scipy.spatial.distance import cosine
import gensim
import nltk
from string import punctuation

Word Embedding

This lesson is designed to explore features of word embeddings described by Ben Schmidt in his blog post "Rejecting the Gender Binary".

The primary corpus we use consists of the 150 English-language novels made available by the .txtLab at McGill. We also look at a Word2Vec model trained on the ECCO-TCP corpus of 2,350 eighteenth-century literary texts made available by Ryan Heuser. (I have shortened the number of terms in the model by half in order to conserve memory.)

For background on Word2Vec's mechanics, I suggest this brief tutorial by Google, especially the sections "Motivation," "Skip-Gram Model," and "Visualizing."

We'll read in Andrew Piper's corpus we used in our Topic Modeling notebook:


In [ ]:
metadata_tb = Table.read_table('../09-Topic-Modeling/data/txtlab_Novel150_English.csv')

fiction_path = '../09-Topic-Modeling/data/txtlab_Novel150_English/'

novel_list = []

# Iterate through filenames in metadata table
for filename in metadata_tb['filename']:
    
    # Read in novel text as single string, make lowercase
    with open(fiction_path + filename, 'r') as file_in:
        novel = file_in.read()
    
    # Add novel text as single string to master list
    novel_list.append(novel)

Pre-Processing

Word2Vec learns about the relationships among words by observing them in context. We'll need to tokenize the words in our corpus while retaining sentence boundaries. Since novels were imported as single strings, we'll first use sent_tokenize to divide them into sentences, and second, we'll split each sentence into its own list of words.

We'll use nltk's sentence tokenizer:


In [ ]:
from nltk.tokenize import sent_tokenize

Due to memory and time constraints we'll use our quick and dirty tokenizer:


In [ ]:
def fast_tokenize(text):
    
    # Iterate through text removing punctuation characters
    no_punct = "".join([char for char in text if char not in punctuation])
    
    # Split text over whitespace into list of words
    tokens = no_punct.split()
    
    return tokens

First get the sentences:


In [ ]:
sentences = [sentence for novel in novel_list for sentence in sent_tokenize(novel)]

Now the words:


In [ ]:
words_by_sentence = [fast_tokenize(sentence.lower()) for sentence in sentences]

We'll double check that we don't have any empty sentences:


In [ ]:
words_by_sentence = [sentence for sentence in words_by_sentence if sentence != []]

We should now have a list of lists with sentences and words:


In [ ]:
words_by_sentence[:2]

Word2Vec

Word Embeddings

Word2Vec is the most prominent word embedding algorithm. Word embedding generally attempts to identify semantic relationships between words by observing them in context.

Imagine that each word in a novel has its meaning determined by the ones that surround it in a limited window. For example, in Moby Dick's first sentence, “me” is paired on either side by “Call” and “Ishmael.” After observing the windows around every word in the novel (or many novels), the computer will notice a pattern in which “me” falls between similar pairs of words to “her,” “him,” or “them.” Of course, the computer had gone through a similar process over the words “Call” and “Ishmael,” for which “me” is reciprocally part of their contexts. This chaining of signifiers to one another mirrors some of humanists' most sophisticated interpretative frameworks of language.

The two main flavors of Word2Vec are CBOW (Continuous Bag of Words) and Skip-Gram, which can be distinguished partly by their input and output during training. Skip-Gram takes a word of interest as its input (e.g. "me") and tries to learn how to predict its context words ("Call","Ishmael"). CBOW does the opposite, taking the context words ("Call","Ishmael") as a single input and tries to predict the word of interest ("me").

In general, CBOW is is faster and does well with frequent words, while Skip-Gram potentially represents rare words better.

Word2Vec Features

  • `size`: Number of dimensions for word embedding model
  • `window`: Number of context words to observe in each direction
  • `min_count`: Minimum frequency for words included in model
  • `sg` (Skip-Gram): '0' indicates CBOW model; '1' indicates Skip-Gram
  • `alpha`: Learning rate (initial); prevents model from over-correcting, enables finer tuning
  • `iterations`: Number of passes through dataset
  • `batch_words`: Number of words to sample from data during each pass

Note: cell below uses default value for each argument

Training

We've gotten accustomed to training powerful models in Python with one line of code, why stop now?


In [ ]:
model = gensim.models.Word2Vec(words_by_sentence, size=100, window=5, \
                               min_count=5, sg=0, alpha=0.025, iter=5, batch_words=10000)

Embeddings

We can return the actual high-dimensional vector by simply indexing the model with the word as the key:


In [ ]:
model['whale']

gensim comes with some handy methods to analyze word relationships. similarity will give us a number from 0-1 based on how similar two words are. If this sounds like cosine similarity for words, you'd be right! It just takes the cosine similarity of these high dimensional vectors:


In [ ]:
model.similarity('sense','sensibility')

We can also find cosine distance between two clusters of word vectors. Each cluster is measured as the mean of its words:


In [ ]:
model.n_similarity(['sense','sensibility'],['whale','harpoon'])

We can find words that don't belong with doesnt_match. It finds the mean vector of the words in the list, and identifies the furthest away:


In [ ]:
model.doesnt_match(['pride','prejudice', 'harpoon'])

The most famous implementation of this vector math is semantics. What happens if we take:

$$King - Man + Woman = $$

In [ ]:
model.most_similar(positive=['woman', 'king'], negative=['man'])

Schmidt looked at words associated with male and female pronouns to investigate gender. Let's try take all the female pronouns and subtracting the male pronouns:


In [ ]:
model.most_similar(positive=['she','her','hers','herself'], negative=['he','him','his','himself'])

And the opposite:


In [ ]:
model.most_similar(positive=['he','him','his','himself'], negative=['she','her','hers','herself'])

How about together (genderless in Schmidt's sense)?


In [ ]:
model.most_similar(positive=['she','her','hers','herself','he','him','his','himself'], topn=50)

Homework

Use the most_similar method to find the tokens nearest to 'car' in our model. Do the same for 'motorcar'.


In [ ]:

What characterizes each word in our corpus? Does this make sense?


In [ ]:

Vector addition and subtraction can be thought of in terms of analogy. From the example above: 'man' is to 'king' as 'woman' is to '???'. Use the most_similar method to find: 'paris' is to 'france' as 'london' is to '???'


In [ ]:

What has our model learned about nation-states?


In [ ]:

Perform the canonic Word2Vec addition again but leave out a term. Try 'king' - 'man', 'woman' - 'man', 'woman' + 'king'.


In [ ]:

What do these indicate semantically?


In [ ]:


Visualization

We can use multi-dimensional scaling to visualize this space just like we did with the documents before. But there are a lot of words here, so let's limit it to 50 words from our female gendered subset:


In [ ]:
her_tokens = [token for token,weight in model.most_similar(positive=['she','her','hers','herself'], \
                                                       negative=['he','him','his','himself'], topn=50)]

We need to get the vector from each word, just like above, and add that to a list:


In [ ]:
vectors = [model[word] for word in her_tokens]

We can then calculate pairwise the cosine distance:


In [ ]:
from sklearn.metrics import pairwise
dist_matrix = pairwise.pairwise_distances(vectors, metric='cosine')

We'll use MDS to reduce the dimensions to two:


In [ ]:
from sklearn.manifold import MDS
mds = MDS(n_components = 2, dissimilarity='precomputed')
embeddings = mds.fit_transform(dist_matrix)

Some fancy matplotlib code...


In [ ]:
_, ax = plt.subplots(figsize=(10,10))
ax.scatter(embeddings[:,0], embeddings[:,1], alpha=0)
for i in range(len(vectors)):
    ax.annotate(her_tokens[i], ((embeddings[i,0], embeddings[i,1])))

What kinds of semantic relationships exist in the diagram above? Are there any words that seem out of place? How do you think they go there?


In [ ]:


Saving/Loading Models

We can save the model as a .txt file with the save_word2vec_format method:


In [ ]:
model.wv.save_word2vec_format('word2vec.txtalb_Novel150_English.txt')

To load up a model, we just ask gensim. Here's a model trained on Eighteenth Century Collections Online corpus (~2500 texts) made available by Ryan Heuser: http://ryanheuser.org/word-vectors-1/


In [ ]:
ecco_model = gensim.models.KeyedVectors.load_word2vec_format('data/word2vec.ECCO-TCP.txt')

In [ ]:
ecco_model.most_similar(positive=['woman', 'king'], negative=['man'])

In [ ]:
ecco_model.most_similar(positive=['she','her','hers','herself'], negative=['he','him','his','himself'])

How does this differ from our novels model?


In [ ]:


Homework

Heuser's blog post explores an analogy in eighteenth-century thought that Riches are to Virtue what Learning is to Genius. How true is this in the ECCO-trained Word2Vec model? Is it true in the one we trained?

How might we compare word2vec models more generally?


In [ ]:


Alternative features for a classification model

This is really cool but what implications does this have for our model of language? Well, word embeddings are simply more precise features of what we've been trying to get at already. That means we can use them in the machine learning models we've been building.

Recall our DTM bag of words classifier:


In [ ]:
from nltk.corpus import movie_reviews
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.utils import shuffle

reviews = [movie_reviews.raw(fileid) for fileid in movie_reviews.fileids()]
judgements = [movie_reviews.categories(fileid)[0] for fileid in movie_reviews.fileids()]

np.random.seed(0)

X, y = shuffle(reviews, judgements, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=50)

# get tfidf values
tfidf = TfidfVectorizer()
tfidf.fit(X)
X_train_transformed = tfidf.transform(X_train)
X_test_transformed = tfidf.transform(X_test)

# build and test logit
logit_class = LogisticRegression(penalty='l2', C=1000)
logit_model = logit_class.fit(X_train_transformed, y_train)
logit_model.score(X_test_transformed, y_test)

So how can we use word embeddings as features? Believe it or not, one of the most effective ways is to simply average each dimension of our embedding across all the words for a given document. Recall our w2v model for novels was trained for 100 dimensions. Creating the features for a specific document would entail first extracting the 100 dimensions for each word, then average each dimension across all words:


In [ ]:
np.mean([model[w] for w in fast_tokenize(X[0]) if w in model], axis=0)

This gives us a set X array with 100 features. We can write a function to do this for us for any given string:


In [ ]:
def w2v_featurize(document, model):
    return np.mean([model[w] for w in fast_tokenize(document) if w in model], axis=0)

We can then featurize all of our documents:


In [ ]:
X_train_w2v = [w2v_featurize(d, model) for d in X_train]
X_test_w2v = [w2v_featurize(d, model) for d in X_test]

We can fit and score the machine learning modle just as before:


In [ ]:
logit_class = LogisticRegression(random_state=0, penalty='l2', C=1000)
logit_model = logit_class.fit(X_train_w2v, y_train)
logit_model.score(X_test_w2v, y_test)

What about Heuser's model?


In [ ]:
X_train_w2v = [w2v_featurize(d, ecco_model) for d in X_train]
X_test_w2v = [w2v_featurize(d, ecco_model) for d in X_test]
logit_class = LogisticRegression(random_state=0, penalty='l2', C=1000)
logit_model = logit_class.fit(X_train_w2v, y_train)
logit_model.score(X_test_w2v, y_test)

Cool! But wait, what if we wanted to know why the model was making decisions. If we ask for the most postive coefficients:


In [ ]:
np.argsort(logit_model.coef_[0])[-10:]

And the negative:


In [ ]:
np.argsort(logit_model.coef_[0])[:10]

These are the indices for the important features. But what are these features now?


Note that using our novels w2v model was not as accurate as a BoW tfidf method. That should be expected given movie review language is likely VERY different from our novel corpus. And our novel corpus likely didn't even have entries for a lot of the words used in our movie reviews corpus.

For modern English, most people look for Stanford's GloVe model. This was trained on 6 billion tokens from Wikipedia and Gigaword! Quite a step up from 150 novels. Even the smallest model is a bit large to be working with on our cloud server, but using this model and the code below, you can see it's power:

>>> os.system('python -m gensim.scripts.glove2word2vec -i glove.6B.100d.txt -o glove.6B.100d.w2v.txt')
>>> glove = gensim.models.KeyedVectors.load_word2vec_format('glove.6B.100d.w2v.txt')

>>> X_train_glove = [w2v_featurize(d, glove) for d in X_train]
>>> X_test_glove = [w2v_featurize(d, glove) for d in X_test]

>>> logit_class = LogisticRegression(random_state=0, penalty='l2', C=1000)
>>> logit_model = logit_class.fit(X_train_glove, y_train)
>>> logit_model.score(X_test_glove, y_test)

.8125

While this is not as accurate as our BoW tfidf method, there have been several applications and transformations of word embeddings that have proven to be more accurate than a BoW tfidf on general modern text corpora. And keep in mind, one of the most interesting parts of this is that it only uses 100 dimensions, i.e., we can get ~81% accuracy by reducing a movie review to only 100 different features (our BoW model had over 39000!).


In [ ]:
X_train_transformed