In [ ]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
from datascience import *
import numpy as np
from scipy.spatial.distance import cosine
import gensim
import nltk
from string import punctuation
This lesson is designed to explore features of word embeddings described by Ben Schmidt in his blog post "Rejecting the Gender Binary".
The primary corpus we use consists of the 150 English-language novels made available by the .txtLab at McGill. We also look at a Word2Vec model trained on the ECCO-TCP corpus of 2,350 eighteenth-century literary texts made available by Ryan Heuser. (I have shortened the number of terms in the model by half in order to conserve memory.)
For background on Word2Vec's mechanics, I suggest this brief tutorial by Google, especially the sections "Motivation," "Skip-Gram Model," and "Visualizing."
We'll read in Andrew Piper's corpus we used in our Topic Modeling notebook:
In [ ]:
metadata_tb = Table.read_table('../09-Topic-Modeling/data/txtlab_Novel150_English.csv')
fiction_path = '../09-Topic-Modeling/data/txtlab_Novel150_English/'
novel_list = []
# Iterate through filenames in metadata table
for filename in metadata_tb['filename']:
# Read in novel text as single string, make lowercase
with open(fiction_path + filename, 'r') as file_in:
novel = file_in.read()
# Add novel text as single string to master list
novel_list.append(novel)
Word2Vec learns about the relationships among words by observing them in context. We'll need to tokenize the words in our corpus while retaining sentence boundaries. Since novels were imported as single strings, we'll first use sent_tokenize to divide them into sentences, and second, we'll split each sentence into its own list of words.
We'll use nltk
's sentence tokenizer:
In [ ]:
from nltk.tokenize import sent_tokenize
Due to memory and time constraints we'll use our quick and dirty tokenizer:
In [ ]:
def fast_tokenize(text):
# Iterate through text removing punctuation characters
no_punct = "".join([char for char in text if char not in punctuation])
# Split text over whitespace into list of words
tokens = no_punct.split()
return tokens
First get the sentences:
In [ ]:
sentences = [sentence for novel in novel_list for sentence in sent_tokenize(novel)]
Now the words:
In [ ]:
words_by_sentence = [fast_tokenize(sentence.lower()) for sentence in sentences]
We'll double check that we don't have any empty sentences:
In [ ]:
words_by_sentence = [sentence for sentence in words_by_sentence if sentence != []]
We should now have a list
of list
s with sentences and words:
In [ ]:
words_by_sentence[:2]
Word2Vec is the most prominent word embedding algorithm. Word embedding generally attempts to identify semantic relationships between words by observing them in context.
Imagine that each word in a novel has its meaning determined by the ones that surround it in a limited window. For example, in Moby Dick's first sentence, “me” is paired on either side by “Call” and “Ishmael.” After observing the windows around every word in the novel (or many novels), the computer will notice a pattern in which “me” falls between similar pairs of words to “her,” “him,” or “them.” Of course, the computer had gone through a similar process over the words “Call” and “Ishmael,” for which “me” is reciprocally part of their contexts. This chaining of signifiers to one another mirrors some of humanists' most sophisticated interpretative frameworks of language.
The two main flavors of Word2Vec are CBOW (Continuous Bag of Words) and Skip-Gram, which can be distinguished partly by their input and output during training. Skip-Gram takes a word of interest as its input (e.g. "me") and tries to learn how to predict its context words ("Call","Ishmael"). CBOW does the opposite, taking the context words ("Call","Ishmael") as a single input and tries to predict the word of interest ("me").
In general, CBOW is is faster and does well with frequent words, while Skip-Gram potentially represents rare words better.
Note: cell below uses default value for each argument
In [ ]:
model = gensim.models.Word2Vec(words_by_sentence, size=100, window=5, \
min_count=5, sg=0, alpha=0.025, iter=5, batch_words=10000)
In [ ]:
model['whale']
gensim
comes with some handy methods to analyze word relationships. similarity
will give us a number from 0-1 based on how similar two words are. If this sounds like cosine similarity for words, you'd be right! It just takes the cosine similarity of these high dimensional vectors:
In [ ]:
model.similarity('sense','sensibility')
We can also find cosine distance between two clusters of word vectors. Each cluster is measured as the mean of its words:
In [ ]:
model.n_similarity(['sense','sensibility'],['whale','harpoon'])
We can find words that don't belong with doesnt_match
. It finds the mean vector of the words in the list
, and identifies the furthest away:
In [ ]:
model.doesnt_match(['pride','prejudice', 'harpoon'])
The most famous implementation of this vector math is semantics. What happens if we take:
$$King - Man + Woman = $$
In [ ]:
model.most_similar(positive=['woman', 'king'], negative=['man'])
Schmidt looked at words associated with male and female pronouns to investigate gender. Let's try take all the female pronouns and subtracting the male pronouns:
In [ ]:
model.most_similar(positive=['she','her','hers','herself'], negative=['he','him','his','himself'])
And the opposite:
In [ ]:
model.most_similar(positive=['he','him','his','himself'], negative=['she','her','hers','herself'])
How about together (genderless in Schmidt's sense)?
In [ ]:
model.most_similar(positive=['she','her','hers','herself','he','him','his','himself'], topn=50)
In [ ]:
What characterizes each word in our corpus? Does this make sense?
In [ ]:
Vector addition and subtraction can be thought of in terms of analogy. From the example above: 'man' is to 'king' as 'woman' is to '???'. Use the most_similar
method to find: 'paris' is to 'france' as 'london' is to '???'
In [ ]:
What has our model learned about nation-states?
In [ ]:
Perform the canonic Word2Vec addition again but leave out a term. Try 'king' - 'man', 'woman' - 'man', 'woman' + 'king'.
In [ ]:
What do these indicate semantically?
In [ ]:
In [ ]:
her_tokens = [token for token,weight in model.most_similar(positive=['she','her','hers','herself'], \
negative=['he','him','his','himself'], topn=50)]
We need to get the vector from each word, just like above, and add that to a list:
In [ ]:
vectors = [model[word] for word in her_tokens]
We can then calculate pairwise the cosine distance:
In [ ]:
from sklearn.metrics import pairwise
dist_matrix = pairwise.pairwise_distances(vectors, metric='cosine')
We'll use MDS
to reduce the dimensions to two:
In [ ]:
from sklearn.manifold import MDS
mds = MDS(n_components = 2, dissimilarity='precomputed')
embeddings = mds.fit_transform(dist_matrix)
Some fancy matplotlib
code...
In [ ]:
_, ax = plt.subplots(figsize=(10,10))
ax.scatter(embeddings[:,0], embeddings[:,1], alpha=0)
for i in range(len(vectors)):
ax.annotate(her_tokens[i], ((embeddings[i,0], embeddings[i,1])))
What kinds of semantic relationships exist in the diagram above? Are there any words that seem out of place? How do you think they go there?
In [ ]:
In [ ]:
model.wv.save_word2vec_format('word2vec.txtalb_Novel150_English.txt')
To load up a model, we just ask gensim
. Here's a model trained on Eighteenth Century Collections Online corpus (~2500 texts) made available by Ryan Heuser: http://ryanheuser.org/word-vectors-1/
In [ ]:
ecco_model = gensim.models.KeyedVectors.load_word2vec_format('data/word2vec.ECCO-TCP.txt')
In [ ]:
ecco_model.most_similar(positive=['woman', 'king'], negative=['man'])
In [ ]:
ecco_model.most_similar(positive=['she','her','hers','herself'], negative=['he','him','his','himself'])
How does this differ from our novels model?
In [ ]:
In [ ]:
This is really cool but what implications does this have for our model of language? Well, word embeddings are simply more precise features of what we've been trying to get at already. That means we can use them in the machine learning models we've been building.
Recall our DTM bag of words classifier:
In [ ]:
from nltk.corpus import movie_reviews
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.utils import shuffle
reviews = [movie_reviews.raw(fileid) for fileid in movie_reviews.fileids()]
judgements = [movie_reviews.categories(fileid)[0] for fileid in movie_reviews.fileids()]
np.random.seed(0)
X, y = shuffle(reviews, judgements, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=50)
# get tfidf values
tfidf = TfidfVectorizer()
tfidf.fit(X)
X_train_transformed = tfidf.transform(X_train)
X_test_transformed = tfidf.transform(X_test)
# build and test logit
logit_class = LogisticRegression(penalty='l2', C=1000)
logit_model = logit_class.fit(X_train_transformed, y_train)
logit_model.score(X_test_transformed, y_test)
So how can we use word embeddings as features? Believe it or not, one of the most effective ways is to simply average each dimension of our embedding across all the words for a given document. Recall our w2v model for novels was trained for 100 dimensions. Creating the features for a specific document would entail first extracting the 100 dimensions for each word, then average each dimension across all words:
In [ ]:
np.mean([model[w] for w in fast_tokenize(X[0]) if w in model], axis=0)
This gives us a set X
array with 100 features. We can write a function to do this for us for any given string:
In [ ]:
def w2v_featurize(document, model):
return np.mean([model[w] for w in fast_tokenize(document) if w in model], axis=0)
We can then featurize all of our documents:
In [ ]:
X_train_w2v = [w2v_featurize(d, model) for d in X_train]
X_test_w2v = [w2v_featurize(d, model) for d in X_test]
We can fit and score the machine learning modle just as before:
In [ ]:
logit_class = LogisticRegression(random_state=0, penalty='l2', C=1000)
logit_model = logit_class.fit(X_train_w2v, y_train)
logit_model.score(X_test_w2v, y_test)
What about Heuser's model?
In [ ]:
X_train_w2v = [w2v_featurize(d, ecco_model) for d in X_train]
X_test_w2v = [w2v_featurize(d, ecco_model) for d in X_test]
logit_class = LogisticRegression(random_state=0, penalty='l2', C=1000)
logit_model = logit_class.fit(X_train_w2v, y_train)
logit_model.score(X_test_w2v, y_test)
Cool! But wait, what if we wanted to know why the model was making decisions. If we ask for the most postive coefficients:
In [ ]:
np.argsort(logit_model.coef_[0])[-10:]
And the negative:
In [ ]:
np.argsort(logit_model.coef_[0])[:10]
These are the indices for the important features. But what are these features now?
Note that using our novels w2v model was not as accurate as a BoW tfidf method. That should be expected given movie review language is likely VERY different from our novel corpus. And our novel corpus likely didn't even have entries for a lot of the words used in our movie reviews corpus.
For modern English, most people look for Stanford's GloVe model. This was trained on 6 billion tokens from Wikipedia and Gigaword! Quite a step up from 150 novels. Even the smallest model is a bit large to be working with on our cloud server, but using this model and the code below, you can see it's power:
>>> os.system('python -m gensim.scripts.glove2word2vec -i glove.6B.100d.txt -o glove.6B.100d.w2v.txt')
>>> glove = gensim.models.KeyedVectors.load_word2vec_format('glove.6B.100d.w2v.txt')
>>> X_train_glove = [w2v_featurize(d, glove) for d in X_train]
>>> X_test_glove = [w2v_featurize(d, glove) for d in X_test]
>>> logit_class = LogisticRegression(random_state=0, penalty='l2', C=1000)
>>> logit_model = logit_class.fit(X_train_glove, y_train)
>>> logit_model.score(X_test_glove, y_test)
.8125
While this is not as accurate as our BoW tfidf method, there have been several applications and transformations of word embeddings that have proven to be more accurate than a BoW tfidf on general modern text corpora. And keep in mind, one of the most interesting parts of this is that it only uses 100 dimensions, i.e., we can get ~81% accuracy by reducing a movie review to only 100 different features (our BoW model had over 39000!).
In [ ]:
X_train_transformed