Getting Started with Jupyter Notebooks

This file is a "Jupyter Notebook". Jupyter Notebooks are files that allow one to write and evaluate Python (and R, and Julia...) alongside documentation, which makes them great for exploratory code investigations.

To run this notebook locally on your machine, we recommend that you follow these steps.

Installing Anaconda (Optional)

To follow along, the first step will be to install Anaconda, a distribution of the Python programming language that helps make managing Python easier.

Once Anaconda is installed, open a new terminal window. (If you are on Windows, you should open an Anaconda terminal by going to Programs -> Anaconda3 (64-bit) -> Anaconda Prompt). Then you can create and activate a virtual environment:

# create a virtual environment with Python 3.6 named "3.6"
conda create python=3.6 --name=3.6

# activate the virtual environment
source activate 3.6

Running the Workshop Notebook

You should now see (3.6) prepended on your path. Once you see that prefix, you can start the notebook with the following commands:

git clone https://github.com/YaleDHLab/lab-workshops
cd lab-workshops/word-vectors
pip install -r requirements.txt
jupyter notebook word-vectors.ipynb

Once the notebook is open, you can evaluate a code cell by clicking on that cell, then clicking Cell -> Run Cells. Alternatively, after clicking on a cell you can hold Control and press Enter to execute the code in the cell. To run all the cells in a notebook (which I recommend you do for this notebook), you can click Cell -> Run All.

If you want to add a new cell, click the "+" button near the top of the page (below and between File and Edit). In that new cell, you can type Python code, like import this, then run the cell and immediately see the output. I encourage you to add and modify cells as we move through the discussion of machine learning below, as interacting with the code is one of the best ways to grow comfortable with the techniques discussed below.

Use Cases and Motivations for Studying Word Vectors

Word vectors are data structures that map sequences of alphabetic characters (aka "words") to lists of numbers (aka "vectors"). These peculiar data structures have been used in all sorts of research applications, and are certainly worth exploring if your work touches on any of the following domains:

These are of course only a few of the many possible use cases for word vectors. The field is still new, so there are plenty of new applications you can discover!

Introduction to Word Vectors

"Word vectors", or "word embeddings" are ways of representing semantic and syntactic information about words in a vector form. As you may know, the word "vector" is just a fancy term for a list of numbers (or sometimes a list of number lists). Word vectors are a way of representing each word in a vocabulary with a list of numbers, such that those numbers can tell us useful information about the words in the vocabulary.

The simplest kind of word vector system represents each distinct word in a vocabulary with a w dimensional vector (or list of w numbers), where w equals the number of distinct words in the language. Each word's vector consists of zeros for all but the i-th value in the list, where i indicates the given word's index position within the vocabulary. This is known as a "one-hot encoding," as the vector contains 0s in all but one position.

For example, suppose one has a vocabulary consisting only of five words: King, Queen, Man, Woman, and Child. In that case one could encode the word "Queen" as:

While it's easy to represent words with this kind of one-hot encoding, these vectors don't give us any way to compare words except to check if they're equal, which isn't very helpful.

Recent approaches to word vectors, including Google's Word2Vec and Stanford University's GloVe embeddings, create more insightful word vectors by representing each word in a language with a dense k dimensional vector, where k is a value chosen by the user who creates the word vectors. (Dense vectors are lists of numbers in which there are very few if any zeros; unlike one-hot vectors which are comprised of all zeros with a single 1, dense vectors are comprised almost entirely of non-zero values). The meaning of each word is thereby represented by a list of k values, and each unit in the k dimensions contributes some meaning to each word.

If one were to label the dimensions in a word vector, the result might look something like the following:

Each of these word vectors (the columns above) gives a representation of the semantic and syntactic function of a word in a language. By comparing these vectors, we can study relationships between words in ways that were previously not possible. To see how this works, let's dive into some code below.

Preparing the Code

The following block loads some dependencies, clears all warnings, and makes random number generation reproducible.


In [ ]:
%load_ext autoreload
%autoreload 2 # reimport modules when evaluating cells

import warnings
warnings.filterwarnings('ignore') # ignore all warnings

import numpy as np
np.random.seed(0) # make random number generation consistent

Loading Pretrained Word Vectors

The following section will allow us to get started with word vectors by loading a pretrained model trained by Google. This model has already learned the mapping from each word to a 300 dimensional vector, so we won't need to "train" this model. The download will take a minute or two, but that's much faster than training a model from scratch, which could take days or weeks!


In [ ]:
import requests, os

# step one: download (1.5GB) Google's model to your current directory
if not os.path.exists('GoogleNews-vectors-negative300.bin.gz'):
  url = 'https://s3.amazonaws.com/lab-data-collections/GoogleNews-vectors-negative300.bin.gz'
  open('GoogleNews-vectors-negative300.bin.gz', 'wb').write(requests.get(url, allow_redirects=True).content)

In [ ]:
import gzip

# step two: unzip the gzipped model file we just downloaded
if not os.path.exists('pretrained-model.bin'):
  with gzip.open('GoogleNews-vectors-negative300.bin.gz', 'rb') as f:
    with open('pretrained-model.bin', 'wb') as out:
      for i in f:
        out.write(i)

In [ ]:
from gensim.models import KeyedVectors

# step three: load the pretrained model we just downloaded and unzipped
model = KeyedVectors.load_word2vec_format('pretrained-model.bin', binary=True)

In [ ]:
vec = model.wv['tiptop']

words, sims = zip(*model.wv.similar_by_vector(vec, topn=3))

Exploring the Model

Now that we've downloaded and loaded a pretrained model, let's see what this model can do.

To do so, let's use one of the most useful Python commands, dir(), which displays all of the attributes (data appended to object) and methods (functions we can call on an object) defined on the model we just instantiated.


In [ ]:
# investigate all attributes on the model
dir(model)

There are two attributes that are of particular interest in this list: vocab and wv. The former lists all of the words in the model, while the latter lets us fetch the vector associated with a given word:


In [ ]:
print(list(model.vocab)[-100:])

In [ ]:
print(model.wv['tiptop'])

We'll use these attributes below to investigate the model more thoroughly. (For information on some of the other attributes of model listed above, check out the Gensim documentation).

Comparing Word Vectors

Now that we've downloaded and loaded a pretrained model, let's use it to compare how similar words are. We should expect words we associate with one-another to have high similarity values, and words that we don't associate together to have low similarity values.


In [ ]:
from scipy.spatial.distance import cosine
import numpy as np

# get the vectors associated with two sample words
vec_a = model.wv['beach']
vec_b = model.wv['ocean']

# find the n most similar terms to a given query vector
print(model.wv.similar_by_vector(vec_a, topn=3))

# show the distance between those vectors
print(1 - cosine(vec_a, vec_b))

In [ ]:
# if you try to get the vector associated with a word that doesn't exist, an error springs!
word = 'cats_in_trees'

try:
  model.wv[word]
except KeyError:
  print('! Word missing from vocabulary !', word)

In [ ]:
# find all words in the model
words = list(model.wv.vocab.keys())

words[:100] # show just the first hundred words in the model

Visualizing Word Vector Fields

In the code above, we examined the simialarity between two word vectors. In what follows, we'll visualize the similarity between hundreds of words at once. To do so, we'll reduce the vector representation of each word to just two dimensions, then we'll create a visualization that renders each word at its two-dimensional position.


In [ ]:
%matplotlib inline

# visualize the field of terms fed as input
from sklearn.manifold import TSNE
from adjustText import adjust_text
import matplotlib.pyplot as plt
from umap import UMAP
import numpy as np
import operator

def plot_word_field(model, method='umap', skip=0, max_words=1000, jitter=0, margin=0.05, figsize=(20, 14), min_dist=0.1):
  '''
  Given a gensim.model instance, plot the positions of terms
  '''
  if method not in ['tsne', 'umap']: raise Exception(' ! Requested method is not supported')
  words = get_model_words(model, max_words=max_words, skip=skip)
  if not words: raise Exception(' ! No words were found in model -- exiting')
  word_vectors = np.array([model.wv[word] for word in words]) # array of vecs, one per word
  print(' * creating layout for', len(word_vectors), 'words')
  if method == 'umap':
    X = UMAP(n_neighbors=5, min_dist=min_dist).fit_transform(word_vectors) # X.shape = len(words), 2
  elif method == 'tsne':
    X = TSNE(n_components=2).fit_transform(word_vectors) # X.shape = len(words), 2
  plot_words(X, words, figsize=figsize, jitter=jitter, margin=margin)

  
def get_model_words(model, max_words=1000, skip=0):
  '''Get the words from `model`. If `max_words` is provided, return up to that many words'''
  words = list(model.wv.vocab.keys()) # find all words in the model
  if max_words: # get the n most popular words if user requested max words
    word_to_count = {w: model.wv.vocab[w].count for w in words}
    sorted_by_count = sorted(word_to_count.items(), key=operator.itemgetter(1))
    sorted_by_count.reverse()
    words = [i[0] for i in sorted_by_count[skip:skip+max_words]]
  return words


def plot_words(X, words, jitter=False, figsize=(10,6), margin=0.5, labels={}):
  '''Given `X` where shape = words,2 and list of strings `words` plot the words at positions in `X`'''
  plt.figure(figsize=figsize)
  words = [plt.text(X[word_idx][0], X[word_idx][1], word) for word_idx, word in enumerate(words)]
  if jitter: adjust_text(words, lim=jitter)
  if labels.get('x', False): plt.xlabel(labels['x'])
  if labels.get('y', False): plt.ylabel(labels['y'])
  # set the axis ranges
  x_vals = [w.get_position()[0] for w in words]
  y_vals = [w.get_position()[1] for w in words]
  plt.xlim(( min(x_vals)-margin, max(x_vals)+margin ))
  plt.ylim(( min(y_vals)-margin, max(y_vals)+margin ))
  plt.show()

In [ ]:
words = plot_word_field(model, skip=5000, max_words=500, jitter=0, min_dist=0.25, method='umap')

Finding Terms Related to a Concept

Previously, we found and compared vectors for individual words. We can also compare vectors realted to concepts or groups of related words.

In this example, we create two lists containing words that we've determined relate to a concept. In our case we have veggie_words and meat_words. The code below will create a list of meat words & a list veggie words. Then it determines the word vector for each word in each concept list, finds the centroid of the word vectors for each concept, and creates a new list of other words that have a high similarity to each concept.

In the end, this should return a list of words that relate to each identified concept; in our case either meat or veggie.


In [ ]:
import json
import os

def find_centroid(words):
  '''Given a list of words, get the centroid of those word's vectors'''
  vecs = np.vstack([model.wv[w] for w in words if w in model.wv])
  sums = np.array([ np.sum(vecs[:,idx]) for idx, i in enumerate(range(vecs[0].shape[0])) ])
  return sums / vecs[0].shape[0]


def find_similar_by_vec(vec, n=50):
  '''Return the words for the `n` most similar words to a query vector'''
  words, sims = zip(*model.wv.similar_by_vector(vec, topn=n))
  curated = set()
  for i in words:
    if i.lower() not in curated:
      curated.add(i.lower())
  curated = list(curated)
  if n:
    return curated[:n]
  return curated


def find_similar_by_words(words, n=50):
  '''Return the words for the `n` most similar words to a list of query words'''
  centroid = find_centroid(words)
  return find_similar_by_vec(centroid, n=n)

In [ ]:
# create lists of words in a conceptual category        
veggie_words = ['asparagus', 'artichoke', 'avocado', 'beets', 'broccoli', 'carrot', 'celery', 'cauliflower', 'cucumber', 'eggplant', 'kale', 'lentils', 'lettuce', 'mushroom', 'olive', 'onion', 'pea', 'potato', 'salad', 'spinach', 'squash', 'tomato', 'turnip', 'yam', 'zucchini']
meat_words = ['bacon', 'beef', 'chicken', 'crab', 'duck', 'goose', 'meat', 'meatball', 'mutton', 'offal', 'partridge', 'pheasant', 'pork', 'quail', 'rabbit', 'turkey', 'veal', 'venison']

other_veggie_words = find_similar_by_words(veggie_words)
other_meat_words = find_similar_by_words(meat_words)

In [ ]:
print(other_meat_words)
print(other_veggie_words)

Plotting relationships between concepts

Again we can reduce our vector representation to two demensions so we can plot our concepts.

For this example, rather than plotting our words based on thier individual word vectors, we are plotting words based on their relation to our two concepts: meat or veggie.

In the plot below, the x-axis represents the 'veggieness' of each word and the y-axis represents the 'meatiness' of each word. A higher value means a greater similarity to the concept. So a higher x value would imply the word is more similar to our orginal concept of veggie_words and a higher y value wold imply the word is more similar to our concept meat_words.


In [ ]:
from scipy.spatial.distance import cosine as dist

def plot_distances_to_concepts(words, vec_one, vec_two, jitter=0, labels={}):
  '''Plot each item in `words` accordint to its distance from `vec_one` and `vec_two`'''
  # 2D array where subarrays = [dist_from_vec_one, dist_from_vec_two]
  words = [w for w in words if w in model.wv]
  distances = [[1-dist(model.wv[w], vec_one), 1-dist(model.wv[w], vec_two)] for w in words]
  plot_words(np.vstack(distances), words, jitter=jitter, labels=labels) 

# find the centroids for each cluster of words
veggie_centroid = find_centroid(veggie_words)
meat_centroid = find_centroid(meat_words)

# plot the words in the concept space
words = other_veggie_words + other_meat_words
plot_distances_to_concepts(words, veggie_centroid, meat_centroid, jitter=4, labels={'x': 'vegginess', 'y': 'meatiness'})

Creating Custom Word2Vec Models with Gensim

In the previous examples we used the pre-built model obtained from Google. We can also build a custom model using our own text documents. In the examples below, we are downloading three documents from Project Gutenberg, creating a list with each word from each document, then generating a model from that word list using the gensim function Word2Vec(). We've also saved our model in a file called word2vec.model. Now we can use this model in other applications or share it with colleagues.

Finally, we plot the results in two dimensions. We limit our plot to 1000 words to decrease processing time. The full model can also be plotted if desired, but note that this will take much longer to run and dispaly.


In [ ]:
from gensim.test.utils import common_texts, get_tmpfile
from gensim.models import Word2Vec
import requests as r
import os

def download_text(url, out_dir='texts'):
  '''Download the content at `url` to the present directory'''
  if not os.path.exists(out_dir): os.makedirs(out_dir)
  response = r.get(url, allow_redirects=True)
  out_path = os.path.join(out_dir, os.path.basename(i))
  open(out_path, 'wb').write(response.content)
  print(' * downloaded', url)

# specify the urls to one or more text files
urls = [
  'http://www.gutenberg.org/cache/epub/14/pg14.txt',
  'http://www.gutenberg.org/cache/epub/25/pg25.txt',
  'http://www.gutenberg.org/cache/epub/48/pg48.txt',
]

# download each of the text files specified above
for i in urls: download_text(i)

In [ ]:
import codecs
import glob

# generate a list of lists, where sublists contain words from a file (in order)
word_lists = []
for i in glob.glob('texts/*'):
  with codecs.open(i) as f:
    word_lists.append(f.read().lower().split())

# build a model with the custom word lists
model = Word2Vec(word_lists, size=100, window=5, min_count=20, workers=4)
model.save('word2vec.model')

In [ ]:
# plot the words
plot_word_field(model, method='umap', min_dist=0.9)

Hierarchical Word Modelling

In the cells above we examined a few different ways of analyzing the similarity between word vectors. In the last example below, we'll analyze another way of analyzing similarity through "dendrograms". Dendrograms are tree-like structures that give a visual representation of the similarities between objects. As an example, we could take a look at the points below:

One can see that E and F are the closest two points in the 2D space on the left. Likewise, the distance one must travel along the tree structure in the dendrogram is also the shortest. D is the next most similar to E and F, and it is therefore the next closest link in the dendrogram as well. In general, the closer two observations are in the 2D space, the closer they are in the dendrogram space


In [ ]:
from scipy.cluster.hierarchy import dendrogram, linkage

max_words = 1000
skip = 1000

words = get_model_words(model, max_words=max_words, skip=skip)
vecs = [model.wv[i] for i in words]

Z = linkage(vecs, 'ward')

In [ ]:
fig, ax = plt.subplots(figsize=(max_words/4, 8))
dendrogram(Z, leaf_rotation=90.0, leaf_font_size=14.0, labels=words)
plt.show()