PyTorch is a Python package for performing tensor computation, automatic differentiation, and dynamically defining neural networks. It makes it particularly easy to accelerate model training with a GPU. In recent years it has gained a large following in the NLP community.
Instructions for installing PyTorch can be found on the home-page of their website: http://pytorch.org/. The PyTorch developers recommended you use the conda
package manager to install the library (in my experience pip
works fine as well).
One thing to be aware of is that the package name will be different depending on whether or not you intend on using a GPU. If you do plan on using a GPU, then you will need to install CUDA and CUDNN before installing PyTorch. Detailed instructions can be found at NVIDIA's website: https://docs.nvidia.com/cuda/. The following versions of CUDA are supported: 7.5, 8, and 9.
The PyTorch API is designed to very closely resemble NumPy. The central object for performing computation is the Tensor
, which is PyTorch's version of NumPy's array
.
In [1]:
import numpy as np
import torch
In [2]:
# Create a 3 x 2 array
np.ndarray((3, 2))
Out[2]:
In [3]:
# Create a 3 x 2 Tensor
torch.Tensor(3, 2)
Out[3]:
All of the basic arithmetic operations are supported.
In [4]:
a = torch.Tensor([1,2])
b = torch.Tensor([3,4])
print('a + b:', a + b)
print('a - b:', a - b)
print('a * b:', a * b)
print('a / b:', a / b)
Indexing/slicing also behaves the same.
In [5]:
a = torch.Tensor(5, 5)
print('a:', a)
# Slice using ranges
print('a[2:4, 3:4]', a[2:4, 3:4])
# Can count backwards using negative indices
print('a[:, -1]', a[:, -1])
# Skipping elements
print('a[::2, ::3]', a[::2, ::3])
Changing a Tensor
to and from an array
is also quite simple:
In [6]:
# Tensor from array
arr = np.array([1,2])
torch.from_numpy(arr)
Out[6]:
In [7]:
# Tensor to array
t = torch.Tensor([1, 2])
t.numpy()
Out[7]:
Moving Tensor
s to the GPU is also quite simple:
In [8]:
t = torch.Tensor([1, 2]) # on CPU
if torch.cuda.is_available():
t = t.cuda() # on GPU
Derivatives and gradients are critical to a large number of machine learning algorithms. One of the key benefits of PyTorch is that these can be computed automatically.
We'll demonstrate this using the following example. Suppose we have some data $x$ and $y$, and want to fit a model: $$ \hat{y} = mx + b $$ by minimizing the loss function: $$ L(y, \hat{y}) = \frac{1}{2}(y - \hat{y})^2 $$
In [9]:
# Data
x = torch.tensor([1., 2, 3, 4]) # requires_grad = False by default
y = torch.tensor([0., -1, -2, -3])
# Initialize a variables
m = torch.rand(1, requires_grad=True)
b = torch.rand(1, requires_grad=True)
# Define function
y_hat = m * x + b
# Define loss
loss = torch.mean(0.5 * (y - y_hat)**2)
To obtain the gradient of the $L$ w.r.t $m$ and $b$ you need only run:
In [10]:
loss.backward() # Backprop the gradients of the loss w.r.t other variables
# Gradients
print('dL/dm: %0.4f' % m.grad)
print('dL/db: %0.4f' % b.grad)
While automatic differentiation is in itself a useful feature, it can be quite tedious to keep track of all of the different parameters and gradients for more complicated models. In order to make life simple, PyTorch defines a torch.nn.Module
class which handles all of these details for you. To paraphrase the PyTorch documentation, this is the base class for all neural network modules, and whenever you define a model it should be a subclass of this class.
Here is an example implementation of the simple linear model given above:
In [11]:
import torch.nn as nn
class LinearModel(nn.Module):
def __init__(self):
"""This method is called when you instantiate a new LinearModel object.
You should use it to define the parameters/layers of your model.
"""
# Whenever you define a new nn.Module you should start the __init__()
# method with the following line. Remember to replace `LinearModel`
# with whatever you are calling your model.
super(LinearModel, self).__init__()
# Now we define the parameters used by the model.
self.m = torch.nn.Parameter(torch.rand(1))
self.b = torch.nn.Parameter(torch.rand(1))
def forward(self, x):
"""This method computes the output of the model.
Args:
x: The input data.
"""
return self.m * x + self.b
# Example forward pass. Note that we use model(x) not model.forward(x) !!!
model = LinearModel()
y_hat = model(x)
To train this model we need to pick an optimizer such as SGD, AdaDelta, ADAM, etc. There are many options in torch.optim
. When initializing an optimizer, the first argument will be the collection of variables you want optimized. To obtain a list of all of the trainable parameters of a model you can call the nn.Module.parameters()
method. For example, the following code initalizes a SGD optimizer for the model defined above:
In [12]:
import torch.optim as optim
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
Training is done in a loop. The general structure is:
For example, we can train our linear model by running:
In [13]:
import time
for i in range(5001):
optimizer.zero_grad()
y_hat = model(x)
loss = torch.mean(0.5 * (y - y_hat)**2)
loss.backward()
optimizer.step()
if i % 1000 == 0:
time.sleep(1) # DO NOT INCLUDE THIS IN YOUR CODE !!! Only for demo.
print(f'Iteration {i} - Loss: {loss.item():0.6f}', end='\r')
Observe that the final parameters are what we expect:
In [14]:
print('Final parameters:')
print('m: %0.2f' % model.m)
print('b: %0.2f' % model.b)
Now let's dive into an example that is more relevant to NLP, Word2Vec! The idea of Word2Vec is to create continuous vector representations of words using shallow neural networks, so that words with similar meanings end up close together in the vector space. First introduced by Mikolov et. al 2013, these models greatly improved the state-of-the-art of measuring semantic and syntactic similarity of words. In addition, the learned word embeddings end up being useful for many other downstream tasks such as language modeling, machine translation, and automatic image captioning.
In this notebook, we will go over the continuous bag-of-words (CBOW) model, which has the following architecture:
Since the output is a probability distribution over a categorical variable we will use cross-entropy as our loss function.
To start, we'll need some data to train on. We will use the text8 dataset since it does not require tokenization (the tokens can be obtained by splitting the document on spaces). The dataset can be downloaded by running the following shell scripts:
In [ ]:
!wget http://mattmahoney.net/dc/text8.zip
!unzip text8.zip
In [15]:
with open('text8', 'r') as f:
corpus = f.read().split()
Next, we need to get our data into Python and in a form that is usable by PyTorch. For text data this typically entails building a vocabulary of all of the words, sorting the vocabulary in terms of frequency, and then mapping words to integers corresponding to their place in the sorted vocabulary. This can be done as follows:
In [16]:
from collections import Counter
def build_vocabulary(corpus, limit=30000):
"""Builds a vocabulary.
Args:
corpus: A list of words.
"""
counts = Counter(corpus) # Count the word occurances.
counts = counts.items() # Transform Counter to (word, count) tuples.
counts = sorted(counts, key=lambda x: x[1], reverse=True) # Sort in terms of frequency.
counts = counts[:limit]
reverse_vocab = ['<UNK>'] + [x[0] for x in counts] # Use a list to map indices to words.
vocab = {x: i for i, x in enumerate(reverse_vocab)} # Invert that mapping to get the vocabulary.
data = [vocab[x] if x in vocab else 0 for x in corpus] # Get ids for all words in the corpus.
return data, vocab, reverse_vocab
data, vocab, reverse_vocab = build_vocabulary(corpus)
Use the following cell to inspect the output of that function.
In [17]:
reverse_vocab[data[0]]
Out[17]:
We now need to generate batches of word-context pairs from the corpus, to be used during training. The following function illustrates how this can be done.
In [18]:
from random import shuffle
def batch_generator(data, batch_size, window_size):
# Stores the indices of the words in order that they will be chosen when generating batches.
# e.g. the first word that will be chosen in an epoch is word_ids[0]
ids = list(range(window_size, len(data) - window_size))
while True:
shuffle(ids) # Randomize at the start of each epoch
sample_queue = []
for id in ids: # Iterate over random words
w = data[id]
c = []
for i in range(1, window_size + 1): # Iterate over window sizes
c.append(data[id - i]) # Left context
c.append(data[id + i]) # Right context
sample_queue.append((w, c))
# Once positive sample queue is full deque a batch and generate negative samples
if len(sample_queue) >= batch_size:
batch = [sample_queue.pop() for _ in range(batch_size)]
w, c = zip(*batch) # Separate words and contexts
# Convert data to tensors
w = torch.LongTensor(w)
c = torch.LongTensor(c)
# Shuttle to GPU
if torch.cuda.is_available():
w = w.cuda()
c = c.cuda()
yield w, c
Unfortunately, this approach is quite slow, for two reasons:
In general, training will run must faster if operations are performed in PyTorch instead of Python. With this in mind, let's see what a better approach looks like:
In [19]:
# Take the list of ids, and turn it into a tensor
data = torch.tensor(data)
# Replace the python loop-based approach to getting context windows
# with tensor operations.
def window(x, window_size=2):
chunks = x.unfold(0, 2 * window_size + 1, 1)
w = chunks[:, window_size]
c_left = chunks[:, :window_size]
c_right = chunks[:, window_size + 1:]
c = torch.cat((c_left, c_right), dim=1)
return w, c
Let's double check that the window
function behaves as expected.
In [20]:
w, c = window(data)
print(reverse_vocab[w[0]])
print(' '.join(reverse_vocab[_c] for _c in c[0]))
At this point our data is in a tensor, and we can create context windows using only PyTorch operations.
Now we need a way to generate batches of data for training and evaluation.
PyTorch has great built-in utilities for this in the torch.utils.data
module.
In [21]:
# As good scientists, we first hold out some data for evaluation
val_size = 100000
train = data[:-val_size]
val = data[-val_size:]
# Since the dataset and model are not too large, we'll move all the data
# onto the GPU ahead of time. Avoiding shuttling between the GPU and CPU
# will considerably increase the speed of training.
#
# Due to resource constraints, this step may not always be possible. If
# that is the case, then you will need to omit the next few lines of
# code, and move the `.cuda()` calls into the training loop (see below).
if torch.cuda.is_available():
train = train.cuda()
val = val.cuda()
# Hyperparameters. You should experiment with different settings to
# see what gives the best results.
window_size = 2
batch_size = 128
# There are two main data utilities to be aware of:
# - Datasets: Contain instances of data
# - DataLoaders: Handle batching and shuffling
# Data is generated during training by iterating over the DataLoader.
#
# In the code below we:
# - Create the (word, context) tensors using the `window` function
# - Feed them into a `TensorDataset`
# - Feed the dataset into the `DataLoader` as well as the batch size.
# Also note that shuffling is enabled in the training loader.
# Shuffling is important to maximizing performance during training.
train_w, train_c = window(train, window_size=window_size)
train_dataset = torch.utils.data.TensorDataset(train_w, train_c)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size, shuffle=True)
val_w, val_c = window(val, window_size=window_size)
val_dataset = torch.utils.data.TensorDataset(*window(val))
val_loader = torch.utils.data.DataLoader(val_dataset, batch_size)
Now that we can read in the data, it is time to build our model. Note that the code almost perfectly follows the mathematical description of CBOW given above. There are a few noteworthy things:
bias=False
in the output projection so that only matrix multiplication is applied.
In [22]:
import torch.nn as nn
import torch.nn.functional as F
class CBOW(nn.Module):
def __init__(self, vocab_size, embedding_size):
super(CBOW, self).__init__()
# Parameters
self.vocab_size = vocab_size
self.embedding_size = embedding_size
# Layers
self.embeddings = nn.Embedding(vocab_size, embedding_size)
self.fc = nn.Linear(embedding_size, vocab_size, bias=False)
# Share projection weights between the input and output
# embeddings.
self.embeddings.weight.data = self.fc.weight.data
def forward(self, c):
embeddings = self.embeddings(c)
sum_of_embeddings = embeddings.mean(dim=1)
projection = self.fc(sum_of_embeddings)
log_probabilities = F.log_softmax(projection, dim=1)
return log_probabilities
In [23]:
from torch import optim
# Training settings / hyperparameters
vocab_size = len(vocab)
embedding_size = 128
# Create model
model = CBOW(vocab_size, embedding_size)
if torch.cuda.is_available():
model = model.cuda()
# Initialize optimizer.
# Note: The learning rate is often one of the most important hyperparameters
# to tune.
optimizer = optim.Adam(model.parameters(), lr=3e-4, weight_decay=1e-6)
# We will train using negative log-likelihood loss.
# Note: Sometimes the loss is computed inside the model (e.g., in AllenNLP).
# Different people have different preferences.
loss_function = nn.NLLLoss()
In [24]:
epochs = 10
best_val_loss = float('inf')
for epoch in range(epochs):
# Training loop
model.train()
for i, batch in enumerate(train_loader):
w, c = batch
optimizer.zero_grad()
output = model(c)
loss = loss_function(output, w)
loss.backward()
optimizer.step()
if (i % 100) == 0:
print(f'Epoch: {epoch}, Iteration: {i} - Train Loss: {loss.item()}', end='\r')
# Evaluation loop
model.eval()
total_loss = 0.0
total_preds = 0
for i, batch in enumerate(val_loader):
w, c = batch
with torch.no_grad():
output = model(c)
total_loss += F.nll_loss(output, w, reduction='sum').item()
total_preds += w.shape[0]
val_loss = total_loss / total_preds
print(f'Epoch: {epoch} - Validation Loss: {val_loss}')
# Save best model
# Note: We save the model's state dict, not the model itself.
if val_loss < best_val_loss:
print('Best so far')
torch.save(model.state_dict(), 'cbow_model.pt')
best_val_loss = val_loss
In [25]:
state_dict = torch.load('cbow_model.pt')
model.load_state_dict(state_dict)
Out[25]:
Notice that instead of saving and loading the entire model, we save and load only it's state_dict
.
This is generally the preferred approach.
To understand why, check out this page from the docs: https://pytorch.org/tutorials/beginner/saving_loading_models.html
In [26]:
# Extract the embedding weights from the model
embeddings = model.embeddings.weight.data
One interesting visual is to look at a t-SNE plot of the word vectors (note: best viewed in another tab):
In [27]:
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
# If embeddings are on the GPU we need them to be on the CPU before
# converting to a numpy array.
if torch.cuda.is_available():
embedding_array = embeddings.cpu().numpy()
else:
embedding_array = embeddings.numpy()
# Get the t-SNE embeddings.
tsne = TSNE(n_components=2).fit_transform(embedding_array[:2000])
# Create a scatter plot
fig = plt.figure(figsize=(24, 24), dpi=300)
ax = fig.add_subplot(111)
ax.scatter(tsne[:,0], tsne[:,1], color='#ff7f0e')
# Show labels for 200 most common words
for i in range(1, 201):
ax.text(tsne[i, 0], tsne[i, 1], reverse_vocab[i], fontsize=12, color='#1f77b4')
Another fun application is to find similar words using cosine distance. This is another good opportunity to demonstrate how math is converted to code. The cosine distance between two vectors is defined as: $$ d(w, q) = 1 - \frac{w \cdot q}{|w||q|} $$ It is often useful to break down expressions like this into multiple steps:
To find similar words we will need to find the closest distance of all word vectors $W$ (stored in a matrix) to the query vector $q$. In this case, we can replace the dot-product above with matrix-vector multiplication. $$ d(W, q) = 1 - W'q' $$ Note that the output is a vector of distances.
This translates into the function below.
In [28]:
def cosine_distance(W, q):
"""
Parameters:
W : torch.FloatTensor
shape (vocab_size, embedding_dim)
q : torch.FloatTensor
shape (embedding_dim)
"""
# Normalize
Wp = W / W.norm(dim=1, keepdim=True)
q = q / q.norm()
return 1 - torch.mv(Wp, q)
def similar_words(word):
print('Most similar words to: %s' % word)
# Get embedding for query word.
word_idx = vocab[word]
query = embeddings[word_idx]
# Compute cosine distance between query word embedding and all other embeddings.
# `torch.mv` is matrix vector multiplication.
distance = cosine_distance(embeddings, query)
# Find closest embeddings and print out corresponding words.
closest_word_ids = distance.argsort()[:10]
for i, close_word_idx in enumerate(closest_word_ids):
print('%i - %s' % (i + 1, reverse_vocab[close_word_idx]))
In [29]:
similar_words('chaos')
In [30]:
similar_words('business')
Lastly, we'll perform the obligatory word algebra. We can use the vector representations of words and cosine distance function to compute things like: $King - Man + Woman \approx Queen$
In [31]:
def word_algebra(a, b, c):
a_idx = vocab[a]
b_idx = vocab[b]
c_idx = vocab[c]
query = embeddings[a_idx] - embeddings[b_idx] + embeddings[c_idx]
distance = cosine_distance(embeddings, query)
closest_word_ids = distance.argsort()[:10]
for i, close_word_idx in enumerate(closest_word_ids):
print('%i - %s' % (i + 1, reverse_vocab[close_word_idx]))
In [32]:
word_algebra('king', 'man', 'woman')