Introduction to PyTorch

PyTorch is a Python package for performing tensor computation, automatic differentiation, and dynamically defining neural networks. It makes it particularly easy to accelerate model training with a GPU. In recent years it has gained a large following in the NLP community.

Installing PyTorch

Instructions for installing PyTorch can be found on the home-page of their website: http://pytorch.org/. The PyTorch developers recommended you use the conda package manager to install the library (in my experience pip works fine as well).

One thing to be aware of is that the package name will be different depending on whether or not you intend on using a GPU. If you do plan on using a GPU, then you will need to install CUDA and CUDNN before installing PyTorch. Detailed instructions can be found at NVIDIA's website: https://docs.nvidia.com/cuda/. The following versions of CUDA are supported: 7.5, 8, and 9.

PyTorch Basics

The PyTorch API is designed to very closely resemble NumPy. The central object for performing computation is the Tensor, which is PyTorch's version of NumPy's array.


In [1]:
import numpy as np
import torch

In [2]:
# Create a 3 x 2 array
np.ndarray((3, 2))


Out[2]:
array([[4.68565489e-310, 0.00000000e+000],
       [0.00000000e+000, 0.00000000e+000],
       [0.00000000e+000, 0.00000000e+000]])

In [3]:
# Create a 3 x 2 Tensor
torch.Tensor(3, 2)


Out[3]:
tensor([[0.0000e+00, 4.3701e+12],
        [1.8788e+31, 1.7220e+22],
        [2.9926e+21, 1.3613e-05]])

All of the basic arithmetic operations are supported.


In [4]:
a = torch.Tensor([1,2])
b = torch.Tensor([3,4])
print('a + b:', a + b)
print('a - b:', a - b)
print('a * b:', a * b)
print('a / b:', a / b)


a + b: tensor([4., 6.])
a - b: tensor([-2., -2.])
a * b: tensor([3., 8.])
a / b: tensor([0.3333, 0.5000])

Indexing/slicing also behaves the same.


In [5]:
a = torch.Tensor(5, 5)
print('a:', a)

# Slice using ranges
print('a[2:4, 3:4]', a[2:4, 3:4])

# Can count backwards using negative indices
print('a[:, -1]', a[:, -1])

# Skipping elements
print('a[::2, ::3]', a[::2, ::3])


a: tensor([[2.8026e-45, 0.0000e+00, 9.9182e+16, 3.0942e-41, 1.4013e-45],
        [0.0000e+00, 3.9236e-44, 4.5695e-41, 3.7652e+00, 4.5695e-41],
        [1.4013e-45,        nan, 1.4013e-45, 0.0000e+00, 1.4013e-45],
        [5.6052e-45, 3.5032e-44, 4.5695e-41, 9.9182e+16, 3.0942e-41],
        [7.8055e-05, 4.5695e-41, 1.4013e-45, 5.6052e-45, 1.4013e-45]])
a[2:4, 3:4] tensor([[0.0000e+00],
        [9.9182e+16]])
a[:, -1] tensor([1.4013e-45, 4.5695e-41, 1.4013e-45, 3.0942e-41, 1.4013e-45])
a[::2, ::3] tensor([[2.8026e-45, 3.0942e-41],
        [1.4013e-45, 0.0000e+00],
        [7.8055e-05, 5.6052e-45]])

Changing a Tensor to and from an array is also quite simple:


In [6]:
# Tensor from array
arr = np.array([1,2])
torch.from_numpy(arr)


Out[6]:
tensor([1, 2])

In [7]:
# Tensor to array
t = torch.Tensor([1, 2])
t.numpy()


Out[7]:
array([1., 2.], dtype=float32)

Moving Tensors to the GPU is also quite simple:


In [8]:
t = torch.Tensor([1, 2]) # on CPU
if torch.cuda.is_available():
    t = t.cuda() # on GPU

Automatic Differentiation

Derivatives and gradients are critical to a large number of machine learning algorithms. One of the key benefits of PyTorch is that these can be computed automatically.

We'll demonstrate this using the following example. Suppose we have some data $x$ and $y$, and want to fit a model: $$ \hat{y} = mx + b $$ by minimizing the loss function: $$ L(y, \hat{y}) = \frac{1}{2}(y - \hat{y})^2 $$


In [9]:
# Data
x = torch.tensor([1.,  2,  3,  4])  # requires_grad = False by default
y = torch.tensor([0., -1, -2, -3])

# Initialize a variables
m = torch.rand(1, requires_grad=True)
b = torch.rand(1, requires_grad=True)

# Define function
y_hat = m * x + b

# Define loss
loss = torch.mean(0.5 * (y - y_hat)**2)

To obtain the gradient of the $L$ w.r.t $m$ and $b$ you need only run:


In [10]:
loss.backward() # Backprop the gradients of the loss w.r.t other variables

# Gradients
print('dL/dm: %0.4f' % m.grad)
print('dL/db: %0.4f' % b.grad)


dL/dm: 7.5916
dL/db: 2.3932

Training Models

While automatic differentiation is in itself a useful feature, it can be quite tedious to keep track of all of the different parameters and gradients for more complicated models. In order to make life simple, PyTorch defines a torch.nn.Module class which handles all of these details for you. To paraphrase the PyTorch documentation, this is the base class for all neural network modules, and whenever you define a model it should be a subclass of this class.

Here is an example implementation of the simple linear model given above:


In [11]:
import torch.nn as nn

class LinearModel(nn.Module):
    
    def __init__(self):
        """This method is called when you instantiate a new LinearModel object.
        
        You should use it to define the parameters/layers of your model.
        """
        # Whenever you define a new nn.Module you should start the __init__()
        # method with the following line. Remember to replace `LinearModel` 
        # with whatever you are calling your model.
        super(LinearModel, self).__init__()
        
        # Now we define the parameters used by the model.
        self.m = torch.nn.Parameter(torch.rand(1))
        self.b = torch.nn.Parameter(torch.rand(1))
    
    def forward(self, x):
        """This method computes the output of the model.
        
        Args:
            x: The input data.
        """
        return self.m * x + self.b


# Example forward pass. Note that we use model(x) not model.forward(x) !!! 
model = LinearModel()
y_hat = model(x)

To train this model we need to pick an optimizer such as SGD, AdaDelta, ADAM, etc. There are many options in torch.optim. When initializing an optimizer, the first argument will be the collection of variables you want optimized. To obtain a list of all of the trainable parameters of a model you can call the nn.Module.parameters() method. For example, the following code initalizes a SGD optimizer for the model defined above:


In [12]:
import torch.optim as optim
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

Training is done in a loop. The general structure is:

  1. Clear the gradients.
  2. Evaluate the model.
  3. Calculate the loss.
  4. Backpropagate.
  5. Perform an optimization step.
  6. (Once in a while) Print monitoring metrics.

For example, we can train our linear model by running:


In [13]:
import time

for i in range(5001):
    optimizer.zero_grad()
    y_hat = model(x)
    loss = torch.mean(0.5 * (y - y_hat)**2)
    loss.backward()
    optimizer.step()
    if i % 1000 == 0:
        time.sleep(1) # DO NOT INCLUDE THIS IN YOUR CODE !!! Only for demo.
        print(f'Iteration {i} - Loss: {loss.item():0.6f}', end='\r')


Iteration 5000 - Loss: 0.000000

Observe that the final parameters are what we expect:


In [14]:
print('Final parameters:')
print('m: %0.2f' % model.m)
print('b: %0.2f' % model.b)


Final parameters:
m: -1.00
b: 1.00

CASE STUDY: Word2Vec

Now let's dive into an example that is more relevant to NLP, Word2Vec! The idea of Word2Vec is to create continuous vector representations of words using shallow neural networks, so that words with similar meanings end up close together in the vector space. First introduced by Mikolov et. al 2013, these models greatly improved the state-of-the-art of measuring semantic and syntactic similarity of words. In addition, the learned word embeddings end up being useful for many other downstream tasks such as language modeling, machine translation, and automatic image captioning.

In this notebook, we will go over the continuous bag-of-words (CBOW) model, which has the following architecture: Given a word $w_i$ in our corpus as well as a context $c = [w_{i-k}, \ldots,w_{i-1}, w_{i+1}, w_{i+k}]$ comprised of words in a window of size $k$ around $w_i$ the goal is to maximize $p(W_i = w_i | c)$. In the CBOW model this is done by:

  1. Embedding the context words $w_{i+j}$ to get word vectors $v_{i+j}$.
  2. Summing up these word vectors to get $v = \sum_{j \in [-k, k]-{0}}v_{i+j}$
  3. Feeding $v$ into a fully connected layer with softmax activation to obtain the output probabilities. That is, $P(w_i | c) = \text{softmax}(Av)$.

Since the output is a probability distribution over a categorical variable we will use cross-entropy as our loss function.

Dataset

To start, we'll need some data to train on. We will use the text8 dataset since it does not require tokenization (the tokens can be obtained by splitting the document on spaces). The dataset can be downloaded by running the following shell scripts:


In [ ]:
!wget http://mattmahoney.net/dc/text8.zip
!unzip text8.zip

In [15]:
with open('text8', 'r') as f:
    corpus = f.read().split()

Next, we need to get our data into Python and in a form that is usable by PyTorch. For text data this typically entails building a vocabulary of all of the words, sorting the vocabulary in terms of frequency, and then mapping words to integers corresponding to their place in the sorted vocabulary. This can be done as follows:


In [16]:
from collections import Counter


def build_vocabulary(corpus, limit=30000):
    """Builds a vocabulary.
    
    Args:
        corpus: A list of words.
    """
    counts = Counter(corpus) # Count the word occurances.
    counts = counts.items() # Transform Counter to (word, count) tuples.
    counts = sorted(counts, key=lambda x: x[1], reverse=True) # Sort in terms of frequency.
    counts = counts[:limit]
    reverse_vocab = ['<UNK>'] + [x[0] for x in counts] # Use a list to map indices to words.
    vocab = {x: i for i, x in enumerate(reverse_vocab)} # Invert that mapping to get the vocabulary.
    data = [vocab[x] if x in vocab else 0 for x in corpus] # Get ids for all words in the corpus.
    return data, vocab, reverse_vocab


data, vocab, reverse_vocab = build_vocabulary(corpus)

Use the following cell to inspect the output of that function.


In [17]:
reverse_vocab[data[0]]


Out[17]:
'anarchism'

We now need to generate batches of word-context pairs from the corpus, to be used during training. The following function illustrates how this can be done.


In [18]:
from random import shuffle


def batch_generator(data, batch_size, window_size):

    # Stores the indices of the words in order that they will be chosen when generating batches.
    # e.g. the first word that will be chosen in an epoch is word_ids[0]
    ids = list(range(window_size, len(data) - window_size))

    while True:
        shuffle(ids) # Randomize at the start of each epoch
        sample_queue = []
        for id in ids: # Iterate over random words
            w = data[id]
            c = []
            for i in range(1, window_size + 1): # Iterate over window sizes
                c.append(data[id - i]) # Left context
                c.append(data[id + i]) # Right context
            sample_queue.append((w, c))
            
            # Once positive sample queue is full deque a batch and generate negative samples
            if len(sample_queue) >= batch_size:
                
                batch = [sample_queue.pop() for _ in range(batch_size)]
                w, c = zip(*batch) # Separate words and contexts
                
                # Convert data to tensors
                w = torch.LongTensor(w)
                c = torch.LongTensor(c)
                
                # Shuttle to GPU
                if torch.cuda.is_available():
                    w = w.cuda()
                    c = c.cuda()
                    
                yield w, c

Unfortunately, this approach is quite slow, for two reasons:

  • it repeats pre-processing the data on each outer loop
  • data is frequently shuffled from Python objects to PyTorch tensors, and from the CPU to the GPU.

In general, training will run must faster if operations are performed in PyTorch instead of Python. With this in mind, let's see what a better approach looks like:


In [19]:
# Take the list of ids, and turn it into a tensor
data = torch.tensor(data)

# Replace the python loop-based approach to getting context windows
# with tensor operations.
def window(x, window_size=2):
    chunks = x.unfold(0, 2 * window_size + 1, 1)
    w = chunks[:, window_size]
    c_left = chunks[:, :window_size]
    c_right = chunks[:, window_size + 1:]
    c = torch.cat((c_left, c_right), dim=1)
    return w, c

Let's double check that the window function behaves as expected.


In [20]:
w, c = window(data)
print(reverse_vocab[w[0]])
print(' '.join(reverse_vocab[_c] for _c in c[0]))


as
anarchism originated a term

At this point our data is in a tensor, and we can create context windows using only PyTorch operations. Now we need a way to generate batches of data for training and evaluation. PyTorch has great built-in utilities for this in the torch.utils.data module.


In [21]:
# As good scientists, we first hold out some data for evaluation
val_size = 100000
train = data[:-val_size]
val = data[-val_size:]

# Since the dataset and model are not too large, we'll move all the data
# onto the GPU ahead of time. Avoiding shuttling between the GPU and CPU
# will considerably increase the speed of training.
# 
# Due to resource constraints, this step may not always be possible. If
# that is the case, then you will need to omit the next few lines of
# code, and move the `.cuda()` calls into the training loop (see below).
if torch.cuda.is_available():
    train = train.cuda()
    val = val.cuda()

# Hyperparameters. You should experiment with different settings to
# see what gives the best results.
window_size = 2
batch_size = 128

# There are two main data utilities to be aware of:
# - Datasets: Contain instances of data
# - DataLoaders: Handle batching and shuffling
# Data is generated during training by iterating over the DataLoader.
#
# In the code below we:
# - Create the (word, context) tensors using the `window` function
# - Feed them into a `TensorDataset`
# - Feed the dataset into the `DataLoader` as well as the batch size.
#   Also note that shuffling is enabled in the training loader.
#   Shuffling is important to maximizing performance during training.
train_w, train_c = window(train, window_size=window_size)
train_dataset = torch.utils.data.TensorDataset(train_w, train_c)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size, shuffle=True)

val_w, val_c = window(val, window_size=window_size)
val_dataset = torch.utils.data.TensorDataset(*window(val))
val_loader = torch.utils.data.DataLoader(val_dataset, batch_size)

Model

Now that we can read in the data, it is time to build our model. Note that the code almost perfectly follows the mathematical description of CBOW given above. There are a few noteworthy things:

  • We set bias=False in the output projection so that only matrix multiplication is applied.
  • According to the paper, "the weight matrix between the input and the projection layer is shared". This is often referred to as weight tying, and is done in many models where the inputs and outputs are both words. Weight tying is achieved using the code on line 17.
  • The paper says to sum word embeddings, we instead take the average. We do this since the average will have a smaller norm than the sum, which is more sensible for large window sizes.

In [22]:
import torch.nn as nn
import torch.nn.functional as F


class CBOW(nn.Module):
    
    def __init__(self, vocab_size, embedding_size):
        super(CBOW, self).__init__()
        # Parameters
        self.vocab_size = vocab_size
        self.embedding_size = embedding_size
        # Layers
        self.embeddings = nn.Embedding(vocab_size, embedding_size)
        self.fc = nn.Linear(embedding_size, vocab_size, bias=False)
        # Share projection weights between the input and output 
        # embeddings.
        self.embeddings.weight.data = self.fc.weight.data
    
    def forward(self, c):
        embeddings = self.embeddings(c)
        sum_of_embeddings = embeddings.mean(dim=1)
        projection = self.fc(sum_of_embeddings)
        log_probabilities = F.log_softmax(projection, dim=1)
        return log_probabilities

Training

The training script essentially follows the same pattern that we used for the linear model above. However we have also added an evaluation step, and code for saving model checkpoints.


In [23]:
from torch import optim

# Training settings / hyperparameters
vocab_size = len(vocab)
embedding_size = 128

# Create model
model = CBOW(vocab_size, embedding_size)
if torch.cuda.is_available():
    model = model.cuda()

# Initialize optimizer.
# Note: The learning rate is often one of the most important hyperparameters
# to tune.
optimizer = optim.Adam(model.parameters(), lr=3e-4, weight_decay=1e-6)

# We will train using negative log-likelihood loss.
# Note: Sometimes the loss is computed inside the model (e.g., in AllenNLP).
# Different people have different preferences.
loss_function = nn.NLLLoss()

WARNING:

The following code may take some time to run. If you'd like you can skip it and download the pretrained weights at:


In [24]:
epochs = 10
best_val_loss = float('inf')

for epoch in range(epochs):
    
    # Training loop
    model.train()
    for i, batch in enumerate(train_loader):
        w, c = batch
        optimizer.zero_grad()
        output = model(c)
        loss = loss_function(output, w)
        loss.backward()
        optimizer.step()
        if (i % 100) == 0:
            print(f'Epoch: {epoch}, Iteration: {i} - Train Loss: {loss.item()}', end='\r')
    
    # Evaluation loop
    model.eval()
    total_loss = 0.0
    total_preds = 0
    for i, batch in enumerate(val_loader):
        w, c = batch
        with torch.no_grad():
            output = model(c)
        total_loss += F.nll_loss(output, w, reduction='sum').item()
        total_preds += w.shape[0]
    val_loss = total_loss / total_preds
    print(f'Epoch: {epoch} - Validation Loss: {val_loss}')
    
    # Save best model
    # Note: We save the model's state dict, not the model itself.
    if val_loss < best_val_loss:
        print('Best so far')
        torch.save(model.state_dict(), 'cbow_model.pt')
        best_val_loss = val_loss


Epoch: 0 - Validation Loss: 6.071990185668218647711753845215
Best so far
Epoch: 1 - Validation Loss: 5.937021108011068094298362731935
Best so far
Epoch: 2 - Validation Loss: 5.876079877179462715705871582035
Best so far
Epoch: 3 - Validation Loss: 5.843310949692626576888275146485
Best so far
Epoch: 4 - Validation Loss: 5.827390054983303729538917541565
Best so far
Epoch: 5 - Validation Loss: 5.813451683575155927100181579595
Best so far
Epoch: 6 - Validation Loss: 5.807064711024727991361618041995
Best so far
Epoch: 7 - Validation Loss: 5.802931745565721439743041992195
Best so far
Epoch: 8 - Validation Loss: 5.798820292045081052816390991215
Best so far
Epoch: 9 - Validation Loss: 5.795429436043409459304809570315
Best so far

Loading Trained Models

Loading a pretrained model can be done in two lines:


In [25]:
state_dict = torch.load('cbow_model.pt')
model.load_state_dict(state_dict)


Out[25]:
<All keys matched successfully>

Notice that instead of saving and loading the entire model, we save and load only it's state_dict. This is generally the preferred approach. To understand why, check out this page from the docs: https://pytorch.org/tutorials/beginner/saving_loading_models.html

Fun with Word Vectors

Time for some fun word vector examples. To start we will need to extract our word embeddings from the model. We'll then read them into NumPy arrays so they can be used by SciPy, scikit-learn, etc.


In [26]:
# Extract the embedding weights from the model
embeddings = model.embeddings.weight.data

One interesting visual is to look at a t-SNE plot of the word vectors (note: best viewed in another tab):


In [27]:
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

# If embeddings are on the GPU we need them to be on the CPU before
# converting to a numpy array.
if torch.cuda.is_available():
    embedding_array = embeddings.cpu().numpy()
else:
    embedding_array = embeddings.numpy()

# Get the t-SNE embeddings.
tsne = TSNE(n_components=2).fit_transform(embedding_array[:2000])

# Create a scatter plot
fig = plt.figure(figsize=(24, 24), dpi=300)
ax = fig.add_subplot(111)
ax.scatter(tsne[:,0], tsne[:,1], color='#ff7f0e')

# Show labels for 200 most common words
for i in range(1, 201):
    ax.text(tsne[i, 0], tsne[i, 1], reverse_vocab[i], fontsize=12, color='#1f77b4')


Another fun application is to find similar words using cosine distance. This is another good opportunity to demonstrate how math is converted to code. The cosine distance between two vectors is defined as: $$ d(w, q) = 1 - \frac{w \cdot q}{|w||q|} $$ It is often useful to break down expressions like this into multiple steps:

  • Start by normalizing the vectors (e.g., $w' = w / |w|$)
  • Then $d$ can be written as $d(w, q) = 1 -w' \cdot q'$. The reason we prefer to use multiple steps is that it often results in much easier to read code.

To find similar words we will need to find the closest distance of all word vectors $W$ (stored in a matrix) to the query vector $q$. In this case, we can replace the dot-product above with matrix-vector multiplication. $$ d(W, q) = 1 - W'q' $$ Note that the output is a vector of distances.

This translates into the function below.


In [28]:
def cosine_distance(W, q):
    """
    Parameters:
    W : torch.FloatTensor
        shape (vocab_size, embedding_dim)
    q : torch.FloatTensor
        shape (embedding_dim)
    """
    # Normalize
    Wp = W / W.norm(dim=1, keepdim=True)
    q = q / q.norm()
    return 1 - torch.mv(Wp, q)


def similar_words(word):
    print('Most similar words to: %s' % word)
       
    # Get embedding for query word.
    word_idx = vocab[word]
    query = embeddings[word_idx]
    
    # Compute cosine distance between query word embedding and all other embeddings.
    # `torch.mv` is matrix vector multiplication.
    distance = cosine_distance(embeddings, query)
    
    # Find closest embeddings and print out corresponding words.
    closest_word_ids = distance.argsort()[:10]
    
    for i, close_word_idx in enumerate(closest_word_ids):
        print('%i - %s' % (i + 1, reverse_vocab[close_word_idx]))

In [29]:
similar_words('chaos')


Most similar words to: chaos
1 - chaos
2 - computability
3 - darwinian
4 - bcs
5 - gaia
6 - gravitation
7 - cybernetics
8 - kaluza
9 - superconductivity
10 - causal

In [30]:
similar_words('business')


Most similar words to: business
1 - business
2 - corporate
3 - financial
4 - banking
5 - retail
6 - marketing
7 - private
8 - accounting
9 - finance
10 - economics

Lastly, we'll perform the obligatory word algebra. We can use the vector representations of words and cosine distance function to compute things like: $King - Man + Woman \approx Queen$


In [31]:
def word_algebra(a, b, c):
    a_idx = vocab[a]
    b_idx = vocab[b]
    c_idx = vocab[c]
    query = embeddings[a_idx] - embeddings[b_idx] + embeddings[c_idx]
    distance = cosine_distance(embeddings, query)
    closest_word_ids = distance.argsort()[:10]
    for i, close_word_idx in enumerate(closest_word_ids):
        print('%i - %s' % (i + 1, reverse_vocab[close_word_idx]))

In [32]:
word_algebra('king', 'man', 'woman')


1 - king
2 - kings
3 - son
4 - emperor
5 - duke
6 - queen
7 - daughter
8 - bishop
9 - vii
10 - prince