Attention Based Classification Tutorial

Recommended time: 30 minutes

Contributors: nthain, martin-gorner

This tutorial provides an introduction to building text classification models in tensorflow that use attention to provide insight into how classification decisions are being made. We will build our tensorflow graph following the Embed - Encode - Attend - Predict paradigm introduced by Matthew Honnibal. For more information about this approach, you can refer to:

Slides: https://goo.gl/BYT7au

Video: https://youtu.be/pzOzmxCR37I

Figure 1 below provides a representation of the full tensorflow graph we will build in this tutorial. The green squares represent RNN cells and the blue trapezoids represent neural networks for computing attention weights which will be discussed in more detail below. We will implement each piece of this model graph in a seperate function. The whole model will then simply be calling all of these functions in turn.

This tutorial was created in collaboration with the Tensorflow without a PhD series. To check out more episodes, tutorials, and codelabs from this series, please visit:

https://github.com/GoogleCloudPlatform/tensorflow-without-a-phd

Imports


In [ ]:
%load_ext autoreload
%autoreload 2

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function


import pandas as pd
import tensorflow as tf
import numpy as np
import time
import os
from sklearn import metrics
from visualize_attention import attentionDisplay
from process_figshare import download_figshare, process_figshare

tf.set_random_seed(1234)

Load & Explore Data

Let's begin by downloading the data from Figshare and cleaning and splitting it for use in training.


In [ ]:
download_figshare()
process_figshare()

We then load these splits as pandas dataframes.


In [ ]:
SPLITS = ['train', 'dev', 'test']

wiki = {}
for split in SPLITS:
    wiki[split] = pd.read_csv('data/wiki_%s.csv' % split)

We display the top few rows of the dataframe to see what we're dealing with. The key columns are 'comment' which contains the text of a comment from a Wikipedia talk page and 'toxicity' which contains the fraction of annotators who found this comment to be toxic. More information about the other fields and how this data was collected can be found on this wiki and research paper.


In [ ]:
wiki['train'].head()

Hyperparameters

Hyperparameters are used to specify various aspects of our model's architecture. In practice, these are often critical to model performance and are carefully tuned using some type of hyperparameter search. For this tutorial, we will choose a reasonable set of hyperparameters and treat them as fixed.


In [ ]:
hparams = {'max_document_length': 60,
           'embedding_size': 50,
           'rnn_cell_size': 128,
           'batch_size': 256,
           'attention_size': 32,
           'attention_depth': 2}

In [ ]:
MAX_LABEL = 2
WORDS_FEATURE = 'words'
NUM_STEPS = 300

Step 0: Text Preprocessing

Before we can build a neural network on comment strings, we first have to complete a number of preprocessing steps. In particular, it is important that we "tokenize" the string, splitting it into an array of tokens. In our case, each token will be a word in our sentence and they will be seperated by spaces and punctuation. Many alternative tokenizers exist, some of which use characters as tokens, and others which include punctuation, emojis, or even cleverly handle misspellings.

Once we've tokenized the sentences, each word will be replaced with an integer representative. This will make the embedding (Step 1) much easier.

Happily the tensorflow function VocabularyProcessor takes care of both the tokenization and integer mapping. We only have to give it the max_document_length argument which will determine the length of the output arrays. If sentences are shorter than this length, they will be padded and if they are longer, they will be trimmed. The VocabularyProcessor is then trained on the training set to build the initial vocabulary and map the words to integers.


In [ ]:
# Initialize the vocabulary processor
vocab_processor = tf.contrib.learn.preprocessing.VocabularyProcessor(hparams['max_document_length'])

def process_inputs(vocab_processor, df, train_label = 'train', test_label = 'test'):
    
    # For simplicity, we call our features x and our outputs y
    x_train = df['train'].comment
    y_train = df['train'].is_toxic
    x_test = df['test'].comment
    y_test = df['test'].is_toxic

    # Train the vocab_processor from the training set
    x_train = vocab_processor.fit_transform(x_train)
    # Transform our test set with the vocabulary processor
    x_test = vocab_processor.transform(x_test)

    # We need these to be np.arrays instead of generators
    x_train = np.array(list(x_train))
    x_test = np.array(list(x_test))
    y_train = np.array(y_train).astype(int)
    y_test = np.array(y_test).astype(int)

    n_words = len(vocab_processor.vocabulary_)
    print('Total words: %d' % n_words)

    # Return the transformed data and the number of words
    return x_train, y_train, x_test, y_test, n_words

x_train, y_train, x_test, y_test, n_words = process_inputs(vocab_processor, wiki)

Step 1: Embed

Neural networks at their core are a composition of operators from linear algebra and non-linear activation functions. In order to perform these computations on our input sentences, we must first embed them as a vector of numbers. There are two main approaches to perform this embedding:

  1. Pre-trained: It is often beneficial to initialize our embedding matrix using pre-trained embeddings like Word2Vec or GloVe. These embeddings are trained on a huge corpus of text with a general purpose problem so that they incorporate syntactic and semantic properties of the words being embedded and are amenable to transfer learning on new problems. Once initialized, you can optionally train them further for your specific problem by allowing the embedding matrix in the graph to be a trainable variable in our tensorflow graph.
  2. Random: Alternatively, embeddings can be "trained from scratch" by initializing the embedding matrix randomly and then training it like any other parameter in the tensorflow graph.

In this notebook, we will be using a random initialization. To perform this embedding we use the embed_sequence function from the layers package. This will take our input features, which are the arrays of integers we produced in Step 0, and will randomly initialize a matrix to embed them into. The parameters of this matrix will then be trained with the rest of the graph.


In [ ]:
def embed(features):
    word_vectors = tf.contrib.layers.embed_sequence(
        features[WORDS_FEATURE], 
        vocab_size=n_words, 
        embed_dim=hparams['embedding_size'])
    
    return word_vectors

Step 2: Encode

A recurrent neural network is a deep learning architecture that is useful for encoding sequential information like sentences. They are built around a single cell which contains one of several standard neural network architectures (e.g. simple RNN, GRU, or LSTM). We will not focus on the details of the architectures, but at each point in time the cell takes in two inputs and produces two outputs. The inputs are the input token for that step in the sequence and some state from the previous steps in the sequence. The outputs produced are the encoded vectors for the current sequence step and a state to pass on to the next step of the sequence.

Figure 2 shows what this looks like for an unrolled RNN. Each cell (represented by a green square) has two input arrows and two output arrrows. Note that all of the green squares represent the same cell and share parameters. One major advantage of this cell replication is that, at inference time, it allows us to deal with arbitrary length input and not be restricted by the input sizes of our training set.

For our model, we will use a bi-directional RNN. This is simply the concatentation of two RNNs, one which processes the sequence from left to right (the "forward" RNN) and one which process from right to left (the "backward" RNN). By using both directions, we get a stronger encoding as each word can be encoded using the context of its neighbors on boths sides rather than just a single side. For our cells, we use gated recurrent units (GRUs). Figure 3 gives a visual representation of this.


In [ ]:
def encode(word_vectors):
    # Create a Gated Recurrent Unit cell with hidden size of RNN_SIZE.
    # Since the forward and backward RNNs will have different parameters, we instantiate two seperate GRUS.
    rnn_fw_cell = tf.contrib.rnn.GRUCell(hparams['rnn_cell_size'])
    rnn_bw_cell = tf.contrib.rnn.GRUCell(hparams['rnn_cell_size'])
    
    # Create an unrolled Bi-Directional Recurrent Neural Networks to length of
    # max_document_length and passes word_list as inputs for each unit.
    outputs, _ = tf.nn.bidirectional_dynamic_rnn(rnn_fw_cell, 
                                                 rnn_bw_cell, 
                                                 word_vectors, 
                                                 dtype=tf.float32, 
                                                 time_major=False)
    
    return outputs

Step 3: Attend

There are a number of ways to use the encoded states of a recurrent neural network for prediction. One traditional approach is to simply use the final encoded state of the network, as seen in Figure 2. However, this could lose some useful information encoded in the previous steps of the sequence. In order to keep that information, one could instead use an average of the encoded states outputted by the RNN. There is not reason to believe, though, that all of the encoded states of the RNN are equally valuable. Thus, we arrive at the idea of using a weighted sum of these encoded states to make our prediction.

We will call the weights of this weighted sum "attention weights" as we will see below that they correspond to how important our model thinks each token of the sequence is in making a prediction decision. We compute these attention weights simply by building a small fully connected neural network on top of each encoded state. This network will have a single unit final layer which will correspond to the attention weight we will assign. As for RNNs, the parameters of this network will be the same for each step of the sequence, allowing us to accomodate variable length inputs. Figure 4 shows us what the graph would look like if we applied attention to a uni-directional RNN.

Again, as our model uses a bi-directional RNN, we first concatenate the hidden states from each RNN before computing the attention weights and applying the weighted sum. Figure 5 below visualizes this step.


In [ ]:
def attend(inputs, attention_size, attention_depth):
  
  inputs = tf.concat(inputs, axis = 2)
  
  inputs_shape = inputs.shape
  sequence_length = inputs_shape[1].value
  final_layer_size = inputs_shape[2].value
  
  x = tf.reshape(inputs, [-1, final_layer_size])
  for _ in range(attention_depth-1):
    x = tf.layers.dense(x, attention_size, activation = tf.nn.relu)
  x = tf.layers.dense(x, 1, activation = None)
  logits = tf.reshape(x, [-1, sequence_length, 1])
  alphas = tf.nn.softmax(logits, dim = 1)
  
  output = tf.reduce_sum(inputs * alphas, 1)

  return output, alphas

Step 4: Predict

To genereate a class prediction about whether a comment is toxic or not, the final part of our tensorflow graph takes the weighted average of hidden states generated in the attention step and uses a fully connected layer with a softmax activation function to generate probability scores for each of our prediction classes. While training, the model will use the cross-entropy loss function to train its parameters.

As we will use the estimator framework to train our model, we write an estimator_spec function to specify how our model is trained and what values to return during the prediction stage. We also specify the evaluation metrics of accuracy and auc, which we will use to evaluate our model in Step 7.


In [ ]:
def estimator_spec_for_softmax_classification(
    logits, labels, mode, alphas):
  """Returns EstimatorSpec instance for softmax classification."""
  predicted_classes = tf.argmax(logits, 1)
  if mode == tf.estimator.ModeKeys.PREDICT:
    return tf.estimator.EstimatorSpec(
        mode=mode,
        predictions={
            'class': predicted_classes,
            'prob': tf.nn.softmax(logits),
            'attention': alphas
        })

  onehot_labels = tf.one_hot(labels, MAX_LABEL, 1, 0)
  loss = tf.losses.softmax_cross_entropy(
      onehot_labels=onehot_labels, logits=logits)
  if mode == tf.estimator.ModeKeys.TRAIN:
    optimizer = tf.train.AdamOptimizer(learning_rate=0.01)
    train_op = optimizer.minimize(loss, 
                                  global_step=tf.train.get_global_step())
    return tf.estimator.EstimatorSpec(mode, 
                                      loss=loss, 
                                      train_op=train_op)

  eval_metric_ops = {
      'accuracy': tf.metrics.accuracy(
          labels=labels, predictions=predicted_classes),
      'auc': tf.metrics.auc(
          labels=labels, predictions=predicted_classes),    
  }
  return tf.estimator.EstimatorSpec(
      mode=mode, loss=loss, eval_metric_ops=eval_metric_ops)

The predict component of our graph then just takes the output of our attention step, i.e. the weighted average of the bi-RNN hidden layers, and adds one more fully connected layer to compute the logits. These logits are fed into a our estimator_spec which uses a softmax to get the final class probabilties and a softmax_cross_entropy to build a loss function.


In [ ]:
def predict(encoding, labels, mode, alphas):
    logits = tf.layers.dense(encoding, MAX_LABEL, activation=None)
    return estimator_spec_for_softmax_classification(
          logits=logits, labels=labels, mode=mode, alphas=alphas)

Step 5: Complete Model Architecture

We are now ready to put it all together. As you can see from the bi_rnn_model function below, once you have the components for embed, encode, attend, and predict, putting the whole graph together is extremely simple!


In [ ]:
def bi_rnn_model(features, labels, mode):
  """RNN model to predict from sequence of words to a class."""

  word_vectors = embed(features)
  outputs = encode(word_vectors)
  encoding, alphas = attend(outputs, 
                            hparams['attention_size'], 
                            hparams['attention_depth'])

  return predict(encoding, labels, mode, alphas)

Step 6: Train Model

We will use the estimator framework to train our model. To define our classifier, we just provide it with the complete model graph (i.e. the bi_rnn_model function) and a directory where the models will be saved.


In [ ]:
current_time = str(int(time.time()))
model_dir = os.path.join('checkpoints', current_time)
classifier = tf.estimator.Estimator(model_fn=bi_rnn_model, 
                                    model_dir=model_dir)

The estimator framework also requires us to define an input function. This will take the input data and provide it during model training in batches. We will use the provided numpy_input_function, which takes numpy arrays as features and labels. We also specify the batch size and whether we want to shuffle the data between epochs.


In [ ]:
# Train.
train_input_fn = tf.estimator.inputs.numpy_input_fn(
  x={WORDS_FEATURE: x_train},
  y=y_train,
  batch_size=hparams['batch_size'],
  num_epochs=None,
  shuffle=True)

Now, it's finally time to train our model! With estimator, this is as easy as calling the train function and specifying how long we'd like to train for.


In [ ]:
classifier.train(input_fn=train_input_fn, 
                 steps=NUM_STEPS)

Step 7: Predict and Evaluate Model

To evaluate the function, we will use it to predict the values of examples from our test set. Again, we define a numpy_input_fn, for the test data in this case, and then have the classifier run predictions on this input function.


In [ ]:
# Predict.
test_input_fn = tf.estimator.inputs.numpy_input_fn(
  x={WORDS_FEATURE: x_test},
  y=y_test,
  num_epochs=1,
  shuffle=False)

predictions = classifier.predict(input_fn=test_input_fn)

These predictions are returned to us as a generator. The code below gives an example of how we can extract the class and attention weights for each prediction.


In [ ]:
y_predicted = []
alphas_predicted = []
for p in predictions:
    y_predicted.append(p['class'])
    alphas_predicted.append(p['attention'])

To evaluate our model, we can use the evaluate function provided by estimator to get the accuracy and ROC-AUC scores as we defined them in our estimator_spec.


In [ ]:
scores = classifier.evaluate(input_fn=test_input_fn)
print('Accuracy: {0:f}'.format(scores['accuracy']))
print('AUC: {0:f}'.format(scores['auc']))

Step 8: Display Attention

Now that we have a trained attention based toxicity model, let's use it to visualize how our model makes its classification decisions. We use the helpful attentionDisplay class from the visualize_attention package. Given any sentence, this class uses our trained classifier to determine whether the sentence is toxic and also returns a representation of the attention weights. In the arrays below, the more red a word is, the more weight classifier puts on encoded word. Try it out on some sentences of your own and see what patterns you can find!

Note: If you are viewing this on Github, the colors in the cells won't display properly. We recommend viewing it locally or with nbviewer to see the correct rendering of the attention weights.


In [ ]:
display = attentionDisplay(vocab_processor, classifier)

In [ ]:
display.display_prediction_attention("Fuck off, you idiot.")

In [ ]:
display.display_prediction_attention("Thanks for your help editing this.")

In [ ]:
display.display_prediction_attention("You're such an asshole. But thanks anyway.")

In [ ]:
display.display_prediction_attention("I'm going to shoot you!")

In [ ]:
display.display_prediction_attention("Oh shoot. Well alright.")

In [ ]:
display.display_prediction_attention("First of all who the fuck died and made you the god.")

In [ ]:
display.display_prediction_attention("Gosh darn it!")

In [ ]:
display.display_prediction_attention("God damn it!")

In [ ]:
display.display_prediction_attention("You're not that smart are you?")

In [ ]: