Introduction

In this notebook we will reproduce the results of Deep Speech: Scaling up end-to-end speech recognition. The core of the system is a bidirectional recurrent neural network (BRNN) trained to ingest speech spectrograms and generate English text transcriptions.

Let a single utterance $x$ and label $y$ be sampled from a training set $S = \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), . . .\}$. Each utterance, $x^{(i)}$ is a time-series of length $T^{(i)}$ where every time-slice is a vector of audio features, $x^{(i)}_t$ where $t=1,\ldots,T^{(i)}$. We use MFCC as our features; so $x^{(i)}_{t,p}$ denotes the $p$-th MFCC feature in the audio frame at time $t$. The goal of our BRNN is to convert an input sequence $x$ into a sequence of character probabilities for the transcription $y$, with $\hat{y}_t =\mathbb{P}(c_t \mid x)$, where $c_t \in \{a,b,c, . . . , z, space, apostrophe, blank\}$. (The significance of $blank$ will be explained below.)

Our BRNN model is composed of $5$ layers of hidden units. For an input $x$, the hidden units at layer $l$ are denoted $h^{(l)}$ with the convention that $h^{(0)}$ is the input. The first three layers are not recurrent. For the first layer, at each time $t$, the output depends on the MFCC frame $x_t$ along with a context of $C$ frames on each side. (We typically use $C \in \{5, 7, 9\}$ for our experiments.) The remaining non-recurrent layers operate on independent data for each time step. Thus, for each time $t$, the first $3$ layers are computed by:

$$h^{(l)}_t = g(W^{(l)} h^{(l-1)}_t + b^{(l)})$$

where $g(z) = \min\{\max\{0, z\}, 20\}$ is a clipped rectified-linear (ReLu) activation function and $W^{(l)}$, $b^{(l)}$ are the weight matrix and bias parameters for layer $l$. The fourth layer is a bidirectional recurrent layer[1]. This layer includes two sets of hidden units: a set with forward recurrence, $h^{(f)}$, and a set with backward recurrence $h^{(b)}$:

$$h^{(f)}_t = g(W^{(4)} h^{(3)}_t + W^{(f)}_r h^{(f)}_{t-1} + b^{(4)})$$$$h^{(b)}_t = g(W^{(4)} h^{(3)}_t + W^{(b)}_r h^{(b)}_{t+1} + b^{(4)})$$

Note that $h^{(f)}$ must be computed sequentially from $t = 1$ to $t = T^{(i)}$ for the $i$-th utterance, while the units $h^{(b)}$ must be computed sequentially in reverse from $t = T^{(i)}$ to $t = 1$.

The fifth (non-recurrent) layer takes both the forward and backward units as inputs

$$h^{(5)} = g(W^{(5)} h^{(4)} + b^{(5)})$$

where $h^{(4)} = h^{(f)} + h^{(b)}$. The output layer are standard logits that correspond to the predicted character probabilities for each time slice $t$ and character $k$ in the alphabet:

$$h^{(6)}_{t,k} = \hat{y}_{t,k} = (W^{(6)} h^{(5)}_t)_k + b^{(6)}_k$$

Here $b^{(6)}_k$ denotes the $k$-th bias and $(W^{(6)} h^{(5)}_t)_k$ the $k$-th element of the matrix product.

Once we have computed a prediction for $\hat{y}_{t,k}$, we compute the CTC loss[2] $\cal{L}(\hat{y}, y)$ to measure the error in prediction. During training, we can evaluate the gradient $\nabla \cal{L}(\hat{y}, y)$ with respect to the network outputs given the ground-truth character sequence $y$. From this point, computing the gradient with respect to all of the model parameters may be done via back-propagation through the rest of the network. We use the Adam method for training[3].

The complete BRNN model is illustrated in the figure below.

Preliminaries

Imports

Here we first import all of the packages we require to implement the DeepSpeech BRNN.


In [ ]:
import os
import time
import json
import datetime
import tempfile
import subprocess
import numpy as np
from math import ceil
from xdg import BaseDirectory as xdg
import tensorflow as tf
from util.log import merge_logs
from util.gpu import get_available_gpus
from util.shared_lib import check_cupti
from util.text import sparse_tensor_value_to_texts, wers
from tensorflow.python.ops import ctc_ops
from tensorflow.contrib.session_bundle import exporter

ds_importer = os.environ.get('ds_importer', 'ted')
ds_dataset_path = os.environ.get('ds_dataset_path', os.path.join('./data', ds_importer))

import importlib
ds_importer_module = importlib.import_module('util.importers.%s' % ds_importer)

from util.website import maybe_publish

do_fulltrace = bool(len(os.environ.get('ds_do_fulltrace', '')))

if do_fulltrace:
    check_cupti()

Global Constants

Next we introduce several constants used in the algorithm below. In particular, we define

  • learning_rate - The learning rate we will employ in Adam optimizer[3]
  • training_iters - The number of iterations we will train for
  • batch_size - The number of elements in a batch
  • display_step - The number of epochs we cycle through before displaying progress
  • checkpoint_step - The number of epochs we cycle through before checkpointing the model
  • checkpoint_dir - The directory in which checkpoints are stored
  • export_dir - The directory in which exported models are stored
  • export_version - The version number of the exported model
  • default_stddev - The default standard deviation to use when initialising weights and biases
  • [bh][12356]_stddev - Individual standard deviations to use when initialising particular weights and biases

In [ ]:
learning_rate = float(os.environ.get('ds_learning_rate', 0.001)) # TODO: Determine a reasonable value for this
beta1 = float(os.environ.get('ds_beta1', 0.9))                   # TODO: Determine a reasonable value for this
beta2 = float(os.environ.get('ds_beta2', 0.999))                 # TODO: Determine a reasonable value for this
epsilon = float(os.environ.get('ds_epsilon', 1e-8))              # TODO: Determine a reasonable value for this
training_iters = int(os.environ.get('ds_training_iters', 15))    # TODO: Determine a reasonable value for this
batch_size = int(os.environ.get('ds_batch_size', 64))            # TODO: Determine a reasonable value for this
display_step = int(os.environ.get('ds_display_step', 1))         # TODO: Determine a reasonable value for this
validation_step = int(os.environ.get('ds_validation_step', 1))   # TODO: Determine a reasonable value for this
checkpoint_step = int(os.environ.get('ds_checkpoint_step', 5))   # TODO: Determine a reasonable value for this
checkpoint_dir = os.environ.get('ds_checkpoint_dir', xdg.save_data_path('deepspeech'))
export_dir = os.environ.get('ds_export_dir', None)
export_version = 1
use_warpctc = bool(len(os.environ.get('ds_use_warpctc', '')))
default_stddev = float(os.environ.get('ds_default_stddev', 0.1))
for var in ['b1', 'h1', 'b2', 'h2', 'b3', 'h3', 'b5', 'h5', 'b6', 'h6']:
    locals()['%s_stddev' % var] = float(os.environ.get('ds_%s_stddev' % var, default_stddev))

Note that we use the Adam optimizer[3] instead of Nesterov’s Accelerated Gradient [4] used in the original DeepSpeech paper, as, at the time of writing, TensorFlow does not have an implementation of Nesterov’s Accelerated Gradient [4].

As we will also employ dropout on the feedforward layers of the network, we need to define a parameter dropout_rate that keeps track of the dropout rate for these layers


In [ ]:
dropout_rate = float(os.environ.get('ds_dropout_rate', 0.05))  # TODO: Validate this is a reasonable value

# This global placeholder will be used for all dropout definitions
dropout_rate_placeholder = tf.placeholder(tf.float32)

# The feed_dict used for training employs the given dropout_rate
feed_dict_train = { dropout_rate_placeholder: dropout_rate }

# While the feed_dict used for validation, test and train progress reporting employs zero dropout
feed_dict = { dropout_rate_placeholder: 0.0 }

One more constant required of the non-recurrant layers is the clipping value of the ReLU. We capture that in the value of the variable relu_clip


In [ ]:
relu_clip = int(os.environ.get('ds_relu_clip', 20)) # TODO: Validate this is a reasonable value

Geometric Constants

Now we will introduce several constants related to the geometry of the network.

The network views each speech sample as a sequence of time-slices $x^{(i)}_t$ of length $T^{(i)}$. As the speech samples vary in length, we know that $T^{(i)}$ need not equal $T^{(j)}$ for $i \ne j$. For each batch, BRNN in TensorFlow needs to know n_steps which is the maximum $T^{(i)}$ for the batch.

Each of the at maximum n_steps vectors is a vector of MFCC features of a time-slice of the speech sample. We will make the number of MFCC features dependent upon the sample rate of the data set. Generically, if the sample rate is 8kHz we use 13 features. If the sample rate is 16kHz we use 26 features... We capture the dimension of these vectors, equivalently the number of MFCC features, in the variable n_input


In [ ]:
n_input = 26 # TODO: Determine this programatically from the sample rate

As previously mentioned, the BRNN is not simply fed the MFCC features of a given time-slice. It is fed, in addition, a context of $C \in \{5, 7, 9\}$ frames on either side of the frame in question. The number of frames in this context is captured in the variable n_context


In [ ]:
n_context = 9 # TODO: Determine the optimal value using a validation data set

Next we will introduce constants that specify the geometry of some of the non-recurrent layers of the network. We do this by simply specifying the number of units in each of the layers


In [ ]:
n_hidden_1 = n_input + 2*n_input*n_context # Note: This value was not specified in the original paper
n_hidden_2 = n_input + 2*n_input*n_context # Note: This value was not specified in the original paper
n_hidden_5 = n_input + 2*n_input*n_context # Note: This value was not specified in the original paper

where n_hidden_1 is the number of units in the first layer, n_hidden_2 the number of units in the second, and n_hidden_5 the number in the fifth. We haven't forgotten about the third or sixth layer. We will define their unit count below.

A LSTM BRNN consists of a pair of LSTM RNN's. One LSTM RNN that works "forward in time"

and a second LSTM RNN that works "backwards in time"

The dimension of the cell state, the upper line connecting subsequent LSTM units, is independent of the input dimension and the same for both the forward and backward LSTM RNN.

Hence, we are free to choose the dimension of this cell state independent of the input dimension. We capture the cell state dimension in the variable n_cell_dim.


In [ ]:
n_cell_dim = n_input + 2*n_input*n_context # TODO: Is this a reasonable value

The number of units in the third layer, which feeds in to the LSTM, is determined by n_cell_dim as follows


In [ ]:
n_hidden_3 = 2 * n_cell_dim

Next, we introduce an additional variable n_character which holds the number of characters in the target language plus one, for the $blamk$. For English it is the cardinality of the set $\{a,b,c, . . . , z, space, apostrophe, blank\}$ we referred to earlier.


In [ ]:
n_character = 29 # TODO: Determine if this should be extended with other punctuation

The number of units in the sixth layer is determined by n_character as follows


In [ ]:
n_hidden_6 = n_character

Graph Creation

Next we concern ourselves with graph creation.

However, before we do so we must introduce a utility function variable_on_cpu() used to create a variable in CPU memory.


In [ ]:
def variable_on_cpu(name, shape, initializer):
    # Use the /cpu:0 device for scoped operations
    with tf.device('/cpu:0'):
        # Create or get apropos variable
        var = tf.get_variable(name=name, shape=shape, initializer=initializer)
    return var

That done, we will define the learned variables, the weights and biases, within the method BiRNN() which also constructs the neural network. The variables named hn, where n is an integer, hold the learned weight variables. The variables named bn, where n is an integer, hold the learned bias variables.

In particular, the first variable h1 holds the learned weight matrix that converts an input vector of dimension n_input + 2*n_input*n_context to a vector of dimension n_hidden_1. Similarly, the second variable h2 holds the weight matrix converting an input vector of dimension n_hidden_1 to one of dimension n_hidden_2. The variables h3, h5, and h6 are similar. Likewise, the biases, b1, b2..., hold the biases for the various layers.

That said let us introduce the method BiRNN() that takes a batch of data batch_x and performs inference upon it.


In [ ]:
def BiRNN(batch_x, seq_length):
    # Input shape: [batch_size, n_steps, n_input + 2*n_input*n_context]
    batch_x_shape = tf.shape(batch_x)
    # Permute n_steps and batch_size
    batch_x = tf.transpose(batch_x, [1, 0, 2])
    # Reshape to prepare input for first layer
    batch_x = tf.reshape(batch_x, [-1, n_input + 2*n_input*n_context]) # (n_steps*batch_size, n_input + 2*n_input*n_context)
    
    #Hidden layer with clipped RELU activation and dropout
    b1 = variable_on_cpu('b1', [n_hidden_1], tf.random_normal_initializer(stddev=b1_stddev))
    h1 = variable_on_cpu('h1', [n_input + 2*n_input*n_context, n_hidden_1], tf.random_normal_initializer(stddev=h1_stddev))
    layer_1 = tf.minimum(tf.nn.relu(tf.add(tf.matmul(batch_x, h1), b1)), relu_clip)
    layer_1 = tf.nn.dropout(layer_1, (1.0 - dropout_rate_placeholder))
    #Hidden layer with clipped RELU activation and dropout
    b2 = variable_on_cpu('b2', [n_hidden_2], tf.random_normal_initializer(stddev=b2_stddev))
    h2 = variable_on_cpu('h2', [n_hidden_1, n_hidden_2], tf.random_normal_initializer(stddev=h2_stddev))
    layer_2 = tf.minimum(tf.nn.relu(tf.add(tf.matmul(layer_1, h2), b2)), relu_clip)
    layer_2 = tf.nn.dropout(layer_2, (1.0 - dropout_rate_placeholder))
    #Hidden layer with clipped RELU activation and dropout
    b3 = variable_on_cpu('b3', [n_hidden_3], tf.random_normal_initializer(stddev=b3_stddev))
    h3 = variable_on_cpu('h3', [n_hidden_2, n_hidden_3], tf.random_normal_initializer(stddev=h3_stddev))
    layer_3 = tf.minimum(tf.nn.relu(tf.add(tf.matmul(layer_2, h3), b3)), relu_clip)
    layer_3 = tf.nn.dropout(layer_3, (1.0 - dropout_rate_placeholder))
    
    # Define lstm cells with tensorflow
    # Forward direction cell
    lstm_fw_cell = tf.nn.rnn_cell.BasicLSTMCell(n_cell_dim, forget_bias=1.0)
    # Backward direction cell
    lstm_bw_cell = tf.nn.rnn_cell.BasicLSTMCell(n_cell_dim, forget_bias=1.0)
    
    # Reshape data because rnn cell expects shape [max_time, batch_size, input_size]
    layer_3 = tf.reshape(layer_3, [-1, batch_x_shape[0], n_hidden_3])

    # Get lstm cell output
    outputs, output_states = tf.nn.bidirectional_dynamic_rnn(cell_fw=lstm_fw_cell,
                                                             cell_bw=lstm_bw_cell,
                                                             inputs=layer_3,
                                                             dtype=tf.float32,
                                                             time_major=True,
                                                             sequence_length=seq_length)
    
    # Reshape outputs from two tensors each of shape [n_steps, batch_size, n_cell_dim]
    # to a single tensor of shape [n_steps*batch_size, 2*n_cell_dim]
    outputs = tf.concat(2, outputs)
    outputs = tf.reshape(outputs, [-1, 2*n_cell_dim])
    
    #Hidden layer with clipped RELU activation and dropout
    b5 = variable_on_cpu('b5', [n_hidden_5], tf.random_normal_initializer(stddev=b5_stddev))
    h5 = variable_on_cpu('h5', [(2 * n_cell_dim), n_hidden_5], tf.random_normal_initializer(stddev=h5_stddev))
    layer_5 = tf.minimum(tf.nn.relu(tf.add(tf.matmul(outputs, h5), b5)), relu_clip)
    layer_5 = tf.nn.dropout(layer_5, (1.0 - dropout_rate_placeholder))
    #Hidden layer of logits
    b6 = variable_on_cpu('b6', [n_hidden_6], tf.random_normal_initializer(stddev=b6_stddev))
    h6 = variable_on_cpu('h6', [n_hidden_5, n_hidden_6], tf.random_normal_initializer(stddev=h6_stddev))
    layer_6 = tf.add(tf.matmul(layer_5, h6), b6)
    
    # Reshape layer_6 from a tensor of shape [n_steps*batch_size, n_hidden_6]
    # to a tensor of shape [n_steps, batch_size, n_hidden_6]
    layer_6 = tf.reshape(layer_6, [-1, batch_x_shape[0], n_hidden_6])
    
    # Return layer_6
    # Output shape: [n_steps, batch_size, n_input + 2*n_input*n_context]
    return layer_6

The first few lines of the function BiRNN

def BiRNN(batch_x, seq_length):
    # Input shape: [batch_size, n_steps, n_input + 2*n_input*n_context]
    batch_x_shape = tf.shape(batch_x)
    # Permute n_steps and batch_size
    batch_x = tf.transpose(batch_x, [1, 0, 2])
    # Reshape to prepare input for first layer
    batch_x = tf.reshape(batch_x, [-1, n_input + 2*n_input*n_context])
    ...

reshape batch_x which has shape [batch_size, n_steps, n_input + 2*n_input*n_context] initially, to a tensor with shape [n_steps*batch_size, n_input + 2*n_input*n_context]. This is done to prepare the batch for input into the first layer which expects a tensor of rank 2.

The next few lines of BiRNN

#Hidden layer with clipped RELU activation and dropout
    b1 = variable_on_cpu('b1', [n_hidden_1], tf.random_normal_initializer())
    h1 = variable_on_cpu('h1', [n_input + 2*n_input*n_context, n_hidden_1], tf.random_normal_initializer())
    layer_1 = tf.minimum(tf.nn.relu(tf.add(tf.matmul(batch_x, h1), b1)), relu_clip)
    layer_1 = tf.nn.dropout(layer_1, (1.0 - dropout_rate_placeholder))
    ...

pass batch_x through the first layer of the non-recurrent neural network, then applies dropout to the result.

The next few lines do the same thing, but for the second and third layers

#Hidden layer with clipped RELU activation and dropout
    b2 = variable_on_cpu('b2', [n_hidden_2], tf.random_normal_initializer())
    h2 = variable_on_cpu('h2', [n_hidden_1, n_hidden_2], tf.random_normal_initializer())
    layer_2 = tf.minimum(tf.nn.relu(tf.add(tf.matmul(layer_1, h2), b2)), relu_clip)   
    layer_2 = tf.nn.dropout(layer_2, (1.0 - dropout_rate_placeholder))
    #Hidden layer with clipped RELU activation and dropout
    b3 = variable_on_cpu('b3', [n_hidden_3], tf.random_normal_initializer())
    h3 = variable_on_cpu('h3', [n_hidden_2, n_hidden_3], tf.random_normal_initializer())
    layer_3 = tf.minimum(tf.nn.relu(tf.add(tf.matmul(layer_2, h3), b3)), relu_clip)
    layer_3 = tf.nn.dropout(layer_3, (1.0 - dropout_rate_placeholder))

Next we create the forward and backward LSTM units

# Define lstm cells with tensorflow
    # Forward direction cell
    lstm_fw_cell = tf.nn.rnn_cell.BasicLSTMCell(n_cell_dim, forget_bias=1.0)
    # Backward direction cell
    lstm_bw_cell = tf.nn.rnn_cell.BasicLSTMCell(n_cell_dim, forget_bias=1.0)

both of which have inputs of length n_cell_dim and bias 1.0 for the forget gate of the LSTM.

The next line of the funtion BiRNN does a bit more data preparation.

# Reshape data because rnn cell expects shape [max_time, batch_size, input_size]
    layer_3 = tf.reshape(layer_3, [-1, batch_x_shape[0], n_hidden_3])

It reshapes layer_3 in to [n_steps, batch_size, 2*n_cell_dim] as the LSTM BRNN expects its input to be of shape [max_time, batch_size, input_size].

The next line of BiRNN

# Get lstm cell output
    outputs, output_states = tf.nn.bidirectional_dynamic_rnn(cell_fw=lstm_fw_cell,
                                                             cell_bw=lstm_bw_cell,
                                                             inputs=layer_3,
                                                             dtype=tf.float32,
                                                             time_major=True,
                                                             sequence_length=seq_length)

feeds layer_3 to the LSTM BRNN cell and obtains the LSTM BRNN output.

The next lines convert outputs from two rank two tensors into a single rank two tensor in preparation for passing it to the next neural network layer

# Reshape outputs from two tensors each of shape [n_steps, batch_size, n_cell_dim]
    # to a single tensor of shape [n_steps*batch_size, 2*n_cell_dim]
    outputs = tf.concat(2, outputs)
    outputs = tf.reshape(outputs, [-1, 2*n_cell_dim])

The next couple of lines feed outputs to the fifth hidden layer

#Hidden layer with clipped RELU activation and dropout
    b5 = variable_on_cpu('b5', [n_hidden_5], tf.random_normal_initializer())
    h5 = variable_on_cpu('h5', [(2 * n_cell_dim), n_hidden_5], tf.random_normal_initializer())
    layer_5 = tf.minimum(tf.nn.relu(tf.add(tf.matmul(outputs, h5), b5)), relu_clip)
    layer_5 = tf.nn.dropout(layer_5, (1.0 - dropout_rate_placeholder))

The next line of BiRNN

#Hidden layer of logits
    b6 = variable_on_cpu('b6', [n_hidden_6], tf.random_normal_initializer())
    h6 = variable_on_cpu('h6', [n_hidden_5, n_hidden_6], tf.random_normal_initializer())
    layer_6 = tf.add(tf.matmul(layer_5, h6), b6)

Applies the weight matrix h6 and bias b6 to the output of layer_5 creating n_classes dimensional vectors, the logits.

The next lines of BiRNN

# Reshape layer_6 from a tensor of shape [n_steps*batch_size, n_hidden_6]
    # to a tensor of shape [n_steps, batch_size, n_hidden_6]
    layer_6 = tf.reshape(layer_6, [-1, batch_x_shape[0], n_hidden_6])

reshapes layer_6 to the slightly more useful shape [n_steps, batch_size, n_hidden_6]. Note, that this differs from the input in that it is time-major.

The final line of BiRNN returns layer_6

# Return layer_6
    # Output shape: [n_steps, batch_size, n_input + 2*n_input*n_context]
    return layer_6

Accuracy and Loss

In accord with Deep Speech: Scaling up end-to-end speech recognition, the loss function used by our network should be the CTC loss function[2]. Conveniently, this loss function is implemented in TensorFlow. Thus, we can simply make use of this implementation to define our loss.

To do so we introduce a utility function calculate_accuracy_and_loss() that beam search decodes a mini-batch and calculates the loss and accuracy. Next to total and average loss it returns the accuracy, the decoded result and the batch's original Y.


In [ ]:
def calculate_accuracy_and_loss(batch_set):
    # Obtain the next batch of data
    batch_x, batch_seq_len, batch_y = batch_set.next_batch()

    # Calculate the logits of the batch using BiRNN
    logits = BiRNN(batch_x, tf.to_int64(batch_seq_len))
    
    # Compute the CTC loss
    if use_warpctc:
        total_loss = tf.contrib.warpctc.warp_ctc_loss(logits, batch_y, batch_seq_len)
    else:
        total_loss = ctc_ops.ctc_loss(logits, batch_y, batch_seq_len)
    
    # Calculate the average loss across the batch
    avg_loss = tf.reduce_mean(total_loss)
    
    # Beam search decode the batch
    decoded, _ = ctc_ops.ctc_beam_search_decoder(logits, batch_seq_len)
    
    # Compute the edit (Levenshtein) distance 
    distance = tf.edit_distance(tf.cast(decoded[0], tf.int32), batch_y)
    
    # Compute the accuracy 
    accuracy = tf.reduce_mean(distance)

    # Return results to the caller
    return total_loss, avg_loss, accuracy, decoded, batch_y

The first lines of calculate_accuracy_and_loss()

def calculate_accuracy_and_loss(batch_set):
    # Obtain the next batch of data
    batch_x, batch_seq_len, batch_y = batch_set.next_batch()

simply obtain the next mini-batch of data.

The next line

# Calculate the logits from the BiRNN
    logits = BiRNN(batch_x, batch_seq_len)

calls BiRNN() with a batch of data and does inference on the batch.

The next few lines

# Compute the CTC loss
    total_loss = ctc_ops.ctc_loss(logits, batch_y, batch_seq_len)

    # Calculate the average loss across the batch
    avg_loss = tf.reduce_mean(total_loss)

calculate the average loss using tensor flow's ctc_loss operator.

The next lines first beam decode the batch and then compute the accuracy on base of the Levenshtein distance between the decoded batch and the batch's original Y.

# Beam search decode the batch
    decoded, _ = ctc_ops.ctc_beam_search_decoder(logits, batch_seq_len)

    # Compute the edit (Levenshtein) distance 
    distance = tf.edit_distance(tf.cast(decoded[0], tf.int32), batch_y)

    # Compute the accuracy 
    accuracy = tf.reduce_mean(distance)

Finally, the total_loss, avg_loss, accuracy, the decoded batch and the original batch_y are returned to the caller

# Return results to the caller
    return total_loss, avg_loss, accuracy, decoded, batch_y

Parallel Optimization

Next we will implement optimization of the DeepSpeech model across GPU's on a single host. This parallel optimization can take on various forms. For example one can use asynchronous updates of the model, synchronous updates of the model, or some combination of the two.

Asynchronous Parallel Optimization

In asynchronous parallel optimization, for example, one places the model initially in CPU memory. Then each of the $G$ GPU's obtains a mini-batch of data along with the current model parameters. Using this mini-batch each GPU then computes the gradients for all model parameters and sends these gradients back to the CPU when the GPU is done with its mini-batch. The CPU then asynchronously updates the model parameters whenever it recieves a set of gradients from a GPU.

Asynchronous parallel optimization has several advantages and several disadvantages. One large advantage is throughput. No GPU will every be waiting idle. When a GPU is done processing a mini-batch, it can immediately obtain the next mini-batch to process. It never has to wait on other GPU's to finish their mini-batch. However, this means that the model updates will also be asynchronous which can have problems.

For example, one may have model parameters $W$ on the CPU and send mini-batch $n$ to GPU 1 and send mini-batch $n+1$ to GPU 2. As processing is asynchronous, GPU 2 may finish before GPU 1 and thus update the CPU's model parameters $W$ with its gradients $\Delta W_{n+1}(W)$, where the subscript $n+1$ identifies the mini-batch and the argument $W$ the location at which the gradient was evaluated. This results in the new model parameters

$$W + \Delta W_{n+1}(W).$$

Next GPU 1 could finish with its mini-batch and update the parameters to

$$W + \Delta W_{n+1}(W) + \Delta W_{n}(W).$$

The problem with this is that $\Delta W_{n}(W)$ is evaluated at $W$ and not $W + \Delta W_{n+1}(W)$. Hence, the direction of the gradient $\Delta W_{n}(W)$ is slightly incorrect as it is evaluated at the wrong location. This can be counteracted through synchronous updates of model, but this is also problematic.

Synchronous Optimization

Synchronous optimization solves the problem we saw above. In synchronous optimization, one places the model initially in CPU memory. Then one of the $G$ GPU's is given a mini-batch of data along with the current model parameters. Using the mini-batch the GPU computes the gradients for all model parameters and sends the gradients back to the CPU. The CPU then updates the model parameters and starts the process of sending out the next mini-batch.

As on can readily see, synchronous optimization does not have the problem we found in the last section, that of incorrect gradients. However, synchronous optimization can only make use of a single GPU at a time. So, when we have a multi-GPU setup, $G > 1$, all but one of the GPU's will remain idle, which is unacceptable. However, there is a third alternative which is combines the advantages of asynchronous and synchronous optimization.

Hybrid Parallel Optimization

Hybrid parallel optimization combines most of the benifits of asynchronous and synchronous optimization. It allows for multiple GPU's to be used, but does not suffer from the incorrect gradient problem exhibited by asynchronous optimization.

In hybrid parallel optimization one places the model initially in CPU memory. Then, as in asynchronous optimization, each of the $G$ GPU'S obtains a mini-batch of data along with the current model parameters. Using the mini-batch each of the GPU's then computes the gradients for all model parameters and sends these gradients back to the CPU. Now, in contrast to asynchronous optimization, the CPU waits until each GPU is finished with its mini-batch then takes the mean of all the gradients from the $G$ GPU's and updates the model with this mean gradient.

Hybrid parallel optimization has several advantages and few disadvantages. As in asynchronous parallel optimization, hybrid parallel optimization allows for one to use multiple GPU's in parallel. Furthermore, unlike asynchronous parallel optimization, the incorrect gradient problem is not present here. In fact, hybrid parallel optimization performs as if one is working with a single mini-batch which is $G$ times the size of a mini-batch handled by a single GPU. Hoewever, hybrid parallel optimization is not perfect. If one GPU is slower than all the others in completing its mini-batch, all other GPU's will have to sit idle until this straggler finishes with its mini-batch. This hurts throughput. But, if all GPU'S are of the same make and model, this problem should be minimized.

So, relatively speaking, hybrid parallel optimization seems the have more advantages and fewer disadvantages as compared to both asynchronous and synchronous optimization. So, we will, for our work, use this hybrid model.

Adam Optimization

In constrast to Deep Speech: Scaling up end-to-end speech recognition, in which Nesterov’s Accelerated Gradient Descent was used, we will use the Adam method for optimization[3], because, generally, it requires less fine-tuning.


In [ ]:
def create_optimizer():
    optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate,
                                       beta1=beta1,
                                       beta2=beta2,
                                       epsilon=epsilon)
    return optimizer

Towers

In order to properly make use of multiple GPU's, one must introduce new abstractions, not present when using a single GPU, that facilitate the multi-GPU use case.

In particular, one must introduce a means to isolate the inference and gradient calculations on the various GPU's. The abstraction we intoduce for this purpose is called a 'tower'. A tower is specified by two properties:

  • Scope - A scope, as provided by tf.name_scope(), is a means to isolate the operations within a tower. For example, all operations within "tower 0" could have their name prefixed with tower_0/.
  • Device - A hardware device, as provided by tf.device(), on which all operations within the tower execute. For example, all operations of "tower 0" could execute on the first GPU tf.device('/gpu:0').

As we are introducing one tower for each GPU, first we must determine how many GPU's are available


In [ ]:
# Get a list of the available gpu's ['/gpu:0', '/gpu:1'...]
available_devices = get_available_gpus()

# If there are no GPU's use the CPU
if 0 == len(available_devices):
    available_devices = ['/cpu:0']

With this preliminary step out of the way, we can for each GPU intoduce a tower for which's batch we calculate

  • the CTC decodings decoded,
  • the (total) loss against the outcome (Y) total_loss,
  • the loss averaged over the whole batch avg_loss,
  • the optimization gradient (computed on base of the averaged loss) and
  • the accuracy of the outcome averaged over the whole batch accuracy

and retain the original labels (Y).

decoded, labels, the optimization gradient, total_loss and avg_loss are collected into the respectful arrays tower_decodings, tower_labels, tower_gradients, tower_total_losses, tower_avg_losses (dimension 0 being the tower).

Finally this new method get_tower_results() will return those tower arrays plus (based on accuracy) the averaged accuracy value accross all towers avg_accuracy.


In [ ]:
def get_tower_results(batch_set, optimizer=None):
    # Tower decodings to return
    tower_decodings = []
    # Tower labels to return
    tower_labels = []
    # Tower gradients to return
    tower_gradients = []
    # Tower total batch losses to return
    tower_total_losses = []
    # Tower avg batch losses to return
    tower_avg_losses = []
    # Tower accuracies to return
    tower_accuracies = []
    
    # Loop over available_devices
    for i in xrange(len(available_devices)):
        # Execute operations of tower i on device i
        with tf.device(available_devices[i]):
            # Create a scope for all operations of tower i
            with tf.name_scope('tower_%d' % i) as scope:
                # Calculate the avg_loss and accuracy and retrieve the decoded 
                # batch along with the original batch's labels (Y) of this tower
                total_loss, avg_loss, accuracy, decoded, labels = calculate_accuracy_and_loss(batch_set)
                                
                # Allow for variables to be re-used by the next tower
                tf.get_variable_scope().reuse_variables()
                
                # Retain tower's decoded batch
                tower_decodings.append(decoded)
                
                # Retain tower's labels (Y)
                tower_labels.append(labels)
                
                # If we are in training, there will be an optimizer given and 
                # only then we will compute and retain gradients on base of the loss
                if optimizer is not None:
                    # Compute gradients for model parameters using tower's mini-batch
                    gradients = optimizer.compute_gradients(avg_loss)

                    # Retain tower's gradients
                    tower_gradients.append(gradients)
                    
                # Retain tower's total losses
                tower_total_losses.append(total_loss)
                
                # Retain tower's avg losses
                tower_avg_losses.append(avg_loss)
                
                # Retain tower's accuracies
                tower_accuracies.append(accuracy)
                
    # Average accuracies over the 'tower' dimension
    avg_accuracy = tf.reduce_mean(tower_accuracies, 0)

    # Return results to caller
    return tower_decodings, tower_labels, tower_gradients, tower_total_losses, tower_avg_losses, avg_accuracy

Next we want to average the gradients obtained from the GPU's.

We compute the average the gradients obtained from the GPU's for each variable in the function average_gradients()


In [ ]:
def average_gradients(tower_gradients):
    # List of average gradients to return to the caller
    average_grads = []
    
    # Loop over gradient/variable pairs from all towers
    for grad_and_vars in zip(*tower_gradients):
        # Introduce grads to store the gradients for the current variable
        grads = []
        
        # Loop over the gradients for the current variable
        for g, _ in grad_and_vars:
            # Add 0 dimension to the gradients to represent the tower.
            expanded_g = tf.expand_dims(g, 0)
            # Append on a 'tower' dimension which we will average over below.
            grads.append(expanded_g)
            
        # Average over the 'tower' dimension
        grad = tf.concat(0, grads)
        grad = tf.reduce_mean(grad, 0)
        
        # Create a gradient/variable tuple for the current variable with its average gradient
        grad_and_var = (grad, grad_and_vars[0][1])
        
        # Add the current tuple to average_grads
        average_grads.append(grad_and_var)
    
    #Return result to caller
    return average_grads

Note also that this code acts as a syncronization point as it requires all GPU's to be finished with their mini-batch before it can run to completion.

Now next we introduce a function to apply the averaged gradients to update the model's paramaters on the CPU


In [ ]:
def apply_gradients(optimizer, average_grads):
    apply_gradient_op = optimizer.apply_gradients(average_grads)
    return apply_gradient_op

Logging

We introduce a function for logging a tensor variable's current state. It logs scalar values for the mean, standard deviation, minimum and maximum. Furthermore it logs a histogram of its state and (if given) of an optimization gradient.


In [ ]:
def log_variable(variable, gradient=None):
    name = variable.name
    mean = tf.reduce_mean(variable)
    tf.scalar_summary(name + '/mean', mean)
    tf.scalar_summary(name + '/sttdev', tf.sqrt(tf.reduce_mean(tf.square(variable - mean))))
    tf.scalar_summary(name + '/max', tf.reduce_max(variable))
    tf.scalar_summary(name + '/min', tf.reduce_min(variable))
    tf.histogram_summary(name, variable)
    if gradient is not None:
        if isinstance(gradient, tf.IndexedSlices):
            grad_values = gradient.values
        else:
            grad_values = gradient
        if grad_values is not None:
            tf.histogram_summary(name + "/gradients", grad_values)

Let's also introduce a helper function for logging collections of gradient/variable tuples.


In [ ]:
def log_grads_and_vars(grads_and_vars):
    for gradient, variable in grads_and_vars:
        log_variable(variable, gradient=gradient)

Finally we define the top directory for all logs and our current log sub-directory of it. We also add some log helpers.


In [ ]:
logs_dir = os.environ.get('ds_logs_dir', 'logs')
log_dir = '%s/%s' % (logs_dir, time.strftime("%Y%m%d-%H%M%S"))

def get_git_revision_hash():
    return subprocess.check_output(['git', 'rev-parse', 'HEAD']).strip()

def get_git_branch():
    return subprocess.check_output(['git', 'rev-parse', '--abbrev-ref', 'HEAD']).strip()

Test and Validation

First we need a helper method to create a normal forward calculation without optimization, dropouts and special reporting.


In [ ]:
def decode_batch(data_set):
    # Get gradients for each tower (Runs across all GPU's)
    tower_decodings, tower_labels, _, tower_total_losses, _, _ = get_tower_results(data_set)
    return tower_decodings, tower_labels, tower_total_losses

To report progress and to get an idea of the current state of the model, we create a method that calculates the word error rate (WER) out of (tower) decodings and their respective (original) labels. It will return an array of WER result tuples, each consisting of

  • the original text (text version of X),
  • the resulting text (Y),
  • the calculated WER and
  • the total loss for that item,

plus the mean WER across all tuples.


In [ ]:
def calculate_wer(session, tower_decodings, tower_labels, tower_total_losses):
    originals = []
    results = []
    losses = []
    
    # Normalization
    tower_decodings = [j for i in tower_decodings for j in i]
    
    # Iterating over the towers
    for i in range(len(tower_decodings)):
        decoded, labels, loss = session.run([tower_decodings[i], tower_labels[i], tower_total_losses[i]], feed_dict)
        originals.extend(sparse_tensor_value_to_texts(labels))
        results.extend(sparse_tensor_value_to_texts(decoded))
        losses.extend(loss)
        
    # Pairwise calculation of all rates
    rates, mean = wers(originals, results)
    return zip(originals, results, rates, losses), mean

Let's introduce a routine to print a WER report under a given caption. It prints the given mean WER plus summaries of the top ten lowest loss items of the given array of WER result tuples (only items with WER=0 and ordered by their WER).


In [ ]:
def print_wer_report(caption, mean, items=[]):
    print
    print "#" * 80
    print "%s WER: %f" % (caption, mean)
    if len(items) > 0:
        # Filter out all items with WER=0
        items = [a for a in items if a[2] > 0]
        # Order the remaining items by their loss (lowest loss on top)
        items.sort(key=lambda a: a[3])
        # Take only the first 10 items
        items = items[:10]
        # Order this top ten items by their WER (lowest WER on top)
        items.sort(key=lambda a: a[2])
        for a in items:
            print "-" * 80
            print " - WER:    %f" % a[2]
            print " - loss:   %f" % a[3]
            print " - source: \"%s\"" % a[0]
            print " - result: \"%s\"" % a[1] 
    print "#" * 80
    print

Plus a convenience method to calculate and print the WER report all at once.


In [ ]:
def calculate_and_print_wer_report(session, caption, tower_decodings, tower_labels, tower_total_losses, show_ranked=True):
    items, mean = calculate_wer(session, tower_decodings, tower_labels, tower_total_losses)
    if show_ranked:
        print_wer_report(caption, mean, items=items)
    else:
        print_wer_report(caption, mean)
    return items, mean

Training

Now, as we have prepared all the apropos operators and methods, we can create the method which trains the network.


In [ ]:
def train(session, data_sets):
    # Calculate the total number of batches
    total_batches = data_sets.train.total_batches
    batches_per_device = float(total_batches) / len(available_devices)
        
    # Create optimizer
    optimizer = create_optimizer()

    # Get gradients for each tower (Runs across all GPU's)
    tower_decodings, \
    tower_labels, \
    tower_gradients, \
    tower_total_losses, \
    tower_avg_losses, \
    avg_accuracy \
    = get_tower_results(data_sets.train, optimizer)
    
    # Validation step preparation
    validation_tower_decodings, \
    validation_tower_labels, \
    validation_tower_total_losses \
    = decode_batch(data_sets.dev)

    # Average tower gradients
    avg_tower_gradients = average_gradients(tower_gradients)

    # Add logging of averaged gradients
    log_grads_and_vars(avg_tower_gradients)

    # Apply gradients to modify the model
    apply_gradient_op = apply_gradients(optimizer, avg_tower_gradients)

    # Create a saver to checkpoint the model
    saver = tf.train.Saver(tf.all_variables())

    # Prepare tensor board logging
    merged = tf.merge_all_summaries()
    writer = tf.train.SummaryWriter(log_dir, session.graph)

    # Init all variables in session
    session.run(tf.initialize_all_variables())
    
    # Start queue runner threads
    tf.train.start_queue_runners(sess=session)
    
    # Start importer's queue threads
    data_sets.start_queue_threads(session)
    
    # Init recent word error rate levels
    last_train_wer = 0.0
    last_validation_wer = 0.0
    
    
    # Loop over the data set for training_epochs epochs
    for epoch in range(training_iters):
        # Define total accuracy for the epoch
        total_accuracy = 0.0

        # Loop over the batches
        for batch in range(int(ceil(batches_per_device))):
            extra_params = { }
            if do_fulltrace:
                loss_run_metadata            = tf.RunMetadata()
                extra_params['options']      = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
                extra_params['run_metadata'] = loss_run_metadata
            
            # Compute the average loss for the last batch
            session.run(apply_gradient_op, feed_dict_train, **extra_params)

            # Add batch to total_accuracy
            total_accuracy += session.run(avg_accuracy, feed_dict_train)

            # Log all variable states in current step
            step = epoch * total_batches + batch * len(available_devices)
            summary_str = session.run(merged, feed_dict_train)
            writer.add_summary(summary_str, step)
            if do_fulltrace:
                writer.add_run_metadata(loss_run_metadata, 'loss_epoch%d_batch%d'   % (epoch, batch))
            writer.flush()
        
        # Print progress message
        if epoch % display_step == 0:
            print "Epoch:", '%04d' % (epoch+1), "avg_cer=", "{:.9f}".format((total_accuracy / ceil(batches_per_device)))
            _, last_train_wer = calculate_and_print_wer_report( \
                session, \
                "Training", \
                tower_decodings, \
                tower_labels, \
                tower_total_losses)
            print
            
        # Validation step
        if epoch % validation_step == 0:
            _, last_validation_wer = calculate_and_print_wer_report( \
                session, \
                "Validation", \
                validation_tower_decodings, \
                validation_tower_labels, \
                validation_tower_total_losses)
            print

        # Checkpoint the model
        if (epoch % checkpoint_step == 0) or (epoch == training_iters - 1):
            checkpoint_path = os.path.join(checkpoint_dir, 'model.ckpt')
            print "Checkpointing in directory", "%s" % checkpoint_dir
            saver.save(session, checkpoint_path, global_step=epoch)
            print
        
    # Indicate optimization has concluded
    print "Optimization Finished!"
    return last_train_wer, last_validation_wer

As everything is prepared, we are now able to do the training.


In [ ]:
# Define CPU as device on which the muti-gpu training is orchestrated
with tf.device('/cpu:0'):
    # Obtain all the data, defaulting to TED LIUM
    data_sets = ds_importer_module.read_data_sets(ds_dataset_path, batch_size, n_input, n_context)
    
    # Create session in which to execute
    session = tf.Session(config=tf.ConfigProto(allow_soft_placement=True, log_device_placement=True))
    
    # Take start time for time measurement
    time_started = datetime.datetime.utcnow()
    
    # Train the network
    last_train_wer, last_validation_wer = train(session, data_sets)
    
    # Take final time for time measurement
    time_finished = datetime.datetime.utcnow()
    
    # Calculate duration in seconds
    duration = time_finished - time_started
    duration = duration.days * 86400 + duration.seconds

Now the trained model is tested using an unbiased test set.


In [ ]:
# Define CPU as device on which the muti-gpu testing is orchestrated
with tf.device('/cpu:0'):
    # Test network
    test_decodings, test_labels, test_total_losses = decode_batch(data_sets.test)
    _, test_wer = calculate_and_print_wer_report(session, "Test", test_decodings, test_labels, test_total_losses)

Finally, we restore the trained variables into a simpler graph that we can export for serving.


In [ ]:
# Don't export a model if no export directory has been set
if export_dir:
    with tf.device('/cpu:0'):
        tf.reset_default_graph()
        session = tf.Session(config=tf.ConfigProto(allow_soft_placement=True, log_device_placement=True))

        # Run inference
        # Replace the dropout placeholder with a constant
        dropout_rate_placeholder = tf.constant(0.0)

        # Input tensor will be of shape [batch_size, n_steps, n_input + 2*n_input*n_context]
        input_tensor = tf.placeholder(tf.float32, [None, None, n_input + 2*n_input*n_context])

        # Calculate input sequence length. This is done by tiling n_steps, batch_size times.
        # If there are multiple sequences, it is assumed they are padded with zeros to be of
        # the same length.
        n_items  = tf.slice(tf.shape(input_tensor), [0], [1])
        n_steps = tf.slice(tf.shape(input_tensor), [1], [1])
        seq_length = tf.tile(n_steps, n_items)

        # Calculate the logits of the batch using BiRNN
        logits = BiRNN(input_tensor, tf.to_int64(seq_length))

        # Beam search decode the batch
        decoded, _ = ctc_ops.ctc_beam_search_decoder(logits, seq_length)
        decoded = tf.convert_to_tensor(
            [tf.sparse_tensor_to_dense(sparse_tensor) for sparse_tensor in decoded])

        # TODO: Transform the decoded output to a string

        # Create a saver and exporter using variables from the above newly created graph
        saver = tf.train.Saver(tf.all_variables())
        model_exporter = exporter.Exporter(saver)
        
        # Restore variables from training checkpoint
        # TODO: This restores the most recent checkpoint, but if we use validation to counterract
        #       over-fitting, we may want to restore an earlier checkpoint.
        checkpoint = tf.train.get_checkpoint_state(checkpoint_dir)
        saver.restore(session, checkpoint.model_checkpoint_path)
        print 'Restored checkpoint at training epoch %d' % (int(checkpoint.model_checkpoint_path.split('-')[1]) + 1)

        # Initialise the model exporter and export the model
        model_exporter.init(session.graph.as_graph_def(),
                            named_graph_signatures = {
                                'inputs': exporter.generic_signature(
                                    { 'input': input_tensor }),
                                'outputs': exporter.generic_signature(
                                    { 'outputs': decoded})})
        model_exporter.export(export_dir, tf.constant(export_version), session)

    print 'Model exported at %s' % (export_dir)

Logging Hyper Parameters and Results

Now, as training and test are done, we persist the results alongside with the involved hyper parameters for further reporting.


In [ ]:
with open('%s/%s' % (log_dir, 'hyper.json'), 'w') as dump_file:
    json.dump({ \
        'context': { \
            'time_started': time_started.isoformat(), \
            'time_finished': time_finished.isoformat(), \
            'git_hash': get_git_revision_hash(), \
            'git_branch': get_git_branch() \
        }, \
        'parameters': { \
            'learning_rate': learning_rate, \
            'beta1': beta1, \
            'beta2': beta2, \
            'epsilon': epsilon, \
            'training_iters': training_iters, \
            'batch_size': batch_size, \
            'validation_step': validation_step, \
            'dropout_rate': dropout_rate, \
            'relu_clip': relu_clip, \
            'n_input': n_input, \
            'n_context': n_context, \
            'n_hidden_1': n_hidden_1, \
            'n_hidden_2': n_hidden_2, \
            'n_hidden_3': n_hidden_3, \
            'n_hidden_5': n_hidden_5, \
            'n_hidden_6': n_hidden_6, \
            'n_cell_dim': n_cell_dim, \
            'n_character': n_character, \
            'total_batches_train': data_sets.train.total_batches, \
            'total_batches_validation': data_sets.dev.total_batches, \
            'total_batches_test': data_sets.test.total_batches, \
            'data_set': { \
                'name': ds_importer \
            }, \
        }, \
        'results': { \
            'duration': duration, \
            'last_train_wer': last_train_wer, \
            'last_validation_wer': last_validation_wer, \
            'test_wer': test_wer \
        } \
    }, dump_file, sort_keys=True, indent = 4)

Let's also re-populate a central JS file, that contains all the dumps at once.


In [ ]:
merge_logs(logs_dir)
maybe_publish()