In [1]:
# Copyright 2018 The TensorFlow Hub Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

Predict Shakespeare with Cloud TPUs and Keras

Overview

This example uses tf.keras to build a language model and train it on a Cloud TPU. This language model predicts the next character of text given the text so far. The trained model can generate new snippets of text that read in a similar style to the text training data.

The model trains for 10 epochs and completes in approximately 5 minutes.

This notebook is hosted on GitHub. To view it in its original repository, after opening the notebook in Colab, select File > View on GitHub.

Learning objectives

In this notebook, you will learn how to:

  • Build a two-layer, forward-LSTM model.
  • Convert a tf.keras model to an equivalent TPU version and then use the standard Keras methods to train: fit, predict, and evaluate.
  • Use the trained model to make predictions and generate your own Shakespeare-esque play.

Instruction to Train on Deep Learning VM

  • Start a VM attached to a TPU:

      INSTANCE_NAME=mytpu   # CHANGE THIS
      GCP_LOGIN_NAME=google-cloud-customer@gmail.com  # CHANGE THIS
      TPU_NAME=$INSTANCE_NAME
    
      gcloud compute instances create $INSTANCE_NAME \
      --machine-type n1-standard-8 \
      --image-project deeplearning-platform-release \
      --image-family tf-1-12-cpu \
      --scopes cloud-platform \
      --metadata proxy-user-mail="${GCP_LOGIN_NAME}",\
      startup-script="echo export TPU_NAME=$TPU_NAME > /etc/profile.d/tpu-env.sh"
    
      gcloud compute tpus create $TPU_NAME \
       --network default \
       --range 10.240.1.0 \
       --version 1.12
  • Open JupyterLab from the GCP web console and clone the repo that this notebook is hosted in:
         git clone https://www.github.com/GoogleCloudPlatform/training-data-analyst
    and navigate to courses/fast-and-lean-data-science/09_lstm_keras_tpu.ipynb

Alternate method: Train on Colab

  1. On the main menu, click Runtime and select Change runtime type. Set "TPU" as the hardware accelerator.
  2. Click Runtime again and select Runtime > Run All. You can also run the cells manually with Shift-ENTER.

Note that the Colab method is suitable for educational use, but you will not be able to use it for long-term work.


In [1]:
import tensorflow as tf

SEQ_LEN = 100
BATCH_SIZE = 128

Data, model, and training

In this example, you train the model on the combined works of William Shakespeare, then use the model to compose a play in the style of The Great Bard:

Loves that led me no dumbs lack her Berjoy's face with her to-day. The spirits roar'd; which shames which within his powers Which tied up remedies lending with occasion, A loud and Lancaster, stabb'd in me Upon my sword for ever: 'Agripo'er, his days let me free. Stop it of that word, be so: at Lear, When I did profess the hour-stranger for my life, When I did sink to be cried how for aught; Some beds which seeks chaste senses prove burning; But he perforces seen in her eyes so fast; And _

Data

We have downloaded The Complete Works of William Shakespeare as a single text file from Project Gutenberg and stored in a GCS bucket. Because TPUs are located in Google Cloud, for optimal performance, they read data directly from Google Cloud Storage (GCS).

We will use snippets from this file as the training data for the model. The target snippet is offset by one character.


In [2]:
!gsutil cat gs://cloud-training-demos/tpudemo/shakespeare.txt | head -1000 | tail -10


From hands of falsehood, in sure wards of trust!
But thou, to whom my jewels trifles are,
Most worthy comfort, now my greatest grief,
Thou best of dearest, and mine only care,
Art left the prey of every vulgar thief.
Thee have I not locked up in any chest,
Save where thou art not, though I feel thou art,
Within the gentle closure of my breast,
From whence at pleasure thou mayst come and part,
  And even thence thou wilt be stol’n I fear,

Build the tf.data.Dataset


In [3]:
import numpy as np
import six
import tensorflow as tf
import time
import os

SHAKESPEARE_TXT = 'gs://cloud-training-demos/tpudemo/shakespeare.txt'

tf.logging.set_verbosity(tf.logging.INFO)

def transform(txt, pad_to=None):
  # drop any non-ascii characters
  output = np.asarray([ord(c) for c in txt if ord(c) < 255], dtype=np.int32)
  if pad_to is not None:
    output = output[:pad_to]
    output = np.concatenate([
        np.zeros([pad_to - len(txt)], dtype=np.int32),
        output,
    ])
  return output

def training_generator(seq_len=SEQ_LEN, batch_size=BATCH_SIZE):
  """A generator yields (source, target) arrays for training."""
  with tf.gfile.GFile(SHAKESPEARE_TXT, 'r') as f:
    txt = f.read()

  tf.logging.info('Input text [%d] %s', len(txt), txt[:50])
  source = transform(txt)
  while True:
    offsets = np.random.randint(0, len(source) - seq_len, batch_size)

    # Our model uses sparse crossentropy loss, but Keras requires labels
    # to have the same rank as the input logits.  We add an empty final
    # dimension to account for this.
    yield (
        np.stack([source[idx:idx + seq_len] for idx in offsets]),
        np.expand_dims(
            np.stack([source[idx + 1:idx + seq_len + 1] for idx in offsets]),
            -1),
    )

a = six.next(training_generator(seq_len=10, batch_size=1))
print(a)
#print(tf.convert_to_tensor(a[1]))


INFO:tensorflow:Input text [5796379] 
Project Gutenberg’s The Complete Works of Willi
(array([[115,  32, 116, 111,  32,  97,  32,  98,  97, 119]], dtype=int32), array([[[ 32],
        [116],
        [111],
        [ 32],
        [ 97],
        [ 32],
        [ 98],
        [ 97],
        [119],
        [100]]], dtype=int32))

In [9]:
def create_dataset():
    return tf.data.Dataset.from_generator(training_generator, 
                                          (tf.int32, tf.int32),
                                          (tf.TensorShape([BATCH_SIZE, SEQ_LEN]), 
                                           tf.TensorShape([BATCH_SIZE, SEQ_LEN, 1]))
                                         )

Build the model

The model is defined as a two-layer, forward-LSTM—with two changes from the tf.keras standard LSTM definition:

  1. Define the input shape of the model to comply with the XLA compiler's static shape requirement.
  2. Use tf.train.Optimizer instead of a standard Keras optimizer (Keras optimizer support is still experimental).

In [7]:
EMBEDDING_DIM = 512

def lstm_model(seq_len, batch_size, stateful):
  """Language model: predict the next word given the current word."""
  source = tf.keras.Input(
      name='seed', shape=(seq_len,), batch_size=batch_size, dtype=tf.int32)

  embedding = tf.keras.layers.Embedding(input_dim=256, output_dim=EMBEDDING_DIM)(source)
  lstm_1 = tf.keras.layers.LSTM(EMBEDDING_DIM, stateful=stateful, return_sequences=True)(embedding)
  lstm_2 = tf.keras.layers.LSTM(EMBEDDING_DIM, stateful=stateful, return_sequences=True)(lstm_1)
  predicted_char = tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(256, activation='softmax'))(lstm_2)
  model = tf.keras.Model(inputs=[source], outputs=[predicted_char])
  model.compile(
      optimizer=tf.train.RMSPropOptimizer(learning_rate=0.01),
      loss='sparse_categorical_crossentropy',
      metrics=['sparse_categorical_accuracy'])
  return model

Train the model

The tf.contrib.tpu.keras_to_tpu_model function converts a tf.keras model to an equivalent TPU version. You then use the standard Keras methods to train: fit, predict, and evaluate.


In [10]:
tf.keras.backend.clear_session()

training_model = lstm_model(seq_len=SEQ_LEN, batch_size=BATCH_SIZE, stateful=False)

# Use TPU if it exists, else fall back to GPU
try: # TPU detection
  tpu = tf.contrib.cluster_resolver.TPUClusterResolver()
  training_model = tf.contrib.tpu.keras_to_tpu_model(
        training_model,
        strategy=tf.contrib.tpu.TPUDistributionStrategy(tpu))
  training_input = create_dataset  # Function that returns a dataset
  print('Running on TPU ', tpu.cluster_spec().as_dict()['worker'])
except ValueError:
  tpu = None
  training_input = create_dataset()  # The dataset itself
  print("Running on GPU or CPU")

# Run fit()
training_model.fit(
    training_input,
    steps_per_epoch=100,
    epochs=10,
)
tpu_model.save_weights('/tmp/bard.h5', overwrite=True)


INFO:tensorflow:Querying Tensorflow master (grpc://10.240.1.2:8470) for TPU system metadata.
INFO:tensorflow:Found TPU system:
INFO:tensorflow:*** Num TPU Cores: 8
INFO:tensorflow:*** Num TPU Workers: 1
INFO:tensorflow:*** Num TPU Cores Per Worker: 8
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, -1, 3577263294212491411)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 9025790428816496025)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_GPU:0, XLA_GPU, 17179869184, 12633420940520028490)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 17179869184, 4889055515393833545)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 17179869184, 9632346120050048346)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 17179869184, 17116235810456753864)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 17179869184, 10387310604824546591)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 17179869184, 1907352312443083400)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 17179869184, 12795733773701058027)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 17179869184, 3885015511603421288)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 17179869184, 5505781315448043335)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 17179869184, 4880015139463075197)
WARNING:tensorflow:tpu_model (from tensorflow.contrib.tpu.python.tpu.keras_support) is experimental and may change or be removed at any time, and without warning.
Running on TPU  ['10.240.1.2:8470']
Epoch 1/10
INFO:tensorflow:New input shapes; (re-)compiling: mode=train (# of cores 8), [TensorSpec(shape=(128,), dtype=tf.int32, name=None), TensorSpec(shape=(128, 100), dtype=tf.int32, name=None), TensorSpec(shape=(128, 100, 1), dtype=tf.int32, name=None)]
INFO:tensorflow:Overriding default placeholder.
INFO:tensorflow:Remapping placeholder for seed
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py in _apply_op_helper(self, op_type_name, name, **keywords)
    509                 as_ref=input_arg.is_ref,
--> 510                 preferred_dtype=default_dtype)
    511           except TypeError as err:

/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py in internal_convert_to_tensor(value, dtype, name, as_ref, preferred_dtype, ctx)
   1145     if ret is None:
-> 1146       ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
   1147 

/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py in _TensorTensorConversionFunction(t, dtype, name, as_ref)
    982         "Tensor conversion requested dtype %s for Tensor with dtype %s: %r" %
--> 983         (dtype.name, t.dtype.name, str(t)))
    984   return t

ValueError: Tensor conversion requested dtype int32 for Tensor with dtype int64: 'Tensor("tpu_139640912195936/metrics/sparse_categorical_accuracy/ArgMax:0", shape=(1024, 100), dtype=int64)'

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
<ipython-input-10-fbd398209b08> in <module>
     20     training_input,
     21     steps_per_epoch=100,
---> 22     epochs=10,
     23 )
     24 tpu_model.save_weights('/tmp/bard.h5', overwrite=True)

/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/tpu/python/tpu/keras_support.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, **kwargs)
   1505                                   validation_split, validation_data, shuffle,
   1506                                   class_weight, sample_weight, initial_epoch,
-> 1507                                   steps_per_epoch, validation_steps, **kwargs)
   1508       finally:
   1509         self._numpy_to_infeed_manager_list = []

/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/tpu/python/tpu/keras_support.py in _pipeline_fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, **kwargs)
   1598         initial_epoch=initial_epoch,
   1599         steps_per_epoch=steps_per_epoch,
-> 1600         validation_steps=validation_steps)
   1601 
   1602   def _pipeline_fit_loop(self,

/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/tpu/python/tpu/keras_support.py in _pipeline_fit_loop(self, inputs, targets, sample_weights, batch_size, epochs, verbose, callbacks, val_inputs, val_targets, val_sample_weights, shuffle, initial_epoch, steps_per_epoch, validation_steps)
   1685             val_sample_weights=val_sample_weights,
   1686             validation_steps=validation_steps,
-> 1687             epoch_logs=epoch_logs)
   1688       else:
   1689         # Sample-wise fit loop.

/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/tpu/python/tpu/keras_support.py in _pipeline_fit_loop_step_wise(self, ins, callbacks, steps_per_epoch, epochs, do_validation, val_inputs, val_targets, val_sample_weights, validation_steps, epoch_logs)
   1813     # Loop prologue
   1814     try:
-> 1815       outs = f.pipeline_run(cur_step_inputs=None, next_step_inputs=ins)
   1816       assert outs is None  # Function shouldn't return anything!
   1817     except errors.OutOfRangeError:

/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/tpu/python/tpu/keras_support.py in pipeline_run(***failed resolving arguments***)
   1323           next_input_tensors)
   1324       next_tpu_model_ops = self._tpu_model_ops_for_input_specs(
-> 1325           next_input_specs, next_step_infeed_manager)
   1326       infeed_dict = next_infeed_instance.make_feed_dict(next_tpu_model_ops)
   1327 

/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/tpu/python/tpu/keras_support.py in _tpu_model_ops_for_input_specs(self, input_specs, infeed_manager)
   1152           self._tpu_assignment.num_towers, input_specs)
   1153       new_tpu_model_ops = self._specialize_model(input_specs,
-> 1154                                                  infeed_manager)
   1155       self._compilation_cache[shape_key] = new_tpu_model_ops
   1156       self._test_model_compiles(new_tpu_model_ops)

/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/tpu/python/tpu/keras_support.py in _specialize_model(self, input_specs, infeed_manager)
   1062     # running on a different logical core.
   1063     compile_op, execute_op = tpu.split_compile_and_replicate(
-> 1064         _model_fn, inputs=[[]] * self._tpu_assignment.num_towers)
   1065 
   1066     # Generate CPU side operations to enqueue features/labels and dequeue

/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu.py in split_compile_and_replicate(***failed resolving arguments***)
    682       vscope.set_custom_getter(custom_getter)
    683 
--> 684       outputs = computation(*computation_inputs)
    685 
    686       vscope.set_use_resource(saved_use_resource)

/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/tpu/python/tpu/keras_support.py in _model_fn()
   1003                   weighted_metrics=metrics_module.clone_metrics(
   1004                       self.model.weighted_metrics),
-> 1005                   target_tensors=tpu_targets,
   1006               )
   1007 

/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/checkpointable/base.py in _method_wrapper(self, *args, **kwargs)
    472     self._setattr_tracking = False  # pylint: disable=protected-access
    473     try:
--> 474       method(self, *args, **kwargs)
    475     finally:
    476       self._setattr_tracking = previous_value  # pylint: disable=protected-access

/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/engine/training.py in compile(self, optimizer, loss, metrics, loss_weights, sample_weight_mode, weighted_metrics, target_tensors, distribute, **kwargs)
    646         targets=self.targets,
    647         skip_target_indices=skip_target_indices,
--> 648         sample_weights=self.sample_weights)
    649 
    650     # Prepare gradient updates and state updates.

/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/engine/training.py in _handle_metrics(self, outputs, skip_target_indices, targets, sample_weights, masks)
    311         metric_results.extend(
    312             self._handle_per_output_metrics(self._per_output_metrics[i], target,
--> 313                                             output, output_mask))
    314         metric_results.extend(
    315             self._handle_per_output_metrics(

/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/engine/training.py in _handle_per_output_metrics(self, metrics_dict, y_true, y_pred, mask, weights)
    268               metric_fn)
    269           metric_result = weighted_metric_fn(
--> 270               y_true, y_pred, weights=weights, mask=mask)
    271 
    272         if not context.executing_eagerly():

/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/engine/training_utils.py in weighted(y_true, y_pred, weights, mask)
    596     """
    597     # score_array has ndim >= 2
--> 598     score_array = fn(y_true, y_pred)
    599     if mask is not None:
    600       mask = math_ops.cast(mask, y_pred.dtype)

/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/metrics.py in sparse_categorical_accuracy(y_true, y_pred)
    660     y_pred = math_ops.cast(y_pred, K.floatx())
    661 
--> 662   return math_ops.cast(math_ops.equal(y_true, y_pred), K.floatx())
    663 
    664 

/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_math_ops.py in equal(x, y, name)
   2732   if _ctx is None or not _ctx._eager_context.is_eager:
   2733     _, _, _op = _op_def_lib._apply_op_helper(
-> 2734         "Equal", x=x, y=y, name=name)
   2735     _result = _op.outputs[:]
   2736     _inputs_flat = _op.inputs

/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py in _apply_op_helper(self, op_type_name, name, **keywords)
    544                   "%s type %s of argument '%s'." %
    545                   (prefix, dtypes.as_dtype(attrs[input_arg.type_attr]).name,
--> 546                    inferred_from[input_arg.type_attr]))
    547 
    548           types = [values.dtype]

TypeError: Input 'y' of 'Equal' Op has type int64 that does not match type int32 of argument 'x'.

Make predictions with the model

Use the trained model to make predictions and generate your own Shakespeare-esque play. Start the model off with a seed sentence, then generate 250 characters from it. The model makes five predictions from the initial seed.


In [ ]:
BATCH_SIZE = 5
PREDICT_LEN = 250

# Keras requires the batch size be specified ahead of time for stateful models.
# We use a sequence length of 1, as we will be feeding in one character at a 
# time and predicting the next character.
prediction_model = lstm_model(seq_len=1, batch_size=BATCH_SIZE, stateful=True)
prediction_model.load_weights('/tmp/bard.h5')

# We seed the model with our initial string, copied BATCH_SIZE times

seed_txt = 'Looks it not like the king?  Verily, we must go! '
seed = transform(seed_txt)
seed = np.repeat(np.expand_dims(seed, 0), BATCH_SIZE, axis=0)

# First, run the seed forward to prime the state of the model.
prediction_model.reset_states()
for i in range(len(seed_txt) - 1):
  prediction_model.predict(seed[:, i:i + 1])

# Now we can accumulate predictions!
predictions = [seed[:, -1:]]
for i in range(PREDICT_LEN):
  last_word = predictions[-1]
  next_probits = prediction_model.predict(last_word)[:, 0, :]
  
  # sample from our output distribution
  next_idx = [
      np.random.choice(256, p=next_probits[i])
      for i in range(BATCH_SIZE)
  ]
  predictions.append(np.asarray(next_idx, dtype=np.int32))
  

for i in range(BATCH_SIZE):
  print('PREDICTION %d\n\n' % i)
  p = [predictions[j][i] for j in range(PREDICT_LEN)]
  generated = ''.join([chr(c) for c in p])
  print(generated)
  print()
  assert len(generated) == PREDICT_LEN, 'Generated text too short'

What's next

  • Learn about Cloud TPUs that Google designed and optimized specifically to speed up and scale up ML workloads for training and inference and to enable ML engineers and researchers to iterate more quickly.
  • Explore the range of Cloud TPU tutorials and Colabs to find other examples that can be used when implementing your ML project.

On Google Cloud Platform, in addition to GPUs and TPUs available on pre-configured deep learning VMs, you will find AutoML(beta) for training custom models without writing code and Cloud ML Engine which will allows you to run parallel trainings and hyperparameter tuning of your custom models on powerful distributed hardware.