Getting Started with Cloud TPUs

Before We Start

This notebook requires that you run it on a GCE VM in a GCP project that has Cloud TPU quota. If you are not using this notebook on a pre-built virtual machine image, here is how you can start a new GCE VM with the right settings:

  1. Create a new GCE VM with the following settings:
    1. Zone: us-central1-c or us-central1-f
    2. Machine Type: n1-standard-8
    3. Operating System: Ubuntu 16.04 LTS with 64 GB of persistent disk
    4. Identity and API access: Enable "Allow full access to all Cloud APIs"
    5. Networking > Network Tag: tpu-jupyterhub-demo
  2. Create a network rule under VPC network > Firewall Rules in the GCE control plane with the following settings:
    1. Name: tpu-jupyterhub-demo
    2. Target tags: tpu-jupyterhub-demo
    3. Source IP ranges: 0.0.0.0/0
    4. Protocols and ports: Specified protocols and ports, tcp:6006,8888
  3. SSH into the GCE VM you have created in Step 1 and run the following: a. sudo apt-get update b. sudo apt-get -y install python3 python3-pip c. sudo -H pip3 install jupyter tf-nightly google-api-python-client
  4. Start Jupyter in the GCE VM with jupyter notebook --no-browser --ip=0.0.0.0
  5. Navigate to http://IP.OF.MY.VM:8888/?token=THE.TOKEN.DISPLAYED.ON.THE.COMMANDLINE on your favorite browser.
  6. Upload this notebook to your Jupyter notebook by clicking on the Upload button.

Configuration

Please modify the following environment variables as it is required for the notebook.

  • GCE_PROJECT_NAME: The name of the GCE project this VM (and your Cloud TPU) starts in.
  • TPU_ZONE: The GCE zone in which you want your Cloud TPU to start in.
  • TPU_NAME: The name of the Cloud TPU
  • TPU_IP_RANGE: The IP address range for the Cloud TPU
  • GCS_DATA_PATH: The GCS path where we will store sample test data for Cloud TPUs. As GCS bucket namespaces are global, you may need to change this.
  • GCS_CKPT_PATH: The GCS path where we will store sample checkpoint data for Cloud TPUs. As GCS bucket namespaces are global, you may need to change this.

Note: We will grant Storage Admin permissions to Cloud TPUs on the GCS buckets specified in GCS_DATA_PATH and GCS_CKPT_PATH. We encourage you to specify new buckets or test buckets containing non-production data.


In [ ]:
%env GCE_PROJECT_NAME my-sample-tpu-project
%env TPU_ZONE us-central1-f
%env TPU_NAME demo-tpu
%env TPU_IP_RANGE 10.240.1.0/29
%env GCS_DATA_PATH gs://cloud-tpu-data-bucket/mnist/
%env GCS_CKPT_PATH gs://cloud-tpu-checkpoint-bucket/mnist/
    
# Automatically get bucket name from GCS paths
import os
os.environ['GCS_DATA_BUCKET'] = os.environ['GCS_DATA_PATH'][5:].split('/')[0]
os.environ['GCS_CKPT_BUCKET'] = os.environ['GCS_CKPT_PATH'][5:].split('/')[0]

Create a new Cloud TPU

You can create a new Cloud TPU by running the command in the cell below.


In [ ]:
!gcloud config set compute/zone $TPU_ZONE
!gcloud alpha compute tpus create $TPU_NAME --range=$TPU_IP_RANGE --accelerator-type=tpu-v2 --version=nightly --zone=$TPU_ZONE

Create GCS Buckets for Cloud TPU

Here, we will create two GCS buckets -- one for training/test data (GCS_DATA_BUCKET), and the other for TensorFlow checkpoint and TensorBoard metric data (GCS_CKPT_BUCKET).

The first two commands creates the buckets in a single region for maximum performance, and the final command grants the Cloud TPU Service Account owner access to the bucket so that it can read from and write to the bucket.


In [ ]:
!gsutil mb -c regional -l us-central1 gs://$GCS_DATA_BUCKET
!gsutil mb -c regional -l us-central1 gs://$GCS_CKPT_BUCKET

!gsutil iam ch serviceAccount:`gcloud alpha compute tpus describe $TPU_NAME | grep serviceAccount | cut -d' ' -f2`:admin gs://$GCS_DATA_BUCKET gs://$GCS_CKPT_BUCKET && echo 'Successfully set permissions!'

Connect to Cloud TPU and Run a Simple AX+Y Calculation

The following code connects to your GCE VM (which is already running the proxy), resolves the IP address of the TPU using TPU Cluster Resovers, connects to the Cloud TPU with tf.Session, and run a simple calculation on TPUs.

You should see a 3x3 array of random numbers being printed out if the command is successful. This computation may take up to 60 seconds to run to completion due to TPU initialization overheads.


In [ ]:
import os
import tensorflow as tf
from tensorflow.contrib import tpu
from tensorflow.contrib.cluster_resolver import TPUClusterResolver

def axy_computation(a, x, y):
  return a * x + y

inputs = [
    3.0,
    tf.random_uniform([3, 3], 0, 1, tf.float32),
    tf.random_uniform([3, 3], 0, 1, tf.float32),
]
tpu_computation = tpu.rewrite(axy_computation, inputs)

tpu_cluster_resolver = TPUClusterResolver([os.environ['TPU_NAME']], zone=os.environ['TPU_ZONE'], project=os.environ['GCE_PROJECT_NAME'])
tpu_grpc_url = tpu_cluster_resolver.get_master()

with tf.Session(tpu_grpc_url) as sess:
  print('Initializing TPU...')
  sess.run(tpu.initialize_system())
  print('Initializing global variables...')
  sess.run(tf.global_variables_initializer())
  print('Executing TPU operation...')
  output = sess.run(tpu_computation)
  print(output)
  print('Shutting down TPU...')
  sess.run(tpu.shutdown_system())
  print('Done!')

Creating and Uploading TFRecords from MNIST Test Data

The script below downloads the MNIST data from http://yann.lecun.com/exdb/mnist/ and creates TFRecord files from it. We then upload it to the GCS data bucket (GCS_DATA_BUCKET) that we created earlier. If the run is successful, you should see a message similar to Operation completed over 3 objects/59.5 MiB. as the last line of the output.


In [ ]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import argparse
import os
import sys

import tensorflow as tf

from tensorflow.contrib.learn.python.learn.datasets import mnist


def _int64_feature(value):
  return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))


def _bytes_feature(value):
  return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))


def convert_mnist_data_to_tfrecord(data_set, name):
  """Converts a dataset to tfrecords."""
  images = data_set.images
  labels = data_set.labels
  num_examples = data_set.num_examples

  if images.shape[0] != num_examples:
    raise ValueError('Images size %d does not match label size %d.' %
                     (images.shape[0], num_examples))
  rows = images.shape[1]
  cols = images.shape[2]
  depth = images.shape[3]

  filename = os.path.join(os.environ['GCS_DATA_PATH'], name + '.tfrecords')
  print('Writing', filename)
  with tf.python_io.TFRecordWriter(filename) as writer:
    for index in range(num_examples):
      image_raw = images[index].tostring()
      example = tf.train.Example(
          features=tf.train.Features(
              feature={
                  'height': _int64_feature(rows),
                  'width': _int64_feature(cols),
                  'depth': _int64_feature(depth),
                  'label': _int64_feature(int(labels[index])),
                  'image_raw': _bytes_feature(image_raw)
              }))
      writer.write(example.SerializeToString())


def convert_mnist_to_tfrecord():
  # Get the data.
  data_sets = mnist.read_data_sets('/tmp',
                                   dtype=tf.uint8,
                                   reshape=False,
                                   validation_size=5000)

  # Convert to Examples and write the result to TFRecords.
  convert_mnist_data_to_tfrecord(data_sets.train, 'train')
  convert_mnist_data_to_tfrecord(data_sets.validation, 'validation')
  convert_mnist_data_to_tfrecord(data_sets.test, 'test')

convert_mnist_to_tfrecord()
print("Finished writing TFRecords to %s" % os.environ['GCS_DATA_PATH'])

Defining a Simple Neural Network Model with TPU Estimators

We can define a simple model with TensorFlow Estimators that can train on the MNIST dataset to identify images. As part of using TensorFlow estimators, we need to create a model function that defines the model (mnist_model_fn), and a input function to process the inputs (the result of mnist_get_input_fn).


In [ ]:
from tensorflow.contrib.tpu.python.tpu import tpu_config
from tensorflow.contrib.tpu.python.tpu import tpu_estimator
from tensorflow.contrib.tpu.python.tpu import tpu_optimizer

def mnist_metric_fn(labels, logits):
  """Evaluation metric Fn which runs on CPU."""
  predictions = tf.argmax(logits, 1)
  return {
      "accuracy": tf.metrics.precision(
          labels=labels, predictions=predictions),
  }


def mnist_model_fn(features, labels, mode, params):
  """A simple CNN."""
  del params

  if mode == tf.estimator.ModeKeys.PREDICT:
    raise RuntimeError("mode {} is not supported yet".format(mode))

  input_layer = tf.reshape(features, [-1, 28, 28, 1])
  conv1 = tf.layers.conv2d(
      inputs=input_layer,
      filters=32,
      kernel_size=[5, 5],
      padding="same",
      activation=tf.nn.relu)
  pool1 = tf.layers.max_pooling2d(inputs=conv1, pool_size=[2, 2], strides=2)
  conv2 = tf.layers.conv2d(
      inputs=pool1,
      filters=64,
      kernel_size=[5, 5],
      padding="same",
      activation=tf.nn.relu)
  pool2 = tf.layers.max_pooling2d(inputs=conv2, pool_size=[2, 2], strides=2)
  pool2_flat = tf.reshape(pool2, [-1, 7 * 7 * 64])
  dense = tf.layers.dense(inputs=pool2_flat, units=128, activation=tf.nn.relu)
  dropout = tf.layers.dropout(
      inputs=dense, rate=0.4, training=mode == tf.estimator.ModeKeys.TRAIN)
  logits = tf.layers.dense(inputs=dropout, units=10)
  onehot_labels = tf.one_hot(indices=tf.cast(labels, tf.int32), depth=10)

  loss = tf.losses.softmax_cross_entropy(
      onehot_labels=onehot_labels, logits=logits)

  if mode == tf.estimator.ModeKeys.EVAL:
    return tpu_estimator.TPUEstimatorSpec(
        mode=mode,
        loss=loss,
        eval_metrics=(mnist_metric_fn, [labels, logits]))

  # Train.
  learning_rate = tf.train.exponential_decay(0.05,
                                             tf.train.get_global_step(), 100000,
                                             0.96)

  optimizer = tpu_optimizer.CrossShardOptimizer(
      tf.train.GradientDescentOptimizer(learning_rate=learning_rate))

  train_op = optimizer.minimize(loss, global_step=tf.train.get_global_step())
  return tpu_estimator.TPUEstimatorSpec(mode=mode, loss=loss, train_op=train_op)


def mnist_get_input_fn(filename):
  """Returns an `input_fn` for train and eval."""

  def input_fn(params):
    """A simple input_fn using the experimental input pipeline."""
    # Retrieves the batch size for the current shard. The # of shards is
    # computed according to the input pipeline deployment. See
    # `tf.contrib.tpu.RunConfig` for details.
    batch_size = params["batch_size"]

    def parser(serialized_example):
      """Parses a single tf.Example into image and label tensors."""
      features = tf.parse_single_example(
          serialized_example,
          features={
              "image_raw": tf.FixedLenFeature([], tf.string),
              "label": tf.FixedLenFeature([], tf.int64),
          })
      image = tf.decode_raw(features["image_raw"], tf.uint8)
      image.set_shape([28 * 28])
      # Normalize the values of the image from the range [0, 255] to [-0.5, 0.5]
      image = tf.cast(image, tf.float32) * (1. / 255) - 0.5
      label = tf.cast(features["label"], tf.int32)
      return image, label

    dataset = tf.data.TFRecordDataset(
        filename, buffer_size=None)
    dataset = dataset.map(parser).cache().repeat()
    dataset = dataset.apply(
        tf.contrib.data.batch_and_drop_remainder(batch_size))
    images, labels = dataset.make_one_shot_iterator().get_next()
    return images, labels
  return input_fn

print("Estimator-based MNIST Model Defined Successfully")

Running a TPU Estimator-based Model

To run a TPU Estimator-based model, we can define some configurations (tpu_config.RunConfig) and create a TPUEstimator using the model_fn, and then call train and evaluate on the Estimator. This computation may take up to 120 seconds to run to completion due to TPU initialization overheads.


In [ ]:
from tensorflow.contrib.cluster_resolver import TPUClusterResolver
import os
import time

tpu_cluster_resolver = TPUClusterResolver([os.environ['TPU_NAME']], zone=os.environ['TPU_ZONE'], project=os.environ['GCE_PROJECT_NAME'])
tpu_grpc_url = tpu_cluster_resolver.get_master()

batch_size = 128
train_file = os.path.join(os.environ['GCS_DATA_PATH'], "train.tfrecords")
train_steps = 1000

eval_file = os.path.join(os.environ['GCS_DATA_PATH'], "validation.tfrecords")
eval_steps = 100

model_dir = os.path.join(os.environ['GCS_CKPT_PATH'], str(int(time.time()))) + "/"
iterations = 50
num_shards = 8

os.environ['MNIST_MODEL_DIR'] = model_dir
  
tf.logging.set_verbosity(tf.logging.INFO)

run_config = tpu_config.RunConfig(
    master=tpu_grpc_url,
    evaluation_master=tpu_grpc_url,
    model_dir=model_dir,
    session_config=tf.ConfigProto(
        allow_soft_placement=True, log_device_placement=True),
    tpu_config=tpu_config.TPUConfig(iterations, num_shards),)

estimator = tpu_estimator.TPUEstimator(
    model_fn=mnist_model_fn,
    use_tpu=True,
    train_batch_size=batch_size,
    eval_batch_size=batch_size,
    config=run_config)

estimator.train(input_fn=mnist_get_input_fn(train_file),
                max_steps=train_steps)

if eval_steps:
  estimator.evaluate(input_fn=mnist_get_input_fn(eval_file),
                     steps=eval_steps)

Visualize Graphs in TensorBoard

You can visualize the model and training details in TensorBoard. To launch Tensorboard, simply pass in the GCS path of the model directory into TensorBoard. Make sure that you have port 6006 open.

To stop TensorBoard, click on Kernel > Interrupt.


In [ ]:
!echo Visit http://`curl -H "Metadata-Flavor: Google" http://metadata/computeMetadata/v1/instance/network-interfaces/0/access-configs/0/external-ip 2> /dev/null`:6006 for TensorBoard
!which pip3 && pip3 install html5lib==0.99999999  # workaround for TensorBoard dependency error in Python 3
!tensorboard --logdir=$MNIST_MODEL_DIR

Deleting your Cloud TPU

To delete the Cloud TPU you have created, simply run the command below.


In [ ]:
!yes | gcloud alpha compute tpus delete $TPU_NAME

That's All

Congratulations! You have finished our tutorial. In this tutorial, you have

  1. Created a new Cloud TPU using the gcloud command.
  2. Created two GCS buckets and added permissions for the Cloud TPU to read from/write to these buckets.
  3. Created TFRecords files suitable for Cloud TPU consumption from the MNIST dataset.
  4. Ran a simple computation to verify that your Cloud TPU works.
  5. Trained a MNIST image recognition model and evaluated the results.
  6. Visualized the results of training using TensorBoard.
  7. Deleted a Cloud TPU after you are done with training and evaluation.

For more information about Cloud TPUs, you can take a look at the official Cloud TPU documentation.