This notebook requires that you run it on a GCE VM in a GCP project that has Cloud TPU quota. If you are not using this notebook on a pre-built virtual machine image, here is how you can start a new GCE VM with the right settings:
tpu-jupyterhub-demo
tpu-jupyterhub-demo
tpu-jupyterhub-demo
0.0.0.0/0
tcp:6006,8888
sudo apt-get update
b. sudo apt-get -y install python3 python3-pip
c. sudo -H pip3 install jupyter tf-nightly google-api-python-client
jupyter notebook --no-browser --ip=0.0.0.0
Upload
button.Please modify the following environment variables as it is required for the notebook.
GCE_PROJECT_NAME
: The name of the GCE project this VM (and your Cloud TPU) starts in.TPU_ZONE
: The GCE zone in which you want your Cloud TPU to start in.TPU_NAME
: The name of the Cloud TPUTPU_IP_RANGE
: The IP address range for the Cloud TPUGCS_DATA_PATH
: The GCS path where we will store sample test data for Cloud TPUs. As GCS bucket namespaces are global, you may need to change this.GCS_CKPT_PATH
: The GCS path where we will store sample checkpoint data for Cloud TPUs. As GCS bucket namespaces are global, you may need to change this.Note: We will grant Storage Admin permissions to Cloud TPUs on the GCS buckets specified in GCS_DATA_PATH
and GCS_CKPT_PATH
. We encourage you to specify new buckets or test buckets containing non-production data.
In [ ]:
%env GCE_PROJECT_NAME my-sample-tpu-project
%env TPU_ZONE us-central1-f
%env TPU_NAME demo-tpu
%env TPU_IP_RANGE 10.240.1.0/29
%env GCS_DATA_PATH gs://cloud-tpu-data-bucket/mnist/
%env GCS_CKPT_PATH gs://cloud-tpu-checkpoint-bucket/mnist/
# Automatically get bucket name from GCS paths
import os
os.environ['GCS_DATA_BUCKET'] = os.environ['GCS_DATA_PATH'][5:].split('/')[0]
os.environ['GCS_CKPT_BUCKET'] = os.environ['GCS_CKPT_PATH'][5:].split('/')[0]
In [ ]:
!gcloud config set compute/zone $TPU_ZONE
!gcloud alpha compute tpus create $TPU_NAME --range=$TPU_IP_RANGE --accelerator-type=tpu-v2 --version=nightly --zone=$TPU_ZONE
Here, we will create two GCS buckets -- one for training/test data (GCS_DATA_BUCKET
), and the other for TensorFlow checkpoint and TensorBoard metric data (GCS_CKPT_BUCKET
).
The first two commands creates the buckets in a single region for maximum performance, and the final command grants the Cloud TPU Service Account owner access to the bucket so that it can read from and write to the bucket.
In [ ]:
!gsutil mb -c regional -l us-central1 gs://$GCS_DATA_BUCKET
!gsutil mb -c regional -l us-central1 gs://$GCS_CKPT_BUCKET
!gsutil iam ch serviceAccount:`gcloud alpha compute tpus describe $TPU_NAME | grep serviceAccount | cut -d' ' -f2`:admin gs://$GCS_DATA_BUCKET gs://$GCS_CKPT_BUCKET && echo 'Successfully set permissions!'
The following code connects to your GCE VM (which is already running the proxy), resolves the IP address of the TPU using TPU Cluster Resovers, connects to the Cloud TPU with tf.Session
, and run a simple calculation on TPUs.
You should see a 3x3 array of random numbers being printed out if the command is successful. This computation may take up to 60 seconds to run to completion due to TPU initialization overheads.
In [ ]:
import os
import tensorflow as tf
from tensorflow.contrib import tpu
from tensorflow.contrib.cluster_resolver import TPUClusterResolver
def axy_computation(a, x, y):
return a * x + y
inputs = [
3.0,
tf.random_uniform([3, 3], 0, 1, tf.float32),
tf.random_uniform([3, 3], 0, 1, tf.float32),
]
tpu_computation = tpu.rewrite(axy_computation, inputs)
tpu_cluster_resolver = TPUClusterResolver([os.environ['TPU_NAME']], zone=os.environ['TPU_ZONE'], project=os.environ['GCE_PROJECT_NAME'])
tpu_grpc_url = tpu_cluster_resolver.get_master()
with tf.Session(tpu_grpc_url) as sess:
print('Initializing TPU...')
sess.run(tpu.initialize_system())
print('Initializing global variables...')
sess.run(tf.global_variables_initializer())
print('Executing TPU operation...')
output = sess.run(tpu_computation)
print(output)
print('Shutting down TPU...')
sess.run(tpu.shutdown_system())
print('Done!')
The script below downloads the MNIST data from http://yann.lecun.com/exdb/mnist/ and creates TFRecord files from it. We then upload it to the GCS data bucket (GCS_DATA_BUCKET
) that we created earlier. If the run is successful, you should see a message similar to Operation completed over 3 objects/59.5 MiB.
as the last line of the output.
In [ ]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import argparse
import os
import sys
import tensorflow as tf
from tensorflow.contrib.learn.python.learn.datasets import mnist
def _int64_feature(value):
return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
def _bytes_feature(value):
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
def convert_mnist_data_to_tfrecord(data_set, name):
"""Converts a dataset to tfrecords."""
images = data_set.images
labels = data_set.labels
num_examples = data_set.num_examples
if images.shape[0] != num_examples:
raise ValueError('Images size %d does not match label size %d.' %
(images.shape[0], num_examples))
rows = images.shape[1]
cols = images.shape[2]
depth = images.shape[3]
filename = os.path.join(os.environ['GCS_DATA_PATH'], name + '.tfrecords')
print('Writing', filename)
with tf.python_io.TFRecordWriter(filename) as writer:
for index in range(num_examples):
image_raw = images[index].tostring()
example = tf.train.Example(
features=tf.train.Features(
feature={
'height': _int64_feature(rows),
'width': _int64_feature(cols),
'depth': _int64_feature(depth),
'label': _int64_feature(int(labels[index])),
'image_raw': _bytes_feature(image_raw)
}))
writer.write(example.SerializeToString())
def convert_mnist_to_tfrecord():
# Get the data.
data_sets = mnist.read_data_sets('/tmp',
dtype=tf.uint8,
reshape=False,
validation_size=5000)
# Convert to Examples and write the result to TFRecords.
convert_mnist_data_to_tfrecord(data_sets.train, 'train')
convert_mnist_data_to_tfrecord(data_sets.validation, 'validation')
convert_mnist_data_to_tfrecord(data_sets.test, 'test')
convert_mnist_to_tfrecord()
print("Finished writing TFRecords to %s" % os.environ['GCS_DATA_PATH'])
We can define a simple model with TensorFlow Estimators that can train on the MNIST dataset to identify images. As part of using TensorFlow estimators, we need to create a model function that defines the model (mnist_model_fn
), and a input function to process the inputs (the result of mnist_get_input_fn
).
In [ ]:
from tensorflow.contrib.tpu.python.tpu import tpu_config
from tensorflow.contrib.tpu.python.tpu import tpu_estimator
from tensorflow.contrib.tpu.python.tpu import tpu_optimizer
def mnist_metric_fn(labels, logits):
"""Evaluation metric Fn which runs on CPU."""
predictions = tf.argmax(logits, 1)
return {
"accuracy": tf.metrics.precision(
labels=labels, predictions=predictions),
}
def mnist_model_fn(features, labels, mode, params):
"""A simple CNN."""
del params
if mode == tf.estimator.ModeKeys.PREDICT:
raise RuntimeError("mode {} is not supported yet".format(mode))
input_layer = tf.reshape(features, [-1, 28, 28, 1])
conv1 = tf.layers.conv2d(
inputs=input_layer,
filters=32,
kernel_size=[5, 5],
padding="same",
activation=tf.nn.relu)
pool1 = tf.layers.max_pooling2d(inputs=conv1, pool_size=[2, 2], strides=2)
conv2 = tf.layers.conv2d(
inputs=pool1,
filters=64,
kernel_size=[5, 5],
padding="same",
activation=tf.nn.relu)
pool2 = tf.layers.max_pooling2d(inputs=conv2, pool_size=[2, 2], strides=2)
pool2_flat = tf.reshape(pool2, [-1, 7 * 7 * 64])
dense = tf.layers.dense(inputs=pool2_flat, units=128, activation=tf.nn.relu)
dropout = tf.layers.dropout(
inputs=dense, rate=0.4, training=mode == tf.estimator.ModeKeys.TRAIN)
logits = tf.layers.dense(inputs=dropout, units=10)
onehot_labels = tf.one_hot(indices=tf.cast(labels, tf.int32), depth=10)
loss = tf.losses.softmax_cross_entropy(
onehot_labels=onehot_labels, logits=logits)
if mode == tf.estimator.ModeKeys.EVAL:
return tpu_estimator.TPUEstimatorSpec(
mode=mode,
loss=loss,
eval_metrics=(mnist_metric_fn, [labels, logits]))
# Train.
learning_rate = tf.train.exponential_decay(0.05,
tf.train.get_global_step(), 100000,
0.96)
optimizer = tpu_optimizer.CrossShardOptimizer(
tf.train.GradientDescentOptimizer(learning_rate=learning_rate))
train_op = optimizer.minimize(loss, global_step=tf.train.get_global_step())
return tpu_estimator.TPUEstimatorSpec(mode=mode, loss=loss, train_op=train_op)
def mnist_get_input_fn(filename):
"""Returns an `input_fn` for train and eval."""
def input_fn(params):
"""A simple input_fn using the experimental input pipeline."""
# Retrieves the batch size for the current shard. The # of shards is
# computed according to the input pipeline deployment. See
# `tf.contrib.tpu.RunConfig` for details.
batch_size = params["batch_size"]
def parser(serialized_example):
"""Parses a single tf.Example into image and label tensors."""
features = tf.parse_single_example(
serialized_example,
features={
"image_raw": tf.FixedLenFeature([], tf.string),
"label": tf.FixedLenFeature([], tf.int64),
})
image = tf.decode_raw(features["image_raw"], tf.uint8)
image.set_shape([28 * 28])
# Normalize the values of the image from the range [0, 255] to [-0.5, 0.5]
image = tf.cast(image, tf.float32) * (1. / 255) - 0.5
label = tf.cast(features["label"], tf.int32)
return image, label
dataset = tf.data.TFRecordDataset(
filename, buffer_size=None)
dataset = dataset.map(parser).cache().repeat()
dataset = dataset.apply(
tf.contrib.data.batch_and_drop_remainder(batch_size))
images, labels = dataset.make_one_shot_iterator().get_next()
return images, labels
return input_fn
print("Estimator-based MNIST Model Defined Successfully")
To run a TPU Estimator-based model, we can define some configurations (tpu_config.RunConfig
) and create a TPUEstimator using the model_fn
, and then call train
and evaluate
on the Estimator. This computation may take up to 120 seconds to run to completion due to TPU initialization overheads.
In [ ]:
from tensorflow.contrib.cluster_resolver import TPUClusterResolver
import os
import time
tpu_cluster_resolver = TPUClusterResolver([os.environ['TPU_NAME']], zone=os.environ['TPU_ZONE'], project=os.environ['GCE_PROJECT_NAME'])
tpu_grpc_url = tpu_cluster_resolver.get_master()
batch_size = 128
train_file = os.path.join(os.environ['GCS_DATA_PATH'], "train.tfrecords")
train_steps = 1000
eval_file = os.path.join(os.environ['GCS_DATA_PATH'], "validation.tfrecords")
eval_steps = 100
model_dir = os.path.join(os.environ['GCS_CKPT_PATH'], str(int(time.time()))) + "/"
iterations = 50
num_shards = 8
os.environ['MNIST_MODEL_DIR'] = model_dir
tf.logging.set_verbosity(tf.logging.INFO)
run_config = tpu_config.RunConfig(
master=tpu_grpc_url,
evaluation_master=tpu_grpc_url,
model_dir=model_dir,
session_config=tf.ConfigProto(
allow_soft_placement=True, log_device_placement=True),
tpu_config=tpu_config.TPUConfig(iterations, num_shards),)
estimator = tpu_estimator.TPUEstimator(
model_fn=mnist_model_fn,
use_tpu=True,
train_batch_size=batch_size,
eval_batch_size=batch_size,
config=run_config)
estimator.train(input_fn=mnist_get_input_fn(train_file),
max_steps=train_steps)
if eval_steps:
estimator.evaluate(input_fn=mnist_get_input_fn(eval_file),
steps=eval_steps)
In [ ]:
!echo Visit http://`curl -H "Metadata-Flavor: Google" http://metadata/computeMetadata/v1/instance/network-interfaces/0/access-configs/0/external-ip 2> /dev/null`:6006 for TensorBoard
!which pip3 && pip3 install html5lib==0.99999999 # workaround for TensorBoard dependency error in Python 3
!tensorboard --logdir=$MNIST_MODEL_DIR
In [ ]:
!yes | gcloud alpha compute tpus delete $TPU_NAME
Congratulations! You have finished our tutorial. In this tutorial, you have
gcloud
command.For more information about Cloud TPUs, you can take a look at the official Cloud TPU documentation.