Transfer Learning on TPUs

In the previous notebook, we learned how to do transfer learning with TensorFlow Hub. In this notebook, we're going to kick up our training speed with TPUs.

Learning Objectives

Know how to set up a TPU strategy for training
Know how to use a TensorFlow Hub Module when training on a TPU
Know how to create and specify a TPU for training

First things first. Configure the parameters below to match your own Google Cloud project details.



In [ ]:

    
import os
os.environ["BUCKET"] = "your-bucket-here"

Packaging the Model

In order to train on a TPU, we'll need to set up a python module for training. The skeleton for this has already been built out in tpu_models with the data processing functions from the pevious lab copied into util.py.

Similarly, the model building and training functions are pulled into model.py. This is almost entirely the same as before, except the hub module path is now a variable to be provided by the user. We'll get into why in a bit, but first, let's take a look at the new task.py file.

We've added five command line arguments which are standard for cloud training of a TensorFlow model: epochs, steps_per_epoch, train_path, eval_path, and job-dir. There are two new arguments for TPU training: tpu_address and hub_path

tpu_address is going to be our TPU name as it appears in Compute Engine Instances. We can specify this name with the ctpu up command.

hub_path is going to be a Google Cloud Storage path to a downloaded TensorFlow Hub module.

The other big difference is some code to deploy our model on a TPU. To begin, we'll set up a TPU Cluster Resolver, which will help tensorflow communicate with the hardware to set up workers for training (more on TensorFlow Cluster Resolvers). Once the resolver connects to and initializes the TPU system, our Tensorflow Graphs can be initialized within a TPU distribution strategy, allowing our TensorFlow code to take full advantage of the TPU hardware capabilities.

TODO #1: Set up a TPU strategy



In [ ]:

    
%%writefile tpu_models/trainer/task.py
import argparse
import json
import os
import sys

import tensorflow as tf

from . import model
from . import util


def _parse_arguments(argv):
    """Parses command-line arguments."""
    parser = argparse.ArgumentParser()
    parser.add_argument(
        '--epochs',
        help='The number of epochs to train',
        type=int, default=5)
    parser.add_argument(
        '--steps_per_epoch',
        help='The number of steps per epoch to train',
        type=int, default=500)
    parser.add_argument(
        '--train_path',
        help='The path to the training data',
        type=str, default="gs://cloud-ml-data/img/flower_photos/train_set.csv")
    parser.add_argument(
        '--eval_path',
        help='The path to the evaluation data',
        type=str, default="gs://cloud-ml-data/img/flower_photos/eval_set.csv")
    parser.add_argument(
        '--tpu_address',
        help='The path to the evaluation data',
        type=str, required=True)
    parser.add_argument(
        '--hub_path',
        help='The path to TF Hub module to use in GCS',
        type=str, required=True)
    parser.add_argument(
        '--job-dir',
        help='Directory where to save the given model',
        type=str, required=True)
    return parser.parse_known_args(argv)


def main():
    """Parses command line arguments and kicks off model training."""
    args = _parse_arguments(sys.argv[1:])[0]
    
    # TODO: define a TPU strategy
    resolver = tf.distribute.cluster_resolver.TPUClusterResolver(
        tpu=args.tpu_address)
    tf.config.experimental_connect_to_cluster(resolver)
    tf.tpu.experimental.initialize_tpu_system(resolver)
    strategy = tf.distribute.experimental.TPUStrategy(resolver)
    
    with strategy.scope():
        train_data = util.load_dataset(args.train_path)
        eval_data = util.load_dataset(args.eval_path, training=False)
        image_model = model.build_model(args.job_dir, args.hub_path)

    model_history = model.train_and_evaluate(
        image_model, args.epochs, args.steps_per_epoch,
        train_data, eval_data, args.job_dir)


if __name__ == '__main__':
    main()

The TPU server

Before we can start training with this code, we need a way to pull in MobileNet. When working with TPUs in the cloud, the TPU will not have access to the VM's local file directory since the TPU worker acts as a server. Because of this all data used by our model must be hosted on an outside storage system such as Google Cloud Storage. This makes caching our dataset especially critical in order to speed up training time.

To access MobileNet with these restrictions, we can download a compressed saved version of the model by using the wget command. Adding ?tf-hub-format=compressed at the end of our module handle gives us a download URL.



In [ ]:

    
!wget https://tfhub.dev/google/imagenet/mobilenet_v2_100_224/feature_vector/4?tf-hub-format=compressed

This model is still compressed, so lets uncompress it with the tar command below and place it in our tpu_models directory.



In [ ]:

    
%%bash
rm -r tpu_models/hub
mkdir tpu_models/hub
tar xvzf 4?tf-hub-format=compressed -C tpu_models/hub/

Finally, we need to transfer our materials to the TPU. We'll use GCS as a go-between, using gsutil cp to copy everything.



In [ ]:

    
!gsutil rm -r gs://$BUCKET/tpu_models
!gsutil cp -r tpu_models gs://$BUCKET/tpu_models

Spinning up a TPU

Time to wake up a TPU! Open the Google Cloud Shell and copy the ctpu up) command below. Say 'Yes' to the prompts to spin up the TPU.

ctpu up --zone=us-central1-b --tf-version=2.1 --name=my-tpu

It will take about five minutes to wake up. Then, it should automatically SSH into the TPU, but alternatively Compute Engine Interface can be used to SSH in. You'll know you're running on a TPU when the command line starts with your-username@your-tpu-name.

This is a fresh TPU and still needs our code. Run the below cell and copy the output into your TPU terminal to copy your model from your GCS bucket. Don't forget to include the . at the end as it tells gsutil to copy data into the currect directory.



In [ ]:

    
!echo "gsutil cp -r gs://$BUCKET/tpu_models ."

Time to shine, TPU! Run the below cell and copy the output into your TPU terminal. Training will be slow at first, but it will pick up speed after a few minutes once the Tensorflow graph has been built out.

TODO #2 and #3: Specify the tpu_address and hub_path



In [ ]:

    
!echo "python3 -m tpu_models.trainer.task \
    --tpu_address=my-tpu \
    --hub_path=gs://$BUCKET/tpu_models/hub/ \
    --job-dir=gs://$BUCKET/flowers_tpu_$(date -u +%y%m%d_%H%M%S)"

How did it go? In the previous lab, it took about 2-3 minutes to get through 25 images. On the TPU, it took 5-6 minutes to get through 2500. That's more than 40x faster! And now our accuracy is over 90%! Congratulations!

Time to pack up shop. Run exit in the TPU terminal to close the SSH connection, and ctpu delete --zone=us-central1-b --name=my-tpu in the Cloud Shell to delete the TPU instance. Alternatively, it can be deleted through the Compute Engine Interface.

Copyright 2020 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.