Image Classification from scratch with TPUs on Cloud ML Engine using ResNet

This notebook demonstrates how to do image classification from scratch on a flowers dataset using TPUs and the resnet trainer.


In [ ]:
import os
PROJECT = 'cloud-training-demos' # REPLACE WITH YOUR PROJECT ID
BUCKET = 'cloud-training-demos-ml' # REPLACE WITH YOUR BUCKET NAME
REGION = 'us-central1' # REPLACE WITH YOUR BUCKET REGION e.g. us-central1

# do not change these
os.environ['PROJECT'] = PROJECT
os.environ['BUCKET'] = BUCKET
os.environ['REGION'] = REGION
os.environ['TFVERSION'] = '1.9'

In [ ]:
%%bash
gcloud config set project $PROJECT
gcloud config set compute/region $REGION

Convert JPEG images to TensorFlow Records

My dataset consists of JPEG images in Google Cloud Storage. I have two CSV files that are formatted as follows: image-name, category

Instead of reading the images from JPEG each time, we'll convert the JPEG data and store it as TF Records.


In [ ]:
%%bash
gsutil cat gs://cloud-ml-data/img/flower_photos/train_set.csv | head -5 > /tmp/input.csv
cat /tmp/input.csv

In [ ]:
%%bash
gsutil cat gs://cloud-ml-data/img/flower_photos/train_set.csv  | sed 's/,/ /g' | awk '{print $2}' | sort | uniq > /tmp/labels.txt
cat /tmp/labels.txt

Clone the TPU repo

Let's git clone the repo and get the preprocessing and model files. The model code has imports of the form:

import resnet_model as model_lib
We will need to change this to:
from . import resnet_model as model_lib

In [ ]:
%%writefile copy_resnet_files.sh
#!/bin/bash
rm -rf tpu
git clone https://github.com/tensorflow/tpu
cd tpu
TFVERSION=$1
echo "Switching to version r$TFVERSION"
git checkout r$TFVERSION
cd ..
  
MODELCODE=tpu/models/official/resnet
OUTDIR=mymodel
rm -rf $OUTDIR

# preprocessing
cp -r imgclass $OUTDIR   # brings in setup.py and __init__.py
cp tpu/tools/datasets/jpeg_to_tf_record.py $OUTDIR/trainer/preprocess.py

# model: fix imports
for FILE in $(ls -p $MODELCODE | grep -v /); do
    CMD="cat $MODELCODE/$FILE "
    for f2 in $(ls -p $MODELCODE | grep -v /); do
        MODULE=`echo $f2 | sed 's/.py//g'`
        CMD="$CMD | sed 's/^import ${MODULE}/from . import ${MODULE}/g' "
    done
    CMD="$CMD > $OUTDIR/trainer/$FILE"
    eval $CMD
done
find $OUTDIR
echo "Finished copying files into $OUTDIR"

In [ ]:
!bash ./copy_resnet_files.sh $TFVERSION

Enable TPU service account

Allow Cloud ML Engine to access the TPU and bill to your project


In [ ]:
%%writefile enable_tpu_mlengine.sh
SVC_ACCOUNT=$(curl -H "Authorization: Bearer $(gcloud auth print-access-token)"  \
    https://ml.googleapis.com/v1/projects/${PROJECT}:getConfig \
              | grep tpuServiceAccount | tr '"' ' ' | awk '{print $3}' )
echo "Enabling TPU service account $SVC_ACCOUNT to act as Cloud ML Service Agent"
gcloud projects add-iam-policy-binding $PROJECT \
    --member serviceAccount:$SVC_ACCOUNT --role roles/ml.serviceAgent
echo "Done"

In [ ]:
!bash ./enable_tpu_mlengine.sh

Try preprocessing locally


In [ ]:
%%bash
export PYTHONPATH=${PYTHONPATH}:${PWD}/mymodel
  
rm -rf /tmp/out
python -m trainer.preprocess \
       --train_csv /tmp/input.csv \
       --validation_csv /tmp/input.csv \
       --labels_file /tmp/labels.txt \
       --project_id $PROJECT \
       --output_dir /tmp/out --runner=DirectRunner

In [ ]:
!ls -l /tmp/out

Now run it over full training and evaluation datasets. This will happen in Cloud Dataflow.


In [ ]:
%%bash
export PYTHONPATH=${PYTHONPATH}:${PWD}/mymodel
gsutil -m rm -rf gs://${BUCKET}/tpu/resnet/data
python -m trainer.preprocess \
       --train_csv gs://cloud-ml-data/img/flower_photos/train_set.csv \
       --validation_csv gs://cloud-ml-data/img/flower_photos/eval_set.csv \
       --labels_file /tmp/labels.txt \
       --project_id $PROJECT \
       --output_dir gs://${BUCKET}/tpu/resnet/data

The above preprocessing step will take 15-20 minutes. Wait for the job to finish before you proceed. Navigate to Cloud Dataflow section of GCP web console to monitor job progress. You will see something like this

Alternately, you can simply copy my already preprocessed files and proceed to the next step:

gsutil -m cp gs://cloud-training-demos/tpu/resnet/data/* gs://${BUCKET}/tpu/resnet/copied_data

In [ ]:
%%bash
gsutil ls gs://${BUCKET}/tpu/resnet/data

Train on the Cloud


In [ ]:
%%bash
echo -n "--num_train_images=$(gsutil cat gs://cloud-ml-data/img/flower_photos/train_set.csv | wc -l)  "
echo -n "--num_eval_images=$(gsutil cat gs://cloud-ml-data/img/flower_photos/eval_set.csv | wc -l)  "
echo "--num_label_classes=$(cat /tmp/labels.txt | wc -l)"

In [ ]:
%%bash
TOPDIR=gs://${BUCKET}/tpu/resnet
OUTDIR=${TOPDIR}/trained
JOBNAME=imgclass_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR  # Comment out this line to continue training from the last time
gcloud ml-engine jobs submit training $JOBNAME \
  --region=$REGION \
  --module-name=trainer.resnet_main \
  --package-path=$(pwd)/mymodel/trainer \
  --job-dir=$OUTDIR \
  --staging-bucket=gs://$BUCKET \
  --scale-tier=BASIC_TPU \
  --runtime-version=$TFVERSION --python-version=3.5 \
  -- \
  --data_dir=${TOPDIR}/data \
  --model_dir=${OUTDIR} \
  --resnet_depth=18 \
  --train_batch_size=128 --eval_batch_size=32 --skip_host_call=True \
  --steps_per_eval=250 --train_steps=1000 \
  --num_train_images=3300  --num_eval_images=370  --num_label_classes=5 \
  --export_dir=${OUTDIR}/export

The above training job will take 15-20 minutes. Wait for the job to finish before you proceed. Navigate to Cloud ML Engine section of GCP web console to monitor job progress.

The model should finish with a 80-83% accuracy (results will vary):

Eval results: {'global_step': 1000, 'loss': 0.7359053, 'top_1_accuracy': 0.82954544, 'top_5_accuracy': 1.0}

In [ ]:
%%bash
gsutil ls gs://${BUCKET}/tpu/resnet/trained/export/

You can look at the training charts with TensorBoard:


In [ ]:
OUTDIR = 'gs://{}/tpu/resnet/trained/'.format(BUCKET)
from google.datalab.ml import TensorBoard
TensorBoard().start(OUTDIR)

In [ ]:
TensorBoard().stop(11531)
print("Stopped Tensorboard")

These were the charts I got (I set smoothing to be zero): As you can see, the final blue dot (eval) is quite close to the lowest training loss, indicating that the model hasn't overfit. The top_1 accuracy on the evaluation dataset, however, is 80% which isn't that great. More data would help.

Deploying and predicting with model

Deploy the model:


In [ ]:
%%bash
MODEL_NAME="flowers"
MODEL_VERSION=resnet
MODEL_LOCATION=$(gsutil ls gs://${BUCKET}/tpu/resnet/trained/export/ | tail -1)
echo "Deleting/deploying $MODEL_NAME $MODEL_VERSION from $MODEL_LOCATION ... this will take a few minutes"

# comment/uncomment the appropriate line to run. The first time around, you will need only the two create calls
# But during development, you might need to replace a version by deleting the version and creating it again

#gcloud ml-engine versions delete --quiet ${MODEL_VERSION} --model ${MODEL_NAME}
#gcloud ml-engine models delete ${MODEL_NAME}
gcloud ml-engine models create ${MODEL_NAME} --regions $REGION
gcloud ml-engine versions create ${MODEL_VERSION} --model ${MODEL_NAME} --origin ${MODEL_LOCATION} --runtime-version=$TFVERSION

We can use saved_model_cli to find out what inputs the model expects:


In [ ]:
%%bash
saved_model_cli show --dir $(gsutil ls gs://${BUCKET}/tpu/resnet/trained/export/ | tail -1) --tag_set serve --signature_def serving_default

As you can see, the model expects image_bytes. This is typically base64 encoded

To predict with the model, let's take one of the example images that is available on Google Cloud Storage and convert it to a base64-encoded array


In [ ]:
import base64, sys, json
import tensorflow as tf
import io
with tf.gfile.GFile('gs://cloud-ml-data/img/flower_photos/sunflowers/1022552002_2b93faf9e7_n.jpg', 'rb') as ifp:
  with io.open('test.json', 'w') as ofp:
    image_data = ifp.read()
    img = base64.b64encode(image_data).decode('utf-8')
    json.dump({"image_bytes": {"b64": img}}, ofp)

In [ ]:
!ls -l test.json

Send it to the prediction service


In [ ]:
%%bash
gcloud ml-engine predict --model=flowers --version=resnet --json-instances=./test.json

What does CLASS no. 3 correspond to? (remember that classes is 0-based)


In [ ]:
%%bash
head -4 /tmp/labels.txt | tail -1

Here's how you would invoke those predictions without using gcloud


In [ ]:
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials
import base64, sys, json
import tensorflow as tf

with tf.gfile.GFile('gs://cloud-ml-data/img/flower_photos/sunflowers/1022552002_2b93faf9e7_n.jpg', 'rb') as ifp:
  credentials = GoogleCredentials.get_application_default()
  api = discovery.build('ml', 'v1', credentials=credentials,
            discoveryServiceUrl='https://storage.googleapis.com/cloud-ml/discovery/ml_v1_discovery.json')
  
  request_data = {'instances':
  [
      {"image_bytes": {"b64": base64.b64encode(ifp.read()).decode('utf-8')}}
  ]}

  parent = 'projects/%s/models/%s/versions/%s' % (PROJECT, 'flowers', 'resnet')
  response = api.projects().predict(body=request_data, name=parent).execute()
  print("response={0}".format(response))
# Copyright 2018 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

In [ ]: