By deploying or using this software you agree to comply with the AI Hub Terms of Service and the Google APIs Terms of Service. To the extent of a direct conflict of terms, the AI Hub Terms of Service will control.

Overview

This notebook provides an example workflow of using the ResNet ML container for training an Image classification ML model.

Dataset

The flower dataset container 3670 images with 5 classes: daisy, dandelion, roses, sunflowers, and tulips. The target is an integer variable with class IDs from 1 to 5. They are split to 3300 training and 370 validation images. It is publically available at: gs://cloud-training-demos/tpu/resnet/data.

Objective

The goal of this notebook is to go through a common training workflow:

  • Train an ML model using the AI Platform Training service
  • Monitor the training job with TensorBoard
  • Identify if the model was trained successfully by looking at the generated "Run Report"
  • Deploy the model for serving using the AI Platform Prediction service
  • Use the endpoint for online predictions

Costs

This tutorial uses billable components of Google Cloud Platform (GCP):

  • Cloud AI Platform
  • Cloud Storage

Learn about Cloud AI Platform pricing and Cloud Storage pricing, and use the Pricing Calculator to generate a cost estimate based on your projected usage.

Set up your local development environment

If you are using Colab or AI Platform Notebooks, your environment already meets all the requirements to run this notebook. You can skip this step.

Otherwise, make sure your environment meets this notebook's requirements. You need the following:

  • The Google Cloud SDK
  • Git
  • Python 3
  • virtualenv
  • Jupyter notebook running in a virtual environment with Python 3

The Google Cloud guide to Setting up a Python development environment and the Jupyter installation guide provide detailed instructions for meeting these requirements. The following steps provide a condensed set of instructions:

  1. Install and initialize the Cloud SDK.

  2. Install Python 3.

  3. Install virtualenv and create a virtual environment that uses Python 3.

  4. Activate that environment and run pip install jupyter in a shell to install Jupyter.

  5. Run jupyter notebook in a shell to launch Jupyter.

  6. Open this notebook in the Jupyter Notebook Dashboard.

Set up your GCP project

The following steps are required, regardless of your notebook environment.

  1. Select or create a GCP project.. When you first create an account, you get a $300 free credit towards your compute/storage costs.

  2. Make sure that billing is enabled for your project.

  3. Enable the AI Platform APIs and Compute Engine APIs.

  4. Enter your project ID in the cell below. Then run the cell to make sure the Cloud SDK uses the right project for all the commands in this notebook.

Note: Jupyter runs lines prefixed with ! as shell commands, and it interpolates Python variables prefixed with $ into these commands.


In [ ]:
PROJECT_ID = "[your-project-id]" #@param {type:"string"}
! gcloud config set project $PROJECT_ID

Authenticate your GCP account

If you are using AI Platform Notebooks, your environment is already authenticated. Skip this step.

If you are using Colab, run the cell below and follow the instructions when prompted to authenticate your account via oAuth.

Otherwise, follow these steps:

  1. In the GCP Console, go to the Create service account key page.

  2. From the Service account drop-down list, select New service account.

  3. In the Service account name field, enter a name.

  4. From the Role drop-down list, select Machine Learning Engine > AI Platform Admin and Storage > Storage Object Admin.

  5. Click Create. A JSON file that contains your key downloads to your local environment.

  6. Enter the path to your service account key as the GOOGLE_APPLICATION_CREDENTIALS variable in the cell below and run the cell.


In [ ]:
import sys

# If you are running this notebook in Colab, run this cell and follow the
# instructions to authenticate your GCP account. This provides access to your
# Cloud Storage bucket and lets you submit training jobs and prediction
# requests.

if 'google.colab' in sys.modules:
  from google.colab import auth as google_auth
  google_auth.authenticate_user()

# If you are running this notebook locally, replace the string below with the
# path to your service account key and run this cell to authenticate your GCP
# account.
else:
  %env GOOGLE_APPLICATION_CREDENTIALS ''

Create a Cloud Storage bucket

The following steps are required, regardless of your notebook environment.

You need to have a "workspace" bucket that will hold the dataset and the output from the ML Container. Set the name of your Cloud Storage bucket below. It must be unique across all Cloud Storage buckets.

You may also change the REGION variable, which is used for operations throughout the rest of this notebook. Make sure to choose a region where Cloud AI Platform services are available. You may not use a Multi-Regional Storage bucket for training with AI Platform.


In [ ]:
BUCKET_NAME = "[your-bucket-name]" #@param {type:"string"}
REGION = 'us-central1' #@param {type:"string"}

Only if your bucket doesn't already exist: Run the following cell to create your Cloud Storage bucket.


In [ ]:
! gsutil mb -l $REGION gs://$BUCKET_NAME

Finally, validate access to your Cloud Storage bucket by examining its contents:


In [ ]:
! gsutil ls -al gs://$BUCKET_NAME

Import libraries and define constants


In [ ]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import os
import time
import base64
import requests
import tensorflow as tf
from IPython.core.display import HTML
from googleapiclient import discovery

Cloud training

Accelerator and distribution support

GPU Multi-GPU Node TPU Workers Parameter Server
Yes Yes Yes No No

Note that the ResNet can be trained with a GPU[s] as well.

--master-machine-type standard_gpu

In [ ]:
output_location = os.path.join('gs://', BUCKET_NAME, 'output')

job_name = "resnet_{}".format(time.strftime("%Y%m%d%H%M%S"))
!gcloud beta ai-platform jobs submit training $job_name \
    --master-image-uri gcr.io/aihub-c2t-containers/kfp-components/oob_algorithm/resnet:latest \
    --region $REGION \
    --scale-tier CUSTOM \
    --master-machine-type standard \
    --worker-machine-type cloud_tpu \
    --worker-count 1 \
    --tpu-tf-version 1.14 \
    -- \
    --output-location $output_location \
    --data gs://cloud-training-demos/tpu/resnet/data \
    --number-of-classes 5 \
    --training-steps 100000

Monitor the training with TensorBoard


In [ ]:
try:
  %load_ext tensorboard
  %tensorboard --logdir {output_location}
except:
  !tensorboard --logdir {output_location}

Inspect the Run Report

The "Run Report" will help you identify if the model was successfully trained.


In [11]:
if not tf.io.gfile.exists(os.path.join(output_location, 'report.html')):
  raise RuntimeError('The file report.html was not found. Did the training job finish?')

with tf.io.gfile.GFile(os.path.join(output_location, 'report.html')) as f:
  display(HTML(f.read()))


temp_input_nb
+ Table of Contents

Runtime arguments

value
data gs://cloud-training-demos/tpu/resnet/data
output_location gs://aihub-content-test/resnet_example
number_of_classes 5
use_cache False
resnet_depth 50
training_steps 100000
batch_size 64
learning_rate 0.1
momentum 0.9
remainder None
use_tpu True
tpu cmle-training-15625827900301499714-tpu
gcp_project ee2a81c09470c949f-ml
tpu_zone us-central1-b
num_cores 8
precision bfloat16
num_train_images 3300
num_eval_images 370

Tensorboard snippet

To see the training progress, you can need to install the latest tensorboard with the command: pip install -U tensorboard and then run one of the following commands.

Local tensorboard

tensorboard --logdir gs://aihub-content-test/resnet_example

Publicly shared tensorboard

tensorboard dev upload --logdir gs://aihub-content-test/resnet_example

Datasets

Data reading snippet

import os
import tensorflow as tf


def _parse_example(example_proto):
  # describing the features.  
  features = {
      'image/height': tf.FixedLenFeature([], tf.int64),
      'image/width': tf.FixedLenFeature([], tf.int64),
      'image/channels': tf.FixedLenFeature([], tf.int64),
      'image/class/label': tf.FixedLenFeature([], tf.int64),
      'image/encoded': tf.FixedLenFeature([], tf.string),
  }
  row = tf.parse_single_example(example_proto, features)
  row['image/encoded'] = tf.io.decode_image(row['image/encoded'])
  return row

file_pattern = 'gs://cloud-training-demos/tpu/resnet/data/train*'
batch_size = 10

dataset = tf.data.TFRecordDataset(tf.io.gfile.glob(file_pattern))
dataset = dataset.map(_parse_example)

iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()

with tf.Session() as sess:
  row = sess.run(next_element)

Training dataset sample

image/channels image/class/label image/encoded image/height image/width
0 3 5 b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x01\x00H\x00H\x00\x00\xff\xfe\x00\x1ccmp3.10.3.2Lq3 0x219a22fe\x00\xff\xdb\x00C\x00\x04\x03\x03\x04\x03\x03\x04\x04\x03\x04\x04\x04\x04\x05\x06\n\x07\x06... 333 500
1 3 3 b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\x00\x00\xff\xed\x00\xf8Photoshop 3.0\x008BIM\x04\x04\x00\x00\x00\x00\x00\xae\x1c\x01\x00\x00\x02\x00\x04\x1c\x02\x00\x00\x02\x00\x04\... 240 240
... ... ... ... ... ...
98 3 3 b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\x00\x00\xff\xe2\x0cXICC_PROFILE\x00\x01\x01\x00\x00\x0cHLino\x02\x10\x00\x00mntrRGB XYZ \x07\xce\x00\x02\x00\t\x00\x06\x001\x00\x00ac... 240 159
99 3 1 b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x01\x00H\x00H\x00\x00\xff\xe2\x0cXICC_PROFILE\x00\x01\x01\x00\x00\x0cHLino\x02\x10\x00\x00mntrRGB XYZ \x07\xce\x00\x02\x00\t\x00\x06\x001\x00\x00acspMSFT... 333 500

100 rows × 5 columns

Validation dataset sample

image/channels image/class/label image/encoded image/height image/width
0 3 2 b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\x00\x00\xff\xdb\x00C\x00\x03\x02\x02\x03\x02\x02\x03\x03\x03\x03\x04\x03\x03\x04\x05\x08\x05\x05\x04\x04\x05\n\x07\x07\x06\x08\x0c\n\... 213 320
1 3 5 b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\x00\x00\xff\xe2\x028ICC_PROFILE\x00\x01\x01\x00\x00\x02(ADBE\x02\x10\x00\x00mntrRGB XYZ \x07\xd0\x00\x08\x00\x0b\x00\x13\x004\x00;acs... 221 240
... ... ... ... ... ...
98 3 1 b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\x00\x00\xff\xfe\x01\x89Copyright (C) 2006 by Gilles Gonthier.\r\rSOME rights reserved.\r\rThis work is licensed under the\rCreative C... 240 320
99 3 5 b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x01\x00H\x00H\x00\x00\xff\xe2\x0cXICC_PROFILE\x00\x01\x01\x00\x00\x0cHLino\x02\x10\x00\x00mntrRGB XYZ \x07\xce\x00\x02\x00\t\x00\x06\x001\x00\x00acspMSFT... 332 500

100 rows × 5 columns

Predictions

Local predictions snippet

import tensorflow as tf

saved_model = 'gs://aihub-content-test/resnet_example/export/1580771946'
predict_fn = tf.contrib.predictor.from_saved_model(saved_model)
predictions = predict_fn({'input': data['image/encoded']})

Deploy for serving snippet

MODEL_NAME='REPLACE_WITH_YOUR_MODEL_NAME'
MODEL_VERSION='v1'

# create model name
gcloud ai-platform models create $MODEL_NAME

# create version name
gcloud ai-platform versions create $MODEL_VERSION \
  --model $MODEL_NAME \
  --origin gs://aihub-content-test/resnet_example/export/1580771946 \
  --runtime-version=1.15 \
  --framework=tensorflow \
  --python-version=3.7

Predictions

Training predictions sample

probabilities classes
1 2 3 4 5
0 7.5418e-09 2.2828e-08 5.0617e-09 3.6036e-11 1.0000e+00 5
1 4.5365e-10 7.1170e-13 1.0000e+00 1.5757e-09 4.2130e-06 3
... ... ... ... ... ... ...
98 5.7285e-10 3.5125e-10 9.9992e-01 1.0857e-05 6.9853e-05 3
99 1.0000e+00 4.5675e-09 8.3454e-17 1.8794e-13 5.0974e-14 1

100 rows × 6 columns

Validation predictions sample

probabilities classes
1 2 3 4 5
0 7.1049e-04 9.9148e-01 7.2378e-03 5.4523e-04 2.4940e-05 2
1 6.0323e-08 6.0859e-10 1.9088e-03 1.5295e-06 9.9809e-01 5
... ... ... ... ... ... ...
98 1.0000e+00 1.2027e-09 8.1750e-10 5.6300e-09 4.7444e-10 1
99 5.4199e-07 1.1658e-08 3.8212e-04 6.8198e-08 9.9962e-01 5

100 rows × 6 columns

Metrics

Training dataset

Confusion matrix

Count
Predicted
1 2 3 4 5
Actual 1 20 0 0 0 0
2 0 19 0 0 0
3 0 0 18 0 0
4 0 0 0 16 0
5 0 0 0 0 27
Relative
Predicted
1 2 3 4 5
Actual 1 1 0 0 0 0
2 0 1 0 0 0
3 0 0 1 0 0
4 0 0 0 1 0
5 0 0 0 0 1
Aggregated metrics
accuracy f1-score precision recall
weighted value 1.0 1.0 1.0 1.0

Classification metrics

Per class metrics
precision recall f1-score support
Label 1 1.0 1.0 1.0 20
2 1.0 1.0 1.0 19
3 1.0 1.0 1.0 18
4 1.0 1.0 1.0 16
5 1.0 1.0 1.0 27

Validation dataset

Confusion matrix

Count
Predicted
1 2 3 4 5
Actual 1 17 1 1 0 0
2 1 26 0 0 1
3 0 1 16 0 0
4 0 0 0 14 0
5 0 1 2 0 19
Relative
Predicted
1 2 3 4 5
Actual 1 0.9 0.05 0.05 0 0
2 0.04 0.9 0 0 0.04
3 0 0.06 0.9 0 0
4 0 0 0 1 0
5 0 0.05 0.09 0 0.9
Aggregated metrics
accuracy f1-score precision recall
weighted value 0.92 0.9202 0.9226 0.92

Classification metrics

Per class metrics
precision recall f1-score support
Label 1 0.9444 0.8947 0.9189 19
2 0.8966 0.9286 0.9123 28
3 0.8421 0.9412 0.8889 17
4 1.0000 1.0000 1.0000 14
5 0.9500 0.8636 0.9048 22

ROC Curve

Training ROC Curve

1 2 3 4 5 Mean AUC
Area Under Curve 1.0 1.0 1.0 1.0 1.0 1.0

Validation ROC Curve

1 2 3 4 5 Mean AUC
Area Under Curve 0.9948 0.997 0.9922 1.0 0.9977 0.9963

Prediction tables

Training data and prediction

Best predictions

Worst predictions

Validation data and prediction

Best predictions

Worst predictions

Deployment parameters


In [ ]:
#@markdown ---
model = 'resnet' #@param {type:"string"}
version = 'v1' #@param {type:"string"}
#@markdown ---

In [ ]:
# the exact location of the model is in model_uri.txt
with tf.io.gfile.GFile(os.path.join(output_location, 'model_uri.txt')) as f:
  model_uri = f.read()

# create a model
! gcloud ai-platform models create $model --regions $REGION

# create a version
! gcloud ai-platform versions create $version \
  --model $model \
  --runtime-version 1.15 \
  --origin $model_uri \
  --project $PROJECT_ID

Get one image for test prediction

Download and encode one image


In [18]:
!wget --output-document /tmp/image.jpeg \
  https://fyf.tac-cdn.net/images/products/large/F-395.jpg

# read the image, decode, resize and base 64 encode it
with tf.Session() as sess:
  img = tf.io.read_file('/tmp/image.jpeg')
  img = tf.image.decode_jpeg(img, channels=3)
  img = tf.image.resize(img, [192, 192])
  img = tf.image.convert_image_dtype(img, tf.uint8)
  img = tf.image.encode_jpeg(img)
  encoded_image = sess.run(img)

encoded_image = base64.b64encode(encoded_image).decode()
encoded_image[:200]


--2020-02-14 10:39:28--  https://fyf.tac-cdn.net/images/products/large/F-395.jpg
Resolving fyf.tac-cdn.net (fyf.tac-cdn.net)... 151.101.41.177
Connecting to fyf.tac-cdn.net (fyf.tac-cdn.net)|151.101.41.177|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26384 (26K) [image/jpeg]
Saving to: ‘/tmp/image.jpeg’

/tmp/image.jpeg     100%[===================>]  25.77K  --.-KB/s    in 0.03s   

2020-02-14 10:39:28 (755 KB/s) - ‘/tmp/image.jpeg’ saved [26384/26384]

Out[18]:
'/9j/4AAQSkZJRgABAQEBLAEsAAD/2wBDAAIBAQEBAQIBAQECAgICAgQDAgICAgUEBAMEBgUGBgYFBgYGBwkIBgcJBwYGCAsICQoKCgoKBggLDAsKDAkKCgr/2wBDAQICAgICAgUDAwUKBwYHCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoK'

Make online prediction

Notes that the image has to be resized before seding for inferences.


In [20]:
# make a REST call for online inference
service = discovery.build('ml', 'v1')
name = 'projects/{project}/models/{model}/versions/{version}'.format(project=PROJECT_ID,
                                                                    model=model,
                                                                    version=version)
body = {'instances': {'input': {'b64': encoded_image}}}

response = service.projects().predict(name=name, body=body).execute()
if 'error' in response:
    raise RuntimeError(response['error'])

print('predicted probabilities: {}'.format(response['predictions'][0]['probabilities']))
print('predicted class: {}'.format(response['predictions'][0]['classes']+1))


/Users/evo/Library/Python/3.7/lib/python/site-packages/google/auth/_default.py:66: UserWarning: Your application has authenticated using end user credentials from Google Cloud SDK. We recommend that most server applications use service accounts instead. If your application continues to use end user credentials from Cloud SDK, you might receive a "quota exceeded" or "API not enabled" error. For more information about service accounts, see https://cloud.google.com/docs/authentication/
  warnings.warn(_CLOUD_SDK_CREDENTIALS_WARNING)
predicted probabilities: [2.028111359408946e-11, 2.5678767472787336e-24, 0.005123268347233534, 0.006565507967025042, 0.988311231136322]
predicted class: 5