Fashion MNIST Image Classification with TensorFlow on Cloud ML Engine

This notebook demonstrates how to train a deep neural network model for image classification and deploy it as an Application Programming Interface (API) (i.e. web service) for online predictions.


In [ ]:
import os
PROJECT = 'my-project-id' # REPLACE WITH YOUR PROJECT ID
BUCKET = 'my-bucket-name' # REPLACE WITH YOUR BUCKET NAME
REGION = 'us-central1' # REPLACE WITH YOUR BUCKET REGION e.g. us-central1
MODEL_TYPE='dnn'  #  'dnn' or 'cnn'

# do not change these
os.environ['PROJECT'] = PROJECT
os.environ['BUCKET'] = BUCKET
os.environ['REGION'] = REGION
os.environ['MODEL_TYPE'] = MODEL_TYPE
os.environ['TFVERSION'] = '1.8'  # Tensorflow version

In [ ]:
%bash
gcloud config set project $PROJECT
gcloud config set compute/region $REGION

Run as a Python module

Now since we want to run our code on Cloud ML Engine, we've packaged it as a Python module.

The model.py and task.py files containing the model code are in fashionmodel/trainer


In [ ]:
%bash
rm -rf fashionmodel.tar.gz fashion_trained
gcloud ml-engine local train \
   --module-name=trainer.task \
   --package-path=${PWD}/fashionmodel/trainer \
   -- \
   --output_dir=${PWD}/fashion_trained \
   --train_steps=1000 \
   --learning_rate=0.01 \
   --train_batch_size=512 \
   --model=$MODEL_TYPE

Training should finish in just a few seconds because it ran outside of the Jupyter's runtime, as an independent process of the node's operating system. You can explore the training metrics using TensorBoard.


In [ ]:
from google.datalab.ml import TensorBoard
TensorBoard().start('fashion_trained'.format(BUCKET, MODEL_TYPE))

In [ ]:
for pid in TensorBoard.list()['pid']:
  TensorBoard().stop(pid)
  print 'Stopped TensorBoard with pid {}'.format(pid)

Next, let's use Cloud ML Engine so we can train on GPU: --scale-tier=BASIC_GPU and deploy the model as a web service API

Note that GPU speed up depends on the model type. You'll notice that more complex models train substantially faster on GPUs. When you are working with simple models that take just seconds to minutes to train on a single node, keep in mind that Cloud ML Engine introduces a few minutes of overhead for training job setup & teardown.


In [ ]:
%bash
OUTDIR=gs://${BUCKET}/fashion/trained_${MODEL_TYPE}
JOBNAME=fashion_${MODEL_TYPE}_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR
gcloud ml-engine jobs submit training $JOBNAME \
   --region=$REGION \
   --module-name=trainer.task \
   --package-path=${PWD}/fashionmodel/trainer \
   --job-dir=$OUTDIR \
   --staging-bucket=gs://$BUCKET \
   --scale-tier=BASIC_GPU \
   --runtime-version=$TFVERSION \
   -- \
   --output_dir=$OUTDIR \
   --train_steps=1000 --learning_rate=0.01 --train_batch_size=512 \
   --model=$MODEL_TYPE

Monitoring training with TensorBoard

Use this cell to launch tensorboard


In [ ]:
from google.datalab.ml import TensorBoard
TensorBoard().start('gs://{}/fashion/trained_{}'.format(BUCKET, MODEL_TYPE))

In [ ]:
for pid in TensorBoard.list()['pid']:
  TensorBoard().stop(pid)
  print 'Stopped TensorBoard with pid {}'.format(pid)

Deploying and predicting with model

Deploy the model:


In [ ]:
%bash
MODEL_NAME="fashion"
MODEL_VERSION=${MODEL_TYPE}
MODEL_LOCATION=$(gsutil ls gs://${BUCKET}/fashion/trained_${MODEL_TYPE}/export/exporter | tail -1)
echo "Deleting and deploying $MODEL_NAME $MODEL_VERSION from $MODEL_LOCATION ... this will take a few minutes"
#gcloud ml-engine versions delete ${MODEL_VERSION} --model ${MODEL_NAME}
#gcloud ml-engine models delete ${MODEL_NAME}
gcloud ml-engine models create ${MODEL_NAME} --regions $REGION
gcloud ml-engine versions create ${MODEL_VERSION} --model ${MODEL_NAME} --origin ${MODEL_LOCATION} --runtime-version=$TFVERSION

The previous step of deploying the model can take a few minutes. If it is successful, you should see an output similar to this one:

Created ml engine model [projects/qwiklabs-gcp-27eb45524d98e9a5/models/fashion].
Creating version (this might take a few minutes)......
...................................................................................................................done.

Next, download a local copy of the Fashion MNIST dataset to use with Cloud ML Engine for predictions.


In [ ]:
import tensorflow as tf
(train_images, train_labels), (test_images, test_labels) = tf.keras.datasets.fashion_mnist.load_data()
LABELS = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']

To predict with the model, save one of the test images as a JavaScript Object Notation (JSON) file. Also, take a look at it as a graphic and notice the expected class value in the title.


In [ ]:
HEIGHT=28
WIDTH=28

IMGNO=12 #CHANGE THIS to get different images

#Convert raw image data to a test.json file and persist it to disk
import json, codecs
jsondata = {'image': test_images[IMGNO].reshape(HEIGHT, WIDTH).tolist()}
json.dump(jsondata, codecs.open('test.json', 'w', encoding='utf-8'))

#Take a look at a sample image and the correct label from the test dataset
import matplotlib.pyplot as plt
plt.imshow(test_images[IMGNO].reshape(HEIGHT, WIDTH))
title = plt.title('{} / Class #{}'.format(LABELS[test_labels[IMGNO]], test_labels[IMGNO]))

Here's how the same image looks when it saved in the test.json file for use with the prediction API.


In [ ]:
%bash
cat test.json

Send the file to the prediction service and check whether the model you trained returns the correct prediction.


In [ ]:
%bash
gcloud ml-engine predict \
   --model=fashion \
   --version=${MODEL_TYPE} \
   --json-instances=./test.json

Here is what my prediction service returned based on a model that I trained with Cloud ML Engine. Notice that the model predicts a probability of roughly 0.87 for the sneaker (class #7).

CLASSES  PROBABILITIES
7        [2.627301398661075e-08, 1.843199037843135e-09, 6.106044111220399e-06, 5.581538289334276e-07, 9.015485602503759e-08, 0.09837665408849716, 3.8953788816797896e-07, 0.8761420249938965, 0.0034074552822858095, 0.02206665836274624]
# Copyright 2017 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

In [ ]: