MNIST Image Classification with TensorFlow on Cloud ML Engine

This notebook demonstrates how to implement different image models on MNIST using Estimator.

Note the MODEL_TYPE; change it to try out different models

In [6]:
import os
PROJECT = "cloud-training-demos" # REPLACE WITH YOUR PROJECT ID
BUCKET = "cloud-training-demos-ml" # REPLACE WITH YOUR BUCKET NAME
REGION = "us-central1" # REPLACE WITH YOUR BUCKET REGION e.g. us-central1
MODEL_TYPE = "dnn"  # "linear", "dnn", "dnn_dropout", or "cnn"

# Do not change these
os.environ["PROJECT"] = PROJECT
os.environ["BUCKET"] = BUCKET
os.environ["REGION"] = REGION
os.environ["MODEL_TYPE"] = MODEL_TYPE
os.environ["TFVERSION"] = "1.13"  # Tensorflow version

In [ ]:
gcloud config set project $PROJECT
gcloud config set compute/region $REGION

Run as a Python module

In the previous notebook (mnist_linear.ipynb) we ran our code directly from the notebook.

Now since we want to run our code on Cloud ML Engine, we've packaged it as a python module.

The and containing the model code is in mnistmodel/trainer

Let's first run it locally for a few steps to test the code.

In [ ]:
rm -rf mnistmodel.tar.gz mnist_trained
gcloud ml-engine local train \
    --module-name=trainer.task \
    --package-path=${PWD}/mnistmodel/trainer \
    -- \
    --output_dir=${PWD}/mnist_trained \
    --train_steps=100 \
    --learning_rate=0.01 \

Now, let's do it on Cloud ML Engine so we can train on GPU (--scale-tier=BASIC_GPU)

Note the GPU speed up depends on the model type. You'll notice the more complex CNN model trains significantly faster on GPU, however the speed up on the simpler models is not as pronounced.

In [ ]:
JOBNAME=mnist_${MODEL_TYPE}_$(date -u +%y%m%d_%H%M%S)
gsutil -m rm -rf $OUTDIR
gcloud ml-engine jobs submit training $JOBNAME \
    --region=$REGION \
    --module-name=trainer.task \
    --package-path=${PWD}/mnistmodel/trainer \
    --job-dir=$OUTDIR \
    --staging-bucket=gs://$BUCKET \
    --scale-tier=BASIC_GPU \
    --runtime-version=$TFVERSION \
    -- \
    --output_dir=$OUTDIR \
    --train_steps=10000 --learning_rate=0.01 --train_batch_size=512 \
    --model=$MODEL_TYPE --batch_norm

Monitoring training with TensorBoard

Use this cell to launch tensorboard

In [ ]:
from import TensorBoard
TensorBoard().start("gs://{}/mnist/trained_{}".format(BUCKET, MODEL_TYPE))

In [ ]:
for pid in TensorBoard.list()["pid"]:
    print("Stopped TensorBoard with pid {}".format(pid))

Here's what it looks like with a linear model for 10,000 steps:

Here are my results:

Model Accuracy Time taken Model description Run time parameters
linear 91.53 3 min 100 steps, LR=0.01, Batch=512
linear 92.73 8 min 1000 steps, LR=0.01, Batch=512
linear 92.29 18 min 10000 steps, LR=0.01, Batch=512
dnn 98.14 15 min 300-100-30 nodes fully connected 10000 steps, LR=0.01, Batch=512
dnn 97.99 48 min 300-100-30 nodes fully connected 100000 steps, LR=0.01, Batch=512
dnn_dropout 97.84 29 min 300-100-30-DL(0.1)- nodes 20000 steps, LR=0.01, Batch=512
cnn 98.97 35 min maxpool(10 5x5 cnn, 2)-maxpool(20 5x5 cnn, 2)-300-DL(0.25) 20000 steps, LR=0.01, Batch=512
cnn 98.93 35 min maxpool(10 11x11 cnn, 2)-maxpool(20 3x3 cnn, 2)-300-DL(0.25) 20000 steps, LR=0.01, Batch=512
cnn 99.17 35 min maxpool(10 11x11 cnn, 2)-maxpool(20 3x3 cnn, 2)-300-DL(0.25), batch_norm (logits only) 20000 steps, LR=0.01, Batch=512
cnn 99.27 35 min maxpool(10 11x11 cnn, 2)-maxpool(20 3x3 cnn, 2)-300-DL(0.25), batch_norm (logits, deep) 10000 steps, LR=0.01, Batch=512
cnn 99.48 12 hr as-above but nfil1=20, nfil2=27, dprob=0.1, lr=0.001, batchsize=233 (hyperparameter optimization)

Deploying and predicting with model

Deploy the model:

In [ ]:
MODEL_LOCATION=$(gsutil ls gs://${BUCKET}/mnist/trained_${MODEL_TYPE}/export/exporter | tail -1)
echo "Deleting and deploying $MODEL_NAME $MODEL_VERSION from $MODEL_LOCATION ... this will take a few minutes"
#gcloud ml-engine versions delete ${MODEL_VERSION} --model ${MODEL_NAME}
#gcloud ml-engine models delete ${MODEL_NAME}
gcloud ml-engine models create ${MODEL_NAME} --regions $REGION
gcloud ml-engine versions create ${MODEL_VERSION} --model ${MODEL_NAME} --origin ${MODEL_LOCATION} --runtime-version=$TFVERSION

To predict with the model, let's take one of the example images.

In [ ]:
import json, codecs
import matplotlib.pyplot as plt
import tensorflow as tf

WIDTH = 28

# Get mnist data
mnist = tf.keras.datasets.mnist

(_, _), (x_test, _) = mnist.load_data()

# Scale our features between 0 and 1
x_test = x_test / 255.0 

IMGNO = 5 # CHANGE THIS to get different images
jsondata = {"image": x_test[IMGNO].reshape(HEIGHT, WIDTH).tolist()}
json.dump(jsondata,"test.json", 'w', encoding = "utf-8"))
plt.imshow(x_test[IMGNO].reshape(HEIGHT, WIDTH));

Send it to the prediction service

In [ ]:
gcloud ml-engine predict \
    --model=mnist \
    --version=${MODEL_TYPE} \

DO NOT RUN anything beyond this point

This shows you what I did, but trying to repeat this will take several hours.

Hyperparameter tuning

This is what hyperparam.yaml looked like:

In [ ]:
    scaleTier: CUSTOM
    masterType: complex_model_m_gpu
        goal: MAXIMIZE
        maxTrials: 30
        maxParallelTrials: 2
        hyperparameterMetricTag: accuracy
        - parameterName: train_batch_size
            type: INTEGER
            minValue: 32
            maxValue: 512
            scaleType: UNIT_LINEAR_SCALE
        - parameterName: learning_rate
            type: DOUBLE
            minValue: 0.001
            maxValue: 0.1
            scaleType: UNIT_LOG_SCALE
        - parameterName: nfil1
            type: INTEGER
            minValue: 5
            maxValue: 20
            scaleType: UNIT_LINEAR_SCALE
        - parameterName: nfil2
            type: INTEGER
            minValue: 10
            maxValue: 30
            scaleType: UNIT_LINEAR_SCALE
        - parameterName: dprob
            type: DOUBLE
            minValue: 0.1
            maxValue: 0.6
            scaleType: UNIT_LINEAR_SCALE

This takes 13 hours and 250 ML Units, so don't try this at home :)

The key thing is here the --config parameter.

In [ ]:
JOBNAME=mnist_${MODEL_TYPE}_$(date -u +%y%m%d_%H%M%S)
gsutil -m rm -rf $OUTDIR
gcloud ml-engine jobs submit training $JOBNAME \
   --region=$REGION \
   --module-name=trainer.task \
   --package-path=${PWD}/mnistmodel/trainer \
   --job-dir=$OUTDIR \
   --staging-bucket=gs://$BUCKET \
   --runtime-version=$TFVERSION \
   --config hyperparam.yaml \
   -- \
   --output_dir=$OUTDIR \
   --model=$MODEL_TYPE --batch_norm
