In [6]:
import os
PROJECT = "cloud-training-demos" # REPLACE WITH YOUR PROJECT ID
BUCKET = "cloud-training-demos-ml" # REPLACE WITH YOUR BUCKET NAME
REGION = "us-central1" # REPLACE WITH YOUR BUCKET REGION e.g. us-central1
MODEL_TYPE = "dnn" # "linear", "dnn", "dnn_dropout", or "cnn"
# Do not change these
os.environ["PROJECT"] = PROJECT
os.environ["BUCKET"] = BUCKET
os.environ["REGION"] = REGION
os.environ["MODEL_TYPE"] = MODEL_TYPE
os.environ["TFVERSION"] = "1.13" # Tensorflow version
In [ ]:
%%bash
gcloud config set project $PROJECT
gcloud config set compute/region $REGION
In the previous notebook (mnist_linear.ipynb) we ran our code directly from the notebook.
Now since we want to run our code on Cloud ML Engine, we've packaged it as a python module.
The model.py
and task.py
containing the model code is in mnistmodel/trainer
Let's first run it locally for a few steps to test the code.
In [ ]:
%%bash
rm -rf mnistmodel.tar.gz mnist_trained
gcloud ml-engine local train \
--module-name=trainer.task \
--package-path=${PWD}/mnistmodel/trainer \
-- \
--output_dir=${PWD}/mnist_trained \
--train_steps=100 \
--learning_rate=0.01 \
--model=$MODEL_TYPE
Now, let's do it on Cloud ML Engine so we can train on GPU (--scale-tier=BASIC_GPU
)
Note the GPU speed up depends on the model type. You'll notice the more complex CNN model trains significantly faster on GPU, however the speed up on the simpler models is not as pronounced.
In [ ]:
%%bash
OUTDIR=gs://${BUCKET}/mnist/trained_${MODEL_TYPE}
JOBNAME=mnist_${MODEL_TYPE}_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR
gcloud ml-engine jobs submit training $JOBNAME \
--region=$REGION \
--module-name=trainer.task \
--package-path=${PWD}/mnistmodel/trainer \
--job-dir=$OUTDIR \
--staging-bucket=gs://$BUCKET \
--scale-tier=BASIC_GPU \
--runtime-version=$TFVERSION \
-- \
--output_dir=$OUTDIR \
--train_steps=10000 --learning_rate=0.01 --train_batch_size=512 \
--model=$MODEL_TYPE --batch_norm
In [ ]:
from google.datalab.ml import TensorBoard
TensorBoard().start("gs://{}/mnist/trained_{}".format(BUCKET, MODEL_TYPE))
In [ ]:
for pid in TensorBoard.list()["pid"]:
TensorBoard().stop(pid)
print("Stopped TensorBoard with pid {}".format(pid))
Here's what it looks like with a linear model for 10,000 steps:
Here are my results:
Model | Accuracy | Time taken | Model description | Run time parameters |
---|---|---|---|---|
linear | 91.53 | 3 min | 100 steps, LR=0.01, Batch=512 | |
linear | 92.73 | 8 min | 1000 steps, LR=0.01, Batch=512 | |
linear | 92.29 | 18 min | 10000 steps, LR=0.01, Batch=512 | |
dnn | 98.14 | 15 min | 300-100-30 nodes fully connected | 10000 steps, LR=0.01, Batch=512 |
dnn | 97.99 | 48 min | 300-100-30 nodes fully connected | 100000 steps, LR=0.01, Batch=512 |
dnn_dropout | 97.84 | 29 min | 300-100-30-DL(0.1)- nodes | 20000 steps, LR=0.01, Batch=512 |
cnn | 98.97 | 35 min | maxpool(10 5x5 cnn, 2)-maxpool(20 5x5 cnn, 2)-300-DL(0.25) | 20000 steps, LR=0.01, Batch=512 |
cnn | 98.93 | 35 min | maxpool(10 11x11 cnn, 2)-maxpool(20 3x3 cnn, 2)-300-DL(0.25) | 20000 steps, LR=0.01, Batch=512 |
cnn | 99.17 | 35 min | maxpool(10 11x11 cnn, 2)-maxpool(20 3x3 cnn, 2)-300-DL(0.25), batch_norm (logits only) | 20000 steps, LR=0.01, Batch=512 |
cnn | 99.27 | 35 min | maxpool(10 11x11 cnn, 2)-maxpool(20 3x3 cnn, 2)-300-DL(0.25), batch_norm (logits, deep) | 10000 steps, LR=0.01, Batch=512 |
cnn | 99.48 | 12 hr | as-above but nfil1=20, nfil2=27, dprob=0.1, lr=0.001, batchsize=233 | (hyperparameter optimization) |
In [ ]:
%%bash
MODEL_NAME="mnist"
MODEL_VERSION=${MODEL_TYPE}
MODEL_LOCATION=$(gsutil ls gs://${BUCKET}/mnist/trained_${MODEL_TYPE}/export/exporter | tail -1)
echo "Deleting and deploying $MODEL_NAME $MODEL_VERSION from $MODEL_LOCATION ... this will take a few minutes"
#gcloud ml-engine versions delete ${MODEL_VERSION} --model ${MODEL_NAME}
#gcloud ml-engine models delete ${MODEL_NAME}
gcloud ml-engine models create ${MODEL_NAME} --regions $REGION
gcloud ml-engine versions create ${MODEL_VERSION} --model ${MODEL_NAME} --origin ${MODEL_LOCATION} --runtime-version=$TFVERSION
To predict with the model, let's take one of the example images.
In [ ]:
import json, codecs
import matplotlib.pyplot as plt
import tensorflow as tf
HEIGHT = 28
WIDTH = 28
# Get mnist data
mnist = tf.keras.datasets.mnist
(_, _), (x_test, _) = mnist.load_data()
# Scale our features between 0 and 1
x_test = x_test / 255.0
IMGNO = 5 # CHANGE THIS to get different images
jsondata = {"image": x_test[IMGNO].reshape(HEIGHT, WIDTH).tolist()}
json.dump(jsondata, codecs.open("test.json", 'w', encoding = "utf-8"))
plt.imshow(x_test[IMGNO].reshape(HEIGHT, WIDTH));
Send it to the prediction service
In [ ]:
%%bash
gcloud ml-engine predict \
--model=mnist \
--version=${MODEL_TYPE} \
--json-instances=./test.json
In [ ]:
trainingInput:
scaleTier: CUSTOM
masterType: complex_model_m_gpu
hyperparameters:
goal: MAXIMIZE
maxTrials: 30
maxParallelTrials: 2
hyperparameterMetricTag: accuracy
params:
- parameterName: train_batch_size
type: INTEGER
minValue: 32
maxValue: 512
scaleType: UNIT_LINEAR_SCALE
- parameterName: learning_rate
type: DOUBLE
minValue: 0.001
maxValue: 0.1
scaleType: UNIT_LOG_SCALE
- parameterName: nfil1
type: INTEGER
minValue: 5
maxValue: 20
scaleType: UNIT_LINEAR_SCALE
- parameterName: nfil2
type: INTEGER
minValue: 10
maxValue: 30
scaleType: UNIT_LINEAR_SCALE
- parameterName: dprob
type: DOUBLE
minValue: 0.1
maxValue: 0.6
scaleType: UNIT_LINEAR_SCALE
This takes 13 hours and 250 ML Units, so don't try this at home :)
The key thing is here the --config parameter.
In [ ]:
%%bash
OUTDIR=gs://${BUCKET}/mnist/trained_${MODEL_TYPE}_hparam
JOBNAME=mnist_${MODEL_TYPE}_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR
gcloud ml-engine jobs submit training $JOBNAME \
--region=$REGION \
--module-name=trainer.task \
--package-path=${PWD}/mnistmodel/trainer \
--job-dir=$OUTDIR \
--staging-bucket=gs://$BUCKET \
--runtime-version=$TFVERSION \
--config hyperparam.yaml \
-- \
--output_dir=$OUTDIR \
--model=$MODEL_TYPE --batch_norm
# Copyright 2017 Google Inc. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License.