MNIST Image Classification with TensorFlow on Cloud AI Platform

This notebook demonstrates how to implement different image models on MNIST using the tf.keras API.

Learning Objectives

  1. Understand how to build a Dense Neural Network (DNN) for image classification
  2. Understand how to use dropout (DNN) for image classification
  3. Understand how to use Convolutional Neural Networks (CNN)
  4. Know how to deploy and use an image classifcation model using Google Cloud's AI Platform

First things first. Configure the parameters below to match your own Google Cloud project details.


In [1]:
from datetime import datetime
import os

PROJECT = "your-project-id-here"  # REPLACE WITH YOUR PROJECT ID
BUCKET = "your-bucket-id-here"  # REPLACE WITH YOUR BUCKET NAME
REGION = "us-central1"  # REPLACE WITH YOUR BUCKET REGION e.g. us-central1
MODEL_TYPE = "cnn"  # "linear", "dnn", "dnn_dropout", or "dnn"

# Do not change these
os.environ["PROJECT"] = PROJECT
os.environ["BUCKET"] = BUCKET
os.environ["REGION"] = REGION
os.environ["MODEL_TYPE"] = MODEL_TYPE
os.environ["TFVERSION"] = "2.1"  # Tensorflow  version
os.environ["IMAGE_URI"] = os.path.join("gcr.io", PROJECT, "mnist_models")

Building a dynamic model

In the previous notebook, mnist_linear.ipynb, we ran our code directly from the notebook. In order to run it on the AI Platform, it needs to be packaged as a python module.

The boilerplate structure for this module has already been set up in the folder mnist_models. The module lives in the sub-folder, trainer, and is designated as a python package with the empty __init__.py (mnist_models/trainer/__init__.py) file. It still needs the model and a trainer to run it, so let's make them.

Let's start with the trainer file first. This file parses command line arguments to feed into the model.


In [2]:
%%writefile mnist_models/trainer/task.py
import argparse
import json
import os
import sys

from . import model


def _parse_arguments(argv):
    """Parses command-line arguments."""
    parser = argparse.ArgumentParser()
    parser.add_argument(
        '--model_type',
        help='Which model type to use',
        type=str, default='linear')
    parser.add_argument(
        '--epochs',
        help='The number of epochs to train',
        type=int, default=10)
    parser.add_argument(
        '--steps_per_epoch',
        help='The number of steps per epoch to train',
        type=int, default=100)
    parser.add_argument(
        '--job-dir',
        help='Directory where to save the given model',
        type=str, default='mnist_models/')
    return parser.parse_known_args(argv)


def main():
    """Parses command line arguments and kicks off model training."""
    args = _parse_arguments(sys.argv[1:])[0]

    # Configure path for hyperparameter tuning.
    trial_id = json.loads(
        os.environ.get('TF_CONFIG', '{}')).get('task', {}).get('trial', '')
    output_path = args.job_dir if not trial_id else args.job_dir + '/'

    model_layers = model.get_layers(args.model_type)
    image_model = model.build_model(model_layers, args.job_dir)
    model_history = model.train_and_evaluate(
        image_model, args.epochs, args.steps_per_epoch, args.job_dir)


if __name__ == '__main__':
    main()


Overwriting mnist_models/trainer/task.py

Next, let's group non-model functions into a util file to keep the model file simple. We'll copy over the scale and load_dataset functions from the previous lab.


In [3]:
%%writefile mnist_models/trainer/util.py
import tensorflow as tf


def scale(image, label):
    """Scales images from a 0-255 int range to a 0-1 float range"""
    image = tf.cast(image, tf.float32)
    image /= 255
    image = tf.expand_dims(image, -1)
    return image, label


def load_dataset(
        data, training=True, buffer_size=5000, batch_size=100, nclasses=10):
    """Loads MNIST dataset into a tf.data.Dataset"""
    (x_train, y_train), (x_test, y_test) = data
    x = x_train if training else x_test
    y = y_train if training else y_test
    # One-hot encode the classes
    y = tf.keras.utils.to_categorical(y, nclasses)
    dataset = tf.data.Dataset.from_tensor_slices((x, y))
    dataset = dataset.map(scale).batch(batch_size)
    if training:
        dataset = dataset.shuffle(buffer_size).repeat()
    return dataset


Overwriting mnist_models/trainer/util.py

Finally, let's code the models! The tf.keras API accepts an array of layers into a model object, so we can create a dictionary of layers based on the different model types we want to use. The below file has two functions: get_layers and create_and_train_model. We will build the structure of our model in get_layers. Last but not least, we'll copy over the training code from the previous lab into train_and_evaluate.

TODO 1: Define the Keras layers for a DNN model
TODO 2: Define the Keras layers for a dropout model
TODO 3: Define the Keras layers for a CNN model

Hint: These models progressively build on each other. Look at the imported tensorflow.keras.layers modules and the default values for the variables defined in get_layers for guidance.


In [4]:
%%writefile mnist_models/trainer/model.py
import os
import shutil

import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.callbacks import TensorBoard
from tensorflow.keras.layers import (
    Conv2D, Dense, Dropout, Flatten, MaxPooling2D, Softmax)

from . import util


# Image Variables
WIDTH = 28
HEIGHT = 28


def get_layers(
        model_type,
        nclasses=10,
        hidden_layer_1_neurons=400,
        hidden_layer_2_neurons=100,
        dropout_rate=0.25,
        num_filters_1=64,
        kernel_size_1=3,
        pooling_size_1=2,
        num_filters_2=32,
        kernel_size_2=3,
        pooling_size_2=2):
    """Constructs layers for a keras model based on a dict of model types."""
    model_layers = {
        'linear': [
            Flatten(),
            Dense(nclasses),
            Softmax()
        ],
        'dnn': [
            Flatten(),
            Dense(hidden_layer_1_neurons, activation='relu'),
            Dense(hidden_layer_2_neurons, activation='relu'),
            Dense(nclasses),
            Softmax()
        ],
        'dnn_dropout': [
            Flatten(),
            Dense(hidden_layer_1_neurons, activation='relu'),
            Dense(hidden_layer_2_neurons, activation='relu'),
            Dropout(dropout_rate),
            Dense(nclasses),
            Softmax()
        ],
        'cnn': [
            Conv2D(num_filters_1, kernel_size=kernel_size_1,
                   activation='relu', input_shape=(WIDTH, HEIGHT, 1)),
            MaxPooling2D(pooling_size_1),
            Conv2D(num_filters_2, kernel_size=kernel_size_2,
                   activation='relu'),
            MaxPooling2D(pooling_size_2),
            Flatten(),
            Dense(hidden_layer_1_neurons, activation='relu'),
            Dense(hidden_layer_2_neurons, activation='relu'),
            Dropout(dropout_rate),
            Dense(nclasses),
            Softmax()
        ]
    }
    return model_layers[model_type]


def build_model(layers, output_dir):
    """Compiles keras model for image classification."""
    model = Sequential(layers)
    model.compile(optimizer='adam',
                  loss='categorical_crossentropy',
                  metrics=['accuracy'])
    return model


def train_and_evaluate(model, num_epochs, steps_per_epoch, output_dir):
    """Compiles keras model and loads data into it for training."""
    mnist = tf.keras.datasets.mnist.load_data()
    train_data = util.load_dataset(mnist)
    validation_data = util.load_dataset(mnist, training=False)

    callbacks = []
    if output_dir:
        tensorboard_callback = TensorBoard(log_dir=output_dir)
        callbacks = [tensorboard_callback]

    history = model.fit(
        train_data,
        validation_data=validation_data,
        epochs=num_epochs,
        steps_per_epoch=steps_per_epoch,
        verbose=2,
        callbacks=callbacks)

    if output_dir:
        export_path = os.path.join(output_dir, 'keras_export')
        model.save(export_path, save_format='tf')

    return history


Overwriting mnist_models/trainer/model.py

Local Training

With everything set up, let's run locally to test the code. Some of the previous tests have been copied over into a testing script mnist_models/trainer/test.py to make sure the model still passes our previous checks. On line 13, you can specify which model types you would like to check. line 14 and line 15 has the number of epochs and steps per epoch respectively.

Moment of truth! Run the code below to check your models against the unit tests. If you see "OK" at the end when it's finished running, congrats! You've passed the tests!


In [11]:
!python3 -m mnist_models.trainer.test


2020-01-16 05:18:44.945076: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-01-16 05:18:48.133198: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-01-16 05:18:48.136960: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-16 05:18:48.137440: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.562
pciBusID: 0000:00:04.0
2020-01-16 05:18:48.137493: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-01-16 05:18:48.139017: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-01-16 05:18:48.140478: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-01-16 05:18:48.140937: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-01-16 05:18:48.143407: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-01-16 05:18:48.145015: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-01-16 05:18:48.149588: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-01-16 05:18:48.149825: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-16 05:18:48.150331: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-16 05:18:48.150660: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-01-16 05:18:48.159213: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2200000000 Hz
2020-01-16 05:18:48.159592: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55adaacb8770 executing computations on platform Host. Devices:
2020-01-16 05:18:48.159638: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
2020-01-16 05:18:48.225613: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-16 05:18:48.226166: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55adaad2b9b0 executing computations on platform CUDA. Devices:
2020-01-16 05:18:48.226209: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Tesla K80, Compute Capability 3.7
2020-01-16 05:18:48.226552: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-16 05:18:48.227025: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.562
pciBusID: 0000:00:04.0
2020-01-16 05:18:48.227099: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-01-16 05:18:48.227161: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-01-16 05:18:48.227219: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-01-16 05:18:48.227255: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-01-16 05:18:48.227287: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-01-16 05:18:48.227319: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-01-16 05:18:48.227352: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-01-16 05:18:48.227442: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-16 05:18:48.227963: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-16 05:18:48.228406: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-01-16 05:18:48.228473: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-01-16 05:18:48.754118: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-01-16 05:18:48.754187: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0 
2020-01-16 05:18:48.754204: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N 
2020-01-16 05:18:48.754741: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-16 05:18:48.755241: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-16 05:18:48.755690: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10285 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7)
..
*** Building model for linear ***

Train for 100 steps, validate for 100 steps
Epoch 1/10
2020-01-16 05:19:00.979496: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
100/100 - 7s - loss: 1.3553 - accuracy: 0.6579 - val_loss: 0.8047 - val_accuracy: 0.8377
Epoch 2/10
100/100 - 1s - loss: 0.6943 - accuracy: 0.8436 - val_loss: 0.5564 - val_accuracy: 0.8715
Epoch 3/10
100/100 - 1s - loss: 0.5415 - accuracy: 0.8682 - val_loss: 0.4654 - val_accuracy: 0.8898
Epoch 4/10
100/100 - 1s - loss: 0.4609 - accuracy: 0.8803 - val_loss: 0.4194 - val_accuracy: 0.8953
Epoch 5/10
100/100 - 1s - loss: 0.3979 - accuracy: 0.8985 - val_loss: 0.3917 - val_accuracy: 0.9005
Epoch 6/10
100/100 - 2s - loss: 0.4024 - accuracy: 0.8901 - val_loss: 0.3669 - val_accuracy: 0.9056
Epoch 7/10
100/100 - 7s - loss: 0.3874 - accuracy: 0.8961 - val_loss: 0.3541 - val_accuracy: 0.9063
Epoch 8/10
100/100 - 1s - loss: 0.3674 - accuracy: 0.9021 - val_loss: 0.3420 - val_accuracy: 0.9093
Epoch 9/10
100/100 - 1s - loss: 0.3477 - accuracy: 0.9090 - val_loss: 0.3360 - val_accuracy: 0.9112
Epoch 10/10
100/100 - 1s - loss: 0.3263 - accuracy: 0.9106 - val_loss: 0.3241 - val_accuracy: 0.9134

*** Building model for dnn ***

100/100 - 7s - loss: 0.5730 - accuracy: 0.8386 - val_loss: 0.3121 - val_accuracy: 0.9089
Epoch 2/10
100/100 - 1s - loss: 0.2515 - accuracy: 0.9274 - val_loss: 0.2085 - val_accuracy: 0.9395
Epoch 3/10
100/100 - 1s - loss: 0.2000 - accuracy: 0.9351 - val_loss: 0.1880 - val_accuracy: 0.9434
Epoch 4/10
100/100 - 1s - loss: 0.1796 - accuracy: 0.9469 - val_loss: 0.1460 - val_accuracy: 0.9574
Epoch 5/10
100/100 - 1s - loss: 0.1520 - accuracy: 0.9531 - val_loss: 0.1406 - val_accuracy: 0.9557
Epoch 6/10
100/100 - 1s - loss: 0.1351 - accuracy: 0.9581 - val_loss: 0.1177 - val_accuracy: 0.9643
Epoch 7/10
100/100 - 6s - loss: 0.1039 - accuracy: 0.9686 - val_loss: 0.1038 - val_accuracy: 0.9685
Epoch 8/10
100/100 - 1s - loss: 0.0944 - accuracy: 0.9725 - val_loss: 0.1103 - val_accuracy: 0.9650
Epoch 9/10
100/100 - 1s - loss: 0.1110 - accuracy: 0.9663 - val_loss: 0.1052 - val_accuracy: 0.9678
Epoch 10/10
100/100 - 1s - loss: 0.0906 - accuracy: 0.9735 - val_loss: 0.0941 - val_accuracy: 0.9708

*** Building model for dnn_dropout ***

Train for 100 steps, validate for 100 steps
Epoch 1/10
100/100 - 7s - loss: 0.6572 - accuracy: 0.8011 - val_loss: 0.2871 - val_accuracy: 0.9177
Epoch 2/10
100/100 - 1s - loss: 0.3085 - accuracy: 0.9079 - val_loss: 0.2138 - val_accuracy: 0.9338
Epoch 3/10
100/100 - 1s - loss: 0.2339 - accuracy: 0.9317 - val_loss: 0.1914 - val_accuracy: 0.9407
Epoch 4/10
100/100 - 1s - loss: 0.1930 - accuracy: 0.9415 - val_loss: 0.1504 - val_accuracy: 0.9537
Epoch 5/10
100/100 - 1s - loss: 0.1867 - accuracy: 0.9411 - val_loss: 0.1413 - val_accuracy: 0.9566
Epoch 6/10
100/100 - 1s - loss: 0.1594 - accuracy: 0.9540 - val_loss: 0.1248 - val_accuracy: 0.9603
Epoch 7/10
100/100 - 6s - loss: 0.1285 - accuracy: 0.9611 - val_loss: 0.1075 - val_accuracy: 0.9668
Epoch 8/10
100/100 - 1s - loss: 0.1308 - accuracy: 0.9602 - val_loss: 0.1077 - val_accuracy: 0.9671
Epoch 9/10
100/100 - 1s - loss: 0.1215 - accuracy: 0.9643 - val_loss: 0.1063 - val_accuracy: 0.9667
Epoch 10/10
100/100 - 1s - loss: 0.1192 - accuracy: 0.9646 - val_loss: 0.1011 - val_accuracy: 0.9685

*** Building model for cnn ***

Train for 100 steps, validate for 100 steps
Epoch 1/10
100/100 - 9s - loss: 0.6915 - accuracy: 0.7779 - val_loss: 0.1829 - val_accuracy: 0.9478
Epoch 2/10
100/100 - 2s - loss: 0.1862 - accuracy: 0.9447 - val_loss: 0.1235 - val_accuracy: 0.9612
Epoch 3/10
100/100 - 2s - loss: 0.1354 - accuracy: 0.9576 - val_loss: 0.0830 - val_accuracy: 0.9735
Epoch 4/10
100/100 - 2s - loss: 0.1206 - accuracy: 0.9615 - val_loss: 0.0649 - val_accuracy: 0.9812
Epoch 5/10
100/100 - 2s - loss: 0.0916 - accuracy: 0.9718 - val_loss: 0.0534 - val_accuracy: 0.9829
Epoch 6/10
100/100 - 2s - loss: 0.1032 - accuracy: 0.9684 - val_loss: 0.0607 - val_accuracy: 0.9811
Epoch 7/10
100/100 - 7s - loss: 0.0644 - accuracy: 0.9811 - val_loss: 0.0542 - val_accuracy: 0.9835
Epoch 8/10
100/100 - 2s - loss: 0.0692 - accuracy: 0.9791 - val_loss: 0.0483 - val_accuracy: 0.9839
Epoch 9/10
100/100 - 2s - loss: 0.0644 - accuracy: 0.9809 - val_loss: 0.0451 - val_accuracy: 0.9857
Epoch 10/10
100/100 - 2s - loss: 0.0673 - accuracy: 0.9790 - val_loss: 0.0432 - val_accuracy: 0.9853
...
----------------------------------------------------------------------
Ran 5 tests in 132.269s

OK

Now that we know that our models are working as expected, let's run it on the Google Cloud AI Platform. We can run it as a python module locally first using the command line.

The below cell transfers some of our variables to the command line as well as create a job directory including a timestamp.


In [12]:
current_time = datetime.now().strftime("%y%m%d_%H%M%S")
model_type = 'cnn'

os.environ["MODEL_TYPE"] = model_type
os.environ["JOB_DIR"] = "mnist_models/models/{}_{}/".format(
    model_type, current_time)

The cell below runs the local version of the code. The epochs and steps_per_epoch flag can be changed to run for longer or shorther, as defined in our mnist_models/trainer/task.py file.


In [13]:
%%bash
python3 -m mnist_models.trainer.task \
    --job-dir=$JOB_DIR \
    --epochs=5 \
    --steps_per_epoch=50 \
    --model_type=$MODEL_TYPE


Train for 50 steps, validate for 100 steps
Epoch 1/5
50/50 - 10s - loss: 1.0179 - accuracy: 0.6738 - val_loss: 0.4321 - val_accuracy: 0.8669
Epoch 2/5
50/50 - 2s - loss: 0.3472 - accuracy: 0.8968 - val_loss: 0.1736 - val_accuracy: 0.9497
Epoch 3/5
50/50 - 2s - loss: 0.2108 - accuracy: 0.9368 - val_loss: 0.1329 - val_accuracy: 0.9587
Epoch 4/5
50/50 - 2s - loss: 0.1708 - accuracy: 0.9464 - val_loss: 0.1116 - val_accuracy: 0.9641
Epoch 5/5
50/50 - 2s - loss: 0.1560 - accuracy: 0.9514 - val_loss: 0.0938 - val_accuracy: 0.9726
2020-01-16 05:25:13.487462: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-01-16 05:25:15.782968: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-01-16 05:25:15.786337: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-16 05:25:15.786747: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.562
pciBusID: 0000:00:04.0
2020-01-16 05:25:15.786787: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-01-16 05:25:15.788375: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-01-16 05:25:15.789817: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-01-16 05:25:15.790163: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-01-16 05:25:15.792126: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-01-16 05:25:15.793596: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-01-16 05:25:15.798378: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-01-16 05:25:15.798647: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-16 05:25:15.799173: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-16 05:25:15.799500: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-01-16 05:25:15.811621: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2200000000 Hz
2020-01-16 05:25:15.811990: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x558dad09ee00 executing computations on platform Host. Devices:
2020-01-16 05:25:15.812024: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
2020-01-16 05:25:15.879488: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-16 05:25:15.879961: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x558dad1120a0 executing computations on platform CUDA. Devices:
2020-01-16 05:25:15.879997: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Tesla K80, Compute Capability 3.7
2020-01-16 05:25:15.880233: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-16 05:25:15.880554: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.562
pciBusID: 0000:00:04.0
2020-01-16 05:25:15.880604: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-01-16 05:25:15.880640: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-01-16 05:25:15.880666: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-01-16 05:25:15.880689: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-01-16 05:25:15.880711: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-01-16 05:25:15.880727: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-01-16 05:25:15.880750: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-01-16 05:25:15.880818: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-16 05:25:15.881232: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-16 05:25:15.881623: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-01-16 05:25:15.881705: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-01-16 05:25:16.398516: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-01-16 05:25:16.398584: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0 
2020-01-16 05:25:16.398596: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N 
2020-01-16 05:25:16.399117: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-16 05:25:16.399539: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-16 05:25:16.399897: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10285 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7)
2020-01-16 05:25:18.352074: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-01-16 05:25:23.815108: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-01-16 05:25:24.920631: I tensorflow/core/profiler/lib/profiler_session.cc:184] Profiler session started.
2020-01-16 05:25:24.922583: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcupti.so.10.0
2020-01-16 05:25:25.180703: I tensorflow/core/platform/default/device_tracer.cc:588] Collecting 120 kernel records, 8 memcpy records.
WARNING:tensorflow:Method (on_train_batch_end) is slow compared to the batch update (0.128515). Check your callbacks.
2020-01-16 05:25:34.101811: W tensorflow/python/util/util.cc:299] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py:1781: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.

Training on the cloud

Since we're using an unreleased version of TensorFlow on AI Platform, we can instead use a Deep Learning Container in order to take advantage of libraries and applications not normally packaged with AI Platform. Below is a simple Dockerlife which copies our code to be used in a TF2 environment.


In [14]:
%%writefile mnist_models/Dockerfile
FROM gcr.io/deeplearning-platform-release/tf2-cpu
COPY mnist_models/trainer /mnist_models/trainer
ENTRYPOINT ["python3", "-m", "mnist_models.trainer.task"]


Overwriting mnist_models/Dockerfile

The below command builds the image and ships it off to Google Cloud so it can be used for AI Platform. When built, it will show up here with the name mnist_models. (Click here to enable Cloud Build)


In [15]:
!docker build -f mnist_models/Dockerfile -t $IMAGE_URI ./


Sending build context to Docker daemon  6.766MB
Step 1/3 : FROM gcr.io/deeplearning-platform-release/tf2-cpu
 ---> e493f17c90d0
Step 2/3 : COPY mnist_models/trainer /mnist_models/trainer
 ---> 9c7bb60ef956
Step 3/3 : ENTRYPOINT ["python3", "-m", "mnist_models.trainer.task"]
 ---> Running in 3b7135292970
Removing intermediate container 3b7135292970
 ---> 6270b4938691
Successfully built 6270b4938691
Successfully tagged gcr.io/ddetering-experimental/mnist_models:latest

In [16]:
!docker push $IMAGE_URI


The push refers to repository [gcr.io/ddetering-experimental/mnist_models]

3439f44a: Preparing 
581319d1: Preparing 
d92bb97d: Preparing 
218cebaf: Preparing 
750480c5: Preparing 
1a04733b: Preparing 
67e0ba2d: Preparing 
50747c65: Preparing 
36f9bda1: Preparing 
7efe6784: Preparing 
27b0e386: Preparing 
4689ea9a: Preparing 
971c1728: Preparing 
63e1ea3e: Preparing 
2b9c88ab: Preparing 
db0181d2: Preparing 
197acff6: Preparing 
87810c15: Preparing 
e11ab4a2: Preparing 
13bce073: Preparing 
e43028b3: Preparing 
439f44a: Pushed lready exists 2kBlatest: digest: sha256:9ff5bfb9f1ae5a3d58063b2e8165e855b5c770549aac317b19d344b9ec336fdc size: 4925

Finally, we can kickoff the AI Platform training job. We can pass in our docker image using the master-image-uri flag.


In [17]:
current_time = datetime.now().strftime("%y%m%d_%H%M%S")
model_type = 'cnn'

os.environ["MODEL_TYPE"] = model_type
os.environ["JOB_DIR"] = "gs://{}/mnist_{}_{}/".format(
    BUCKET, model_type, current_time)
os.environ["JOB_NAME"] = "mnist_{}_{}".format(
    model_type, current_time)

In [18]:
%%bash
echo $JOB_DIR $REGION $JOB_NAME
gcloud ai-platform jobs submit training $JOB_NAME \
    --staging-bucket=gs://$BUCKET \
    --region=$REGION \
    --master-image-uri=$IMAGE_URI \
    --scale-tier=BASIC_GPU \
    --job-dir=$JOB_DIR \
    -- \
    --model_type=$MODEL_TYPE


gs://ddetering-experimental/mnist_cnn_200116_053228/ us-central1 mnist_cnn_200116_053228
jobId: mnist_cnn_200116_053228
state: QUEUED
Job [mnist_cnn_200116_053228] submitted successfully.
Your job is still active. You may view the status of your job with the command

  $ gcloud ai-platform jobs describe mnist_cnn_200116_053228

or continue streaming the logs with the command

  $ gcloud ai-platform jobs stream-logs mnist_cnn_200116_053228

Can't wait to see the results? Run the code below and copy the output into the Google Cloud Shell to follow.

Deploying and predicting with model

Once you have a model you're proud of, let's deploy it! All we need to do is give AI Platform the location of the model. Below uses the keras export path of the previous job, but ${JOB_DIR}keras_export/ can always be changed to a different path.

Uncomment the delete commands below if you are getting an "already exists error" and want to deploy a new model.


In [21]:
%%bash
MODEL_NAME="mnist"
MODEL_VERSION=${MODEL_TYPE}
MODEL_LOCATION=${JOB_DIR}keras_export/
echo "Deleting and deploying $MODEL_NAME $MODEL_VERSION from $MODEL_LOCATION ... this will take a few minutes"
#yes | gcloud ai-platform versions delete ${MODEL_VERSION} --model ${MODEL_NAME}
#yes | gcloud ai-platform models delete ${MODEL_NAME}
gcloud ai-platform models create ${MODEL_NAME} --regions $REGION
gcloud ai-platform versions create ${MODEL_VERSION} \
    --model ${MODEL_NAME} \
    --origin ${MODEL_LOCATION} \
    --framework tensorflow \
    --runtime-version=2.1


Deleting and deploying mnist cnn from gs://ddetering-experimental/mnist_cnn_200116_053228/keras_export/ ... this will take a few minutes
Created ml engine model [projects/ddetering-experimental/models/mnist].
Creating version (this might take a few minutes)......
.......................................................................................................................................................................................................................................................................................................................................................................................done.

To predict with the model, let's take one of the example images.

TODO 4: Write a .json file with image data to send to an AI Platform deployed model


In [22]:
import json, codecs
import tensorflow as tf
import matplotlib.pyplot as plt
from mnist_models.trainer import util

HEIGHT = 28
WIDTH = 28
IMGNO = 12

mnist = tf.keras.datasets.mnist.load_data()
(x_train, y_train), (x_test, y_test) = mnist
test_image = x_test[IMGNO]

jsondata = test_image.reshape(HEIGHT, WIDTH, 1).tolist()
json.dump(jsondata, codecs.open("test.json", "w", encoding = "utf-8"))
plt.imshow(test_image.reshape(HEIGHT, WIDTH));


Finally, we can send it to the prediction service. The output will have a 1 in the index of the corresponding digit it is predicting. Congrats! You've completed the lab!


In [23]:
%%bash
gcloud ai-platform predict \
    --model=mnist \
    --version=${MODEL_TYPE} \
    --json-instances=./test.json


SOFTMAX_3
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]

Copyright 2020 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.