Please run this notebook on a GPU backend. Porting the model from Estimator to TPUEstimator is needed for it to work on TPU.

MNIST with Tensorboard, using the Estimator API

Fun with handwritten digits and tensorboard.

This notebook will show you how to follow your training and validation curves in Tensorboard and what you can do to address the issues you see there.

Imports


In [0]:
import os, re, math, json, shutil, pprint, datetime
import PIL.Image, PIL.ImageFont, PIL.ImageDraw
import numpy as np
import tensorflow as tf
from matplotlib import pyplot as plt
from tensorflow.python.platform import tf_logging
print("Tensorflow version " + tf.__version__)


Tensorflow version 1.12.0

Parameters


In [0]:
BATCH_SIZE = 32 #@param {type:"integer"}
BUCKET = 'gs://' #@param {type:"string"}

assert re.search(r'gs://.+', BUCKET), 'You need a GCS bucket for your Tensorboard logs. Head to http://console.cloud.google.com/storage and create one.'

training_images_file   = 'gs://mnist-public/train-images-idx3-ubyte'
training_labels_file   = 'gs://mnist-public/train-labels-idx1-ubyte'
validation_images_file = 'gs://mnist-public/t10k-images-idx3-ubyte'
validation_labels_file = 'gs://mnist-public/t10k-labels-idx1-ubyte'

Colab-only auth


In [0]:
# backend identification
IS_COLAB_BACKEND = 'COLAB_GPU' in os.environ  # this is always set on Colab, the value is 0 or 1 depending on GPU presence

# Auth on Colab
# Little wrinkle: without auth, Colab will be extremely slow in accessing data from a GCS bucket, even public
if IS_COLAB_BACKEND:
  from google.colab import auth
  auth.authenticate_user()

In [0]:
#@title visualization utilities [RUN ME]
"""
This cell contains helper functions used for visualization
and downloads only. You can skip reading it. There is very
little useful Keras/Tensorflow code here.
"""

# Matplotlib config
plt.rc('image', cmap='gray_r')
plt.rc('grid', linewidth=0)
plt.rc('xtick', top=False, bottom=False, labelsize='large')
plt.rc('ytick', left=False, right=False, labelsize='large')
plt.rc('axes', facecolor='F8F8F8', titlesize="large", edgecolor='white')
plt.rc('text', color='a8151a')
plt.rc('figure', facecolor='F0F0F0')# Matplotlib fonts
MATPLOTLIB_FONT_DIR = os.path.join(os.path.dirname(plt.__file__), "mpl-data/fonts/ttf")

# pull a batch from the datasets. This code is not very nice, it gets much better in eager mode (TODO)
def dataset_to_numpy_util(training_dataset, validation_dataset, N):
  
  # get one batch from each: 10000 validation digits, N training digits
  unbatched_train_ds = training_dataset.apply(tf.data.experimental.unbatch())
  v_images, v_labels = validation_dataset.make_one_shot_iterator().get_next()
  t_images, t_labels = unbatched_train_ds.batch(N).make_one_shot_iterator().get_next()
  
  # Run once, get one batch. Session.run returns numpy results
  with tf.Session() as ses:
    (validation_digits, validation_labels,
     training_digits, training_labels) = ses.run([v_images, v_labels, t_images, t_labels])
  
  # these were one-hot encoded in the dataset
  validation_labels = np.argmax(validation_labels, axis=1)
  training_labels = np.argmax(training_labels, axis=1)
  
  return (training_digits, training_labels,
          validation_digits, validation_labels)

# create digits from local fonts for testing
def create_digits_from_local_fonts(n):
  font_labels = []
  img = PIL.Image.new('LA', (28*n, 28), color = (0,255)) # format 'LA': black in channel 0, alpha in channel 1
  font1 = PIL.ImageFont.truetype(os.path.join(MATPLOTLIB_FONT_DIR, 'DejaVuSansMono-Oblique.ttf'), 25)
  font2 = PIL.ImageFont.truetype(os.path.join(MATPLOTLIB_FONT_DIR, 'STIXGeneral.ttf'), 25)
  d = PIL.ImageDraw.Draw(img)
  for i in range(n):
    font_labels.append(i%10)
    d.text((7+i*28,0 if i<10 else -4), str(i%10), fill=(255,255), font=font1 if i<10 else font2)
  font_digits = np.array(img.getdata(), np.float32)[:,0] / 255.0 # black in channel 0, alpha in channel 1 (discarded)
  font_digits = np.reshape(np.stack(np.split(np.reshape(font_digits, [28, 28*n]), n, axis=1), axis=0), [n, 28*28])
  return font_digits, font_labels

# utility to display a row of digits with their predictions
def display_digits(digits, predictions, labels, title, n):
  plt.figure(figsize=(13,3))
  digits = np.reshape(digits, [n, 28, 28])
  digits = np.swapaxes(digits, 0, 1)
  digits = np.reshape(digits, [28, 28*n])
  plt.yticks([])
  plt.xticks([28*x+14 for x in range(n)], predictions)
  for i,t in enumerate(plt.gca().xaxis.get_ticklabels()):
    if predictions[i] != labels[i]: t.set_color('red') # bad predictions in red
  plt.imshow(digits)
  plt.grid(None)
  plt.title(title)
  
# utility to display multiple rows of digits, sorted by unrecognized/recognized status
def display_top_unrecognized(digits, predictions, labels, n, lines):
  idx = np.argsort(predictions==labels) # sort order: unrecognized first
  for i in range(lines):
    display_digits(digits[idx][i*n:(i+1)*n], predictions[idx][i*n:(i+1)*n], labels[idx][i*n:(i+1)*n],
                   "{} sample validation digits out of {} with bad predictions in red and sorted first".format(n*lines, len(digits)) if i==0 else "", n)
    
# utility to display training and validation curves
def display_training_curves(training, validation, title, subplot):
  if subplot%10==1: # set up the subplots on the first call
    plt.subplots(figsize=(10,10), facecolor='#F0F0F0')
    plt.tight_layout()
  ax = plt.subplot(subplot)
  ax.grid(linewidth=1, color='white')
  ax.plot(training)
  ax.plot(validation)
  ax.set_title('model '+ title)
  ax.set_ylabel(title)
  ax.set_xlabel('epoch')
  ax.legend(['train', 'valid.'])

tf.data.Dataset: parse files and prepare training and validation datasets

Please read the best practices for building input pipelines with tf.data.Dataset


In [0]:
def read_label(tf_bytestring):
    label = tf.decode_raw(tf_bytestring, tf.uint8)
    label = tf.reshape(label, [])
    label = tf.one_hot(label, 10)
    return label
  
def read_image(tf_bytestring):
    image = tf.decode_raw(tf_bytestring, tf.uint8)
    image = tf.cast(image, tf.float32)/256.0
    image = tf.reshape(image, [28*28])
    return image
  
def load_dataset(image_file, label_file):
    imagedataset = tf.data.FixedLengthRecordDataset(image_file, 28*28, header_bytes=16)
    imagedataset = imagedataset.map(read_image, num_parallel_calls=16)
    labelsdataset = tf.data.FixedLengthRecordDataset(label_file, 1, header_bytes=8)
    labelsdataset = labelsdataset.map(read_label, num_parallel_calls=16)
    dataset = tf.data.Dataset.zip((imagedataset, labelsdataset))
    return dataset 
  
def get_training_dataset(image_file, label_file, batch_size):
    dataset = load_dataset(image_file, label_file)
    dataset = dataset.cache()  # this small dataset can be entirely cached in RAM, for TPU this is important to get good performance from such a small dataset
    dataset = dataset.shuffle(5000, reshuffle_each_iteration=True)
    dataset = dataset.repeat() # Mandatory for Keras for now
    dataset = dataset.batch(batch_size, drop_remainder=True) # drop_remainder is important on TPU, batch size must be fixed
    dataset = dataset.prefetch(-1)  # prefetch next batch while training (-1: autotune prefetch buffer size)
    return dataset
  
def get_validation_dataset(image_file, label_file):
    dataset = load_dataset(image_file, label_file)
    dataset = dataset.cache() # this small dataset can be entirely cached in RAM, for TPU this is important to get good performance from such a small dataset
    dataset = dataset.batch(10000, drop_remainder=True) # 10000 items in eval dataset, all in one batch
    dataset = dataset.repeat() # Mandatory for Keras for now
    return dataset

# instantiate the datasets
training_dataset = get_training_dataset(training_images_file, training_labels_file, BATCH_SIZE)
validation_dataset = get_validation_dataset(validation_images_file, validation_labels_file)

# In Estimator, we will need a function that returns the dataset
training_input_fn = lambda: get_training_dataset(training_images_file, training_labels_file, BATCH_SIZE)
validation_input_fn = lambda: get_validation_dataset(validation_images_file, validation_labels_file)

Let's have a look at the data


In [0]:
N = 24
(training_digits, training_labels,
 validation_digits, validation_labels) = dataset_to_numpy_util(training_dataset, validation_dataset, N)
display_digits(training_digits, training_labels, training_labels, "training digits and their labels", N)
display_digits(validation_digits[:N], validation_labels[:N], validation_labels[:N], "validation digits and their labels", N)
font_digits, font_labels = create_digits_from_local_fonts(N)


Estimator model

If you are not sure what cross-entropy, dropout, softmax or batch-normalization mean, head here for a crash-course: Tensorflow and deep learning without a PhD


In [0]:
# This model trains to 99.4% sometimes 99.5% accuracy in 10 epochs
def model_fn(features, labels, mode):
  x = features
  
  is_training = (mode == tf.estimator.ModeKeys.TRAIN)

  x = features
  y = tf.reshape(x, [-1, 28, 28, 1])

  # little wrinkle: tf.keras.layers can normally be used in an Estimator but tf.keras.layers.BatchNormalization does not work
  # in an Estimator environment. Using TF layers everywhere for consistency. tf.layers and tf.ketas.layers are carbon copies of each other.
  
  y = tf.layers.Conv2D(filters=6, kernel_size=3, padding='same', use_bias=False)(y) # no bias necessary before batch norm
  y = tf.layers.BatchNormalization(scale=False, center=True)(y, training=is_training) # no batch norm scaling necessary before "relu"
  y = tf.nn.relu(y) # activation after batch norm

  y = tf.layers.Conv2D(filters=12, kernel_size=6, padding='same', use_bias=False, strides=2)(y)
  y = tf.layers.BatchNormalization(scale=False, center=True)(y, training=is_training)
  y = tf.nn.relu(y)

  y = tf.layers.Conv2D(filters=24, kernel_size=6, padding='same', use_bias=False, strides=2)(y)
  y = tf.layers.BatchNormalization(scale=False, center=True)(y, training=is_training)
  y = tf.nn.relu(y)

  y = tf.layers.Flatten()(y)
  y = tf.layers.Dense(200, use_bias=False)(y)
  y = tf.layers.BatchNormalization(scale=False, center=True)(y, training=is_training)
  y = tf.nn.relu(y)
  y = tf.layers.Dropout(0.5)(y, training=is_training)
  
  logits = tf.layers.Dense(10)(y)
  predictions = tf.nn.softmax(logits)
  classes = tf.math.argmax(predictions, axis=-1)
  
  if (mode != tf.estimator.ModeKeys.PREDICT):
    loss = tf.losses.softmax_cross_entropy(labels, logits)

    step = tf.train.get_or_create_global_step()
    lr = 0.0001 + tf.train.exponential_decay(0.01, step, 2000, 1/math.e)
    tf.summary.scalar("learn_rate", lr)

    optimizer = tf.train.AdamOptimizer(lr)
    # little wrinkle: batch norm uses running averages which need updating after each batch. create_train_op does it, optimizer.minimize does not.
    train_op = tf.contrib.training.create_train_op(loss, optimizer)
    #train_op = optimizer.minimize(loss, tf.train.get_or_create_global_step())

    metrics = {'accuracy': tf.metrics.accuracy(classes, tf.math.argmax(labels, axis=-1))}
  else:
    loss = train_op = metrics = None  # None of these can be computed in prediction mode because labels are not available
  
  return tf.estimator.EstimatorSpec(
    mode=mode,
    predictions={"predictions": predictions, "classes": classes},  # name these fields as you like
    loss=loss,
    train_op=train_op,
    eval_metric_ops=metrics
  )

In [0]:
# Called once when the model is saved. This function produces a Tensorflow
# graph of operations that will be prepended to your model graph. When
# your model is deployed as a REST API, the API receives data in JSON format,
# parses it into Tensors, then sends the tensors to the input graph generated by
# this function. The graph can transform the data so it can be sent into your
# model input_fn. You can do anything you want here as long as you do it with
# tf.* functions that produce a graph of operations.
def serving_input_fn():
    # placeholder for the data received by the API (already parsed, no JSON decoding necessary,
    # but the JSON must contain one or multiple 'image' key(s) with 28x28 greyscale images  as content.)
    inputs = {"serving_input": tf.placeholder(tf.float32, [None, 28, 28])}  # the shape of this dict should match the shape of your JSON
    features = inputs['serving_input']  # no transformation needed
    return tf.estimator.export.TensorServingInputReceiver(features, inputs)  # features are the features needed by your model_fn
    # Return a ServingInputReceiver if your features are a dictionary of Tensors, TensorServingInputReceiver if they are a straight Tensor

Train and validate the model


In [0]:
EPOCHS = 8
steps_per_epoch = 60000 // BATCH_SIZE  # 60,000 images in training dataset
MODEL_EXPORT_NAME = "mnist"  # name for exporting saved model

tf_logging.set_verbosity(tf_logging.INFO)
now = datetime.datetime.now()
MODEL_DIR = BUCKET+"/mnistjobs/job" + "-{}-{:02d}-{:02d}-{:02d}:{:02d}:{:02d}".format(now.year, now.month, now.day, now.hour, now.minute, now.second)

training_config = tf.estimator.RunConfig(model_dir=MODEL_DIR, save_summary_steps=10, save_checkpoints_steps=steps_per_epoch, log_step_count_steps=steps_per_epoch/4)
export_latest = tf.estimator.LatestExporter(MODEL_EXPORT_NAME, serving_input_receiver_fn=serving_input_fn)
estimator = tf.estimator.Estimator(model_fn=model_fn, config=training_config)

train_spec = tf.estimator.TrainSpec(training_input_fn, max_steps=EPOCHS*steps_per_epoch)
eval_spec = tf.estimator.EvalSpec(validation_input_fn, steps=1, exporters=export_latest, throttle_secs=0) # no eval throttling: evaluates after each checkpoint

tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
tf_logging.set_verbosity(tf_logging.WARN)


INFO:tensorflow:Using config: {'_model_dir': 'gs://stagingtemp/mnistjobs/job-2018-12-02-21:30:52', '_tf_random_seed': None, '_save_summary_steps': 10, '_save_checkpoints_steps': 1875, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 468.75, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f00aad098d0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Not using Distribute Coordinator.
INFO:tensorflow:Running training and evaluation locally (non-distributed).
INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps 1875 or save_checkpoints_secs None.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into gs://stagingtemp/mnistjobs/job-2018-12-02-21:30:52/model.ckpt.
INFO:tensorflow:loss = 3.221363, step = 0
INFO:tensorflow:global_step/sec: 131.309
INFO:tensorflow:loss = 0.06300494, step = 469 (3.573 sec)
INFO:tensorflow:global_step/sec: 141.173
INFO:tensorflow:loss = 0.18602926, step = 938 (3.322 sec)
INFO:tensorflow:global_step/sec: 141.169
INFO:tensorflow:loss = 0.0048653, step = 1407 (3.323 sec)
INFO:tensorflow:Saving checkpoints for 1875 into gs://stagingtemp/mnistjobs/job-2018-12-02-21:30:52/model.ckpt.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-12-02-21:31:45
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from gs://stagingtemp/mnistjobs/job-2018-12-02-21:30:52/model.ckpt-1875
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [1/1]
INFO:tensorflow:Finished evaluation at 2018-12-02-21:31:49
INFO:tensorflow:Saving dict for global step 1875: accuracy = 0.9872, global_step = 1875, loss = 0.039856445
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 1875: gs://stagingtemp/mnistjobs/job-2018-12-02-21:30:52/model.ckpt-1875
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Signatures INCLUDED in export for Classify: None
INFO:tensorflow:Signatures INCLUDED in export for Regress: None
INFO:tensorflow:Signatures INCLUDED in export for Predict: ['serving_default']
INFO:tensorflow:Signatures INCLUDED in export for Train: None
INFO:tensorflow:Signatures INCLUDED in export for Eval: None
INFO:tensorflow:Restoring parameters from gs://stagingtemp/mnistjobs/job-2018-12-02-21:30:52/model.ckpt-1875
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/estimator/estimator.py:1044: calling SavedModelBuilder.add_meta_graph_and_variables (from tensorflow.python.saved_model.builder_impl) with legacy_init_op is deprecated and will be removed in a future version.
Instructions for updating:
Pass your op to the equivalent parameter main_op instead.
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:No assets to write.
INFO:tensorflow:SavedModel written to: gs://stagingtemp/mnistjobs/job-2018-12-02-21:30:52/export/mnist/temp-b'1543786314'/saved_model.pb
INFO:tensorflow:global_step/sec: 11.2124
INFO:tensorflow:loss = 0.17025141, step = 1876 (41.828 sec)
INFO:tensorflow:global_step/sec: 182.519
INFO:tensorflow:loss = 0.008357742, step = 2345 (2.569 sec)
INFO:tensorflow:global_step/sec: 180.884
INFO:tensorflow:loss = 0.018714316, step = 2814 (2.593 sec)
INFO:tensorflow:global_step/sec: 181.33
INFO:tensorflow:loss = 0.021446247, step = 3283 (2.586 sec)
INFO:tensorflow:Saving checkpoints for 3750 into gs://stagingtemp/mnistjobs/job-2018-12-02-21:30:52/model.ckpt.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-12-02-21:32:35
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from gs://stagingtemp/mnistjobs/job-2018-12-02-21:30:52/model.ckpt-3750
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [1/1]
INFO:tensorflow:Finished evaluation at 2018-12-02-21:32:38
INFO:tensorflow:Saving dict for global step 3750: accuracy = 0.991, global_step = 3750, loss = 0.025054898
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 3750: gs://stagingtemp/mnistjobs/job-2018-12-02-21:30:52/model.ckpt-3750
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Signatures INCLUDED in export for Classify: None
INFO:tensorflow:Signatures INCLUDED in export for Regress: None
INFO:tensorflow:Signatures INCLUDED in export for Predict: ['serving_default']
INFO:tensorflow:Signatures INCLUDED in export for Train: None
INFO:tensorflow:Signatures INCLUDED in export for Eval: None
INFO:tensorflow:Restoring parameters from gs://stagingtemp/mnistjobs/job-2018-12-02-21:30:52/model.ckpt-3750
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:No assets to write.
INFO:tensorflow:SavedModel written to: gs://stagingtemp/mnistjobs/job-2018-12-02-21:30:52/export/mnist/temp-b'1543786359'/saved_model.pb
INFO:tensorflow:global_step/sec: 13.2764
INFO:tensorflow:loss = 0.009053471, step = 3752 (35.326 sec)
INFO:tensorflow:global_step/sec: 181.479
INFO:tensorflow:loss = 0.18442132, step = 4221 (2.584 sec)
INFO:tensorflow:global_step/sec: 183.38
INFO:tensorflow:loss = 0.06688864, step = 4690 (2.564 sec)
INFO:tensorflow:global_step/sec: 142.75
INFO:tensorflow:loss = 0.07341187, step = 5159 (3.279 sec)
INFO:tensorflow:Saving checkpoints for 5625 into gs://stagingtemp/mnistjobs/job-2018-12-02-21:30:52/model.ckpt.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-12-02-21:33:18
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from gs://stagingtemp/mnistjobs/job-2018-12-02-21:30:52/model.ckpt-5625
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [1/1]
INFO:tensorflow:Finished evaluation at 2018-12-02-21:33:21
INFO:tensorflow:Saving dict for global step 5625: accuracy = 0.9919, global_step = 5625, loss = 0.02433909
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 5625: gs://stagingtemp/mnistjobs/job-2018-12-02-21:30:52/model.ckpt-5625
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Signatures INCLUDED in export for Classify: None
INFO:tensorflow:Signatures INCLUDED in export for Regress: None
INFO:tensorflow:Signatures INCLUDED in export for Predict: ['serving_default']
INFO:tensorflow:Signatures INCLUDED in export for Train: None
INFO:tensorflow:Signatures INCLUDED in export for Eval: None
INFO:tensorflow:Restoring parameters from gs://stagingtemp/mnistjobs/job-2018-12-02-21:30:52/model.ckpt-5625
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:No assets to write.
INFO:tensorflow:SavedModel written to: gs://stagingtemp/mnistjobs/job-2018-12-02-21:30:52/export/mnist/temp-b'1543786402'/saved_model.pb
INFO:tensorflow:global_step/sec: 13.5759
INFO:tensorflow:loss = 0.047259383, step = 5628 (34.547 sec)
INFO:tensorflow:global_step/sec: 181.762
INFO:tensorflow:loss = 0.05273029, step = 6097 (2.580 sec)
INFO:tensorflow:global_step/sec: 182.219
INFO:tensorflow:loss = 0.0065530227, step = 6566 (2.574 sec)
INFO:tensorflow:global_step/sec: 180.293
INFO:tensorflow:loss = 0.24633394, step = 7035 (2.601 sec)
INFO:tensorflow:Saving checkpoints for 7500 into gs://stagingtemp/mnistjobs/job-2018-12-02-21:30:52/model.ckpt.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-12-02-21:34:00
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from gs://stagingtemp/mnistjobs/job-2018-12-02-21:30:52/model.ckpt-7500
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [1/1]
INFO:tensorflow:Finished evaluation at 2018-12-02-21:34:03
INFO:tensorflow:Saving dict for global step 7500: accuracy = 0.9934, global_step = 7500, loss = 0.019156769
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 7500: gs://stagingtemp/mnistjobs/job-2018-12-02-21:30:52/model.ckpt-7500
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Signatures INCLUDED in export for Classify: None
INFO:tensorflow:Signatures INCLUDED in export for Regress: None
INFO:tensorflow:Signatures INCLUDED in export for Predict: ['serving_default']
INFO:tensorflow:Signatures INCLUDED in export for Train: None
INFO:tensorflow:Signatures INCLUDED in export for Eval: None
INFO:tensorflow:Restoring parameters from gs://stagingtemp/mnistjobs/job-2018-12-02-21:30:52/model.ckpt-7500
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:No assets to write.
INFO:tensorflow:SavedModel written to: gs://stagingtemp/mnistjobs/job-2018-12-02-21:30:52/export/mnist/temp-b'1543786444'/saved_model.pb
INFO:tensorflow:global_step/sec: 13.752
INFO:tensorflow:loss = 0.01225789, step = 7504 (34.104 sec)
INFO:tensorflow:global_step/sec: 181.817
INFO:tensorflow:loss = 0.002070228, step = 7973 (2.580 sec)
INFO:tensorflow:global_step/sec: 182.989
INFO:tensorflow:loss = 0.0005350555, step = 8442 (2.563 sec)
INFO:tensorflow:global_step/sec: 182.368
INFO:tensorflow:loss = 0.026766729, step = 8911 (2.572 sec)
INFO:tensorflow:Saving checkpoints for 9375 into gs://stagingtemp/mnistjobs/job-2018-12-02-21:30:52/model.ckpt.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-12-02-21:34:43
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from gs://stagingtemp/mnistjobs/job-2018-12-02-21:30:52/model.ckpt-9375
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [1/1]
INFO:tensorflow:Finished evaluation at 2018-12-02-21:34:46
INFO:tensorflow:Saving dict for global step 9375: accuracy = 0.9938, global_step = 9375, loss = 0.017522633
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 9375: gs://stagingtemp/mnistjobs/job-2018-12-02-21:30:52/model.ckpt-9375
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Signatures INCLUDED in export for Classify: None
INFO:tensorflow:Signatures INCLUDED in export for Regress: None
INFO:tensorflow:Signatures INCLUDED in export for Predict: ['serving_default']
INFO:tensorflow:Signatures INCLUDED in export for Train: None
INFO:tensorflow:Signatures INCLUDED in export for Eval: None
INFO:tensorflow:Restoring parameters from gs://stagingtemp/mnistjobs/job-2018-12-02-21:30:52/model.ckpt-9375
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:No assets to write.
INFO:tensorflow:SavedModel written to: gs://stagingtemp/mnistjobs/job-2018-12-02-21:30:52/export/mnist/temp-b'1543786487'/saved_model.pb
INFO:tensorflow:global_step/sec: 12.8443
INFO:tensorflow:loss = 0.15341315, step = 9380 (37.426 sec)
INFO:tensorflow:global_step/sec: 134.156
INFO:tensorflow:loss = 0.0002842332, step = 9849 (2.585 sec)
INFO:tensorflow:global_step/sec: 182.844
INFO:tensorflow:loss = 0.00032512506, step = 10318 (2.569 sec)
INFO:tensorflow:global_step/sec: 181.927
INFO:tensorflow:loss = 0.00846606, step = 10787 (2.574 sec)
INFO:tensorflow:Saving checkpoints for 11250 into gs://stagingtemp/mnistjobs/job-2018-12-02-21:30:52/model.ckpt.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-12-02-21:35:28
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from gs://stagingtemp/mnistjobs/job-2018-12-02-21:30:52/model.ckpt-11250
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [1/1]
INFO:tensorflow:Finished evaluation at 2018-12-02-21:35:31
INFO:tensorflow:Saving dict for global step 11250: accuracy = 0.9939, global_step = 11250, loss = 0.017586086
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 11250: gs://stagingtemp/mnistjobs/job-2018-12-02-21:30:52/model.ckpt-11250
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Signatures INCLUDED in export for Classify: None
INFO:tensorflow:Signatures INCLUDED in export for Regress: None
INFO:tensorflow:Signatures INCLUDED in export for Predict: ['serving_default']
INFO:tensorflow:Signatures INCLUDED in export for Train: None
INFO:tensorflow:Signatures INCLUDED in export for Eval: None
INFO:tensorflow:Restoring parameters from gs://stagingtemp/mnistjobs/job-2018-12-02-21:30:52/model.ckpt-11250
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:No assets to write.
INFO:tensorflow:SavedModel written to: gs://stagingtemp/mnistjobs/job-2018-12-02-21:30:52/export/mnist/temp-b'1543786532'/saved_model.pb
INFO:tensorflow:global_step/sec: 12.7952
INFO:tensorflow:loss = 0.0036112254, step = 11256 (36.654 sec)
INFO:tensorflow:global_step/sec: 182.804
INFO:tensorflow:loss = 0.00047459383, step = 11725 (2.566 sec)
INFO:tensorflow:global_step/sec: 183.244
INFO:tensorflow:loss = 0.0034961137, step = 12194 (2.559 sec)
INFO:tensorflow:global_step/sec: 183.219
INFO:tensorflow:loss = 0.13055202, step = 12663 (2.560 sec)
INFO:tensorflow:Saving checkpoints for 13125 into gs://stagingtemp/mnistjobs/job-2018-12-02-21:30:52/model.ckpt.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-12-02-21:36:13
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from gs://stagingtemp/mnistjobs/job-2018-12-02-21:30:52/model.ckpt-13125
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [1/1]
INFO:tensorflow:Finished evaluation at 2018-12-02-21:36:16
INFO:tensorflow:Saving dict for global step 13125: accuracy = 0.9944, global_step = 13125, loss = 0.017364949
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 13125: gs://stagingtemp/mnistjobs/job-2018-12-02-21:30:52/model.ckpt-13125
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Signatures INCLUDED in export for Classify: None
INFO:tensorflow:Signatures INCLUDED in export for Regress: None
INFO:tensorflow:Signatures INCLUDED in export for Predict: ['serving_default']
INFO:tensorflow:Signatures INCLUDED in export for Train: None
INFO:tensorflow:Signatures INCLUDED in export for Eval: None
INFO:tensorflow:Restoring parameters from gs://stagingtemp/mnistjobs/job-2018-12-02-21:30:52/model.ckpt-13125
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:No assets to write.
INFO:tensorflow:SavedModel written to: gs://stagingtemp/mnistjobs/job-2018-12-02-21:30:52/export/mnist/temp-b'1543786577'/saved_model.pb
INFO:tensorflow:global_step/sec: 12.7545
INFO:tensorflow:loss = 0.01629746, step = 13132 (36.772 sec)
INFO:tensorflow:global_step/sec: 183.609
INFO:tensorflow:loss = 0.0024809996, step = 13601 (2.554 sec)
INFO:tensorflow:global_step/sec: 181.988
INFO:tensorflow:loss = 0.007379628, step = 14070 (2.580 sec)
INFO:tensorflow:global_step/sec: 182.966
INFO:tensorflow:loss = 0.0026895313, step = 14539 (2.560 sec)
INFO:tensorflow:Saving checkpoints for 15000 into gs://stagingtemp/mnistjobs/job-2018-12-02-21:30:52/model.ckpt.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-12-02-21:36:57
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from gs://stagingtemp/mnistjobs/job-2018-12-02-21:30:52/model.ckpt-15000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [1/1]
INFO:tensorflow:Finished evaluation at 2018-12-02-21:37:00
INFO:tensorflow:Saving dict for global step 15000: accuracy = 0.9944, global_step = 15000, loss = 0.017209128
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 15000: gs://stagingtemp/mnistjobs/job-2018-12-02-21:30:52/model.ckpt-15000
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Signatures INCLUDED in export for Classify: None
INFO:tensorflow:Signatures INCLUDED in export for Regress: None
INFO:tensorflow:Signatures INCLUDED in export for Predict: ['serving_default']
INFO:tensorflow:Signatures INCLUDED in export for Train: None
INFO:tensorflow:Signatures INCLUDED in export for Eval: None
INFO:tensorflow:Restoring parameters from gs://stagingtemp/mnistjobs/job-2018-12-02-21:30:52/model.ckpt-15000
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:No assets to write.
INFO:tensorflow:SavedModel written to: gs://stagingtemp/mnistjobs/job-2018-12-02-21:30:52/export/mnist/temp-b'1543786622'/saved_model.pb
INFO:tensorflow:Loss for final step: 0.0018769283.

Visualize predictions


In [0]:
# recognize digits from local fonts
predictions = estimator.predict(lambda:  tf.data.Dataset.from_tensor_slices(font_digits).batch(N),
                                  yield_single_examples=False)  # the returned value is a generator that will yield one batch of predictions per next() call
predicted_font_classes = next(predictions)['classes']
display_digits(font_digits, predicted_font_classes, font_labels, "predictions from local fonts (bad predictions in red)", N)

# recognize validation digits
predictions = estimator.predict(validation_input_fn,
                                    yield_single_examples=False)  # the returned value is a generator that will yield one batch of predictions per next() call
predicted_labels = next(predictions)['classes']
display_top_unrecognized(validation_digits, predicted_labels, validation_labels, N, 7)


Deploy the trained model to ML Engine

Push your trained model to production on ML Engine for a serverless, autoscaled, REST API experience.

You will need a GCS bucket and a GCP project for this. Models deployed on ML Engine autoscale to zero if not used. There will be no ML Engine charges after you are done testing. Google Cloud Storage incurs charges. Empty the bucket after deployment if you want to avoid these. Once the model is deployed, the bucket is not useful anymore.

Configuration


In [0]:
PROJECT = "" #@param {type:"string"}
NEW_MODEL = True #@param {type:"boolean"}
MODEL_NAME = "estimator_mnist" #@param {type:"string"}
MODEL_VERSION = "v0" #@param {type:"string"}

assert PROJECT, 'For this part, you need a GCP project. Head to http://console.cloud.google.com/ and create one.'

export_path = os.path.join(MODEL_DIR, 'export', MODEL_EXPORT_NAME)
last_export = sorted(tf.gfile.ListDirectory(export_path))[-1]
export_path = os.path.join(export_path, last_export)
print('Saved model directory found: ', export_path)

Deploy the model

This uses the command-line interface. You can do the same thing through the ML Engine UI at https://console.cloud.google.com/mlengine/models


In [0]:
# Create the model
if NEW_MODEL:
  !gcloud ml-engine models create {MODEL_NAME} --project={PROJECT} --regions=us-central1

In [0]:
# Create a version of this model (you can add --async at the end of the line to make this call non blocking)
# Additional config flags are available: https://cloud.google.com/ml-engine/reference/rest/v1/projects.models.versions
# You can also deploy a model that is stored locally by providing a --staging-bucket=... parameter
!echo "Deployment takes a couple of minutes. You can watch your deployment here: https://console.cloud.google.com/mlengine/models/{MODEL_NAME}"
!gcloud ml-engine versions create {MODEL_VERSION} --model={MODEL_NAME} --origin={export_path} --project={PROJECT} --runtime-version=1.10

Test the deployed model

Your model is now available as a REST API. Let us try to call it. The cells below use the "gcloud ml-engine" command line tool but any tool that can send a JSON payload to a REST endpoint will work.


In [0]:
# prepare digits to send to online prediction endpoint
digits = np.concatenate((font_digits, validation_digits[:100-N]))
labels = np.concatenate((font_labels, validation_labels[:100-N]))
with open("digits.json", "w") as f:
  for digit in digits:
    # the format for ML Engine online predictions is: one JSON object per line
    data = json.dumps({"serving_input": digit.tolist()})  # "serving_input" because that is what you defined in your serving_input_fn: {"serving_input": tf.placeholder(tf.float32, [None, 28, 28])}
    f.write(data+'\n')

In [0]:
# Request online predictions from deployed model (REST API) using the "gcloud ml-engine" command line.
predictions = !gcloud ml-engine predict --model={MODEL_NAME} --json-instances digits.json --project={PROJECT} --version {MODEL_VERSION}

predictions = np.array([int(p.split('[')[0]) for p in predictions[1:]]) # first line is the name of the input layer: drop it, parse the rest
display_top_unrecognized(digits, predictions, labels, N, 100//N)

License


author: Martin Gorner
twitter: @martin_gorner


Copyright 2018 Google LLC

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.


This is not an official Google product but sample code provided for an educational purpose