Training Keras model on Cloud AI Platform

This notebook illustrates distributed training and hyperparameter tuning on Cloud AI Platform (formerly known as Cloud ML Engine). This uses Keras and requires TensorFlow 2.0


In [1]:
# change these to try this notebook out
BUCKET = 'cloud-training-demos-ml'
PROJECT = 'cloud-training-demos'
REGION = 'us-central1'

In [2]:
import os
os.environ['BUCKET'] = BUCKET
os.environ['PROJECT'] = PROJECT
os.environ['REGION'] = REGION
os.environ['TFVERSION'] = '2.0'  # not used in this notebook

In [3]:
%%bash
gcloud config set project $PROJECT
gcloud config set compute/region $REGION


Updated property [core/project].
Updated property [compute/region].

In [ ]:
%%bash
if ! gsutil ls | grep -q gs://${BUCKET}/babyweight/preproc; then
  gsutil mb -l ${REGION} gs://${BUCKET}
  # copy canonical set of preprocessed files if you didn't do previous notebook
  gsutil -m cp -R gs://cloud-training-demos/babyweight gs://${BUCKET}
fi

In [4]:
%%bash
gsutil ls gs://${BUCKET}/babyweight/preproc/*-00000*


gs://cloud-training-demos-ml/babyweight/preproc/eval.csv-00000-of-00012
gs://cloud-training-demos-ml/babyweight/preproc/eval_2000.csv-00000-of-00030
gs://cloud-training-demos-ml/babyweight/preproc/train.csv-00000-of-00043
gs://cloud-training-demos-ml/babyweight/preproc/train_2000.csv-00000-of-00255

Now that we have the Keras wide-and-deep code working on a subset of the data, we can package the TensorFlow code up as a Python module and train it on Cloud AI Platform.

Train on Cloud AI Platform

Training on Cloud AI Platform requires:

  1. Making the code a Python package
  2. Using gcloud to submit the training code to Cloud AI Platform

Ensure that the AI Platform API is enabled by going to this link.

Lab Task 1

The following code edits babyweight_tf2/trainer/task.py.


In [ ]:
!mkdir -p babyweight_tf2/trainer

In [ ]:
%%writefile babyweight_tf2/trainer/task.py
import argparse
import json
import os

from . import model

import tensorflow as tf

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument(
        '--bucket',
        help = 'GCS path to data. We assume that data is in gs://BUCKET/babyweight/preproc/',
        required = True
    )
    parser.add_argument(
        '--output_dir',
        help = 'GCS location to write checkpoints and export models',
        required = True
    )
    parser.add_argument(
        '--batch_size',
        help = 'Number of examples to compute gradient over.',
        type = int,
        default = 512
    )
    parser.add_argument(
        '--job-dir',
        help = 'this model ignores this field, but it is required by gcloud',
        default = 'junk'
    )
    parser.add_argument(
        '--nnsize',
        help = 'Hidden layer sizes to use for DNN feature columns -- provide space-separated layers',
        nargs = '+',
        type = int,
        default=[128, 32, 4]
    )
    parser.add_argument(
        '--nembeds',
        help = 'Embedding size of a cross of n key real-valued parameters',
        type = int,
        default = 3
    )

    ## TODO 1: add the new arguments here 
    parser.add_argument(
        '--train_examples',
        help = 'Number of examples (in thousands) to run the training job over. If this is more than actual # of examples available, it cycles through them. So specifying 1000 here when you have only 100k examples makes this 10 epochs.',
        type = int,
        default = 5000
    )    
    parser.add_argument(
        '--pattern',
        help = 'Specify a pattern that has to be in input files. For example 00001-of will process only one shard',
        default = 'of'
    )
    parser.add_argument(
        '--eval_steps',
        help = 'Positive number of steps for which to evaluate model. Default to None, which means to evaluate until input_fn raises an end-of-input exception',
        type = int,       
        default = None
    )
        
    ## parse all arguments
    args = parser.parse_args()
    arguments = args.__dict__

    # unused args provided by service
    arguments.pop('job_dir', None)
    arguments.pop('job-dir', None)

    ## assign the arguments to the model variables
    output_dir = arguments.pop('output_dir')
    model.BUCKET     = arguments.pop('bucket')
    model.BATCH_SIZE = arguments.pop('batch_size')
    model.TRAIN_EXAMPLES = arguments.pop('train_examples') * 1000
    model.EVAL_STEPS = arguments.pop('eval_steps')    
    print ("Will train on {} examples using batch_size={}".format(model.TRAIN_EXAMPLES, model.BATCH_SIZE))
    model.PATTERN = arguments.pop('pattern')
    model.NEMBEDS= arguments.pop('nembeds')
    model.NNSIZE = arguments.pop('nnsize')
    print ("Will use DNN size of {}".format(model.NNSIZE))

    # Append trial_id to path if we are doing hptuning
    # This code can be removed if you are not using hyperparameter tuning
    output_dir = os.path.join(
        output_dir,
        json.loads(
            os.environ.get('TF_CONFIG', '{}')
        ).get('task', {}).get('trial', '')
    )

    # Run the training job
    model.train_and_evaluate(output_dir)

Lab Task 2

The following code edits babyweight_tf2/trainer/model.py.


In [ ]:
%%writefile babyweight_tf2/trainer/model.py
import shutil, os, datetime
import numpy as np
import tensorflow as tf

BUCKET = None  # set from task.py
PATTERN = 'of' # gets all files

# Determine CSV, label, and key columns
CSV_COLUMNS = 'weight_pounds,is_male,mother_age,plurality,gestation_weeks,key'.split(',')
LABEL_COLUMN = 'weight_pounds'
KEY_COLUMN = 'key'

# Set default values for each CSV column
DEFAULTS = [[0.0], ['null'], [0.0], ['null'], [0.0], ['nokey']]

# Define some hyperparameters
TRAIN_EXAMPLES = 1000 * 1000
EVAL_STEPS = None
NUM_EVALS = 10
BATCH_SIZE = 512
NEMBEDS = 3
NNSIZE = [64, 16, 4]

# Create an input function reading a file using the Dataset API
def features_and_labels(row_data):
    for unwanted_col in ['key']:
        row_data.pop(unwanted_col)
    label = row_data.pop(LABEL_COLUMN)
    return row_data, label  # features, label

# load the training data
def load_dataset(pattern, batch_size=1, mode=tf.estimator.ModeKeys.EVAL):
  dataset = (tf.data.experimental.make_csv_dataset(pattern, batch_size, CSV_COLUMNS, DEFAULTS)
             .map(features_and_labels) # features, label
             )
  if mode == tf.estimator.ModeKeys.TRAIN:
        dataset = dataset.shuffle(1000).repeat()
  dataset = dataset.prefetch(1) # take advantage of multi-threading; 1=AUTOTUNE
  return dataset

## Build a Keras wide-and-deep model using its Functional API
def rmse(y_true, y_pred):
    return tf.sqrt(tf.reduce_mean(tf.square(y_pred - y_true))) 

# Helper function to handle categorical columns
def categorical_fc(name, values):
    orig = tf.feature_column.categorical_column_with_vocabulary_list(name, values)
    wrapped = tf.feature_column.indicator_column(orig)
    return orig, wrapped

def build_wd_model(dnn_hidden_units = [64, 32], nembeds = 3):
    # input layer
    deep_inputs = {
        colname : tf.keras.layers.Input(name=colname, shape=(), dtype='float32')
           for colname in ['mother_age', 'gestation_weeks']
    }
    wide_inputs = {
        colname : tf.keras.layers.Input(name=colname, shape=(), dtype='string')
           for colname in ['is_male', 'plurality']        
    }
    inputs = {**wide_inputs, **deep_inputs}
    
    # feature columns from inputs
    deep_fc = {
        colname : tf.feature_column.numeric_column(colname)
           for colname in ['mother_age', 'gestation_weeks']
    }
    wide_fc = {}
    is_male, wide_fc['is_male'] = categorical_fc('is_male', ['True', 'False', 'Unknown'])
    plurality, wide_fc['plurality'] = categorical_fc('plurality',
                      ['Single(1)', 'Twins(2)', 'Triplets(3)',
                       'Quadruplets(4)', 'Quintuplets(5)','Multiple(2+)'])
    
    # bucketize the float fields. This makes them wide
    age_buckets = tf.feature_column.bucketized_column(deep_fc['mother_age'],
                                                     boundaries=np.arange(15,45,1).tolist())
    wide_fc['age_buckets'] = tf.feature_column.indicator_column(age_buckets)
    gestation_buckets = tf.feature_column.bucketized_column(deep_fc['gestation_weeks'],
                                                     boundaries=np.arange(17,47,1).tolist())
    wide_fc['gestation_buckets'] = tf.feature_column.indicator_column(gestation_buckets)
    
    # cross all the wide columns. We have to do the crossing before we one-hot encode
    crossed = tf.feature_column.crossed_column(
        [is_male, plurality, age_buckets, gestation_buckets], hash_bucket_size=20000)
    deep_fc['crossed_embeds'] = tf.feature_column.embedding_column(crossed, nembeds)

    # the constructor for DenseFeatures takes a list of numeric columns
    # The Functional API in Keras requires that you specify: LayerConstructor()(inputs)
    wide_inputs = tf.keras.layers.DenseFeatures(wide_fc.values(), name='wide_inputs')(inputs)
    deep_inputs = tf.keras.layers.DenseFeatures(deep_fc.values(), name='deep_inputs')(inputs)
        
    # hidden layers for the deep side
    layers = [int(x) for x in dnn_hidden_units]
    deep = deep_inputs
    for layerno, numnodes in enumerate(layers):
        deep = tf.keras.layers.Dense(numnodes, activation='relu', name='dnn_{}'.format(layerno+1))(deep)        
    deep_out = deep
    
    # linear model for the wide side
    wide_out = tf.keras.layers.Dense(10, activation='relu', name='linear')(wide_inputs)
   
    # concatenate the two sides
    both = tf.keras.layers.concatenate([deep_out, wide_out], name='both')

    # final output is a linear activation because this is regression
    output = tf.keras.layers.Dense(1, activation='linear', name='weight')(both)
    model = tf.keras.models.Model(inputs, output)
    model.compile(optimizer='adam', loss='mse', metrics=[rmse, 'mse'])
    return model



# The main function
def train_and_evaluate(output_dir):
    model = build_wd_model(NNSIZE, NEMBEDS)
    print("Here is our Wide-and-Deep architecture so far:\n")
    print(model.summary())

    train_file_path = 'gs://{}/babyweight/preproc/{}*{}*'.format(BUCKET, 'train', PATTERN)
    eval_file_path = 'gs://{}/babyweight/preproc/{}*{}*'.format(BUCKET, 'eval', PATTERN)
    trainds = load_dataset('train*', BATCH_SIZE, tf.estimator.ModeKeys.TRAIN)
    evalds = load_dataset('eval*', 1000, tf.estimator.ModeKeys.EVAL)
    if EVAL_STEPS:
        evalds = evalds.take(EVAL_STEPS)
    steps_per_epoch = TRAIN_EXAMPLES // (BATCH_SIZE * NUM_EVALS)
    
    checkpoint_path = os.path.join(output_dir, 'checkpoints/babyweight')
    cp_callback = tf.keras.callbacks.ModelCheckpoint(checkpoint_path, save_weights_only=True, verbose=1)
    
    history = model.fit(trainds, 
                        validation_data=evalds,
                        epochs=NUM_EVALS, 
                        steps_per_epoch=steps_per_epoch,
                        verbose=2, # 0=silent, 1=progress bar, 2=one line per epoch
                        callbacks=[cp_callback])
    
    EXPORT_PATH = os.path.join(output_dir, datetime.datetime.now().strftime('%Y%m%d%H%M%S'))
    tf.saved_model.save(model, EXPORT_PATH) # with default serving function
    print("Exported trained model to {}".format(EXPORT_PATH))

Lab Task 3

After moving the code to a package, make sure it works standalone. (Note the --pattern and --train_examples lines so that I am not trying to boil the ocean on my laptop). Even then, this takes about 3 minutes in which you won't see any output ...


In [ ]:
%%bash
echo "bucket=${BUCKET}"
rm -rf babyweight_trained
export PYTHONPATH=${PYTHONPATH}:${PWD}/babyweight_tf2
python3 -m trainer.task \
  --bucket=${BUCKET} \
  --output_dir=babyweight_trained \
  --job-dir=./tmp \
  --pattern="00000-of-" --train_examples=1 --eval_steps=1 --batch_size=10

Lab Task 4

Since we are using TensorFlow 2.0 preview, we will use a container image to run the code on AI Platform.

Once TensorFlow 2.0 is released, you will be able to simply do (without having to build a container)

gcloud ai-platform jobs submit training $JOBNAME \
  --region=$REGION \
  --module-name=trainer.task \
  --package-path=$(pwd)/babyweight/trainer \
  --job-dir=$OUTDIR \
  --staging-bucket=gs://$BUCKET \
  --scale-tier=STANDARD_1 \
  --runtime-version=$TFVERSION \
  -- \
  --bucket=${BUCKET} \
  --output_dir=${OUTDIR} \
  --train_examples=200000

In [ ]:
%%writefile babyweight_tf2/Dockerfile
FROM gcr.io/deeplearning-platform-release/tf2-cpu
COPY trainer /babyweight_tf2/trainer
RUN apt update && \
    apt install --yes python3-pip && \
    pip3 install --upgrade --quiet tf-nightly-2.0-preview

ENV PYTHONPATH ${PYTHONPATH}:/babyweight_tf2
CMD ["python3", "-m", "trainer.task"]

In [ ]:
%%writefile babyweight_tf2/push_docker.sh
export PROJECT_ID=$(gcloud config list project --format "value(core.project)")
export IMAGE_REPO_NAME=babyweight_training_container
#export IMAGE_TAG=$(date +%Y%m%d_%H%M%S)
#export IMAGE_URI=gcr.io/$PROJECT_ID/$IMAGE_REPO_NAME:$IMAGE_TAG
export IMAGE_URI=gcr.io/$PROJECT_ID/$IMAGE_REPO_NAME

echo "Building  $IMAGE_URI"
docker build -f Dockerfile -t $IMAGE_URI ./
echo "Pushing $IMAGE_URI"
docker push $IMAGE_URI

Note: If you get a permissions/stat error when running push_docker.sh from Notebooks, do it from CloudShell:

Open CloudShell on the GCP Console

This step takes 5-10 minutes to run


In [ ]:
%%bash
cd babyweight_tf2
bash push_docker.sh

Lab Task 5

Once the code works in standalone mode, you can run it on Cloud AI Platform. Because this is on the entire dataset, it will take a while. The training run took about two hours for me. You can monitor the job from the GCP console in the Cloud AI Platform section.


In [ ]:
%%bash
OUTDIR=gs://${BUCKET}/babyweight/trained_model
JOBID=babyweight_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR

#IMAGE=gcr.io/deeplearning-platform-release/tf2-cpu
IMAGE=gcr.io/$PROJECT/serverlessml_training_container

gcloud beta ai-platform jobs submit training $JOBID \
   --staging-bucket=gs://$BUCKET  --region=$REGION \
   --master-image-uri=$IMAGE \
   --master-machine-type=n1-standard-4 --scale-tier=CUSTOM \
   -- \
   --bucket=${BUCKET} \
   --output_dir=${OUTDIR} \
   --train_examples=200000

When I ran it, I used train_examples=2000000. When training finished, I filtered in the Stackdriver log on the word "dict" and saw that the last line was:

Saving dict for global step 5714290: average_loss = 1.06473, global_step = 5714290, loss = 34882.4, rmse = 1.03186
The final RMSE was 1.03 pounds.

Hyperparameter tuning

All of these are command-line parameters to my program. To do hyperparameter tuning, create hyperparam.xml and pass it as --configFile. This step will take up to 2 hours -- you can increase maxParallelTrials or reduce maxTrials to get it done faster. Since maxParallelTrials is the number of initial seeds to start searching from, you don't want it to be too large; otherwise, all you have is a random search.


In [ ]:
%%writefile hyperparam.yaml
trainingInput:
  scaleTier: STANDARD_1
  hyperparameters:
    hyperparameterMetricTag: rmse
    goal: MINIMIZE
    maxTrials: 20
    maxParallelTrials: 5
    enableTrialEarlyStopping: True
    params:
    - parameterName: batch_size
      type: INTEGER
      minValue: 8
      maxValue: 512
      scaleType: UNIT_LOG_SCALE
    - parameterName: nembeds
      type: INTEGER
      minValue: 3
      maxValue: 30
      scaleType: UNIT_LINEAR_SCALE
    - parameterName: nnsize
      type: INTEGER
      minValue: 64
      maxValue: 512
      scaleType: UNIT_LOG_SCALE

In [ ]:
%%bash
OUTDIR=gs://${BUCKET}/babyweight/hyperparam
JOBNAME=babyweight_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR
gcloud ai-platform jobs submit training $JOBNAME \
  --region=$REGION \
  --module-name=trainer.task \
  --package-path=$(pwd)/babyweight/trainer \
  --job-dir=$OUTDIR \
  --staging-bucket=gs://$BUCKET \
  --scale-tier=STANDARD_1 \
  --config=hyperparam.yaml \
  --runtime-version=$TFVERSION \
  -- \
  --bucket=${BUCKET} \
  --output_dir=${OUTDIR} \
  --eval_steps=10 \
  --train_examples=20000

Repeat training

This time with tuned parameters (note last line)


In [ ]:
%%bash
OUTDIR=gs://${BUCKET}/babyweight/trained_model_tuned
JOBNAME=babyweight_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR
gcloud ai-platform jobs submit training $JOBNAME \
  --region=$REGION \
  --module-name=trainer.task \
  --package-path=$(pwd)/babyweight/trainer \
  --job-dir=$OUTDIR \
  --staging-bucket=gs://$BUCKET \
  --scale-tier=STANDARD_1 \
  --runtime-version=$TFVERSION \
  -- \
  --bucket=${BUCKET} \
  --output_dir=${OUTDIR} \
  --train_examples=20000 --batch_size=35 --nembeds=16 --nnsize=281

Copyright 2017 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License