LAB 5a: Training Keras model on Cloud AI Platform.

Learning Objectives

  1. Setup up the environment
  2. Create trainer module's task.py to hold hyperparameter argparsing code
  3. Create trainer module's model.py to hold Keras model code
  4. Run trainer module package locally
  5. Submit training job to Cloud AI Platform
  6. Submit hyperparameter tuning job to Cloud AI Platform

Introduction

After having testing our training pipeline both locally and in the cloud on a susbset of the data, we can submit another (much larger) training job to the cloud. It is also a good idea to run a hyperparameter tuning job to make sure we have optimized the hyperparameters of our model.

In this notebook, we'll be training our Keras model at scale using Cloud AI Platform.

In this lab, we will set up the environment, create the trainer module's task.py to hold hyperparameter argparsing code, create the trainer module's model.py to hold Keras model code, run the trainer module package locally, submit a training job to Cloud AI Platform, and submit a hyperparameter tuning job to Cloud AI Platform.

Each learning objective will correspond to a #TODO in this student lab notebook -- try to complete this notebook first and then review the solution notebook.

Set up environment variables and load necessary libraries

Import necessary libraries.


In [ ]:
import os

Lab Task #1: Set environment variables.

Set environment variables so that we can use them throughout the entire lab. We will be using our project name for our bucket, so you only need to change your project and region.


In [ ]:
%%bash
export PROJECT=$(gcloud config list project --format "value(core.project)")
echo "Your current GCP Project Name is: "${PROJECT}

In [ ]:
# TODO: Change these to try this notebook out
PROJECT = "cloud-training-demos"  # Replace with your PROJECT
BUCKET = PROJECT  # defaults to PROJECT
REGION = "us-central1"  # Replace with your REGION

In [ ]:
os.environ["PROJECT"] = PROJECT
os.environ["BUCKET"] = BUCKET
os.environ["REGION"] = REGION
os.environ["TFVERSION"] = "2.0"

In [ ]:
%%bash
gcloud config set project $PROJECT
gcloud config set compute/region $REGION

In [ ]:
%%bash
if ! gsutil ls | grep -q gs://${BUCKET}; then
    gsutil mb -l ${REGION} gs://${BUCKET}
fi

Check data exists

Verify that you previously created CSV files we'll be using for training and evaluation. If not, go back to lab 1b_prepare_data_babyweight.ipynb to create them.


In [ ]:
%%bash
gsutil ls gs://${BUCKET}/babyweight/data/*000000000000.csv

Now that we have the Keras wide-and-deep code working on a subset of the data, we can package the TensorFlow code up as a Python module and train it on Cloud AI Platform.

Train on Cloud AI Platform

Training on Cloud AI Platform requires:

  • Making the code a Python package
  • Using gcloud to submit the training code to Cloud AI Platform

Ensure that the AI Platform API is enabled by going to this link.

Move code into a Python package

A Python package is simply a collection of one or more .py files along with an __init__.py file to identify the containing directory as a package. The __init__.py sometimes contains initialization code but for our purposes an empty file suffices.

The bash command touch creates an empty file in the specified location, the directory babyweight should already exist.


In [ ]:
%%bash
mkdir -p babyweight/trainer
touch babyweight/trainer/__init__.py

We then use the %%writefile magic to write the contents of the cell below to a file called task.py in the babyweight/trainer folder.

Lab Task #2: Create trainer module's task.py to hold hyperparameter argparsing code.

The cell below writes the file babyweight/trainer/task.py which sets up our training job. Here is where we determine which parameters of our model to pass as flags during training using the parser module. Look at how batch_size is passed to the model in the code below. Use this as an example to parse arguements for the following variables

  • nnsize which represents the hidden layer sizes to use for DNN feature columns
  • nembeds which represents the embedding size of a cross of n key real-valued parameters
  • train_examples which represents the number of examples (in thousands) to run the training job
  • eval_steps which represents the positive number of steps for which to evaluate model

Be sure to include a default value for the parsed arguments above and specfy the type if necessary.


In [ ]:
%%writefile babyweight/trainer/task.py
import argparse
import json
import os

from babyweight.trainer import model

import tensorflow as tf

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--job-dir",
        help="this model ignores this field, but it is required by gcloud",
        default="junk"
    )
    parser.add_argument(
        "--train_data_path",
        help="GCS location of training data",
        required=True
    )
    parser.add_argument(
        "--eval_data_path",
        help="GCS location of evaluation data",
        required=True
    )
    parser.add_argument(
        "--output_dir",
        help="GCS location to write checkpoints and export models",
        required=True
    )
    parser.add_argument(
        "--batch_size",
        help="Number of examples to compute gradient over.",
        type=int,
        default=512
    )

    # TODO: Add nnsize argument

    # TODO: Add nembeds argument

    # TODO: Add num_epochs argument

    # TODO: Add train_examples argument

    # TODO: Add eval_steps argument

    # Parse all arguments
    args = parser.parse_args()
    arguments = args.__dict__

    # Unused args provided by service
    arguments.pop("job_dir", None)
    arguments.pop("job-dir", None)

    # Modify some arguments
    arguments["train_examples"] *= 1000

    # Append trial_id to path if we are doing hptuning
    # This code can be removed if you are not using hyperparameter tuning
    arguments["output_dir"] = os.path.join(
        arguments["output_dir"],
        json.loads(
            os.environ.get("TF_CONFIG", "{}")
        ).get("task", {}).get("trial", "")
    )

    # Run the training job
    model.train_and_evaluate(arguments)

In the same way we can write to the file model.py the model that we developed in the previous notebooks.

Lab Task #3: Create trainer module's model.py to hold Keras model code.

Complete the TODOs in the code cell below to create our model.py. We'll use the code we wrote for the Wide & Deep model. Look back at your 9_keras_wide_and_deep_babyweight notebook and copy/paste the necessary code from that notebook into its place in the cell below.


In [ ]:
%%writefile babyweight/trainer/model.py
import datetime
import os
import shutil
import numpy as np
import tensorflow as tf

# Determine CSV, label, and key columns
# TODO: Add CSV_COLUMNS and LABEL_COLUMN

# Set default values for each CSV column.
# Treat is_male and plurality as strings.
# TODO: Add DEFAULTS


def features_and_labels(row_data):
    # TODO: Add your code here
    pass


def load_dataset(pattern, batch_size=1, mode='eval'):
    # TODO: Add your code here
    pass


def create_input_layers():
    # TODO: Add your code here
    pass


def categorical_fc(name, values):
    # TODO: Add your code here
    pass


def create_feature_columns(nembeds):
    # TODO: Add your code here
    pass


def get_model_outputs(wide_inputs, deep_inputs, dnn_hidden_units):
    # TODO: Add your code here
    pass


def rmse(y_true, y_pred):
    # TODO: Add your code here
    pass


def build_wide_deep_model(dnn_hidden_units=[64, 32], nembeds=3):
    # TODO: Add your code here
    pass


def train_and_evaluate(args):
    model = build_wide_deep_model(args["nnsize"], args["nembeds"])
    print("Here is our Wide-and-Deep architecture so far:\n")
    print(model.summary())

    trainds = load_dataset(
        args["train_data_path"],
        args["batch_size"],
        'train')

    evalds = load_dataset(
        args["eval_data_path"], 1000, 'eval')
    if args["eval_steps"]:
        evalds = evalds.take(count=args["eval_steps"])

    num_batches = args["batch_size"] * args["num_epochs"]
    steps_per_epoch = args["train_examples"] // num_batches

    checkpoint_path = os.path.join(args["output_dir"], "checkpoints/babyweight")
    cp_callback = tf.keras.callbacks.ModelCheckpoint(
        filepath=checkpoint_path, verbose=1, save_weights_only=True)

    history = model.fit(
        trainds,
        validation_data=evalds,
        epochs=args["num_epochs"],
        steps_per_epoch=steps_per_epoch,
        verbose=2,  # 0=silent, 1=progress bar, 2=one line per epoch
        callbacks=[cp_callback])

    EXPORT_PATH = os.path.join(
        args["output_dir"], datetime.datetime.now().strftime("%Y%m%d%H%M%S"))
    tf.saved_model.save(
        obj=model, export_dir=EXPORT_PATH)  # with default serving function
    print("Exported trained model to {}".format(EXPORT_PATH))

Train locally

After moving the code to a package, make sure it works as a standalone. Note, we incorporated the --train_examples flag so that we don't try to train on the entire dataset while we are developing our pipeline. Once we are sure that everything is working on a subset, we can change it so that we can train on all the data. Even for this subset, this takes about 3 minutes in which you won't see any output ...

Lab Task #4: Run trainer module package locally.

Fill in the missing code in the TODOs below so that we can run a very small training job over a single file with a small batch size, 1 epoch, 1 train example, and 1 eval step.


In [ ]:
%%bash
OUTDIR=babyweight_trained
rm -rf ${OUTDIR}
export PYTHONPATH=${PYTHONPATH}:${PWD}/babyweight
python3 -m trainer.task \
    --job-dir=./tmp \
    --train_data_path=gs://${BUCKET}/babyweight/data/train*.csv \
    --eval_data_path=gs://${BUCKET}/babyweight/data/eval*.csv \
    --output_dir=${OUTDIR} \
    --batch_size=# TODO: Add batch size
    --num_epochs=# TODO: Add the number of epochs to train for
    --train_examples=# TODO: Add the number of examples to train each epoch for
    --eval_steps=# TODO: Add the number of evaluation batches to run

Dockerized module

Since we are using TensorFlow 2.0 and it is new, we will use a container image to run the code on AI Platform.

Once TensorFlow 2.0 is natively supported on AI Platform, you will be able to simply do (without having to build a container):

gcloud ai-platform jobs submit training ${JOBNAME} \
    --region=${REGION} \
    --module-name=trainer.task \
    --package-path=$(pwd)/babyweight/trainer \
    --job-dir=${OUTDIR} \
    --staging-bucket=gs://${BUCKET} \
    --scale-tier=STANDARD_1 \
    --runtime-version=${TFVERSION} \
    -- \
    --train_data_path=gs://${BUCKET}/babyweight/data/train*.csv \
    --eval_data_path=gs://${BUCKET}/babyweight/data/eval*.csv \
    --output_dir=${OUTDIR} \
    --num_epochs=10 \
    --train_examples=20000 \
    --eval_steps=100 \
    --batch_size=32 \
    --nembeds=8

Create Dockerfile

We need to create a container with everything we need to be able to run our model. This includes our trainer module package, python3, as well as the libraries we use such as the most up to date TensorFlow 2.0 version.


In [ ]:
%%writefile babyweight/Dockerfile
FROM gcr.io/deeplearning-platform-release/tf2-cpu
COPY trainer /babyweight/trainer
RUN apt update && \
    apt install --yes python3-pip && \
    pip3 install --upgrade --quiet tensorflow==2.0

ENV PYTHONPATH ${PYTHONPATH}:/babyweight
ENTRYPOINT ["python3", "babyweight/trainer/task.py"]

Build and push container image to repo

Now that we have created our Dockerfile, we need to build and push our container image to our project's container repo. To do this, we'll create a small shell script that we can call from the bash.


In [ ]:
%%writefile babyweight/push_docker.sh
export PROJECT_ID=$(gcloud config list project --format "value(core.project)")
export IMAGE_REPO_NAME=babyweight_training_container
export IMAGE_URI=gcr.io/${PROJECT_ID}/${IMAGE_REPO_NAME}

echo "Building  $IMAGE_URI"
docker build -f Dockerfile -t ${IMAGE_URI} ./
echo "Pushing $IMAGE_URI"
docker push ${IMAGE_URI}

Note: If you get a permissions/stat error when running push_docker.sh from Notebooks, do it from CloudShell:

Open CloudShell on the GCP Console

This step takes 5-10 minutes to run.


In [ ]:
%%bash
cd babyweight
bash push_docker.sh

Test container locally

Before we submit our training job to Cloud AI Platform, let's make sure our container that we just built and pushed to our project's container repo works perfectly. We can do that by calling our container in bash and passing the necessary user_args for our task.py's parser.


In [ ]:
%%bash
export PROJECT_ID=$(gcloud config list project --format "value(core.project)")
export IMAGE_REPO_NAME=babyweight_training_container
export IMAGE_URI=gcr.io/${PROJECT_ID}/${IMAGE_REPO_NAME}
echo "Running  $IMAGE_URI"
docker run ${IMAGE_URI} \
    --train_data_path=gs://${BUCKET}/babyweight/data/train*.csv \
    --train_data_path=gs://${BUCKET}/babyweight/data/train*.csv \
    --eval_data_path=gs://${BUCKET}/babyweight/data/eval*.csv \
    --output_dir=gs://${BUCKET}/babyweight/trained_model \
    --batch_size=10 \
    --num_epochs=10 \
    --train_examples=1 \
    --eval_steps=1

Lab Task #5: Train on Cloud AI Platform.

Once the code works in standalone mode, you can run it on Cloud AI Platform. Because this is on the entire dataset, it will take a while. The training run took about two hours for me. You can monitor the job from the GCP console in the Cloud AI Platform section. Complete the #TODOs to make sure you have the necessary user_args for our task.py's parser.


In [ ]:
%%bash
OUTDIR=gs://${BUCKET}/babyweight/trained_model
JOBID=babyweight_$(date -u +%y%m%d_%H%M%S)
echo ${OUTDIR} ${REGION} ${JOBNAME}
gsutil -m rm -rf ${OUTDIR}

IMAGE=gcr.io/${PROJECT}/babyweight_training_container

gcloud ai-platform jobs submit training ${JOBID} \
    --staging-bucket=gs://${BUCKET} \
    --region=${REGION} \
    --master-image-uri=${IMAGE} \
    --master-machine-type=n1-standard-4 \
    --scale-tier=CUSTOM \
    -- \
    --train_data_path=# TODO: Add path to training data in GCS
    --eval_data_path=# TODO: Add path to evaluation data in GCS
    --output_dir=${OUTDIR} \
    --num_epochs=# TODO: Add the number of epochs to train for
    --train_examples=# TODO: Add the number of examples to train each epoch for
    --eval_steps=# TODO: Add the number of evaluation batches to run
    --batch_size=# TODO: Add batch size
    --nembeds=# TODO: Add number of embedding dimensions

When I ran it, I used train_examples=2000000. When training finished, I filtered in the Stackdriver log on the word "dict" and saw that the last line was:

Saving dict for global step 5714290: average_loss = 1.06473, global_step = 5714290, loss = 34882.4, rmse = 1.03186
The final RMSE was 1.03 pounds.

Lab Task #6: Hyperparameter tuning.

All of these are command-line parameters to my program. To do hyperparameter tuning, create hyperparam.yaml and pass it as --config hyperparam.yaml. This step will take up to 2 hours -- you can increase maxParallelTrials or reduce maxTrials to get it done faster. Since maxParallelTrials is the number of initial seeds to start searching from, you don't want it to be too large; otherwise, all you have is a random search. Complete #TODOs in yaml file and gcloud training job bash command so that we can run hyperparameter tuning.


In [ ]:
%%writefile hyperparam.yaml
trainingInput:
    scaleTier: STANDARD_1
    hyperparameters:
        hyperparameterMetricTag: # TODO: Add metric we want to optimize
        goal: # TODO: MAXIMIZE or MINIMIZE?
        maxTrials: 20
        maxParallelTrials: 5
        enableTrialEarlyStopping: True
        params:
        - parameterName: batch_size
          type: # TODO: What datatype?
          minValue: # TODO: Choose a min value
          maxValue: # TODO: Choose a max value
          scaleType: # TODO: UNIT_LINEAR_SCALE or UNIT_LOG_SCALE?
        - parameterName: nembeds
          type: # TODO: What datatype?
          minValue: # TODO: Choose a min value
          maxValue: # TODO: Choose a max value
          scaleType: # TODO: UNIT_LINEAR_SCALE or UNIT_LOG_SCALE?

In [ ]:
%%bash
OUTDIR=gs://${BUCKET}/babyweight/hyperparam
JOBNAME=babyweight_$(date -u +%y%m%d_%H%M%S)
echo ${OUTDIR} ${REGION} ${JOBNAME}
gsutil -m rm -rf ${OUTDIR}

IMAGE=gcr.io/${PROJECT}/babyweight_training_container

gcloud ai-platform jobs submit training ${JOBNAME} \
    --staging-bucket=gs://${BUCKET} \
    --region=${REGION} \
    --master-image-uri=${IMAGE} \
    --master-machine-type=n1-standard-4 \
    --scale-tier=CUSTOM \
    --# TODO: Add config for hyperparam.yaml
    -- \
    --train_data_path=gs://${BUCKET}/babyweight/data/train*.csv \
    --eval_data_path=gs://${BUCKET}/babyweight/data/eval*.csv \
    --output_dir=${OUTDIR} \
    --num_epochs=10 \
    --train_examples=20000 \
    --eval_steps=100

Repeat training

This time with tuned parameters for batch_size and nembeds.


In [ ]:
%%bash
OUTDIR=gs://${BUCKET}/babyweight/trained_model_tuned
JOBNAME=babyweight_$(date -u +%y%m%d_%H%M%S)
echo ${OUTDIR} ${REGION} ${JOBNAME}
gsutil -m rm -rf ${OUTDIR}

IMAGE=gcr.io/${PROJECT}/babyweight_training_container

gcloud ai-platform jobs submit training ${JOBNAME} \
    --staging-bucket=gs://${BUCKET} \
    --region=${REGION} \
    --master-image-uri=${IMAGE} \
    --master-machine-type=n1-standard-4 \
    --scale-tier=CUSTOM \
    -- \
    --train_data_path=gs://${BUCKET}/babyweight/data/train*.csv \
    --eval_data_path=gs://${BUCKET}/babyweight/data/eval*.csv \
    --output_dir=${OUTDIR} \
    --num_epochs=10 \
    --train_examples=20000 \
    --eval_steps=100 \
    --batch_size=32 \
    --nembeds=8

Lab Summary:

In this lab, we set up the environment, created the trainer module's task.py to hold hyperparameter argparsing code, created the trainer module's model.py to hold Keras model code, ran the trainer module package locally, submitted a training job to Cloud AI Platform, and submitted a hyperparameter tuning job to Cloud AI Platform.

Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License