LAB 5a: Training Keras model on Cloud AI Platform.

Learning Objectives

  1. Setup up the environment
  2. Create trainer module's task.py to hold hyperparameter argparsing code
  3. Create trainer module's model.py to hold Keras model code
  4. Run trainer module package locally
  5. Submit training job to Cloud AI Platform
  6. Submit hyperparameter tuning job to Cloud AI Platform

Introduction

After having testing our training pipeline both locally and in the cloud on a susbset of the data, we can submit another (much larger) training job to the cloud. It is also a good idea to run a hyperparameter tuning job to make sure we have optimized the hyperparameters of our model.

In this notebook, we'll be training our Keras model at scale using Cloud AI Platform.

In this lab, we will set up the environment, create the trainer module's task.py to hold hyperparameter argparsing code, create the trainer module's model.py to hold Keras model code, run the trainer module package locally, submit a training job to Cloud AI Platform, and submit a hyperparameter tuning job to Cloud AI Platform.

Set up environment variables and load necessary libraries

Import necessary libraries.


In [5]:
import os

Set environment variables.

Set environment variables so that we can use them throughout the entire lab. We will be using our project name for our bucket, so you only need to change your project and region.


In [ ]:
%%bash
export PROJECT=$(gcloud config list project --format "value(core.project)")
echo "Your current GCP Project Name is: "${PROJECT}

In [8]:
# TODO: Change these to try this notebook out
PROJECT = "cloud-training-demos"  # Replace with your PROJECT
BUCKET = PROJECT  # defaults to PROJECT
REGION = "us-central1"  # Replace with your REGION

In [22]:
os.environ["PROJECT"] = PROJECT
os.environ["BUCKET"] = BUCKET
os.environ["REGION"] = REGION
os.environ["TFVERSION"] = "2.0"

In [ ]:
%%bash
gcloud config set project ${PROJECT}
gcloud config set compute/region ${REGION}

In [13]:
%%bash
if ! gsutil ls | grep -q gs://${BUCKET}; then
    gsutil mb -l ${REGION} gs://${BUCKET}
fi

Check data exists

Verify that you previously created CSV files we'll be using for training and evaluation. If not, go back to lab 2_prepare_babyweight to create them.


In [ ]:
%%bash
gsutil ls gs://${BUCKET}/babyweight/data/*000000000000.csv

Now that we have the Keras wide-and-deep code working on a subset of the data, we can package the TensorFlow code up as a Python module and train it on Cloud AI Platform.

Train on Cloud AI Platform

Training on Cloud AI Platform requires:

  • Making the code a Python package
  • Using gcloud to submit the training code to Cloud AI Platform

Ensure that the AI Platform API is enabled by going to this link.

Move code into a Python package

A Python package is simply a collection of one or more .py files along with an __init__.py file to identify the containing directory as a package. The __init__.py sometimes contains initialization code but for our purposes an empty file suffices.

The bash command touch creates an empty file in the specified location, the directory babyweight should already exist.


In [21]:
%%bash
mkdir -p babyweight/trainer
touch babyweight/trainer/__init__.py

We then use the %%writefile magic to write the contents of the cell below to a file called task.py in the babyweight/trainer folder.

Create trainer module's task.py to hold hyperparameter argparsing code.

The cell below writes the file babyweight/trainer/task.py which sets up our training job. Here is where we determine which parameters of our model to pass as flags during training using the parser module. Look at how batch_size is passed to the model in the code below. Use this as an example to parse arguements for the following variables

  • nnsize which represents the hidden layer sizes to use for DNN feature columns
  • nembeds which represents the embedding size of a cross of n key real-valued parameters
  • train_examples which represents the number of examples (in thousands) to run the training job
  • eval_steps which represents the positive number of steps for which to evaluate model

Be sure to include a default value for the parsed arguments above and specfy the type if necessary.


In [ ]:
%%writefile babyweight/trainer/task.py
import argparse
import json
import os

from babyweight.trainer import model

import tensorflow as tf

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--job-dir",
        help="this model ignores this field, but it is required by gcloud",
        default="junk"
    )
    parser.add_argument(
        "--train_data_path",
        help="GCS location of training data",
        required=True
    )
    parser.add_argument(
        "--eval_data_path",
        help="GCS location of evaluation data",
        required=True
    )
    parser.add_argument(
        "--output_dir",
        help="GCS location to write checkpoints and export models",
        required=True
    )
    parser.add_argument(
        "--batch_size",
        help="Number of examples to compute gradient over.",
        type=int,
        default=512
    )
    parser.add_argument(
        "--nnsize",
        help="Hidden layer sizes for DNN -- provide space-separated layers",
        nargs="+",
        type=int,
        default=[128, 32, 4]
    )
    parser.add_argument(
        "--nembeds",
        help="Embedding size of a cross of n key real-valued parameters",
        type=int,
        default=3
    )
    parser.add_argument(
        "--num_epochs",
        help="Number of epochs to train the model.",
        type=int,
        default=10
    )
    parser.add_argument(
        "--train_examples",
        help="""Number of examples (in thousands) to run the training job over.
        If this is more than actual # of examples available, it cycles through
        them. So specifying 1000 here when you have only 100k examples makes
        this 10 epochs.""",
        type=int,
        default=5000
    )
    parser.add_argument(
        "--eval_steps",
        help="""Positive number of steps for which to evaluate model. Default
        to None, which means to evaluate until input_fn raises an end-of-input
        exception""",
        type=int,
        default=None
    )

    # Parse all arguments
    args = parser.parse_args()
    arguments = args.__dict__

    # Unused args provided by service
    arguments.pop("job_dir", None)
    arguments.pop("job-dir", None)

    # Modify some arguments
    arguments["train_examples"] *= 1000

    # Append trial_id to path if we are doing hptuning
    # This code can be removed if you are not using hyperparameter tuning
    arguments["output_dir"] = os.path.join(
        arguments["output_dir"],
        json.loads(
            os.environ.get("TF_CONFIG", "{}")
        ).get("task", {}).get("trial", "")
    )

    # Run the training job
    model.train_and_evaluate(arguments)

In the same way we can write to the file model.py the model that we developed in the previous notebooks.

Create trainer module's model.py to hold Keras model code.

To create our model.py, we'll use the code we wrote for the Wide & Deep model. Look back at your 9_keras_wide_and_deep_babyweight notebook and copy/paste the necessary code from that notebook into its place in the cell below.


In [ ]:
%%writefile babyweight/trainer/model.py
import datetime
import os
import shutil
import numpy as np
import tensorflow as tf

# Determine CSV, label, and key columns
CSV_COLUMNS = ["weight_pounds",
               "is_male",
               "mother_age",
               "plurality",
               "gestation_weeks"]
LABEL_COLUMN = "weight_pounds"

# Set default values for each CSV column.
# Treat is_male and plurality as strings.
DEFAULTS = [[0.0], ["null"], [0.0], ["null"], [0.0]]


def features_and_labels(row_data):
    """Splits features and labels from feature dictionary.

    Args:
        row_data: Dictionary of CSV column names and tensor values.
    Returns:
        Dictionary of feature tensors and label tensor.
    """
    label = row_data.pop(LABEL_COLUMN)

    return row_data, label  # features, label


def load_dataset(pattern, batch_size=1, mode=tf.estimator.ModeKeys.EVAL):
    """Loads dataset using the tf.data API from CSV files.

    Args:
        pattern: str, file pattern to glob into list of files.
        batch_size: int, the number of examples per batch.
        mode: tf.estimator.ModeKeys to determine if training or evaluating.
    Returns:
        `Dataset` object.
    """
    print("mode = {}".format(mode))
    # Make a CSV dataset
    dataset = tf.data.experimental.make_csv_dataset(
        file_pattern=pattern,
        batch_size=batch_size,
        column_names=CSV_COLUMNS,
        column_defaults=DEFAULTS)

    # Map dataset to features and label
    dataset = dataset.map(map_func=features_and_labels)  # features, label

    # Shuffle and repeat for training
    if mode == tf.estimator.ModeKeys.TRAIN:
        dataset = dataset.shuffle(buffer_size=1000).repeat()

    # Take advantage of multi-threading; 1=AUTOTUNE
    dataset = dataset.prefetch(buffer_size=1)

    return dataset


def create_input_layers():
    """Creates dictionary of input layers for each feature.

    Returns:
        Dictionary of `tf.Keras.layers.Input` layers for each feature.
    """
    deep_inputs = {
        colname: tf.keras.layers.Input(
            name=colname, shape=(), dtype="float32")
        for colname in ["mother_age", "gestation_weeks"]
    }

    wide_inputs = {
        colname: tf.keras.layers.Input(
            name=colname, shape=(), dtype="string")
        for colname in ["is_male", "plurality"]
    }

    inputs = {**wide_inputs, **deep_inputs}

    return inputs


def categorical_fc(name, values):
    """Helper function to wrap categorical feature by indicator column.

    Args:
        name: str, name of feature.
        values: list, list of strings of categorical values.
    Returns:
        Categorical and indicator column of categorical feature.
    """
    # Currently vocabulary_list feature columns don't work correctly with Keras
    # training with TF 2.0 when deploying with TF 1.14 for CAIP workaround.
    # Hash bucket is a nice temporary substitute.
#     cat_column = tf.feature_column.categorical_column_with_vocabulary_list(
#             key=name, vocabulary_list=values)
    cat_column = tf.feature_column.categorical_column_with_hash_bucket(
            key=name, hash_bucket_size=8)
    ind_column = tf.feature_column.indicator_column(
        categorical_column=cat_column)

    return cat_column, ind_column


def create_feature_columns(nembeds):
    """Creates wide and deep dictionaries of feature columns from inputs.

    Args:
        nembeds: int, number of dimensions to embed categorical column down to.
    Returns:
        Wide and deep dictionaries of feature columns.
    """
    deep_fc = {
        colname: tf.feature_column.numeric_column(key=colname)
        for colname in ["mother_age", "gestation_weeks"]
    }
    wide_fc = {}
    is_male, wide_fc["is_male"] = categorical_fc(
        "is_male", ["True", "False", "Unknown"])
    plurality, wide_fc["plurality"] = categorical_fc(
        "plurality", ["Single(1)", "Twins(2)", "Triplets(3)",
                      "Quadruplets(4)", "Quintuplets(5)", "Multiple(2+)"])

    # Bucketize the float fields. This makes them wide
    age_buckets = tf.feature_column.bucketized_column(
        source_column=deep_fc["mother_age"],
        boundaries=np.arange(15, 45, 1).tolist())
    wide_fc["age_buckets"] = tf.feature_column.indicator_column(
        categorical_column=age_buckets)

    gestation_buckets = tf.feature_column.bucketized_column(
        source_column=deep_fc["gestation_weeks"],
        boundaries=np.arange(17, 47, 1).tolist())
    wide_fc["gestation_buckets"] = tf.feature_column.indicator_column(
        categorical_column=gestation_buckets)

    # Cross all the wide columns, have to do the crossing before we one-hot
    crossed = tf.feature_column.crossed_column(
        keys=[age_buckets, gestation_buckets],
        hash_bucket_size=1000)
    deep_fc["crossed_embeds"] = tf.feature_column.embedding_column(
        categorical_column=crossed, dimension=nembeds)

    return wide_fc, deep_fc


def get_model_outputs(wide_inputs, deep_inputs, dnn_hidden_units):
    """Creates model architecture and returns outputs.

    Args:
        wide_inputs: Dense tensor used as inputs to wide side of model.
        deep_inputs: Dense tensor used as inputs to deep side of model.
        dnn_hidden_units: List of integers where length is number of hidden
            layers and ith element is the number of neurons at ith layer.
    Returns:
        Dense tensor output from the model.
    """
    # Hidden layers for the deep side
    layers = [int(x) for x in dnn_hidden_units]
    deep = deep_inputs
    for layerno, numnodes in enumerate(layers):
        deep = tf.keras.layers.Dense(
            units=numnodes,
            activation="relu",
            name="dnn_{}".format(layerno+1))(deep)
    deep_out = deep

    # Linear model for the wide side
    wide_out = tf.keras.layers.Dense(
        units=10, activation="relu", name="linear")(wide_inputs)

    # Concatenate the two sides
    both = tf.keras.layers.concatenate(
        inputs=[deep_out, wide_out], name="both")

    # Final output is a linear activation because this is regression
    output = tf.keras.layers.Dense(
        units=1, activation="linear", name="weight")(both)

    return output


def rmse(y_true, y_pred):
    """Calculates RMSE evaluation metric.

    Args:
        y_true: tensor, true labels.
        y_pred: tensor, predicted labels.
    Returns:
        Tensor with value of RMSE between true and predicted labels.
    """
    return tf.sqrt(tf.reduce_mean(tf.square(y_pred - y_true)))


def build_wide_deep_model(dnn_hidden_units=[64, 32], nembeds=3):
    """Builds wide and deep model using Keras Functional API.

    Returns:
        `tf.keras.models.Model` object.
    """
    # Create input layers
    inputs = create_input_layers()

    # Create feature columns for both wide and deep
    wide_fc, deep_fc = create_feature_columns(nembeds)

    # The constructor for DenseFeatures takes a list of numeric columns
    # The Functional API in Keras requires: LayerConstructor()(inputs)
    wide_inputs = tf.keras.layers.DenseFeatures(
        feature_columns=wide_fc.values(), name="wide_inputs")(inputs)
    deep_inputs = tf.keras.layers.DenseFeatures(
        feature_columns=deep_fc.values(), name="deep_inputs")(inputs)

    # Get output of model given inputs
    output = get_model_outputs(wide_inputs, deep_inputs, dnn_hidden_units)

    # Build model and compile it all together
    model = tf.keras.models.Model(inputs=inputs, outputs=output)
    model.compile(optimizer="adam", loss="mse", metrics=[rmse, "mse"])

    return model


def train_and_evaluate(args):
    model = build_wide_deep_model(args["nnsize"], args["nembeds"])
    print("Here is our Wide-and-Deep architecture so far:\n")
    print(model.summary())

    trainds = load_dataset(
        args["train_data_path"],
        args["batch_size"],
        tf.estimator.ModeKeys.TRAIN)

    evalds = load_dataset(
        args["eval_data_path"], 1000, tf.estimator.ModeKeys.EVAL)
    if args["eval_steps"]:
        evalds = evalds.take(count=args["eval_steps"])

    num_batches = args["batch_size"] * args["num_epochs"]
    steps_per_epoch = args["train_examples"] // num_batches

    checkpoint_path = os.path.join(args["output_dir"], "checkpoints/babyweight")
    cp_callback = tf.keras.callbacks.ModelCheckpoint(
        filepath=checkpoint_path, verbose=1, save_weights_only=True)

    history = model.fit(
        trainds,
        validation_data=evalds,
        epochs=args["num_epochs"],
        steps_per_epoch=steps_per_epoch,
        verbose=2,  # 0=silent, 1=progress bar, 2=one line per epoch
        callbacks=[cp_callback])

    EXPORT_PATH = os.path.join(
        args["output_dir"], datetime.datetime.now().strftime("%Y%m%d%H%M%S"))
    tf.saved_model.save(
        obj=model, export_dir=EXPORT_PATH)  # with default serving function
    print("Exported trained model to {}".format(EXPORT_PATH))

Train locally

After moving the code to a package, make sure it works as a standalone. Note, we incorporated the --train_examples flag so that we don't try to train on the entire dataset while we are developing our pipeline. Once we are sure that everything is working on a subset, we can change it so that we can train on all the data. Even for this subset, this takes about 3 minutes in which you won't see any output ...

Run trainer module package locally.

We can run a very small training job over a single file with a small batch size, 1 epoch, 1 train example, and 1 eval step.


In [24]:
%%bash
OUTDIR=babyweight_trained
rm -rf ${OUTDIR}
export PYTHONPATH=${PYTHONPATH}:${PWD}/babyweight
python3 -m trainer.task \
    --job-dir=./tmp \
    --train_data_path=gs://${BUCKET}/babyweight/data/train*.csv  \
    --eval_data_path=gs://${BUCKET}/babyweight/data/eval*.csv \
    --output_dir=${OUTDIR} \
    --batch_size=10 \
    --num_epochs=1 \
    --train_examples=1 \
    --eval_steps=1


Here is our Wide-and-Deep architecture so far:

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
gestation_weeks (InputLayer)    [(None,)]            0                                            
__________________________________________________________________________________________________
is_male (InputLayer)            [(None,)]            0                                            
__________________________________________________________________________________________________
mother_age (InputLayer)         [(None,)]            0                                            
__________________________________________________________________________________________________
plurality (InputLayer)          [(None,)]            0                                            
__________________________________________________________________________________________________
deep_inputs (DenseFeatures)     (None, 5)            3000        gestation_weeks[0][0]            
                                                                 is_male[0][0]                    
                                                                 mother_age[0][0]                 
                                                                 plurality[0][0]                  
__________________________________________________________________________________________________
dnn_1 (Dense)                   (None, 128)          768         deep_inputs[0][0]                
__________________________________________________________________________________________________
dnn_2 (Dense)                   (None, 32)           4128        dnn_1[0][0]                      
__________________________________________________________________________________________________
wide_inputs (DenseFeatures)     (None, 71)           0           gestation_weeks[0][0]            
                                                                 is_male[0][0]                    
                                                                 mother_age[0][0]                 
                                                                 plurality[0][0]                  
__________________________________________________________________________________________________
dnn_3 (Dense)                   (None, 4)            132         dnn_2[0][0]                      
__________________________________________________________________________________________________
linear (Dense)                  (None, 10)           720         wide_inputs[0][0]                
__________________________________________________________________________________________________
both (Concatenate)              (None, 14)           0           dnn_3[0][0]                      
                                                                 linear[0][0]                     
__________________________________________________________________________________________________
weight (Dense)                  (None, 1)            15          both[0][0]                       
==================================================================================================
Total params: 8,763
Trainable params: 8,763
Non-trainable params: 0
__________________________________________________________________________________________________
None
mode = train
mode = eval
Train for 10 steps, validate for 1 steps
Epoch 1/10

Epoch 00001: saving model to babyweight_trained/checkpoints/babyweight
10/10 - 3s - loss: 8.7626 - rmse: 2.6177 - mse: 8.7626 - val_loss: 2.8766 - val_rmse: 1.6961 - val_mse: 2.8766
Epoch 2/10

Epoch 00002: saving model to babyweight_trained/checkpoints/babyweight
10/10 - 0s - loss: 3.7634 - rmse: 1.8733 - mse: 3.7634 - val_loss: 2.1518 - val_rmse: 1.4669 - val_mse: 2.1518
Epoch 3/10

Epoch 00003: saving model to babyweight_trained/checkpoints/babyweight
10/10 - 0s - loss: 1.4293 - rmse: 1.1661 - mse: 1.4293 - val_loss: 2.1744 - val_rmse: 1.4746 - val_mse: 2.1744
Epoch 4/10

Epoch 00004: saving model to babyweight_trained/checkpoints/babyweight
10/10 - 0s - loss: 1.9580 - rmse: 1.3507 - mse: 1.9580 - val_loss: 1.5461 - val_rmse: 1.2434 - val_mse: 1.5461
Epoch 5/10

Epoch 00005: saving model to babyweight_trained/checkpoints/babyweight
10/10 - 0s - loss: 1.8938 - rmse: 1.3371 - mse: 1.8938 - val_loss: 1.6900 - val_rmse: 1.3000 - val_mse: 1.6900
Epoch 6/10

Epoch 00006: saving model to babyweight_trained/checkpoints/babyweight
10/10 - 0s - loss: 1.5862 - rmse: 1.2189 - mse: 1.5862 - val_loss: 1.8496 - val_rmse: 1.3600 - val_mse: 1.8496
Epoch 7/10

Epoch 00007: saving model to babyweight_trained/checkpoints/babyweight
10/10 - 0s - loss: 1.7102 - rmse: 1.2807 - mse: 1.7102 - val_loss: 1.3785 - val_rmse: 1.1741 - val_mse: 1.3785
Epoch 8/10

Epoch 00008: saving model to babyweight_trained/checkpoints/babyweight
10/10 - 0s - loss: 1.1809 - rmse: 1.0657 - mse: 1.1809 - val_loss: 1.3404 - val_rmse: 1.1578 - val_mse: 1.3404
Epoch 9/10

Epoch 00009: saving model to babyweight_trained/checkpoints/babyweight
10/10 - 0s - loss: 1.5710 - rmse: 1.1937 - mse: 1.5710 - val_loss: 1.2855 - val_rmse: 1.1338 - val_mse: 1.2855
Epoch 10/10

Epoch 00010: saving model to babyweight_trained/checkpoints/babyweight
10/10 - 0s - loss: 1.2826 - rmse: 1.1179 - mse: 1.2826 - val_loss: 1.2535 - val_rmse: 1.1196 - val_mse: 1.2535
Exported trained model to babyweight_trained/20191030174517
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow_core/python/feature_column/feature_column_v2.py:4276: IndicatorColumn._variable_shape (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow_core/python/feature_column/feature_column_v2.py:4331: BucketizedColumn._num_buckets (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
2019-10-30 17:45:10.701202: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations:  AVX2 FMA
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
2019-10-30 17:45:10.709821: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2200000000 Hz
2019-10-30 17:45:10.710156: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x56465037f020 executing computations on platform Host. Devices:
2019-10-30 17:45:10.710186: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
2019-10-30 17:45:10.710544: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow_core/python/feature_column/feature_column_v2.py:4331: VocabularyListCategoricalColumn._num_buckets (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow_core/python/data/experimental/ops/readers.py:521: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_determinstic`.
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow_core/python/data/experimental/ops/readers.py:215: shuffle_and_repeat (from tensorflow.python.data.experimental.ops.shuffle_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.shuffle(buffer_size, seed)` followed by `tf.data.Dataset.repeat(count)`. Static tf.data optimizations will take care of using the fused implementation.
2019-10-30 17:45:19.039904: W tensorflow/python/util/util.cc:299] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py:1781: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.

Dockerized module

Since we are using TensorFlow 2.0 and it is new, we will use a container image to run the code on AI Platform.

Once TensorFlow 2.0 is natively supported on AI Platform, you will be able to simply do (without having to build a container):

gcloud ai-platform jobs submit training ${JOBNAME} \
    --region=${REGION} \
    --module-name=trainer.task \
    --package-path=$(pwd)/babyweight/trainer \
    --job-dir=${OUTDIR} \
    --staging-bucket=gs://${BUCKET} \
    --scale-tier=STANDARD_1 \
    --runtime-version=${TFVERSION} \
    -- \
    --train_data_path=gs://${BUCKET}/babyweight/data/train*.csv \
    --eval_data_path=gs://${BUCKET}/babyweight/data/eval*.csv \
    --output_dir=${OUTDIR} \
    --num_epochs=10 \
    --train_examples=20000 \
    --eval_steps=100 \
    --batch_size=32 \
    --nembeds=8

Create Dockerfile

We need to create a container with everything we need to be able to run our model. This includes our trainer module package, python3, as well as the libraries we use such as the most up to date TensorFlow 2.0 version.


In [ ]:
%%writefile babyweight/Dockerfile
FROM gcr.io/deeplearning-platform-release/tf2-cpu
COPY trainer /babyweight/trainer
RUN apt update && \
    apt install --yes python3-pip && \
    pip3 install --upgrade --quiet tensorflow==2.0

ENV PYTHONPATH ${PYTHONPATH}:/babyweight
ENTRYPOINT ["python3", "babyweight/trainer/task.py"]

Build and push container image to repo

Now that we have created our Dockerfile, we need to build and push our container image to our project's container repo. To do this, we'll create a small shell script that we can call from the bash.


In [ ]:
%%writefile babyweight/push_docker.sh
export PROJECT_ID=$(gcloud config list project --format "value(core.project)")
export IMAGE_REPO_NAME=babyweight_training_container
export IMAGE_URI=gcr.io/${PROJECT_ID}/${IMAGE_REPO_NAME}

echo "Building  $IMAGE_URI"
docker build -f Dockerfile -t ${IMAGE_URI} ./
echo "Pushing $IMAGE_URI"
docker push ${IMAGE_URI}

Note: If you get a permissions/stat error when running push_docker.sh from Notebooks, do it from CloudShell:

Open CloudShell on the GCP Console

This step takes 5-10 minutes to run.


In [30]:
%%bash
cd babyweight
bash push_docker.sh


Building  gcr.io/qwiklabs-gcp-4b437f7e5bfff9dd/babyweight_training_container
Sending build context to Docker daemon  36.35kB
Step 1/5 : FROM gcr.io/deeplearning-platform-release/tf2-cpu
 ---> bed936671274
Step 2/5 : COPY trainer /babyweight/trainer
 ---> Using cache
 ---> 3c07d08c2528
Step 3/5 : RUN apt update &&     apt install --yes python3-pip &&     pip3 install --upgrade --quiet tf-nightly-2.0-preview
 ---> Running in 4600f5d7e84a

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

Get:1 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:2 http://packages.cloud.google.com/apt cloud-sdk-bionic InRelease [6372 B]
Get:3 http://archive.ubuntu.com/ubuntu bionic InRelease [242 kB]
Get:4 http://packages.cloud.google.com/apt cloud-sdk-bionic/main amd64 Packages [92.7 kB]
Get:5 http://security.ubuntu.com/ubuntu bionic-security/multiverse amd64 Packages [5944 B]
Get:6 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages [700 kB]
Get:7 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Get:8 http://security.ubuntu.com/ubuntu bionic-security/universe amd64 Packages [782 kB]
Get:9 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB]
Get:10 http://security.ubuntu.com/ubuntu bionic-security/restricted amd64 Packages [12.6 kB]
Get:11 http://archive.ubuntu.com/ubuntu bionic/main amd64 Packages [1344 kB]
Get:12 http://archive.ubuntu.com/ubuntu bionic/restricted amd64 Packages [13.5 kB]
Get:13 http://archive.ubuntu.com/ubuntu bionic/multiverse amd64 Packages [186 kB]
Get:14 http://archive.ubuntu.com/ubuntu bionic/universe amd64 Packages [11.3 MB]
Get:15 http://archive.ubuntu.com/ubuntu bionic-updates/restricted amd64 Packages [23.2 kB]
Get:16 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 Packages [1303 kB]
Get:17 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages [995 kB]
Get:18 http://archive.ubuntu.com/ubuntu bionic-updates/multiverse amd64 Packages [9022 B]
Get:19 http://archive.ubuntu.com/ubuntu bionic-backports/main amd64 Packages [2496 B]
Get:20 http://archive.ubuntu.com/ubuntu bionic-backports/universe amd64 Packages [4227 B]
Fetched 17.3 MB in 10s (1770 kB/s)
Reading package lists...
Building dependency tree...
Reading state information...
48 packages can be upgraded. Run 'apt list --upgradable' to see them.

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

Reading package lists...
Building dependency tree...
Reading state information...
The following additional packages will be installed:
  dh-python libexpat1-dev libpython3-dev libpython3.6 libpython3.6-dev
  libpython3.6-minimal libpython3.6-stdlib python-pip-whl python3-asn1crypto
  python3-cffi-backend python3-crypto python3-cryptography python3-dev
  python3-distutils python3-idna python3-keyring python3-keyrings.alt
  python3-lib2to3 python3-pkg-resources python3-secretstorage
  python3-setuptools python3-six python3-wheel python3-xdg python3.6
  python3.6-dev python3.6-minimal
Suggested packages:
  python-crypto-doc python-cryptography-doc python3-cryptography-vectors
  gnome-keyring libkf5wallet-bin gir1.2-gnomekeyring-1.0
  python-secretstorage-doc python-setuptools-doc python3.6-venv python3.6-doc
  binfmt-support
The following NEW packages will be installed:
  dh-python libexpat1-dev libpython3-dev libpython3.6-dev python-pip-whl
  python3-asn1crypto python3-cffi-backend python3-crypto python3-cryptography
  python3-dev python3-distutils python3-idna python3-keyring
  python3-keyrings.alt python3-lib2to3 python3-pip python3-pkg-resources
  python3-secretstorage python3-setuptools python3-six python3-wheel
  python3-xdg python3.6-dev
The following packages will be upgraded:
  libpython3.6 libpython3.6-minimal libpython3.6-stdlib python3.6
  python3.6-minimal
5 upgraded, 23 newly installed, 0 to remove and 43 not upgraded.
Need to get 54.1 MB of archives.
After this operation, 89.3 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libpython3.6 amd64 3.6.8-1~18.04.3 [1415 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 python3.6 amd64 3.6.8-1~18.04.3 [202 kB]
Get:3 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libpython3.6-stdlib amd64 3.6.8-1~18.04.3 [1712 kB]
Get:4 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 python3.6-minimal amd64 3.6.8-1~18.04.3 [1610 kB]
Get:5 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libpython3.6-minimal amd64 3.6.8-1~18.04.3 [533 kB]
Get:6 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 python3-lib2to3 all 3.6.8-1~18.04 [76.5 kB]
Get:7 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 python3-distutils all 3.6.8-1~18.04 [141 kB]
Get:8 http://archive.ubuntu.com/ubuntu bionic/main amd64 dh-python all 3.20180325ubuntu2 [89.2 kB]
Get:9 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libexpat1-dev amd64 2.2.5-3ubuntu0.2 [122 kB]
Get:10 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libpython3.6-dev amd64 3.6.8-1~18.04.3 [44.8 MB]
Get:11 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libpython3-dev amd64 3.6.7-1~18.04 [7328 B]
Get:12 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 python-pip-whl all 9.0.1-2.3~ubuntu1.18.04.1 [1653 kB]
Get:13 http://archive.ubuntu.com/ubuntu bionic/main amd64 python3-asn1crypto all 0.24.0-1 [72.8 kB]
Get:14 http://archive.ubuntu.com/ubuntu bionic/main amd64 python3-cffi-backend amd64 1.11.5-1 [64.6 kB]
Get:15 http://archive.ubuntu.com/ubuntu bionic/main amd64 python3-crypto amd64 2.6.1-8ubuntu2 [244 kB]
Get:16 http://archive.ubuntu.com/ubuntu bionic/main amd64 python3-idna all 2.6-1 [32.5 kB]
Get:17 http://archive.ubuntu.com/ubuntu bionic/main amd64 python3-six all 1.11.0-2 [11.4 kB]
Get:18 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 python3-cryptography amd64 2.1.4-1ubuntu1.3 [221 kB]
Get:19 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 python3.6-dev amd64 3.6.8-1~18.04.3 [508 kB]
Get:20 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 python3-dev amd64 3.6.7-1~18.04 [1288 B]
Get:21 http://archive.ubuntu.com/ubuntu bionic/main amd64 python3-secretstorage all 2.3.1-2 [12.1 kB]
Get:22 http://archive.ubuntu.com/ubuntu bionic/main amd64 python3-keyring all 10.6.0-1 [26.7 kB]
Get:23 http://archive.ubuntu.com/ubuntu bionic/main amd64 python3-keyrings.alt all 3.0-1 [16.6 kB]
Get:24 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 python3-pip all 9.0.1-2.3~ubuntu1.18.04.1 [114 kB]
Get:25 http://archive.ubuntu.com/ubuntu bionic/main amd64 python3-pkg-resources all 39.0.1-2 [98.8 kB]
Get:26 http://archive.ubuntu.com/ubuntu bionic/main amd64 python3-setuptools all 39.0.1-2 [248 kB]
Get:27 http://archive.ubuntu.com/ubuntu bionic/universe amd64 python3-wheel all 0.30.0-0.2 [36.5 kB]
Get:28 http://archive.ubuntu.com/ubuntu bionic/main amd64 python3-xdg all 0.25-4ubuntu1 [31.4 kB]
debconf: delaying package configuration, since apt-utils is not installed
Fetched 54.1 MB in 23s (2360 kB/s)
(Reading database ... 85082 files and directories currently installed.)
Preparing to unpack .../00-libpython3.6_3.6.8-1~18.04.3_amd64.deb ...
Unpacking libpython3.6:amd64 (3.6.8-1~18.04.3) over (3.6.8-1~18.04.2) ...
Preparing to unpack .../01-python3.6_3.6.8-1~18.04.3_amd64.deb ...
Unpacking python3.6 (3.6.8-1~18.04.3) over (3.6.8-1~18.04.2) ...
Preparing to unpack .../02-libpython3.6-stdlib_3.6.8-1~18.04.3_amd64.deb ...
Unpacking libpython3.6-stdlib:amd64 (3.6.8-1~18.04.3) over (3.6.8-1~18.04.2) ...
Preparing to unpack .../03-python3.6-minimal_3.6.8-1~18.04.3_amd64.deb ...
Unpacking python3.6-minimal (3.6.8-1~18.04.3) over (3.6.8-1~18.04.2) ...
Preparing to unpack .../04-libpython3.6-minimal_3.6.8-1~18.04.3_amd64.deb ...
Unpacking libpython3.6-minimal:amd64 (3.6.8-1~18.04.3) over (3.6.8-1~18.04.2) ...
Selecting previously unselected package python3-lib2to3.
Preparing to unpack .../05-python3-lib2to3_3.6.8-1~18.04_all.deb ...
Unpacking python3-lib2to3 (3.6.8-1~18.04) ...
Selecting previously unselected package python3-distutils.
Preparing to unpack .../06-python3-distutils_3.6.8-1~18.04_all.deb ...
Unpacking python3-distutils (3.6.8-1~18.04) ...
Selecting previously unselected package dh-python.
Preparing to unpack .../07-dh-python_3.20180325ubuntu2_all.deb ...
Unpacking dh-python (3.20180325ubuntu2) ...
Selecting previously unselected package libexpat1-dev:amd64.
Preparing to unpack .../08-libexpat1-dev_2.2.5-3ubuntu0.2_amd64.deb ...
Unpacking libexpat1-dev:amd64 (2.2.5-3ubuntu0.2) ...
Selecting previously unselected package libpython3.6-dev:amd64.
Preparing to unpack .../09-libpython3.6-dev_3.6.8-1~18.04.3_amd64.deb ...
Unpacking libpython3.6-dev:amd64 (3.6.8-1~18.04.3) ...
Selecting previously unselected package libpython3-dev:amd64.
Preparing to unpack .../10-libpython3-dev_3.6.7-1~18.04_amd64.deb ...
Unpacking libpython3-dev:amd64 (3.6.7-1~18.04) ...
Selecting previously unselected package python-pip-whl.
Preparing to unpack .../11-python-pip-whl_9.0.1-2.3~ubuntu1.18.04.1_all.deb ...
Unpacking python-pip-whl (9.0.1-2.3~ubuntu1.18.04.1) ...
Selecting previously unselected package python3-asn1crypto.
Preparing to unpack .../12-python3-asn1crypto_0.24.0-1_all.deb ...
Unpacking python3-asn1crypto (0.24.0-1) ...
Selecting previously unselected package python3-cffi-backend.
Preparing to unpack .../13-python3-cffi-backend_1.11.5-1_amd64.deb ...
Unpacking python3-cffi-backend (1.11.5-1) ...
Selecting previously unselected package python3-crypto.
Preparing to unpack .../14-python3-crypto_2.6.1-8ubuntu2_amd64.deb ...
Unpacking python3-crypto (2.6.1-8ubuntu2) ...
Selecting previously unselected package python3-idna.
Preparing to unpack .../15-python3-idna_2.6-1_all.deb ...
Unpacking python3-idna (2.6-1) ...
Selecting previously unselected package python3-six.
Preparing to unpack .../16-python3-six_1.11.0-2_all.deb ...
Unpacking python3-six (1.11.0-2) ...
Selecting previously unselected package python3-cryptography.
Preparing to unpack .../17-python3-cryptography_2.1.4-1ubuntu1.3_amd64.deb ...
Unpacking python3-cryptography (2.1.4-1ubuntu1.3) ...
Selecting previously unselected package python3.6-dev.
Preparing to unpack .../18-python3.6-dev_3.6.8-1~18.04.3_amd64.deb ...
Unpacking python3.6-dev (3.6.8-1~18.04.3) ...
Selecting previously unselected package python3-dev.
Preparing to unpack .../19-python3-dev_3.6.7-1~18.04_amd64.deb ...
Unpacking python3-dev (3.6.7-1~18.04) ...
Selecting previously unselected package python3-secretstorage.
Preparing to unpack .../20-python3-secretstorage_2.3.1-2_all.deb ...
Unpacking python3-secretstorage (2.3.1-2) ...
Selecting previously unselected package python3-keyring.
Preparing to unpack .../21-python3-keyring_10.6.0-1_all.deb ...
Unpacking python3-keyring (10.6.0-1) ...
Selecting previously unselected package python3-keyrings.alt.
Preparing to unpack .../22-python3-keyrings.alt_3.0-1_all.deb ...
Unpacking python3-keyrings.alt (3.0-1) ...
Selecting previously unselected package python3-pip.
Preparing to unpack .../23-python3-pip_9.0.1-2.3~ubuntu1.18.04.1_all.deb ...
Unpacking python3-pip (9.0.1-2.3~ubuntu1.18.04.1) ...
Selecting previously unselected package python3-pkg-resources.
Preparing to unpack .../24-python3-pkg-resources_39.0.1-2_all.deb ...
Unpacking python3-pkg-resources (39.0.1-2) ...
Selecting previously unselected package python3-setuptools.
Preparing to unpack .../25-python3-setuptools_39.0.1-2_all.deb ...
Unpacking python3-setuptools (39.0.1-2) ...
Selecting previously unselected package python3-wheel.
Preparing to unpack .../26-python3-wheel_0.30.0-0.2_all.deb ...
Unpacking python3-wheel (0.30.0-0.2) ...
Selecting previously unselected package python3-xdg.
Preparing to unpack .../27-python3-xdg_0.25-4ubuntu1_all.deb ...
Unpacking python3-xdg (0.25-4ubuntu1) ...
Setting up python-pip-whl (9.0.1-2.3~ubuntu1.18.04.1) ...
Processing triggers for mime-support (3.60ubuntu1) ...
Setting up python3-cffi-backend (1.11.5-1) ...
Setting up python3-crypto (2.6.1-8ubuntu2) ...
Setting up python3-idna (2.6-1) ...
Setting up python3-xdg (0.25-4ubuntu1) ...
Setting up python3-six (1.11.0-2) ...
Setting up python3-wheel (0.30.0-0.2) ...
Setting up python3-pkg-resources (39.0.1-2) ...
Setting up libpython3.6-minimal:amd64 (3.6.8-1~18.04.3) ...
Setting up python3-asn1crypto (0.24.0-1) ...
Processing triggers for libc-bin (2.27-3ubuntu1) ...
Setting up libexpat1-dev:amd64 (2.2.5-3ubuntu0.2) ...
Setting up python3-lib2to3 (3.6.8-1~18.04) ...
Setting up python3-distutils (3.6.8-1~18.04) ...
Setting up python3-cryptography (2.1.4-1ubuntu1.3) ...
Setting up libpython3.6-stdlib:amd64 (3.6.8-1~18.04.3) ...
Setting up python3-keyrings.alt (3.0-1) ...
Setting up python3.6-minimal (3.6.8-1~18.04.3) ...
Setting up python3-pip (9.0.1-2.3~ubuntu1.18.04.1) ...
Setting up python3-setuptools (39.0.1-2) ...
Setting up python3-secretstorage (2.3.1-2) ...
Setting up dh-python (3.20180325ubuntu2) ...
Setting up libpython3.6:amd64 (3.6.8-1~18.04.3) ...
Setting up python3.6 (3.6.8-1~18.04.3) ...
Setting up python3-keyring (10.6.0-1) ...
Setting up libpython3.6-dev:amd64 (3.6.8-1~18.04.3) ...
Setting up python3.6-dev (3.6.8-1~18.04.3) ...
Setting up libpython3-dev:amd64 (3.6.7-1~18.04) ...
Setting up python3-dev (3.6.7-1~18.04) ...
Processing triggers for libc-bin (2.27-3ubuntu1) ...
Removing intermediate container 4600f5d7e84a
 ---> bd32de3d9d7d
Step 4/5 : ENV PYTHONPATH ${PYTHONPATH}:/babyweight
 ---> Running in 7f2a7b00dc22
Removing intermediate container 7f2a7b00dc22
 ---> 3f5eb73fe990
Step 5/5 : ENTRYPOINT ["python3", "babyweight/trainer/task.py"]
 ---> Running in 5e7037953f34
Removing intermediate container 5e7037953f34
 ---> 35bb7a827663
Successfully built 35bb7a827663
Successfully tagged gcr.io/qwiklabs-gcp-4b437f7e5bfff9dd/babyweight_training_container:latest
Pushing gcr.io/qwiklabs-gcp-4b437f7e5bfff9dd/babyweight_training_container
The push refers to repository [gcr.io/qwiklabs-gcp-4b437f7e5bfff9dd/babyweight_training_container]
a8f2b0322007: Preparing
c3f121ae99b2: Preparing
bf90a9083c07: Preparing
11d81998383e: Preparing
95d20553f9d8: Preparing
db39d1d921f4: Preparing
30a800c3bce3: Preparing
b4e8d60ebd43: Preparing
e1331da2fed4: Preparing
47b98e0067a4: Preparing
1540993b0bb8: Preparing
aeb8ac7a4b4c: Preparing
3502212822b5: Preparing
a792cec8f1b3: Preparing
fd6e4fea6397: Preparing
e661f78768b4: Preparing
cb4cde3af37a: Preparing
122be11ab4a2: Preparing
7beb13bce073: Preparing
f7eae43028b3: Preparing
6cebf3abed5f: Preparing
1540993b0bb8: Waiting
aeb8ac7a4b4c: Waiting
3502212822b5: Waiting
a792cec8f1b3: Waiting
fd6e4fea6397: Waiting
e661f78768b4: Waiting
cb4cde3af37a: Waiting
122be11ab4a2: Waiting
7beb13bce073: Waiting
f7eae43028b3: Waiting
6cebf3abed5f: Waiting
db39d1d921f4: Waiting
30a800c3bce3: Waiting
b4e8d60ebd43: Waiting
e1331da2fed4: Waiting
47b98e0067a4: Waiting
bf90a9083c07: Layer already exists
95d20553f9d8: Layer already exists
11d81998383e: Layer already exists
db39d1d921f4: Layer already exists
30a800c3bce3: Layer already exists
b4e8d60ebd43: Layer already exists
e1331da2fed4: Layer already exists
47b98e0067a4: Layer already exists
1540993b0bb8: Layer already exists
aeb8ac7a4b4c: Layer already exists
a792cec8f1b3: Layer already exists
fd6e4fea6397: Layer already exists
3502212822b5: Layer already exists
cb4cde3af37a: Layer already exists
e661f78768b4: Layer already exists
122be11ab4a2: Layer already exists
7beb13bce073: Layer already exists
f7eae43028b3: Layer already exists
6cebf3abed5f: Layer already exists
c3f121ae99b2: Pushed
a8f2b0322007: Pushed
latest: digest: sha256:24a648d56f000fd0fb9bbd9bdd25a05c0e589e80ea2ab475f45da19f9eae1b63 size: 4714

Test container locally

Before we submit our training job to Cloud AI Platform, let's make sure our container that we just built and pushed to our project's container repo works perfectly. We can do that by calling our container in bash and passing the necessary user_args for our task.py's parser.


In [31]:
%%bash
export PROJECT_ID=$(gcloud config list project --format "value(core.project)")
export IMAGE_REPO_NAME=babyweight_training_container
export IMAGE_URI=gcr.io/${PROJECT_ID}/${IMAGE_REPO_NAME}
echo "Running  $IMAGE_URI"
docker run ${IMAGE_URI} \
    --train_data_path=gs://${BUCKET}/babyweight/data/train*.csv \
    --train_data_path=gs://${BUCKET}/babyweight/data/train*.csv \
    --eval_data_path=gs://${BUCKET}/babyweight/data/eval*.csv \
    --output_dir=gs://${BUCKET}/babyweight/trained_model \
    --batch_size=10 \
    --num_epochs=10 \
    --train_examples=1 \
    --eval_steps=1


Running  gcr.io/qwiklabs-gcp-4b437f7e5bfff9dd/babyweight_training_container
Here is our Wide-and-Deep architecture so far:

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
gestation_weeks (InputLayer)    [(None,)]            0                                            
__________________________________________________________________________________________________
is_male (InputLayer)            [(None,)]            0                                            
__________________________________________________________________________________________________
mother_age (InputLayer)         [(None,)]            0                                            
__________________________________________________________________________________________________
plurality (InputLayer)          [(None,)]            0                                            
__________________________________________________________________________________________________
deep_inputs (DenseFeatures)     (None, 5)            3000        gestation_weeks[0][0]            
                                                                 is_male[0][0]                    
                                                                 mother_age[0][0]                 
                                                                 plurality[0][0]                  
__________________________________________________________________________________________________
dnn_1 (Dense)                   (None, 128)          768         deep_inputs[0][0]                
__________________________________________________________________________________________________
dnn_2 (Dense)                   (None, 32)           4128        dnn_1[0][0]                      
__________________________________________________________________________________________________
wide_inputs (DenseFeatures)     (None, 71)           0           gestation_weeks[0][0]            
                                                                 is_male[0][0]                    
                                                                 mother_age[0][0]                 
                                                                 plurality[0][0]                  
__________________________________________________________________________________________________
dnn_3 (Dense)                   (None, 4)            132         dnn_2[0][0]                      
__________________________________________________________________________________________________
linear (Dense)                  (None, 10)           720         wide_inputs[0][0]                
__________________________________________________________________________________________________
both (Concatenate)              (None, 14)           0           dnn_3[0][0]                      
                                                                 linear[0][0]                     
__________________________________________________________________________________________________
weight (Dense)                  (None, 1)            15          both[0][0]                       
==================================================================================================
Total params: 8,763
Trainable params: 8,763
Non-trainable params: 0
__________________________________________________________________________________________________
None
mode = train
mode = eval
Train for 10 steps, validate for 1 steps
Epoch 1/10

Epoch 00001: saving model to gs://qwiklabs-gcp-4b437f7e5bfff9dd/babyweight/trained_model/checkpoints/babyweight
10/10 - 17s - loss: 3.8755 - rmse: 1.9246 - mse: 3.8755 - val_loss: 6.2796 - val_rmse: 2.5059 - val_mse: 6.2796
Epoch 2/10

Epoch 00002: saving model to gs://qwiklabs-gcp-4b437f7e5bfff9dd/babyweight/trained_model/checkpoints/babyweight
10/10 - 8s - loss: 4.4790 - rmse: 2.0677 - mse: 4.4790 - val_loss: 4.3228 - val_rmse: 2.0791 - val_mse: 4.3228
Epoch 3/10

Epoch 00003: saving model to gs://qwiklabs-gcp-4b437f7e5bfff9dd/babyweight/trained_model/checkpoints/babyweight
10/10 - 8s - loss: 3.1877 - rmse: 1.7655 - mse: 3.1877 - val_loss: 3.2621 - val_rmse: 1.8061 - val_mse: 3.2621
Epoch 4/10

Epoch 00004: saving model to gs://qwiklabs-gcp-4b437f7e5bfff9dd/babyweight/trained_model/checkpoints/babyweight
10/10 - 8s - loss: 3.2268 - rmse: 1.7080 - mse: 3.2268 - val_loss: 2.9845 - val_rmse: 1.7276 - val_mse: 2.9845
Epoch 5/10

Epoch 00005: saving model to gs://qwiklabs-gcp-4b437f7e5bfff9dd/babyweight/trained_model/checkpoints/babyweight
10/10 - 8s - loss: 2.9043 - rmse: 1.6701 - mse: 2.9043 - val_loss: 3.1875 - val_rmse: 1.7853 - val_mse: 3.1875
Epoch 6/10

Epoch 00006: saving model to gs://qwiklabs-gcp-4b437f7e5bfff9dd/babyweight/trained_model/checkpoints/babyweight
10/10 - 9s - loss: 3.3367 - rmse: 1.8113 - mse: 3.3367 - val_loss: 2.7325 - val_rmse: 1.6530 - val_mse: 2.7325
Epoch 7/10

Epoch 00007: saving model to gs://qwiklabs-gcp-4b437f7e5bfff9dd/babyweight/trained_model/checkpoints/babyweight
10/10 - 7s - loss: 2.8142 - rmse: 1.6597 - mse: 2.8142 - val_loss: 2.6752 - val_rmse: 1.6356 - val_mse: 2.6752
Epoch 8/10

Epoch 00008: saving model to gs://qwiklabs-gcp-4b437f7e5bfff9dd/babyweight/trained_model/checkpoints/babyweight
10/10 - 8s - loss: 2.4923 - rmse: 1.5440 - mse: 2.4923 - val_loss: 2.9194 - val_rmse: 1.7086 - val_mse: 2.9194
Epoch 9/10

Epoch 00009: saving model to gs://qwiklabs-gcp-4b437f7e5bfff9dd/babyweight/trained_model/checkpoints/babyweight
10/10 - 8s - loss: 3.2297 - rmse: 1.7768 - mse: 3.2297 - val_loss: 3.1855 - val_rmse: 1.7848 - val_mse: 3.1855
Epoch 10/10

Epoch 00010: saving model to gs://qwiklabs-gcp-4b437f7e5bfff9dd/babyweight/trained_model/checkpoints/babyweight
10/10 - 8s - loss: 3.0952 - rmse: 1.7117 - mse: 3.0952 - val_loss: 3.4157 - val_rmse: 1.8482 - val_mse: 3.4157
Exported trained model to gs://qwiklabs-gcp-4b437f7e5bfff9dd/babyweight/trained_model/20191030175118
WARNING:tensorflow:From /root/miniconda3/lib/python3.5/site-packages/tensorflow_core/python/feature_column/feature_column_v2.py:4276: IndicatorColumn._variable_shape (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
WARNING:tensorflow:From /root/miniconda3/lib/python3.5/site-packages/tensorflow_core/python/feature_column/feature_column_v2.py:4331: BucketizedColumn._num_buckets (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
2019-10-30 17:49:47.083403: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations:  AVX2 FMA
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.

User settings:

   KMP_AFFINITY=granularity=fine,verbose,compact,1,0
   KMP_BLOCKTIME=0
   KMP_SETTINGS=1

Effective settings:

   KMP_ABORT_DELAY=0
   KMP_ADAPTIVE_LOCK_PROPS='1,1024'
   KMP_ALIGN_ALLOC=64
   KMP_ALL_THREADPRIVATE=128
   KMP_ATOMIC_MODE=2
   KMP_BLOCKTIME=0
   KMP_CPUINFO_FILE: value is not defined
   KMP_DETERMINISTIC_REDUCTION=false
   KMP_DEVICE_THREAD_LIMIT=2147483647
   KMP_DISP_HAND_THREAD=false
   KMP_DISP_NUM_BUFFERS=7
   KMP_DUPLICATE_LIB_OK=false
   KMP_FORCE_REDUCTION: value is not defined
   KMP_FOREIGN_THREADS_THREADPRIVATE=true
   KMP_FORKJOIN_BARRIER='2,2'
   KMP_FORKJOIN_BARRIER_PATTERN='hyper,hyper'
   KMP_FORKJOIN_FRAMES=true
   KMP_FORKJOIN_FRAMES_MODE=3
   KMP_GTID_MODE=3
   KMP_HANDLE_SIGNALS=false
   KMP_HOT_TEAMS_MAX_LEVEL=1
   KMP_HOT_TEAMS_MODE=0
   KMP_INIT_AT_FORK=true
   KMP_ITT_PREPARE_DELAY=0
   KMP_LIBRARY=throughput
   KMP_LOCK_KIND=queuing
   KMP_MALLOC_POOL_INCR=1M
   KMP_MWAIT_HINTS=0
   KMP_NUM_LOCKS_IN_BLOCK=1
   KMP_PLAIN_BARRIER='2,2'
   KMP_PLAIN_BARRIER_PATTERN='hyper,hyper'
   KMP_REDUCTION_BARRIER='1,1'
   KMP_REDUCTION_BARRIER_PATTERN='hyper,hyper'
   KMP_SCHEDULE='static,balanced;guided,iterative'
   KMP_SETTINGS=true
   KMP_SPIN_BACKOFF_PARAMS='4096,100'
   KMP_STACKOFFSET=64
   KMP_STACKPAD=0
   KMP_STACKSIZE=8M
   KMP_STORAGE_MAP=false
   KMP_TASKING=2
   KMP_TASKLOOP_MIN_TASKS=0
   KMP_TASK_STEALING_CONSTRAINT=1
   KMP_TEAMS_THREAD_LIMIT=4
   KMP_TOPOLOGY_METHOD=all
   KMP_USER_LEVEL_MWAIT=false
   KMP_USE_YIELD=1
   KMP_VERSION=false
   KMP_WARNINGS=true
   OMP_AFFINITY_FORMAT='OMP: pid %P tid %i thread %n bound to OS proc set {%A}'
   OMP_ALLOCATOR=omp_default_mem_alloc
   OMP_CANCELLATION=false
   OMP_DEBUG=disabled
   OMP_DEFAULT_DEVICE=0
   OMP_DISPLAY_AFFINITY=false
   OMP_DISPLAY_ENV=false
   OMP_DYNAMIC=false
   OMP_MAX_ACTIVE_LEVELS=2147483647
   OMP_MAX_TASK_PRIORITY=0
   OMP_NESTED=false
   OMP_NUM_THREADS: value is not defined
   OMP_PLACES: value is not defined
   OMP_PROC_BIND='intel'
   OMP_SCHEDULE='static'
   OMP_STACKSIZE=8M
   OMP_TARGET_OFFLOAD=DEFAULT
   OMP_THREAD_LIMIT=2147483647
   OMP_TOOL=enabled
   OMP_TOOL_LIBRARIES: value is not defined
   OMP_WAIT_POLICY=PASSIVE
   KMP_AFFINITY='verbose,warnings,respect,granularity=fine,compact,1,0'

2019-10-30 17:49:47.536790: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2200000000 Hz
2019-10-30 17:49:47.537312: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x562631508020 executing computations on platform Host. Devices:
2019-10-30 17:49:47.537355: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
OMP: Info #212: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #210: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: 0-3
OMP: Info #156: KMP_AFFINITY: 4 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 1 packages x 2 cores/pkg x 2 threads/core (2 total cores)
OMP: Info #214: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 0 thread 1 
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 1 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 0 core 1 thread 1 
OMP: Info #250: KMP_AFFINITY: pid 1 tid 1 thread 0 bound to OS proc set 0
2019-10-30 17:49:47.566316: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
WARNING:tensorflow:From /root/miniconda3/lib/python3.5/site-packages/tensorflow_core/python/feature_column/feature_column_v2.py:4331: VocabularyListCategoricalColumn._num_buckets (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
WARNING:tensorflow:From /root/miniconda3/lib/python3.5/site-packages/tensorflow_core/python/data/experimental/ops/readers.py:521: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_determinstic`.
WARNING:tensorflow:From /root/miniconda3/lib/python3.5/site-packages/tensorflow_core/python/data/experimental/ops/readers.py:215: shuffle_and_repeat (from tensorflow.python.data.experimental.ops.shuffle_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.shuffle(buffer_size, seed)` followed by `tf.data.Dataset.repeat(count)`. Static tf.data optimizations will take care of using the fused implementation.
OMP: Info #250: KMP_AFFINITY: pid 1 tid 31 thread 1 bound to OS proc set 1
OMP: Info #250: KMP_AFFINITY: pid 1 tid 33 thread 2 bound to OS proc set 2
OMP: Info #250: KMP_AFFINITY: pid 1 tid 26 thread 3 bound to OS proc set 3
OMP: Info #250: KMP_AFFINITY: pid 1 tid 27 thread 4 bound to OS proc set 0
OMP: Info #250: KMP_AFFINITY: pid 1 tid 38 thread 6 bound to OS proc set 2
OMP: Info #250: KMP_AFFINITY: pid 1 tid 37 thread 5 bound to OS proc set 1
OMP: Info #250: KMP_AFFINITY: pid 1 tid 39 thread 7 bound to OS proc set 3
OMP: Info #250: KMP_AFFINITY: pid 1 tid 40 thread 8 bound to OS proc set 0
OMP: Info #250: KMP_AFFINITY: pid 1 tid 41 thread 9 bound to OS proc set 1
OMP: Info #250: KMP_AFFINITY: pid 1 tid 42 thread 10 bound to OS proc set 2
OMP: Info #250: KMP_AFFINITY: pid 1 tid 28 thread 11 bound to OS proc set 3
OMP: Info #250: KMP_AFFINITY: pid 1 tid 46 thread 12 bound to OS proc set 0
OMP: Info #250: KMP_AFFINITY: pid 1 tid 51 thread 12 bound to OS proc set 0
OMP: Info #250: KMP_AFFINITY: pid 1 tid 56 thread 12 bound to OS proc set 0
OMP: Info #250: KMP_AFFINITY: pid 1 tid 61 thread 12 bound to OS proc set 0
OMP: Info #250: KMP_AFFINITY: pid 1 tid 66 thread 12 bound to OS proc set 0
OMP: Info #250: KMP_AFFINITY: pid 1 tid 71 thread 12 bound to OS proc set 0
OMP: Info #250: KMP_AFFINITY: pid 1 tid 76 thread 12 bound to OS proc set 0
OMP: Info #250: KMP_AFFINITY: pid 1 tid 81 thread 12 bound to OS proc set 0
OMP: Info #250: KMP_AFFINITY: pid 1 tid 86 thread 12 bound to OS proc set 0
OMP: Info #250: KMP_AFFINITY: pid 1 tid 91 thread 12 bound to OS proc set 0
2019-10-30 17:51:19.949620: W tensorflow/python/util/util.cc:299] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
WARNING:tensorflow:From /root/miniconda3/lib/python3.5/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1781: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.

Train on Cloud AI Platform

Once the code works in standalone mode, you can run it on Cloud AI Platform. Because this is on the entire dataset, it will take a while. The training run took about two hours for me. You can monitor the job from the GCP console in the Cloud AI Platform section.


In [ ]:
%%bash
OUTDIR=gs://${BUCKET}/babyweight/trained_model
JOBID=babyweight_$(date -u +%y%m%d_%H%M%S)
echo ${OUTDIR} ${REGION} ${JOBNAME}
gsutil -m rm -rf ${OUTDIR}

IMAGE=gcr.io/${PROJECT}/babyweight_training_container

gcloud ai-platform jobs submit training ${JOBID} \
    --staging-bucket=gs://${BUCKET} \
    --region=${REGION} \
    --master-image-uri=${IMAGE} \
    --master-machine-type=n1-standard-4 \
    --scale-tier=CUSTOM \
    -- \
    --train_data_path=gs://${BUCKET}/babyweight/data/train*.csv \
    --eval_data_path=gs://${BUCKET}/babyweight/data/eval*.csv \
    --output_dir=${OUTDIR} \
    --num_epochs=10 \
    --train_examples=20000 \
    --eval_steps=100 \
    --batch_size=32 \
    --nembeds=8

When I ran it, I used train_examples=2000000. When training finished, I filtered in the Stackdriver log on the word "dict" and saw that the last line was:

Saving dict for global step 5714290: average_loss = 1.06473, global_step = 5714290, loss = 34882.4, rmse = 1.03186
The final RMSE was 1.03 pounds.

Hyperparameter tuning.

All of these are command-line parameters to my program. To do hyperparameter tuning, create hyperparam.yaml and pass it as --config hyperparam.yaml. This step will take up to 2 hours -- you can increase maxParallelTrials or reduce maxTrials to get it done faster. Since maxParallelTrials is the number of initial seeds to start searching from, you don't want it to be too large; otherwise, all you have is a random search.


In [ ]:
%%writefile hyperparam.yaml
trainingInput:
    scaleTier: STANDARD_1
    hyperparameters:
        hyperparameterMetricTag: rmse
        goal: MINIMIZE
        maxTrials: 20
        maxParallelTrials: 5
        enableTrialEarlyStopping: True
        params:
        - parameterName: batch_size
          type: INTEGER
          minValue: 8
          maxValue: 512
          scaleType: UNIT_LOG_SCALE
        - parameterName: nembeds
          type: INTEGER
          minValue: 3
          maxValue: 30
          scaleType: UNIT_LINEAR_SCALE

In [ ]:
%%bash
OUTDIR=gs://${BUCKET}/babyweight/hyperparam
JOBNAME=babyweight_$(date -u +%y%m%d_%H%M%S)
echo ${OUTDIR} ${REGION} ${JOBNAME}
gsutil -m rm -rf ${OUTDIR}

IMAGE=gcr.io/${PROJECT}/babyweight_training_container

gcloud ai-platform jobs submit training ${JOBNAME} \
    --staging-bucket=gs://${BUCKET} \
    --region=${REGION} \
    --master-image-uri=${IMAGE} \
    --master-machine-type=n1-standard-4 \
    --scale-tier=CUSTOM \
    --config=hyperparam.yaml \
    -- \
    --train_data_path=gs://${BUCKET}/babyweight/data/train*.csv \
    --eval_data_path=gs://${BUCKET}/babyweight/data/eval*.csv \
    --output_dir=${OUTDIR} \
    --num_epochs=10 \
    --train_examples=20000 \
    --eval_steps=100

Repeat training

This time with tuned parameters for batch_size and nembeds.


In [ ]:
%%bash
OUTDIR=gs://${BUCKET}/babyweight/trained_model_tuned
JOBNAME=babyweight_$(date -u +%y%m%d_%H%M%S)
echo ${OUTDIR} ${REGION} ${JOBNAME}
gsutil -m rm -rf ${OUTDIR}

IMAGE=gcr.io/${PROJECT}/babyweight_training_container

gcloud ai-platform jobs submit training ${JOBNAME} \
    --staging-bucket=gs://${BUCKET} \
    --region=${REGION} \
    --master-image-uri=${IMAGE} \
    --master-machine-type=n1-standard-4 \
    --scale-tier=CUSTOM \
    -- \
    --train_data_path=gs://${BUCKET}/babyweight/data/train*.csv \
    --eval_data_path=gs://${BUCKET}/babyweight/data/eval*.csv \
    --output_dir=${OUTDIR} \
    --num_epochs=10 \
    --train_examples=20000 \
    --eval_steps=100 \
    --batch_size=32 \
    --nembeds=8

Lab Summary:

In this lab, we set up the environment, created the trainer module's task.py to hold hyperparameter argparsing code, created the trainer module's model.py to hold Keras model code, ran the trainer module package locally, submitted a training job to Cloud AI Platform, and submitted a hyperparameter tuning job to Cloud AI Platform.

Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License