Learning Objectives
After having testing our training pipeline both locally and in the cloud on a susbset of the data, we can submit another (much larger) training job to the cloud. It is also a good idea to run a hyperparameter tuning job to make sure we have optimized the hyperparameters of our model.
This notebook illustrates how to do distributed training and hyperparameter tuning on Cloud AI Platform.
To start, we'll set up our environment variables as before.
In [ ]:
PROJECT = "cloud-training-demos" # Replace with your PROJECT
BUCKET = "cloud-training-bucket" # Replace with your BUCKET
REGION = "us-central1" # Choose an available region for Cloud AI Platform
TFVERSION = "1.14" # TF version for CAIP to use
In [ ]:
import os
os.environ["BUCKET"] = BUCKET
os.environ["PROJECT"] = PROJECT
os.environ["REGION"] = REGION
os.environ["TFVERSION"] = TFVERSION
In [ ]:
%%bash
gcloud config set project $PROJECT
gcloud config set compute/region $REGION
Next, we'll look for the preprocessed data for the babyweight model and copy it over if it's not there.
In [ ]:
%%bash
if ! gsutil ls -r gs://$BUCKET | grep -q gs://$BUCKET/babyweight/preproc; then
gsutil mb -l ${REGION} gs://${BUCKET}
# copy canonical set of preprocessed files if you didn't do previous notebook
gsutil -m cp -R gs://cloud-training-demos/babyweight gs://${BUCKET}
fi
In [ ]:
%%bash
gsutil ls gs://${BUCKET}/babyweight/preproc/*-00000*
In the previous labs we developed our TensorFlow model and got it working on a subset of the data. Now we can package the TensorFlow code up as a Python module and train it on Cloud AI Platform.
Training on Cloud AI Platform requires two things:
A Python package is simply a collection of one or more .py
files along with an __init__.py
file to identify the containing directory as a package. The __init__.py
sometimes contains initialization code but for our purposes an empty file suffices.
The bash command touch
creates an empty file in the specified location, the directory babyweight
should already exist.
In [ ]:
%%bash
touch babyweight/trainer/__init__.py
We then use the %%writefile
magic to write the contents of the cell below to a file called task.py
in the babyweight/trainer
folder.
The cell below write the file babyweight/trainer/task.py
which sets up our training job. Here is where we determine which parameters of our model to pass as flags during training using the parser
module. Look at how batch_size
is passed to the model in the code below. Use this as an example to parse arguements for the following variables
nnsize
which represents the hidden layer sizes to use for DNN feature columnsnembeds
which represents the embedding size of a cross of n key real-valued parameterstrain_examples
which represents the number of examples (in thousands) to run the training jobeval_steps
which represents the positive number of steps for which to evaluate modelpattern
which specifies a pattern that has to be in input files. For example '00001-of' would process only one shard. For this variable, set 'of' to be the default. Be sure to include a default value for the parsed arguments above and specfy the type
if necessary.
In [ ]:
%%writefile babyweight/trainer/task.py
import argparse
import json
import os
import tensorflow as tf
from . import model
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"--bucket",
help="GCS path to data. We assume that data is in \
gs://BUCKET/babyweight/preproc/",
required=True
)
parser.add_argument(
"--output_dir",
help="GCS location to write checkpoints and export models",
required=True
)
parser.add_argument(
"--batch_size",
help="Number of examples to compute gradient over.",
type=int,
default=512
)
parser.add_argument(
"--job-dir",
help="this model ignores this field, but it is required by gcloud",
default="junk"
)
# TODO: Your code goes here
# TODO: Your code goes here
# TODO: Your code goes here
# TODO: Your code goes here
# TODO: Your code goes here
# Parse arguments
args = parser.parse_args()
arguments = args.__dict__
# Pop unnecessary args needed for gcloud
arguments.pop("job-dir", None)
# Assign the arguments to the model variables
output_dir = arguments.pop("output_dir")
model.BUCKET = arguments.pop("bucket")
model.BATCH_SIZE = arguments.pop("batch_size")
model.TRAIN_STEPS = (
arguments.pop("train_examples") * 1000) / model.BATCH_SIZE
model.EVAL_STEPS = arguments.pop("eval_steps")
print ("Will train for {} steps using batch_size={}".format(
model.TRAIN_STEPS, model.BATCH_SIZE))
model.PATTERN = arguments.pop("pattern")
model.NEMBEDS = arguments.pop("nembeds")
model.NNSIZE = arguments.pop("nnsize")
print ("Will use DNN size of {}".format(model.NNSIZE))
# Append trial_id to path if we are doing hptuning
# This code can be removed if you are not using hyperparameter tuning
output_dir = os.path.join(
output_dir,
json.loads(
os.environ.get("TF_CONFIG", "{}")
).get("task", {}).get("trial", "")
)
# Run the training job
model.train_and_evaluate(output_dir)
In the same way we can write to the file model.py
the model that we developed in the previous notebooks.
Complete the TODOs in the code cell below to create out model.py
. We'll use the code we wrote for the Wide & Deep model. Look back at your 3_tensorflow_wide_deep
notebook and copy/paste the necessary code from that notebook into its place in the cell below.
In [ ]:
%%writefile babyweight/trainer/model.py
import shutil
import numpy as np
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.INFO)
BUCKET = None # set from task.py
PATTERN = "of" # gets all files
# Determine CSV and label columns
# TODO: Your code goes here
# Set default values for each CSV column
# TODO: Your code goes here
# Define some hyperparameters
TRAIN_STEPS = 10000
EVAL_STEPS = None
BATCH_SIZE = 512
NEMBEDS = 3
NNSIZE = [64, 16, 4]
# Create an input function reading a file using the Dataset API
# Then provide the results to the Estimator API
def read_dataset(prefix, mode, batch_size):
def _input_fn():
def decode_csv(value_column):
# TODO: Your code goes here
# Use prefix to create file path
file_path = "gs://{}/babyweight/preproc/{}*{}*".format(
BUCKET, prefix, PATTERN)
# Create list of files that match pattern
file_list = tf.gfile.Glob(filename=file_path)
# Create dataset from file list
# TODO: Your code goes here
# In training mode, shuffle the dataset and repeat indefinitely
# TODO: Your code goes here
dataset = # TODO: Your code goes here
# This will now return batches of features, label
return dataset
return _input_fn
# Define feature columns
def get_wide_deep():
# TODO: Your code goes here
return wide, deep
# Create serving input function to be able to serve predictions later using provided inputs
def serving_input_fn():
# TODO: Your code goes here
return tf.estimator.export.ServingInputReceiver(
features=features, receiver_tensors=feature_placeholders)
# create metric for hyperparameter tuning
def my_rmse(labels, predictions):
pred_values = predictions["predictions"]
return {"rmse": tf.metrics.root_mean_squared_error(
labels=labels, predictions=pred_values)}
# Create estimator to train and evaluate
def train_and_evaluate(output_dir):
# TODO: Your code goes here
After moving the code to a package, make sure it works as a standalone. Note, we incorporated the --pattern
and --train_examples
flags so that we don't try to train on the entire dataset while we are developing our pipeline. Once we are sure that everything is working on a subset, we can change the pattern so that we can train on all the data. Even for this subset, this takes about 3 minutes in which you won't see any output ...
In [ ]:
%%bash
echo "bucket=${BUCKET}"
rm -rf babyweight_trained
export PYTHONPATH=${PYTHONPATH}:${PWD}/babyweight
python -m trainer.task \
--bucket= # TODO: Your code goes here
--output_dir= # TODO: Your code goes here
--job-dir=./tmp \
--pattern= # TODO: Your code goes here
--train_examples= # TODO: Your code goes here
--eval_steps= # TODO: Your code goes here
In [ ]:
%%writefile inputs.json
{"is_male": "True", "mother_age": 26.0, "plurality": "Single(1)", "gestation_weeks": 39}
{"is_male": "False", "mother_age": 26.0, "plurality": "Single(1)", "gestation_weeks": 39}
Finish the code in cell below to run a local prediction job on the inputs.json
file we just created. You will need to provide two additional flags
model-dir
specifying the location of the model binariesjson-instances
specifying the location of the json file on which you want to predict
In [ ]:
%%bash
MODEL_LOCATION=$(ls -d $(pwd)/babyweight_trained/export/exporter/* | tail -1)
echo $MODEL_LOCATION
gcloud ai-platform local predict # TODO: Your code goes here
In [ ]:
%%bash
OUTDIR=gs://${BUCKET}/babyweight/trained_model
JOBNAME=babyweight_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR
gcloud ai-platform jobs submit training $JOBNAME \
--region= # TODO: Your code goes here
--module-name= # TODO: Your code goes here
--package-path= # TODO: Your code goes here
--job-dir= # TODO: Your code goes here
--staging-bucket=gs://$BUCKET \
--scale-tier= #TODO: Your code goes here
--runtime-version= #TODO: Your code goes here
-- \
--bucket=${BUCKET} \
--output_dir=${OUTDIR} \
--train_examples=200000
When I ran it, I used train_examples=2000000. When training finished, I filtered in the Stackdriver log on the word "dict" and saw that the last line was:
Saving dict for global step 5714290: average_loss = 1.06473, global_step = 5714290, loss = 34882.4, rmse = 1.03186The final RMSE was 1.03 pounds.
All of these are command-line parameters to my program. To do hyperparameter tuning, create hyperparam.xml and pass it as --configFile. This step will take 1 hour -- you can increase maxParallelTrials or reduce maxTrials to get it done faster. Since maxParallelTrials is the number of initial seeds to start searching from, you don't want it to be too large; otherwise, all you have is a random search.
In [ ]:
%writefile hyperparam.yaml
trainingInput:
scaleTier: STANDARD_1
hyperparameters:
hyperparameterMetricTag: rmse
goal: MINIMIZE
maxTrials: 20
maxParallelTrials: 5
enableTrialEarlyStopping: True
params:
- parameterName: batch_size
type: # TODO: Your code goes here
minValue: # TODO: Your code goes here
maxValue: # TODO: Your code goes here
scaleType: # TODO: Your code goes here
- parameterName: nembeds
type: # TODO: Your code goes here
minValue: # TODO: Your code goes here
maxValue: # TODO: Your code goes here
scaleType: # TODO: Your code goes here
- parameterName: nnsize
type: # TODO: Your code goes here
minValue: # TODO: Your code goes here
maxValue: # TODO: Your code goes here
scaleType: # TODO: Your code goes here
In [ ]:
%%bash
OUTDIR=gs://${BUCKET}/babyweight/hyperparam
JOBNAME=babyweight_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR
gcloud ai-platform jobs submit training $JOBNAME \
--region=$REGION \
--module-name=trainer.task \
--package-path=$(pwd)/babyweight/trainer \
--job-dir=$OUTDIR \
--staging-bucket=gs://$BUCKET \
--scale-tier=STANDARD_1 \
--config=hyperparam.yaml \
--runtime-version=$TFVERSION \
-- \
--bucket=${BUCKET} \
--output_dir=${OUTDIR} \
--eval_steps=10 \
--train_examples=20000
This time with tuned parameters (note last line)
In [ ]:
%%bash
OUTDIR=gs://${BUCKET}/babyweight/trained_model_tuned
JOBNAME=babyweight_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR
gcloud ai-platform jobs submit training $JOBNAME \
--region=$REGION \
--module-name=trainer.task \
--package-path=$(pwd)/babyweight/trainer \
--job-dir=$OUTDIR \
--staging-bucket=gs://$BUCKET \
--scale-tier=STANDARD_1 \
--runtime-version=$TFVERSION \
-- \
--bucket=${BUCKET} \
--output_dir=${OUTDIR} \
--train_examples=20000 --batch_size=35 --nembeds=16 --nnsize=281
Copyright 2017 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License