Learning Objectives
After having testing our training pipeline both locally and in the cloud on a susbset of the data, we can submit another (much larger) training job to the cloud. It is also a good idea to run a hyperparameter tuning job to make sure we have optimized the hyperparameters of our model.
In this notebook, we'll be training our Keras model at scale using Cloud AI Platform.
In this lab, we will set up the environment, create the trainer module's task.py to hold hyperparameter argparsing code, create the trainer module's model.py to hold Keras model code, run the trainer module package locally, submit a training job to Cloud AI Platform, and submit a hyperparameter tuning job to Cloud AI Platform.
First we will install the cloudml-hypertune
package on our local machine. This is the package which we will use to report hyperparameter tuning metrics to Cloud AI Platform. Installing the package will allow us to test our trainer package locally.
In [ ]:
!sudo chown -R jupyter:jupyter /home/jupyter/training-data-analyst
In [ ]:
!pip3 install cloudml-hypertune
Import necessary libraries.
In [ ]:
import os
In [ ]:
%%bash
export PROJECT=$(gcloud config list project --format "value(core.project)")
echo "Your current GCP Project Name is: "${PROJECT}
In [ ]:
# TODO: Change these to try this notebook out
PROJECT = "your-project-name-here" # Replace with your PROJECT
BUCKET = PROJECT # defaults to PROJECT
REGION = "us-central1" # Replace with your REGION
In [ ]:
os.environ["PROJECT"] = PROJECT
os.environ["BUCKET"] = BUCKET
os.environ["REGION"] = REGION
os.environ["TFVERSION"] = "2.1"
os.environ["PYTHONVERSION"] = "3.7"
In [ ]:
%%bash
gcloud config set project ${PROJECT}
gcloud config set compute/region ${REGION}
In [ ]:
%%bash
if ! gsutil ls | grep -q gs://${BUCKET}; then
gsutil mb -l ${REGION} gs://${BUCKET}
fi
Verify that you previously created CSV files we'll be using for training and evaluation. If not, go back to lab 1b_prepare_data_babyweight to create them.
In [ ]:
%%bash
gsutil ls gs://${BUCKET}/babyweight/data/*000000000000.csv
Now that we have the Keras wide-and-deep code working on a subset of the data, we can package the TensorFlow code up as a Python module and train it on Cloud AI Platform.
Training on Cloud AI Platform requires:
Ensure that the AI Platform API is enabled by going to this link.
A Python package is simply a collection of one or more .py
files along with an __init__.py
file to identify the containing directory as a package. The __init__.py
sometimes contains initialization code but for our purposes an empty file suffices.
The bash command touch
creates an empty file in the specified location, the directory babyweight
should already exist.
In [ ]:
%%bash
mkdir -p babyweight/trainer
touch babyweight/trainer/__init__.py
We then use the %%writefile
magic to write the contents of the cell below to a file called task.py
in the babyweight/trainer
folder.
The cell below writes the file babyweight/trainer/task.py
which sets up our training job. Here is where we determine which parameters of our model to pass as flags during training using the parser
module. Look at how batch_size
is passed to the model in the code below. Use this as an example to parse arguements for the following variables
nnsize
which represents the hidden layer sizes to use for DNN feature columnsnembeds
which represents the embedding size of a cross of n key real-valued parameterstrain_examples
which represents the number of examples (in thousands) to run the training jobeval_steps
which represents the positive number of steps for which to evaluate modelBe sure to include a default value for the parsed arguments above and specfy the type
if necessary.
In [ ]:
%%writefile babyweight/trainer/task.py
import argparse
import json
import os
from trainer import model
import tensorflow as tf
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"--job-dir",
help="this model ignores this field, but it is required by gcloud",
default="junk"
)
parser.add_argument(
"--train_data_path",
help="GCS location of training data",
required=True
)
parser.add_argument(
"--eval_data_path",
help="GCS location of evaluation data",
required=True
)
parser.add_argument(
"--output_dir",
help="GCS location to write checkpoints and export models",
required=True
)
parser.add_argument(
"--batch_size",
help="Number of examples to compute gradient over.",
type=int,
default=512
)
parser.add_argument(
"--nnsize",
help="Hidden layer sizes for DNN -- provide space-separated layers",
nargs="+",
type=int,
default=[128, 32, 4]
)
parser.add_argument(
"--nembeds",
help="Embedding size of a cross of n key real-valued parameters",
type=int,
default=3
)
parser.add_argument(
"--num_epochs",
help="Number of epochs to train the model.",
type=int,
default=10
)
parser.add_argument(
"--train_examples",
help="""Number of examples (in thousands) to run the training job over.
If this is more than actual # of examples available, it cycles through
them. So specifying 1000 here when you have only 100k examples makes
this 10 epochs.""",
type=int,
default=5000
)
parser.add_argument(
"--eval_steps",
help="""Positive number of steps for which to evaluate model. Default
to None, which means to evaluate until input_fn raises an end-of-input
exception""",
type=int,
default=None
)
# Parse all arguments
args = parser.parse_args()
arguments = args.__dict__
# Unused args provided by service
arguments.pop("job_dir", None)
arguments.pop("job-dir", None)
# Modify some arguments
arguments["train_examples"] *= 1000
# Append trial_id to path if we are doing hptuning
# This code can be removed if you are not using hyperparameter tuning
arguments["output_dir"] = os.path.join(
arguments["output_dir"],
json.loads(
os.environ.get("TF_CONFIG", "{}")
).get("task", {}).get("trial", "")
)
# Run the training job
model.train_and_evaluate(arguments)
In the same way we can write to the file model.py
the model that we developed in the previous notebooks.
To create our model.py
, we'll use the code we wrote for the Wide & Deep model. Look back at your 9_keras_wide_and_deep_babyweight notebook and copy/paste the necessary code from that notebook into its place in the cell below.
In [ ]:
%%writefile babyweight/trainer/model.py
import datetime
import os
import shutil
import numpy as np
import tensorflow as tf
import hypertune
# Determine CSV, label, and key columns
CSV_COLUMNS = ["weight_pounds",
"is_male",
"mother_age",
"plurality",
"gestation_weeks"]
LABEL_COLUMN = "weight_pounds"
# Set default values for each CSV column.
# Treat is_male and plurality as strings.
DEFAULTS = [[0.0], ["null"], [0.0], ["null"], [0.0]]
def features_and_labels(row_data):
"""Splits features and labels from feature dictionary.
Args:
row_data: Dictionary of CSV column names and tensor values.
Returns:
Dictionary of feature tensors and label tensor.
"""
label = row_data.pop(LABEL_COLUMN)
return row_data, label # features, label
def load_dataset(pattern, batch_size=1, mode='eval'):
"""Loads dataset using the tf.data API from CSV files.
Args:
pattern: str, file pattern to glob into list of files.
batch_size: int, the number of examples per batch.
mode: 'train' | 'eval' to determine if training or evaluating.
Returns:
`Dataset` object.
"""
print("mode = {}".format(mode))
# Make a CSV dataset
dataset = tf.data.experimental.make_csv_dataset(
file_pattern=pattern,
batch_size=batch_size,
column_names=CSV_COLUMNS,
column_defaults=DEFAULTS)
# Map dataset to features and label
dataset = dataset.map(map_func=features_and_labels) # features, label
# Shuffle and repeat for training
if mode == 'train':
dataset = dataset.shuffle(buffer_size=1000).repeat()
# Take advantage of multi-threading; 1=AUTOTUNE
dataset = dataset.prefetch(buffer_size=1)
return dataset
def create_input_layers():
"""Creates dictionary of input layers for each feature.
Returns:
Dictionary of `tf.Keras.layers.Input` layers for each feature.
"""
deep_inputs = {
colname: tf.keras.layers.Input(
name=colname, shape=(), dtype="float32")
for colname in ["mother_age", "gestation_weeks"]
}
wide_inputs = {
colname: tf.keras.layers.Input(
name=colname, shape=(), dtype="string")
for colname in ["is_male", "plurality"]
}
inputs = {**wide_inputs, **deep_inputs}
return inputs
def categorical_fc(name, values):
"""Helper function to wrap categorical feature by indicator column.
Args:
name: str, name of feature.
values: list, list of strings of categorical values.
Returns:
Categorical and indicator column of categorical feature.
"""
cat_column = tf.feature_column.categorical_column_with_vocabulary_list(
key=name, vocabulary_list=values)
ind_column = tf.feature_column.indicator_column(
categorical_column=cat_column)
return cat_column, ind_column
def create_feature_columns(nembeds):
"""Creates wide and deep dictionaries of feature columns from inputs.
Args:
nembeds: int, number of dimensions to embed categorical column down to.
Returns:
Wide and deep dictionaries of feature columns.
"""
deep_fc = {
colname: tf.feature_column.numeric_column(key=colname)
for colname in ["mother_age", "gestation_weeks"]
}
wide_fc = {}
is_male, wide_fc["is_male"] = categorical_fc(
"is_male", ["True", "False", "Unknown"])
plurality, wide_fc["plurality"] = categorical_fc(
"plurality", ["Single(1)", "Twins(2)", "Triplets(3)",
"Quadruplets(4)", "Quintuplets(5)", "Multiple(2+)"])
# Bucketize the float fields. This makes them wide
age_buckets = tf.feature_column.bucketized_column(
source_column=deep_fc["mother_age"],
boundaries=np.arange(15, 45, 1).tolist())
wide_fc["age_buckets"] = tf.feature_column.indicator_column(
categorical_column=age_buckets)
gestation_buckets = tf.feature_column.bucketized_column(
source_column=deep_fc["gestation_weeks"],
boundaries=np.arange(17, 47, 1).tolist())
wide_fc["gestation_buckets"] = tf.feature_column.indicator_column(
categorical_column=gestation_buckets)
# Cross all the wide columns, have to do the crossing before we one-hot
crossed = tf.feature_column.crossed_column(
keys=[age_buckets, gestation_buckets],
hash_bucket_size=1000)
deep_fc["crossed_embeds"] = tf.feature_column.embedding_column(
categorical_column=crossed, dimension=nembeds)
return wide_fc, deep_fc
def get_model_outputs(wide_inputs, deep_inputs, dnn_hidden_units):
"""Creates model architecture and returns outputs.
Args:
wide_inputs: Dense tensor used as inputs to wide side of model.
deep_inputs: Dense tensor used as inputs to deep side of model.
dnn_hidden_units: List of integers where length is number of hidden
layers and ith element is the number of neurons at ith layer.
Returns:
Dense tensor output from the model.
"""
# Hidden layers for the deep side
layers = [int(x) for x in dnn_hidden_units]
deep = deep_inputs
for layerno, numnodes in enumerate(layers):
deep = tf.keras.layers.Dense(
units=numnodes,
activation="relu",
name="dnn_{}".format(layerno+1))(deep)
deep_out = deep
# Linear model for the wide side
wide_out = tf.keras.layers.Dense(
units=10, activation="relu", name="linear")(wide_inputs)
# Concatenate the two sides
both = tf.keras.layers.concatenate(
inputs=[deep_out, wide_out], name="both")
# Final output is a linear activation because this is regression
output = tf.keras.layers.Dense(
units=1, activation="linear", name="weight")(both)
return output
def rmse(y_true, y_pred):
"""Calculates RMSE evaluation metric.
Args:
y_true: tensor, true labels.
y_pred: tensor, predicted labels.
Returns:
Tensor with value of RMSE between true and predicted labels.
"""
return tf.sqrt(tf.reduce_mean(tf.square(y_pred - y_true)))
def build_wide_deep_model(dnn_hidden_units=[64, 32], nembeds=3):
"""Builds wide and deep model using Keras Functional API.
Returns:
`tf.keras.models.Model` object.
"""
# Create input layers
inputs = create_input_layers()
# Create feature columns for both wide and deep
wide_fc, deep_fc = create_feature_columns(nembeds)
# The constructor for DenseFeatures takes a list of numeric columns
# The Functional API in Keras requires: LayerConstructor()(inputs)
wide_inputs = tf.keras.layers.DenseFeatures(
feature_columns=wide_fc.values(), name="wide_inputs")(inputs)
deep_inputs = tf.keras.layers.DenseFeatures(
feature_columns=deep_fc.values(), name="deep_inputs")(inputs)
# Get output of model given inputs
output = get_model_outputs(wide_inputs, deep_inputs, dnn_hidden_units)
# Build model and compile it all together
model = tf.keras.models.Model(inputs=inputs, outputs=output)
model.compile(optimizer="adam", loss="mse", metrics=[rmse, "mse"])
return model
def train_and_evaluate(args):
model = build_wide_deep_model(args["nnsize"], args["nembeds"])
print("Here is our Wide-and-Deep architecture so far:\n")
print(model.summary())
trainds = load_dataset(
args["train_data_path"],
args["batch_size"],
'train')
evalds = load_dataset(
args["eval_data_path"], 1000, 'eval')
if args["eval_steps"]:
evalds = evalds.take(count=args["eval_steps"])
num_batches = args["batch_size"] * args["num_epochs"]
steps_per_epoch = args["train_examples"] // num_batches
checkpoint_path = os.path.join(args["output_dir"], "checkpoints/babyweight")
cp_callback = tf.keras.callbacks.ModelCheckpoint(
filepath=checkpoint_path, verbose=1, save_weights_only=True)
history = model.fit(
trainds,
validation_data=evalds,
epochs=args["num_epochs"],
steps_per_epoch=steps_per_epoch,
verbose=2, # 0=silent, 1=progress bar, 2=one line per epoch
callbacks=[cp_callback])
EXPORT_PATH = os.path.join(
args["output_dir"], datetime.datetime.now().strftime("%Y%m%d%H%M%S"))
tf.saved_model.save(
obj=model, export_dir=EXPORT_PATH) # with default serving function
hp_metric = history.history['val_rmse'][-1]
hpt = hypertune.HyperTune()
hpt.report_hyperparameter_tuning_metric(
hyperparameter_metric_tag='rmse',
metric_value=hp_metric,
global_step=args['num_epochs'])
print("Exported trained model to {}".format(EXPORT_PATH))
After moving the code to a package, make sure it works as a standalone. Note, we incorporated the --train_examples
flag so that we don't try to train on the entire dataset while we are developing our pipeline. Once we are sure that everything is working on a subset, we can change it so that we can train on all the data. Even for this subset, this takes about 3 minutes in which you won't see any output ...
In [ ]:
%%bash
OUTDIR=babyweight_trained
rm -rf ${OUTDIR}
export PYTHONPATH=${PYTHONPATH}:${PWD}/babyweight
python3 -m trainer.task \
--job-dir=./tmp \
--train_data_path=gs://${BUCKET}/babyweight/data/train*.csv \
--eval_data_path=gs://${BUCKET}/babyweight/data/eval*.csv \
--output_dir=${OUTDIR} \
--batch_size=10 \
--num_epochs=1 \
--train_examples=1 \
--eval_steps=1
To submit to the Cloud we use gcloud ai-platform jobs submit training [jobname]
and simply specify some additional parameters for AI Platform Training Service:
Below the -- \
we add in the arguments for our task.py
file.
In [ ]:
%%bash
OUTDIR=gs://${BUCKET}/babyweight/trained_model
JOBID=babyweight_$(date -u +%y%m%d_%H%M%S)
gcloud ai-platform jobs submit training ${JOBID} \
--region=${REGION} \
--module-name=trainer.task \
--package-path=$(pwd)/babyweight/trainer \
--job-dir=${OUTDIR} \
--staging-bucket=gs://${BUCKET} \
--master-machine-type=n1-standard-8 \
--scale-tier=CUSTOM \
--runtime-version=${TFVERSION} \
--python-version=${PYTHONVERSION} \
-- \
--train_data_path=gs://${BUCKET}/babyweight/data/train*.csv \
--eval_data_path=gs://${BUCKET}/babyweight/data/eval*.csv \
--output_dir=${OUTDIR} \
--num_epochs=10 \
--train_examples=10000 \
--eval_steps=100 \
--batch_size=32 \
--nembeds=8
The training job should complete within 10 to 15 minutes. You do not need to wait for this training job to finish before moving forward in the notebook, but will need a trained model to complete our next lab.
All of these are command-line parameters to my program. To do hyperparameter tuning, create hyperparam.yaml
and pass it as --config hyperparam.yaml
.
This step will take up to 2 hours -- you can increase maxParallelTrials
or reduce maxTrials
to get it done faster. Since maxParallelTrials
is the number of initial seeds to start searching from, you don't want it to be too large; otherwise, all you have is a random search.
In [ ]:
%%writefile hyperparam.yaml
trainingInput:
scaleTier: STANDARD_1
hyperparameters:
hyperparameterMetricTag: rmse
goal: MINIMIZE
maxTrials: 20
maxParallelTrials: 5
enableTrialEarlyStopping: True
params:
- parameterName: batch_size
type: INTEGER
minValue: 8
maxValue: 512
scaleType: UNIT_LOG_SCALE
- parameterName: nembeds
type: INTEGER
minValue: 3
maxValue: 30
scaleType: UNIT_LINEAR_SCALE
In [ ]:
%%bash
OUTDIR=gs://${BUCKET}/babyweight/hp_tuning
JOBID=babyweight_$(date -u +%y%m%d_%H%M%S)
gcloud ai-platform jobs submit training ${JOBID} \
--region=${REGION} \
--module-name=trainer.task \
--package-path=$(pwd)/babyweight/trainer \
--job-dir=${OUTDIR} \
--staging-bucket=gs://${BUCKET} \
--master-machine-type=n1-standard-8 \
--scale-tier=CUSTOM \
--config=hyperparam.yaml \
--runtime-version=${TFVERSION} \
--python-version=${PYTHONVERSION} \
-- \
--train_data_path=gs://${BUCKET}/babyweight/data/train*.csv \
--eval_data_path=gs://${BUCKET}/babyweight/data/eval*.csv \
--output_dir=${OUTDIR} \
--num_epochs=10 \
--train_examples=5000 \
--eval_steps=100 \
--batch_size=32 \
--nembeds=8
In [ ]:
%%bash
OUTDIR=gs://${BUCKET}/babyweight/trained_model_tuned
JOBID=babyweight_$(date -u +%y%m%d_%H%M%S)
echo ${OUTDIR} ${REGION} ${JOBNAME}
gsutil -m rm -rf ${OUTDIR}
gcloud ai-platform jobs submit training ${JOBID} \
--region=${REGION} \
--module-name=trainer.task \
--package-path=$(pwd)/babyweight/trainer \
--job-dir=${OUTDIR} \
--staging-bucket=gs://${BUCKET} \
--master-machine-type=n1-standard-8 \
--scale-tier=CUSTOM \
--runtime-version=${TFVERSION} \
--python-version=${PYTHONVERSION} \
-- \
--train_data_path=gs://${BUCKET}/babyweight/data/train*.csv \
--eval_data_path=gs://${BUCKET}/babyweight/data/eval*.csv \
--output_dir=${OUTDIR} \
--num_epochs=10 \
--train_examples=20000 \
--eval_steps=100 \
--batch_size=32 \
--nembeds=8
In this lab, we set up the environment, created the trainer module's task.py to hold hyperparameter argparsing code, created the trainer module's model.py to hold Keras model code, ran the trainer module package locally, submitted a training job to Cloud AI Platform, and submitted a hyperparameter tuning job to Cloud AI Platform.
Though we can directly submit TensorFlow 2.1 models using the gcloud ai-platform jobs submit training
command, we can also submit containerized models for training. One advantage of using this approach is that we can use frameworks not natively supported by Cloud AI Platform for training and have more control over the environment in which the training loop is running.
The rest of this notebook is dedicated to using the containerized model approach.
In [ ]:
%%writefile babyweight/Dockerfile
FROM gcr.io/deeplearning-platform-release/tf2-cpu
COPY trainer /babyweight/trainer
RUN apt update && \
apt install --yes python3-pip && \
pip3 install --upgrade --quiet tensorflow==2.1 && \
pip3 install --upgrade --quiet cloudml-hypertune
ENV PYTHONPATH ${PYTHONPATH}:/babyweight
ENTRYPOINT ["python3", "babyweight/trainer/task.py"]
In [ ]:
%%writefile babyweight/push_docker.sh
export PROJECT_ID=$(gcloud config list project --format "value(core.project)")
export IMAGE_REPO_NAME=babyweight_training_container
export IMAGE_URI=gcr.io/${PROJECT_ID}/${IMAGE_REPO_NAME}
echo "Building $IMAGE_URI"
docker build -f Dockerfile -t ${IMAGE_URI} ./
echo "Pushing $IMAGE_URI"
docker push ${IMAGE_URI}
Note: If you get a permissions/stat error when running push_docker.sh from Notebooks, do it from CloudShell:
Open CloudShell on the GCP Console
This step takes 5-10 minutes to run.
In [ ]:
%%bash
cd babyweight
bash push_docker.sh
In [ ]:
%%bash
export PROJECT_ID=$(gcloud config list project --format "value(core.project)")
export IMAGE_REPO_NAME=babyweight_training_container
export IMAGE_URI=gcr.io/${PROJECT_ID}/${IMAGE_REPO_NAME}
echo "Running $IMAGE_URI"
docker run ${IMAGE_URI} \
--train_data_path=gs://${BUCKET}/babyweight/data/train*.csv \
--train_data_path=gs://${BUCKET}/babyweight/data/train*.csv \
--eval_data_path=gs://${BUCKET}/babyweight/data/eval*.csv \
--output_dir=gs://${BUCKET}/babyweight/trained_model \
--batch_size=10 \
--num_epochs=10 \
--train_examples=1 \
--eval_steps=1
In [ ]:
%%bash
OUTDIR=gs://${BUCKET}/babyweight/trained_model
JOBID=babyweight_$(date -u +%y%m%d_%H%M%S)
echo ${OUTDIR} ${REGION} ${JOBID}
gsutil -m rm -rf ${OUTDIR}
IMAGE=gcr.io/${PROJECT}/babyweight_training_container
gcloud ai-platform jobs submit training ${JOBID} \
--staging-bucket=gs://${BUCKET} \
--region=${REGION} \
--master-image-uri=${IMAGE} \
--master-machine-type=n1-standard-4 \
--scale-tier=CUSTOM \
-- \
--train_data_path=gs://${BUCKET}/babyweight/data/train*.csv \
--eval_data_path=gs://${BUCKET}/babyweight/data/eval*.csv \
--output_dir=${OUTDIR} \
--num_epochs=10 \
--train_examples=20000 \
--eval_steps=100 \
--batch_size=32 \
--nembeds=8
When I ran it, I used train_examples=2000000. When training finished, I filtered in the Stackdriver log on the word "dict" and saw that the last line was:
Saving dict for global step 5714290: average_loss = 1.06473, global_step = 5714290, loss = 34882.4, rmse = 1.03186The final RMSE was 1.03 pounds.
All of these are command-line parameters to my program. To do hyperparameter tuning, create hyperparam.yaml
and pass it as --config hyperparam.yaml
.
This step will take up to 2 hours -- you can increase maxParallelTrials
or reduce maxTrials
to get it done faster. Since maxParallelTrials
is the number of initial seeds to start searching from, you don't want it to be too large; otherwise, all you have is a random search.
Note that this is the same hyperparam.yaml
file as above, but included here for convenience.
In [ ]:
%%writefile hyperparam.yaml
trainingInput:
scaleTier: STANDARD_1
hyperparameters:
hyperparameterMetricTag: rmse
goal: MINIMIZE
maxTrials: 20
maxParallelTrials: 5
enableTrialEarlyStopping: True
params:
- parameterName: batch_size
type: INTEGER
minValue: 8
maxValue: 512
scaleType: UNIT_LOG_SCALE
- parameterName: nembeds
type: INTEGER
minValue: 3
maxValue: 30
scaleType: UNIT_LINEAR_SCALE
In [ ]:
%%bash
OUTDIR=gs://${BUCKET}/babyweight/hyperparam
JOBNAME=babyweight_$(date -u +%y%m%d_%H%M%S)
echo ${OUTDIR} ${REGION} ${JOBNAME}
gsutil -m rm -rf ${OUTDIR}
IMAGE=gcr.io/${PROJECT}/babyweight_training_container
gcloud ai-platform jobs submit training ${JOBNAME} \
--staging-bucket=gs://${BUCKET} \
--region=${REGION} \
--master-image-uri=${IMAGE} \
--master-machine-type=n1-standard-8 \
--scale-tier=CUSTOM \
--config=hyperparam.yaml \
-- \
--train_data_path=gs://${BUCKET}/babyweight/data/train*.csv \
--eval_data_path=gs://${BUCKET}/babyweight/data/eval*.csv \
--output_dir=${OUTDIR} \
--num_epochs=10 \
--train_examples=5000 \
--eval_steps=100
In [ ]:
%%bash
OUTDIR=gs://${BUCKET}/babyweight/trained_model_tuned
JOBNAME=babyweight_$(date -u +%y%m%d_%H%M%S)
echo ${OUTDIR} ${REGION} ${JOBNAME}
gsutil -m rm -rf ${OUTDIR}
IMAGE=gcr.io/${PROJECT}/babyweight_training_container
gcloud ai-platform jobs submit training ${JOBNAME} \
--staging-bucket=gs://${BUCKET} \
--region=${REGION} \
--master-image-uri=${IMAGE} \
--master-machine-type=n1-standard-4 \
--scale-tier=CUSTOM \
-- \
--train_data_path=gs://${BUCKET}/babyweight/data/train*.csv \
--eval_data_path=gs://${BUCKET}/babyweight/data/eval*.csv \
--output_dir=${OUTDIR} \
--num_epochs=10 \
--train_examples=20000 \
--eval_steps=100 \
--batch_size=32 \
--nembeds=8
In this lab, we set up the environment, created the trainer module's task.py to hold hyperparameter argparsing code, created the trainer module's model.py to hold Keras model code, built a container to run the trainer package ran the trainer module package locally, submitted a training job to Cloud AI Platform, and submitted a hyperparameter tuning job to Cloud AI Platform.
Copyright 2020 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License