Learning Objectives
cloudml-hypertune
to report the results for Cloud hyperparameter tuning trial runs.yaml
file for submitting a Cloud hyperparameter tuning jobLet's see if we can improve upon that by tuning our hyperparameters.
Hyperparameters are parameters that are set prior to training a model, as opposed to parameters which are learned during training.
These include learning rate and batch size, but also model design parameters such as type of activation function and number of hidden units.
Here are the four most common ways to finding the ideal hyperparameters:
1. Manual
Traditionally, hyperparameter tuning is a manual trial and error process. A data scientist has some intution about suitable hyperparameters which they use as a starting point, then they observe the result and use that information to try a new set of hyperparameters to try to beat the existing performance.
Pros
Cons
2. Grid Search
On the other extreme we can use grid search. Define a discrete set of values to try for each hyperparameter then try every possible combination.
Pros
Cons
3. Random Search
Alternatively define a range for each hyperparamter (e.g. 0-256) and sample uniformly at random from that range.
Pros
Cons
4. Bayesian Optimization
Unlike Grid Search and Random Search, Bayesian Optimization takes into account information from past trials to select parameters for future trials. The details of how this is done is beyond the scope of this notebook, but if you're interested you can read how it works here here.
Pros
Cons
AI Platform HyperTune
AI Platform HyperTune, powered by Google Vizier, uses Bayesian Optimization by default, but also supports Grid Search and Random Search.
When tuning just a few hyperparameters (say less than 4), Grid Search and Random Search work well, but when tunining several hyperparameters and the search space is large Bayesian Optimization is best.
In [ ]:
PROJECT = <YOUR PROJECT>
BUCKET = <YOUR BUCKET>
REGION = <YOUR REGION>
TFVERSION = "2.1" # TF version for AI Platform to use
In [ ]:
import os
os.environ["PROJECT"] = PROJECT
os.environ["BUCKET"] = BUCKET
os.environ["REGION"] = REGION
os.environ["TFVERSION"] = TFVERSION
In the previous lab, we moved our code into a python package for training on Cloud AI Platform. Let's just check that the files are there. You should see the following files in the taxifare/trainer
directory:
__init__.py
model.py
task.py
In [ ]:
!ls -la taxifare/trainer
To use hyperparameter tuning in your training job you must perform the following steps:
Specify the hyperparameter tuning configuration for your training job by including a HyperparameterSpec in your TrainingInput object.
Include the following code in your training application:
Parse the command-line arguments representing the hyperparameters you want to tune, and use the values to set the hyperparameters for your training trial. Add your hyperparameter metric to the summary for your graph.
To submit a hyperparameter tuning job, we must modify model.py
and task.py
to expose any variables we want to tune as command line arguments.
In [ ]:
%%writefile ./taxifare/trainer/model.py
import datetime
import hypertune
import logging
import os
import shutil
import numpy as np
import tensorflow as tf
from tensorflow.keras import activations
from tensorflow.keras import callbacks
from tensorflow.keras import layers
from tensorflow.keras import models
from tensorflow import feature_column as fc
logging.info(tf.version.VERSION)
CSV_COLUMNS = [
'fare_amount',
'pickup_datetime',
'pickup_longitude',
'pickup_latitude',
'dropoff_longitude',
'dropoff_latitude',
'passenger_count',
'key',
]
LABEL_COLUMN = 'fare_amount'
DEFAULTS = [[0.0], ['na'], [0.0], [0.0], [0.0], [0.0], [0.0], ['na']]
DAYS = ['Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat']
def features_and_labels(row_data):
for unwanted_col in ['key']:
row_data.pop(unwanted_col)
label = row_data.pop(LABEL_COLUMN)
return row_data, label
def load_dataset(pattern, batch_size, num_repeat):
dataset = tf.data.experimental.make_csv_dataset(
file_pattern=pattern,
batch_size=batch_size,
column_names=CSV_COLUMNS,
column_defaults=DEFAULTS,
num_epochs=num_repeat,
)
return dataset.map(features_and_labels)
def create_train_dataset(pattern, batch_size):
dataset = load_dataset(pattern, batch_size, num_repeat=None)
return dataset.prefetch(1)
def create_eval_dataset(pattern, batch_size):
dataset = load_dataset(pattern, batch_size, num_repeat=1)
return dataset.prefetch(1)
def parse_datetime(s):
if type(s) is not str:
s = s.numpy().decode('utf-8')
return datetime.datetime.strptime(s, "%Y-%m-%d %H:%M:%S %Z")
def euclidean(params):
lon1, lat1, lon2, lat2 = params
londiff = lon2 - lon1
latdiff = lat2 - lat1
return tf.sqrt(londiff*londiff + latdiff*latdiff)
def get_dayofweek(s):
ts = parse_datetime(s)
return DAYS[ts.weekday()]
@tf.function
def dayofweek(ts_in):
return tf.map_fn(
lambda s: tf.py_function(get_dayofweek, inp=[s], Tout=tf.string),
ts_in
)
@tf.function
def fare_thresh(x):
return 60 * activations.relu(x)
def transform(inputs, NUMERIC_COLS, STRING_COLS, nbuckets):
# Pass-through columns
transformed = inputs.copy()
del transformed['pickup_datetime']
feature_columns = {
colname: fc.numeric_column(colname)
for colname in NUMERIC_COLS
}
# Scaling longitude from range [-70, -78] to [0, 1]
for lon_col in ['pickup_longitude', 'dropoff_longitude']:
transformed[lon_col] = layers.Lambda(
lambda x: (x + 78)/8.0,
name='scale_{}'.format(lon_col)
)(inputs[lon_col])
# Scaling latitude from range [37, 45] to [0, 1]
for lat_col in ['pickup_latitude', 'dropoff_latitude']:
transformed[lat_col] = layers.Lambda(
lambda x: (x - 37)/8.0,
name='scale_{}'.format(lat_col)
)(inputs[lat_col])
# Adding Euclidean dist (no need to be accurate: NN will calibrate it)
transformed['euclidean'] = layers.Lambda(euclidean, name='euclidean')([
inputs['pickup_longitude'],
inputs['pickup_latitude'],
inputs['dropoff_longitude'],
inputs['dropoff_latitude']
])
feature_columns['euclidean'] = fc.numeric_column('euclidean')
# hour of day from timestamp of form '2010-02-08 09:17:00+00:00'
transformed['hourofday'] = layers.Lambda(
lambda x: tf.strings.to_number(
tf.strings.substr(x, 11, 2), out_type=tf.dtypes.int32),
name='hourofday'
)(inputs['pickup_datetime'])
feature_columns['hourofday'] = fc.indicator_column(
fc.categorical_column_with_identity(
'hourofday', num_buckets=24))
latbuckets = np.linspace(0, 1, nbuckets).tolist()
lonbuckets = np.linspace(0, 1, nbuckets).tolist()
b_plat = fc.bucketized_column(
feature_columns['pickup_latitude'], latbuckets)
b_dlat = fc.bucketized_column(
feature_columns['dropoff_latitude'], latbuckets)
b_plon = fc.bucketized_column(
feature_columns['pickup_longitude'], lonbuckets)
b_dlon = fc.bucketized_column(
feature_columns['dropoff_longitude'], lonbuckets)
ploc = fc.crossed_column(
[b_plat, b_plon], nbuckets * nbuckets)
dloc = fc.crossed_column(
[b_dlat, b_dlon], nbuckets * nbuckets)
pd_pair = fc.crossed_column([ploc, dloc], nbuckets ** 4)
feature_columns['pickup_and_dropoff'] = fc.embedding_column(
pd_pair, 100)
return transformed, feature_columns
def rmse(y_true, y_pred):
return tf.sqrt(tf.reduce_mean(tf.square(y_pred - y_true)))
def build_dnn_model(nbuckets, nnsize, lr):
# input layer is all float except for pickup_datetime which is a string
STRING_COLS = ['pickup_datetime']
NUMERIC_COLS = (
set(CSV_COLUMNS) - set([LABEL_COLUMN, 'key']) - set(STRING_COLS)
)
inputs = {
colname: layers.Input(name=colname, shape=(), dtype='float32')
for colname in NUMERIC_COLS
}
inputs.update({
colname: layers.Input(name=colname, shape=(), dtype='string')
for colname in STRING_COLS
})
# transforms
transformed, feature_columns = transform(
inputs, NUMERIC_COLS, STRING_COLS, nbuckets=nbuckets)
dnn_inputs = layers.DenseFeatures(feature_columns.values())(transformed)
x = dnn_inputs
for layer, nodes in enumerate(nnsize):
x = layers.Dense(nodes, activation='relu', name='h{}'.format(layer))(x)
output = layers.Dense(1, name='fare')(x)
model = models.Model(inputs, output)
lr_optimizer = tf.keras.optimizers.Adam(learning_rate=lr)
model.compile(optimizer=lr_optimizer, loss='mse', metrics=[rmse, 'mse'])
return model
def train_and_evaluate(hparams):
batch_size = hparams['batch_size']
eval_data_path = hparams['eval_data_path']
nnsize = hparams['nnsize']
nbuckets = hparams['nbuckets']
lr = hparams['lr']
num_evals = hparams['num_evals']
num_examples_to_train_on = hparams['num_examples_to_train_on']
output_dir = hparams['output_dir']
train_data_path = hparams['train_data_path']
timestamp = datetime.datetime.now().strftime('%Y%m%d%H%M%S')
savedmodel_dir = os.path.join(output_dir, 'export/savedmodel')
model_export_path = os.path.join(savedmodel_dir, timestamp)
checkpoint_path = os.path.join(output_dir, 'checkpoints')
tensorboard_path = os.path.join(output_dir, 'tensorboard')
if tf.io.gfile.exists(output_dir):
tf.io.gfile.rmtree(output_dir)
dnn_model = build_dnn_model(nbuckets, nnsize, lr)
logging.info(dnn_model.summary())
trainds = create_train_dataset(train_data_path, batch_size)
evalds = create_eval_dataset(eval_data_path, batch_size)
steps_per_epoch = num_examples_to_train_on // (batch_size * num_evals)
checkpoint_cb = callbacks.ModelCheckpoint(checkpoint_path,
save_weights_only=True,
verbose=1)
tensorboard_cb = callbacks.TensorBoard(tensorboard_path,
histogram_freq=1)
history = dnn_model.fit(
trainds,
validation_data=evalds,
epochs=num_evals,
steps_per_epoch=max(1, steps_per_epoch),
verbose=2, # 0=silent, 1=progress bar, 2=one line per epoch
callbacks=[checkpoint_cb, tensorboard_cb]
)
# Exporting the model with default serving function.
tf.saved_model.save(dnn_model, model_export_path)
# TODO 1
hp_metric = # TODO: Your code goes here
# TODO 1
hpt = # TODO: Your code goes here
# TODO: Your code goes here
return history
In [ ]:
%%writefile taxifare/trainer/task.py
import argparse
import json
import os
from trainer import model
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument(
"--batch_size",
help = "Batch size for training steps",
type = int,
default = 32
)
parser.add_argument(
"--eval_data_path",
help = "GCS location pattern of eval files",
required = True
)
parser.add_argument(
"--nnsize",
help = "Hidden layer sizes (provide space-separated sizes)",
nargs = "+",
type = int,
default=[32, 8]
)
parser.add_argument(
"--nbuckets",
help = "Number of buckets to divide lat and lon with",
type = int,
default = 10
)
parser.add_argument(
"--lr",
help = "learning rate for optimizer",
type = float,
default = 0.001
)
parser.add_argument(
"--num_evals",
help = "Number of times to evaluate model on eval data training.",
type = int,
default = 5
)
parser.add_argument(
"--num_examples_to_train_on",
help = "Number of examples to train on.",
type = int,
default = 100
)
parser.add_argument(
"--output_dir",
help = "GCS location to write checkpoints and export models",
required = True
)
parser.add_argument(
"--train_data_path",
help = "GCS location pattern of train files containing eval URLs",
required = True
)
parser.add_argument(
"--job-dir",
help = "this model ignores this field, but it is required by gcloud",
default = "junk"
)
args, _ = parser.parse_known_args()
hparams = args.__dict__
model.train_and_evaluate(hparams)
Specify the hyperparameter tuning configuration for your training job Create a HyperparameterSpec object to hold the hyperparameter tuning configuration for your training job, and add the HyperparameterSpec as the hyperparameters object in your TrainingInput object.
In your HyperparameterSpec, set the hyperparameterMetricTag to a value representing your chosen metric. If you don't specify a hyperparameterMetricTag, AI Platform Training looks for a metric with the name training/hptuning/metric. The following example shows how to create a configuration for a metric named metric1:
Complete the TODOs below.
In [ ]:
%%writefile hptuning_config.yaml
trainingInput:
scaleTier: BASIC
hyperparameters:
goal: MINIMIZE
maxTrials: # TODO: Your code goes here
maxParallelTrials: # TODO: Your code goes here
hyperparameterMetricTag: # TODO: Your code goes here
enableTrialEarlyStopping: True
params:
- parameterName: lr
# TODO: Your code goes here
- parameterName: nbuckets
# TODO: Your code goes here
- parameterName: batch_size
# TODO: Your code goes here
The way to report your hyperparameter metric to the AI Platform Training service depends on whether you are using TensorFlow for training or not. It also depends on whether you are using a runtime version or a custom container for training.
We recommend that your training code reports your hyperparameter metric to AI Platform Training frequently in order to take advantage of early stopping.
TensorFlow with a runtime version If you use an AI Platform Training runtime version and train with TensorFlow, then you can report your hyperparameter metric to AI Platform Training by writing the metric to a TensorFlow summary. Use one of the following functions.
You may need to install cloudml-hypertune
on your machine to run this code locally.
In [ ]:
!python3 -m pip install cloudml-hypertune
In [ ]:
%%bash
EVAL_DATA_PATH=./taxifare/tests/data/taxi-valid*
TRAIN_DATA_PATH=./taxifare/tests/data/taxi-train*
OUTPUT_DIR=./taxifare-model
rm -rf ${OUTDIR}
export PYTHONPATH=${PYTHONPATH}:${PWD}/taxifare
python3 -m trainer.task \
--eval_data_path $EVAL_DATA_PATH \
--output_dir $OUTPUT_DIR \
--train_data_path $TRAIN_DATA_PATH \
--batch_size 5 \
--num_examples_to_train_on 100 \
--num_evals 1 \
--nbuckets 10 \
--lr 0.001 \
--nnsize 32 8
In [ ]:
%%bash
PROJECT_ID=$(gcloud config list project --format "value(core.project)")
BUCKET=$PROJECT_ID
REGION="us-central1"
TFVERSION="2.1"
# Output directory and jobID
OUTDIR=gs://${BUCKET}/taxifare/trained_model_$(date -u +%y%m%d_%H%M%S)
JOBID=taxifare_$(date -u +%y%m%d_%H%M%S)
echo ${OUTDIR} ${REGION} ${JOBID}
gsutil -m rm -rf ${OUTDIR}
# Model and training hyperparameters
BATCH_SIZE=15
NUM_EXAMPLES_TO_TRAIN_ON=100
NUM_EVALS=10
NBUCKETS=10
LR=0.001
NNSIZE="32 8"
# GCS paths
GCS_PROJECT_PATH=gs://$BUCKET/taxifare
DATA_PATH=$GCS_PROJECT_PATH/data
TRAIN_DATA_PATH=$DATA_PATH/taxi-train*
EVAL_DATA_PATH=$DATA_PATH/taxi-valid*
# TODO
gcloud ai-platform jobs submit training $JOBID \
# TODO: Your code goes here
-- \
--eval_data_path $EVAL_DATA_PATH \
--output_dir $OUTDIR \
--train_data_path $TRAIN_DATA_PATH \
--batch_size $BATCH_SIZE \
--num_examples_to_train_on $NUM_EXAMPLES_TO_TRAIN_ON \
--num_evals $NUM_EVALS \
--nbuckets $NBUCKETS \
--lr $LR \
--nnsize $NNSIZE
Copyright 2020 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License