Hyper-parameter tuning

Learning Objectives

Understand various approaches to hyperparameter tuning
Automate hyperparameter tuning using AI Platform

Introduction

In the previous notebook we achieved an RMSE of 4.13. Let's see if we can improve upon that by tuning our hyperparameters.

Hyperparameters are parameters that are set prior to training a model, as opposed to parameters which are learned during training.

These include learning rate and batch size, but also model design parameters such as type of activation function and number of hidden units.

Here are the four most common ways to finding the ideal hyperparameters:

Manual
Grid Search
Random Search
Bayesian Optimzation

1. Manual

Traditionally, hyperparameter tuning is a manual trial and error process. A data scientist has some intution about suitable hyperparameters which they use as a starting point, then they observe the result and use that information to try a new set of hyperparameters to try to beat the existing performance.

Pros

Educational, builds up your intuition as a data scientist
Inexpensive because only one trial is conducted at a time

Cons

Requires alot of time and patience

2. Grid Search

On the other extreme we can use grid search. Define a discrete set of values to try for each hyperparameter then try every possible combination.

Pros

Can run hundreds of trials in parallel using the cloud
Gauranteed to find the best solution within the search space

Cons

Expensive

3. Random Search

Alternatively define a range for each hyperparamter (e.g. 0-256) and sample uniformly at random from that range.

Pros

Can run hundreds of trials in parallel using the cloud
Requires less trials than Grid Search to find a good solution

Cons

Expensive (but less so than Grid Search)

4. Bayesian Optimization

Unlike Grid Search and Random Search, Bayesian Optimization takes into account information from past trials to select parameters for future trials. The details of how this is done is beyond the scope of this notebook, but if you're interested you can read how it works here here.

Pros

Picks values intelligenty based on results from past trials
Less expensive because requires fewer trials to get a good result

Cons

Requires sequential trials for best results, takes longer

AI Platform HyperTune

AI Platform HyperTune, powered by Google Vizier, uses Bayesian Optimization by default, but also supports Grid Search and Random Search.

When tuning just a few hyperparameters (say less than 4), Grid Search and Random Search work well, but when tunining several hyperparameters and the search space is large Bayesian Optimization is best.



In [10]:

    
PROJECT = "cloud-training-demos"  # Replace with your PROJECT
BUCKET = "cloud-training-bucket"  # Replace with your BUCKET
REGION = "us-central1"            # Choose an available region for AI Platform
TFVERSION = "1.14"                # TF version for AI Platform to use



In [ ]:

    
import os 
os.environ["PROJECT"] = PROJECT
os.environ["BUCKET"] = BUCKET
os.environ["REGION"] = REGION
os.environ["TFVERSION"] = TFVERSION

Move code into python package

Let's package our updated code with feature engineering so it's AI Platform compatible.



In [ ]:

    
%%bash
mkdir taxifaremodel
touch taxifaremodel/__init__.py

Create model.py

Note that any hyperparameters we want to tune need to be exposed as command line arguments. In particular note that the number of hidden units is now a parameter.



In [ ]:

    
%%writefile taxifaremodel/model.py
import tensorflow as tf
import numpy as np
import shutil
print(tf.__version__)

#1. Train and Evaluate Input Functions
CSV_COLUMN_NAMES = ["fare_amount","dayofweek","hourofday","pickuplon","pickuplat","dropofflon","dropofflat"]
CSV_DEFAULTS = [[0.0],[1],[0],[-74.0],[40.0],[-74.0],[40.7]]

def read_dataset(csv_path):
    def _parse_row(row):
        # Decode the CSV row into list of TF tensors
        fields = tf.decode_csv(records = row, record_defaults = CSV_DEFAULTS)

        # Pack the result into a dictionary
        features = dict(zip(CSV_COLUMN_NAMES, fields))
        
        # NEW: Add engineered features
        features = add_engineered_features(features)
        
        # Separate the label from the features
        label = features.pop("fare_amount") # remove label from features and store

        return features, label
    
    # Create a dataset containing the text lines.
    dataset = tf.data.Dataset.list_files(file_pattern = csv_path) # (i.e. data_file_*.csv)
    dataset = dataset.flat_map(map_func = lambda filename:tf.data.TextLineDataset(filenames = filename).skip(count = 1))

    # Parse each CSV row into correct (features,label) format for Estimator API
    dataset = dataset.map(map_func = _parse_row)
    
    return dataset

def train_input_fn(csv_path, batch_size = 128):
    #1. Convert CSV into tf.data.Dataset with (features,label) format
    dataset = read_dataset(csv_path)
      
    #2. Shuffle, repeat, and batch the examples.
    dataset = dataset.shuffle(buffer_size = 1000).repeat(count = None).batch(batch_size = batch_size)
   
    return dataset

def eval_input_fn(csv_path, batch_size = 128):
    #1. Convert CSV into tf.data.Dataset with (features,label) format
    dataset = read_dataset(csv_path)

    #2.Batch the examples.
    dataset = dataset.batch(batch_size = batch_size)
   
    return dataset
  
#2. Feature Engineering
# One hot encode dayofweek and hourofday
fc_dayofweek = tf.feature_column.categorical_column_with_identity(key = "dayofweek", num_buckets = 7)
fc_hourofday = tf.feature_column.categorical_column_with_identity(key = "hourofday", num_buckets = 24)

# Cross features to get combination of day and hour
fc_day_hr = tf.feature_column.crossed_column(keys = [fc_dayofweek, fc_hourofday], hash_bucket_size = 24 * 7)

# Bucketize latitudes and longitudes
NBUCKETS = 16
latbuckets = np.linspace(start = 38.0, stop = 42.0, num = NBUCKETS).tolist()
lonbuckets = np.linspace(start = -76.0, stop = -72.0, num = NBUCKETS).tolist()
fc_bucketized_plat = tf.feature_column.bucketized_column(source_column = tf.feature_column.numeric_column(key = "pickuplon"), boundaries = lonbuckets)
fc_bucketized_plon = tf.feature_column.bucketized_column(source_column = tf.feature_column.numeric_column(key = "pickuplat"), boundaries = latbuckets)
fc_bucketized_dlat = tf.feature_column.bucketized_column(source_column = tf.feature_column.numeric_column(key = "dropofflon"), boundaries = lonbuckets)
fc_bucketized_dlon = tf.feature_column.bucketized_column(source_column = tf.feature_column.numeric_column(key = "dropofflat"), boundaries = latbuckets)

def add_engineered_features(features):
    features["dayofweek"] = features["dayofweek"] - 1 # subtract one since our days of week are 1-7 instead of 0-6
    
    features["latdiff"] = features["pickuplat"] - features["dropofflat"] # East/West
    features["londiff"] = features["pickuplon"] - features["dropofflon"] # North/South
    features["euclidean_dist"] = tf.sqrt(features["latdiff"]**2 + features["londiff"]**2)

    return features

feature_cols = [
  #1. Engineered using tf.feature_column module
  tf.feature_column.indicator_column(categorical_column = fc_day_hr),
  fc_bucketized_plat,
  fc_bucketized_plon,
  fc_bucketized_dlat,
  fc_bucketized_dlon,
  #2. Engineered in input functions
  tf.feature_column.numeric_column(key = "latdiff"),
  tf.feature_column.numeric_column(key = "londiff"),
  tf.feature_column.numeric_column(key = "euclidean_dist") 
]

#3. Serving Input Receiver Function
def serving_input_receiver_fn():
    receiver_tensors = {
        'dayofweek' : tf.placeholder(dtype = tf.int32, shape = [None]), # shape is vector to allow batch of requests
        'hourofday' : tf.placeholder(dtype = tf.int32, shape = [None]),
        'pickuplon' : tf.placeholder(dtype = tf.float32, shape = [None]), 
        'pickuplat' : tf.placeholder(dtype = tf.float32, shape = [None]),
        'dropofflat' : tf.placeholder(dtype = tf.float32, shape = [None]),
        'dropofflon' : tf.placeholder(dtype = tf.float32, shape = [None]),
    }
    
    features = add_engineered_features(receiver_tensors) # 'features' is what is passed on to the model
    
    return tf.estimator.export.ServingInputReceiver(features = features, receiver_tensors = receiver_tensors)
  
#4. Train and Evaluate
def train_and_evaluate(params):
    OUTDIR = params["output_dir"]

    model = tf.estimator.DNNRegressor(
        hidden_units = params["hidden_units"].split(","), # NEW: paramaterize architecture
        feature_columns = feature_cols, 
        model_dir = OUTDIR,
        config = tf.estimator.RunConfig(
            tf_random_seed = 1, # for reproducibility
            save_checkpoints_steps = max(100, params["train_steps"] // 10) # checkpoint every N steps
        ) 
    )

    # Add custom evaluation metric
    def my_rmse(labels, predictions):
        pred_values = tf.squeeze(input = predictions["predictions"], axis = -1)
        return {"rmse": tf.metrics.root_mean_squared_error(labels = labels, predictions = pred_values)}
    
    model = tf.contrib.estimator.add_metrics(model, my_rmse)  

    train_spec = tf.estimator.TrainSpec(
        input_fn = lambda: train_input_fn(params["train_data_path"]),
        max_steps = params["train_steps"])

    exporter = tf.estimator.FinalExporter(name = "exporter", serving_input_receiver_fn = serving_input_receiver_fn) # export SavedModel once at the end of training
    # Note: alternatively use tf.estimator.BestExporter to export at every checkpoint that has lower loss than the previous checkpoint


    eval_spec = tf.estimator.EvalSpec(
        input_fn = lambda: eval_input_fn(params["eval_data_path"]),
        steps = None,
        start_delay_secs = 1, # wait at least N seconds before first evaluation (default 120)
        throttle_secs = 1, # wait at least N seconds before each subsequent evaluation (default 600)
        exporters = exporter) # export SavedModel once at the end of training

    tf.logging.set_verbosity(v = tf.logging.INFO) # so loss is printed during training
    shutil.rmtree(path = OUTDIR, ignore_errors = True) # start fresh each time

    tf.estimator.train_and_evaluate(model, train_spec, eval_spec)

Create task.py

When doing hyperparameter tuning we need to make sure the output directory is different for each run, otherwise successive runs will overwrite previous runs.

One way to do this is to append the trial id. This part of code can be removed if you are not using hyperparameter tuning



In [ ]:

    
%%writefile taxifaremodel/task.py
import argparse
import json
import os

from . import model

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--hidden_units",
        help = "Hidden layer sizes to use for DNN feature columns -- provide space-separated layers",
        type = str,
        default = "10,10"
    )
    parser.add_argument(
        "--train_data_path",
        help = "GCS or local path to training data",
        required = True
    )
    parser.add_argument(
        "--train_steps",
        help = "Steps to run the training job for (default: 1000)",
        type = int,
        default = 1000
    )
    parser.add_argument(
        "--eval_data_path",
        help = "GCS or local path to evaluation data",
        required = True
    )
    parser.add_argument(
        "--output_dir",
        help = "GCS location to write checkpoints and export models",
        required = True
    )
    parser.add_argument(
        "--job-dir",
        help="This is not used by our model, but it is required by gcloud",
    )
    args = parser.parse_args().__dict__
    
    # Append trial_id to path so trials don"t overwrite each other
    args["output_dir"] = os.path.join(
        args["output_dir"],
        json.loads(
            os.environ.get("TF_CONFIG", "{}")
        ).get("task", {}).get("trial", "")
    ) 
        
    # Run the training job
    model.train_and_evaluate(args)

Create hypertuning configuration

We specify:

How many trials to run (maxTrials) and how many of those trials can be run in parrallel (maxParallelTrials)
Which algorithm to use (in this case GRID_SEARCH)
Which metric to optimize (hyperparameterMetricTag)
The search region in which to constrain the hyperparameter search

Full specification options here.

Here we are just tuning one parameter, the number of hidden units, and we'll run all trials in parrallel. However more commonly you would tune multiple hyperparameters.



In [ ]:

    
%%writefile hyperparam.yaml
trainingInput:
  scaleTier: BASIC
  hyperparameters:
    goal: MINIMIZE
    maxTrials: 10
    maxParallelTrials: 10
    hyperparameterMetricTag: rmse
    enableTrialEarlyStopping: True
    algorithm: GRID_SEARCH
    params:
    - parameterName: hidden_units
      type: CATEGORICAL
      categoricalValues:
      - 10,10
      - 64,32
      - 128,64,32
      - 32,64,128
      - 128,128,128
      - 32,32,32
      - 256,128,64,32
      - 256,256,256,32
      - 256,256,256,256
      - 512,256,128,64,32

Run the training job

Same as before with the addition of --config=hyperpam.yaml to reference the file we just created.

This will take about 20 minutes. Go to cloud console and click on the job id. Once the job is completed, the choosen hyperparameters and resulting objective value (RMSE in this case) will be shown. Trials will sorted from best to worst.



In [ ]:

    
OUTDIR="gs://{}/taxifare/trained_hp_tune".format(BUCKET)
!gsutil -m rm -rf {OUTDIR} # start fresh each time
!gcloud ai-platform jobs submit training taxifare_$(date -u +%y%m%d_%H%M%S) \
    --package-path=taxifaremodel \
    --module-name=taxifaremodel.task \
    --config=hyperparam.yaml \
    --job-dir=gs://{BUCKET}/taxifare \
    --python-version=3.5 \
    --runtime-version={TFVERSION} \
    --region={REGION} \
    -- \
    --train_data_path=gs://{BUCKET}/taxifare/smallinput/taxi-train.csv \
    --eval_data_path=gs://{BUCKET}/taxifare/smallinput/taxi-valid.csv  \
    --train_steps=5000 \
    --output_dir={OUTDIR}

Results

The best result is RMSE 4.02 with hidden units = 128,64,32.

This improvement is modest, but now that we have our hidden units tuned let's run on our larger dataset to see if it helps.

Note the passing of hyperparameter values via command line



In [ ]:

    
OUTDIR="gs://{}/taxifare/trained_large_tuned".format(BUCKET)
!gsutil -m rm -rf {OUTDIR} # start fresh each time
!gcloud ai-platform jobs submit training taxifare_large_$(date -u +%y%m%d_%H%M%S) \
    --package-path=taxifaremodel \
    --module-name=taxifaremodel.task \
    --job-dir=gs://{BUCKET}/taxifare \
    --python-version=3.5 \
    --runtime-version={TFVERSION} \
    --region={REGION} \
    --scale-tier=STANDARD_1 \
    -- \
    --train_data_path=gs://cloud-training-demos/taxifare/large/taxi-train*.csv \
    --eval_data_path=gs://cloud-training-demos/taxifare/small/taxi-valid.csv  \
    --train_steps=200000 \
    --output_dir={OUTDIR} \
    --hidden_units="128,64,32"

Analysis

Our RMSE improved to 3.85

Challenge excercise

Try to beat the current RMSE:

Try adding more engineered features or modifying existing ones
Try tuning additional hyperparameters

Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License