In [ ]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
This notebook trains a model on Ai Platform using Hyperparameter Tuning to predict a car's Miles Per Gallon. It uses Auto MPG Data Set from UCI Machine Learning Repository.
Citation: Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
Using HP Tuning for training can be done in a few steps:
Before you jump in, let’s cover some of the different tools you’ll be using to get HP tuning up and running on AI Platform.
Google Cloud Platform lets you build and host applications and websites, store data, and analyze data on Google's scalable infrastructure.
AI Platform is a managed service that enables you to easily build machine learning models that work on any type of data, of any size.
Google Cloud Storage (GCS) is a unified object storage for developers and enterprises, from live data serving to data analytics/ML to data archiving.
Cloud SDK is a command line tool which allows you to interact with Google Cloud products. In order to run this notebook, make sure that Cloud SDK is installed in the same environment as your Jupyter kernel.
Overview of Hyperparameter Tuning - Hyperparameter tuning takes advantage of the processing infrastructure of Google Cloud Platform to test different hyperparameter configurations when training your model.
These variables will be needed for the following steps.
TRAINER_PACKAGE_PATH <./auto_mpg_hp_tuning>
- A packaged training application that will be staged in a Google Cloud Storage location. The model file created below is placed inside this package path.MAIN_TRAINER_MODULE <auto_mpg_hp_tuning.train>
- Tells AI Platform which file to execute. This is formatted as follows <folder_name.python_file_name>JOB_DIR <gs://$BUCKET_ID/scikit_learn_job_dir>
- The path to a Google Cloud Storage location to use for job output.RUNTIME_VERSION <1.9>
- The version of AI Platform to use for the job. If you don't specify a runtime version, the training service uses the default AI Platform runtime version 1.0. See the list of runtime versions for more information.PYTHON_VERSION <3.5>
- The Python version to use for the job. Python 3.5 is available with runtime version 1.4 or greater. If you don't specify a Python version, the training service uses Python 2.7.HPTUNING_CONFIG <hptuning_config.yaml>
- Path to the job configuration file.Replace:
PROJECT_ID <YOUR_PROJECT_ID>
- with your project's id. Use the PROJECT_ID that matches your Google Cloud Platform project.BUCKET_ID <YOUR_BUCKET_ID>
- with the bucket id you created above.JOB_DIR <gs://YOUR_BUCKET_ID/scikit_learn_job_dir>
- with the bucket id you created above.REGION <REGION>
- select a region from here or use the default 'us-central1
'. The region is where the model will be deployed.
In [ ]:
%env PROJECT_ID PROJECT_ID
%env BUCKET_ID BUCKET_ID
%env JOB_DIR gs://BUCKET_ID/scikit_learn_job_dir
%env REGION us-central1
%env TRAINER_PACKAGE_PATH ./auto_mpg_hp_tuning
%env MAIN_TRAINER_MODULE auto_mpg_hp_tuning.train
%env RUNTIME_VERSION 1.9
%env PYTHON_VERSION 3.5
%env HPTUNING_CONFIG hptuning_config.yaml
! mkdir auto_mpg_hp_tuning
The Auto MPG Data Set that this sample
uses for training is provided by the UC Irvine Machine Learning
Repository. We have hosted the data on a public GCS bucket gs://cloud-samples-data/ml-engine/auto_mpg/
. The data has been pre-processed to remove rows with incomplete data so as not to create additional steps for this notebook.
auto-mpg.data
Note: Your typical development process with your own data would require you to upload your data to GCS so that AI Platform can access that data. However, in this case, we have put the data on GCS to avoid the steps of having you download the data from UC Irvine and then upload the data to GCS.
Citation: Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
First, we'll create the python model file (provided below) that we'll upload to AI Platform. This is similar to your normal process for creating a scikit-learn model. However, there are a few key differences:
cloudml-hypertune
to track your training jobs metrics.The code in this file first handles the hyperparameters passed to the file from AI Platform. Then it loads the data into a pandas DataFrame that can be used by scikit-learn. Then the model is fit against the training data and the metrics for that data are shared with AI Platform. Lastly, sklearn's built in version of joblib is used to save the model to a file that can be uploaded to AI Platform's prediction service.
Note: In normal practice you would want to test your model locally on a small dataset to ensure that it works, before using it with your larger dataset on AI Platform. This avoids wasted time and costs.
In [ ]:
%%writefile ./auto_mpg_hp_tuning/train.py
#!/usr/bin/env python
# Copyright 2018 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import datetime
import os
import pandas as pd
import subprocess
from google.cloud import storage
import hypertune
from sklearn.externals import joblib
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
In this tutorial, the Lasso regressor is used, because it has several parameters that can be used to help demonstrate how to choose HP tuning values. (The range of values are set below in the configuration file for the HP tuning values.)
In [ ]:
%%writefile -a ./auto_mpg_hp_tuning/train.py
parser = argparse.ArgumentParser()
parser.add_argument(
'--job-dir', # handled automatically by AI Platform
help='GCS location to write checkpoints and export models',
required=True
)
parser.add_argument(
'--alpha', # Specified in the config file
help='Constant that multiplies the L1 term.',
default=1.0,
type=float
)
parser.add_argument(
'--max_iter', # Specified in the config file
help='The maximum number of iterations.',
default=1000,
type=int
)
parser.add_argument(
'--tol', # Specified in the config file
help='The tolerance for the optimization: if the updates are smaller than tol, '
'the optimization code checks the dual gap for optimality and continues '
'until it is smaller than tol.',
default=0.0001,
type=float
)
parser.add_argument(
'--selection', # Specified in the config file
help='Supported criteria are “cyclic” loop over features sequentially and '
'“random” a random coefficient is updated every iteration ',
default='cyclic'
)
args = parser.parse_args()
In [ ]:
%%writefile -a ./auto_mpg_hp_tuning/train.py
# Public bucket holding the auto mpg data
bucket = storage.Client().bucket('cloud-samples-data')
# Path to the data inside the public bucket
blob = bucket.blob('ml-engine/auto_mpg/auto-mpg.data')
# Download the data
blob.download_to_filename('auto-mpg.data')
# ---------------------------------------
# This is where your model code would go. Below is an example model using the auto mpg dataset.
# ---------------------------------------
# Define the format of your input data including unused columns
# (These are the columns from the auto-mpg data files)
COLUMNS = (
'mpg',
'cylinders',
'displacement',
'horsepower',
'weight',
'acceleration',
'model-year',
'origin',
'car-name'
)
# Load the training auto mpg dataset
with open('./auto-mpg.data', 'r') as train_data:
raw_training_data = pd.read_csv(train_data, header=None, names=COLUMNS, delim_whitespace=True)
# Remove the column we are trying to predict ('mpg') from our features list
# Convert the Dataframe to a lists of lists
features = raw_training_data.drop('mpg', axis=1).drop('car-name', axis=1).values.tolist()
# Create our training labels list, convert the Dataframe to a lists of lists
labels = raw_training_data['mpg'].values.tolist()
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.15)
In [ ]:
%%writefile -a ./auto_mpg_hp_tuning/train.py
# Create the regressor, here we will use a Lasso Regressor to demonstrate the use of HP Tuning.
# Here is where we set the variables used during HP Tuning from
# the parameters passed into the python script
regressor = Lasso(
alpha=args.alpha,
max_iter=args.max_iter,
tol=args.tol,
selection=args.selection)
# Transform the features and fit them to the regressor
regressor.fit(train_features, train_labels)
In [ ]:
%%writefile -a ./auto_mpg_hp_tuning/train.py
# Calculate the mean accuracy on the given test data and labels.
score = regressor.score(test_features, test_labels)
# The default name of the metric is training/hptuning/metric.
# We recommend that you assign a custom name. The only functional difference is that
# if you use a custom name, you must set the hyperparameterMetricTag value in the
# HyperparameterSpec object in your job request to match your chosen name.
# https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs#HyperparameterSpec
hpt = hypertune.HyperTune()
hpt.report_hyperparameter_tuning_metric(
hyperparameter_metric_tag='my_metric_tag',
metric_value=score,
global_step=1000)
In [ ]:
%%writefile -a ./auto_mpg_hp_tuning/train.py
# Export the model to a file
model_filename = 'model.joblib'
joblib.dump(regressor, model_filename)
# Example: job_dir = 'gs://BUCKET_ID/scikit_learn_job_dir/1'
job_dir = args.job_dir.replace('gs://', '') # Remove the 'gs://'
# Get the Bucket Id
bucket_id = job_dir.split('/')[0]
# Get the path
bucket_path = job_dir.lstrip('{}/'.format(bucket_id)) # Example: 'scikit_learn_job_dir/1'
# Upload the model to GCS
bucket = storage.Client().bucket(bucket_id)
blob = bucket.blob('{}/{}'.format(
bucket_path,
model_filename))
blob.upload_from_filename(model_filename)
In [ ]:
%%writefile ./auto_mpg_hp_tuning/__init__.py
#!/usr/bin/env python
# Copyright 2018 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Note that __init__.py can be an empty file.
Next, we need to set the hp tuning values used to train our model. Check HyperparameterSpec for more info.
In this config file several key things are set:
maxTrials
- How many training trials should be attempted to optimize the specified hyperparameters.maxParallelTrials: 5
- The number of training trials to run concurrently. params
- The set of parameters to tune.. These are the different parameters to pass into your model and the specified ranges you wish to try.parameterName
- The parameter name must be unique amongst all ParameterConfigstype
- The type of the parameter. [INTEGER, DOUBLE, ...]minValue
& maxValue
- The range of values that this parameter could be. scaleType
- How the parameter should be scaled to the hypercube. Leave unset for categorical parameters. Some kind of scaling is strongly recommended for real or integral parameters (e.g., UNIT_LINEAR_SCALE).
In [ ]:
%%writefile ./hptuning_config.yaml
#!/usr/bin/env python
# Copyright 2018 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# hyperparam.yaml
trainingInput:
hyperparameters:
goal: MAXIMIZE
maxTrials: 30
maxParallelTrials: 5
hyperparameterMetricTag: my_metric_tag
enableTrialEarlyStopping: TRUE
params:
- parameterName: alpha
type: DOUBLE
minValue: 0.0
maxValue: 10.0
scaleType: UNIT_LINEAR_SCALE
- parameterName: max_iter
type: INTEGER
minValue: 1000
maxValue: 5000
scaleType: UNIT_LINEAR_SCALE
- parameterName: tol
type: DOUBLE
minValue: 0.0001
maxValue: 0.1
scaleType: UNIT_LINEAR_SCALE
- parameterName: selection
type: CATEGORICAL
categoricalValues: [
"cyclic",
"random"
]
Lastly, we need to install the dependencies used in our model. Check adding_standard_pypi_dependencies for more info.
To do this, AI Platform uses a setup.py file to install your dependencies.
In [ ]:
%%writefile ./setup.py
#!/usr/bin/env python
# Copyright 2018 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from setuptools import find_packages
from setuptools import setup
REQUIRED_PACKAGES = ['cloudml-hypertune']
setup(
name='auto_mpg_hp_tuning',
version='0.1',
install_requires=REQUIRED_PACKAGES,
packages=find_packages(),
include_package_data=True,
description='Auto MPG sklearn HP tuning training application'
)
Next we need to submit the job for training on AI Platform. We'll use gcloud to submit the job which has the following flags:
job-name
- A name to use for the job (mixed-case letters, numbers, and underscores only, starting with a letter). In this case: auto_mpg_hp_tuning_$(date +"%Y%m%d_%H%M%S")
job-dir
- The path to a Google Cloud Storage location to use for job output.package-path
- A packaged training application that is staged in a Google Cloud Storage location. If you are using the gcloud command-line tool, this step is largely automated.module-name
- The name of the main module in your trainer package. The main module is the Python file you call to start the application. If you use the gcloud command to submit your job, specify the main module name in the --module-name argument. Refer to Python Packages to figure out the module name.region
- The Google Cloud Compute region where you want your job to run. You should run your training job in the same region as the Cloud Storage bucket that stores your training data. Select a region from here or use the default 'us-central1
'.runtime-version
- The version of AI Platform to use for the job. If you don't specify a runtime version, the training service uses the default AI Platform runtime version 1.0. See the list of runtime versions for more information.python-version
- The Python version to use for the job. Python 3.5 is available with runtime version 1.4 or greater. If you don't specify a Python version, the training service uses Python 2.7.scale-tier
- A scale tier specifying the type of processing cluster to run your job on. This can be the CUSTOM scale tier, in which case you also explicitly specify the number and type of machines to use.config
- Path to the job configuration file. This file should be a YAML document (JSON also accepted) containing a Job resource as defined in the APINote: Check to make sure gcloud is set to the current PROJECT_ID
In [ ]:
! gcloud config set project $PROJECT_ID
Submit the training job.
In [ ]:
! gcloud ml-engine jobs submit training auto_mpg_hp_tuning_$(date +"%Y%m%d_%H%M%S") \
--job-dir $JOB_DIR \
--package-path $TRAINER_PACKAGE_PATH \
--module-name $MAIN_TRAINER_MODULE \
--region $REGION \
--runtime-version=$RUNTIME_VERSION \
--python-version=$PYTHON_VERSION \
--scale-tier BASIC \
--config $HPTUNING_CONFIG
You can view the logs for your training job:
On the logging page of your model, you can view the different results for each HP tuning job.
Example:
{
"trialId": "2",
"hyperparameters": {
"selection": "random",
"max_iter": "1892",
"tol": "0.0609819896050862",
"alpha": "4.3704164028167725"
},
"finalMetric": {
"trainingStep": "1000",
"objectiveValue": 0.8658283435394591
}
}
In [ ]:
! gsutil ls $JOB_DIR/*
The AI Platform online prediction service manages computing resources in the cloud to run your models. Check out the documentation pages that describe the process to get online predictions from these exported models using AI Platform.