In [ ]:
# Copyright 2020 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
AI Hub is a repository of plug-and-play AI components including end-to-end AI pipelines and out-of-the-box algorithms. The following is an example of using the XGBoost AI Hub container to train a model with the AI Platform Training service and creating a model endpoint using AI Platform Prediction.
AI Hub includes components that make it easy to run training jobs at scale on Google's cloud infrastructure. Without revising any code, users can run distributed training jobs on a variety of hardware (including GPU and TPU devices). These components offer native support for AI Platform Training and export trained model files that can be uploaded to AI Platform Prediction for generating inferences. The components also include a run report that provides practical insights into the behavior of the trained model, and a visual inspection of the training and validation error for each run.
The dataset used in this notebook includes residential real-estate data for homes in Ames, Iowa. The data are stored in a tabular format (.CSV) and include the sale price of 1,460 homes along with 79 explanatory features.
The dataset comes from Kaggle's competition to predict House Prices. The Kaggle API can be used to download and import the data.
The following notebook provides an example workflow of using an AI Hub component to train an XGBoost regression model and create an endpoint for generating predictions.
The notebook includes a complete ML workflow from data ingestion to model training and deployment. The steps below can be used as a template for creating end-to-end workflows with XGBoost and tabular data.
This tutorial uses billable components of Google Cloud Platform (GCP):
Learn about Cloud AI Platform pricing and Cloud Storage pricing, and use the Pricing Calculator to generate a cost estimate based on your projected usage.
The following steps are required, regardless of your notebook environment.
Select or create a GCP project.. When you first create an account, you get a $300 free credit towards your compute/storage costs.
Enter your project ID in the cell below. Then run the cell to make sure the Cloud SDK uses the right project for all the commands in this notebook.
Note: Jupyter runs lines prefixed with !
as shell commands, and it interpolates Python variables prefixed with $
into these commands.
In [ ]:
PROJECT_ID = "[your-project-id]"
! gcloud config set project $PROJECT_ID
The following steps are required, regardless of your notebook environment.
When you submit a training job using the Cloud SDK, you upload a Python package containing your training code to a Cloud Storage bucket. AI Platform runs the code from this package. In this tutorial, AI Platform also saves the trained model that results from your job in the same bucket. You can then create an AI Platform model version based on this output in order to serve online predictions.
Set the name of your Cloud Storage bucket below. It must be unique across all Cloud Storage buckets.
You may also change the REGION
variable, which is used for operations
throughout the rest of this notebook. Make sure to choose a region where Cloud
AI Platform services are
available. You may
not use a Multi-Regional Storage bucket for training with AI Platform.
In [ ]:
BUCKET_NAME = "[your-bucket-name]"
REGION = "us-central1"
Only if your bucket doesn't already exist: Run the following cell to create your Cloud Storage bucket.
In [ ]:
! gsutil mb -l $REGION gs://$BUCKET_NAME
Finally, validate access to your Cloud Storage bucket by examining its contents:
In [ ]:
! gsutil ls -al gs://$BUCKET_NAME
Use PIP to install the Kaggle API for downloading the House Prices dataset. Follow the instructions on GitHub for generating an API token for Kaggle, then set the KAGGLE_USERNAME
and KAGGLE_KEY
ENV variables accordingly.
In [ ]:
! pip install --user kaggle
In [ ]:
%env KAGGLE_USERNAME YOUR-KAGGLE-USERNAME
%env KAGGLE_KEY YOUR-KAGGLE-KEY
In [ ]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import zipfile
import pandas as pd
import tensorflow as tf
import numpy as np
import os
from IPython.core.display import HTML
import googleapiclient.discovery
In [ ]:
# Import data from Kaggle
# For documentation on using the Kaggle API for Python refer to the official repo: https://github.com/Kaggle/kaggle-api
!~/.local/bin/kaggle competitions download -c house-prices-advanced-regression-techniques
In [ ]:
# If you don't have a Kaggle account:
! gsutil cp gs://cloud-samples-data/ai-hub/house-prices-advanced-regression-techniques.zip .
In [ ]:
# Unzip the training and test datasets
with zipfile.ZipFile('house-prices-advanced-regression-techniques.zip', 'r') as data_zip:
data_zip.extractall('data')
# Remove the downloaded compressed file
tf.io.gfile.remove('house-prices-advanced-regression-techniques.zip')
In [ ]:
# Import training data
train_data = pd.read_csv('data/train.csv').sample(frac=1)
train_data['set'] = 'train'
In [ ]:
# Partition 10% of training data as a validation set
train_data.iloc[0:(int(train_data.shape[0] * 0.1)), train_data.columns.get_loc("set")] = 'validation'
In [ ]:
# Import test data
test_data = pd.read_csv('data/test.csv')
test_data['SalePrice'] = None
test_data['set'] = 'test'
# Pull Ids for test dataset for writing submission.csv file
test_ids = test_data['Id']
In [ ]:
# Combine training/validation/test sets into single DataFrame
all_data = train_data.append(test_data)
all_data = all_data.drop(labels='Id', axis=1)
# Reorder columns
cols = all_data.columns.tolist()
del cols[-2:]
cols.insert(0, 'SalePrice')
cols.insert(0, 'set')
all_data = all_data[cols]
The data includes a large number of categorical features. For the sake of simplicity, assume that all integer features are ordinal and perform one-hot encoding for each of the string features.
In [ ]:
def one_hot_encode_features(features):
preprocessed_features = pd.DataFrame()
# One-hot encode categorical features
for col_name in features.columns:
# Assume that all numeric columns are continuous or ordinal
if col_name in ['set', 'SalePrice'] or features[col_name].dtype in ['int64', 'float64']:
preprocessed_features = pd.concat((preprocessed_features, features[col_name]), axis=1)
else:
preprocessed_features = pd.concat((preprocessed_features, pd.get_dummies(features[col_name])), axis=1)
return preprocessed_features
all_data = one_hot_encode_features(all_data)
In [ ]:
# Revise column names
col_names = ['set', 'SalePrice']
col_names.extend(['feature_{}'.format(i) for i in range(all_data.shape[1] - 2)])
all_data.columns = col_names
Replace missing values with the mean of each column from the training data. Then standardize the features by subtracting the column mean and dividing by the column standard deviation.
In [ ]:
# Split data into train/validation/test sets
train = all_data.loc[all_data['set'] == 'train']
validation = all_data.loc[all_data['set'] == 'validation']
test = all_data.loc[all_data['set'] == 'test']
# Remove 'set' column
train = train.drop('set', axis=1)
validation = validation.drop('set', axis=1)
test = test.drop('set', axis=1)
In [ ]:
# Pull column-wise mean and standard deviation from training set
train_column_means = train.mean(axis=0)
train_column_sd = train.std(axis=0)
In [ ]:
# Impute missing values with column mean
train.iloc[:, 1:] = train.iloc[:, 1:].fillna(train_column_means[1:])
validation.iloc[:, 1:] = validation.iloc[:, 1:].fillna(train_column_means[1:])
test.iloc[:, 1:] = test.iloc[:, 1:].fillna(train_column_means[1:])
In [ ]:
# Standardize features for the train, validation and test sets
def standardize_features(features, col_means, col_sds):
for i in range(features.shape[1]):
if col_sds[i] != 0:
features.iloc[:, i] = features.iloc[:, i].subtract(col_means[i]).divide(col_sds[i])
return features
In [ ]:
train.iloc[:, 1:] = standardize_features(
features=train.iloc[:, 1:],
col_means=train_column_means[1:],
col_sds=train_column_sd[1:])
validation.iloc[:, 1:] = standardize_features(
features=validation.iloc[:, 1:],
col_means=train_column_means[1:],
col_sds=train_column_sd[1:])
test.iloc[:, 1:] = standardize_features(
features=test.iloc[:, 1:],
col_means=train_column_means[1:],
col_sds=train_column_sd[1:])
Use TensorFlow's Gfile class to copy the preprocessed CSV files to a GCS bucket.
In [ ]:
# Save preprocessed data as CSV files
os.mkdir('data/preprocessed')
train.to_csv('data/preprocessed/train.csv', index=False)
validation.to_csv('data/preprocessed/validation.csv', index=False)
test.to_csv('data/preprocessed/test.csv', index=False)
In [ ]:
# Copy the preprocessed CSV data to a GCS bucket
for dataset in tf.io.gfile.glob('data/preprocessed/*.csv'):
tf.io.gfile.copy(
dataset,
os.path.join('gs://', BUCKET_NAME, 'house_prices_data', os.path.basename(dataset)),
overwrite=True)
To use an AI Hub component with the AI Platform Training service, navigate to the component's page and click the 'Edit Training Command' button. A pop-up will appear with a list of arguments that the component accepts and the Bash shell command for submitting a training job to AI Platform Training.
The XGBoost AI Hub component is a Docker image hosted on Google Container Registry. The component uses AI Platform Training's custom container feature to run training jobs.
The parameter values below can be revised for your use case and data. For additional information on submitting a training job on AI Platform Training refer to the documentation. To view the status of a training run and inspect the logs, navigate to the GCP console and go to the AI Platform > Jobs page.
In [ ]:
# Set parameter values for a training run
TRAINING_DATA = os.path.join('gs://', BUCKET_NAME, 'house_prices_data/train*')
VALIDATION_DATA = os.path.join('gs://', BUCKET_NAME, 'house_prices_data/val*')
OUTPUT_LOCATION = os.path.join('gs://', BUCKET_NAME, 'xgboost_output')
TARGET_COLUMN = 'SalePrice'
DATA_TYPE = 'csv'
FRESH_START = True
WEIGHT_COLUMN = ""
NUMBER_OF_CLASSES = 1
NUM_ROUND = 250
EARLY_STOPPING_ROUNDS = -1
VERBOSITY = 1
ETA = 0.1
GAMMA = 0.001
MAX_DEPTH = 10
MIN_CHILD_WEIGHT = 1
MAX_DELTA_STEP = 0
SUBSAMPLE = 1
COLSAMPLE_BYTREE = 1
COLSAMPLE_BYLEVEL = 1
COLSAMPLE_BYNODE = 1
REG_LAMBDA = 1
ALPHA = 0
SCALE_POS_WEIGHT = 1
OBJECTIVE = 'reg:gamma'
TREE_METHOD = 'auto'
# AI Platform Training job related arguments:
SCALE_TIER='CUSTOM'
MASTER_MACHINE_TYPE='standard_gpu'
In [ ]:
import uuid
JOB_NAME = "kaggle_xgboost_example_" + uuid.uuid4().hex[:10]
In [ ]:
# Submit AI Platform training job.
!gcloud ai-platform jobs submit training {JOB_NAME} \
--master-image-uri gcr.io/aihub-c2t-containers/kfp-components/trainer/dist_xgboost@sha256:7de885ef326e55b663ff0eb06724d580116953fe6a702383a113b2f306f308ae \
--region {REGION} \
--scale-tier {SCALE_TIER} \
--master-machine-type {MASTER_MACHINE_TYPE} \
--stream-logs \
-- \
--training-data {TRAINING_DATA} \
--target-column {TARGET_COLUMN} \
--validation-data {VALIDATION_DATA} \
--output-location {OUTPUT_LOCATION} \
--data-type {DATA_TYPE} \
--fresh-start {FRESH_START} \
--weight-column {WEIGHT_COLUMN} \
--number-of-classes {NUMBER_OF_CLASSES} \
--num-round {NUM_ROUND} \
--early-stopping-rounds {EARLY_STOPPING_ROUNDS} \
--verbosity {VERBOSITY} \
--eta {ETA} \
--gamma {GAMMA} \
--max-depth {MAX_DEPTH} \
--min-child-weight {MIN_CHILD_WEIGHT} \
--max-delta-step {MAX_DELTA_STEP} \
--subsample {SUBSAMPLE} \
--colsample-bytree {COLSAMPLE_BYTREE} \
--colsample-bylevel {COLSAMPLE_BYLEVEL} \
--colsample-bynode {COLSAMPLE_BYNODE} \
--reg-lambda {REG_LAMBDA} \
--alpha {ALPHA} \
--scale-pos-weight {SCALE_POS_WEIGHT} \
--objective {OBJECTIVE} \
--tree-method {TREE_METHOD} \
In [ ]:
MODEL_NAME = 'xgboost_housing_price_predictor'
MODEL_VERSION = 'v1'
In [ ]:
# Delete model version resource
! gcloud ai-platform versions delete {MODEL_VERSION} --quiet --model {MODEL_NAME}
# Delete model resource
! gcloud ai-platform models delete {MODEL_NAME} --quiet
In [ ]:
# Create a model resource on AI Platform Prediction. Once this is created, multiple versions
# of a model can be uploaded to this resource.
!gcloud ai-platform models create {MODEL_NAME} --regions {REGION}
In [ ]:
# Create a model version using the exported XGBoost model from the training run
!gcloud ai-platform versions create {MODEL_VERSION} \
--model {MODEL_NAME} \
--origin {OUTPUT_LOCATION} \
--runtime-version=1.14 \
--python-version=3.5 \
--framework XGBOOST
In [ ]:
# Verify that the model endpoint was created successfully
!gcloud ai-platform versions describe {MODEL_VERSION} \
--model {MODEL_NAME}
Once the model is deployed to AI Platform Prediction, the endpoint can be used to serve inferences. Refer to the documentation for additional information on generating online predictions from an AI Platform endpoint.
In [ ]:
service = googleapiclient.discovery.build('ml', 'v1')
name = 'projects/{}/models/{}/versions/{}'.format(PROJECT_ID, MODEL_NAME, MODEL_VERSION)
response = service.projects().predict(
name=name,
# Generate inferences for the first 10 examples from the test set
body={'instances': test.iloc[0:10, :].values.tolist()}
).execute()
if 'error' in response:
print (response['error'])
else:
online_results = response['predictions']
print(online_results)
In [ ]:
tf.io.gfile.copy(
os.path.join(OUTPUT_LOCATION, 'report.html'),
'report.html',
overwrite=True)
In [ ]:
with open('report.html', 'r') as f:
html_report = f.read()
display(HTML(html_report))
To clean up all GCP resources used in this project, you can delete the GCP project you used for the tutorial.
In [ ]:
# Delete model version resource
! gcloud ai-platform versions delete {MODEL_VERSION} --quiet --model {MODEL_NAME}
# Delete model resource
! gcloud ai-platform models delete {MODEL_NAME} --quiet
# If training job is still running, cancel it
! gcloud ai-platform jobs cancel {JOB_NAME} --quiet