In [ ]:
# Copyright 2020 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Training an XGBoost model with AI Hub

Overview

AI Hub is a repository of plug-and-play AI components including end-to-end AI pipelines and out-of-the-box algorithms. The following is an example of using the XGBoost AI Hub container to train a model with the AI Platform Training service and creating a model endpoint using AI Platform Prediction.

AI Hub includes components that make it easy to run training jobs at scale on Google's cloud infrastructure. Without revising any code, users can run distributed training jobs on a variety of hardware (including GPU and TPU devices). These components offer native support for AI Platform Training and export trained model files that can be uploaded to AI Platform Prediction for generating inferences. The components also include a run report that provides practical insights into the behavior of the trained model, and a visual inspection of the training and validation error for each run.

Dataset

The dataset used in this notebook includes residential real-estate data for homes in Ames, Iowa. The data are stored in a tabular format (.CSV) and include the sale price of 1,460 homes along with 79 explanatory features.

The dataset comes from Kaggle's competition to predict House Prices. The Kaggle API can be used to download and import the data.

Objective

The following notebook provides an example workflow of using an AI Hub component to train an XGBoost regression model and create an endpoint for generating predictions.

The notebook includes a complete ML workflow from data ingestion to model training and deployment. The steps below can be used as a template for creating end-to-end workflows with XGBoost and tabular data.

Costs

This tutorial uses billable components of Google Cloud Platform (GCP):

  • Cloud AI Platform
  • Cloud Storage

Learn about Cloud AI Platform pricing and Cloud Storage pricing, and use the Pricing Calculator to generate a cost estimate based on your projected usage.

Set up your GCP project

The following steps are required, regardless of your notebook environment.

  1. Select or create a GCP project.. When you first create an account, you get a $300 free credit towards your compute/storage costs.

  2. Make sure that billing is enabled for your project.

  3. Enable the AI Platform APIs and Compute Engine APIs.

  4. Enter your project ID in the cell below. Then run the cell to make sure the Cloud SDK uses the right project for all the commands in this notebook.

Note: Jupyter runs lines prefixed with ! as shell commands, and it interpolates Python variables prefixed with $ into these commands.


In [ ]:
PROJECT_ID = "[your-project-id]"
! gcloud config set project $PROJECT_ID

Authenticate your GCP account

If you are using AI Platform Notebooks, your environment is already authenticated. Skip this step.

Create a Cloud Storage bucket

The following steps are required, regardless of your notebook environment.

When you submit a training job using the Cloud SDK, you upload a Python package containing your training code to a Cloud Storage bucket. AI Platform runs the code from this package. In this tutorial, AI Platform also saves the trained model that results from your job in the same bucket. You can then create an AI Platform model version based on this output in order to serve online predictions.

Set the name of your Cloud Storage bucket below. It must be unique across all Cloud Storage buckets.

You may also change the REGION variable, which is used for operations throughout the rest of this notebook. Make sure to choose a region where Cloud AI Platform services are available. You may not use a Multi-Regional Storage bucket for training with AI Platform.


In [ ]:
BUCKET_NAME = "[your-bucket-name]"
REGION = "us-central1"

Only if your bucket doesn't already exist: Run the following cell to create your Cloud Storage bucket.


In [ ]:
! gsutil mb -l $REGION gs://$BUCKET_NAME

Finally, validate access to your Cloud Storage bucket by examining its contents:


In [ ]:
! gsutil ls -al gs://$BUCKET_NAME

Install Kaggle API

Use PIP to install the Kaggle API for downloading the House Prices dataset. Follow the instructions on GitHub for generating an API token for Kaggle, then set the KAGGLE_USERNAME and KAGGLE_KEY ENV variables accordingly.


In [ ]:
! pip install --user kaggle

In [ ]:
%env KAGGLE_USERNAME YOUR-KAGGLE-USERNAME
%env KAGGLE_KEY YOUR-KAGGLE-KEY

Import libraries and download data


In [ ]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import zipfile
import pandas as pd
import tensorflow as tf
import numpy as np
import os

from IPython.core.display import HTML
import googleapiclient.discovery

In [ ]:
# Import data from Kaggle
# For documentation on using the Kaggle API for Python refer to the official repo: https://github.com/Kaggle/kaggle-api
!~/.local/bin/kaggle competitions download -c house-prices-advanced-regression-techniques

In [ ]:
# If you don't have a Kaggle account:
! gsutil cp gs://cloud-samples-data/ai-hub/house-prices-advanced-regression-techniques.zip .

In [ ]:
# Unzip the training and test datasets
with zipfile.ZipFile('house-prices-advanced-regression-techniques.zip', 'r') as data_zip:
    data_zip.extractall('data')
# Remove the downloaded compressed file
tf.io.gfile.remove('house-prices-advanced-regression-techniques.zip')

Preprocess the data

Import and preprocess the train and test datasets before training the model. The training and test sets each include 1460 examples. Below I've partitioned 10% of the training data as a validation set.


In [ ]:
# Import training data
train_data = pd.read_csv('data/train.csv').sample(frac=1)
train_data['set'] = 'train'

In [ ]:
# Partition 10% of training data as a validation set
train_data.iloc[0:(int(train_data.shape[0] * 0.1)), train_data.columns.get_loc("set")] = 'validation'

In [ ]:
# Import test data
test_data = pd.read_csv('data/test.csv')
test_data['SalePrice'] = None
test_data['set'] = 'test'
# Pull Ids for test dataset for writing submission.csv file
test_ids = test_data['Id']

In [ ]:
# Combine training/validation/test sets into single DataFrame
all_data = train_data.append(test_data)
all_data = all_data.drop(labels='Id', axis=1)
# Reorder columns
cols = all_data.columns.tolist()
del cols[-2:]
cols.insert(0, 'SalePrice')
cols.insert(0, 'set')
all_data = all_data[cols]

The data includes a large number of categorical features. For the sake of simplicity, assume that all integer features are ordinal and perform one-hot encoding for each of the string features.


In [ ]:
def one_hot_encode_features(features):
    preprocessed_features = pd.DataFrame()
    
    # One-hot encode categorical features
    for col_name in features.columns:
        # Assume that all numeric columns are continuous or ordinal
        if col_name in ['set', 'SalePrice'] or features[col_name].dtype in ['int64', 'float64']:
            preprocessed_features = pd.concat((preprocessed_features, features[col_name]), axis=1)
        else:
            preprocessed_features = pd.concat((preprocessed_features, pd.get_dummies(features[col_name])), axis=1)

    return preprocessed_features

all_data = one_hot_encode_features(all_data)

In [ ]:
# Revise column names
col_names = ['set', 'SalePrice']
col_names.extend(['feature_{}'.format(i) for i in range(all_data.shape[1] - 2)])
all_data.columns = col_names

Replace missing values with the mean of each column from the training data. Then standardize the features by subtracting the column mean and dividing by the column standard deviation.


In [ ]:
# Split data into train/validation/test sets
train = all_data.loc[all_data['set'] == 'train']
validation = all_data.loc[all_data['set'] == 'validation']
test = all_data.loc[all_data['set'] == 'test']
# Remove 'set' column
train = train.drop('set', axis=1)
validation = validation.drop('set', axis=1)
test = test.drop('set', axis=1)

In [ ]:
# Pull column-wise mean and standard deviation from training set
train_column_means = train.mean(axis=0)
train_column_sd = train.std(axis=0)

In [ ]:
# Impute missing values with column mean
train.iloc[:, 1:] = train.iloc[:, 1:].fillna(train_column_means[1:])
validation.iloc[:, 1:] = validation.iloc[:, 1:].fillna(train_column_means[1:])
test.iloc[:, 1:] = test.iloc[:, 1:].fillna(train_column_means[1:])

In [ ]:
# Standardize features for the train, validation and test sets
def standardize_features(features, col_means, col_sds):
    for i in range(features.shape[1]):
        if col_sds[i] != 0:
            features.iloc[:, i] = features.iloc[:, i].subtract(col_means[i]).divide(col_sds[i])
    return features

In [ ]:
train.iloc[:, 1:] = standardize_features(
    features=train.iloc[:, 1:],
    col_means=train_column_means[1:],
    col_sds=train_column_sd[1:])
validation.iloc[:, 1:] = standardize_features(
    features=validation.iloc[:, 1:],
    col_means=train_column_means[1:],
    col_sds=train_column_sd[1:])
test.iloc[:, 1:] = standardize_features(
    features=test.iloc[:, 1:],
    col_means=train_column_means[1:],
    col_sds=train_column_sd[1:])

Write data to CSV files and upload to Google Cloud Storage

Use TensorFlow's Gfile class to copy the preprocessed CSV files to a GCS bucket.


In [ ]:
# Save preprocessed data as CSV files
os.mkdir('data/preprocessed')
train.to_csv('data/preprocessed/train.csv', index=False)
validation.to_csv('data/preprocessed/validation.csv', index=False)
test.to_csv('data/preprocessed/test.csv', index=False)

In [ ]:
# Copy the preprocessed CSV data to a GCS bucket
for dataset in tf.io.gfile.glob('data/preprocessed/*.csv'):
    tf.io.gfile.copy(
        dataset,
        os.path.join('gs://', BUCKET_NAME, 'house_prices_data', os.path.basename(dataset)),
        overwrite=True)

Submit a training job with the XGBoost AI Hub component

To use an AI Hub component with the AI Platform Training service, navigate to the component's page and click the 'Edit Training Command' button. A pop-up will appear with a list of arguments that the component accepts and the Bash shell command for submitting a training job to AI Platform Training.

The XGBoost AI Hub component is a Docker image hosted on Google Container Registry. The component uses AI Platform Training's custom container feature to run training jobs.

The parameter values below can be revised for your use case and data. For additional information on submitting a training job on AI Platform Training refer to the documentation. To view the status of a training run and inspect the logs, navigate to the GCP console and go to the AI Platform > Jobs page.


In [ ]:
# Set parameter values for a training run
TRAINING_DATA = os.path.join('gs://', BUCKET_NAME, 'house_prices_data/train*')
VALIDATION_DATA = os.path.join('gs://', BUCKET_NAME, 'house_prices_data/val*')
OUTPUT_LOCATION = os.path.join('gs://', BUCKET_NAME, 'xgboost_output')
TARGET_COLUMN = 'SalePrice'
DATA_TYPE = 'csv'
FRESH_START = True
WEIGHT_COLUMN = ""
NUMBER_OF_CLASSES = 1
NUM_ROUND = 250
EARLY_STOPPING_ROUNDS = -1
VERBOSITY = 1
ETA = 0.1
GAMMA = 0.001
MAX_DEPTH = 10
MIN_CHILD_WEIGHT = 1
MAX_DELTA_STEP = 0
SUBSAMPLE = 1
COLSAMPLE_BYTREE = 1
COLSAMPLE_BYLEVEL = 1
COLSAMPLE_BYNODE = 1
REG_LAMBDA = 1
ALPHA = 0
SCALE_POS_WEIGHT = 1
OBJECTIVE = 'reg:gamma'
TREE_METHOD = 'auto'

# AI Platform Training job related arguments:
SCALE_TIER='CUSTOM'
MASTER_MACHINE_TYPE='standard_gpu'

In [ ]:
import uuid

JOB_NAME = "kaggle_xgboost_example_" + uuid.uuid4().hex[:10]

In [ ]:
# Submit AI Platform training job.

!gcloud ai-platform jobs submit training {JOB_NAME} \
    --master-image-uri gcr.io/aihub-c2t-containers/kfp-components/trainer/dist_xgboost@sha256:7de885ef326e55b663ff0eb06724d580116953fe6a702383a113b2f306f308ae \
    --region {REGION} \
    --scale-tier {SCALE_TIER} \
    --master-machine-type {MASTER_MACHINE_TYPE} \
    --stream-logs \
    -- \
    --training-data {TRAINING_DATA} \
    --target-column {TARGET_COLUMN} \
    --validation-data {VALIDATION_DATA} \
    --output-location {OUTPUT_LOCATION} \
    --data-type {DATA_TYPE} \
    --fresh-start {FRESH_START} \
    --weight-column {WEIGHT_COLUMN} \
    --number-of-classes {NUMBER_OF_CLASSES} \
    --num-round {NUM_ROUND} \
    --early-stopping-rounds {EARLY_STOPPING_ROUNDS} \
    --verbosity {VERBOSITY} \
    --eta {ETA} \
    --gamma {GAMMA} \
    --max-depth {MAX_DEPTH} \
    --min-child-weight {MIN_CHILD_WEIGHT} \
    --max-delta-step {MAX_DELTA_STEP} \
    --subsample {SUBSAMPLE} \
    --colsample-bytree {COLSAMPLE_BYTREE} \
    --colsample-bylevel {COLSAMPLE_BYLEVEL} \
    --colsample-bynode {COLSAMPLE_BYNODE} \
    --reg-lambda {REG_LAMBDA} \
    --alpha {ALPHA} \
    --scale-pos-weight {SCALE_POS_WEIGHT} \
    --objective {OBJECTIVE} \
    --tree-method {TREE_METHOD} \

Deploy the trained model to AI Platform Prediction

After the training run succeeds a model file (model.bst) will be exported to the GCS bucket defined by the OUTPUT_LOCATION parameter. Create a model resource on AI Platform Prediction and deploy a new version using the trained XGBoost model.


In [ ]:
MODEL_NAME = 'xgboost_housing_price_predictor'
MODEL_VERSION = 'v1'

In [ ]:
# Delete model version resource
! gcloud ai-platform versions delete {MODEL_VERSION} --quiet --model {MODEL_NAME} 

# Delete model resource
! gcloud ai-platform models delete {MODEL_NAME} --quiet

In [ ]:
# Create a model resource on AI Platform Prediction. Once this is created, multiple versions
# of a model can be uploaded to this resource.
!gcloud ai-platform models create {MODEL_NAME} --regions {REGION}

In [ ]:
# Create a model version using the exported XGBoost model from the training run
!gcloud ai-platform versions create {MODEL_VERSION} \
  --model {MODEL_NAME} \
  --origin {OUTPUT_LOCATION} \
  --runtime-version=1.14 \
  --python-version=3.5 \
  --framework XGBOOST

In [ ]:
# Verify that the model endpoint was created successfully
!gcloud ai-platform versions describe {MODEL_VERSION} \
  --model {MODEL_NAME}

Generate inferences from the model endpoint

Once the model is deployed to AI Platform Prediction, the endpoint can be used to serve inferences. Refer to the documentation for additional information on generating online predictions from an AI Platform endpoint.


In [ ]:
service = googleapiclient.discovery.build('ml', 'v1')
name = 'projects/{}/models/{}/versions/{}'.format(PROJECT_ID, MODEL_NAME, MODEL_VERSION)

response = service.projects().predict(
    name=name,
    # Generate inferences for the first 10 examples from the test set
    body={'instances': test.iloc[0:10, :].values.tolist()}
).execute()

if 'error' in response:
    print (response['error'])
else:
    online_results = response['predictions']
    print(online_results)

Inspect the training job

After the training job completes on AI Platform Training a run report will be created in the OUTPUT_LOCATION. The report examines the quality of the trained model and provides a visual inspection of the training and validation error from the training run.


In [ ]:
tf.io.gfile.copy(
    os.path.join(OUTPUT_LOCATION, 'report.html'),
    'report.html',
    overwrite=True)

In [ ]:
with open('report.html', 'r') as f:
    html_report = f.read()

display(HTML(html_report))

Cleaning up

To clean up all GCP resources used in this project, you can delete the GCP project you used for the tutorial.


In [ ]:
# Delete model version resource
! gcloud ai-platform versions delete {MODEL_VERSION} --quiet --model {MODEL_NAME} 

# Delete model resource
! gcloud ai-platform models delete {MODEL_NAME} --quiet

# If training job is still running, cancel it
! gcloud ai-platform jobs cancel {JOB_NAME} --quiet