Cloud AI Platform + What-if Tool: end-to-end XGBoost example

This notebook shows how to:

  • Build a binary classification model with XGBoost trained on a housing dataset to predict whether a house is worth more or less than 160k
  • Deploy the model to Cloud AI Platform
  • Use the What-if Tool on your deployed model

You will need a Google Cloud Platform account and project to run this notebook. Instructions for creating a project can be found here.


In [2]:
import sys
python_version = sys.version_info[0]

In [6]:
# If you're running on Colab, you'll need to install the What-if Tool package and authenticate
# If you're on Cloud AI Platform Notebooks, you'll need to install XGBoost on the TF instance
def pip_install(module):
    if python_version == '2':
        !pip install {module} --quiet
    else:
        !pip3 install {module} --quiet

try:
    import google.colab
    IN_COLAB = True
except:
    IN_COLAB = False

if IN_COLAB:
    pip_install('witwidget')

    from google.colab import auth
    auth.authenticate_user()
else:
    pip_install('xgboost')

In [18]:
import pandas as pd
import numpy as np
import witwidget
import xgboost as xgb

from sklearn.utils import shuffle
from sklearn.metrics import accuracy_score
from witwidget.notebook.visualization import WitWidget, WitConfigBuilder

Download and pre-process data

In this section we'll:

  • Download the housing dataset from Google Cloud Storage
  • Shuffle the data and remove some columns that don't contribute too much to model ouptut
  • Turn this into a classification problem by converting labelse to 0/1 format indicating if the house is worth more or less than $160k
  • Because XGBoost requires all columns to be numerical, we'll convert all categorical columns to dummy columns (0 or 1 values for each possible category value)

In [23]:
# Original data source: http://jse.amstat.org/v19n3/decock.pdf
# Dataset on Kaggle: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data
!gsutil cp gs://housing_model_data/housing-train.csv .


Copying gs://housing_model_data/housing-train.csv...
/ [1 files][449.9 KiB/449.9 KiB]                                                
Operation completed over 1 objects/449.9 KiB.                                    

In [24]:
data = pd.read_csv('housing-train.csv')

In [25]:
data = shuffle(data, random_state=2)

In [27]:
# Drop columnns that don't have much effect on model outcome
data = data.drop(columns=['Id', 'MSSubClass', 'MSZoning', 'Street', 'Street','Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope',  'Condition2', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea', 'ExterCond', 'Foundation', 'BsmtExposure', 'BsmtFinSF2' ,'BsmtFinType1', 'BsmtFinType2', 'BsmtExposure', 'BsmtFinSF2', 'Heating', 'CentralAir', 'Electrical', 'BsmtHalfBath','LowQualFinSF','KitchenAbvGr', 'KitchenQual', 'Functional','GarageQual', 'GarageCond', 'PavedDrive' , 'EnclosedPorch', 'PoolArea','PoolQC',  'MiscFeature', 'MiscVal', 'SaleType', 'MoSold', 'SaleCondition'])

In [28]:
train_size = int(len(data) * .8)
data.head()


Out[28]:
LotFrontage LotArea Neighborhood Condition1 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd ... GarageFinish GarageCars GarageArea WoodDeckSF OpenPorchSF 3SsnPorch ScreenPorch Fence YrSold SalePrice
503 100.0 15602 Crawfor Norm 1Fam 1Story 7 8 1959 1997 ... Fin 2 484 0 54 0 161 GdWo 2010 289000
101 77.0 9206 SawyerW Norm 1Fam 2Story 6 5 1985 1985 ... Fin 2 476 192 46 0 0 NaN 2010 178000
608 78.0 12168 Crawfor Norm 1Fam 2Story 8 6 1934 1998 ... Unf 2 380 0 0 0 0 NaN 2007 359100
1089 37.0 3316 Somerst Norm TwnhsE 1Story 8 5 2005 2005 ... Fin 2 550 0 84 0 0 NaN 2006 197000
819 44.0 6371 NridgHt Norm TwnhsE 1Story 7 5 2009 2010 ... RFn 2 484 192 35 0 0 NaN 2010 224000

5 rows × 39 columns


In [29]:
labels = data['SalePrice']
data = data.drop(columns=['SalePrice'])

In [30]:
# Convert categorical columns to dummy columns and preview
data = pd.get_dummies(data) 
data.head()


Out[30]:
LotFrontage LotArea OverallQual OverallCond YearBuilt YearRemodAdd BsmtFinSF1 BsmtUnfSF TotalBsmtSF 1stFlrSF ... GarageType_BuiltIn GarageType_CarPort GarageType_Detchd GarageFinish_Fin GarageFinish_RFn GarageFinish_Unf Fence_GdPrv Fence_GdWo Fence_MnPrv Fence_MnWw
503 100.0 15602 7 8 1959 1997 1247 254 1501 1801 ... 0 0 0 1 0 0 0 1 0 0
101 77.0 9206 6 5 1985 1985 0 741 741 977 ... 0 0 0 1 0 0 0 0 0 0
608 78.0 12168 8 6 1934 1998 428 537 965 1940 ... 0 0 0 0 0 1 0 0 0 0
1089 37.0 3316 8 5 2005 2005 1039 208 1247 1247 ... 0 0 0 1 0 0 0 0 0 0
819 44.0 6371 7 5 2009 2010 733 625 1358 1358 ... 0 0 0 0 1 0 0 0 0 0

5 rows × 108 columns


In [32]:
# Convert labels to a classification problem
price_more_than_160k = (labels.values > 160000).astype(int)

In [33]:
# Split data into train and test sets
train_data = data[:train_size]
test_data = data[train_size:]

train_labels = price_more_than_160k[:train_size]
test_labels = price_more_than_160k[train_size:]

Train the XGBoost model


In [34]:
bst = xgb.XGBClassifier(objective='binary:logistic')

In [35]:
bst.fit(train_data.values, train_labels)


Out[35]:
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
              max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
              n_jobs=1, nthread=None, objective='binary:logistic',
              random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
              seed=None, silent=True, subsample=1)

In [37]:
# Get predictions on the test set and print the accuracy score
y_pred = bst.predict(test_data.values)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(test_labels, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))


Accuracy: 92.47%

In [38]:
bst.save_model('model.bst')

Deploy model to AI Platform

Copy your saved model file to Cloud Storage and deploy the model to AI Platform. In order for this to work, you'll need the Cloud AI Platform Models API enabled. Update the values in the next cell with the info for your GCP project and Cloud Storage bucket.


In [ ]:
# Define some globals - update these to your own project + model names
GCP_PROJECT = 'YOUR_GCP_PROJECT'
MODEL_BUCKET = 'your_GCS_bucket_name'
VERSION_NAME = 'v1'

In [42]:
!gsutil cp model.bst gs://$MODEL_BUCKET


Copying file://model.bst [Content-Type=application/octet-stream]...
/ [1 files][ 61.4 KiB/ 61.4 KiB]                                                
Operation completed over 1 objects/61.4 KiB.                                     

In [43]:
!gcloud config set project $GCP_PROJECT


Updated property [core/project].

In [44]:
# Create a version, this will take ~2 minutes to deploy
!gcloud ai-platform versions create $VERSION_NAME \
--model=housing_classification \
--framework='XGBOOST' \
--runtime-version=1.14 \
--origin=gs://$MODEL_BUCKET \
--python-version=3.5


Creating version (this might take a few minutes)......done.                    

Using the What-if Tool to interpret your model

Once your model has deployed, you're ready to connect it to the What-if Tool using the WitWidget.


In [45]:
# Format a subset of the test data to send to the What-if Tool for visualization
# Append ground truth label value to training data

test_examples = np.hstack((test_data.values[:100], test_labels[:100].reshape(-1,1)))

In [ ]:
# Create a What-if Tool visualization, it may take a minute to load
# See the cell below this for exploration ideas

# This prediction adjustment function is needed as this xgboost model's
# prediction returns just a score for the positive class of the binary
# classification, whereas the What-If Tool expects a list of scores for each
# class (in this case, both the negative class and the positive class).

def adjust_prediction(pred):
  return [1 - pred, pred]

config_builder = (WitConfigBuilder(test_examples.tolist(), data.columns.tolist() + ['SalePrice'])
  .set_ai_platform_model(GCP_PROJECT, 'housing_classification', VERSION_NAME, adjust_prediction=adjust_prediction)
  .set_target_feature('SalePrice')
  .set_label_vocab(['Under160', 'Over160']))
WitWidget(config_builder, height=800)

What-if Tool exploration ideas

  • Individual data points: the default graph shows all data points from the test set, colored by their ground truth label (priced over or under 160k)

    • Try selecting data points close to the middle and tweaking some of their feature values. Then run inference again to see if the model prediction changes
    • Select a data point and then select the "Show nearest counterfactual datapoint" radio button. This will highlight a data point with feature values closest to your original one, but with the opposite prediction
  • Binning data: create separate graphs for individual features

    • From the "Binning - X axis" dropdown, try selecting one of the agency codes, for example "ExternalQual_Gd".This will create 2 separate graphs, one for houses where the external quality was rated as "Good" (graph labeled 1), and one for houses with other external quality ratings (graph labeled 0). This shows us that houses with a good external quality rating have a higher likelihood of getting accepted.
  • Exploring overall performance: Click on the "Performance & Fairness" tab to view overall performance statistics on the model's results on the provided dataset, including confusion matrices, PR curves, and ROC curves.

    • Experiment with the threshold slider, raising and lowering the positive classification score the model needs to return before it prices a house at over 160k, and see how it changes accuracy, false positives, and false negatives.
    • On the left side "Slice by" menu, select "GarageType_Attachd". You'll now see performance on the two subsets of your data: the "0" slice shows when the garage is not attached, and the "1" slice is for when a garage is attached to the house. Check out the accuracy, false postive, and false negative rate between the two slices to look for differences in performance. If you expand the rows to look at the confusion matrices, you can see that the model predicts a price of over 160k more frequently for houses with attached garages.
    • You can use the optimization buttons on the left side to have the tool auto-select different positive classification thresholds for each slice in order to achieve different goals. If you select the "Demographic parity" button, then the two thresholds will be adjusted so that the model predicts over 160k for a similar percentage of houses in both slices. What does this do to the accuracy, false positives and false negatives for each slice?