Cloud AI Platform + What-if Tool: end-to-end XGBoost example

This notebook shows how to:

Build a binary classification model with XGBoost trained on a housing dataset to predict whether a house is worth more or less than 160k
Deploy the model to Cloud AI Platform
Use the What-if Tool on your deployed model

You will need a Google Cloud Platform account and project to run this notebook. Instructions for creating a project can be found here.



In [2]:

    
import sys
python_version = sys.version_info[0]



In [6]:

    
# If you're running on Colab, you'll need to install the What-if Tool package and authenticate
# If you're on Cloud AI Platform Notebooks, you'll need to install XGBoost on the TF instance
def pip_install(module):
    if python_version == '2':
        !pip install {module} --quiet
    else:
        !pip3 install {module} --quiet

try:
    import google.colab
    IN_COLAB = True
except:
    IN_COLAB = False

if IN_COLAB:
    pip_install('witwidget')

    from google.colab import auth
    auth.authenticate_user()
else:
    pip_install('xgboost')



In [18]:

    
import pandas as pd
import numpy as np
import witwidget
import xgboost as xgb

from sklearn.utils import shuffle
from sklearn.metrics import accuracy_score
from witwidget.notebook.visualization import WitWidget, WitConfigBuilder

Download and pre-process data

In this section we'll:

Download the housing dataset from Google Cloud Storage
Shuffle the data and remove some columns that don't contribute too much to model ouptut
Turn this into a classification problem by converting labelse to 0/1 format indicating if the house is worth more or less than $160k
Because XGBoost requires all columns to be numerical, we'll convert all categorical columns to dummy columns (0 or 1 values for each possible category value)



In [23]:

    
# Original data source: http://jse.amstat.org/v19n3/decock.pdf
# Dataset on Kaggle: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data
!gsutil cp gs://housing_model_data/housing-train.csv .









    



Copying gs://housing_model_data/housing-train.csv...
/ [1 files][449.9 KiB/449.9 KiB]                                                
Operation completed over 1 objects/449.9 KiB.



In [24]:

    
data = pd.read_csv('housing-train.csv')



In [25]:

    
data = shuffle(data, random_state=2)



In [27]:

    
# Drop columnns that don't have much effect on model outcome
data = data.drop(columns=['Id', 'MSSubClass', 'MSZoning', 'Street', 'Street','Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope',  'Condition2', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea', 'ExterCond', 'Foundation', 'BsmtExposure', 'BsmtFinSF2' ,'BsmtFinType1', 'BsmtFinType2', 'BsmtExposure', 'BsmtFinSF2', 'Heating', 'CentralAir', 'Electrical', 'BsmtHalfBath','LowQualFinSF','KitchenAbvGr', 'KitchenQual', 'Functional','GarageQual', 'GarageCond', 'PavedDrive' , 'EnclosedPorch', 'PoolArea','PoolQC',  'MiscFeature', 'MiscVal', 'SaleType', 'MoSold', 'SaleCondition'])



In [28]:

    
train_size = int(len(data) * .8)
data.head()









    Out[28]:







  
    
      
      LotFrontage
      LotArea
      Neighborhood
      Condition1
      BldgType
      HouseStyle
      OverallQual
      OverallCond
      YearBuilt
      YearRemodAdd
      ...
      GarageFinish
      GarageCars
      GarageArea
      WoodDeckSF
      OpenPorchSF
      3SsnPorch
      ScreenPorch
      Fence
      YrSold
      SalePrice
    
  
  
    
      503
      100.0
      15602
      Crawfor
      Norm
      1Fam
      1Story
      7
      8
      1959
      1997
      ...
      Fin
      2
      484
      0
      54
      0
      161
      GdWo
      2010
      289000
    
    
      101
      77.0
      9206
      SawyerW
      Norm
      1Fam
      2Story
      6
      5
      1985
      1985
      ...
      Fin
      2
      476
      192
      46
      0
      0
      NaN
      2010
      178000
    
    
      608
      78.0
      12168
      Crawfor
      Norm
      1Fam
      2Story
      8
      6
      1934
      1998
      ...
      Unf
      2
      380
      0
      0
      0
      0
      NaN
      2007
      359100
    
    
      1089
      37.0
      3316
      Somerst
      Norm
      TwnhsE
      1Story
      8
      5
      2005
      2005
      ...
      Fin
      2
      550
      0
      84
      0
      0
      NaN
      2006
      197000
    
    
      819
      44.0
      6371
      NridgHt
      Norm
      TwnhsE
      1Story
      7
      5
      2009
      2010
      ...
      RFn
      2
      484
      192
      35
      0
      0
      NaN
      2010
      224000
    
  

5 rows × 39 columns



In [29]:

    
labels = data['SalePrice']
data = data.drop(columns=['SalePrice'])



In [30]:

    
# Convert categorical columns to dummy columns and preview
data = pd.get_dummies(data) 
data.head()









    Out[30]:







  
    
      
      LotFrontage
      LotArea
      OverallQual
      OverallCond
      YearBuilt
      YearRemodAdd
      BsmtFinSF1
      BsmtUnfSF
      TotalBsmtSF
      1stFlrSF
      ...
      GarageType_BuiltIn
      GarageType_CarPort
      GarageType_Detchd
      GarageFinish_Fin
      GarageFinish_RFn
      GarageFinish_Unf
      Fence_GdPrv
      Fence_GdWo
      Fence_MnPrv
      Fence_MnWw
    
  
  
    
      503
      100.0
      15602
      7
      8
      1959
      1997
      1247
      254
      1501
      1801
      ...
      0
      0
      0
      1
      0
      0
      0
      1
      0
      0
    
    
      101
      77.0
      9206
      6
      5
      1985
      1985
      0
      741
      741
      977
      ...
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
    
    
      608
      78.0
      12168
      8
      6
      1934
      1998
      428
      537
      965
      1940
      ...
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
    
    
      1089
      37.0
      3316
      8
      5
      2005
      2005
      1039
      208
      1247
      1247
      ...
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
    
    
      819
      44.0
      6371
      7
      5
      2009
      2010
      733
      625
      1358
      1358
      ...
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
    
  

5 rows × 108 columns



In [32]:

    
# Convert labels to a classification problem
price_more_than_160k = (labels.values > 160000).astype(int)



In [33]:

    
# Split data into train and test sets
train_data = data[:train_size]
test_data = data[train_size:]

train_labels = price_more_than_160k[:train_size]
test_labels = price_more_than_160k[train_size:]

Train the XGBoost model



In [34]:

    
bst = xgb.XGBClassifier(objective='binary:logistic')



In [35]:

    
bst.fit(train_data.values, train_labels)









    Out[35]:





XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
              max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
              n_jobs=1, nthread=None, objective='binary:logistic',
              random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
              seed=None, silent=True, subsample=1)



In [37]:

    
# Get predictions on the test set and print the accuracy score
y_pred = bst.predict(test_data.values)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(test_labels, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))









    



Accuracy: 92.47%



In [38]:

    
bst.save_model('model.bst')

Deploy model to AI Platform

Copy your saved model file to Cloud Storage and deploy the model to AI Platform. In order for this to work, you'll need the Cloud AI Platform Models API enabled. Update the values in the next cell with the info for your GCP project and Cloud Storage bucket.



In [ ]:

    
# Define some globals - update these to your own project + model names
GCP_PROJECT = 'YOUR_GCP_PROJECT'
MODEL_BUCKET = 'your_GCS_bucket_name'
VERSION_NAME = 'v1'



In [42]:

    
!gsutil cp model.bst gs://$MODEL_BUCKET









    



Copying file://model.bst [Content-Type=application/octet-stream]...
/ [1 files][ 61.4 KiB/ 61.4 KiB]                                                
Operation completed over 1 objects/61.4 KiB.



In [43]:

    
!gcloud config set project $GCP_PROJECT









    



Updated property [core/project].



In [44]:

    
# Create a version, this will take ~2 minutes to deploy
!gcloud ai-platform versions create $VERSION_NAME \
--model=housing_classification \
--framework='XGBOOST' \
--runtime-version=1.14 \
--origin=gs://$MODEL_BUCKET \
--python-version=3.5









    



Creating version (this might take a few minutes)......done.

Using the What-if Tool to interpret your model

Once your model has deployed, you're ready to connect it to the What-if Tool using the WitWidget.



In [45]:

    
# Format a subset of the test data to send to the What-if Tool for visualization
# Append ground truth label value to training data

test_examples = np.hstack((test_data.values[:100], test_labels[:100].reshape(-1,1)))



In [ ]:

    
# Create a What-if Tool visualization, it may take a minute to load
# See the cell below this for exploration ideas

# This prediction adjustment function is needed as this xgboost model's
# prediction returns just a score for the positive class of the binary
# classification, whereas the What-If Tool expects a list of scores for each
# class (in this case, both the negative class and the positive class).

def adjust_prediction(pred):
  return [1 - pred, pred]

config_builder = (WitConfigBuilder(test_examples.tolist(), data.columns.tolist() + ['SalePrice'])
  .set_ai_platform_model(GCP_PROJECT, 'housing_classification', VERSION_NAME, adjust_prediction=adjust_prediction)
  .set_target_feature('SalePrice')
  .set_label_vocab(['Under160', 'Over160']))
WitWidget(config_builder, height=800)

What-if Tool exploration ideas

Individual data points: the default graph shows all data points from the test set, colored by their ground truth label (priced over or under 160k)
- Try selecting data points close to the middle and tweaking some of their feature values. Then run inference again to see if the model prediction changes
- Select a data point and then select the "Show nearest counterfactual datapoint" radio button. This will highlight a data point with feature values closest to your original one, but with the opposite prediction
Binning data: create separate graphs for individual features
- From the "Binning - X axis" dropdown, try selecting one of the agency codes, for example "ExternalQual_Gd".This will create 2 separate graphs, one for houses where the external quality was rated as "Good" (graph labeled 1), and one for houses with other external quality ratings (graph labeled 0). This shows us that houses with a good external quality rating have a higher likelihood of getting accepted.
Exploring overall performance: Click on the "Performance & Fairness" tab to view overall performance statistics on the model's results on the provided dataset, including confusion matrices, PR curves, and ROC curves.
- Experiment with the threshold slider, raising and lowering the positive classification score the model needs to return before it prices a house at over 160k, and see how it changes accuracy, false positives, and false negatives.
- On the left side "Slice by" menu, select "GarageType_Attachd". You'll now see performance on the two subsets of your data: the "0" slice shows when the garage is not attached, and the "1" slice is for when a garage is attached to the house. Check out the accuracy, false postive, and false negative rate between the two slices to look for differences in performance. If you expand the rows to look at the confusion matrices, you can see that the model predicts a price of over 160k more frequently for houses with attached garages.
- You can use the optimization buttons on the left side to have the tool auto-select different positive classification thresholds for each slice in order to achieve different goals. If you select the "Demographic parity" button, then the two thresholds will be adjusted so that the model predicts over 160k for a similar percentage of houses in both slices. What does this do to the accuracy, false positives and false negatives for each slice?

	LotFrontage	LotArea	Neighborhood	Condition1	BldgType	HouseStyle	OverallQual	OverallCond	YearBuilt	YearRemodAdd	...	GarageFinish	GarageCars	GarageArea	WoodDeckSF	OpenPorchSF	ScreenPorch	Fence	YrSold	SalePrice
503	100.0	15602	Crawfor	Norm	1Fam	1Story	7	8	1959	1997	...	Fin	2	484	0	54	161	GdWo	2010	289000
101	77.0	9206	SawyerW	Norm	1Fam	2Story	6	5	1985	1985	...	Fin	2	476	192	46	0	NaN	2010	178000
608	78.0	12168	Crawfor	Norm	1Fam	2Story	8	6	1934	1998	...	Unf	2	380	0	0	0	NaN	2007	359100
1089	37.0	3316	Somerst	Norm	TwnhsE	1Story	8	5	2005	2005	...	Fin	2	550	0	84	0	NaN	2006	197000
819	44.0	6371	NridgHt	Norm	TwnhsE	1Story	7	5	2009	2010	...	RFn	2	484	192	35	0	NaN	2010	224000