This notebook shows how to:
You will need a Google Cloud Platform account and project to run this notebook. Instructions for creating a project can be found here.
In [2]:
import sys
python_version = sys.version_info[0]
In [6]:
# If you're running on Colab, you'll need to install the What-if Tool package and authenticate
# If you're on Cloud AI Platform Notebooks, you'll need to install XGBoost on the TF instance
def pip_install(module):
if python_version == '2':
!pip install {module} --quiet
else:
!pip3 install {module} --quiet
try:
import google.colab
IN_COLAB = True
except:
IN_COLAB = False
if IN_COLAB:
pip_install('witwidget')
from google.colab import auth
auth.authenticate_user()
else:
pip_install('xgboost')
In [18]:
import pandas as pd
import numpy as np
import witwidget
import xgboost as xgb
from sklearn.utils import shuffle
from sklearn.metrics import accuracy_score
from witwidget.notebook.visualization import WitWidget, WitConfigBuilder
In this section we'll:
In [23]:
# Original data source: http://jse.amstat.org/v19n3/decock.pdf
# Dataset on Kaggle: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data
!gsutil cp gs://housing_model_data/housing-train.csv .
In [24]:
data = pd.read_csv('housing-train.csv')
In [25]:
data = shuffle(data, random_state=2)
In [27]:
# Drop columnns that don't have much effect on model outcome
data = data.drop(columns=['Id', 'MSSubClass', 'MSZoning', 'Street', 'Street','Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Condition2', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea', 'ExterCond', 'Foundation', 'BsmtExposure', 'BsmtFinSF2' ,'BsmtFinType1', 'BsmtFinType2', 'BsmtExposure', 'BsmtFinSF2', 'Heating', 'CentralAir', 'Electrical', 'BsmtHalfBath','LowQualFinSF','KitchenAbvGr', 'KitchenQual', 'Functional','GarageQual', 'GarageCond', 'PavedDrive' , 'EnclosedPorch', 'PoolArea','PoolQC', 'MiscFeature', 'MiscVal', 'SaleType', 'MoSold', 'SaleCondition'])
In [28]:
train_size = int(len(data) * .8)
data.head()
Out[28]:
In [29]:
labels = data['SalePrice']
data = data.drop(columns=['SalePrice'])
In [30]:
# Convert categorical columns to dummy columns and preview
data = pd.get_dummies(data)
data.head()
Out[30]:
In [32]:
# Convert labels to a classification problem
price_more_than_160k = (labels.values > 160000).astype(int)
In [33]:
# Split data into train and test sets
train_data = data[:train_size]
test_data = data[train_size:]
train_labels = price_more_than_160k[:train_size]
test_labels = price_more_than_160k[train_size:]
In [34]:
bst = xgb.XGBClassifier(objective='binary:logistic')
In [35]:
bst.fit(train_data.values, train_labels)
Out[35]:
In [37]:
# Get predictions on the test set and print the accuracy score
y_pred = bst.predict(test_data.values)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(test_labels, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
In [38]:
bst.save_model('model.bst')
In [ ]:
# Define some globals - update these to your own project + model names
GCP_PROJECT = 'YOUR_GCP_PROJECT'
MODEL_BUCKET = 'your_GCS_bucket_name'
VERSION_NAME = 'v1'
In [42]:
!gsutil cp model.bst gs://$MODEL_BUCKET
In [43]:
!gcloud config set project $GCP_PROJECT
In [44]:
# Create a version, this will take ~2 minutes to deploy
!gcloud ai-platform versions create $VERSION_NAME \
--model=housing_classification \
--framework='XGBOOST' \
--runtime-version=1.14 \
--origin=gs://$MODEL_BUCKET \
--python-version=3.5
In [45]:
# Format a subset of the test data to send to the What-if Tool for visualization
# Append ground truth label value to training data
test_examples = np.hstack((test_data.values[:100], test_labels[:100].reshape(-1,1)))
In [ ]:
# Create a What-if Tool visualization, it may take a minute to load
# See the cell below this for exploration ideas
# This prediction adjustment function is needed as this xgboost model's
# prediction returns just a score for the positive class of the binary
# classification, whereas the What-If Tool expects a list of scores for each
# class (in this case, both the negative class and the positive class).
def adjust_prediction(pred):
return [1 - pred, pred]
config_builder = (WitConfigBuilder(test_examples.tolist(), data.columns.tolist() + ['SalePrice'])
.set_ai_platform_model(GCP_PROJECT, 'housing_classification', VERSION_NAME, adjust_prediction=adjust_prediction)
.set_target_feature('SalePrice')
.set_label_vocab(['Under160', 'Over160']))
WitWidget(config_builder, height=800)
Individual data points: the default graph shows all data points from the test set, colored by their ground truth label (priced over or under 160k)
Binning data: create separate graphs for individual features
Exploring overall performance: Click on the "Performance & Fairness" tab to view overall performance statistics on the model's results on the provided dataset, including confusion matrices, PR curves, and ROC curves.