Online Prediction with XGBoost on AI Platform

This notebook uses the Census Income Data Set to create a simple XGBoost model, upload the model to Ai Platform, and query it for predictions.

How to bring your model to AI Platform

Getting your model ready for predictions can be done in 5 steps:

  1. Save your model to a file
  2. Upload the saved model to Google Cloud Storage
  3. Create a model resource on AI Platform
  4. Create a model version (linking your XGBoost model)
  5. Make an online prediction

Prerequisites

Before we begin, let’s cover some of the different tools you’ll use to get online prediction up and running on AI Platform.

Google Cloud Platform (GCP) lets you build and host applications and websites, store data, and analyze data on Google's scalable infrastructure.

AI Platform is a managed service that enables you to easily build machine learning models that work on any type of data, of any size.

Google Cloud Storage (GCS) is a unified object storage for developers and enterprises, from live data serving to data analytics/ML to data archiving.

Cloud SDK is a command line tool which allows you to interact with Google Cloud products. In order to run this notebook, make sure that Cloud SDK is installed in the same environment as your Jupyter kernel.

Part 0: Setup

These variables will be needed for the following steps.

In the cell below, replace the following highlighted elements:

  • <YOUR_PROJECT_ID> - with your project's id. Use the PROJECT_ID that matches your Google Cloud Platform project.
  • <YOUR_BUCKET_ID> - with the bucket id you created above.
  • <YOUR_MODEL_NAME> - with your model name, such as "census"
  • <YOUR_VERSION> - with your version name, such as "v1"
  • <ML_ENGINE_REGION> - select a region from here or use the default 'us-central1'. The region is where the model will be deployed.
  • data.json - a JSON file that contains the data used as input to your model’s predict method. (You'll create this in the code below)

In [1]:
%env PROJECT_ID <YOUR_PROJECT_ID>
%env BUCKET_ID <YOUR_BUCKET_ID>
%env MODEL_NAME <YOUR_MODEL_NAME>
%env VERSION_NAME <YOUR_VERSION>
%env REGION <ML_ENGINE_REGION>
%env INPUT_DATA_FILE data.json


env: PROJECT_ID=<YOUR_PROJECT_ID>
env: BUCKET_ID=<YOUR_BUCKET_ID>
env: MODEL_NAME=<YOUR_MODEL_NAME>
env: VERSION_NAME=<YOUR_VERSION>
env: REGION=<ML_ENGINE_REGION>
env: INPUT_DATA_FILE=data.json

Download the data

The Census Income Data Set that this sample uses for training is hosted by the UC Irvine Machine Learning Repository:

  • Training file is adult.data
  • Evaluation file is adult.test

Disclaimer

This dataset is provided by a third party. Google provides no representation, warranty, or other guarantees about the validity or any other aspects of this dataset.


In [2]:
# Create a directory to hold the data
! mkdir census_data

# Download the data
! curl https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data --output census_data/adult.data
! curl https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test --output census_data/adult.test


mkdir: cannot create directory ‘census_data’: File exists
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3881k  100 3881k    0     0  4617k      0 --:--:-- --:--:-- --:--:-- 4614k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1956k  100 1956k    0     0  1885k      0  0:00:01  0:00:01 --:--:-- 1886k

Part 1: Train/Save the model

First, the data is loaded into a pandas DataFrame. Then a simple model is created with the training set. Lastly, the model is saved to a .bst file that can then be uploaded to AI Platform.


In [3]:
import xgboost as xgb
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# these are the column labels from the census data files
COLUMNS = (
    'age',
    'workclass',
    'fnlwgt',
    'education',
    'education-num',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'capital-gain',
    'capital-loss',
    'hours-per-week',
    'native-country',
    'income-level'
)

# categorical columns contain data that need to be turned into numerical values before being used by XGBoost
CATEGORICAL_COLUMNS = (
    'workclass',
    'education',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'native-country'
)

# load training set
with open('./census_data/adult.data', 'r') as train_data:
    raw_training_data = pd.read_csv(train_data, header=None, names=COLUMNS)
# remove column we are trying to predict ('income-level') from features list
train_features = raw_training_data.drop('income-level', axis=1)
# create training labels list
train_labels = (raw_training_data['income-level'] == ' >50K')


# load test set
with open('./census_data/adult.test', 'r') as test_data:
    raw_testing_data = pd.read_csv(test_data, names=COLUMNS, skiprows=1)
# remove column we are trying to predict ('income-level') from features list
test_features = raw_testing_data.drop('income-level', axis=1)
# create training labels list
test_labels = (raw_testing_data['income-level'] == ' >50K.')

# convert data in categorical columns to numerical values
encoders = {col:LabelEncoder() for col in CATEGORICAL_COLUMNS}
for col in CATEGORICAL_COLUMNS:
    train_features[col] = encoders[col].fit_transform(train_features[col])
for col in CATEGORICAL_COLUMNS:
    test_features[col] = encoders[col].fit_transform(test_features[col])

# load data into DMatrix object
dtrain = xgb.DMatrix(train_features, train_labels)
dtest = xgb.DMatrix(test_features)

print 'data loading complete'


data loading complete

In [4]:
# train XGBoost model
bst = xgb.train({}, dtrain, 20)
bst.save_model('./model.bst')

print 'model trained and saved'


model trained and saved

Part 2: Upload the model

To use your model with AI Platform, it needs to be uploaded to Google Cloud Storage (GCS). So next, we'll take your local ‘model.bst’ file and upload it to GCS via the Cloud SDK using gsutil.

Before continuing, make sure you're properly authenticated and have access to the bucket. This next command sets your project to the one specified above.

Note: If you get an error below, make sure the Cloud SDK is installed in the kernel's environment.


In [5]:
! gcloud config set project $PROJECT_ID


Updated property [core/project].

Note: The exact file name of of the exported model you upload to GCS is important! Your model must be named “model.joblib”, “model.pkl”, or “model.bst” with respect to the library you used to export it. This restriction ensures that the model will be safely reconstructed later by using the same technique for import as was used during export.


In [6]:
! gsutil cp ./model.bst gs://$BUCKET_ID/model.bst


Copying file://./model.bst [Content-Type=application/octet-stream]...
/ [1 files][ 77.8 KiB/ 77.8 KiB]                                                
Operation completed over 1 objects/77.8 KiB.                                     

Part 3: Create a model resource

AI Platform organizes your trained models using model and version resources. An AI Platform model is a container for the versions of your machine learning model. For more information on model resources and model versions look here.

At this step, you create a container that you can use to hold several different versions of your actual model.


In [7]:
! gcloud ml-engine models create $MODEL_NAME --regions $REGION


Created AI Platform model [projects/demos-191212/models/census].

Part 4: Create a model version

Now it’s time to get your model online and ready for predictions. The model version requires a few components as specified here.

  • name - The name specified for the version when it was created. This will be the VERSION_NAME variable you declared at the beginning.
  • deployment Uri (curl) - The Google Cloud Storage location of the trained model used to create the version. This is the bucket that you uploaded the model to with your BUCKET_ID
  • runtime version - The Google Cloud ML runtime version to use for this deployment. This is set to 1.4
  • framework - The framework specifies if you are using: TENSORFLOW, SCIKIT_LEARN, XGBOOST. This is set to XGBOOST

Note: Runtime version 1.5 uses XGBoost 0.7. Runtime version 1.4 uses XGBoost 0.6. Please refer to the runtime version dependency list.

Note: It can take several minutes for you model to be available.


In [8]:
! curl -X POST -H "Content-Type: application/json" \
   -d '{"name": "'$VERSION_NAME'", "deploymentUri": "gs://'$BUCKET_ID'/", "runtimeVersion": "1.4", "framework": "XGBOOST"}' \
   -H "Authorization: Bearer `gcloud auth print-access-token`" \
    https://ml.googleapis.com/v1/projects/$PROJECT_ID/models/$MODEL_NAME/versions


{
  "name": "projects/demos-191212/operations/create_census_v1-1519973406339",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.ml.v1.OperationMetadata",
    "createTime": "2018-03-02T06:50:06Z",
    "operationType": "CREATE_VERSION",
    "modelName": "projects/demos-191212/models/census",
    "version": {
      "name": "projects/demos-191212/models/census/versions/v1",
      "deploymentUri": "gs://xgboost-tutorial-notebook-test/",
      "createTime": "2018-03-02T06:50:06Z",
      "runtimeVersion": "1.4",
      "framework": "XGBOOST"
    }
  }
}

Part 5: Make an online prediction

It’s time to make an online prediction with your newly deployed model. Before you begin, you'll need to take some of the test data and prepare it, so that the test data can be used by the deployed model.

To get online predictions, the data needs to be converted from a numpy array to a json array.


In [9]:
import json
import numpy as np

data = []
for i in range(len(test_features)):
  data.append([])
  for col in COLUMNS[:-1]: # ignore 'income-level' column as it isn't in feature set.
    # convert from numpy integers to standard integers
    data[i].append(int(np.uint64(test_features[col][i]).item()))

# write the test data to a json file
with open('data.json', 'w') as outfile:
  json.dump(data, outfile)

# get one person that makes <=50K and one that makes >50K to test our model.
print('Show a person that makes <=50K:')
print('\tFeatures: {0} --> Label: {1}\n'.format(data[0], test_labels[0]))

with open('less_than_50K.json', 'w') as outfile:
  json.dump(data[0], outfile)

  
print('Show a person that makes >50K:')
print('\tFeatures: {0} --> Label: {1}'.format(data[2], test_labels[2]))

with open('more_than_50K.json', 'w') as outfile:
  json.dump(data[2], outfile)


Show a person that makes <=50K:
	Features: [25, 4, 226802, 1, 7, 4, 7, 3, 2, 1, 0, 0, 40, 38] --> Label: False

Show a person that makes >50K:
	Features: [28, 2, 336951, 7, 12, 2, 11, 0, 4, 1, 0, 0, 40, 38] --> Label: True

Use Gcloud to make online predictions

Use the two people (as seen in the table) gathered in the previous step for the gcloud predictions.

Person age workclass fnlwgt education education-num marital-status occupation
1 25 4 226802 1 7 4 7
2 28 2 336951 7 12 2 11
Person relationship race sex capital-gain capital-loss hours-per-week native-country (Label) income-level
1 3 2 1 0 0 40 38 False (<=50K)
2 0 4 1 0 0 40 38 True (>50K)

Creating a model version can take several minutes, check the status of your model version to see if it is available.


In [10]:
! gcloud ml-engine versions list --model $MODEL_NAME


NAME  DEPLOYMENT_URI                        STATE
v1    gs://xgboost-tutorial-notebook-test/  CREATING

In [11]:
! gcloud ml-engine versions list --model $MODEL_NAME


NAME  DEPLOYMENT_URI                        STATE
v1    gs://xgboost-tutorial-notebook-test/  READY

Test the model with an online prediction using the data of a person who makes <=50K.

Note: If you see an error, the model from Part 4 may not be created yet as it takes several minutes for a new model version to be created.


In [12]:
! gcloud ml-engine predict --model $MODEL_NAME --version $VERSION_NAME --json-instances less_than_50K.json


[0.09139516949653625]

Test the model with an online prediction using the data of a person who makes >50K.


In [13]:
! gcloud ml-engine predict --model $MODEL_NAME --version $VERSION_NAME --json-instances more_than_50K.json


[0.4006653428077698]

Realise how the cells above return floats instead of booleans. Let's deal with that below so the output type of the predictions match those of the test set labels. We'll set the prediction to True if it's greater than 0.5 and to False otherwise.

Use Python to make online predictions

We'll test the model with the entire test set and print out some of the results.

Note: If running notebook server on Compute Engine, make sure to "allow full access to all Cloud APIs".


In [14]:
import googleapiclient.discovery
import os

PROJECT_ID = os.environ['PROJECT_ID']
VERSION_NAME = os.environ['VERSION_NAME']
MODEL_NAME = os.environ['MODEL_NAME']

service = googleapiclient.discovery.build('ml', 'v1')
name = 'projects/{}/models/{}'.format(PROJECT_ID, MODEL_NAME)
name += '/versions/{}'.format(VERSION_NAME)

response = service.projects().predict(
    name=name,
    body={'instances': data}
).execute()

if 'error' in response:
  print (response['error'])
else:
  online_results = response['predictions']
  # convert floats to booleans
  converted_responses = [x > 0.5 for x in online_results]
  # Print the first 10 responses
  for i, response in enumerate(converted_responses[:5]):
    print('Prediction: {}\tLabel: {}'.format(response, test_labels[i]))


Prediction: False	Label: False
Prediction: False	Label: False
Prediction: False	Label: True
Prediction: True	Label: True
Prediction: False	Label: False

[Optional] Part 6: Verify Results

Let's visualise our predictions with a confusion matrix.


In [15]:
actual = pd.Series(test_labels, name='actual')
online = pd.Series(converted_responses, name='online')

pd.crosstab(actual,online)


Out[15]:
online False True
actual
False 11842 593
True 1520 2326

Let's compare this with the confusion matrix of our local model.


In [16]:
local_results = bst.predict(dtest)
converted_local = [x > 0.5 for x in local_results]
local = pd.Series(converted_local, name='local')

pd.crosstab(actual,local)


Out[16]:
local False True
actual
False 11842 593
True 1520 2326

Better, let's compare the raw results (pre-boolean-conversion) of our local and online models.


In [17]:
identical = 0
different = 0

for i in xrange(len(online_results)):
    if online_results[i] == local_results[i]:
        identical += 1
    else:
        different += 1
        
print 'identical: {}, different: {}'.format(identical,different)


identical: 16281, different: 0

If all results are identical, it means we've successfully uploaded our local model to AI Platform and performed online predictions correctly.