AI Explanations: Explaining a tabular data model

Overview

In this tutorial we will perform the following steps:

  1. Build and train a Keras model.
  2. Export the Keras model as a TF 1 SavedModel and deploy the model on Cloud AI Platform.
  3. Compute explainations for our model's predictions using Explainable AI on Cloud AI Platform.

Dataset

The dataset used for this tutorial was created from a BigQuery Public Dataset: NYC 2018 Yellow Taxi data.

Objective

The goal is to train a model using the Keras Sequential API that predicts how much a customer is compelled to pay (fares + tolls) for a taxi ride given the pickup location, dropoff location, the day of the week, and the hour of the day.

This tutorial focuses more on deploying the model to AI Explanations than on the design of the model itself. We will be using preprocessed data for this lab. If you wish to know more about the data and how it was preprocessed please see this notebook.

Before you begin

This notebook was written with running in Google Colabratory in mind. The notebook will run on Cloud AI Platform Notebooks or your local environment if the proper packages are installed.

Make sure you're running this notebook in a GPU runtime if you have that option. In Colab, select Runtime --> Change runtime type and select GPU for Hardward Accelerator.

Authenticate your GCP account

If you are using AI Platform Notebooks, your environment is already authenticated. You should skip this step.

Be sure to change the PROJECT_ID below to your project before running the cell!


In [0]:
import os

PROJECT_ID = "michaelabel-gcp-training" 
os.environ["PROJECT_ID"] = PROJECT_ID

If you are using Colab, run the cell below and follow the instructions when prompted to authenticate your account via oAuth. Ignore the error message related to tensorflow-serving-api.


In [0]:
import sys
import warnings
warnings.filterwarnings('ignore')
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 
# If you are running this notebook in Colab, follow the
# instructions to authenticate your GCP account. This provides access to your
# Cloud Storage bucket and lets you submit training jobs and prediction
# requests.

if 'google.colab' in sys.modules:
  from google.colab import auth as google_auth
  google_auth.authenticate_user()
  !pip install witwidget --quiet
  !pip install tensorflow==1.15.2 --quiet
  !gcloud config set project $PROJECT_ID

elif "DL_PATH" in os.environ:
  !sudo pip install tabulate --quiet

Create a Cloud Storage bucket

The following steps are required, regardless of your notebook environment.

When you submit a training job using the Cloud SDK, you upload a Python package containing your training code to a Cloud Storage bucket. AI Platform runs the code from this package. In this tutorial, AI Platform also saves the trained model that results from your job in the same bucket. You can then create an AI Platform model version based on this output in order to serve online predictions.

Set the name of your Cloud Storage bucket below. It must be unique across all Cloud Storage buckets.

You may also change the REGION variable, which is used for operations throughout the rest of this notebook. Make sure to choose a region where Cloud AI Platform services are available. Note that you may not use a Multi-Regional Storage bucket for training with AI Platform.


In [0]:
BUCKET_NAME = "michaelabel-gcp-training-ml" 
REGION = "us-central1"

os.environ['BUCKET_NAME'] = BUCKET_NAME
os.environ['REGION'] = REGION

Run the following cell to create your Cloud Storage bucket if it does not already exist.


In [0]:
%%bash
exists=$(gsutil ls -d | grep -w gs://${BUCKET_NAME}/)

if [ -n "$exists" ]; then
   echo -e "Bucket gs://${BUCKET_NAME} already exists."
    
else
   echo "Creating a new GCS bucket."
   gsutil mb -l ${REGION} gs://${BUCKET_NAME}
   echo -e "\nHere are your current buckets:"
   gsutil ls
fi

Import libraries for creating model

Import the libraries we'll be using in this tutorial. This tutorial has been tested with TensorFlow 1.15.2.


In [0]:
%tensorflow_version 1.x
import tensorflow as tf 
import tensorflow.feature_column as fc
import pandas as pd
import numpy as np 
import json
import time

# Should be 1.15.2
print(tf.__version__)

Downloading and preprocessing data

In this section you'll download the data to train and evaluate your model from a public GCS bucket. The original data has been preprocessed from the public BigQuery dataset linked above.


In [0]:
%%bash
# Copy the data to your notebook instance
mkdir taxi_preproc
gsutil cp -r gs://cloud-training/bootcamps/serverlessml/taxi_preproc/*_xai.csv ./taxi_preproc
ls -l taxi_preproc

Read the data with Pandas

We'll use Pandas to read the training and validation data into a DataFrame. We will only use the first 7 columns of the csv files for our models.


In [0]:
CSV_COLUMNS = ['fare_amount', 'dayofweek', 'hourofday', 'pickuplon',
             'pickuplat', 'dropofflon', 'dropofflat']

DAYS = ['Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat']
DTYPES = ['float32', 'str' , 'int32', 'float32' , 'float32' , 'float32' , 'float32' ]

def prepare_data(file_path):

  df = pd.read_csv(file_path, usecols = range(7), names = CSV_COLUMNS,
                   dtype = dict(zip(CSV_COLUMNS, DTYPES)), skiprows=1)
  
  labels = df['fare_amount'] 
  df = df.drop(columns=['fare_amount'])

  df['dayofweek'] = df['dayofweek'].map(dict(zip(DAYS, range(7)))).astype('float32')

  return df, labels

train_data, train_labels = prepare_data('./taxi_preproc/train_xai.csv')
valid_data, valid_labels = prepare_data('./taxi_preproc/valid_xai.csv')

In [0]:
# Preview the first 5 rows of training data
train_data.head()

Build, train, and evaluate our model with Keras

We'll use tf.Keras to build a our ML model that takes our features as input and predicts the fare amount.

But first, we will do some feature engineering. We will be utilizing tf.feature_column and tf.keras.layers.Lambda to implement our feature engineering in the model graph to simplify our serving_input_fn later.


In [0]:
# Create functions to compute engineered features in later Lambda layers
def euclidean(params):
  lat1, lon1, lat2, lon2 = params
  londiff = lon2 - lon1
  latdiff = lat2 - lat1
  return tf.sqrt(londiff*londiff + latdiff*latdiff)

In [0]:
NUMERIC_COLS = ['pickuplon', 'pickuplat', 'dropofflon', 'dropofflat', 'hourofday', 'dayofweek']

def transform(inputs):

  transformed = inputs.copy()

  transformed['euclidean'] = tf.keras.layers.Lambda(euclidean, name='euclidean')([
              inputs['pickuplat'],
              inputs['pickuplon'],
              inputs['dropofflat'],
              inputs['dropofflon']])
  
  feat_cols = {colname: fc.numeric_column(colname)
           for colname in NUMERIC_COLS}

  feat_cols['euclidean'] = fc.numeric_column('euclidean')

  print("BEFORE TRANSFORMATION")
  print("INPUTS:", inputs.keys())

  print("AFTER TRANSFORMATION")
  print("TRANSFORMED:", transformed.keys())
  print("FEATURES", feat_cols.keys()) 

  return transformed, feat_cols

def build_model():

  raw_inputs = {
          colname : tf.keras.layers.Input(name=colname, shape=(), dtype='float32')
            for colname in NUMERIC_COLS
      }
  
  transformed, feat_cols = transform(raw_inputs)

  dense_inputs = tf.keras.layers.DenseFeatures(feat_cols.values(),
                                               name = 'dense_input')(transformed)

  h1 = tf.keras.layers.Dense(64, activation='relu', name='h1')(dense_inputs)
  h2 = tf.keras.layers.Dense(32, activation='relu', name='h2')(h1)
  output = tf.keras.layers.Dense(1, activation='linear', name = 'output')(h2)

  model = tf.keras.models.Model(raw_inputs, output)

  return model

model = build_model()
model.summary()

In [0]:
# Compile the model and see a summary
optimizer = tf.keras.optimizers.Adam(0.001)

model.compile(loss='mean_squared_error', optimizer=optimizer,
              metrics = [tf.keras.metrics.RootMeanSquaredError()])

tf.keras.utils.plot_model(model, to_file='model_plot.png', show_shapes=True, 
                          show_layer_names=True, rankdir="TB")

Create an input data pipeline with tf.data

Per best practices, we will use tf.Data to create our input data pipeline. Our data is all in an in-memory dataframe, so we will use tf.data.Dataset.from_tensor_slices to create our pipeline.


In [0]:
def load_dataset(features, labels, mode):

  dataset = tf.data.Dataset.from_tensor_slices(({"dayofweek" : features["dayofweek"],
                                                 "hourofday" : features["hourofday"],
                                                 "pickuplat" : features["pickuplat"],
                                                 "pickuplon" : features["pickuplon"],
                                                 "dropofflat" : features["dropofflat"],
                                                 "dropofflon" : features["dropofflon"]},
                                                  labels
                                                    ))


  if mode == tf.estimator.ModeKeys.TRAIN:
    dataset = dataset.repeat().batch(256).shuffle(256*10)
  else:
    dataset = dataset.batch(256)

  return dataset.prefetch(1)


train_dataset = load_dataset(train_data, train_labels, tf.estimator.ModeKeys.TRAIN)
valid_dataset = load_dataset(valid_data, valid_labels, tf.estimator.ModeKeys.EVAL)

Train the model

Now we train the model. We will specify a number of epochs which to train the model and tell the model how many steps to expect per epoch.


In [0]:
tf.keras.backend.get_session().run(tf.tables_initializer(name='init_all_tables'))

steps_per_epoch = 426433 // 256

model.fit(train_dataset, steps_per_epoch=steps_per_epoch, validation_data=valid_dataset, epochs=10)

In [0]:
# Send test instances to model for prediction
predict = model.predict(valid_dataset, steps = 1)
predict[:5]

Export the model as a TF 1 SavedModel

In order to deploy our model in a format compatible with AI Explanations, we'll follow the steps below to convert our Keras model to a TF Estimator, and then use the export_saved_model method to generate the SavedModel and save it in GCS.


In [0]:
## Convert our Keras model to an estimator
keras_estimator = tf.keras.estimator.model_to_estimator(keras_model=model, model_dir='export')

In [0]:
print(model.input)

# We need this serving input function to export our model in the next cell
serving_fn = tf.estimator.export.build_raw_serving_input_receiver_fn(
    model.input
)

In [0]:
export_path = keras_estimator.export_saved_model(
  'gs://' + BUCKET_NAME + '/explanations',
  serving_input_receiver_fn=serving_fn
).decode('utf-8')

Use TensorFlow's saved_model_cli to inspect the model's SignatureDef. We'll use this information when we deploy our model to AI Explanations in the next section.


In [0]:
!saved_model_cli show --dir $export_path --all

Deploy the model to AI Explanations

In order to deploy the model to Explanations, we need to generate an explanations_metadata.json file and upload this to the Cloud Storage bucket with our SavedModel. Then we'll deploy the model using gcloud.

Prepare explanation metadata

We need to tell AI Explanations the names of the input and output tensors our model is expecting, which we print below.

The value for input_baselines tells the explanations service what the baseline input should be for our model. Here we're using the median for all of our input features. That means the baseline prediction for this model will be the fare our model predicts for the median of each feature in our dataset.


In [0]:
# Print the names of our tensors
print('Model input tensors: ', model.input)
print('Model output tensor: ', model.output.name)

In [0]:
baselines_med = train_data.median().values.tolist()
baselines_mode = train_data.mode().values.tolist()
print(baselines_med)
print(baselines_mode)

explanation_metadata = {
    "inputs": {
      "dayofweek": {
        "input_tensor_name": "dayofweek:0",
        "input_baselines": [baselines_mode[0][0]] # Thursday
      },
      "hourofday": {
        "input_tensor_name": "hourofday:0",
        "input_baselines": [baselines_mode[0][1]] # 8pm
      },
      "dropofflon": {
        "input_tensor_name": "dropofflon:0",
        "input_baselines": [baselines_med[4]] 
      },
      "dropofflat": {
        "input_tensor_name": "dropofflat:0",
        "input_baselines": [baselines_med[5]] 
      },
      "pickuplon": {
        "input_tensor_name": "pickuplon:0",
        "input_baselines": [baselines_med[2]] 
      },
      "pickuplat": {
        "input_tensor_name": "pickuplat:0",
        "input_baselines": [baselines_med[3]] 
      },
    },
    "outputs": {
      "dense": {
        "output_tensor_name": "output/BiasAdd:0"
      }
    },
  "framework": "tensorflow"
  }

print(explanation_metadata)

Since this is a regression model (predicting a numerical value), the baseline prediction will be the same for every example we send to the model. If this were instead a classification model, each class would have a different baseline prediction.


In [0]:
# Write the json to a local file
with open('explanation_metadata.json', 'w') as output_file:
  json.dump(explanation_metadata, output_file)

In [0]:
!gsutil cp explanation_metadata.json $export_path

Create the model

Now we will create out model on Cloud AI Platform if it does not already exist.


In [0]:
MODEL = 'taxifare_explain'
os.environ["MODEL"] = MODEL

In [0]:
%%bash
exists=$(gcloud ai-platform models list | grep ${MODEL})

if [ -n "$exists" ]; then
   echo -e "Model ${MODEL} already exists."
    
else
   echo "Creating a new model."
   gcloud ai-platform models create ${MODEL}
fi

Create the model version

Creating the version will take ~5-10 minutes. Note that your first deploy may take longer.


In [0]:
# Each time you create a version the name should be unique
import datetime
now = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
VERSION_IG = 'v_IG_{}'.format(now)
VERSION_SHAP = 'v_SHAP_{}'.format(now)

In [0]:
# Create the version with gcloud
!gcloud beta ai-platform versions create $VERSION_IG \
--model $MODEL \
--origin $export_path \
--runtime-version 1.15 \
--framework TENSORFLOW \
--python-version 3.7 \
--machine-type n1-standard-4 \
--explanation-method 'integrated-gradients' \
--num-integral-steps 25

!gcloud beta ai-platform versions create $VERSION_SHAP \
--model $MODEL \
--origin $export_path \
--runtime-version 1.15 \
--framework TENSORFLOW \
--python-version 3.7 \
--machine-type n1-standard-4 \
--explanation-method 'sampled-shapley' \
--num-paths 50

In [0]:
# Make sure the model deployed correctly. State should be `READY` in the following log
!gcloud ai-platform versions describe $VERSION_IG --model $MODEL
!echo "---"
!gcloud ai-platform versions describe $VERSION_SHAP --model $MODEL

Getting predictions and explanations on deployed model

Now that your model is deployed, you can use the AI Platform Prediction API to get feature attributions. We'll pass it a single test example here and see which features were most important in the model's prediction. Here we'll use gcloud to call our deployed model.

Format our request for gcloud

To use gcloud to make our AI Explanations request, we need to write the JSON to a file. Our example here is for a ride from the Google office in downtown Manhattan to LaGuardia Airport at 5pm on a Tuesday afternoon.

Note that we had to write our day of the week at "3" instead of "Tue" since we encoded the days of the week outside of our model and serving input function.


In [0]:
# Format data for prediction to our model
!rm taxi-data.txt
!touch taxi-data.txt
prediction_json = {"dayofweek": "3", "hourofday": "17", "pickuplon": "-74.0026", "pickuplat": "40.7410", "dropofflat": "40.7790", "dropofflon": "-73.8772"}
with open('taxi-data.txt', 'a') as outfile:
  json.dump(prediction_json, outfile)

In [0]:
# Preview the contents of the data file
!cat taxi-data.txt

Making the explain request

Now we make the explaination requests. We will go ahead and do this here for both integrated gradients and SHAP using the prediction JSON from above.


In [0]:
resp_obj = !gcloud beta ai-platform explain --model $MODEL --version $VERSION_IG --json-instances='taxi-data.txt'
response_IG = json.loads(resp_obj.s)
resp_obj

In [0]:
resp_obj = !gcloud beta ai-platform explain --model $MODEL --version $VERSION_SHAP --json-instances='taxi-data.txt'
response_SHAP = json.loads(resp_obj.s)
resp_obj

Understanding the explanations response

First let's just look at the difference between our predictions using our baselines and our predicted taxi fare for the example.


In [0]:
explanations_IG = response_IG['explanations'][0]['attributions_by_label'][0]
explanations_SHAP = response_SHAP['explanations'][0]['attributions_by_label'][0]

predicted = round(explanations_SHAP['example_score'], 2)
baseline = round(explanations_SHAP['baseline_score'], 2 )
print('Baseline taxi fare: ' + str(baseline) + ' dollars')
print('Predicted taxi fare: ' + str(predicted) + ' dollars')

Next let's look at the feature attributions for this particular example. Positive attribution values mean a particular feature pushed our model prediction up by that amount, and vice versa for negative attribution values. Which features seem like they're the most important...well it seems like the location features are the most important!


In [0]:
from tabulate import tabulate

feature_names = valid_data.columns.tolist()
attributions_IG = explanations_IG['attributions']
attributions_SHAP = explanations_SHAP['attributions']
rows = []
for feat in feature_names:
  rows.append([feat, prediction_json[feat], attributions_IG[feat], attributions_SHAP[feat]])
print(tabulate(rows,headers=['Feature name', 'Feature value', 'Attribution value (IG)', 'Attribution value (SHAP)']))