The dataset used for this tutorial was created from a BigQuery Public Dataset: NYC 2018 Yellow Taxi data.
The goal is to train a model using the Keras Sequential API that predicts how much a customer is compelled to pay (fares + tolls) for a taxi ride given the pickup location, dropoff location, the day of the week, and the hour of the day.
This tutorial focuses more on deploying the model to AI Explanations than on the design of the model itself. We will be using preprocessed data for this lab. If you wish to know more about the data and how it was preprocessed please see this notebook.
This notebook was written with running in Google Colabratory in mind. The notebook will run on Cloud AI Platform Notebooks or your local environment if the proper packages are installed.
Make sure you're running this notebook in a GPU runtime if you have that option. In Colab, select Runtime --> Change runtime type and select GPU for Hardward Accelerator.
In [0]:
import os
PROJECT_ID = "michaelabel-gcp-training"
os.environ["PROJECT_ID"] = PROJECT_ID
If you are using Colab, run the cell below and follow the instructions
when prompted to authenticate your account via oAuth. Ignore the error message related to tensorflow-serving-api
.
In [0]:
import sys
import warnings
warnings.filterwarnings('ignore')
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
# If you are running this notebook in Colab, follow the
# instructions to authenticate your GCP account. This provides access to your
# Cloud Storage bucket and lets you submit training jobs and prediction
# requests.
if 'google.colab' in sys.modules:
from google.colab import auth as google_auth
google_auth.authenticate_user()
!pip install witwidget --quiet
!pip install tensorflow==1.15.2 --quiet
!gcloud config set project $PROJECT_ID
elif "DL_PATH" in os.environ:
!sudo pip install tabulate --quiet
The following steps are required, regardless of your notebook environment.
When you submit a training job using the Cloud SDK, you upload a Python package containing your training code to a Cloud Storage bucket. AI Platform runs the code from this package. In this tutorial, AI Platform also saves the trained model that results from your job in the same bucket. You can then create an AI Platform model version based on this output in order to serve online predictions.
Set the name of your Cloud Storage bucket below. It must be unique across all Cloud Storage buckets.
You may also change the REGION
variable, which is used for operations
throughout the rest of this notebook. Make sure to choose a region where Cloud
AI Platform services are
available. Note that you may
not use a Multi-Regional Storage bucket for training with AI Platform.
In [0]:
BUCKET_NAME = "michaelabel-gcp-training-ml"
REGION = "us-central1"
os.environ['BUCKET_NAME'] = BUCKET_NAME
os.environ['REGION'] = REGION
Run the following cell to create your Cloud Storage bucket if it does not already exist.
In [0]:
%%bash
exists=$(gsutil ls -d | grep -w gs://${BUCKET_NAME}/)
if [ -n "$exists" ]; then
echo -e "Bucket gs://${BUCKET_NAME} already exists."
else
echo "Creating a new GCS bucket."
gsutil mb -l ${REGION} gs://${BUCKET_NAME}
echo -e "\nHere are your current buckets:"
gsutil ls
fi
In [0]:
%tensorflow_version 1.x
import tensorflow as tf
import tensorflow.feature_column as fc
import pandas as pd
import numpy as np
import json
import time
# Should be 1.15.2
print(tf.__version__)
In [0]:
%%bash
# Copy the data to your notebook instance
mkdir taxi_preproc
gsutil cp -r gs://cloud-training/bootcamps/serverlessml/taxi_preproc/*_xai.csv ./taxi_preproc
ls -l taxi_preproc
In [0]:
CSV_COLUMNS = ['fare_amount', 'dayofweek', 'hourofday', 'pickuplon',
'pickuplat', 'dropofflon', 'dropofflat']
DAYS = ['Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat']
DTYPES = ['float32', 'str' , 'int32', 'float32' , 'float32' , 'float32' , 'float32' ]
def prepare_data(file_path):
df = pd.read_csv(file_path, usecols = range(7), names = CSV_COLUMNS,
dtype = dict(zip(CSV_COLUMNS, DTYPES)), skiprows=1)
labels = df['fare_amount']
df = df.drop(columns=['fare_amount'])
df['dayofweek'] = df['dayofweek'].map(dict(zip(DAYS, range(7)))).astype('float32')
return df, labels
train_data, train_labels = prepare_data('./taxi_preproc/train_xai.csv')
valid_data, valid_labels = prepare_data('./taxi_preproc/valid_xai.csv')
In [0]:
# Preview the first 5 rows of training data
train_data.head()
We'll use tf.Keras
to build a our ML model that takes our features as input and predicts the fare amount.
But first, we will do some feature engineering. We will be utilizing tf.feature_column
and tf.keras.layers.Lambda
to implement our feature engineering in the model graph to simplify our serving_input_fn
later.
In [0]:
# Create functions to compute engineered features in later Lambda layers
def euclidean(params):
lat1, lon1, lat2, lon2 = params
londiff = lon2 - lon1
latdiff = lat2 - lat1
return tf.sqrt(londiff*londiff + latdiff*latdiff)
In [0]:
NUMERIC_COLS = ['pickuplon', 'pickuplat', 'dropofflon', 'dropofflat', 'hourofday', 'dayofweek']
def transform(inputs):
transformed = inputs.copy()
transformed['euclidean'] = tf.keras.layers.Lambda(euclidean, name='euclidean')([
inputs['pickuplat'],
inputs['pickuplon'],
inputs['dropofflat'],
inputs['dropofflon']])
feat_cols = {colname: fc.numeric_column(colname)
for colname in NUMERIC_COLS}
feat_cols['euclidean'] = fc.numeric_column('euclidean')
print("BEFORE TRANSFORMATION")
print("INPUTS:", inputs.keys())
print("AFTER TRANSFORMATION")
print("TRANSFORMED:", transformed.keys())
print("FEATURES", feat_cols.keys())
return transformed, feat_cols
def build_model():
raw_inputs = {
colname : tf.keras.layers.Input(name=colname, shape=(), dtype='float32')
for colname in NUMERIC_COLS
}
transformed, feat_cols = transform(raw_inputs)
dense_inputs = tf.keras.layers.DenseFeatures(feat_cols.values(),
name = 'dense_input')(transformed)
h1 = tf.keras.layers.Dense(64, activation='relu', name='h1')(dense_inputs)
h2 = tf.keras.layers.Dense(32, activation='relu', name='h2')(h1)
output = tf.keras.layers.Dense(1, activation='linear', name = 'output')(h2)
model = tf.keras.models.Model(raw_inputs, output)
return model
model = build_model()
model.summary()
In [0]:
# Compile the model and see a summary
optimizer = tf.keras.optimizers.Adam(0.001)
model.compile(loss='mean_squared_error', optimizer=optimizer,
metrics = [tf.keras.metrics.RootMeanSquaredError()])
tf.keras.utils.plot_model(model, to_file='model_plot.png', show_shapes=True,
show_layer_names=True, rankdir="TB")
In [0]:
def load_dataset(features, labels, mode):
dataset = tf.data.Dataset.from_tensor_slices(({"dayofweek" : features["dayofweek"],
"hourofday" : features["hourofday"],
"pickuplat" : features["pickuplat"],
"pickuplon" : features["pickuplon"],
"dropofflat" : features["dropofflat"],
"dropofflon" : features["dropofflon"]},
labels
))
if mode == tf.estimator.ModeKeys.TRAIN:
dataset = dataset.repeat().batch(256).shuffle(256*10)
else:
dataset = dataset.batch(256)
return dataset.prefetch(1)
train_dataset = load_dataset(train_data, train_labels, tf.estimator.ModeKeys.TRAIN)
valid_dataset = load_dataset(valid_data, valid_labels, tf.estimator.ModeKeys.EVAL)
In [0]:
tf.keras.backend.get_session().run(tf.tables_initializer(name='init_all_tables'))
steps_per_epoch = 426433 // 256
model.fit(train_dataset, steps_per_epoch=steps_per_epoch, validation_data=valid_dataset, epochs=10)
In [0]:
# Send test instances to model for prediction
predict = model.predict(valid_dataset, steps = 1)
predict[:5]
In [0]:
## Convert our Keras model to an estimator
keras_estimator = tf.keras.estimator.model_to_estimator(keras_model=model, model_dir='export')
In [0]:
print(model.input)
# We need this serving input function to export our model in the next cell
serving_fn = tf.estimator.export.build_raw_serving_input_receiver_fn(
model.input
)
In [0]:
export_path = keras_estimator.export_saved_model(
'gs://' + BUCKET_NAME + '/explanations',
serving_input_receiver_fn=serving_fn
).decode('utf-8')
Use TensorFlow's saved_model_cli
to inspect the model's SignatureDef. We'll use this information when we deploy our model to AI Explanations in the next section.
In [0]:
!saved_model_cli show --dir $export_path --all
We need to tell AI Explanations the names of the input and output tensors our model is expecting, which we print below.
The value for input_baselines
tells the explanations service what the baseline input should be for our model. Here we're using the median for all of our input features. That means the baseline prediction for this model will be the fare our model predicts for the median of each feature in our dataset.
In [0]:
# Print the names of our tensors
print('Model input tensors: ', model.input)
print('Model output tensor: ', model.output.name)
In [0]:
baselines_med = train_data.median().values.tolist()
baselines_mode = train_data.mode().values.tolist()
print(baselines_med)
print(baselines_mode)
explanation_metadata = {
"inputs": {
"dayofweek": {
"input_tensor_name": "dayofweek:0",
"input_baselines": [baselines_mode[0][0]] # Thursday
},
"hourofday": {
"input_tensor_name": "hourofday:0",
"input_baselines": [baselines_mode[0][1]] # 8pm
},
"dropofflon": {
"input_tensor_name": "dropofflon:0",
"input_baselines": [baselines_med[4]]
},
"dropofflat": {
"input_tensor_name": "dropofflat:0",
"input_baselines": [baselines_med[5]]
},
"pickuplon": {
"input_tensor_name": "pickuplon:0",
"input_baselines": [baselines_med[2]]
},
"pickuplat": {
"input_tensor_name": "pickuplat:0",
"input_baselines": [baselines_med[3]]
},
},
"outputs": {
"dense": {
"output_tensor_name": "output/BiasAdd:0"
}
},
"framework": "tensorflow"
}
print(explanation_metadata)
Since this is a regression model (predicting a numerical value), the baseline prediction will be the same for every example we send to the model. If this were instead a classification model, each class would have a different baseline prediction.
In [0]:
# Write the json to a local file
with open('explanation_metadata.json', 'w') as output_file:
json.dump(explanation_metadata, output_file)
In [0]:
!gsutil cp explanation_metadata.json $export_path
In [0]:
MODEL = 'taxifare_explain'
os.environ["MODEL"] = MODEL
In [0]:
%%bash
exists=$(gcloud ai-platform models list | grep ${MODEL})
if [ -n "$exists" ]; then
echo -e "Model ${MODEL} already exists."
else
echo "Creating a new model."
gcloud ai-platform models create ${MODEL}
fi
In [0]:
# Each time you create a version the name should be unique
import datetime
now = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
VERSION_IG = 'v_IG_{}'.format(now)
VERSION_SHAP = 'v_SHAP_{}'.format(now)
In [0]:
# Create the version with gcloud
!gcloud beta ai-platform versions create $VERSION_IG \
--model $MODEL \
--origin $export_path \
--runtime-version 1.15 \
--framework TENSORFLOW \
--python-version 3.7 \
--machine-type n1-standard-4 \
--explanation-method 'integrated-gradients' \
--num-integral-steps 25
!gcloud beta ai-platform versions create $VERSION_SHAP \
--model $MODEL \
--origin $export_path \
--runtime-version 1.15 \
--framework TENSORFLOW \
--python-version 3.7 \
--machine-type n1-standard-4 \
--explanation-method 'sampled-shapley' \
--num-paths 50
In [0]:
# Make sure the model deployed correctly. State should be `READY` in the following log
!gcloud ai-platform versions describe $VERSION_IG --model $MODEL
!echo "---"
!gcloud ai-platform versions describe $VERSION_SHAP --model $MODEL
Now that your model is deployed, you can use the AI Platform Prediction API to get feature attributions. We'll pass it a single test example here and see which features were most important in the model's prediction. Here we'll use gcloud
to call our deployed model.
To use gcloud to make our AI Explanations request, we need to write the JSON to a file. Our example here is for a ride from the Google office in downtown Manhattan to LaGuardia Airport at 5pm on a Tuesday afternoon.
Note that we had to write our day of the week at "3" instead of "Tue" since we encoded the days of the week outside of our model and serving input function.
In [0]:
# Format data for prediction to our model
!rm taxi-data.txt
!touch taxi-data.txt
prediction_json = {"dayofweek": "3", "hourofday": "17", "pickuplon": "-74.0026", "pickuplat": "40.7410", "dropofflat": "40.7790", "dropofflon": "-73.8772"}
with open('taxi-data.txt', 'a') as outfile:
json.dump(prediction_json, outfile)
In [0]:
# Preview the contents of the data file
!cat taxi-data.txt
In [0]:
resp_obj = !gcloud beta ai-platform explain --model $MODEL --version $VERSION_IG --json-instances='taxi-data.txt'
response_IG = json.loads(resp_obj.s)
resp_obj
In [0]:
resp_obj = !gcloud beta ai-platform explain --model $MODEL --version $VERSION_SHAP --json-instances='taxi-data.txt'
response_SHAP = json.loads(resp_obj.s)
resp_obj
In [0]:
explanations_IG = response_IG['explanations'][0]['attributions_by_label'][0]
explanations_SHAP = response_SHAP['explanations'][0]['attributions_by_label'][0]
predicted = round(explanations_SHAP['example_score'], 2)
baseline = round(explanations_SHAP['baseline_score'], 2 )
print('Baseline taxi fare: ' + str(baseline) + ' dollars')
print('Predicted taxi fare: ' + str(predicted) + ' dollars')
Next let's look at the feature attributions for this particular example. Positive attribution values mean a particular feature pushed our model prediction up by that amount, and vice versa for negative attribution values. Which features seem like they're the most important...well it seems like the location features are the most important!
In [0]:
from tabulate import tabulate
feature_names = valid_data.columns.tolist()
attributions_IG = explanations_IG['attributions']
attributions_SHAP = explanations_SHAP['attributions']
rows = []
for feat in feature_names:
rows.append([feat, prediction_json[feat], attributions_IG[feat], attributions_SHAP[feat]])
print(tabulate(rows,headers=['Feature name', 'Feature value', 'Attribution value (IG)', 'Attribution value (SHAP)']))