In [ ]:
# Copyright 2019 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Slicing AutoML Tables Evaluation Results with BigQuery

Run in Colab View on GitHub

Overview

This colab assumes that you've created a dataset with AutoML Tables, and used that dataset to train a classification model. Once the model is done training, you also need to export the results table by using the following instructions. You'll see more detailed setup instructions below.

This colab will walk you through the process of using BigQuery to visualize data slices, showing you one simple way to evaluate your model for bias.

Dataset

You'll need to use the AutoML Tables frontend or service to create a model and export its evaluation results to BigQuery. You should find a link on the Evaluate tab to view your evaluation results in BigQuery once you've finished training your model. Then navigate to BigQuery in your GCP console and you'll see your new results table in the list of tables to which your project has access.

For demo purposes, we'll be using the Default of Credit Card Clients dataset for analysis.

Note: Although the data we use in this demo is public, you'll need to enter your own Google Cloud project ID in the parameter below to authenticate to it.

Objective

This dataset was collected to help compare different methods of predicting credit card default. Using this colab to analyze your own dataset may require a little adaptation. The code below will sample if you want it to. Or you can set sample_count to be as large or larger than your dataset to use the whole thing for analysis.

Costs

This tutorial uses billable components of Google Cloud Platform (GCP):

  • Cloud AI Platform
  • BigQuery

Learn about Cloud AI Platform pricing, BigQuery pricing and use the Pricing Calculator to generate a cost estimate based on your projected usage.

Set up your local development environment

If you are using Colab or AI Platform Notebooks, your environment already meets all the requirements to run this notebook. If you are using AI Platform Notebook, make sure the machine configuration type is 1 vCPU, 3.75 GB RAM or above and environment as Python or TensorFlow Enterprise 1.15. You can skip this step.

Otherwise, make sure your environment meets this notebook's requirements. You need the following:

  • The Google Cloud SDK
  • Git
  • Python 3
  • virtualenv
  • Jupyter notebook running in a virtual environment with Python 3

The Google Cloud guide to Setting up a Python development environment and the Jupyter installation guide provide detailed instructions for meeting these requirements. The following steps provide a condensed set of instructions:

  1. Install and initialize the Cloud SDK.

  2. Install Python 3.

  3. Install virtualenv and create a virtual environment that uses Python 3.

  4. Activate that environment and run pip install jupyter in a shell to install Jupyter.

  5. Run jupyter notebook in a shell to launch Jupyter.

  6. Open this notebook in the Jupyter Notebook Dashboard.

Set up your GCP project

The following steps are required, regardless of your notebook environment.

  1. Select or create a GCP project.. When you first create an account, you get a $300 free credit towards your compute/storage costs.

  2. Make sure that billing is enabled for your project.

  3. Enable the AI Platform APIs and Compute Engine APIs.

PIP Install Packages and dependencies

Install additional dependencies not installed in Notebook environment.


In [ ]:
! pip install --upgrade --quiet --user sklearn
! pip install --upgrade --quiet --user witwidget
! pip install --upgrade --quiet --user tensorflow==1.15
! pip install --upgrade --quiet --user tensorflow_model_analysis
! pip install --upgrade --quiet --user pandas-gbq

Note: Try installing using sudo, if the above command throw any permission errors. You can ignore other errors and continue to next steps.

Skip the below cell if you are using Colab.

If you are using AI Notebook Platform > JupyterLab. Install following packages.


In [ ]:
! sudo jupyter labextension install wit-widget
! sudo jupyter labextension install @jupyter-widgets/jupyterlab-manager
! sudo jupyter labextension install wit-widget@1.3
! sudo jupyter labextension install jupyter-matplotlib

Skip the below cell if you are using Colab.

If you are using AI Notebook Platform > Classic Notebook or Local Environment. Install and enable following dependencies to link WitWidget and TFMA with notebook extensions.


In [ ]:
! jupyter nbextension enable --py widgetsnbextension
! jupyter nbextension install --py --symlink tensorflow_model_analysis
! jupyter nbextension enable --py tensorflow_model_analysis

Note: Try installing using --user, if the above command throw any permission errors.

Restart the kernel to allow the libraries to be imported for Jupyter Notebooks.


In [ ]:
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

Refresh the browser for visualization while running in Jupyter Notebooks

Set up your GCP Project Id

Enter your Project Id in the cell below. Then run the cell to make sure the Cloud SDK uses the right project for all the commands in this notebook.


In [ ]:
PROJECT_ID = "[your-project-id]" #@param {type:"string"}

Authenticate your GCP account

If you are using AI Platform Notebooks, your environment is already authenticated. Skip this step.

Otherwise, follow these steps:

  1. In the GCP Console, go to the Create service account key page.

  2. From the Service account drop-down list, select New service account.

  3. In the Service account name field, enter a name.

  4. From the Role drop-down list, select BigQuery > BigQuery User.

  5. Click Create. A JSON file that contains your key downloads to your local environment.

Note: Jupyter runs lines prefixed with ! as shell commands, and it interpolates Python variables prefixed with $ into these commands.


In [ ]:
import sys

# Upload the downloaded JSON file that contains your key.
if 'google.colab' in sys.modules:    
  from google.colab import files
  keyfile_upload = files.upload()
  keyfile = list(keyfile_upload.keys())[0]
  %env GOOGLE_APPLICATION_CREDENTIALS $keyfile
  ! gcloud auth activate-service-account --key-file $keyfile

If you are running the notebook locally, enter the path to your service account key as the GOOGLE_APPLICATION_CREDENTIALS variable in the cell below and run the cell


In [ ]:
# If you are running this notebook locally, replace the string below with the
# path to your service account key and run this cell to authenticate your GCP
# account.

%env GOOGLE_APPLICATION_CREDENTIALS /path/to/service/account
! gcloud auth activate-service-account --key-file '/path/to/service/account'

Import libraries and define constants

Import relevant packages.


In [ ]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

In [ ]:
import numpy as np
import os
import pandas as pd
import sys
sys.path.append('./python')
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, roc_curve, roc_auc_score
from sklearn.metrics import precision_recall_curve
from collections import OrderedDict

In [ ]:
# For facets.
from IPython.core.display import display, HTML
import base64
import witwidget.notebook.visualization as visualization

In [ ]:
# Tensorflow model analysis
import apache_beam as beam
import tempfile
from google.protobuf import text_format
from tensorflow_model_analysis import post_export_metrics
from tensorflow_model_analysis import types
from tensorflow_model_analysis.api import model_eval_lib
from tensorflow_model_analysis.evaluators import aggregate
from tensorflow_model_analysis.extractors import slice_key_extractor
from tensorflow_model_analysis.model_agnostic_eval import model_agnostic_evaluate_graph
from tensorflow_model_analysis.model_agnostic_eval import model_agnostic_extractor
from tensorflow_model_analysis.model_agnostic_eval import model_agnostic_predict
from tensorflow_model_analysis.proto import metrics_for_slice_pb2
from tensorflow_model_analysis import slicer
from tensorflow_model_analysis.view.widget_view import render_slicing_metrics

In [ ]:
# Tensorflow versions
import tensorflow as tf
print('Tensorflow version: {}'.format(tf.__version__))
import tensorflow_model_analysis as tfma
print('TFMA version: {}'.format(tfma.version.VERSION_STRING))

Populate the following cell with the necessary constants and run it to initialize constants.


In [ ]:
#@title Constants { vertical-output: true }

TABLE_NAME = 'bigquery-public-data.ml_datasets.credit_card_default' #@param {type:"string"}

Query Dataset


In [ ]:
sample_count = 3000 #@param {type:"integer"}

row_count = pd.io.gbq.read_gbq('''
  SELECT 
    COUNT(*) as total
  FROM `%s`''' % (TABLE_NAME), project_id=PROJECT_ID, verbose=False).total[0]
nested_df = pd.io.gbq.read_gbq('''
  SELECT
    *
  FROM
    `%s`
  WHERE RAND() < %d/%d
  ''' % (TABLE_NAME, sample_count, row_count), 
         project_id=PROJECT_ID, verbose=False)

print('Full dataset has %d rows' % row_count)
nested_df.describe()

Unnest the columns


In [ ]:
from collections import OrderedDict
import json

def unnest_df(nested_df):
    rows_list = []
    for index, row in nested_df.iterrows():
        for i in row["predicted_default_payment_next_month"]:
            row_dict = OrderedDict()
            row_dict = json.loads(row.to_json())
            row_dict["predicted_default_payment_next_month_tables_score"] = i["tables"]["score"]
            row_dict["predicted_default_payment_next_month_tables_value"] = i["tables"]["value"]
            rows_list.append(row_dict) 

    unnested_df = pd.DataFrame(rows_list, columns=list(rows_list[0].keys()))
    unnested_df = unnested_df.drop(
                  ["predicted_default_payment_next_month"], axis=1)
    return unnested_df

df = unnest_df(nested_df)
print("Unnested completed")

Data Preprocessing

Many of the tools we use to analyze models and data expect to find their inputs in the tensorflow.Example format. Here, we'll preprocess our data into tf. Examples, and also extract the predicted class from our classifier, which is binary.


In [ ]:
#@title Columns { vertical-output: true }

unique_id_field = 'id' #@param {type: 'string'}
prediction_field_score = 'predicted_default_payment_next_month_tables_score'  #@param
prediction_field_value = 'predicted_default_payment_next_month_tables_value'  #@param

In [ ]:
def extract_top_class(prediction_tuples):
  # values from Tables show up as a CSV of individual json (prediction, confidence) objects.
  best_score = 0
  best_class = u''
  for val, sco in prediction_tuples:
    if sco > best_score:
      best_score = sco
      best_class = val
  return (best_class, best_score)

In [ ]:
def df_to_examples(df, columns=None):
  examples = []
  if columns == None:
    columns = df.columns.values.tolist()
  for id in df[unique_id_field].unique():
    example = tf.train.Example()
    prediction_tuples = zip(
        df.loc[df[unique_id_field] == id][prediction_field_value], 
        df.loc[df[unique_id_field] == id][prediction_field_score])
    row = df.loc[df[unique_id_field] == id].iloc[0]
    for col in columns:
      if col == prediction_field_score or col == prediction_field_value:
        # Deal with prediction fields separately.
        continue
      elif df[col].dtype is np.dtype(np.int64):
        example.features.feature[col].int64_list.value.append(int(row[col]))
      elif df[col].dtype is np.dtype(np.float64):
        example.features.feature[col].float_list.value.append(row[col])
      elif row[col] is None:
        continue
      elif row[col] == row[col]:
        example.features.feature[col].bytes_list.value.append(
            row[col].encode('utf-8'))
    cla, sco = extract_top_class(prediction_tuples)
    example.features.feature['predicted_class'].int64_list.value.append(cla)
    example.features.feature['predicted_class_score']\
                    .float_list.value.append(sco)
    examples.append(example)
  return examples

In [ ]:
# Fix up some types so analysis is consistent. 
# This code is specific to the dataset.
df = df.astype({"pay_5":float, "pay_6":float})

# Converts a dataframe column into a column of 0's and 1's based on the provided test.
def make_label_column_numeric(df, label_column, test):
  df[label_column] = np.where(test(df[label_column]), 1, 0)
  
# Convert label types to numeric. This code is specific to the dataset.
make_label_column_numeric(df, 
                          'predicted_default_payment_next_month_tables_value', 
                          lambda val: val == '1')
make_label_column_numeric(df, 'default_payment_next_month', 
                               lambda val:  val == '1')

examples = df_to_examples(df)
print("Preprocessing complete!")

What-If Tool

First, we'll explore the data and predictions using the What-If Tool. The What-If tool is a powerful visual interface to explore data, models, and predictions. Because we're reading our results from BigQuery, we aren't able to use the features of the What-If Tool that query the model directly. But we can still learn a lot about this dataset from the exploration that the What-If tool enables.

Imagine that you're curious to discover whether there's a discrepancy in the predictive power of your model depending on the marital status of the person whose credit history is being analyzed. You can use the What-If Tool to look at a glance and see the relative sizes of the data samples for each class. In this dataset, the marital statuses are encoded as 1 = married; 2 = single; 3 = divorce; 0=others. You can see using the What-If Tool that there are very few samples for classes other than married or single, which might indicate that performance could be compromised. If this lack of representation concerns you, you could consider collecting more data for underrepresented classes, downsampling overrepresented classes, or upweighting underrepresented data types as you train, depending on your use case and data availability.


In [ ]:
#@title WitWidget Configuration { vertical-output: false }

WitWidget = visualization.WitWidget
WitConfigBuilder = visualization.WitConfigBuilder

num_datapoints = 2965  #@param {type: "number"}
tool_height_in_px = 700  #@param {type: "number"}

# Setup the tool with the test examples and the trained classifier.
config_builder = WitConfigBuilder(examples[:num_datapoints])
# Need to call this so we have inference_address and model_name initialized.
config_builder = config_builder.set_estimator_and_feature_spec('', '')
config_builder = config_builder.set_compare_estimator_and_feature_spec('', '')

In [ ]:
WitWidget(config_builder, height=tool_height_in_px)

Tensorflow Model Analysis

Then, let's examine some sliced metrics. This section of the tutorial will use TFMA model agnostic analysis capabilities.

TFMA generates sliced metrics graphs and confusion matrices. We can use these to dig deeper into the question of how well this model performs on different classes of marital status. The model was built to optimize for AUC ROC metric, and it does fairly well for all of the classes, though there is a small performance gap for the "divorced" category. But when we look at the AUC-PR metric slices, we can see that the "divorced" and "other" classes are very poorly served by the model compared to the more common classes. AUC-PR is the metric that measures how well the tradeoff between precision and recall is being made in the model's predictions. If we're concerned about this gap, we could consider retraining to use AUC-PR as the optimization metric and see whether that model does a better job making equitable predictions


In [ ]:
# To set up model agnostic extraction, need to specify features and labels of
# interest in a feature map.
feature_map = OrderedDict();

for i, column in enumerate(df.columns):
  type = df.dtypes[i]
  if column == prediction_field_score or column == prediction_field_value:
    continue
  elif (type == np.dtype(np.float64)):
    feature_map[column] =  tf.io.FixedLenFeature([], tf.float32)
  elif (type == np.dtype(np.object)):
    feature_map[column] =  tf.io.FixedLenFeature([], tf.string)
  elif (type == np.dtype(np.int64)):
    feature_map[column] = tf.io.FixedLenFeature([], tf.int64)
  elif (type == np.dtype(np.bool)):
    feature_map[column] = tf.io.FixedLenFeature([], tf.bool)
  elif (type == np.dtype(np.datetime64)):
    feature_map[column] = tf.io.FixedLenFeature([], tf.timestamp)

feature_map['predicted_class'] = tf.io.FixedLenFeature([], tf.int64)
feature_map['predicted_class_score'] = tf.io.FixedLenFeature([], tf.float32)

serialized_examples = [e.SerializeToString() for e in examples]

In [ ]:
BASE_DIR = tempfile.gettempdir()
OUTPUT_DIR = os.path.join(BASE_DIR, 'output')

In [ ]:
#@title TFMA Inputs { vertical-output: false }

slice_column = 'marital_status' #@param {type: 'string'}
predicted_labels = 'predicted_class' #@param {type: 'string'}
actual_labels = 'default_payment_next_month' #@param {type: 'string'}
predicted_class_score = 'predicted_class_score' #@param {type: 'string'}

In [ ]:
with beam.Pipeline() as pipeline:
    model_agnostic_config = model_agnostic_predict.ModelAgnosticConfig(
              label_keys=[actual_labels],
              prediction_keys=[predicted_labels],
              feature_spec=feature_map)

    extractors = [
            model_agnostic_extractor.ModelAgnosticExtractor(
                model_agnostic_config=model_agnostic_config,
                desired_batch_size=3),
              slice_key_extractor.SliceKeyExtractor([
                  slicer.SingleSliceSpec(columns=[slice_column])
              ])
        ]

    auc_roc_callback = post_export_metrics.auc(
        labels_key=actual_labels,
        target_prediction_keys=[predicted_labels])

    auc_pr_callback = post_export_metrics.auc(
        curve='PR',
        labels_key=actual_labels,
        target_prediction_keys=[predicted_labels])

    confusion_matrix_callback = post_export_metrics\
    .confusion_matrix_at_thresholds(
        labels_key=actual_labels,
        target_prediction_keys=[predicted_labels],
        example_weight_key=predicted_class_score,
        thresholds=[0.0, 0.5, 0.8, 1.0])

    # Create our model agnostic aggregator.
    eval_shared_model = types.EvalSharedModel(
        construct_fn=model_agnostic_evaluate_graph.make_construct_fn(
            add_metrics_callbacks=[confusion_matrix_callback,
                                    auc_roc_callback,
                                    auc_pr_callback,
                                    post_export_metrics.example_count()],
            config=model_agnostic_config))

    # Run Model Agnostic Eval.
    _ = (
        pipeline
        | beam.Create(serialized_examples)
        | 'ExtractEvaluateAndWriteResults' >>
          model_eval_lib.ExtractEvaluateAndWriteResults(
              eval_shared_model=eval_shared_model,
              output_path=OUTPUT_DIR,
              extractors=extractors))

eval_result = tfma.load_eval_result(output_path=OUTPUT_DIR)

In [ ]:
render_slicing_metrics(eval_result, slicing_column=slice_column)