What-If Tool on COMPAS

This notebook shows use of the What-If Tool on the COMPAS dataset.

For ML fairness background on COMPAS see:

The dataset is from the COMPAS kaggle page.

This notebook trains a linear classifier on the on the COMPAS dataset to mimic the behavior of the the COMPAS recidivism classifier. We can then analyze our COMPAS proxy model for fairness using the What-If Tool.

The specific binary classification task for this model is to determine if a person belongs in the "Low" risk class according to COMPAS (negative class), or the "Medium" or "High" risk class (positive class).



In [0]:

    
#@title Install the What-If Tool widget if running in colab {display-mode: "form"}

try:
  import google.colab
  !pip install --upgrade witwidget
except:
  pass



In [0]:

    
#@title Define helper functions {display-mode: "form"}

import pandas as pd
import numpy as np
import tensorflow as tf
import functools

# Creates a tf feature spec from the dataframe and columns specified.
def create_feature_spec(df, columns=None):
    feature_spec = {}
    if columns == None:
        columns = df.columns.values.tolist()
    for f in columns:
        if df[f].dtype is np.dtype(np.int64):
            feature_spec[f] = tf.io.FixedLenFeature(shape=(), dtype=tf.int64)
        elif df[f].dtype is np.dtype(np.float64):
            feature_spec[f] = tf.io.FixedLenFeature(shape=(), dtype=tf.float32)
        else:
            feature_spec[f] = tf.io.FixedLenFeature(shape=(), dtype=tf.string)
    return feature_spec

# Creates simple numeric and categorical feature columns from a feature spec and a
# list of columns from that spec to use.
#
# NOTE: Models might perform better with some feature engineering such as bucketed
# numeric columns and hash-bucket/embedding columns for categorical features.
def create_feature_columns(columns, feature_spec):
    ret = []
    for col in columns:
        if feature_spec[col].dtype is tf.int64 or feature_spec[col].dtype is tf.float32:
            ret.append(tf.feature_column.numeric_column(col))
        else:
            ret.append(tf.feature_column.indicator_column(
                tf.feature_column.categorical_column_with_vocabulary_list(col, list(df[col].unique()))))
    return ret

# An input function for providing input to a model from tf.Examples
def tfexamples_input_fn(examples, feature_spec, label, mode=tf.estimator.ModeKeys.EVAL,
                       num_epochs=None, 
                       batch_size=64):
    def ex_generator():
        for i in range(len(examples)):
            yield examples[i].SerializeToString()
    dataset = tf.data.Dataset.from_generator(
      ex_generator, tf.dtypes.string, tf.TensorShape([]))
    if mode == tf.estimator.ModeKeys.TRAIN:
        dataset = dataset.shuffle(buffer_size=2 * batch_size + 1)
    dataset = dataset.batch(batch_size)
    dataset = dataset.map(lambda tf_example: parse_tf_example(tf_example, label, feature_spec))
    dataset = dataset.repeat(num_epochs)
    return dataset

# Parses Tf.Example protos into features for the input function.
def parse_tf_example(example_proto, label, feature_spec):
    parsed_features = tf.io.parse_example(serialized=example_proto, features=feature_spec)
    target = parsed_features.pop(label)
    return parsed_features, target

# Converts a dataframe into a list of tf.Example protos.
def df_to_examples(df, columns=None):
    examples = []
    if columns == None:
        columns = df.columns.values.tolist()
    for index, row in df.iterrows():
        example = tf.train.Example()
        for col in columns:
            if df[col].dtype is np.dtype(np.int64):
                example.features.feature[col].int64_list.value.append(int(row[col]))
            elif df[col].dtype is np.dtype(np.float64):
                example.features.feature[col].float_list.value.append(row[col])
            elif row[col] == row[col]:
                example.features.feature[col].bytes_list.value.append(row[col].encode('utf-8'))
        examples.append(example)
    return examples

# Converts a dataframe column into a column of 0's and 1's based on the provided test.
# Used to force label columns to be numeric for binary classification using a TF estimator.
def make_label_column_numeric(df, label_column, test):
  df[label_column] = np.where(test(df[label_column]), 1, 0)



In [0]:

    
#@title Read training dataset from CSV {display-mode: "form"}

import pandas as pd

df = pd.read_csv('https://storage.googleapis.com/what-if-tool-resources/computefest2019/cox-violent-parsed_filt.csv')
df



In [0]:

    
#@title Specify input columns and column to predict {display-mode: "form"}
import numpy as np

# Filter out entries with no indication of recidivism or no compass score
df = df[df['is_recid'] != -1]
df = df[df['decile_score'] != -1]

# Rename recidivism column
df['recidivism_within_2_years'] = df['is_recid']

# Make the COMPASS label column numeric (0 and 1), for use in our model
df['COMPASS_determination'] = np.where(df['score_text'] == 'Low', 0, 1)

# Set column to predict
label_column = 'COMPASS_determination'

# Get list of all columns from the dataset we will use for model input or output.
input_features = ['sex', 'age', 'race', 'priors_count', 'juv_fel_count', 'juv_misd_count', 'juv_other_count']
features_and_labels = input_features + [label_column]

features_for_file = input_features + ['recidivism_within_2_years', 'COMPASS_determination']



In [0]:

    
#@title Convert dataset to tf.Example protos {display-mode: "form"}

examples = df_to_examples(df, features_for_file)



In [0]:

    
#@title Create and train the classifier {display-mode: "form"}

num_steps = 2000  #@param {type: "number"}

# Create a feature spec for the classifier
feature_spec = create_feature_spec(df, features_and_labels)

# Define and train the classifier
train_inpf = functools.partial(tfexamples_input_fn, examples, feature_spec, label_column)
classifier = tf.estimator.LinearClassifier(
    feature_columns=create_feature_columns(input_features, feature_spec))
classifier.train(train_inpf, steps=num_steps)

What-If Tool analysis

We can see the same unfairness that ProPublica found in their analysis by:

Going the the "Performance + Fairness" tab
Setting "Ground Truth Feature" to "recidivism_within_2_years"
In "Slice by" dropdown menu, select "race"
Look at the confusion matrices of the "African-American" and "Causasian" slices.
- They have very similar accuracy (TP+TN)/Total
- But, the FP rate is MUCH higher for African Americans and the FN rate is MUCH higher for caucasians



In [0]:

    
#@title Invoke What-If Tool for test data and the trained models {display-mode: "form"}


num_datapoints = 10000  #@param {type: "number"}
tool_height_in_px = 1000  #@param {type: "number"}

from witwidget.notebook.visualization import WitConfigBuilder
from witwidget.notebook.visualization import WitWidget

# Setup the tool with the test examples and the trained classifier
config_builder = WitConfigBuilder(examples[0:num_datapoints]).set_estimator_and_feature_spec(
    classifier, feature_spec)
WitWidget(config_builder, height=tool_height_in_px)

Exploration ideas

Organize datapoints by "inference score" (can do this through binning or use of scatter plot) to see points ordered by how likely they were determined to re-offend.
- Select a point near the boundary line (where red points turn to blue points)
- Find the nearest counterfactual to see a similar person with a different decision. What is different?
- Look at the partial dependence plots for the selected person. What changes in what features would change the decision on this person?
In "Performance and Fairness" tab, slice the dataset by different features (such as race or sex)
- Look at the confusion matrices for each slice - How does performance compare in those slices? What from the training data may have caused the difference in performance between the slices? What root causes could exist?
- Use the threshold optimization buttons to optimize positive classification thresholds for each slice based on any of the possible fairness constraints - How different do the thresholds have to be to achieve that constraint? How varied are the thresholds depending on the fairness constraint chosen?
In the "Performance + Fairness" tab, change the cost ratio so that you can optimize the threshold based off of a non-symmetric cost of false positives vs false negatives. Then click the "optimize threshold" button and see the effect on the confusion matrix.
- Slice the dataset by a feature, such as sex or race. How has the new cost ratio affected the disparity in performance between slices? Click the different threshold optimization buttons to see how the changed cost ratio affects the disparity given different fairness constraints.
Try adding/removing features from the set of input features that the model uses during training. The model trained by this notebook only uses 7 of the columns from the dataset, as defined in the "Specify input columns and column to predict" cell in this notebook. How does your new set of input features affect the model performance (overall and across slices).
If you set the ground truth feature to "COMPAS_determination" in the "Performance + Fairness" tab, you will see the confusion matrix and ROC curve of how good our model is at being a proxy for the COMPAS model itself (as opposed to how good our model is at predicting recidivism).
- How well is our model doing at its task? What types of errors does it have?
- Try improving the performance of the model. Options include adding more input features, changing the model architecture, and training for more steps.
- After you've improved our proxy COMPAS model, what (if any) change in unfairness do you see when evaluating against "recidivism_after_2_years"?