Fairness Exercise 2: Remediate Bias

Learning Objectives:

Remediate subgroup bias in the toxic text classifier by upweighting negative examples.
Re-evaluate the revised model to confirm successful remediation using Fairness Indicators and the What-If tool

Prerequisites

This exercise builds on . It is strongly recommended that you complete Fairness Exercise 1 prior to working through this exercise.

Overview

In , you trained a toxicity classifier on the Civil Comments dataset and used Fairness Indicators to identify some unintended bias issues related to gender. In this exercise, you'll apply remediation techniques and retrain the model to mitigate this bias. You'll then use Fairness Indicators and the What-If tool to evaluate the results and confirm that the remediation efforts were successful.

Setup

First, run the cell below to install Fairness Indicators.

NOTE: You MUST RESTART the Colab runtime after doing this installation, either by clicking the RESTART RUNTIME button at the bottom of this cell or by selecting Runtime->Restart runtime... from the menu bar above.



In [0]:

    
!pip install fairness-indicators \
  "absl-py==0.8.0" \
  "pyarrow==0.15.1" \
  "apache-beam==2.17.0" \
  "avro-python3==1.9.1" \
  "tfx-bsl==0.21.4" \
  "tensorflow-data-validation==0.21.5"

Next, import all the dependencies we'll use in this exercise, which include Fairness Indicators, TensorFlow Model Analysis (tfma), and the What-If tool (WIT):



In [0]:

    
%tensorflow_version 2.x
import os
import tempfile
import apache_beam as beam
import numpy as np
import pandas as pd
from datetime import datetime

import tensorflow_hub as hub
import tensorflow as tf
import tensorflow_model_analysis as tfma
from tensorflow_model_analysis.addons.fairness.post_export_metrics import fairness_indicators
from tensorflow_model_analysis.addons.fairness.view import widget_view

from witwidget.notebook.visualization import WitConfigBuilder
from witwidget.notebook.visualization import WitWidget

Run the following code to download and import the training and validation datasets. By default, the following code will load the preprocessed data (see for more details). If you prefer, you can enable the download_original_data checkbox at right to download the original dataset and preprocess it as described in the previous section (this may take 5-10 minutes).



In [0]:

    
download_original_data = False #@param {type:"boolean"}

if download_original_data:
  train_tf_file = tf.keras.utils.get_file('train_tf.tfrecord',
                                          'https://storage.googleapis.com/civil_comments_dataset/train_tf.tfrecord')
  validate_tf_file = tf.keras.utils.get_file('validate_tf.tfrecord',
                                             'https://storage.googleapis.com/civil_comments_dataset/validate_tf.tfrecord')

  # The identity terms list will be grouped together by their categories
  # (see 'IDENTITY_COLUMNS') on threshould 0.5. Only the identity term column,
  # text column and label column will be kept after processing.
  train_tf_file = util.convert_comments_data(train_tf_file)
  validate_tf_file = util.convert_comments_data(validate_tf_file)

else:
  train_tf_file = tf.keras.utils.get_file('train_tf_processed.tfrecord',
                                          'https://storage.googleapis.com/civil_comments_dataset/train_tf_processed.tfrecord')
  validate_tf_file = tf.keras.utils.get_file('validate_tf_processed.tfrecord',
                                             'https://storage.googleapis.com/civil_comments_dataset/validate_tf_processed.tfrecord')

Next, train the original model from , which we'll use as the baseline model for this exercise:



In [0]:

    
#@title Run this cell to train the baseline model from Exercise 1
TEXT_FEATURE = 'comment_text'
LABEL = 'toxicity'

FEATURE_MAP = {
    # Label:
    LABEL: tf.io.FixedLenFeature([], tf.float32),
    # Text:
    TEXT_FEATURE:  tf.io.FixedLenFeature([], tf.string),

    # Identities:
    'sexual_orientation':tf.io.VarLenFeature(tf.string),
    'gender':tf.io.VarLenFeature(tf.string),
    'religion':tf.io.VarLenFeature(tf.string),
    'race':tf.io.VarLenFeature(tf.string),
    'disability':tf.io.VarLenFeature(tf.string),
}

def train_input_fn():
  def parse_function(serialized):
    parsed_example = tf.io.parse_single_example(
        serialized=serialized, features=FEATURE_MAP)
    # Adds a weight column to deal with unbalanced classes.
    parsed_example['weight'] = tf.add(parsed_example[LABEL], 0.1)
    return (parsed_example,
            parsed_example[LABEL])
  train_dataset = tf.data.TFRecordDataset(
      filenames=[train_tf_file]).map(parse_function).batch(512)
  return train_dataset

BASE_DIR = tempfile.gettempdir()

model_dir = os.path.join(BASE_DIR, 'train', datetime.now().strftime(
    "%Y%m%d-%H%M%S"))

embedded_text_feature_column = hub.text_embedding_column(
    key=TEXT_FEATURE,
    module_spec='https://tfhub.dev/google/nnlm-en-dim128/1')

classifier = tf.estimator.DNNClassifier(
    hidden_units=[500, 100],
    weight_column='weight',
    feature_columns=[embedded_text_feature_column],
    optimizer=tf.optimizers.Adagrad(learning_rate=0.003),
    loss_reduction=tf.losses.Reduction.SUM,
    n_classes=2,
    model_dir=model_dir)

classifier.train(input_fn=train_input_fn, steps=1000)

In the next section, we'll apply bias-remediation techniques on our data and then train a revised model on the updated data.

Remediate Bias

To remediate bias in our model, we'll first need to define the remediation metrics we'll use to gauge success and choose an appropriate remediation technique. Then we'll retrain the model using the technique we've selected.

Define the remediation metrics

Before we can apply bias-remediation techniques to our model, we first need to define what successful remediation looks like in the context of our particular problem. As we saw in , there are often tradeoffs that come into play when optimizing a model (for example, adjustments that decrease false positives may increase false negatives), so we need to choose the evaluation metrics that best align with our priorities.

For our toxicity classifier, we've identified that our primary concern is ensuring that gender-related comments are not disproportionately misclassified as toxic, which could result in constructive discourse being suppressed. So here, we will define successful remediation as a decrease in the FPR (false-positive rate) for gender subgroups relative to the overall FPR.

Choose a remediation technique

To mitigate false-positive rate for gender subgroups, we want to help the model "unlearn" any false correlations it's learned between gender-related terminology and toxicity. We've determined that this false correlation likely stems from an insufficient number of training examples in which gender terminology was used in nontoxic contexts.

One excellent way to remediate this issue would be to add more nontoxic examples to each gender subgroup to balance out the dataset, and then retrain on the amended data. However, we've already trained on all the data we have, so what can we do? This is a common problem ML engineers face. Collecting additional data can be costly, resource-intensive, and time-consuming, and as a result, it may just not be feasible in certain circumstances.

One alternative solution is to simulate additional data by upweighting the existing examples in the disproportionately underrepresented group (increasing the loss penalty for errors for these examples) so they carry more weight and are not as easily overwhelmed by the rest of the data.

Let's update the input fuction of our model to implement upweighting for nontoxic examples belonging to one or more gender subgroups. In the UPDATES FOR UPWEIGHTING section of the code below, we've increased the weight values for nontoxic examples that contain a gender value of transgender, female, or male:



In [0]:

    
def train_input_fn_with_remediation():
  def parse_function(serialized):
    parsed_example = tf.io.parse_single_example(
        serialized=serialized, features=FEATURE_MAP)
    # Adds a weight column to deal with unbalanced classes.
  
    parsed_example['weight'] = tf.add(parsed_example[LABEL], 0.1)
  
    # BEGIN UPDATES FOR UPWEIGHTING
    # Up-weighting non-toxic examples to balance toxic and non-toxic examples
    # for gender slice.
    #
    values = parsed_example['gender'].values
    # 'toxicity' label zero represents the example is non-toxic.
    if tf.equal(parsed_example[LABEL], 0):
      # We tuned the upweighting hyperparameters, and found we got good 
      # results by setting `weight`s of 0.4 for `transgender`, 
      # 0.5 for `female`, and 0.7 for `male`.
      # NOTE: `other_gender` is not upweighted separately, because all examples
      # tagged with `other_gender` were also tagged with one of the other
      # values below
      if tf.greater(tf.math.count_nonzero(tf.equal(values, 'transgender')), 0):
        parsed_example['weight'] = tf.constant(0.4)
      if tf.greater(tf.math.count_nonzero(tf.equal(values, 'female')), 0):
        parsed_example['weight'] = tf.constant(0.5)
      if tf.greater(tf.math.count_nonzero(tf.equal(values, 'male')), 0):
        parsed_example['weight'] = tf.constant(0.7)
        
    return (parsed_example,
            parsed_example[LABEL])
  # END UPDATES FOR UPWEIGHTING

  train_dataset = tf.data.TFRecordDataset(
      filenames=[train_tf_file]).map(parse_function).batch(512)
  return train_dataset

Retrain the model

Now, let's retrain the model with our upweighted examples:



In [0]:

    
BASE_DIR = tempfile.gettempdir()
  
model_dir_with_remediation = os.path.join(BASE_DIR, 'train', datetime.now().strftime(
    "%Y%m%d-%H%M%S"))

embedded_text_feature_column = hub.text_embedding_column(
    key=TEXT_FEATURE,
    module_spec='https://tfhub.dev/google/nnlm-en-dim128/1')

classifier_with_remediation = tf.estimator.DNNClassifier(
    hidden_units=[500, 100],
    weight_column='weight',
    feature_columns=[embedded_text_feature_column],
    n_classes=2,
    optimizer=tf.optimizers.Adagrad(learning_rate=0.003),
    loss_reduction=tf.losses.Reduction.SUM,
    model_dir=model_dir_with_remediation)

classifier_with_remediation.train(input_fn=train_input_fn_with_remediation, steps=1000)

Recompute fairness metrics

Now that we've retrained the model, let's recompute our fairness metrics. First, export the model:



In [0]:

    
def eval_input_receiver_fn():
  serialized_tf_example = tf.compat.v1.placeholder(
      dtype=tf.string, shape=[None], name='input_example_placeholder')

  receiver_tensors = {'examples': serialized_tf_example}

  features = tf.io.parse_example(serialized_tf_example, FEATURE_MAP)
  features['weight'] = tf.ones_like(features[LABEL])

  return tfma.export.EvalInputReceiver(
    features=features,
    receiver_tensors=receiver_tensors,
    labels=features[LABEL])

tfma_export_dir_with_remediation = tfma.export.export_eval_savedmodel(
  estimator=classifier_with_remediation,
  export_dir_base=os.path.join(BASE_DIR, 'tfma_eval_model_with_remediation'),
  eval_input_receiver_fn=eval_input_receiver_fn)

Next, run the fairness evaluation using TFMA:



In [0]:

    
tfma_eval_result_path_with_remediation = os.path.join(BASE_DIR, 'tfma_eval_result_with_remediation')

slice_selection = 'gender'
compute_confidence_intervals = False

# Define slices that you want the evaluation to run on.
slice_spec = [
    tfma.slicer.SingleSliceSpec(), # Overall slice
    tfma.slicer.SingleSliceSpec(columns=['gender']),
]

# Add the fairness metrics.
add_metrics_callbacks = [
  tfma.post_export_metrics.fairness_indicators(
      thresholds=[0.1, 0.3, 0.5, 0.7, 0.9],
      labels_key=LABEL
      )
]

eval_shared_model_with_remediation = tfma.default_eval_shared_model(
    eval_saved_model_path=tfma_export_dir_with_remediation,
    add_metrics_callbacks=add_metrics_callbacks)

validate_dataset = tf.data.TFRecordDataset(filenames=[validate_tf_file])

# Run the fairness evaluation.
with beam.Pipeline() as pipeline:
  _ = (
      pipeline
      | 'ReadData' >> beam.io.ReadFromTFRecord(validate_tf_file)
      | 'ExtractEvaluateAndWriteResults' >>
       tfma.ExtractEvaluateAndWriteResults(
                 eval_shared_model=eval_shared_model_with_remediation,
                 slice_spec=slice_spec,
                 compute_confidence_intervals=compute_confidence_intervals,
                 output_path=tfma_eval_result_path_with_remediation)
  )

eval_result_with_remediation = tfma.load_eval_result(output_path=tfma_eval_result_path_with_remediation)

Load evaluation results

Run the following two cells to load results in the What-If tool and Fairness Indicators.

In the What-If tool, we'll load 1,000 examples with the corresponding predictions returned from both the baseline model and the remediated model.



In [0]:

    
DEFAULT_MAX_EXAMPLES = 1000

# Load 100000 examples in memory. When first rendered, What-If Tool only
# displays 1000 of these examples to ensure data loads successfully for most
# browser/machine configurations. 
def wit_dataset(file, num_examples=100000):
  dataset = tf.data.TFRecordDataset(
      filenames=[train_tf_file]).take(num_examples)
  return [tf.train.Example.FromString(d.numpy()) for d in dataset]

wit_data = wit_dataset(train_tf_file)

# Configure WIT with 1000 examples, the FEATURE_MAP we defined above, and
# a label of 1 for positive (toxic) examples and 0 for negative (nontoxic)
# examples
config_builder = WitConfigBuilder(wit_data[:DEFAULT_MAX_EXAMPLES]).set_estimator_and_feature_spec(
    classifier, FEATURE_MAP).set_compare_estimator_and_feature_spec(
    classifier_with_remediation, FEATURE_MAP).set_label_vocab(['0', '1']).set_target_feature(LABEL)
wit = WitWidget(config_builder)

In Fairness Indicators, we'll display the remediated model's evaluation results on the validation set.



In [0]:

    
# Link Fairness Indicators widget with WIT widget above,
# so that clicking a slice in FI below will load its data in WIT above.
event_handlers={'slice-selected':
              wit.create_selection_callback(wit_data, DEFAULT_MAX_EXAMPLES)}
widget_view.render_fairness_indicator(eval_result=eval_result_with_remediation,
                                      slicing_column=slice_selection,
                                      event_handlers=event_handlers)



In [0]:

    
#@title Alternative: Run this cell only if you intend to skip the What-If tool exercises (see Warning above)
# Link Fairness Indicators widget with WIT widget above,
# so that clicking a slice in FI below will load its data in WIT above.
widget_view.render_fairness_indicator(eval_result=eval_result_with_remediation,
                                      slicing_column=slice_selection)

Exercise: Analyze the results

Use the What-If Tool and Fairness Indicators widgets above to answer the following questions.

1. In , our baseline model had an FPR of 0.28 overall and FPRs of 0.51 and 0.47 for `male` and `female` examples, respectively. In our revised model, what are the FPRs for `male` and `female` subgroups? How do these values compare to the overall FPR?

Solution

Click below for the solution.

When we evaluated our model against the validation set, we got an FPR of 0.28 for male and 0.24 for female. The overall FPR was 0.23.

The FPR for male is now approximately 20% higher than the overall rate, and the FPR for female is now approximately 5% lower than the overall rate. This is a significant improvement over our previous model, where the FPRs for male and female were +83% and +69% higher, respectively, than the overall FPR.

NOTE: Model training is not deterministic, so your exact results may vary slightly from ours.

2. What other metrics should we audit to confirm gender subgroup biases have been successfully remediated? What are the results on these metrics?

Solution

Click below for the solution.

We should also review FNR.

A model optimized solely to decrease FPR could learn to always predict the negative class ("nontoxic"), which would result in a FPR of 0. However, this would cause the FNR to skyrocket because every actual positive ("toxic") example would be misclassified and a false negative.

While our primary metric for evaluating remediation is FPR, we still want to make sure we're OK with any tradeoff in increased FNR that we incur to decrease FPR.

If we take a look at FNR results for the revised model, we see that the overall FNR is 0.34, male FNR is 1% lower at 0.33, and female FNR is 12% higher at 0.38. So we can confirm that our subgroup FNRs are not dramatically higher than overall FNR, and overall FNR itself is not sky-high.

NOTE: Model training is not deterministic, so your exact results may vary slightly from ours.

3. Do you see any areas where further improvement is needed?

Solution

Click below for one possible solution.

If we hover over the other_gender slice, as shown above, we see that there are only 6 examples in this slice. This is an extremely small number of examples in comparison to the male and female groups, which each have over 15,000 examples.

NOTE: Model training is not deterministic, so your exact results may vary slightly from ours shown above.

With an other_gender slice this small, we can't make any statistically significant assertions about the model's performance on this subgroup (changing the classification of just one example would cause a swing of 16.6% in FNR or FPR). Upweighting is not sufficient here; we're going to need to add more examples to the other_gender subgroup that the model can learn from.

4. Compare the performance of the baseline model and the revised model on the `female` subgroup as follows:

Click on the bar of the female slice in the Fairness Indicators widget to load the corresponding individual female examples in the What-If Tool widget above. Create a scatterplot that plots toxicity scores for the baseline model (Inference Score 1) against toxicity scores for the revised model (Inference Score 2), with each example color-coded by ground-truth label (toxicity).

What trends can you identify from this graph?

Solution

Click below for a solution.

Here's our graph, with toxicity scores for the baseline model plotted along the x-axis, and toxicity scores for the revised model plotted along the y-axis. Actual toxic examples are colored red, and actual nontoxic examples are colored blue.

NOTE: Model training is not deterministic, so your exact results may vary slightly from ours.

The relationship between the two scores is generally linear, but we can see a few clusters of blue outliers (circled above) where the revised model predicts a significantly lower toxicity score than the baseline model. We can extrapolate that the revised model does a better job of predicting low toxicity scores for a percentage of nontoxic female examples (though there's still room for further improvement).

Fairness Exercise 2: Remediate Bias

Overview

Setup

Remediate Bias

Define the remediation metrics

Choose a remediation technique

Retrain the model

Recompute fairness metrics

Load evaluation results

Exercise: Analyze the results

1. In , our baseline model had an FPR of 0.28 overall and FPRs of 0.51 and 0.47 for male and female examples, respectively. In our revised model, what are the FPRs for male and female subgroups? How do these values compare to the overall FPR?

Solution

2. What other metrics should we audit to confirm gender subgroup biases have been successfully remediated? What are the results on these metrics?

Solution

3. Do you see any areas where further improvement is needed?

Solution

4. Compare the performance of the baseline model and the revised model on the female subgroup as follows:

What trends can you identify from this graph?

Solution

1. In , our baseline model had an FPR of 0.28 overall and FPRs of 0.51 and 0.47 for `male` and `female` examples, respectively. In our revised model, what are the FPRs for `male` and `female` subgroups? How do these values compare to the overall FPR?

4. Compare the performance of the baseline model and the revised model on the `female` subgroup as follows: