Learning Objectives:
Prerequisites
This exercise builds on . It is strongly recommended that you complete Fairness Exercise 1 prior to working through this exercise.
In , you trained a toxicity classifier on the Civil Comments dataset and used Fairness Indicators to identify some unintended bias issues related to gender. In this exercise, you'll apply remediation techniques and retrain the model to mitigate this bias. You'll then use Fairness Indicators and the What-If tool to evaluate the results and confirm that the remediation efforts were successful.
In [0]:
!pip install fairness-indicators \
"absl-py==0.8.0" \
"pyarrow==0.15.1" \
"apache-beam==2.17.0" \
"avro-python3==1.9.1" \
"tfx-bsl==0.21.4" \
"tensorflow-data-validation==0.21.5"
Next, import all the dependencies we'll use in this exercise, which include Fairness Indicators, TensorFlow Model Analysis (tfma), and the What-If tool (WIT):
In [0]:
%tensorflow_version 2.x
import os
import tempfile
import apache_beam as beam
import numpy as np
import pandas as pd
from datetime import datetime
import tensorflow_hub as hub
import tensorflow as tf
import tensorflow_model_analysis as tfma
from tensorflow_model_analysis.addons.fairness.post_export_metrics import fairness_indicators
from tensorflow_model_analysis.addons.fairness.view import widget_view
from witwidget.notebook.visualization import WitConfigBuilder
from witwidget.notebook.visualization import WitWidget
Run the following code to download and import the training and validation datasets. By default, the following code will load the preprocessed data (see for more details). If you prefer, you can enable the download_original_data
checkbox at right to download the original dataset and preprocess it as described in the previous section (this may take 5-10 minutes).
In [0]:
download_original_data = False #@param {type:"boolean"}
if download_original_data:
train_tf_file = tf.keras.utils.get_file('train_tf.tfrecord',
'https://storage.googleapis.com/civil_comments_dataset/train_tf.tfrecord')
validate_tf_file = tf.keras.utils.get_file('validate_tf.tfrecord',
'https://storage.googleapis.com/civil_comments_dataset/validate_tf.tfrecord')
# The identity terms list will be grouped together by their categories
# (see 'IDENTITY_COLUMNS') on threshould 0.5. Only the identity term column,
# text column and label column will be kept after processing.
train_tf_file = util.convert_comments_data(train_tf_file)
validate_tf_file = util.convert_comments_data(validate_tf_file)
else:
train_tf_file = tf.keras.utils.get_file('train_tf_processed.tfrecord',
'https://storage.googleapis.com/civil_comments_dataset/train_tf_processed.tfrecord')
validate_tf_file = tf.keras.utils.get_file('validate_tf_processed.tfrecord',
'https://storage.googleapis.com/civil_comments_dataset/validate_tf_processed.tfrecord')
Next, train the original model from , which we'll use as the baseline model for this exercise:
In [0]:
#@title Run this cell to train the baseline model from Exercise 1
TEXT_FEATURE = 'comment_text'
LABEL = 'toxicity'
FEATURE_MAP = {
# Label:
LABEL: tf.io.FixedLenFeature([], tf.float32),
# Text:
TEXT_FEATURE: tf.io.FixedLenFeature([], tf.string),
# Identities:
'sexual_orientation':tf.io.VarLenFeature(tf.string),
'gender':tf.io.VarLenFeature(tf.string),
'religion':tf.io.VarLenFeature(tf.string),
'race':tf.io.VarLenFeature(tf.string),
'disability':tf.io.VarLenFeature(tf.string),
}
def train_input_fn():
def parse_function(serialized):
parsed_example = tf.io.parse_single_example(
serialized=serialized, features=FEATURE_MAP)
# Adds a weight column to deal with unbalanced classes.
parsed_example['weight'] = tf.add(parsed_example[LABEL], 0.1)
return (parsed_example,
parsed_example[LABEL])
train_dataset = tf.data.TFRecordDataset(
filenames=[train_tf_file]).map(parse_function).batch(512)
return train_dataset
BASE_DIR = tempfile.gettempdir()
model_dir = os.path.join(BASE_DIR, 'train', datetime.now().strftime(
"%Y%m%d-%H%M%S"))
embedded_text_feature_column = hub.text_embedding_column(
key=TEXT_FEATURE,
module_spec='https://tfhub.dev/google/nnlm-en-dim128/1')
classifier = tf.estimator.DNNClassifier(
hidden_units=[500, 100],
weight_column='weight',
feature_columns=[embedded_text_feature_column],
optimizer=tf.optimizers.Adagrad(learning_rate=0.003),
loss_reduction=tf.losses.Reduction.SUM,
n_classes=2,
model_dir=model_dir)
classifier.train(input_fn=train_input_fn, steps=1000)
In the next section, we'll apply bias-remediation techniques on our data and then train a revised model on the updated data.
To remediate bias in our model, we'll first need to define the remediation metrics we'll use to gauge success and choose an appropriate remediation technique. Then we'll retrain the model using the technique we've selected.
Before we can apply bias-remediation techniques to our model, we first need to define what successful remediation looks like in the context of our particular problem. As we saw in , there are often tradeoffs that come into play when optimizing a model (for example, adjustments that decrease false positives may increase false negatives), so we need to choose the evaluation metrics that best align with our priorities.
For our toxicity classifier, we've identified that our primary concern is ensuring that gender-related comments are not disproportionately misclassified as toxic, which could result in constructive discourse being suppressed. So here, we will define successful remediation as a decrease in the FPR (false-positive rate) for gender subgroups relative to the overall FPR.
To mitigate false-positive rate for gender subgroups, we want to help the model "unlearn" any false correlations it's learned between gender-related terminology and toxicity. We've determined that this false correlation likely stems from an insufficient number of training examples in which gender terminology was used in nontoxic contexts.
One excellent way to remediate this issue would be to add more nontoxic examples to each gender subgroup to balance out the dataset, and then retrain on the amended data. However, we've already trained on all the data we have, so what can we do? This is a common problem ML engineers face. Collecting additional data can be costly, resource-intensive, and time-consuming, and as a result, it may just not be feasible in certain circumstances.
One alternative solution is to simulate additional data by upweighting the existing examples in the disproportionately underrepresented group (increasing the loss penalty for errors for these examples) so they carry more weight and are not as easily overwhelmed by the rest of the data.
Let's update the input fuction of our model to implement upweighting for nontoxic examples belonging to one or more gender subgroups. In the UPDATES FOR UPWEIGHTING
section of the code below, we've increased the weight
values for nontoxic examples that contain a gender
value of transgender
, female
, or male
:
In [0]:
def train_input_fn_with_remediation():
def parse_function(serialized):
parsed_example = tf.io.parse_single_example(
serialized=serialized, features=FEATURE_MAP)
# Adds a weight column to deal with unbalanced classes.
parsed_example['weight'] = tf.add(parsed_example[LABEL], 0.1)
# BEGIN UPDATES FOR UPWEIGHTING
# Up-weighting non-toxic examples to balance toxic and non-toxic examples
# for gender slice.
#
values = parsed_example['gender'].values
# 'toxicity' label zero represents the example is non-toxic.
if tf.equal(parsed_example[LABEL], 0):
# We tuned the upweighting hyperparameters, and found we got good
# results by setting `weight`s of 0.4 for `transgender`,
# 0.5 for `female`, and 0.7 for `male`.
# NOTE: `other_gender` is not upweighted separately, because all examples
# tagged with `other_gender` were also tagged with one of the other
# values below
if tf.greater(tf.math.count_nonzero(tf.equal(values, 'transgender')), 0):
parsed_example['weight'] = tf.constant(0.4)
if tf.greater(tf.math.count_nonzero(tf.equal(values, 'female')), 0):
parsed_example['weight'] = tf.constant(0.5)
if tf.greater(tf.math.count_nonzero(tf.equal(values, 'male')), 0):
parsed_example['weight'] = tf.constant(0.7)
return (parsed_example,
parsed_example[LABEL])
# END UPDATES FOR UPWEIGHTING
train_dataset = tf.data.TFRecordDataset(
filenames=[train_tf_file]).map(parse_function).batch(512)
return train_dataset
In [0]:
BASE_DIR = tempfile.gettempdir()
model_dir_with_remediation = os.path.join(BASE_DIR, 'train', datetime.now().strftime(
"%Y%m%d-%H%M%S"))
embedded_text_feature_column = hub.text_embedding_column(
key=TEXT_FEATURE,
module_spec='https://tfhub.dev/google/nnlm-en-dim128/1')
classifier_with_remediation = tf.estimator.DNNClassifier(
hidden_units=[500, 100],
weight_column='weight',
feature_columns=[embedded_text_feature_column],
n_classes=2,
optimizer=tf.optimizers.Adagrad(learning_rate=0.003),
loss_reduction=tf.losses.Reduction.SUM,
model_dir=model_dir_with_remediation)
classifier_with_remediation.train(input_fn=train_input_fn_with_remediation, steps=1000)
Now that we've retrained the model, let's recompute our fairness metrics. First, export the model:
In [0]:
def eval_input_receiver_fn():
serialized_tf_example = tf.compat.v1.placeholder(
dtype=tf.string, shape=[None], name='input_example_placeholder')
receiver_tensors = {'examples': serialized_tf_example}
features = tf.io.parse_example(serialized_tf_example, FEATURE_MAP)
features['weight'] = tf.ones_like(features[LABEL])
return tfma.export.EvalInputReceiver(
features=features,
receiver_tensors=receiver_tensors,
labels=features[LABEL])
tfma_export_dir_with_remediation = tfma.export.export_eval_savedmodel(
estimator=classifier_with_remediation,
export_dir_base=os.path.join(BASE_DIR, 'tfma_eval_model_with_remediation'),
eval_input_receiver_fn=eval_input_receiver_fn)
Next, run the fairness evaluation using TFMA:
In [0]:
tfma_eval_result_path_with_remediation = os.path.join(BASE_DIR, 'tfma_eval_result_with_remediation')
slice_selection = 'gender'
compute_confidence_intervals = False
# Define slices that you want the evaluation to run on.
slice_spec = [
tfma.slicer.SingleSliceSpec(), # Overall slice
tfma.slicer.SingleSliceSpec(columns=['gender']),
]
# Add the fairness metrics.
add_metrics_callbacks = [
tfma.post_export_metrics.fairness_indicators(
thresholds=[0.1, 0.3, 0.5, 0.7, 0.9],
labels_key=LABEL
)
]
eval_shared_model_with_remediation = tfma.default_eval_shared_model(
eval_saved_model_path=tfma_export_dir_with_remediation,
add_metrics_callbacks=add_metrics_callbacks)
validate_dataset = tf.data.TFRecordDataset(filenames=[validate_tf_file])
# Run the fairness evaluation.
with beam.Pipeline() as pipeline:
_ = (
pipeline
| 'ReadData' >> beam.io.ReadFromTFRecord(validate_tf_file)
| 'ExtractEvaluateAndWriteResults' >>
tfma.ExtractEvaluateAndWriteResults(
eval_shared_model=eval_shared_model_with_remediation,
slice_spec=slice_spec,
compute_confidence_intervals=compute_confidence_intervals,
output_path=tfma_eval_result_path_with_remediation)
)
eval_result_with_remediation = tfma.load_eval_result(output_path=tfma_eval_result_path_with_remediation)
Run the following two cells to load results in the What-If tool and Fairness Indicators.
In the What-If tool, we'll load 1,000 examples with the corresponding predictions returned from both the baseline model and the remediated model.
In [0]:
DEFAULT_MAX_EXAMPLES = 1000
# Load 100000 examples in memory. When first rendered, What-If Tool only
# displays 1000 of these examples to ensure data loads successfully for most
# browser/machine configurations.
def wit_dataset(file, num_examples=100000):
dataset = tf.data.TFRecordDataset(
filenames=[train_tf_file]).take(num_examples)
return [tf.train.Example.FromString(d.numpy()) for d in dataset]
wit_data = wit_dataset(train_tf_file)
# Configure WIT with 1000 examples, the FEATURE_MAP we defined above, and
# a label of 1 for positive (toxic) examples and 0 for negative (nontoxic)
# examples
config_builder = WitConfigBuilder(wit_data[:DEFAULT_MAX_EXAMPLES]).set_estimator_and_feature_spec(
classifier, FEATURE_MAP).set_compare_estimator_and_feature_spec(
classifier_with_remediation, FEATURE_MAP).set_label_vocab(['0', '1']).set_target_feature(LABEL)
wit = WitWidget(config_builder)
In Fairness Indicators, we'll display the remediated model's evaluation results on the validation set.
In [0]:
# Link Fairness Indicators widget with WIT widget above,
# so that clicking a slice in FI below will load its data in WIT above.
event_handlers={'slice-selected':
wit.create_selection_callback(wit_data, DEFAULT_MAX_EXAMPLES)}
widget_view.render_fairness_indicator(eval_result=eval_result_with_remediation,
slicing_column=slice_selection,
event_handlers=event_handlers)
In [0]:
#@title Alternative: Run this cell only if you intend to skip the What-If tool exercises (see Warning above)
# Link Fairness Indicators widget with WIT widget above,
# so that clicking a slice in FI below will load its data in WIT above.
widget_view.render_fairness_indicator(eval_result=eval_result_with_remediation,
slicing_column=slice_selection)
When we evaluated our model against the validation set, we got an FPR of 0.28 for male
and 0.24 for female
. The overall FPR was 0.23.
The FPR for male
is now approximately 20% higher than the overall rate, and the FPR for female
is now approximately 5% lower than the overall rate. This is a significant improvement over our previous model, where the FPRs for male
and female
were +83% and +69% higher, respectively, than the overall FPR.
NOTE: Model training is not deterministic, so your exact results may vary slightly from ours.
We should also review FNR.
A model optimized solely to decrease FPR could learn to always predict the negative class ("nontoxic"), which would result in a FPR of 0. However, this would cause the FNR to skyrocket because every actual positive ("toxic") example would be misclassified and a false negative.
While our primary metric for evaluating remediation is FPR, we still want to make sure we're OK with any tradeoff in increased FNR that we incur to decrease FPR.
If we take a look at FNR results for the revised model, we see that the overall FNR is 0.34, male
FNR is 1% lower at 0.33, and female
FNR is 12% higher at 0.38. So we can confirm that our subgroup FNRs are not dramatically higher than overall FNR, and overall FNR itself is not sky-high.
NOTE: Model training is not deterministic, so your exact results may vary slightly from ours.
If we hover over the other_gender
slice, as shown above, we see that there are only 6 examples in this slice. This is an extremely small number of examples in comparison to the male
and female
groups, which each have over 15,000 examples.
NOTE: Model training is not deterministic, so your exact results may vary slightly from ours shown above.
With an other_gender
slice this small, we can't make any statistically significant assertions about the model's performance on this subgroup (changing the classification of just one example would cause a swing of 16.6% in FNR or FPR). Upweighting is not sufficient here; we're going to need to add more examples to the other_gender
subgroup that the model can learn from.
female
subgroup as follows:Click on the bar of the female slice in the Fairness Indicators widget to load the corresponding individual female examples in the What-If Tool widget above. Create a scatterplot that plots toxicity scores for the baseline model (Inference Score 1) against toxicity scores for the revised model (Inference Score 2), with each example color-coded by ground-truth label (toxicity).
Here's our graph, with toxicity scores for the baseline model plotted along the x-axis, and toxicity scores for the revised model plotted along the y-axis. Actual toxic examples are colored red, and actual nontoxic examples are colored blue.
NOTE: Model training is not deterministic, so your exact results may vary slightly from ours.
The relationship between the two scores is generally linear, but we can see a few clusters of blue outliers (circled above) where the revised model predicts a significantly lower toxicity score than the baseline model. We can extrapolate that the revised model does a better job of predicting low toxicity scores for a percentage of nontoxic female
examples (though there's still room for further improvement).