Learning Objectives:
Overview
In this exercise, you'll use Fairness Indicators to evaluate a toxicity classifier trained exclusively on the text comments in the Civil Comments dataset.
About Fairness Indicators
Fairness Indicators is a suite of tools built on top of TensorFlow Model Analysis that enable regular evaluation of fairness metrics in product pipelines.
Fairness Indicators makes it easy for you to ask questions about how your model performs for different groups of users. The suite includes TensorFlow Data Validation, TensorFlow Model Analysis, and the What-If tool.
These tools help you compute common classification fairness metrics and evaluate model performance for defined groups of users, and visualize comparisons to a baseline slice. You can evaluate pipelines of all sizes, and compare your results using different thresholds and confidence levels. Fairness Indicators allows you to deep-dive into individual slices and interrogate your dataset, adjusting confidence intervals and evaluations at multiple thresholds.
Fairness Indicators is packaged with TensorFlow Data Validation and What-If Tool to allow users to evaluate the distribution of datasets and probe models down to the individual datapoint with the What-If Tool.
For a closer look at the Fairness Indicators suite, check out this link. To get started with Fairness Indicators, keep reading.
Run the following code to import the necessary dependencies for the libraries we'll be using in this exercise.
First, run the cell below to install Fairness Indicators.
NOTE: You MUST RESTART the Colab runtime after doing this installation, either by clicking the RESTART RUNTIME button at the bottom of this cell or by selecting Runtime->Restart runtime... from the menu bar above.
In [0]:
!pip install fairness-indicators \
"absl-py==0.8.0" \
"pyarrow==0.15.1" \
"apache-beam==2.17.0" \
"avro-python3==1.9.1" \
"tfx-bsl==0.21.4" \
"tensorflow-data-validation==0.21.5"
Next, import all the dependencies we'll use in this exercise, which include Fairness Indicators, TensorFlow Model Analysis (tfma), TensorFlow Data Validation (tfdv), and the What-If tool (WIT):
In [0]:
%tensorflow_version 2.x
import os
import tempfile
import apache_beam as beam
import numpy as np
import pandas as pd
from datetime import datetime
import tensorflow_hub as hub
import tensorflow as tf
import tensorflow_model_analysis as tfma
import tensorflow_data_validation as tfdv
from tensorflow_model_analysis.addons.fairness.post_export_metrics import fairness_indicators
from tensorflow_model_analysis.addons.fairness.view import widget_view
from fairness_indicators.examples import util
from witwidget.notebook.visualization import WitConfigBuilder
from witwidget.notebook.visualization import WitWidget
The Civil Comments dataset comprises approximately 2 million public comments that were submitted to the Civil Comments platform. Jigsaw sponsored the effort to compile and annotate these comments for ongoing research; they've also hosted competitions on Kaggle to help classify toxic comments as well as minimize unintended model bias.
Within the Civil Comments data, a subset of comments are tagged with a variety of identity attributes pertaining to gender, sexual orientation, religion, race, and ethnicity. Each identity annotation column contains a value that represents the percentage of annotators who categorized a comment as containing references to that identity. Multiple identities may be present in a comment.
NOTE: These identity attributes are intended for evaluation purposes only, to assess how well a classifier trained solely on the comment text performs on different tag sets.
To collect these identity labels, each comment was reviewed by up to 10 annotators, who were asked to indicate all identities that were mentioned in the comment. For example, annotators were posed the question: "What genders are mentioned in the comment?", and asked to choose all of the following categories that were applicable.
NOTE: We recognize the limitations of the categories used in the original dataset, and acknowledge that these terms do not encompass the full range of vocabulary used in describing gender.
Jigsaw used these ratings to generate an aggregate score for each identity attribute representing the percentage of raters who said the identity was mentioned in the comment. For example, if 10 annotators reviewed a comment, and 6 said that the comment mentioned the identity "female" and 0 said that the comment mentioned the identity "male," the comment would receive a female
score of 0.6
and a male
score of 0.0
.
NOTE: For the purposes of annotation, a comment was considered to "mention" gender if it contained a comment about gender issues (e.g., a discussion about feminism, wage gap between men and women, transgender rights, etc.), gendered language, or gendered insults. Use of "he," "she," or gendered names (e.g., Donald, Margaret) did not require a gender label.
Each comment was rated by up to 10 annotators for toxicity, who each classified it with one of the following ratings.
Again, Jigsaw used these ratings to generate an aggregate toxicity "score" for each comment (ranging from 0.0
to 1.0
) to serve as the label, representing the fraction of annotators who labeled the comment either "Very Toxic" or "Toxic." For example, if 10 annotators rated a comment, and 3 of them labeled it "Very Toxic" and 5 of them labeled it "Toxic", the comment would receive a toxicity score of 0.8
.
NOTE: For more information on the Civil Comments labeling schema, see the Data section of the Jigsaw Untended Bias in Toxicity Classification Kaggle competition.
Here are the feature values for one example in the dataset:
comment_text
: i'm a white woman in my late 60's and believe me, they are not too crazy about me either!!
female
: 1.0
white
: 1.0
All raters tagged this comment with the labels female
and white
, giving the example scores of 1.0
for each of these identity mention labels.
NOTE: All other identity labels (e.g., male
, asian
) had values of 0.0
.
Here's the label for this example:
toxicity
: 0.0
All raters labeled the above comment "not toxic," which resulted in a toxicity label of 0.0
.
For the purposes of this exercise, we converted toxicity and identity columns to booleans in order to work with our neural net and metrics calculations. In the preprocessed dataset, we considered any value ≥ 0.5 as True (i.e., a comment is considered toxic if 50% or more crowd raters labeled it as toxic).
For identity labels, the threshold 0.5 was chosen and the identities were grouped together by their categories. For example, if one comment has { male: 0.3, female: 1.0, transgender: 0.0, heterosexual: 0.8, homosexual_gay_or_lesbian: 1.0 }
, after processing, the data will be { gender: [female], sexual_orientation: [heterosexual, homosexual_gay_or_lesbian] }
.
NOTE: Missing identity fields were converted to False.
After preprocessing, here's the revised feature and label data for the example from above:
comment_text
: i'm a white woman in my late 60's and believe me, they are not too crazy about me either!!
gender
: [female]
race
: [white]
disability
: []
religion
: []
sexual_orientation
: []
toxicity
: 0.0
We've posted copies of both the original Civil Comments dataset and our preprocessed data on Google Cloud Platform (in TFRecord format) to make it easy to import into this notebook.
Run the following cell to download and import the training and validation datasets. By default, the following code will load the preprocessed data. If you prefer, you can enable the download_original_data
checkbox at right to download the original dataset and preprocess it as described in the previous section (this may take 5-10 minutes).
In [0]:
download_original_data = False #@param {type:"boolean"}
if download_original_data:
train_tf_file = tf.keras.utils.get_file('train_tf.tfrecord',
'https://storage.googleapis.com/civil_comments_dataset/train_tf.tfrecord')
validate_tf_file = tf.keras.utils.get_file('validate_tf.tfrecord',
'https://storage.googleapis.com/civil_comments_dataset/validate_tf.tfrecord')
# The identity terms list will be grouped together by their categories
# (see 'IDENTITY_COLUMNS') on threshould 0.5. Only the identity term column,
# text column and label column will be kept after processing.
train_tf_file = util.convert_comments_data(train_tf_file)
validate_tf_file = util.convert_comments_data(validate_tf_file)
else:
train_tf_file = tf.keras.utils.get_file('train_tf_processed.tfrecord',
'https://storage.googleapis.com/civil_comments_dataset/train_tf_processed.tfrecord')
validate_tf_file = tf.keras.utils.get_file('validate_tf_processed.tfrecord',
'https://storage.googleapis.com/civil_comments_dataset/validate_tf_processed.tfrecord')
Before we train the model, let's do a quick audit of our training data using TensorFlow Data Validation, so we can better understand our data distribution.
NOTE: The following cell may take 2–3 minutes to run.
In [0]:
stats = tfdv.generate_statistics_from_tfrecord(data_location=train_tf_file)
tfdv.visualize_statistics(stats)
There are 1.08 million total examples in the training dataset.
The count column tells us how many examples there are for a given feature. Each feature (sexual_orientation
, comment_text
, gender
, etc.) has 1.08 million examples. The missing column tells us what percentage of examples are missing that feature.
Each feature is missing from 0% of examples, so we know that the per-feature example count of 1.08 million is also the total number of examples in the dataset.
gender
feature? What are they, and what are the frequencies of each of these values?NOTE #1: gender
and the other identity features (sexual_orientation
, religion
, disability
, and race
) are included in this dataset for evaluation purposes only, so we can assess model performance on different identity slices. The only feature we will use for model training is comment_text
.
NOTE #2: We recognize the limitations of the categories used in the original dataset, and acknowledge that these terms do not encompass the full range of vocabulary used in describing gender.
The unique column of the Categorical Features table tells us that there are 4 unique values for the gender
feature.
To view the 4 values and their frequencies, we can click on the SHOW RAW DATA button:
The raw data table shows that there are 32,208 examples with a gender value of female
, 26,758 examples with a value of male
, 1,551 examples with a value of transgender
, and 4 examples with a value of other gender
.
NOTE: As described earlier, a gender
feature can contain zero or more of these 4 values, depending on the content of the comment. For example, a comment containing the text "I am a transgender man" will have both transgender
and male
as gender
values, whereas a comment that does not reference gender at all will have an empty/false gender
value.
NOTE: In this dataset, a toxicity
value of 0
signifies "not toxic," and a toxicity
value of 1
signifies "toxic."
7.98 percent of examples are toxic.
Under Numeric Features, we can see the distribution of values for the toxicity
feature. 92.02% of examples have a value of 0 (which signifies "non-toxic"), so 7.98% of examples are toxic.
This is a class-imbalanced dataset, as the overwhelming majority of examples (over 90%) are classified as nontoxic.
In [0]:
#@title Calculate label distribution for gender-related examples
raw_dataset = tf.data.TFRecordDataset(train_tf_file)
toxic_gender_examples = 0
nontoxic_gender_examples = 0
# There are 1,082,924 examples in the dataset
for raw_record in raw_dataset.take(1082924):
example = tf.train.Example()
example.ParseFromString(raw_record.numpy())
if str(example.features.feature["gender"].bytes_list.value) != "[]":
if str(example.features.feature["toxicity"].float_list.value) == "[1.0]":
toxic_gender_examples += 1
else:
nontoxic_gender_examples += 1
print("Toxic Gender Examples: %s" % toxic_gender_examples)
print("Nontoxic Gender Examples: %s" % nontoxic_gender_examples)
There are 7,189 gender-related examples that are labeled toxic, which represent 14.7% of all gender-related examples.
The percentage of gender-related examples that are toxic (14.7%) is nearly double the percentage of toxic examples overall (7.98%). In other words, in our dataset, gender-related comments are almost two times more likely than comments overall to be labeled as toxic.
This skew suggests that a model trained on this dataset might learn a correlation between gender-related content and toxicity. This raises fairness considerations, as the model might be more likely to classify nontoxic comments as toxic if they contain gender terminology, which could lead to disparate impact for gender subgroups.
In this section, you'll train a classifier on the Civil Comments dataset to predict whether a given text comment is toxic or not.
In order to feed data into our model, we'll need to define both a feature map and an input function.
The feature map configures the features and label we'll be using, and their corresponding data types. The only feature we'll use for training is the comment_text
. However, we'll also use the features for the five different identity categories (sexual_orientation
, gender
, religion
, race
, and disability
) for evaluation purposes. Our label is toxicity
, which (after our data preprocessing) has a value of either 0
("not toxic") or 1
("toxic").
In [0]:
TEXT_FEATURE = 'comment_text'
LABEL = 'toxicity'
FEATURE_MAP = {
# Label:
LABEL: tf.io.FixedLenFeature([], tf.float32),
# Text:
TEXT_FEATURE: tf.io.FixedLenFeature([], tf.string),
# Identities:
'sexual_orientation':tf.io.VarLenFeature(tf.string),
'gender':tf.io.VarLenFeature(tf.string),
'religion':tf.io.VarLenFeature(tf.string),
'race':tf.io.VarLenFeature(tf.string),
'disability':tf.io.VarLenFeature(tf.string),
}
The input function below specifies how to preprocess and batch the training data into the model.
Because we uncovered a class imbalance when auditing our dataset with TFDV earlier, we'll preprocess our data to add a weight
column to each example.
We'll set the weight
value for each example to LABEL + 0.1
, resulting in a weight
of 0.1 for nontoxic examples and a weight
of 1.1 for toxic examples. During model training, TensorFlow will multiply each example's loss by its weight
, upweighting the toxic examples by increasing the penalty for error in scoring a toxic example relative to the penalty for error in scoring a nontoxic example.
Then we'll feed our data into the model in batches of 512 examples.
In [0]:
def train_input_fn():
def parse_function(serialized):
parsed_example = tf.io.parse_single_example(
serialized=serialized, features=FEATURE_MAP)
# Adds a weight column to deal with unbalanced classes.
parsed_example['weight'] = tf.add(parsed_example[LABEL], 0.1)
return (parsed_example,
parsed_example[LABEL])
train_dataset = tf.data.TFRecordDataset(
filenames=[train_tf_file]).map(parse_function).batch(512)
return train_dataset
Next, create a deep neural network model, and train it on the data. Run the below code to create a DNNClassifier
model with 2 hidden layers.
NOTE: For training, the only feature we will feed into the model is an embedding of our comment text (embedded_text_feature_column
). The identity features we configured above will only be used to assess model performance later on in the evaluation phase.
In [0]:
BASE_DIR = tempfile.gettempdir()
model_dir = os.path.join(BASE_DIR, 'train', datetime.now().strftime(
"%Y%m%d-%H%M%S"))
embedded_text_feature_column = hub.text_embedding_column(
key=TEXT_FEATURE,
module_spec='https://tfhub.dev/google/nnlm-en-dim128/1')
classifier = tf.estimator.DNNClassifier(
hidden_units=[500, 100],
weight_column='weight',
feature_columns=[embedded_text_feature_column],
optimizer=tf.optimizers.Adagrad(learning_rate=0.003),
loss_reduction=tf.losses.Reduction.SUM,
n_classes=2,
model_dir=model_dir)
classifier.train(input_fn=train_input_fn, steps=1000)
First, let's export the model we trained in the previous section, so that we can analyze the results using TensorFlow Model Analysis (TFMA).
In [0]:
def eval_input_receiver_fn():
serialized_tf_example = tf.compat.v1.placeholder(
dtype=tf.string, shape=[None], name='input_example_placeholder')
# This *must* be a dictionary containing a single key 'examples', which
# points to the input placeholder.
receiver_tensors = {'examples': serialized_tf_example}
features = tf.io.parse_example(serialized_tf_example, FEATURE_MAP)
features['weight'] = tf.ones_like(features[LABEL])
return tfma.export.EvalInputReceiver(
features=features,
receiver_tensors=receiver_tensors,
labels=features[LABEL])
tfma_export_dir = tfma.export.export_eval_savedmodel(
estimator=classifier,
export_dir_base=os.path.join(BASE_DIR, 'tfma_eval_model'),
eval_input_receiver_fn=eval_input_receiver_fn)
Next, run the following code to compute fairness metrics on the model output. Here, we'll compute metrics on our 4 gender slices (female
, male
, transgender
, and other_gender
).
NOTE: Depending on your configurations, this step will take 2–10 minutes to run. For this exercise, we recommend leaving compute_confidence_intervals
disabled to decrease computation time.
In [0]:
tfma_eval_result_path = os.path.join(BASE_DIR, 'tfma_eval_result')
# NOTE: If you want to explore slicing by other categories, you can change
# the slice_section value to "sexual_orientation", "religion", "race",
# or "disability"
slice_selection = 'gender'
# Computing confidence intervals can help you make better decisions
# regarding your data, but it requires computing multiple resamples,
# so it takes significantly longer to run, particularly in Colab
# (which cannot take advantage of parallelization),
# so we leave it disabled here.
compute_confidence_intervals = False
# Define slices that you want the evaluation to run on.
slice_spec = [
tfma.slicer.SingleSliceSpec(), # Overall slice
tfma.slicer.SingleSliceSpec(columns=[slice_selection]),
]
# Add the fairness metrics.
add_metrics_callbacks = [
tfma.post_export_metrics.fairness_indicators(
thresholds=[0.1, 0.3, 0.5, 0.7, 0.9],
labels_key=LABEL
)
]
eval_shared_model = tfma.default_eval_shared_model(
eval_saved_model_path=tfma_export_dir,
add_metrics_callbacks=add_metrics_callbacks)
# Run the fairness evaluation.
with beam.Pipeline() as pipeline:
_ = (
pipeline
| 'ReadData' >> beam.io.ReadFromTFRecord(validate_tf_file)
| 'ExtractEvaluateAndWriteResults' >>
tfma.ExtractEvaluateAndWriteResults(
eval_shared_model=eval_shared_model,
slice_spec=slice_spec,
compute_confidence_intervals=compute_confidence_intervals,
output_path=tfma_eval_result_path)
)
eval_result = tfma.load_eval_result(output_path=tfma_eval_result_path)
Finally, render the Fairness Indicators widget with the exported evaluation results.
In [0]:
widget_view.render_fairness_indicator(eval_result=eval_result, slicing_column=slice_selection)
NOTE: The categories above are not mutually exclusive, as examples can be tagged with zero or more of these gender-identity terms. An example with gender values of both transgender
and female
will be represented in both the gender:transgender
and gender:female
slices.
In the previous TensorFlow Data Validation exercise, we determined that the relatively small proportion of examples that had associated gender
values combined with the class-imbalanced nature of the dataset might result in some bias in the model's predictions.
Now that we've trained the model, we can actually evaluate for gender-related bias. In particular, we can take a closer look at gender-group performance on the following two metrics related to misclassifications:
Use the Fairness Indicators widget above to answer the following questions.
Select a threshold of 0.5 in the dropdown at the the top of the widget. To view overall FPR results, enable the post_export_metrics/false_positive_rate checkbox in the left panel and locate the Overall value in the table below the bar graph. Similarly, to view FNR results for gender subgroups, enable the post_export_metrics/false_negative_rate checkbox in the left panel, and locate the Overall value in the table.
Our results show that overall FPR\@0.5 is 0.28, and overall FNR\@0.5 is 0.27:
NOTE: Model training is not deterministic, so your exact evaluation results here may vary slightly from ours.
Select a threshold of 0.5 in the dropdown at the the top of the widget. To view FPR results for gender subgroups, enable the post_export_metrics/false_positive_rate checkbox in the left panel. To view FNR results for gender subgroups, enable the post_export_metrics/false_negative_rate checkbox in the left panel.
FPR\@0.5
male
: 0.51female
: 0.47FNR\@0.5
male
: 0.13female
: 0.15NOTE: Model training is not deterministic, so your exact evaluation results here may vary slightly from ours.
The Diff w/ baseline column in the Fairness Indicators widget tells us the percent difference between a given subgroup's metric performance and the aggregate (overall) metric performance.
False negative rate is lower for both male and female subgroups (–51% and –45%, respectively) than it is overall. In other words, the model is less likely to misclassify a male- or female-related toxic comment as "nontoxic" than it is to misclassify toxic comments as "nontoxic" overall.
In contrast, the false positive rate is higher for both male and female subgroups (+83% and +69%) than it is overall. In other words, the model is more likely to misclassify a male- or female-related nontoxic comment as "toxic" than it is to misclassify nontoxic comments as "toxic" overall.
NOTE: Model training is not deterministic, so your exact evaluation results here may vary slightly from ours below.
This higher FPR raises issues of fairness that should be remediated. If gender-related comments are more likely to be misclassified as "toxic," then in practice, this could result in gender discourse being disproportionally suppressed.
In this section, you'll use the What-If Tool's interactive visual interface to improve your understanding of how the toxic text classifier classifies individual examples, from which you can extrapolate larger insights.
Launch What-If Tool with 1,000 training examples displayed:
In [0]:
# Limit the number of examples to 1000, so that data loads successfully
# for most browser/machine configurations.
DEFAULT_MAX_EXAMPLES = 1000
# Load 100000 examples in memory.
def wit_dataset(file, num_examples=100000):
dataset = tf.data.TFRecordDataset(
filenames=[train_tf_file]).take(num_examples)
return [tf.train.Example.FromString(d.numpy()) for d in dataset]
wit_data = wit_dataset(train_tf_file)
# Configure WIT with 1000 examples, the FEATURE_MAP we defined above, and
# a label of 1 for positive (toxic) examples and 0 for negative (nontoxic)
# examples
config_builder = WitConfigBuilder(wit_data[:DEFAULT_MAX_EXAMPLES]).set_estimator_and_feature_spec(
classifier, FEATURE_MAP).set_label_vocab(['0', '1']).set_target_feature(LABEL)
wit = WitWidget(config_builder)
Using the Binning, Color By, Label by, and Scatter dropdowns at the top of the What-If widget, create a visualization that groups examples by gender, and displays both how each example was categorized (Inference label
) by the model and whether the classification was correct (Inference correct
).
Here is one possible configuration that groups examples by gender, visualizing both the inference label and whether inference was correct or incorrect for each example:
NOTE: Model training is not deterministic, so your exact results in each category may vary slightly from ours.
In the above visualization, first we set Binning | Y-Axis to gender
to bucket examples by gender on the vertical axis. (We've set both Scatter dropdowns to default to clump all the data points together, but you could also scatter by Inference correct
or Inference label
to split apart different classifications within each gender group)
Next, we set Color By to Inference corrrect
to color-code each example by whether the inference correctly predicted the ground-truth label. Correct predictions are colored blue, and incorrect predictions are colored red.
Finally, we set Label By to Inference label
to add a text label to each example that indicates how the model classified the example. Examples that the model classified positive (toxic) are labeled 1
, and examples that the model classified negative (non-toxic) are labeled 0
.
False positives are the red examples labeled 1
(outlined in yellow below).
In our visualization, there are 5 false positives in the female
bucket.
NOTE: Model training is not deterministic, so your false-positive count may vary slightly from ours.
Can you determine what aspects of the comment text might have influenced the model to incorrectly predict the positive class for the examples you found in Task 2?
Click on one of the false positives you found, and make some edits to the text in the comment_text
field in the left panel. Then click Run inference below to see what label the model predicts for the revised text. What changes in the text will result in the model predicting a lower toxicity score?
Try removing gender identity terms from the comments (e.g., women
, girl
), and see if that results in lower toxicity scores.