Fairness Exercise 1: Explore the Model

Learning Objectives:

  • Train a classifier to predict toxicity of text comments.
  • Explore the Civil Comments dataset and explore the toxic-text classifier's predictions using the What-If Tool.
  • Install and use Fairness Indicators to evaluate the toxic-text classifier's results.
  • Identify the source of bias in the classifier's predictions.

Overview

In this exercise, you'll use Fairness Indicators to evaluate a toxicity classifier trained exclusively on the text comments in the Civil Comments dataset.

About Fairness Indicators

Fairness Indicators is a suite of tools built on top of TensorFlow Model Analysis that enable regular evaluation of fairness metrics in product pipelines.

Fairness Indicators makes it easy for you to ask questions about how your model performs for different groups of users. The suite includes TensorFlow Data Validation, TensorFlow Model Analysis, and the What-If tool.

These tools help you compute common classification fairness metrics and evaluate model performance for defined groups of users, and visualize comparisons to a baseline slice. You can evaluate pipelines of all sizes, and compare your results using different thresholds and confidence levels. Fairness Indicators allows you to deep-dive into individual slices and interrogate your dataset, adjusting confidence intervals and evaluations at multiple thresholds.

Fairness Indicators is packaged with TensorFlow Data Validation and What-If Tool to allow users to evaluate the distribution of datasets and probe models down to the individual datapoint with the What-If Tool.

For a closer look at the Fairness Indicators suite, check out this link. To get started with Fairness Indicators, keep reading.

Setup

Run the following code to import the necessary dependencies for the libraries we'll be using in this exercise.

First, run the cell below to install Fairness Indicators.

NOTE: You MUST RESTART the Colab runtime after doing this installation, either by clicking the RESTART RUNTIME button at the bottom of this cell or by selecting Runtime->Restart runtime... from the menu bar above.


In [0]:
!pip install fairness-indicators \
  "absl-py==0.8.0" \
  "pyarrow==0.15.1" \
  "apache-beam==2.17.0" \
  "avro-python3==1.9.1" \
  "tfx-bsl==0.21.4" \
  "tensorflow-data-validation==0.21.5"

Next, import all the dependencies we'll use in this exercise, which include Fairness Indicators, TensorFlow Model Analysis (tfma), TensorFlow Data Validation (tfdv), and the What-If tool (WIT):


In [0]:
%tensorflow_version 2.x
import os
import tempfile
import apache_beam as beam
import numpy as np
import pandas as pd
from datetime import datetime

import tensorflow_hub as hub
import tensorflow as tf
import tensorflow_model_analysis as tfma
import tensorflow_data_validation as tfdv
from tensorflow_model_analysis.addons.fairness.post_export_metrics import fairness_indicators
from tensorflow_model_analysis.addons.fairness.view import widget_view
from fairness_indicators.examples import util

from witwidget.notebook.visualization import WitConfigBuilder
from witwidget.notebook.visualization import WitWidget

Part I: Audit the Data

In this section, you'll audit the Civil Comments dataset to proactively identify fairness considerations prior to training the model.

About the Civil Comments dataset

Click below to learn more about the Civil Comments dataset, and how we've preprocessed it for this exercise.

The Civil Comments dataset comprises approximately 2 million public comments that were submitted to the Civil Comments platform. Jigsaw sponsored the effort to compile and annotate these comments for ongoing research; they've also hosted competitions on Kaggle to help classify toxic comments as well as minimize unintended model bias.

Features

Within the Civil Comments data, a subset of comments are tagged with a variety of identity attributes pertaining to gender, sexual orientation, religion, race, and ethnicity. Each identity annotation column contains a value that represents the percentage of annotators who categorized a comment as containing references to that identity. Multiple identities may be present in a comment.

NOTE: These identity attributes are intended for evaluation purposes only, to assess how well a classifier trained solely on the comment text performs on different tag sets.

To collect these identity labels, each comment was reviewed by up to 10 annotators, who were asked to indicate all identities that were mentioned in the comment. For example, annotators were posed the question: "What genders are mentioned in the comment?", and asked to choose all of the following categories that were applicable.

  • Male
  • Female
  • Transgender
  • Other gender
  • No gender mentioned

NOTE: We recognize the limitations of the categories used in the original dataset, and acknowledge that these terms do not encompass the full range of vocabulary used in describing gender.

Jigsaw used these ratings to generate an aggregate score for each identity attribute representing the percentage of raters who said the identity was mentioned in the comment. For example, if 10 annotators reviewed a comment, and 6 said that the comment mentioned the identity "female" and 0 said that the comment mentioned the identity "male," the comment would receive a female score of 0.6 and a male score of 0.0.

NOTE: For the purposes of annotation, a comment was considered to "mention" gender if it contained a comment about gender issues (e.g., a discussion about feminism, wage gap between men and women, transgender rights, etc.), gendered language, or gendered insults. Use of "he," "she," or gendered names (e.g., Donald, Margaret) did not require a gender label.

Label

Each comment was rated by up to 10 annotators for toxicity, who each classified it with one of the following ratings.

  • Very Toxic
  • Toxic
  • Hard to Say
  • Not Toxic

Again, Jigsaw used these ratings to generate an aggregate toxicity "score" for each comment (ranging from 0.0 to 1.0) to serve as the label, representing the fraction of annotators who labeled the comment either "Very Toxic" or "Toxic." For example, if 10 annotators rated a comment, and 3 of them labeled it "Very Toxic" and 5 of them labeled it "Toxic", the comment would receive a toxicity score of 0.8.

NOTE: For more information on the Civil Comments labeling schema, see the Data section of the Jigsaw Untended Bias in Toxicity Classification Kaggle competition.

Example

Here are the feature values for one example in the dataset:

  • comment_text: i'm a white woman in my late 60's and believe me, they are not too crazy about me either!!
  • female: 1.0
  • white: 1.0

All raters tagged this comment with the labels female and white, giving the example scores of 1.0 for each of these identity mention labels.

NOTE: All other identity labels (e.g., male, asian) had values of 0.0.

Here's the label for this example:

  • toxicity: 0.0

All raters labeled the above comment "not toxic," which resulted in a toxicity label of 0.0.

Preprocessing the data

For the purposes of this exercise, we converted toxicity and identity columns to booleans in order to work with our neural net and metrics calculations. In the preprocessed dataset, we considered any value ≥ 0.5 as True (i.e., a comment is considered toxic if 50% or more crowd raters labeled it as toxic).

For identity labels, the threshold 0.5 was chosen and the identities were grouped together by their categories. For example, if one comment has { male: 0.3, female: 1.0, transgender: 0.0, heterosexual: 0.8, homosexual_gay_or_lesbian: 1.0 }, after processing, the data will be { gender: [female], sexual_orientation: [heterosexual, homosexual_gay_or_lesbian] }.

NOTE: Missing identity fields were converted to False.

Example

After preprocessing, here's the revised feature and label data for the example from above:

  • comment_text: i'm a white woman in my late 60's and believe me, they are not too crazy about me either!!
  • gender: [female]
  • race: [white]
  • disability: []
  • religion: []
  • sexual_orientation: []
  • toxicity: 0.0

Load the data

We've posted copies of both the original Civil Comments dataset and our preprocessed data on Google Cloud Platform (in TFRecord format) to make it easy to import into this notebook.

Run the following cell to download and import the training and validation datasets. By default, the following code will load the preprocessed data. If you prefer, you can enable the download_original_data checkbox at right to download the original dataset and preprocess it as described in the previous section (this may take 5-10 minutes).


In [0]:
download_original_data = False #@param {type:"boolean"}

if download_original_data:
  train_tf_file = tf.keras.utils.get_file('train_tf.tfrecord',
                                          'https://storage.googleapis.com/civil_comments_dataset/train_tf.tfrecord')
  validate_tf_file = tf.keras.utils.get_file('validate_tf.tfrecord',
                                             'https://storage.googleapis.com/civil_comments_dataset/validate_tf.tfrecord')

  # The identity terms list will be grouped together by their categories
  # (see 'IDENTITY_COLUMNS') on threshould 0.5. Only the identity term column,
  # text column and label column will be kept after processing.
  train_tf_file = util.convert_comments_data(train_tf_file)
  validate_tf_file = util.convert_comments_data(validate_tf_file)

else:
  train_tf_file = tf.keras.utils.get_file('train_tf_processed.tfrecord',
                                          'https://storage.googleapis.com/civil_comments_dataset/train_tf_processed.tfrecord')
  validate_tf_file = tf.keras.utils.get_file('validate_tf_processed.tfrecord',
                                             'https://storage.googleapis.com/civil_comments_dataset/validate_tf_processed.tfrecord')

Explore the data distribution in TFDV

Before we train the model, let's do a quick audit of our training data using TensorFlow Data Validation, so we can better understand our data distribution.

NOTE: The following cell may take 2–3 minutes to run.


In [0]:
stats = tfdv.generate_statistics_from_tfrecord(data_location=train_tf_file)
tfdv.visualize_statistics(stats)

Exercise

Use the TensorFlow Data Validation widget above to answer the following questions.

1. How many total examples are in the training dataset?

Solution

Click below for the solution.

There are 1.08 million total examples in the training dataset.

The count column tells us how many examples there are for a given feature. Each feature (sexual_orientation, comment_text, gender, etc.) has 1.08 million examples. The missing column tells us what percentage of examples are missing that feature.

Each feature is missing from 0% of examples, so we know that the per-feature example count of 1.08 million is also the total number of examples in the dataset.

2. How many unique values are there for the gender feature? What are they, and what are the frequencies of each of these values?

NOTE #1: gender and the other identity features (sexual_orientation, religion, disability, and race) are included in this dataset for evaluation purposes only, so we can assess model performance on different identity slices. The only feature we will use for model training is comment_text.

NOTE #2: We recognize the limitations of the categories used in the original dataset, and acknowledge that these terms do not encompass the full range of vocabulary used in describing gender.

Solution

Click below for the solution.

The unique column of the Categorical Features table tells us that there are 4 unique values for the gender feature.

To view the 4 values and their frequencies, we can click on the SHOW RAW DATA button:

The raw data table shows that there are 32,208 examples with a gender value of female, 26,758 examples with a value of male, 1,551 examples with a value of transgender, and 4 examples with a value of other gender.

NOTE: As described earlier, a gender feature can contain zero or more of these 4 values, depending on the content of the comment. For example, a comment containing the text "I am a transgender man" will have both transgender and male as gender values, whereas a comment that does not reference gender at all will have an empty/false gender value.

3. What percentage of total examples are labeled toxic? Overall, is this a class-balanced dataset (relatively even split of examples between positive and negative classes) or a class-imbalanced dataset (majority of examples are in one class)?

NOTE: In this dataset, a toxicity value of 0 signifies "not toxic," and a toxicity value of 1 signifies "toxic."

Solution

Click below for the solution.

7.98 percent of examples are toxic.

Under Numeric Features, we can see the distribution of values for the toxicity feature. 92.02% of examples have a value of 0 (which signifies "non-toxic"), so 7.98% of examples are toxic.

This is a class-imbalanced dataset, as the overwhelming majority of examples (over 90%) are classified as nontoxic.

4. Run the following code to analyze label distribution for the subset of examples that contain a gender value


In [0]:
#@title Calculate label distribution for gender-related examples
raw_dataset = tf.data.TFRecordDataset(train_tf_file)

toxic_gender_examples = 0
nontoxic_gender_examples = 0

# There are 1,082,924 examples in the dataset
for raw_record in raw_dataset.take(1082924):
  example = tf.train.Example()
  example.ParseFromString(raw_record.numpy())
  if str(example.features.feature["gender"].bytes_list.value) != "[]":
    if str(example.features.feature["toxicity"].float_list.value) == "[1.0]":
      toxic_gender_examples += 1
    else:
      nontoxic_gender_examples += 1

print("Toxic Gender Examples: %s" % toxic_gender_examples)
print("Nontoxic Gender Examples: %s" % nontoxic_gender_examples)

What percentage of gender examples are labeled toxic? Compare this percentage to the percentage of total examples that are labeled toxic from #3 above. What, if any, fairness concerns can you identify based on this comparison?

Solution

Click below for one possible solution.

There are 7,189 gender-related examples that are labeled toxic, which represent 14.7% of all gender-related examples.

The percentage of gender-related examples that are toxic (14.7%) is nearly double the percentage of toxic examples overall (7.98%). In other words, in our dataset, gender-related comments are almost two times more likely than comments overall to be labeled as toxic.

This skew suggests that a model trained on this dataset might learn a correlation between gender-related content and toxicity. This raises fairness considerations, as the model might be more likely to classify nontoxic comments as toxic if they contain gender terminology, which could lead to disparate impact for gender subgroups.

Part II: Train the model

In this section, you'll train a classifier on the Civil Comments dataset to predict whether a given text comment is toxic or not.

Configure model input

In order to feed data into our model, we'll need to define both a feature map and an input function.

The feature map configures the features and label we'll be using, and their corresponding data types. The only feature we'll use for training is the comment_text. However, we'll also use the features for the five different identity categories (sexual_orientation, gender, religion, race, and disability) for evaluation purposes. Our label is toxicity, which (after our data preprocessing) has a value of either 0 ("not toxic") or 1 ("toxic").


In [0]:
TEXT_FEATURE = 'comment_text'
LABEL = 'toxicity'

FEATURE_MAP = {
    # Label:
    LABEL: tf.io.FixedLenFeature([], tf.float32),
    # Text:
    TEXT_FEATURE:  tf.io.FixedLenFeature([], tf.string),

    # Identities:
    'sexual_orientation':tf.io.VarLenFeature(tf.string),
    'gender':tf.io.VarLenFeature(tf.string),
    'religion':tf.io.VarLenFeature(tf.string),
    'race':tf.io.VarLenFeature(tf.string),
    'disability':tf.io.VarLenFeature(tf.string),
}

The input function below specifies how to preprocess and batch the training data into the model.

Because we uncovered a class imbalance when auditing our dataset with TFDV earlier, we'll preprocess our data to add a weight column to each example. We'll set the weight value for each example to LABEL + 0.1, resulting in a weight of 0.1 for nontoxic examples and a weight of 1.1 for toxic examples. During model training, TensorFlow will multiply each example's loss by its weight, upweighting the toxic examples by increasing the penalty for error in scoring a toxic example relative to the penalty for error in scoring a nontoxic example.

Then we'll feed our data into the model in batches of 512 examples.


In [0]:
def train_input_fn():
  def parse_function(serialized):
    parsed_example = tf.io.parse_single_example(
        serialized=serialized, features=FEATURE_MAP)
    # Adds a weight column to deal with unbalanced classes.
    parsed_example['weight'] = tf.add(parsed_example[LABEL], 0.1)
    return (parsed_example,
            parsed_example[LABEL])
  train_dataset = tf.data.TFRecordDataset(
      filenames=[train_tf_file]).map(parse_function).batch(512)
  return train_dataset

Train the model

Next, create a deep neural network model, and train it on the data. Run the below code to create a DNNClassifier model with 2 hidden layers.

NOTE: For training, the only feature we will feed into the model is an embedding of our comment text (embedded_text_feature_column). The identity features we configured above will only be used to assess model performance later on in the evaluation phase.


In [0]:
BASE_DIR = tempfile.gettempdir()

model_dir = os.path.join(BASE_DIR, 'train', datetime.now().strftime(
    "%Y%m%d-%H%M%S"))

embedded_text_feature_column = hub.text_embedding_column(
    key=TEXT_FEATURE,
    module_spec='https://tfhub.dev/google/nnlm-en-dim128/1')

classifier = tf.estimator.DNNClassifier(
    hidden_units=[500, 100],
    weight_column='weight',
    feature_columns=[embedded_text_feature_column],
    optimizer=tf.optimizers.Adagrad(learning_rate=0.003),
    loss_reduction=tf.losses.Reduction.SUM,
    n_classes=2,
    model_dir=model_dir)

classifier.train(input_fn=train_input_fn, steps=1000)

Part III: Run Fairness Indicators

In this section you'll use Fairness Indicators to evaluate the model's results for different subgroups of comments. Specifically, you'll take a closer look at performance for different gender categories.

Export the model

First, let's export the model we trained in the previous section, so that we can analyze the results using TensorFlow Model Analysis (TFMA).


In [0]:
def eval_input_receiver_fn():
  serialized_tf_example = tf.compat.v1.placeholder(
      dtype=tf.string, shape=[None], name='input_example_placeholder')

  # This *must* be a dictionary containing a single key 'examples', which
  # points to the input placeholder.
  receiver_tensors = {'examples': serialized_tf_example}

  features = tf.io.parse_example(serialized_tf_example, FEATURE_MAP)
  features['weight'] = tf.ones_like(features[LABEL])

  return tfma.export.EvalInputReceiver(
    features=features,
    receiver_tensors=receiver_tensors,
    labels=features[LABEL])

tfma_export_dir = tfma.export.export_eval_savedmodel(
  estimator=classifier,
  export_dir_base=os.path.join(BASE_DIR, 'tfma_eval_model'),
  eval_input_receiver_fn=eval_input_receiver_fn)

Compute fairness metrics

Next, run the following code to compute fairness metrics on the model output. Here, we'll compute metrics on our 4 gender slices (female, male, transgender, and other_gender).

NOTE: Depending on your configurations, this step will take 2–10 minutes to run. For this exercise, we recommend leaving compute_confidence_intervals disabled to decrease computation time.


In [0]:
tfma_eval_result_path = os.path.join(BASE_DIR, 'tfma_eval_result')

# NOTE: If you want to explore slicing by other categories, you can change
# the slice_section value to "sexual_orientation", "religion", "race", 
# or "disability"
slice_selection = 'gender' 

# Computing confidence intervals can help you make better decisions 
# regarding your data, but it requires computing multiple resamples, 
# so it takes significantly longer to run, particularly in Colab 
# (which cannot take advantage of parallelization), 
# so we leave it disabled here.
compute_confidence_intervals = False

# Define slices that you want the evaluation to run on.
slice_spec = [
    tfma.slicer.SingleSliceSpec(), # Overall slice
    tfma.slicer.SingleSliceSpec(columns=[slice_selection]),
]

# Add the fairness metrics.
add_metrics_callbacks = [
  tfma.post_export_metrics.fairness_indicators(
      thresholds=[0.1, 0.3, 0.5, 0.7, 0.9],
      labels_key=LABEL
      )
]

eval_shared_model = tfma.default_eval_shared_model(
    eval_saved_model_path=tfma_export_dir,
    add_metrics_callbacks=add_metrics_callbacks)

# Run the fairness evaluation.
with beam.Pipeline() as pipeline:
  _ = (
      pipeline
      | 'ReadData' >> beam.io.ReadFromTFRecord(validate_tf_file)
      | 'ExtractEvaluateAndWriteResults' >>
       tfma.ExtractEvaluateAndWriteResults(
                 eval_shared_model=eval_shared_model,
                 slice_spec=slice_spec,
                 compute_confidence_intervals=compute_confidence_intervals,
                 output_path=tfma_eval_result_path)
  )

eval_result = tfma.load_eval_result(output_path=tfma_eval_result_path)

Finally, render the Fairness Indicators widget with the exported evaluation results.


In [0]:
widget_view.render_fairness_indicator(eval_result=eval_result, slicing_column=slice_selection)

NOTE: The categories above are not mutually exclusive, as examples can be tagged with zero or more of these gender-identity terms. An example with gender values of both transgender and female will be represented in both the gender:transgender and gender:female slices.

Exercise

In the previous TensorFlow Data Validation exercise, we determined that the relatively small proportion of examples that had associated gender values combined with the class-imbalanced nature of the dataset might result in some bias in the model's predictions.

Now that we've trained the model, we can actually evaluate for gender-related bias. In particular, we can take a closer look at gender-group performance on the following two metrics related to misclassifications:

  • False positive rate (FPR), which tells us the percentage of actual "not toxic" comments that were incorrectly classified as "toxic"
  • False negative rate (FNR), which tells us the percentage of actual "toxic" comments that were incorrectly classified as "not toxic"

Use the Fairness Indicators widget above to answer the following questions.

1. What are the overall false positive rate (FPR) and false negative rate (FNR) for the model at a classification threshold of 0.5?

Solution

Click below for the solution.

Select a threshold of 0.5 in the dropdown at the the top of the widget. To view overall FPR results, enable the post_export_metrics/false_positive_rate checkbox in the left panel and locate the Overall value in the table below the bar graph. Similarly, to view FNR results for gender subgroups, enable the post_export_metrics/false_negative_rate checkbox in the left panel, and locate the Overall value in the table.

Our results show that overall FPR\@0.5 is 0.28, and overall FNR\@0.5 is 0.27:

NOTE: Model training is not deterministic, so your exact evaluation results here may vary slightly from ours.

2. What are the FPR\@0.5 and FNR\@0.5 for the following gender subgroups:

  • male
  • female

Solution

Click below for the solution.

Select a threshold of 0.5 in the dropdown at the the top of the widget. To view FPR results for gender subgroups, enable the post_export_metrics/false_positive_rate checkbox in the left panel. To view FNR results for gender subgroups, enable the post_export_metrics/false_negative_rate checkbox in the left panel.

FPR\@0.5

  • male: 0.51
  • female: 0.47

FNR\@0.5

  • male: 0.13
  • female: 0.15

NOTE: Model training is not deterministic, so your exact evaluation results here may vary slightly from ours.

3. What fairness considerations can you identify by comparing aggregate FPR and FNR from #1 above to subgroup FPR and FNR from #2 above?

Solution

Click below for a solution.

The Diff w/ baseline column in the Fairness Indicators widget tells us the percent difference between a given subgroup's metric performance and the aggregate (overall) metric performance.

False negative rate is lower for both male and female subgroups (–51% and –45%, respectively) than it is overall. In other words, the model is less likely to misclassify a male- or female-related toxic comment as "nontoxic" than it is to misclassify toxic comments as "nontoxic" overall.

In contrast, the false positive rate is higher for both male and female subgroups (+83% and +69%) than it is overall. In other words, the model is more likely to misclassify a male- or female-related nontoxic comment as "toxic" than it is to misclassify nontoxic comments as "toxic" overall.

NOTE: Model training is not deterministic, so your exact evaluation results here may vary slightly from ours below.

This higher FPR raises issues of fairness that should be remediated. If gender-related comments are more likely to be misclassified as "toxic," then in practice, this could result in gender discourse being disproportionally suppressed.

Part IV: Dig Deeper into the Data

In this section, you'll use the What-If Tool's interactive visual interface to improve your understanding of how the toxic text classifier classifies individual examples, from which you can extrapolate larger insights.

WARNING: When you launch the What-If tool widget below, the left panel will display the full text of individual comments from the Civil Comments dataset. Some of these comments include profanity, offensive statements, and offensive statements involving identity terms. Feel free to skip Part IV if this is a concern.

Launch What-If Tool with 1,000 training examples displayed:


In [0]:
# Limit the number of examples to 1000, so that data loads successfully
# for most browser/machine configurations. 
DEFAULT_MAX_EXAMPLES = 1000

# Load 100000 examples in memory.
def wit_dataset(file, num_examples=100000):
  dataset = tf.data.TFRecordDataset(
      filenames=[train_tf_file]).take(num_examples)
  return [tf.train.Example.FromString(d.numpy()) for d in dataset]

wit_data = wit_dataset(train_tf_file)

# Configure WIT with 1000 examples, the FEATURE_MAP we defined above, and
# a label of 1 for positive (toxic) examples and 0 for negative (nontoxic)
# examples
config_builder = WitConfigBuilder(wit_data[:DEFAULT_MAX_EXAMPLES]).set_estimator_and_feature_spec(
    classifier, FEATURE_MAP).set_label_vocab(['0', '1']).set_target_feature(LABEL)
wit = WitWidget(config_builder)

Exercise

Use the What-If tool to complete the following tasks and answer the associated questions.

Task 1

Using the Binning, Color By, Label by, and Scatter dropdowns at the top of the What-If widget, create a visualization that groups examples by gender, and displays both how each example was categorized (Inference label) by the model and whether the classification was correct (Inference correct).

Solution

Click below for one possible solution.

Here is one possible configuration that groups examples by gender, visualizing both the inference label and whether inference was correct or incorrect for each example:

NOTE: Model training is not deterministic, so your exact results in each category may vary slightly from ours.

In the above visualization, first we set Binning | Y-Axis to gender to bucket examples by gender on the vertical axis. (We've set both Scatter dropdowns to default to clump all the data points together, but you could also scatter by Inference correct or Inference label to split apart different classifications within each gender group)

Next, we set Color By to Inference corrrect to color-code each example by whether the inference correctly predicted the ground-truth label. Correct predictions are colored blue, and incorrect predictions are colored red.

Finally, we set Label By to Inference label to add a text label to each example that indicates how the model classified the example. Examples that the model classified positive (toxic) are labeled 1, and examples that the model classified negative (non-toxic) are labeled 0.

Task 2

Use the visualization you created in Task 1 to locate the false positives (examples where the ground-truth label is "nontoxic" but the model predicted "toxic") in the female bucket. How many false positives are there?

Solution

Click below for the solution.

False positives are the red examples labeled 1 (outlined in yellow below). In our visualization, there are 5 false positives in the female bucket.

NOTE: Model training is not deterministic, so your false-positive count may vary slightly from ours.

Task 3

Can you determine what aspects of the comment text might have influenced the model to incorrectly predict the positive class for the examples you found in Task 2?

Click on one of the false positives you found, and make some edits to the text in the comment_text field in the left panel. Then click Run inference below to see what label the model predicts for the revised text. What changes in the text will result in the model predicting a lower toxicity score?

Solution

Click below for one possible avenue to pursue.

Try removing gender identity terms from the comments (e.g., women, girl), and see if that results in lower toxicity scores.