Hands-on Tutorial: Measuring Unintended Bias in Text Classification Models with Real Data

Unintended bias is a major challenge for machine learning systems. In this tutorial, we will demonstrate a way to measure unintended bias in a text classification model using a large set of online comments which have been labeled for toxicity and identity references. We will provide participants with starter code that builds and evaluates a machine learning model, written using open source Python libraries. Using this code they can explore different ways to measure and visualize model bias. At the end of this tutorial, participants should walk away with new techniques for bias measurement.

WARNING: Some text examples in this notebook include profanity, offensive statments, and offensive statments involving identity terms. Please feel free to avoid using this notebook.

To get started, please click "CONNECT" in the top right of the screen. You can use SHIFT + ↲ to run cells in this notebook. Please be sure to run each cell before moving on to the next cell in the notebook.



In [0]:

    
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import datetime
import os
import pandas as pd
import numpy as np
import pkg_resources
import matplotlib.pyplot as plt
import seaborn as sns
import time
import scipy.stats as stats

from sklearn import metrics

from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding
from keras.layers import Input
from keras.layers import Conv1D
from keras.layers import MaxPooling1D
from keras.layers import Flatten
from keras.layers import Dropout
from keras.layers import Dense
from keras.optimizers import RMSprop
from keras.models import Model
from keras.models import load_model

%matplotlib inline

# autoreload makes it easier to interactively work on code in imported libraries
%load_ext autoreload
%autoreload 2

# Set pandas display options so we can read more of the comment text.
pd.set_option('max_colwidth', 300)

# Download and unzip files used in this colab
!curl -O -J -L https://storage.googleapis.com/civil_comments/fat_star_tutorial/fat-star.zip
!unzip -o fat-star.zip

# Seed for Pandas sampling, to get consistent sampling results
RANDOM_STATE = 123456789

Install library and data dependencies

Load and pre-process data sets



In [0]:

    
# Read the initial train, test, and validate data into Pandas dataframes.
train_df_float = pd.read_csv('public_train.csv')
test_df_float = pd.read_csv('public_test.csv')
validate_df_float = pd.read_csv('public_validate.csv')

print('training data has %d rows' % len(train_df_float))
print('validation data has %d rows' % len(validate_df_float))
print('test data has %d rows' % len(test_df_float))
print('training data columns are: %s' % train_df_float.columns)

Let's examine some rows in these datasets.



In [0]:

    
train_df_float.head()

Understanding the data

There are many column in the data set, however some columns you may want to pay closer attention to are:

comment_text: this is the the text which we will pass into our model.
toxicity: this is the percentage of raters who labeled this comment as being toxic.
identity columns, such as "male", "female", "white", "black", and others: there are the percentage of raters who labeled this comment as refering to a given identity. Unlike comment_text and toxicity, these columns may be missing for many rows and will display as NaN initially.

Let's now look at some unprocessed rows. We will filter the output to only show the "toxicity", "male", and "comment_text" columns, however keep in mind that there are 24 total identity columns.



In [0]:

    
pd.concat([
    # Select 3 rows where 100% of raters said it applied to the male identity.
    train_df_float[['toxicity', 'male', 'comment_text']].query('male == 1').head(3),
    # Select 3 rows where 50% of raters said it applied to the male identity.
    train_df_float[['toxicity', 'male', 'comment_text']].query('male == 0.5').head(3),
    # Select 3 rows where 0% of raters said it applied to the male identity.
    train_df_float[['toxicity', 'male', 'comment_text']].query('male == 0.0').head(3),
    # Select 3 rows that were not labeled for the male identity (have NaN values).
    # See https://stackoverflow.com/questions/26535563 if you would like to
    # understand this Pandas behavior.
    train_df_float[['toxicity', 'male', 'comment_text']].query('male != male').head(3)])

We will need to convert toxicity and identity columns to booleans, in order to work with our neural net and metrics calculcations. For this tutorial, we will consider any value >= 0.5 as True (i.e. a comment should be considered toxic if 50% or more crowd raters labeled it as toxic). Note that this code also converts missing identity fields to False.



In [0]:

    
# List all identities
identity_columns = [
    'male', 'female', 'transgender', 'other_gender', 'heterosexual',
    'homosexual_gay_or_lesbian', 'bisexual', 'other_sexual_orientation', 'christian',
    'jewish', 'muslim', 'hindu', 'buddhist', 'atheist', 'other_religion', 'black',
    'white', 'asian', 'latino', 'other_race_or_ethnicity',
    'physical_disability', 'intellectual_or_learning_disability',
    'psychiatric_or_mental_illness', 'other_disability']

def convert_to_bool(df, col_name):
  df[col_name] = np.where(df[col_name] >= 0.5, True, False)

def convert_dataframe_to_bool(df):
  bool_df = df.copy()
  for col in ['toxicity'] + identity_columns:
      convert_to_bool(bool_df, col)
  return bool_df

train_df = convert_dataframe_to_bool(train_df_float)
validate_df = convert_dataframe_to_bool(validate_df_float)
test_df = convert_dataframe_to_bool(test_df_float)
    
train_df[['toxicity', 'male', 'comment_text']].sample(5, random_state=RANDOM_STATE)

Exercise #1

Count the number of comments in the training set which are labeled as referring to the "female" group.
What percentage of comments which are labeled as referring to the "female" group are toxic?
How does this percentage compare to other identity groups in the training set?
How does this compare to the percentage of toxic comments in the entire training set?



In [0]:

    
# Your code here
#
# HINT: you can query dataframes for identities using code like:
#   train_df.query('black == True')
# and 
#   train_df.query('toxicity == True')
#
# You can print the identity_columns variable to see the full list of identities
# labeled by crowd raters.
#
# Pandas Dataframe documentation is available at https://pandas.pydata.org/pandas-docs/stable/api.html#dataframe

Solution (click to expand)



In [0]:

    
def print_count_and_percent_toxic(df, identity):
  # Query all training comments where the identity column equals True.
  identity_comments = train_df.query(identity + ' == True')

  # Query which of those comments also have "toxicity" equals True
  toxic_identity_comments = identity_comments.query('toxicity == True')
  # Alternatively you could also write a query using & (and), e.g.:
  # toxic_identity_comments = train_df.query(identity  + ' == True & toxicity == True')

  # Print the results.
  num_comments = len(identity_comments)
  percent_toxic = len(toxic_identity_comments) / num_comments 
  print('%d comments refer to the %s identity, %.2f%% are toxic' % (
    num_comments,
    identity,
    # multiply percent_toxic by 100 for easier reading.
    100 * percent_toxic))

# Print values for comments labeled as referring to the female identity
print_count_and_percent_toxic(train_df, 'female')

# Compare this with comments labeled as referring to the male identity
print_count_and_percent_toxic(train_df, 'male')

# Print the percent toxicity for the entire training set
all_toxic_df = train_df.query('toxicity == True')
print('%.2f%% of all comments are toxic' %
  (100 * len(all_toxic_df) / len(train_df)))

Define a text classification model

This code creates and trains a convolutional neural net using the Keras framework. This neural net accepts a text comment, encoded using GloVe embeddings, and outputs a probably that the comment is toxic. Don't worry if you do not understand all of this code, as we will be treating this neural net as a black box later in the tutorial.

Note that for this colab, we will be loading pretrained models from disk, rather than using this code to train a new model which would take over 30 minutes.



In [0]:

    
MAX_NUM_WORDS = 10000
TOXICITY_COLUMN = 'toxicity'
TEXT_COLUMN = 'comment_text'

# Create a text tokenizer.
tokenizer = Tokenizer(num_words=MAX_NUM_WORDS)
tokenizer.fit_on_texts(train_df[TEXT_COLUMN])

# All comments must be truncated or padded to be the same length.
MAX_SEQUENCE_LENGTH = 250
def pad_text(texts, tokenizer):
    return pad_sequences(tokenizer.texts_to_sequences(texts), maxlen=MAX_SEQUENCE_LENGTH)

# Load the first model from disk.
model = load_model('model_2_3_4.h5')

Optional: dive into model architecture

Expand this code to see how our text classification model is defined, and optionally train your own model. Warning: training a new model maybe take over 30 minutes.



In [0]:

    
EMBEDDINGS_PATH = 'glove.6B.100d.txt'
EMBEDDINGS_DIMENSION = 100
DROPOUT_RATE = 0.3
LEARNING_RATE = 0.00005
NUM_EPOCHS = 10
BATCH_SIZE = 128

def train_model(train_df, validate_df, tokenizer):
    # Prepare data
    train_text = pad_text(train_df[TEXT_COLUMN], tokenizer)
    train_labels = to_categorical(train_df[TOXICITY_COLUMN])
    validate_text = pad_text(validate_df[TEXT_COLUMN], tokenizer)
    validate_labels = to_categorical(validate_df[TOXICITY_COLUMN])

    # Load embeddings
    embeddings_index = {}
    with open(EMBEDDINGS_PATH) as f:
        for line in f:
            values = line.split()
            word = values[0]
            coefs = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = coefs

    embedding_matrix = np.zeros((len(tokenizer.word_index) + 1,
                                 EMBEDDINGS_DIMENSION))
    num_words_in_embedding = 0
    for word, i in tokenizer.word_index.items():
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            num_words_in_embedding += 1
            # words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector

    # Create model layers.
    def get_convolutional_neural_net_layers():
        """Returns (input_layer, output_layer)"""
        sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
        embedding_layer = Embedding(len(tokenizer.word_index) + 1,
                                    EMBEDDINGS_DIMENSION,
                                    weights=[embedding_matrix],
                                    input_length=MAX_SEQUENCE_LENGTH,
                                    trainable=False)
        x = embedding_layer(sequence_input)
        x = Conv1D(128, 2, activation='relu', padding='same')(x)
        x = MaxPooling1D(5, padding='same')(x)
        x = Conv1D(128, 3, activation='relu', padding='same')(x)
        x = MaxPooling1D(5, padding='same')(x)
        x = Conv1D(128, 4, activation='relu', padding='same')(x)
        x = MaxPooling1D(40, padding='same')(x)
        x = Flatten()(x)
        x = Dropout(DROPOUT_RATE)(x)
        x = Dense(128, activation='relu')(x)
        preds = Dense(2, activation='softmax')(x)
        return sequence_input, preds

    # Compile model.
    input_layer, output_layer = get_convolutional_neural_net_layers()
    model = Model(input_layer, output_layer)
    model.compile(loss='categorical_crossentropy',
                  optimizer=RMSprop(lr=LEARNING_RATE),
                  metrics=['acc'])

    # Train model.
    model.fit(train_text,
              train_labels,
              batch_size=BATCH_SIZE,
              epochs=NUM_EPOCHS,
              validation_data=(validate_text, validate_labels),
              verbose=2)

    return model

# Uncomment this code to run model training
# model = train_model(train_df, validate_df, tokenizer)

Score test set with our text classification model

Using our new model, we can score the set of test comments for toxicity.



In [0]:

    
# Use the model to score the test set.
test_comments_padded = pad_text(test_df[TEXT_COLUMN], tokenizer)
MODEL_NAME = 'fat_star_tutorial'
test_df[MODEL_NAME] = model.predict(test_comments_padded)[:, 1]

Let's see how our model performed against the test set. We can compare the models predictions against the actual labels, and calculate the overall ROC-AUC for the model.



In [0]:

    
# Print some records to compare our model results with the correct labels
pd.concat([
    test_df.query('toxicity == False').sample(3, random_state=RANDOM_STATE),
    test_df.query('toxicity == True').sample(3, random_state=RANDOM_STATE)])[[TOXICITY_COLUMN, MODEL_NAME, TEXT_COLUMN]]

Evaluate the overall ROC-AUC

This calculates the models performance on the entire test set using the ROC-AUC metric.



In [0]:

    
def calculate_overall_auc(df, model_name):
    true_labels = df[TOXICITY_COLUMN]
    predicted_labels = df[model_name]
    return metrics.roc_auc_score(true_labels, predicted_labels)

calculate_overall_auc(test_df, MODEL_NAME)

Compute Bias Metrics

Using metrics based on ROC-AUC, we can measure our model for biases against different identity groups. We only calculate bias metrics on identities that are refered to in 100 or more comments, to minimize noise.

The 3 bias metrics compare different subsets of the data as illustrated in the following image:



In [0]:

    
# Get a list of identity columns that have >= 100 True records.  This will remove groups such
# as "other_disability" which do not have enough records to calculate meaningful metrics.
identities_with_over_100_records = []
for identity in identity_columns:
    num_records = len(test_df.query(identity + '==True'))
    if num_records >= 100:
        identities_with_over_100_records.append(identity)

SUBGROUP_AUC = 'subgroup_auc'
BACKGROUND_POSITIVE_SUBGROUP_NEGATIVE_AUC = 'background_positive_subgroup_negative_auc'
BACKGROUND_NEGATIVE_SUBGROUP_POSITIVE_AUC = 'background_negative_subgroup_positive_auc'

def compute_auc(y_true, y_pred):
  try:
    return metrics.roc_auc_score(y_true, y_pred)
  except ValueError:
    return np.nan


def compute_subgroup_auc(df, subgroup, label, model_name):
  subgroup_examples = df[df[subgroup]]
  return compute_auc(subgroup_examples[label], subgroup_examples[model_name])


def compute_background_positive_subgroup_negative_auc(df, subgroup, label, model_name):
  """Computes the AUC of the within-subgroup negative examples and the background positive examples."""
  subgroup_negative_examples = df[df[subgroup] & ~df[label]]
  non_subgroup_positive_examples = df[~df[subgroup] & df[label]]
  examples = subgroup_negative_examples.append(non_subgroup_positive_examples)
  return compute_auc(examples[label], examples[model_name])


def compute_background_negative_subgroup_positive_auc(df, subgroup, label, model_name):
  """Computes the AUC of the within-subgroup positive examples and the background negative examples."""
  subgroup_positive_examples = df[df[subgroup] & df[label]]
  non_subgroup_negative_examples = df[~df[subgroup] & ~df[label]]
  examples = subgroup_positive_examples.append(non_subgroup_negative_examples)
  return compute_auc(examples[label], examples[model_name])


def compute_bias_metrics_for_model(dataset,
                                   subgroups,
                                   model,
                                   label_col,
                                   include_asegs=False):
  """Computes per-subgroup metrics for all subgroups and one model."""
  records = []
  for subgroup in subgroups:
    record = {
        'subgroup': subgroup,
        'subgroup_size': len(dataset[dataset[subgroup]])
    }
    record[SUBGROUP_AUC] = compute_subgroup_auc(
        dataset, subgroup, label_col, model)
    record[BACKGROUND_POSITIVE_SUBGROUP_NEGATIVE_AUC] = compute_background_positive_subgroup_negative_auc(
        dataset, subgroup, label_col, model)
    record[BACKGROUND_NEGATIVE_SUBGROUP_POSITIVE_AUC] = compute_background_negative_subgroup_positive_auc(
        dataset, subgroup, label_col, model)
    records.append(record)
  return pd.DataFrame(records).sort_values('subgroup_auc', ascending=True)

bias_metrics_df = compute_bias_metrics_for_model(test_df, identities_with_over_100_records, MODEL_NAME, TOXICITY_COLUMN)

Plot a heatmap of bias metrics

Plot a heatmap of the bias metrics. Higher scores indicate better results.

Subgroup AUC measures the ability to separate toxic and non-toxic comments for this identity.
Negative cross AUC measures the ability to separate non-toxic comments for this identity from toxic comments from the background distribution.
Positive cross AUC measures the ability to separate toxic comments for this identity from non-toxic comments from the background distribution.



In [0]:

    
def plot_auc_heatmap(bias_metrics_results, models):
  metrics_list = [SUBGROUP_AUC, BACKGROUND_POSITIVE_SUBGROUP_NEGATIVE_AUC, BACKGROUND_NEGATIVE_SUBGROUP_POSITIVE_AUC]
  df = bias_metrics_results.set_index('subgroup')
  columns = []
  vlines = [i * len(models) for i in range(len(metrics_list))]
  for metric in metrics_list:
    for model in models:
      columns.append(metric)
  num_rows = len(df)
  num_columns = len(columns)
  fig = plt.figure(figsize=(num_columns, 0.5 * num_rows))
  ax = sns.heatmap(df[columns], annot=True, fmt='.2', cbar=True, cmap='Reds_r',
                   vmin=0.5, vmax=1.0)
  ax.xaxis.tick_top()
  plt.xticks(rotation=90)
  ax.vlines(vlines, *ax.get_ylim())
  return ax

plot_auc_heatmap(bias_metrics_df, [MODEL_NAME])

Exercise #2

Examine the bias heatmap above - what biases can you spot? Do the biases appear to be false positives (non-toxic comments incorrectly classified as toxic) or false negatives (toxic comments incorrectly classified as non-toxic)?

Solution (click to expand)

Some groups have lower subgroup AUC scores, for example the groups "heterosexual", "transgender", and "homosexual_gay_or_lesbian". Because the "Negative Cross AUC" is lower than the "Positive Cross AUC" for this group, it appears that this groups has more false positives, i.e. many non-toxic comments about homosexuals are scoring higher for toxicity than actually toxic comments about other topics.

Plot histograms showing comment scores

We can graph a histogram of comment scores in each identity. In the following graphs, the X axis represents the toxicity score given by our new model, and the Y axis represents the comment count. Blue values are comment whose true label is non-toxic, while red values are those whose true label is toxic.



In [0]:

    
def plot_histogram(non_toxic_scores, toxic_scores, description):
  NUM_BINS=10
  sns.distplot(non_toxic_scores, norm_hist=True, bins=NUM_BINS, color="skyblue", label='non-toxic ' + description, kde=False)
  ax = sns.distplot(toxic_scores, norm_hist=True, bins=NUM_BINS, color="red", label='toxic ' + description, kde=False)
  ax.set(xlabel='model toxicity score', ylabel='relative % of comments', yticklabels=[])
  plt.legend()
  plt.figure()

# Plot toxicity distributions of different identities to visualize bias.
def plot_histogram_for_identity(df, identity):
  toxic_scores = df.query(identity + ' == True & toxicity == True')[MODEL_NAME]
  non_toxic_scores = df.query(identity + ' == True & toxicity == False')[MODEL_NAME]
  plot_histogram(non_toxic_scores, toxic_scores, 'labeled for ' + identity)

def plot_background_histogram(df):
  toxic_scores = df.query('toxicity == True')[MODEL_NAME]
  non_toxic_scores = df.query('toxicity == False')[MODEL_NAME]
  plot_histogram(non_toxic_scores, toxic_scores, 'for all test data')

# Plot the histogram for the background data, and for a few identities
plot_background_histogram(test_df)
plot_histogram_for_identity(test_df, 'heterosexual')
plot_histogram_for_identity(test_df, 'transgender')
plot_histogram_for_identity(test_df, 'homosexual_gay_or_lesbian')
plot_histogram_for_identity(test_df, 'atheist')
plot_histogram_for_identity(test_df, 'christian')
plot_histogram_for_identity(test_df, 'asian')

Exercise #3

By comparing the toxicity histograms for comments that refer to different groups with each other, and with the background distribution, what additional information can we learn about bias in our model?



In [0]:

    
# Your code here
#
# HINT: you can display the background distribution by running:
#   plot_background_histogram(test_df)
#
# You can plot the distribution for a given identity by running
#   plot_histogram_for_identity(test_df, identity_name)
#   e.g. plot_histogram_for_identity(test_df, 'male')

Solution (click to expand)

This is one possible interpretation of the data. We encourage you to explore other identity categories and come up with your own conclusions.

We can see that for some identities such as Asian, the model scores most non-toxic comments as less than 0.2 and most toxic comments as greater than 0.2. This indicates that for the Asian identity, our model is able to distinguish between toxic and non-toxic comments. However, for the black identity, there are many non-toxic comments with scores over 0.5, along with many toxic comments with scores of less than 0.5. This shows that for the black identity, our model will be less accurate at separating toxic comments from non-toxic comments. We can see that the model also has difficulty separating toxic from non-toxic data for comments labeled as applying to the "white" identity.



In [0]:

    
plot_histogram_for_identity(test_df, 'asian')
plot_histogram_for_identity(test_df, 'black')
plot_histogram_for_identity(test_df, 'white')

Additional topics to explore

How does toxicity and bias change if we restrict the dataset to long or short comments?
What patterns exist for comments containing multiple identities? Do some identities often appear together? Are these comments more likely to be toxic? Is our model more or less biased against these comments?
What biases exist when classifying the other "toxicity subtypes" (obscene, sexual_explicit, identity_attack, insult, and threat)?
Are there other ways we might be able to mitigate bias?