This notebook creates a test set "fuzzed" over a set of identity terms. This fuzzed test set can be used for analyzing bias in a model.
The idea is that, for the most part, the specific identity term used should not be the key feature determining whether a comment is toxic or non-toxic. For example, the sentence "I had a x
in our terms set.
Given a set of terms, this code finds comments that mention those terms and replaces each instance with a random other term in the set. This fuzzed test set can be used to evaluate a model for bias. If the model performs worse on the fuzzed test set than on the non-fuzzed test set,
In [1]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import pandas as pd
import urllib
import matplotlib.pyplot as plt
%matplotlib inline
In [4]:
COMMENTS = '../data/toxicity_annotated_comments.tsv'
ANNOTATIONS = '../data/toxicity_annotations.tsv'
In [122]:
comments = pd.read_csv(COMMENTS, sep='\t')
annotations = pd.read_csv(ANNOTATIONS, sep='\t')
# convert rev_id from float to int
comments['rev_id'] = comments['rev_id'].astype(int)
annotations['rev_id'] = annotations['rev_id'].astype(int)
# remove newline and tab tokens
comments['comment'] = comments['comment'].apply(lambda x: x.replace("NEWLINE_TOKEN", " "))
comments['comment'] = comments['comment'].apply(lambda x: x.replace("TAB_TOKEN", " "))
In [124]:
comments.head()
Out[124]:
In [126]:
# label a comment as toxic if the majority of annotators did so
comments.set_index('rev_id', inplace=True)
comments['toxic'] = annotations.groupby('rev_id')['toxicity'].mean() > 0.5
In [154]:
import re
def word_bound(pat):
"""Adds word boundary matchers to pattern."""
return r'\b{}\b'.format(pat)
IDENTITY_TERMS = ['christian', 'catholic', 'protestant', 'muslim', 'sikh', 'jewish', 'jew',
'lesbian', 'gay', 'transgender', 'queer', 'homosexual', 'heterosexual']
In [130]:
%%time
test_comments = comments[comments.split == 'test']
IDENTITY_TERMS_BOUNDED = [word_bound(term) for term in IDENTITY_TERMS]
identity_comments = test_comments[test_comments.comment.str.contains('|'.join(IDENTITY_TERMS_BOUNDED), case=False)]
In [143]:
identity_comments[identity_comments.comment.str.len() < 30].comment
Out[143]:
In [165]:
import random
def fuzz_comment(text, identity_terms):
terms_present = [term for term in identity_terms
if re.search(word_bound(term), text, flags=re.IGNORECASE)]
# TODO(jetpack): earlier replacements may be "overwritten" by later replacements.
# not sure if there's a non-random effect from iterating this list.
# since each choice is random, i don't think so?
for term in terms_present:
# Replace text with random other term.
text, _count = re.subn(word_bound(term), random.choice(identity_terms), text, flags=re.IGNORECASE)
return text
In [166]:
fuzz_comment("Gay is a term that primarily refers to a homosexual person or the trait of being homosexual", IDENTITY_TERMS)
Out[166]:
In [168]:
identity_comments[identity_comments.comment.str.len() < 30].comment.apply(lambda s: fuzz_comment(s, IDENTITY_TERMS))
Out[168]:
We also randomly sample comments that don't mention identity terms. This is because the absolute score ranges are important. For example, AUC can still be high even if all identity term comments have elevated scores relative to other comments. Including non-identity term comments will cause AUC to drop if this is the case.
In [146]:
len(test_comments)
Out[146]:
In [148]:
len(identity_comments)
Out[148]:
In [157]:
_non = test_comments.drop(identity_comments.index)
In [201]:
def build_fuzzed_testset(comments, identity_terms=IDENTITY_TERMS):
"""Builds a test sets 'fuzzed' over the given identity terms.
Returns both a fuzzed and non-fuzzed test set. Each are comprised
of the same comments. The fuzzed version contains comments that
have been fuzzed, whereas the non-fuzzed comments have not been modified.
"""
any_terms_pat = '|'.join(word_bound(term) for term in identity_terms)
test_comments = comments[comments.split == 'test'][['comment', 'toxic']].copy()
identity_comments = test_comments[test_comments.comment.str.contains(any_terms_pat, case=False)]
non_identity_comments = test_comments.drop(identity_comments.index).sample(len(identity_comments))
fuzzed_identity_comments = identity_comments.copy()
fuzzed_identity_comments.loc[:, 'comment'] = fuzzed_identity_comments['comment'].apply(lambda s: fuzz_comment(s, IDENTITY_TERMS))
nonfuzzed_testset = pd.concat([identity_comments, non_identity_comments]).sort_index()
fuzzed_testset = pd.concat([fuzzed_identity_comments, non_identity_comments]).sort_index()
return {'fuzzed': fuzzed_testset, 'nonfuzzed': nonfuzzed_testset}
In [202]:
testsets = build_fuzzed_testset(comments)
In [204]:
testsets['fuzzed'].query('comment.str.len() < 50').sample(15)
Out[204]:
In [208]:
testsets['fuzzed'].to_csv('../eval_datasets/toxicity_fuzzed_testset.csv')
testsets['nonfuzzed'].to_csv('../eval_datasets/toxicity_nonfuzzed_testset.csv')