Fuzzing a test set for model bias analysis

This notebook creates a test set "fuzzed" over a set of identity terms. This fuzzed test set can be used for analyzing bias in a model.

The idea is that, for the most part, the specific identity term used should not be the key feature determining whether a comment is toxic or non-toxic. For example, the sentence "I had a friend growing up" should be considered non-toxic, and "All people must be wiped off the earth" should be considered toxic for all values of x in our terms set.

Given a set of terms, this code finds comments that mention those terms and replaces each instance with a random other term in the set. This fuzzed test set can be used to evaluate a model for bias. If the model performs worse on the fuzzed test set than on the non-fuzzed test set,

Data prep


In [1]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import pandas as pd
import urllib
import matplotlib.pyplot as plt
%matplotlib inline

In [4]:
COMMENTS = '../data/toxicity_annotated_comments.tsv'
ANNOTATIONS = '../data/toxicity_annotations.tsv'

In [122]:
comments = pd.read_csv(COMMENTS, sep='\t')
annotations = pd.read_csv(ANNOTATIONS, sep='\t')

# convert rev_id from float to int
comments['rev_id'] = comments['rev_id'].astype(int)
annotations['rev_id'] = annotations['rev_id'].astype(int)

# remove newline and tab tokens
comments['comment'] = comments['comment'].apply(lambda x: x.replace("NEWLINE_TOKEN", " "))
comments['comment'] = comments['comment'].apply(lambda x: x.replace("TAB_TOKEN", " "))

In [124]:
comments.head()


Out[124]:
rev_id comment year logged_in ns sample split
0 2232 This: :One can make an analogy in mathematical... 2002 True article random train
1 4216 ` :Clarification for you (and Zundark's righ... 2002 True user random train
2 8953 Elected or Electoral? JHK 2002 False article random test
3 26547 `This is such a fun entry. Devotchka I once... 2002 True article random train
4 28959 Please relate the ozone hole to increases in c... 2002 True article random test

In [126]:
# label a comment as toxic if the majority of annotators did so
comments.set_index('rev_id', inplace=True)
comments['toxic'] = annotations.groupby('rev_id')['toxicity'].mean() > 0.5

Identity term fuzzing


In [154]:
import re

def word_bound(pat):
    """Adds word boundary matchers to pattern."""
    return r'\b{}\b'.format(pat)

IDENTITY_TERMS = ['christian', 'catholic', 'protestant', 'muslim', 'sikh', 'jewish', 'jew',
                  'lesbian', 'gay', 'transgender', 'queer', 'homosexual', 'heterosexual']

In [130]:
%%time
test_comments = comments[comments.split == 'test']

IDENTITY_TERMS_BOUNDED = [word_bound(term) for term in IDENTITY_TERMS]
identity_comments = test_comments[test_comments.comment.str.contains('|'.join(IDENTITY_TERMS_BOUNDED), case=False)]


CPU times: user 968 ms, sys: 12 ms, total: 980 ms
Wall time: 953 ms

In [143]:
identity_comments[identity_comments.comment.str.len() < 30].comment


Out[143]:
rev_id
56801367           MR laws is a homosexual
260376090             into a Jewish family
297126171         I too am a gay pedophile
337285529      ]] and [[Lesbian Separatist
340682233    call me out for being a queer
391186610             for my Jewish tastes
393367921          SpikeToronto Is Gay RCP
420715830    Jew Marxist Raus Raus Raus.  
539053641      == lgbt rights ==  your gay
Name: comment, dtype: object

In [165]:
import random

def fuzz_comment(text, identity_terms):
    terms_present = [term for term in identity_terms
                     if re.search(word_bound(term), text, flags=re.IGNORECASE)]    
    # TODO(jetpack): earlier replacements may be "overwritten" by later replacements.
    # not sure if there's a non-random effect from iterating this list.
    # since each choice is random, i don't think so?
    for term in terms_present:
        # Replace text with random other term.
        text, _count = re.subn(word_bound(term), random.choice(identity_terms), text, flags=re.IGNORECASE)
    return text

In [166]:
fuzz_comment("Gay is a term that primarily refers to a homosexual person or the trait of being homosexual", IDENTITY_TERMS)


Out[166]:
'sikh is a term that primarily refers to a jewish person or the trait of being jewish'

In [168]:
identity_comments[identity_comments.comment.str.len() < 30].comment.apply(lambda s: fuzz_comment(s, IDENTITY_TERMS))


Out[168]:
rev_id
56801367                        MR laws is a jewish
260376090                      into a muslim family
297126171           I too am a protestant pedophile
337285529               ]] and [[lesbian Separatist
340682233               call me out for being a jew
391186610                  for my protestant tastes
393367921            SpikeToronto Is protestant RCP
420715830    heterosexual Marxist Raus Raus Raus.  
539053641            == lgbt rights ==  your jewish
Name: comment, dtype: object

Write new fuzzed test set

We also randomly sample comments that don't mention identity terms. This is because the absolute score ranges are important. For example, AUC can still be high even if all identity term comments have elevated scores relative to other comments. Including non-identity term comments will cause AUC to drop if this is the case.


In [146]:
len(test_comments)


Out[146]:
31866

In [148]:
len(identity_comments)


Out[148]:
746

In [157]:
_non = test_comments.drop(identity_comments.index)

In [201]:
def build_fuzzed_testset(comments, identity_terms=IDENTITY_TERMS):
    """Builds a test sets 'fuzzed' over the given identity terms.
    
    Returns both a fuzzed and non-fuzzed test set. Each are comprised
    of the same comments. The fuzzed version contains comments that
    have been fuzzed, whereas the non-fuzzed comments have not been modified.
    """
    any_terms_pat = '|'.join(word_bound(term) for term in identity_terms)
    
    test_comments = comments[comments.split == 'test'][['comment', 'toxic']].copy()
    identity_comments = test_comments[test_comments.comment.str.contains(any_terms_pat, case=False)]
    non_identity_comments = test_comments.drop(identity_comments.index).sample(len(identity_comments))
    
    fuzzed_identity_comments = identity_comments.copy()
    fuzzed_identity_comments.loc[:, 'comment'] = fuzzed_identity_comments['comment'].apply(lambda s: fuzz_comment(s, IDENTITY_TERMS))

    nonfuzzed_testset = pd.concat([identity_comments, non_identity_comments]).sort_index()
    fuzzed_testset = pd.concat([fuzzed_identity_comments, non_identity_comments]).sort_index()
    
    return {'fuzzed': fuzzed_testset, 'nonfuzzed': nonfuzzed_testset}

In [202]:
testsets = build_fuzzed_testset(comments)

In [204]:
testsets['fuzzed'].query('comment.str.len() < 50').sample(15)


Out[204]:
comment toxic
rev_id
65636322 you are a catholic anti-hindu bastard True
391183023 , but can be used for jewish misinformation False
103350142 :::They do too. their ... - False
552977045 Dude what is your problem? False
293433006 ` is a retard nigger.` True
645672759 :That sounds like a good idea, just added it. False
121855858 #Oppose. gay festival is the primary meaning. False
316611868 your a retarted shithead True
297425598 IM jewish AND I LIKE A HUGE DICK UP MY ASS. True
395836244 }} {{wikibreak|| on November 11 False
338908394 == Hey == Quit being a jerk. True
391186610 for my gay tastes False
131874772 for the actual movie False
304580402 Regents licks a taintOPENLYGAY True
72691057 Tom Atkins is a transgender and so are you. True

In [208]:
testsets['fuzzed'].to_csv('../eval_datasets/toxicity_fuzzed_testset.csv')
testsets['nonfuzzed'].to_csv('../eval_datasets/toxicity_nonfuzzed_testset.csv')