The Utility-Fairness Tradeoff

In this post, I'll be taking a dive into the capabilities of themis_ml as a tool to measure and mitigate discriminatory patterns in training data and the predictions made by machine learning algorithms trained for the purposes of socially sensitive decision processes.

The overall goal of this research is to come up with a reasonable way to think about how to make machine learning algorithms more fair. While the mathematical formalization of fairness is not sufficient to solve the problem of discrimination, our ability to understand and articulate what it means for an algorithm to be fair is a step in the right direction.

Since the "discrimination" is an value-laden term in this context, I'll refer to the opposite of fairness as potential discrimination (PD) since the any socially biased patterns we'll be measuring in the training data did not necessarily arise from discriminatory processes.

I'll be using the German Credit data, which consists of ~1000 loan application containing roughly 20 input variables (including foreign_worker, housing, and credit_history) and 1 binary target variable credit_risk, which is either good or bad.

In the context of a good/bad credit_risk binary predict task and an explicit definition of fairness, our objectives will be to:

Measure the degree of discrimination in the dataset with respect to some discrimination metric and protected class.
Establish a baseline performance level with respect to utility and fairness metrics with models trained on a fairness-unaware machine learning pipeline.
Measure and compare the baseline metrics with fairness aware models.

Load Data



In [89]:

    
from themis_ml import datasets
from themis_ml.datasets.german_credit_data_map import \
    preprocess_german_credit_data
from themis_ml.metrics import mean_difference, normalized_mean_difference, \
    mean_confidence_interval



In [13]:

    
german_credit = datasets.german_credit()
german_credit[
    ["credit_risk", "purpose", "age_in_years", "foreign_worker"]].head()









    Out[13]:







  
    
      
      credit_risk
      purpose
      age_in_years
      foreign_worker
    
  
  
    
      0
      1
      radio/television
      67
      1
    
    
      1
      0
      radio/television
      22
      1
    
    
      2
      1
      education
      49
      1
    
    
      3
      1
      furniture/equipment
      45
      1
    
    
      4
      0
      car_(new)
      53
      1



In [14]:

    
german_credit_preprocessed = (
    preprocess_german_credit_data(german_credit)
    # the following binary variable indicates whether someone is female or
    # not since the unique values in `personal_status` are:
    # 'personal_status_and_sex_female_divorced/separated/married'
    # 'personal_status_and_sex_male_divorced/separated'
    # 'personal_status_and_sex_male_married/widowed'
    # 'personal_status_and_sex_male_single'
    .assign(female=lambda df:
            df["personal_status_and_sex_female_divorced/separated/married"])
    # we're going to hypothesize here that young people, aged below 25,
    # might be  considered to have bad credit risk moreso than other groups
    .assign(age_below_25=lambda df: df["age_in_years"] <= 25)
)

target variable: `credit_risk`

1 = low risk (good)
0 = high risk (bad)



In [15]:

    
credit_risk = german_credit_preprocessed.credit_risk
credit_risk.value_counts()









    Out[15]:





1    700
0    300
Name: credit_risk, dtype: int64

protected class: sex

advantaged group: men
disadvantaged group: women



In [16]:

    
is_female = german_credit_preprocessed.female
is_female.value_counts()









    Out[16]:





0    690
1    310
Name: female, dtype: int64



In [17]:

    
def report_metric(metric, mean_diff, lower, upper):
    print("{metric}: {md:0.02f} - 95% CI [{lower:0.02f}, {upper:0.02f}]"
          .format(metric=metric, md=mean_diff, lower=lower, upper=upper))

report_metric(
    "mean difference",
    *map(lambda x: x * 100, mean_difference(credit_risk, is_female)))
report_metric(
    "normalized mean difference",
    *map(lambda x: x * 100, normalized_mean_difference(credit_risk, is_female)))









    



mean difference: 7.48 - 95% CI [1.35, 13.61]
normalized mean difference: 7.73 - 95% CI [1.39, 14.06]

protected class: immigration status

advantaged group: citizen worker
disadvantaged group: foreign worker



In [18]:

    
is_foreign = german_credit_preprocessed.foreign_worker
is_foreign.value_counts()









    Out[18]:





1    963
0     37
Name: foreign_worker, dtype: int64



In [19]:

    
report_metric(
    "mean difference",
    *map(lambda x: x * 100, mean_difference(credit_risk, is_foreign)))
report_metric(
    "normalized mean difference",
    *map(lambda x: x * 100, normalized_mean_difference(credit_risk, is_foreign)))









    



mean difference: 19.93 - 95% CI [4.91, 34.94]
normalized mean difference: 63.96 - 95% CI [15.76, 100.00]

protected class: age

advantaged group: age above 25
disadvantaged group: age below 25



In [20]:

    
age_below_25 = german_credit_preprocessed.age_below_25
age_below_25.value_counts()









    Out[20]:





False    810
True     190
Name: age_below_25, dtype: int64



In [21]:

    
report_metric(
    "mean difference",
    *map(lambda x: x * 100, mean_difference(credit_risk, age_below_25)))
report_metric(
    "normalized mean difference",
    *map(lambda x: x * 100, normalized_mean_difference(credit_risk, age_below_25)))









    



mean difference: 14.94 - 95% CI [7.76, 22.13]
normalized mean difference: 17.29 - 95% CI [8.97, 25.61]

These mean differences and confidence interval bounds suggest that on average:

men have "good" credit risk at a 7.48% higher rate than women, with a lower bound of 1.35% and upper bound of 13.61%.
citizen workers have "good" credit risk at a 19.93% higher rate than foreign workers, with a lower bound of 4.91% and upper bound of 34.94%.
people above the age of 25 have "good" credit risk at a 14.94% higher rate than those below 25 with a lower bound of 8.97% and upper bound of 25.61%.

Establish Baseline Metrics

Suppose that Unjust Bank wants to use these data to train a machine learning algorithm to classify new observations into the "good credit risk"/"bad credit risk" buckets.

In scenario 1, let's also suppose that the data scientists at Unjust Bank are using typical, fairness-unaware modeling techniques. Furthermore, they give absolutely no thought into what inputs go into the learning process. Using this kitchen sink approach, they plan on using variables like sex, age_below_25, and foreign_worker to learn the classifier.

However, a rogue element in the data science team is interested in at least measuring the potentially discriminatory (PD) patterns in the learned algorithms, so in addition to measure performance with metrics like accuracy or ROC area under the curve, also measures the degree to which the algorithm generates PD predictions that favor one social group over another.

Procedure

Specify model hyperparameter settings for training models.
Partition the training data into 10 validation folds.
For each of the validation folds, train model on the rest of the data on each of the hyperparameter settings.
Evaluate the performance of the model on the validation fold.
Pick model with the best average performance to deploy to production.

Below we use StratifiedKFold so that we can partition our data according to the protected class of interest and train the the following models:

LogisticRegression
DecisionTreeClassifier
RandomForest



In [22]:

    
import itertools
import numpy as np
import pandas as pd

from sklearn.model_selection import StratifiedKFold, RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.metrics import (
    accuracy_score, roc_auc_score, f1_score)



In [23]:

    
# specify feature set. Note that we're excluding the `is_female`
# and `age_below_25` columns that we created above.
feature_set_1 = [
    'duration_in_month',
    'credit_amount',
    'installment_rate_in_percentage_of_disposable_income',
    'present_residence_since',
    'age_in_years',
    'number_of_existing_credits_at_this_bank',
    'number_of_people_being_liable_to_provide_maintenance_for',
    'status_of_existing_checking_account',
    'savings_account/bonds',
    'present_employment_since',
    'job',
    'telephone',
    'foreign_worker',
    'credit_history_all_credits_at_this_bank_paid_back_duly',
    'credit_history_critical_account/other_credits_existing_not_at_this_bank',
    'credit_history_delay_in_paying_off_in_the_past',
    'credit_history_existing_credits_paid_back_duly_till_now',
    'credit_history_no_credits_taken/all_credits_paid_back_duly',
    'purpose_business',
    'purpose_car_(new)',
    'purpose_car_(used)',
    'purpose_domestic_appliances',
    'purpose_education',
    'purpose_furniture/equipment',
    'purpose_others',
    'purpose_radio/television',
    'purpose_repairs',
    'purpose_retraining',
    'personal_status_and_sex_female_divorced/separated/married',
    'personal_status_and_sex_male_divorced/separated',
    'personal_status_and_sex_male_married/widowed',
    'personal_status_and_sex_male_single',
    'other_debtors/guarantors_co-applicant',
    'other_debtors/guarantors_guarantor',
    'other_debtors/guarantors_none',
    'property_building_society_savings_agreement/life_insurance',
    'property_car_or_other',
    'property_real_estate',
    'property_unknown/no_property',
    'other_installment_plans_bank',
    'other_installment_plans_none',
    'other_installment_plans_stores',
    'housing_for free',
    'housing_own',
    'housing_rent',
]



In [24]:

    
N_SPLITS = 10
N_REPEATS = 5
RANDOM_STATE = 1000

def get_estimator_name(e):
    return "".join([x for x in str(type(e)).split(".")[-1]
                    if x.isalpha()])


def get_grid_params(grid_params_dict):
    """Get outer product of grid search parameters."""
    return [
        dict(params) for params in itertools.product(
            *[[(k, v_i) for v_i in v] for
              k, v in grid_params_dict.items()])]


def fit_with_s(estimator):
    has_relabeller = getattr(estimator, "relabeller", None) is not None
    child_estimator = getattr(estimator, "estimator", None)
    estimator_fit_with_s = getattr(estimator, "S_ON_FIT", False)
    child_estimator_fit_with_s = getattr(child_estimator, "S_ON_FIT", False)
    return has_relabeller or estimator_fit_with_s or\
        child_estimator_fit_with_s
    
    
def predict_with_s(estimator):
    estimator_pred_with_s = getattr(estimator, "S_ON_PREDICT", False)
    child_estimator = getattr(estimator, "estimator", None)
    return estimator_pred_with_s or \
        getattr(child_estimator, "S_ON_PREDICT", False)

def cross_validation_experiment(estimators, X, y, s, s_name, verbose=True):
    msg = "Training models: protected_class = %s" % s_name
    if verbose:
        print(msg)
        print("-" * len(msg))
    performance_scores = []
    # stratified groups tries to balance out y and s
    groups = [i + j for i, j in
              zip(y.astype(str), s_female.astype(str))]
    cv = RepeatedStratifiedKFold(
        n_splits=N_SPLITS,
        n_repeats=N_REPEATS,
        random_state=RANDOM_STATE)
    for e_name, e in estimators:
        if verbose:
            print("%s, fold:" % e_name),
        for i, (train, test) in enumerate(cv.split(X, y, groups=groups)):
            if verbose:
                print(i),
            # create train and validation fold partitions
            X_train, X_test = X[train], X[test]
            y_train, y_test = y[train], y[test]
            s_train, s_test = s[train], s[test]

            # fit model and generate train and test predictions
            if fit_with_s(e):
                e.fit(X_train, y_train, s_train)
            else:
                e.fit(X_train, y_train)
                
            train_pred_args = (X_train, s_train) if predict_with_s(e) \
                else (X_train, )
            test_pred_args = (X_test, s_test) if predict_with_s(e) \
                else (X_test, )
                
            train_pred_prob = e.predict_proba(*train_pred_args)[:, 1]
            train_pred = e.predict(*train_pred_args)
            test_pred_prob = e.predict_proba(*test_pred_args)[:, 1]
            test_pred = e.predict(*test_pred_args)

            # train scores
            performance_scores.append([
                s_name, e_name, i, "train",
                # regular metrics
                roc_auc_score(y_train, train_pred_prob),

                # fairness metrics
                mean_difference(train_pred, s_train)[0],
            ])
            # test scores
            performance_scores.append([
                s_name, e_name, i, "test",
                # regular metrics
                roc_auc_score(y_test, test_pred_prob),
                # fairness metrics
                mean_difference(test_pred, s_test)[0]
            ])
        if verbose:
            print("")
    if verbose:
        print("")
    return pd.DataFrame(
        performance_scores,
        columns=[
            "protected_class", "estimator", "cv_fold", "fold_type",
            "auc", "mean_diff"])
            
# training and target data
X = german_credit_preprocessed[feature_set_1].values
y = german_credit_preprocessed["credit_risk"].values
s_female = german_credit_preprocessed["female"].values
s_foreign = german_credit_preprocessed["foreign_worker"].values
s_age_below_25 = german_credit_preprocessed["age_below_25"].values

LOGISTIC_REGRESSION = LogisticRegression(
    penalty="l2", C=0.001, class_weight="balanced")
DECISION_TREE_CLF = DecisionTreeClassifier(
    criterion="entropy", max_depth=10, min_samples_leaf=10, max_features=10,
    class_weight="balanced")
RANDOM_FOREST_CLF = RandomForestClassifier(
    criterion="entropy", n_estimators=50, max_depth=10, max_features=10,
    min_samples_leaf=10, class_weight="balanced")
estimators = [
    ("LogisticRegression", LOGISTIC_REGRESSION),
    ("DecisionTree", DECISION_TREE_CLF),
    ("RandomForest", RANDOM_FOREST_CLF)
]
experiment_baseline_female = cross_validation_experiment(
    estimators, X, y, s_female, "female")
experiment_baseline_foreign = cross_validation_experiment(
    estimators, X, y, s_foreign, "foreign_worker")
experiment_baseline_age_below_25 = cross_validation_experiment(
    estimators, X, y, s_age_below_25, "age_below_25")









    



Training models: protected_class = female
-----------------------------------------
LogisticRegression, fold: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 
DecisionTree, fold: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 
RandomForest, fold: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 

Training models: protected_class = foreign_worker
-------------------------------------------------
LogisticRegression, fold: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33





    



/usr/local/miniconda3/envs/themis_ml_dev/lib/python2.7/site-packages/numpy/core/_methods.py:135: RuntimeWarning: Degrees of freedom <= 0 for slice
  keepdims=keepdims)
/usr/local/miniconda3/envs/themis_ml_dev/lib/python2.7/site-packages/numpy/core/_methods.py:105: RuntimeWarning: invalid value encountered in true_divide
  arrmean, rcount, out=arrmean, casting='unsafe', subok=False)
/usr/local/miniconda3/envs/themis_ml_dev/lib/python2.7/site-packages/numpy/core/_methods.py:127: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/Users/Niels/git/themis-ml/themis_ml/metrics.py:30: RuntimeWarning: Mean of empty slice.
  mean_diff = y[s == 0].mean() - y[s == 1].mean()
/usr/local/miniconda3/envs/themis_ml_dev/lib/python2.7/site-packages/numpy/core/_methods.py:80: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/Users/Niels/git/themis-ml/themis_ml/metrics.py:31: RuntimeWarning: divide by zero encountered in double_scalars
  margin_error = t.ppf(ci, df) * std_n0n1 * sqrt(1/n0 + 1 / float(n1))






    



 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 
DecisionTree, fold: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 
RandomForest, fold: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 

Training models: protected_class = age_below_25
-----------------------------------------------
LogisticRegression, fold: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 
DecisionTree, fold: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 
RandomForest, fold: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49



In [25]:

    
import seaborn as sns
import matplotlib.pyplot as plt
% matplotlib inline

UTILITY_METRICS = ["auc"]
FAIRNESS_METRICS = ["mean_diff"]

def summarize_experiment_results(experiment_df):
    return (
        experiment_df
        .drop("cv_fold", axis=1)
        .groupby(["protected_class", "estimator", "fold_type"])
        .mean())

experiment_baseline = pd.concat([
    experiment_baseline_female,
    experiment_baseline_foreign,
    experiment_baseline_age_below_25
])
experiment_baseline_summary = summarize_experiment_results(
    experiment_baseline)
experiment_baseline_summary.query("fold_type == 'test'")









    Out[25]:







  
    
      
      
      
      auc
      mean_diff
    
    
      protected_class
      estimator
      fold_type
      
      
    
  
  
    
      age_below_25
      DecisionTree
      test
      0.697690
      0.200348
    
    
      LogisticRegression
      test
      0.672438
      0.322763
    
    
      RandomForest
      test
      0.787733
      0.251217
    
    
      female
      DecisionTree
      test
      0.696033
      0.049169
    
    
      LogisticRegression
      test
      0.672438
      0.095023
    
    
      RandomForest
      test
      0.783771
      0.091083
    
    
      foreign_worker
      DecisionTree
      test
      0.698857
      0.116408
    
    
      LogisticRegression
      test
      0.672438
      0.293640
    
    
      RandomForest
      test
      0.788571
      0.222311



In [126]:

    
baseline_df = (
    experiment_baseline
    .query("fold_type == 'test' and estimator == 'LogisticRegression'")
)
sns.factorplot(y="protected_class", x="mean_diff", orient="h", data=baseline_df,
               size=4, aspect=2, join=False)









    Out[126]:





<seaborn.axisgrid.FacetGrid at 0x1146707d0>



In [114]:

    
protected_classes = ["female", "foreign_worker", "age_below_25"]
for s in protected_classes:
    mean_ci = mean_confidence_interval(
        plot_df.query("protected_class == @s").mean_diff.dropna())
    print(
        "grand_mean(mean_diff) for %s - mean: %0.03f, 95%% CI(%0.03f, %0.03f)" %
        (s, mean_ci[0], mean_ci[1], mean_ci[2]))









    



grand_mean(mean_diff) for female - mean: 0.095, 95% CI(0.064, 0.126)
grand_mean(mean_diff) for foreign_worker - mean: 0.294, 95% CI(0.243, 0.345)
grand_mean(mean_diff) for age_below_25 - mean: 0.323, 95% CI(0.292, 0.354)



In [26]:

    
def plot_experiment_results(experiment_results):
    return (
        experiment_results
        .query("fold_type == 'test'")
        .drop(["fold_type", "cv_fold"], axis=1)
        .pipe(pd.melt, id_vars=["protected_class", "estimator"],
              var_name="metric", value_name="score")
        .pipe((sns.factorplot, "data"), y="metric",
              x="score", hue="estimator", col="protected_class", col_wrap=3,
              size=3.5, aspect=1.2, join=False, dodge=0.4))

plot_experiment_results(experiment_baseline);

It appears that the variance of normalized_mean_difference across the 10 cross-validation folds is higher than mean_difference, likely because the normalization factor d_max depends on the rate of positive labels in the data.



In [27]:

    
from IPython.display import Markdown, display

def print_best_metrics(experiment_results, protected_classes):
    for pclass in protected_classes:
        msg = "#### protected class = %s:" % pclass
        display(Markdown(msg))
        exp_df = experiment_results[
            (experiment_results["protected_class"] == pclass) &
            (experiment_results["fold_type"] == "test")]
        msg = ""
        
        for m in UTILITY_METRICS:
            utility_msg = \
                "- best utility measured by %s (higher is better)" % m
            best_model = (
                exp_df
                .sort_values(m, ascending=False)
                .drop(["fold_type"], axis=1)
                .iloc[0][[m, "estimator"]])
            msg += utility_msg + " = %0.03f: %s\n" % \
                   (best_model[0], best_model[1])
        
        for m in FAIRNESS_METRICS:
            fairness_msg = \
                "- best fairness measured by %s (lower is better)" % m
            best_model = (
                exp_df
                # score closer to zero is better
                .assign(abs_measure=lambda df: df[m].abs())
                .sort_values("abs_measure")
                .drop(["abs_measure", "fold_type"], axis=1)
                .iloc[0][[m, "estimator"]])
            msg += fairness_msg + " = %0.03f: %s\n" % \
                   (best_model[0], best_model[1])
        display(Markdown(msg))
        
print_best_metrics(
    experiment_baseline_summary.reset_index(),
    ["female", "foreign_worker", "age_below_25"])









    




protected class = female:






    





best utility measured by auc (higher is better) = 0.784: RandomForest
best fairness measured by mean_diff (lower is better) = 0.049: DecisionTree








    




protected class = foreign_worker:






    





best utility measured by auc (higher is better) = 0.789: RandomForest
best fairness measured by mean_diff (lower is better) = 0.116: DecisionTree








    




protected class = age_below_25:






    





best utility measured by auc (higher is better) = 0.788: RandomForest
best fairness measured by mean_diff (lower is better) = 0.200: DecisionTree

Naive Fairness-aware Approach: Remove Protected Class

The naive approach to training fairness-aware models is to remove the protected class variables from the input data. While at face value this approach might seem like a good measure to prevent the model from learning the discriminatory patterns in the raw data, it doesn't preclude the possibility of other non-protected class variables highly correlate with protected class variables.

An well-known example of this is how zipcode correlates with race, so zipcode essentially serves as a proxy for race in the training data even if race is excluded from the input data.



In [28]:

    
# create feature sets that remove variables with protected class information
feature_set_no_sex = [
    f for f in feature_set_1 if
    f not in [
        'personal_status_and_sex_female_divorced/separated/married',
        'personal_status_and_sex_male_divorced/separated',
        'personal_status_and_sex_male_married/widowed',
        'personal_status_and_sex_male_single']]
feature_set_no_foreign = [f for f in feature_set_1 if f != "foreign_worker"]
feature_set_no_age = [f for f in feature_set_1 if f != "age_in_years"]



In [29]:

    
# training and target data
X_no_sex = german_credit_preprocessed[feature_set_no_sex].values
X_no_foreign = german_credit_preprocessed[feature_set_no_foreign].values
X_no_age = german_credit_preprocessed[feature_set_no_age].values

experiment_naive_female = cross_validation_experiment(
    estimators, X_no_sex, y, s_female, "female")
experiment_naive_foreign = cross_validation_experiment(
    estimators, X_no_foreign, y, s_foreign, "foreign_worker")
experiment_naive_age_below_25 = cross_validation_experiment(
    estimators, X_no_age, y, s_age_below_25, "age_below_25")









    



Training models: protected_class = female
-----------------------------------------
LogisticRegression, fold: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 
DecisionTree, fold: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 
RandomForest, fold: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 

Training models: protected_class = foreign_worker
-------------------------------------------------
LogisticRegression, fold: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 
DecisionTree, fold: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 
RandomForest, fold: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 

Training models: protected_class = age_below_25
-----------------------------------------------
LogisticRegression, fold: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 
DecisionTree, fold: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 
RandomForest, fold: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49



In [30]:

    
experiment_naive = pd.concat([
    experiment_naive_female,
    experiment_naive_foreign,
    experiment_naive_age_below_25
])
experiment_naive_summary = summarize_experiment_results(experiment_naive)
experiment_naive_summary.query("fold_type == 'test'")









    Out[30]:







  
    
      
      
      
      auc
      mean_diff
    
    
      protected_class
      estimator
      fold_type
      
      
    
  
  
    
      age_below_25
      DecisionTree
      test
      0.698590
      0.116459
    
    
      LogisticRegression
      test
      0.683210
      0.111665
    
    
      RandomForest
      test
      0.787333
      0.171979
    
    
      female
      DecisionTree
      test
      0.704805
      0.059771
    
    
      LogisticRegression
      test
      0.671352
      0.075295
    
    
      RandomForest
      test
      0.791790
      0.058341
    
    
      foreign_worker
      DecisionTree
      test
      0.701500
      0.126806
    
    
      LogisticRegression
      test
      0.672362
      0.293423
    
    
      RandomForest
      test
      0.787114
      0.183946



In [31]:

    
plot_experiment_results(experiment_naive);



In [32]:

    
print_best_metrics(
    experiment_naive_summary.reset_index(),
    ["female", "foreign_worker", "age_below_25"])









    




protected class = female:






    





best utility measured by auc (higher is better) = 0.792: RandomForest
best fairness measured by mean_diff (lower is better) = 0.058: RandomForest








    




protected class = foreign_worker:






    





best utility measured by auc (higher is better) = 0.787: RandomForest
best fairness measured by mean_diff (lower is better) = 0.127: DecisionTree








    




protected class = age_below_25:






    





best utility measured by auc (higher is better) = 0.787: RandomForest
best fairness measured by mean_diff (lower is better) = 0.112: LogisticRegression

Fairness-aware Method: Relabelling

In this and the following fairness-aware modeling runs, we exclude the protected class variables as in the Naive Fairness-aware Approach section in addition to the explicit fairness-aware technique.



In [33]:

    
from sklearn.base import clone

from themis_ml.preprocessing.relabelling import Relabeller
from themis_ml.meta_estimators import FairnessAwareMetaEstimator

# here we use the relabeller class to create new y vectors for each of the
# protected class contexts.

# we also use the FairnessAwareMetaEstimator as a convenience class to
# compose together different fairness-aware methods. This wraps around the
# estimators that we defined in the previous
relabeller = Relabeller()
relabelling_estimators = [
    (name, FairnessAwareMetaEstimator(e, relabeller=relabeller))
    for name, e in estimators]

experiment_relabel_female = cross_validation_experiment(
    relabelling_estimators, X_no_sex, y, s_female, "female")
experiment_relabel_foreign = cross_validation_experiment(
    relabelling_estimators, X_no_foreign, y, s_foreign, "foreign_worker")
experiment_relabel_age_below_25 = cross_validation_experiment(
    relabelling_estimators, X_no_age, y, s_age_below_25, "age_below_25")









    



Training models: protected_class = female
-----------------------------------------
LogisticRegression, fold: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 
DecisionTree, fold: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 
RandomForest, fold: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 

Training models: protected_class = foreign_worker
-------------------------------------------------
LogisticRegression, fold: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 
DecisionTree, fold: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 
RandomForest, fold: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 

Training models: protected_class = age_below_25
-----------------------------------------------
LogisticRegression, fold: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 
DecisionTree, fold: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 
RandomForest, fold: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49



In [34]:

    
experiment_relabel = pd.concat([
    experiment_relabel_female,
    experiment_relabel_foreign,
    experiment_relabel_age_below_25
])
experiment_relabel_summary = summarize_experiment_results(experiment_relabel)
experiment_relabel_summary.query("fold_type == 'test'")









    Out[34]:







  
    
      
      
      
      auc
      mean_diff
    
    
      protected_class
      estimator
      fold_type
      
      
    
  
  
    
      age_below_25
      DecisionTree
      test
      0.695581
      0.057754
    
    
      LogisticRegression
      test
      0.678210
      0.080572
    
    
      RandomForest
      test
      0.779476
      0.082186
    
    
      female
      DecisionTree
      test
      0.709595
      0.051679
    
    
      LogisticRegression
      test
      0.669229
      0.058147
    
    
      RandomForest
      test
      0.779895
      0.050829
    
    
      foreign_worker
      DecisionTree
      test
      0.694357
      0.154480
    
    
      LogisticRegression
      test
      0.671314
      0.291044
    
    
      RandomForest
      test
      0.785429
      0.161829



In [35]:

    
plot_experiment_results(experiment_relabel);



In [36]:

    
print_best_metrics(
    experiment_relabel_summary.reset_index(),
    ["female", "foreign_worker", "age_below_25"])









    




protected class = female:






    





best utility measured by auc (higher is better) = 0.780: RandomForest
best fairness measured by mean_diff (lower is better) = 0.051: RandomForest








    




protected class = foreign_worker:






    





best utility measured by auc (higher is better) = 0.785: RandomForest
best fairness measured by mean_diff (lower is better) = 0.154: DecisionTree








    




protected class = age_below_25:






    





best utility measured by auc (higher is better) = 0.779: RandomForest
best fairness measured by mean_diff (lower is better) = 0.058: DecisionTree

Validation Curve: Logistic Regression



In [37]:

    
LOGREG_L2_PARAM = [
    3, 1, 3e-1, 1e-1, 3e-2, 1e-2, 3e-3, 1e-3,
    3e-4, 1e-4, 3e-5, 1e-5, 3e-6, 1e-6, 3e-7, 1e-7, 3e-8, 1e-8]

def validation_curve_experiment(
        estimator_name, estimator, param_name, param_list, update_func):
    validaton_curve_experiment = []
    for param in param_list:
        e = clone(estimator)
        e = update_func(e, param_name, param)
        estimators = [(estimator_name, e)]
        experiment_relabel_female = cross_validation_experiment(
            estimators, X_no_sex, y, s_female, "female",
            verbose=False)
        experiment_relabel_foreign = cross_validation_experiment(
            estimators, X_no_foreign, y, s_foreign, "foreign_worker",
            verbose=False)
        experiment_relabel_age_below_25 = cross_validation_experiment(
            estimators, X_no_age, y, s_age_below_25, "age_below_25",
            verbose=False)
        validaton_curve_experiment.extend(
            [experiment_relabel_female.assign(**{param_name: param}),
             experiment_relabel_foreign.assign(**{param_name: param}),
             experiment_relabel_age_below_25.assign(**{param_name: param})])
    return pd.concat(validaton_curve_experiment)


def update_relabeller(e, param_name, param):
    e = clone(e)
    child_estimator = clone(e.estimator)
    child_estimator.set_params(**{param_name: param})
    e.set_params(estimator=child_estimator)
    return e
        

relabel_validaton_curve_experiment = validation_curve_experiment(
    "LogisticRegression", FairnessAwareMetaEstimator(
        LOGISTIC_REGRESSION, relabeller=Relabeller()),
    "C", LOGREG_L2_PARAM, update_relabeller)



In [38]:

    
def validation_curve_plot(x, y, **kwargs):
    ax = plt.gca()
    lw = 2.5
    data = kwargs.pop("data")
    train_data = data.query("fold_type == 'train'")
    test_data = data.query("fold_type == 'test'")
    grp_data_train = train_data.groupby(x)
    grp_data_test = test_data.groupby(x)
    mean_data_train = grp_data_train[y].mean()
    mean_data_test = grp_data_test[y].mean()
    std_data_train = grp_data_train[y].std()
    std_data_test = grp_data_test[y].std()
    ax.semilogx(mean_data_train.index, mean_data_train,
                label="train", color="#848484", lw=lw)
    ax.semilogx(mean_data_test.index, mean_data_test,
                label="test", color="#ae33bf", lw=lw)
    # # Add error region
    ax.fill_between(mean_data_train.index, mean_data_train - std_data_train,
                    mean_data_train + std_data_train, alpha=0.2,
                    color="darkorange", lw=lw)
    ax.fill_between(mean_data_test.index, mean_data_test - std_data_test,
                    mean_data_test + std_data_test, alpha=0.1,
                    color="navy", lw=lw)

relabel_validaton_curve_experiment_df = (
    relabel_validaton_curve_experiment
    .pipe(pd.melt,
          id_vars=["protected_class", "estimator", "cv_fold", "fold_type",
                   "C"],
          value_vars=["auc", "mean_diff"],
          var_name="metric", value_name="score")
    .assign(
        protected_class=lambda df: df.protected_class.str.replace("_", " "),
        metric=lambda df: df.metric.str.replace("_", " "))
    .rename(columns={"score": "mean score"})
)

# relabel_validaton_curve_experiment_df
g = sns.FacetGrid(
    relabel_validaton_curve_experiment_df,
    row="protected_class",
    col="metric", size=2.5, aspect=1.1, sharey=False,
    margin_titles=False)
g = g.map_dataframe(validation_curve_plot, "C", "mean score")
g.set_titles(template="{row_name}, {col_name}")
# g.add_legend()
# g.add_legend(bbox_to_anchor=(0.275, 0.91))
g.add_legend(bbox_to_anchor=(0.28, 0.9))
g.fig.tight_layout()
g.savefig("IMG/logistic_regression_validation_curve.png");

Fairness-aware Method: Additive Counterfactually Fair Model



In [40]:

    
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor

from themis_ml.linear_model.counterfactually_fair_models import \
    LinearACFClassifier

LINEAR_REG = LinearRegression()
DECISION_TREE_REG = DecisionTreeRegressor(max_depth=10, min_samples_leaf=10)
RANDOM_FOREST_REG = RandomForestRegressor(
    n_estimators=50, max_depth=10, min_samples_leaf=10)

# use the estimators defined above to define the linear additive
# counterfactually fair models
linear_acf_estimators = [
    (name, LinearACFClassifier(
         target_estimator=e,
         binary_residual_type="absolute"))
    for name, e in estimators]

experiment_acf_female = cross_validation_experiment(
    linear_acf_estimators, X_no_sex, y, s_female, "female")
experiment_acf_foreign = cross_validation_experiment(
    linear_acf_estimators, X_no_foreign, y, s_foreign, "foreign_worker")
experiment_acf_age_below_25 = cross_validation_experiment(
    linear_acf_estimators, X_no_age, y, s_age_below_25, "age_below_25")









    



Training models: protected_class = female
-----------------------------------------
LogisticRegression, fold: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 
DecisionTree, fold: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 
RandomForest, fold: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 

Training models: protected_class = foreign_worker
-------------------------------------------------
LogisticRegression, fold: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 
DecisionTree, fold: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 
RandomForest, fold: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 

Training models: protected_class = age_below_25
-----------------------------------------------
LogisticRegression, fold: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 
DecisionTree, fold: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 
RandomForest, fold: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49



In [41]:

    
experiment_acf = pd.concat([
    experiment_acf_female,
    experiment_acf_foreign,
    experiment_acf_age_below_25
])
experiment_acf_summary = summarize_experiment_results(experiment_acf)
experiment_acf_summary.query("fold_type == 'test'")









    Out[41]:







  
    
      
      
      
      auc
      mean_diff
    
    
      protected_class
      estimator
      fold_type
      
      
    
  
  
    
      age_below_25
      DecisionTree
      test
      0.693614
      0.267286
    
    
      LogisticRegression
      test
      0.668181
      -0.033848
    
    
      RandomForest
      test
      0.782543
      0.397064
    
    
      female
      DecisionTree
      test
      0.676395
      0.113156
    
    
      LogisticRegression
      test
      0.670010
      0.028742
    
    
      RandomForest
      test
      0.782667
      0.150053
    
    
      foreign_worker
      DecisionTree
      test
      0.697519
      0.335715
    
    
      LogisticRegression
      test
      0.667190
      0.118259
    
    
      RandomForest
      test
      0.784257
      0.381327



In [42]:

    
experiment_acf = pd.concat([
    experiment_acf_female,
    experiment_acf_foreign,
    experiment_acf_age_below_25
])
experiment_acf_summary = summarize_experiment_results(experiment_acf)
experiment_acf_summary.query("fold_type == 'test'")









    Out[42]:







  
    
      
      
      
      auc
      mean_diff
    
    
      protected_class
      estimator
      fold_type
      
      
    
  
  
    
      age_below_25
      DecisionTree
      test
      0.693614
      0.267286
    
    
      LogisticRegression
      test
      0.668181
      -0.033848
    
    
      RandomForest
      test
      0.782543
      0.397064
    
    
      female
      DecisionTree
      test
      0.676395
      0.113156
    
    
      LogisticRegression
      test
      0.670010
      0.028742
    
    
      RandomForest
      test
      0.782667
      0.150053
    
    
      foreign_worker
      DecisionTree
      test
      0.697519
      0.335715
    
    
      LogisticRegression
      test
      0.667190
      0.118259
    
    
      RandomForest
      test
      0.784257
      0.381327



In [43]:

    
plot_experiment_results(experiment_acf);



In [44]:

    
print_best_metrics(
    experiment_acf_summary.reset_index(),
    ["female", "foreign_worker", "age_below_25"])









    




protected class = female:






    





best utility measured by auc (higher is better) = 0.783: RandomForest
best fairness measured by mean_diff (lower is better) = 0.029: LogisticRegression








    




protected class = foreign_worker:






    





best utility measured by auc (higher is better) = 0.784: RandomForest
best fairness measured by mean_diff (lower is better) = 0.118: LogisticRegression








    




protected class = age_below_25:






    





best utility measured by auc (higher is better) = 0.783: RandomForest
best fairness measured by mean_diff (lower is better) = -0.034: LogisticRegression

Fairness-aware Method: Reject-option Classification



In [45]:

    
from themis_ml.postprocessing.reject_option_classification import \
    SingleROClassifier

# use the estimators defined above to define the linear additive
# counterfactually fair models
single_roc_clf_estimators = [
    (name, SingleROClassifier(estimator=e))
    for name, e in estimators]

experiment_single_roc_female = cross_validation_experiment(
    single_roc_clf_estimators, X_no_sex, y, s_female, "female")
experiment_single_roc_foreign = cross_validation_experiment(
    single_roc_clf_estimators, X_no_foreign, y, s_foreign, "foreign_worker")
experiment_single_roc_age_below_25 = cross_validation_experiment(
    single_roc_clf_estimators, X_no_age, y, s_age_below_25, "age_below_25")









    



Training models: protected_class = female
-----------------------------------------
LogisticRegression, fold: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 
DecisionTree, fold: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 
RandomForest, fold: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 

Training models: protected_class = foreign_worker
-------------------------------------------------
LogisticRegression, fold: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 
DecisionTree, fold: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 
RandomForest, fold: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 

Training models: protected_class = age_below_25
-----------------------------------------------
LogisticRegression, fold: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 
DecisionTree, fold: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 
RandomForest, fold: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49



In [46]:

    
experiment_single_roc = pd.concat([
    experiment_single_roc_female,
    experiment_single_roc_foreign,
    experiment_single_roc_age_below_25
])
experiment_single_roc_summary = summarize_experiment_results(
    experiment_single_roc)
experiment_single_roc_summary.query("fold_type == 'test'")









    Out[46]:







  
    
      
      
      
      auc
      mean_diff
    
    
      protected_class
      estimator
      fold_type
      
      
    
  
  
    
      age_below_25
      DecisionTree
      test
      0.696662
      0.123793
    
    
      LogisticRegression
      test
      0.532438
      -0.040159
    
    
      RandomForest
      test
      0.764200
      0.123144
    
    
      female
      DecisionTree
      test
      0.697976
      0.032355
    
    
      LogisticRegression
      test
      0.619505
      -0.038017
    
    
      RandomForest
      test
      0.770295
      0.026264
    
    
      foreign_worker
      DecisionTree
      test
      0.687981
      0.150091
    
    
      LogisticRegression
      test
      0.619476
      0.081522
    
    
      RandomForest
      test
      0.758657
      0.117810



In [47]:

    
plot_experiment_results(experiment_acf);



In [48]:

    
print_best_metrics(
    experiment_acf.reset_index(),
    ["female", "foreign_worker", "age_below_25"])









    




protected class = female:






    





best utility measured by auc (higher is better) = 0.894: RandomForest
best fairness measured by mean_diff (lower is better) = 0.002: DecisionTree








    




protected class = foreign_worker:






    





best utility measured by auc (higher is better) = 0.894: RandomForest
best fairness measured by mean_diff (lower is better) = 0.000: LogisticRegression








    




protected class = age_below_25:






    





best utility measured by auc (higher is better) = 0.860: RandomForest
best fairness measured by mean_diff (lower is better) = 0.000: LogisticRegression

Comparison of Fairness-aware Techniques



In [49]:

    
compare_experiments = (
    pd.concat([
        experiment_baseline.assign(experiment="B"),
        experiment_naive.assign(experiment="RPA"),
        experiment_relabel.assign(experiment="RTV"),
        experiment_acf.assign(experiment="CFM"),
        experiment_single_roc.assign(experiment="ROC")
    ])
    .assign(
        protected_class=lambda df: df.protected_class.str.replace("_", " "),
    )
)
compare_experiments.head()









    Out[49]:







  
    
      
      protected_class
      estimator
      cv_fold
      fold_type
      auc
      mean_diff
      experiment
    
  
  
    
      0
      female
      LogisticRegression
      0
      train
      0.683563
      0.090909
      B
    
    
      1
      female
      LogisticRegression
      0
      test
      0.662381
      0.116484
      B
    
    
      2
      female
      LogisticRegression
      1
      train
      0.675397
      0.102691
      B
    
    
      3
      female
      LogisticRegression
      1
      test
      0.737143
      0.062500
      B
    
    
      4
      female
      LogisticRegression
      2
      train
      0.686843
      0.126646
      B



In [135]:

    
comparison_palette = sns.color_palette("Dark2", n_colors=8)

def compare_experiment_results_multiple_model(experiment_results):
    g = (
        experiment_results
        .query("fold_type == 'test'")
        .drop(["cv_fold"], axis=1)
        .pipe(pd.melt, id_vars=["experiment", "protected_class", "estimator",
                                "fold_type"],
              var_name="metric", value_name="score")
        .assign(
            metric=lambda df: df.metric.str.replace("_", " "))
        .pipe((sns.factorplot, "data"), y="experiment",
              x="score", hue="metric",
              col="protected_class", row="estimator",
              join=False, size=3, aspect=1.1, dodge=0.3,
              palette=comparison_palette, margin_titles=True, legend=False))
    g.set_axis_labels("mean score (95% CI)")
    for ax in g.axes.ravel():
        ax.set_ylabel("")
        plt.setp(ax.texts, text="")
    g.set_titles(row_template="{row_name}", col_template="{col_name}")
    plt.legend(title="metric", loc=9, bbox_to_anchor=(-0.65, -0.4))
    g.fig.legend(loc=9, bbox_to_anchor=(0.5, -0.3))
    g.fig.tight_layout()
    g.savefig("IMG/fairness_aware_comparison.png", dpi=500);

compare_experiment_results_multiple_model(
    compare_experiments.query("estimator == 'LogisticRegression'"));

We can make some interesting observations when comparing the results from different fairness-aware techniques.



In [51]:

    
from scipy import stats

def compute_corr_pearson(x, y, ci=0.95):
    corr = stats.pearsonr(x, y)
    z = np.arctanh(corr[0])
    sigma = (1 / ((len(x) - 3) ** 0.5))
    cint = z + np.array([-1, 1]) * sigma * stats.norm.ppf((1 + ci ) / 2)
    return corr, np.tanh(cint)



In [52]:

    
black_palette = sns.color_palette(["#222222"])

def plot_utility_fairness_tradeoff(x, y, **kwargs):
    ax = plt.gca()
    data = kwargs.pop("data")
    sns_ax = sns.regplot(x=x, y=y, data=data, scatter_kws={'alpha':0.5},
                         **kwargs)
    (corr, p_val), ci = compute_corr_pearson(data[x], data[y])
    r_text = 'r = %0.02f (%0.02f, %0.02f)' % \
        (corr, ci[0], ci[1])
    sns_ax.annotate(
        r_text, xy=(0.7, 0),
        xytext=(0.07, 0.91),
        textcoords='axes fraction',
        fontweight="bold",
        fontsize=9,
        color="gray"
    )
    bottom_padding = 0.05
    top_padding = 0.5
    ylim = (data[y].min() - bottom_padding, data[y].max() + top_padding)
    sns_ax.set_ylim(*ylim)


g = sns.FacetGrid(
    (
        compare_experiments
        .drop("cv_fold", axis=1)
        .reset_index()
        .query("fold_type == 'test'")
        .rename(
            columns={"mean_diff": "mean diff"})
    ),
    col="protected_class",
    row="experiment",
    hue="experiment",
    size=2.0, aspect=1.3, sharey=True,
    palette=black_palette)
g.map_dataframe(plot_utility_fairness_tradeoff, "auc", "mean diff")
g.set_titles(template="{row_name}, {col_name}")
g.fig.tight_layout()
g.savefig("IMG/fairness_utility_tradeoff.png", dpi=500);



In [53]:

    
g = sns.FacetGrid(
    (
        compare_experiments
        .drop("cv_fold", axis=1)
        .reset_index()
        .query("fold_type == 'test'")
        .rename(
            columns={"mean_diff": "mean diff"})
    ),
    col="protected_class",
    row="estimator",
    hue="estimator",
    size=3.5, aspect=1,
    sharey=True, sharex=False,
    palette=black_palette)
g.map_dataframe(plot_utility_fairness_tradeoff, "auc", "mean diff")
g.set_titles(template="{row_name}, {col_name}")
g.fig.tight_layout()

	credit_risk	purpose	age_in_years	foreign_worker
0	1	radio/television	67	1
1	0	radio/television	22	1
2	1	education	49	1
3	1	furniture/equipment	45	1
4	0	car_(new)	53	1

			auc	mean_diff
protected_class	estimator	fold_type
age_below_25	DecisionTree	test	0.697690	0.200348
	LogisticRegression	test	0.672438	0.322763
	RandomForest	test	0.787733	0.251217
female	DecisionTree	test	0.696033	0.049169
	LogisticRegression	test	0.672438	0.095023
	RandomForest	test	0.783771	0.091083
foreign_worker	DecisionTree	test	0.698857	0.116408
	LogisticRegression	test	0.672438	0.293640
	RandomForest	test	0.788571	0.222311

	protected_class	estimator	cv_fold	fold_type	auc	mean_diff	experiment
0	female	LogisticRegression	0	train	0.683563	0.090909	B
1	female	LogisticRegression	0	test	0.662381	0.116484	B
2	female	LogisticRegression	1	train	0.675397	0.102691	B
3	female	LogisticRegression	1	test	0.737143	0.062500	B
4	female	LogisticRegression	2	train	0.686843	0.126646	B

The Utility-Fairness Tradeoff

Load Data

Measure Social Bias

target variable: credit_risk

protected class: sex

protected class: immigration status

protected class: age

Establish Baseline Metrics

Procedure

protected class = female:

protected class = foreign_worker:

protected class = age_below_25:

Naive Fairness-aware Approach: Remove Protected Class

protected class = female:

protected class = foreign_worker:

protected class = age_below_25:

Fairness-aware Method: Relabelling

protected class = female:

protected class = foreign_worker:

protected class = age_below_25:

Validation Curve: Logistic Regression

Fairness-aware Method: Additive Counterfactually Fair Model

protected class = female:

protected class = foreign_worker:

protected class = age_below_25:

Fairness-aware Method: Reject-option Classification

protected class = female:

protected class = foreign_worker:

protected class = age_below_25:

Comparison of Fairness-aware Techniques

target variable: `credit_risk`