In [0]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# See the License for the specific language governing permissions and
# limitations under the License.

Introduction to ML Fairness


This exercise explores just a small subset of ideas and techniques relevant to fairness in machine learning; it is not the whole story!

Learning Objectives

  • Increase awareness of different types of biases that can manifest in model data.
  • Explore feature data to proactively identify potential sources of bias before training a model
  • Evaluate model performace by subgroup rather than in aggregate


In this exercise, you'll explore explore datasets and evaluate classifiers with fairness in mind, noting the ways undesirable biases can creep into machine learning (ML).

Throughout, you will see FairAware tasks, which provide opportunities to contextualize ML processes with respect to fairness. In performing these tasks, you'll identify biases and consider the long-term impact of model predictions if these biases are not addressed.

About the Dataset and Prediction Task

In this exercise, you'll work with the Adult Census Income dataset, which is commonly used in machine learning literature. This data was extracted from the 1994 Census bureau database by Ronny Kohavi and Barry Becker.

Each example in the dataset contains the following demographic data for a set of individuals who took part in the 1994 Census:

Numeric Features

  • age: The age of the individual in years.
  • fnlwgt: The number of individuals the Census Organizations believes that set of observations represents.
  • education_num: An enumeration of the categorical representation of education. The higher the number, the higher the education that individual achieved. For example, an education_num of 11 represents Assoc_voc (associate degree at a vocational school), an education_num of 13 represents Bachelors, and an education_num of 9 represents HS-grad (high school graduate).
  • capital_gain: Capital gain made by the individual, represented in US Dollars.
  • capital_loss: Capital loss mabe by the individual, represented in US Dollars.
  • hours_per_week: Hours worked per week.

Categorical Features

  • workclass: The individual's type of employer. Examples include: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, and Never-worked.
  • education: The highest level of education achieved for that individual.
  • marital_status: Marital status of the individual. Examples include: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, and Married-AF-spouse.
  • occupation: The occupation of the individual. Example include: tech-support, Craft-repair, Other-service, Sales, Exec-managerial and more.
  • relationship: The relationship of each individual in a household. Examples include: Wife, Own-child, Husband, Not-in-family, Other-relative, and Unmarried.
  • gender: Gender of the individual available only in binary choices: Female or Male.
  • race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Black, and Other.
  • native_country: Country of origin of the individual. Examples include: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, and more.

Prediction Task

The prediction task is to determine whether a person makes over $50,000 US Dollar a year.


  • income_bracket: Whether the person makes more than $50,000 US Dollars annually.

Notes on Data Collection

All the examples extracted for this dataset meet the following conditions:

  • age is 16 years or older.
  • The adjusted gross income (used to calculate income_bracket) is greater than $100 USD annually.
  • fnlwgt is greater than 0.
  • hours_per_week is greater than 0.


First, import some modules that will be used throughout this notebook.

In [0]:
import os
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import tensorflow as tf
import tempfile
!pip install seaborn==0.8.1
import seaborn as sns
import itertools
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.metrics import precision_recall_curve
from google.colab import widgets
# For facets
from IPython.core.display import display, HTML
import base64
!pip install -q hopsfacets 
import hopsfacets as facets
from hopsfacets.feature_statistics_generator import FeatureStatisticsGenerator

print('Modules are imported.')

Load the Adult Dataset

With the modules now imported, we can load the Adult dataset into a pandas DataFrame data structure.

In [0]:
COLUMNS = ["age", "workclass", "fnlwgt", "education", "education_num",
           "marital_status", "occupation", "relationship", "race", "gender",
           "capital_gain", "capital_loss", "hours_per_week", "native_country",

train_df = pd.read_csv(
test_df = pd.read_csv(

# Drop rows with missing values
train_df = train_df.dropna(how="any", axis=0)
test_df = test_df.dropna(how="any", axis=0)

print('UCI Adult Census Income dataset loaded.')

Analyzing the Adult Dataset with Facets

As mentioned in MLCC, it is important to understand your dataset before diving straight into the prediction task.

Some important questions to investigate when auditing a dataset for fairness:

  • Are there missing feature values for a large number of observations?
  • Are there features that are missing that might affect other features?
  • Are there any unexpected feature values?
  • What signs of data skew do you see?

To start, we can use Facets Overview, an interactive visualization tool that can help us explore the dataset. With Facets Overview, we can quickly analyze the distribution of values across the Adult dataset.

In [0]:
#@title Visualize the Data in Facets
fsg = FeatureStatisticsGenerator()
dataframes = [
    {'table': train_df, 'name': 'trainData'}]
censusProto = fsg.ProtoFromDataFrames(dataframes)
protostr = base64.b64encode(censusProto.SerializeToString()).decode("utf-8")

HTML_TEMPLATE = """<link rel="import" href="">
        <facets-overview id="elem"></facets-overview>
          document.querySelector("#elem").protoInput = "{protostr}";
html = HTML_TEMPLATE.format(protostr=protostr)

FairAware Task #1

Review the descriptive statistics and histograms for each numerical and continuous feature. Click the Show Raw Data button above the histograms for categorical features to see the distribution of values per category.

Then, try to answer the following questions from earlier:

  1. Are there missing feature values for a large number of observations?
  2. Are there features that are missing that might affect other features?
  3. Are there any unexpected feature values?
  4. What signs of data skew do you see?


Click below for some insights we uncovered.

We can see from reviewing the missing columns for both numeric and categorical features that there are no missing feature values, so that is not a concern here.

By looking at the min/max values and histograms for each numeric feature, we can pinpoint any extreme outliers in our data set. For hours_per_week, we can see that the minimum is 1, which might be a bit surprising, given that most jobs typically require multiple hours of work per week. For capital_gain and capital_loss, we can see that over 90% of values are 0. Given that capital gains/losses are only registered by individuals who make investments, it's certainly plausible that less than 10% of examples would have nonzero values for these feature, but we may want to take a closer look to verify the values for these features are valid.

In looking at the histogram for gender, we see that over two-thirds (approximately 67%) of examples represent males. This strongly suggests data skew, as we would expect the breakdown between genders to be closer to 50/50.

A Deeper Dive

To futher explore the dataset, we can use Facets Dive, a tool that provides an interactive interface where each individual item in the visualization represents a data point. But to use Facets Dive, we need to convert our data to a JSON array. Thankfully the DataFrame method to_json() takes care of this for us.

Run the cell below to perform the data transform to JSON and also load Facets Dive.

In [0]:
#@title Set the Number of Data Points to Visualize in Facets Dive

SAMPLE_SIZE = 2500 #@param
train_dive = train_df.sample(SAMPLE_SIZE).to_json(orient='records')

HTML_TEMPLATE = """<link rel="import" href="">
        <facets-dive id="elem" height="600"></facets-dive>
          var data = {jsonstr};
          document.querySelector("#elem").data = data;
html = HTML_TEMPLATE.format(jsonstr=train_dive)

FairAware Task #2

Use the menus on the left panel of the visualization to change how the data is organized:

  1. In the Faceting | X-Axis menu, select education, and in the Display | Color and Display | Type menus, select income_bracket. How would you describe the relationship between education level and income bracket?

  2. Next, in the Faceting | X-Axis menu, select marital_status, and in the Display | Color and Display | Type menus, select gender. What noteworthy observations can you make about the gender distributions for each marital-status category?

As you perform the above tasks, keep the following fairness-related questions in mind:

  • What's missing?
  • What's being overgeneralized?
  • What's being underrepresented?
  • How do the variables, and their values, reflect the real world?
  • What might we be leaving out?


Click below for some insights we uncovered.

  1. In our data set, higher education levels generally tend to correlate with a higher income bracket. An income level of greater than $50,000 is more heavily represented in examples where education level is Bachelor's degree or higher.

  2. In most marital-status categories, the distribution of male vs. female values is close to 1:1. The one notable exception is "married-civ-spouse", where male outnumbers female by more than 5:1. Given that we already discovered in Task #1 that there is a disproportionately high representation of men in our data set, we can now infer that it's married women specifically that are underrepresented in our data.


Plotting histograms, ranking most-to-least common examples, identifying duplicate or missing examples, making sure the training and test sets are similar, computing feature quantiles—these are all critical analyses to perform on your data.

The better you know what's going on in your data, the more insight you'll have as to where unfairness might creep in!

FairAware Task #3

Now that you've explored the dataset using Facets, see if you can identify some of the problems that may arise with regard to fairness based on what you've learned about its features.

Which of the following features might pose a problem with regard to fairness?

Choose a feature from the drop-down options in the cell below, and then run the cell to check your answer. Then explore the rest of the options to get more insight about how each influences the model's predictions.

In [0]:
feature = 'capital_gain / capital_loss' #@param ["", "hours_per_week", "fnlwgt", "gender", "capital_gain / capital_loss", "age"] {allow-input: false}

if feature == "hours_per_week":
'''It does seem a little strange to see 'hours_per_week' max out at 99 hours,
which could lead to data misrepresentation. One way to address this is by
representing 'hours_per_week' as a binary "working 40 hours/not working 40
hours" feature. Also keep in mind that data was extracted based on work hours
being greater than 0. In other words, this feature representation exclude a
subpopulation of the US that is not working. This could skew the outcomes of the
if feature == "fnlwgt":
"""'fnlwgt' represents the weight of the observations. After fitting the model
to this data set, if certain group of individuals end up performing poorly 
compared to other groups, then we could explore ways of reweighting each data 
point using this feature.""")
if feature == "gender":
"""Looking at the ratio between men and women shows how disproportionate the data
is compared to the real world where the ratio (at least in the US) is closer to
1:1. This could pose a huge probem in performance across gender. Considerable
measures may need to be taken to upsample the underrepresented group (in this
case, women).""")
if feature == "capital_gain / capital_loss":
"""Both 'capital_gain' and 'capital_loss' have very low variance, which might
suggest they don't contribute a whole lot of information for predicting income. It
may be okay to omit these features rather than giving the model more noise.""")
if feature == "age":
'''"age" has a lot of variance, so it might benefit from bucketing to learn
fine-grained correlations between income and age, as well as to prevent
overfitting. "age" has a lot of variance, so it might benefit from bucketization
to learn fine-grained correlations between income and age, as well as to prevent

Prediction Using TensorFlow Estimators

Now that we have a better sense of the Adult dataset, we can now begin with creating a neural network to predict income. In this section, we will be using TensorFlow's Estimator API to access the DNNClassifier class

Convert Adult Dataset into Tensors

We first have to define our input fuction, which will take the Adult dataset that is in a pandas DataFrame and converts it into tensors using the tf.estimator.inputs.pandas_input_fn() function.

In [0]:
def csv_to_pandas_input_fn(data, batch_size=100, num_epochs=1, shuffle=False):
  return tf.estimator.inputs.pandas_input_fn(
      x=data.drop('income_bracket', axis=1),
      y=data['income_bracket'].apply(lambda x: ">50K" in x).astype(int),

print 'csv_to_pandas_input_fn() defined.'

Represent Features in TensorFlow

TensorFlow requires that data maps to a model. To accomplish this, you have to use tf.feature_columns to ingest and represent features in TensorFlow.

In [0]:
#@title Categorical Feature Columns

# Since we don't know the full range of possible values with occupation and
# native_country, we'll use categorical_column_with_hash_bucket() to help map
# each feature string into an integer ID.
occupation = tf.feature_column.categorical_column_with_hash_bucket(
    "occupation", hash_bucket_size=1000)
native_country = tf.feature_column.categorical_column_with_hash_bucket(
    "native_country", hash_bucket_size=1000)

# For the remaining categorical features, since we know what the possible values
# are, we can be more explicit and use categorical_column_with_vocabulary_list()
gender = tf.feature_column.categorical_column_with_vocabulary_list(
    "gender", ["Female", "Male"])
race = tf.feature_column.categorical_column_with_vocabulary_list(
    "race", [
        "White", "Asian-Pac-Islander", "Amer-Indian-Eskimo", "Other", "Black"
education = tf.feature_column.categorical_column_with_vocabulary_list(
    "education", [
        "Bachelors", "HS-grad", "11th", "Masters", "9th",
        "Some-college", "Assoc-acdm", "Assoc-voc", "7th-8th",
        "Doctorate", "Prof-school", "5th-6th", "10th", "1st-4th",
        "Preschool", "12th"
marital_status = tf.feature_column.categorical_column_with_vocabulary_list(
    "marital_status", [
        "Married-civ-spouse", "Divorced", "Married-spouse-absent",
        "Never-married", "Separated", "Married-AF-spouse", "Widowed"
relationship = tf.feature_column.categorical_column_with_vocabulary_list(
    "relationship", [
        "Husband", "Not-in-family", "Wife", "Own-child", "Unmarried",
workclass = tf.feature_column.categorical_column_with_vocabulary_list(
    "workclass", [
        "Self-emp-not-inc", "Private", "State-gov", "Federal-gov",
        "Local-gov", "?", "Self-emp-inc", "Without-pay", "Never-worked"

print 'Categorical feature columns defined.'

In [0]:
#@title Numeric Feature Columns
# For Numeric features, we can just call on feature_column.numeric_column()
# to use its raw value instead of having to create a map between value and ID.
age = tf.feature_column.numeric_column("age")
fnlwgt = tf.feature_column.numeric_column("fnlwgt")
education_num = tf.feature_column.numeric_column("education_num")
capital_gain = tf.feature_column.numeric_column("capital_gain")
capital_loss = tf.feature_column.numeric_column("capital_loss")
hours_per_week = tf.feature_column.numeric_column("hours_per_week")

print 'Numeric feature columns defined.'

Make Age a Categorical Feature

If you chose age when completing FairAware Task #3, you noticed that we suggested that age might benefit from bucketing (also known as binning), grouping together similar ages into different groups. This might help the model generalize better across age. As such, we will convert age from a numeric feature (technically, an ordinal feature) to a categorical feature.

In [0]:
age_buckets = tf.feature_column.bucketized_column(
    age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])

Consider Key Subgroups

When performing feature engineering, it's important to keep in mind that you may be working with data drawn from individuals belonging to subgroups, for which you'll want to evaluate model performance separately.

NOTE: In this context, a subgroup is defined as a group of individuals who share a given characteristic—such as race, gender, or sexual orientation—that merits special consideration when evaluating a model with fairness in mind.

When we want our models to mitigate, or leverage, the learned signal of a characteristic pertaining to a subgroup, we will want to use different kinds of tools and techniques—most of which are still open research at this point.

As you work with different variables and define tasks for them, it can be useful to think about what comes next. For example, where are the places where the interaction of the variable and the task could be a concern?

Define the Model Features

Now we can explicitly define which feature we will include in our model.

We'll consider gender a subgroup and save it in a separate subgroup_variables list, so we can add special handling for it as needed.

In [0]:
# List of variables, with special handling for gender subgroup.
variables = [native_country, education, occupation, workclass,
             relationship, age_buckets]
subgroup_variables = [gender]
feature_columns = variables + subgroup_variables

Train a Deep Neural Net Model on Adult Dataset

With the features now ready to go, we can try predicting income using deep learning.

For the sake of simplicity, we are going to keep the neural network architecture light by simply defining a feed-forward neural network with two hidden layers.

But first, we have to convert our high-dimensional categorical features into a low-dimensional and dense real-valued vector, which we call an embedding vector. Luckily, indicator_column (think of it as one-hot encoding) and embedding_column (that converts sparse features into dense features) helps us streamline the process.

The following cell creates the deep columns needed to move forward with defining the model.

In [0]:
deep_columns = [
    tf.feature_column.embedding_column(native_country, dimension=8),
    tf.feature_column.embedding_column(occupation, dimension=8),

print deep_columns
print 'Deep columns created.'

Will all our data preprocessing taken care of, we can now define the deep neural net model. Start by using the parameters defined below. (Later on, after you've defined evaluation metrics and evaluated the model, you can come back and tweak these parameters to compare results.)

In [0]:
#@title Define Deep Neural Net Model

HIDDEN_UNITS = [1024, 512] #@param
LEARNING_RATE = 0.1 #@param

model_dir = tempfile.mkdtemp()
single_task_deep_model = tf.estimator.DNNClassifier(

print 'Deep neural net model defined.'

To keep things simple, we will train for 1000 steps—but feel free to play around with this parameter.

In [0]:
#@title Fit Deep Neural Net Model to the Adult Training Dataset

STEPS = 1000 #@param

    input_fn=csv_to_pandas_input_fn(train_df, num_epochs=None, shuffle=True),

print "Deep neural net model is done fitting."

We can now evalute the overall model's performance using the held-out test set.

In [0]:
#@title Evaluate Deep Neural Net Performance

results = single_task_deep_model.evaluate(
    input_fn=csv_to_pandas_input_fn(test_df, num_epochs=1, shuffle=False),
print("model directory = %s" % model_dir)
print("---- Results ----")
for key in sorted(results):
  print("%s: %s" % (key, results[key]))

You can try retraining the model using different parameters. In the end, you will find that a deep neural net does a decent job in predicting income.

But what is missing here is evaluation metrics with respect to subgroups. We will cover some of the ways you can evaluate at the subgroup level in the next section.

Evaluating for Fairness Using a Confusion Matrix

While evaluating the overall performance of the model gives us some insight into its quality, it doesn't give us much insight into how well our model performs for different subgroups.

When evaluating a model for fairness, it's important to determine whether prediction errors are uniform across subgroups or whether certain subgroups are more susceptible to certain prediction errors than others.

A key tool for comparing the prevalence of different types of model errors is a confusion matrix. Recall from the Classification module of Machine Learning Crash Course that a confusion matrix is a grid that plots predictions vs. ground truth for your model, and tabulates statistics summarizing how often your model made the correct prediction and how often it made the wrong prediction.

Let's start by creating a binary confusion matrix for our income-prediction model—binary because our label (income_bracket) has only two possible values (<50K or >50K). We'll define an income of >50K as our positive label, and an income of <50k as our negative label.

NOTE: Positive and negative in this context should not be interpreted as value judgments (we are not suggesting that someone who earns more than 50k a year is a better person than someone who earns less than 50k). They are just standard terms used to distinguish between the two possible predictions the model can make.

Cases where the model makes the correct prediction (the prediction matches the ground truth) are classified as true, and cases where the model makes the wrong prediction are classified as false.

Our confusion matrix thus represents four possible states:

  • true positive: Model predicts >50K, and that is the ground truth.
  • true negative: Model predicts <50K, and that is the ground truth.
  • false positive: Model predicts >50K, and that contradicts reality.
  • false negative: Model predicts <50K, and that contradicts reality.

NOTE: If desired, we can use the number of outcomes in each of these states to calculate secondary evaluation metrics, such as precision and recall.

Plot the Confusion Matrix

The following cell define a function that uses the sklearn.metrics.confusion_matrix module to calculate all the instances (true positive, true negative, false positive, and false negative) needed to compute our binary confusion matrix and evaluation metrics.

In [0]:
#@test {"output": "ignore"}
#@title Define Function to Compute Binary Confusion Matrix Evaluation Metrics
def compute_eval_metrics(references, predictions):
  tn, fp, fn, tp = confusion_matrix(references, predictions).ravel()
  precision = tp / float(tp + fp)
  recall = tp / float(tp + fn)
  false_positive_rate = fp / float(fp + tn)
  false_omission_rate = fn / float(tn + fn)
  return precision, recall, false_positive_rate, false_omission_rate

print 'Binary confusion matrix and evaluation metrics defined.'

We will also need help plotting the binary confusion matrix. The function below combines various third-party modules (pandas DataFame, Matplotlib, Seaborn) to draw the confusion matrix.

In [0]:
#@title Define Function to Visualize Binary Confusion Matrix
def plot_confusion_matrix(confusion_matrix, class_names, figsize = (8,6)):
    # We're taking our calculated binary confusion matrix that's already in form 
    # of an array and turning it into a Pandas DataFrame because it's a lot 
    # easier to work with when visualizing a heat map in Seaborn.
    df_cm = pd.DataFrame(
        confusion_matrix, index=class_names, columns=class_names, 
    fig = plt.figure(figsize=figsize)
    # Combine the instance (numercial value) with its description
    strings = np.asarray([['True Positives', 'False Negatives'],
                          ['False Positives', 'True Negatives']])
    labels = (np.asarray(
        ["{0:d}\n{1}".format(value, string) for string, value in zip(
            strings.flatten(), confusion_matrix.flatten())])).reshape(2, 2)

    heatmap = sns.heatmap(df_cm, annot=labels, fmt="");
        heatmap.yaxis.get_ticklabels(), rotation=0, ha='right')
        heatmap.xaxis.get_ticklabels(), rotation=45, ha='right')
    return fig

print "Binary confusion matrix visualization defined."

Now that we have all the necessary functions defined, we can now compute the binary confusion matrix and evaluation metrics using the outcomes from our deep neural net model. The output of this cell is a tabbed view, which allows us to toggle between the confusion matrix and evaluation metrics table.

FairAware Task #4

Use the form below to generate confusion matrices for the two gender subgroups: Female and Male. Compare the number of False Positives and False Negatives for each subgroup. Are there any significant disparities in error rates that suggest the model performs better for one subgroup than another?

In [0]:
#@title Visualize Binary Confusion Matrix and Compute Evaluation Metrics Per Subgroup
CATEGORY  =  "gender" #@param {type:"string"}
SUBGROUP =  "Male" #@param {type:"string"}

# Given define subgroup, generate predictions and obtain its corresponding 
# ground truth.
predictions_dict = single_task_deep_model.predict(input_fn=csv_to_pandas_input_fn(
    test_df.loc[test_df[CATEGORY] == SUBGROUP], num_epochs=1, shuffle=False))
predictions = []
for prediction_item, in zip(predictions_dict):
actuals = list(
    test_df.loc[test_df[CATEGORY] == SUBGROUP]['income_bracket'].apply(
        lambda x: '>50K' in x).astype(int))
classes = ['Over $50K', 'Less than $50K']

# To stay consistent, we have to flip the confusion 
# matrix around on both axes because sklearn's confusion matrix module by
# default is rotated.
rotated_confusion_matrix = np.fliplr(confusion_matrix(actuals, predictions))
rotated_confusion_matrix = np.flipud(rotated_confusion_matrix)

tb = widgets.TabBar(['Confusion Matrix', 'Evaluation Metrics'], location='top')

with tb.output_to('Confusion Matrix'):
  plot_confusion_matrix(rotated_confusion_matrix, classes);

with tb.output_to('Evaluation Metrics'):
  grid = widgets.Grid(2,4)

  p, r, fpr, fomr = compute_eval_metrics(actuals, predictions)

  with grid.output_to(0, 0):
    print " Precision "
  with grid.output_to(1, 0):
    print " %.4f " % p

  with grid.output_to(0, 1):
    print " Recall "
  with grid.output_to(1, 1):
    print " %.4f " % r

  with grid.output_to(0, 2):
    print " False Positive Rate "
  with grid.output_to(1, 2):
    print " %.4f " % fpr

  with grid.output_to(0, 3):
    print " False Omission Rate "
  with grid.output_to(1, 3):
    print " %.4f " % fomr


Click below for some insights we uncovered

Using default model parameters, you may find that the model performs better for male than female. Specifically, in our run, we found that both precision and recall for male (0.7490 and 0.4795, respectively) outperformed female (0.6787 and 0.3716, respectively).

Hopefully going through this confusion matrix demonstration you find that the results varies slightly from the overall performance metrics, highlighting the importance of evaluating model performance across subgroup rather than in aggregate.

In your work, make sure that you make a good decision about the trade-offs between false positives, false negatives, true positives, and true negatives. For example, you may want very a low false positive rate, but a high true positive rate. Or you may want a high precision, but a low recall is okay.

Choose your evaluation metrics in light of these desired tradeoffs.