In [0]:
#@title Copyright 2020 Google LLC. Double-click here for license information.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
In this exercise, you'll explore datasets and evaluate classifiers with fairness in mind, noting the ways undesirable biases can creep into machine learning (ML).
Throughout, you will see FairAware tasks, which provide opportunities to contextualize ML processes with respect to fairness. In performing these tasks, you'll identify biases and consider the long-term impact of model predictions if these biases are not addressed.
In this exercise, you'll work with the Adult Census Income dataset, which is commonly used in machine learning literature. This data was extracted from the 1994 Census bureau database by Ronny Kohavi and Barry Becker.
Each example in the dataset contains the following demographic data for a set of individuals who took part in the 1994 Census:
age
: The age of the individual in years.fnlwgt
: The number of individuals the Census Organizations believes that set of observations represents.education_num
: An enumeration of the categorical representation of education. The higher the number, the higher the education that individual achieved. For example, an education_num
of 11
represents Assoc_voc
(associate degree at a vocational school), an education_num
of 13
represents Bachelors
, and an education_num
of 9
represents HS-grad
(high school graduate).capital_gain
: Capital gain made by the individual, represented in US Dollars.capital_loss
: Capital loss mabe by the individual, represented in US Dollars.hours_per_week
: Hours worked per week.workclass
: The individual's type of employer. Examples include: Private
, Self-emp-not-inc
, Self-emp-inc
, Federal-gov
, Local-gov
, State-gov
, Without-pay
, and Never-worked
.education
: The highest level of education achieved for that individual.marital_status
: Marital status of the individual. Examples include: Married-civ-spouse
, Divorced
, Never-married
, Separated
, Widowed
, Married-spouse-absent
, and Married-AF-spouse
.occupation
: The occupation of the individual. Example include: tech-support
, Craft-repair
, Other-service
, Sales
, Exec-managerial
and more.relationship
: The relationship of each individual in a household. Examples include: Wife
, Own-child
, Husband
, Not-in-family
, Other-relative
, and Unmarried
.gender
: Gender of the individual available only in binary choices: Female
or Male
.race
: White
, Asian-Pac-Islander
, Amer-Indian-Eskimo
, Black
, and Other
. native_country
: Country of origin of the individual. Examples include: United-States
, Cambodia
, England
, Puerto-Rico
, Canada
, Germany
, Outlying-US(Guam-USVI-etc)
, India
, Japan
, and more.The prediction task is to determine whether a person makes over $50,000 US Dollar a year.
income_bracket
: Whether the person makes more than $50,000 US Dollars annually.All the examples extracted for this dataset meet the following conditions:
age
is 16 years or older.income_bracket
) is greater than $100 USD annually.fnlwgt
is greater than 0.hours_per_week
is greater than 0.
In [0]:
#@title Run on TensorFlow 2.x
%tensorflow_version 2.x
from __future__ import absolute_import, division, print_function, unicode_literals
Next, we'll import the necessary modules to run the code in the rest of this Colaboratory notebook.
In addition to importing the usual libraries, this setup code cell also installs Facets, an open-source tool created by PAIR that contains two robust visualizations we'll be using to aid in understanding and analyzing ML datasets.
In [0]:
#@title Import revelant modules and install Facets
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras import layers
from matplotlib import pyplot as plt
from matplotlib import rcParams
import seaborn as sns
# The following lines adjust the granularity of reporting.
pd.options.display.max_rows = 10
pd.options.display.float_format = "{:.1f}".format
from google.colab import widgets
# For facets
from IPython.core.display import display, HTML
import base64
!pip install facets-overview==1.0.0
from facets_overview.feature_statistics_generator import FeatureStatisticsGenerator
In [0]:
COLUMNS = ["age", "workclass", "fnlwgt", "education", "education_num",
"marital_status", "occupation", "relationship", "race", "gender",
"capital_gain", "capital_loss", "hours_per_week", "native_country",
"income_bracket"]
train_csv = tf.keras.utils.get_file('adult.data',
'https://download.mlcc.google.com/mledu-datasets/adult_census_train.csv')
test_csv = tf.keras.utils.get_file('adult.data',
'https://download.mlcc.google.com/mledu-datasets/adult_census_test.csv')
train_df = pd.read_csv(train_csv, names=COLUMNS, sep=r'\s*,\s*',
engine='python', na_values="?")
test_df = pd.read_csv(test_csv, names=COLUMNS, sep=r'\s*,\s*', skiprows=[0],
engine='python', na_values="?")
As mentioned in MLCC, it is important to understand your dataset before diving straight into the prediction task.
Some important questions to investigate when auditing a dataset for fairness:
To start, we can use Facets Overview, an interactive visualization tool that can help us explore the dataset. With Facets Overview, we can quickly analyze the distribution of values across the Adult dataset.
In [0]:
#@title Visualize the Data in Facets
fsg = FeatureStatisticsGenerator()
dataframes = [
{'table': train_df, 'name': 'trainData'}]
censusProto = fsg.ProtoFromDataFrames(dataframes)
protostr = base64.b64encode(censusProto.SerializeToString()).decode("utf-8")
HTML_TEMPLATE = """<script src="https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"></script>
<link rel="import" href="https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html">
<facets-overview id="elem"></facets-overview>
<script>
document.querySelector("#elem").protoInput = "{protostr}";
</script>"""
html = HTML_TEMPLATE.format(protostr=protostr)
display(HTML(html))
Review the descriptive statistics and histograms for each numerical and continuous feature. Click the Show Raw Data button above the histograms for categorical features to see the distribution of values per category.
Then, try to answer the following questions from earlier:
We can see from reviewing the missing column that the following categorical features contain missing values:
Now, because it's only a small percentage of samples that contain either a missing workclass value or occupation value, we can safely drop those rows from the data set. If that percentage was much higher, then we would have to consider using a different data set that is more complete.
Luckily, in Pandas, there is a convenient way to drop any row containing a missing value in the data set:
# pandas.DataFrame.dropna(how="any", axis=0, inplace=True)
We will use this method prior to training the model when we convert a Pandas DataFrame to a Numpy array.
As for the remaining data that does not contain any missing values: if we look at the min/max values and histograms for each numeric feature, then we can pinpoint any extreme outliers in our data set.
For hours_per_week
, we can see that the minimum is 1, which might be a bit surprising, given that most jobs typically require multiple hours of work per week. For capital_gain
and capital_loss
, we can see that over 90% of values are 0. Given that capital gains/losses are only registered by individuals who make investments, it's certainly plausible that less than 10% of examples would have nonzero values for these feature, but we may want to take a closer look to verify the values for these features are valid.
In looking at the histogram for gender, we see that over two-thirds (approximately 67%) of examples represent males. This strongly suggests data skew, as we would expect the breakdown between genders to be closer to 50/50.
To futher explore the dataset, we can use Facets Dive, a tool that provides an interactive interface where each individual item in the visualization represents a data point. But to use Facets Dive, we need to convert the data to a JSON array.
Thankfully the DataFrame method to_json()
takes care of this for us.
Run the cell below to perform the data transform to JSON and also load Facets Dive.
In [0]:
#@title Set the Number of Data Points to Visualize in Facets Dive
SAMPLE_SIZE = 5000 #@param
train_dive = train_df.sample(SAMPLE_SIZE).to_json(orient='records')
HTML_TEMPLATE = """<script src="https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"></script>
<link rel="import" href="https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html">
<facets-dive id="elem" height="600"></facets-dive>
<script>
var data = {jsonstr};
document.querySelector("#elem").data = data;
</script>"""
html = HTML_TEMPLATE.format(jsonstr=train_dive)
display(HTML(html))
Use the menus on the left panel of the visualization to change how the data is organized:
In the Binning | X-Axis menu, select education, and in the Color By and Label By menus, select income_bracket. How would you describe the relationship between education level and income bracket?
Next, in the Binning | X-Axis menu, select marital_status, and in the Color By and Label By menus, select gender. What noteworthy observations can you make about the gender distributions for each marital-status category?
As you perform the above tasks, keep the following fairness-related questions in mind:
In the data set, higher education levels generally tend to correlate with a higher income bracket. An income level of greater than $50,000 is more heavily represented in examples where education level is Bachelor's degree or higher.
In most marital-status categories, the distribution of male vs. female values is close to 1:1. The one notable exception is "married-civ-spouse", where male outnumbers female by more than 5:1. Given that we already discovered in Task #1 that there is a disproportionately high representation of men in the data set, we can now infer that it's married women specifically that are underrepresented in the data.
Plotting histograms, ranking most-to-least common examples, identifying duplicate or missing examples, making sure the training and test sets are similar, computing feature quantiles—these are all critical analyses to perform on your data.
The better you know what's going on in your data, the more insight you'll have as to where unfairness might creep in!
Now that you've explored the dataset using Facets, see if you can identify some of the problems that may arise with regard to fairness based on what you've learned about its features.
Which of the following features might pose a problem with regard to fairness?
Choose a feature from the drop-down options in the cell below, and then run the cell to check your answer. Then explore the rest of the options to get more insight about how each influences the model's predictions.
In [0]:
feature = 'fnlwgt' #@param ["", "hours_per_week", "fnlwgt", "gender", "capital_gain / capital_loss", "age"] {allow-input: false}
if feature == "hours_per_week":
print(
'''It does seem a little strange to see 'hours_per_week' max out at 99 hours,
which could lead to data misrepresentation. One way to address this is by
representing 'hours_per_week' as a binary "working 40 hours/not working 40
hours" feature. Also keep in mind that data was extracted based on work hours
being greater than 0. In other words, this feature representation exclude a
subpopulation of the US that is not working. This could skew the outcomes of the
model.''')
if feature == "fnlwgt":
print(
"""'fnlwgt' represents the weight of the observations. After fitting the model
to this data set, if certain group of individuals end up performing poorly
compared to other groups, then we could explore ways of reweighting each data
point using this feature.""")
if feature == "gender":
print(
"""Looking at the ratio between men and women shows how disproportionate the data
is compared to the real world where the ratio (at least in the US) is closer to
1:1. This could pose a huge probem in performance across gender. Considerable
measures may need to be taken to upsample the underrepresented group (in this
case, women).""")
if feature == "capital_gain / capital_loss":
print(
"""As alluded to in Task #1, both 'capital_gain' and 'capital_loss' could be
indicative of income status as only individuals who make investments register
their capital gains and losses. The caveat is that over 90% of the values in
both 'capital_gain' and 'capital_loss' are 0, and it's not entirely clear from
the description of the data set why that is the case. That is, we don't know
whether we should interpret all these 0s as "no investment gain/loss or "
investment gain/loss is unknown." Lack of context is always a flag for concern,
and one that could trigger fairness-related issues later on. For now, we are
going to omit these features from the model, but you are more than welcome to
experiment with them if you come up with an idea on how capital gains and
losses should be handled.""")
if feature == "age":
print(
'''"age" has a lot of variance, so it might benefit from bucketing to learn
fine-grained correlations between income and age, as well as to prevent
overfitting.''')
Now that we have a better sense of the Adult dataset, we can now begin with creating a neural network to predict income. In this section, as with previous exercises, we will be using TensorFlow's Keras API (specifically, tf.keras.Sequential
) to construct our neural network model.
We first have to define our input fuction, which will take the Adult dataset that is in a pandas DataFrame and convert it a Numpy array.
While a pandas DataFrame is great — especially when working with Facets and other Python modules that visualize data — tf.keras.Sequential
doesn't accept a pandas DataFrame as a data type. Luckily for us, it's quite trivial to convert a pandas DataFrame into a Numpy array, which is an accepted data type.
In [0]:
def pandas_to_numpy(data):
'''Convert a pandas DataFrame into a Numpy array'''
# Drop empty rows.
data = data.dropna(how="any", axis=0)
# Separate DataFrame into two Numpy arrays"
labels = np.array(data['income_bracket'] == ">50K")
features = data.drop('income_bracket', axis=1)
features = {name:np.array(value) for name, value in features.items()}
return features, labels
In [0]:
#@title Create categorical feature columns
# Since we don't know the full range of possible values with occupation and
# native_country, we'll use categorical_column_with_hash_bucket() to help map
# each feature string into an integer ID.
occupation = tf.feature_column.categorical_column_with_hash_bucket(
"occupation", hash_bucket_size=1000)
native_country = tf.feature_column.categorical_column_with_hash_bucket(
"native_country", hash_bucket_size=1000)
# For the remaining categorical features, since we know what the possible values
# are, we can be more explicit and use categorical_column_with_vocabulary_list()
gender = tf.feature_column.categorical_column_with_vocabulary_list(
"gender", ["Female", "Male"])
race = tf.feature_column.categorical_column_with_vocabulary_list(
"race", [
"White", "Asian-Pac-Islander", "Amer-Indian-Eskimo", "Other", "Black"
])
education = tf.feature_column.categorical_column_with_vocabulary_list(
"education", [
"Bachelors", "HS-grad", "11th", "Masters", "9th",
"Some-college", "Assoc-acdm", "Assoc-voc", "7th-8th",
"Doctorate", "Prof-school", "5th-6th", "10th", "1st-4th",
"Preschool", "12th"
])
marital_status = tf.feature_column.categorical_column_with_vocabulary_list(
"marital_status", [
"Married-civ-spouse", "Divorced", "Married-spouse-absent",
"Never-married", "Separated", "Married-AF-spouse", "Widowed"
])
relationship = tf.feature_column.categorical_column_with_vocabulary_list(
"relationship", [
"Husband", "Not-in-family", "Wife", "Own-child", "Unmarried",
"Other-relative"
])
workclass = tf.feature_column.categorical_column_with_vocabulary_list(
"workclass", [
"Self-emp-not-inc", "Private", "State-gov", "Federal-gov",
"Local-gov", "?", "Self-emp-inc", "Without-pay", "Never-worked"
])
In [0]:
#@title Create numeric feature columns
# For Numeric features, we can just call on feature_column.numeric_column()
# to use its raw value instead of having to create a map between value and ID.
age = tf.feature_column.numeric_column("age")
fnlwgt = tf.feature_column.numeric_column("fnlwgt")
education_num = tf.feature_column.numeric_column("education_num")
capital_gain = tf.feature_column.numeric_column("capital_gain")
capital_loss = tf.feature_column.numeric_column("capital_loss")
hours_per_week = tf.feature_column.numeric_column("hours_per_week")
If you chose age
when completing FairAware Task #3, you will have noticed that we suggested bucketing (also known as binning) this feature, grouping together similar ages into different groups. This might help the model generalize better across age. As such, we will convert age
from a numeric feature (technically, an ordinal feature) to a categorical feature.
In [0]:
age_buckets = tf.feature_column.bucketized_column(
age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])
When performing feature engineering, it's important to keep in mind that you may be working with data drawn from individuals belonging to subgroups, for which you'll want to evaluate model performance separately.
NOTE: In this context, a subgroup is defined as a group of individuals who share a given characteristic—such as race, gender, or sexual orientation—that merits special consideration when evaluating a model with fairness in mind.
When we want our models to mitigate, or leverage, the learned signal of a characteristic pertaining to a subgroup, we will want to use different kinds of tools and techniques—most of which are still actively being researched and developed. You can find a list of related research work and techniques at our Responsible AI Practices page.
As you work with different variables and define tasks for them, it can be useful to think about what comes next. For example, where are the places where the interaction of the variable and the task could be a concern?
In [0]:
# List of variables, with special handling for gender subgroup.
variables = [native_country, education, occupation, workclass,
relationship, age_buckets]
subgroup_variables = [gender]
feature_columns = variables + subgroup_variables
With the features now ready to go, we can try predicting income using deep learning.
For the sake of simplicity, we are going to keep the neural network architecture light by simply defining a feed-forward neural network with two hidden layers.
But first, we have to convert our high-dimensional categorical features into a low-dimensional and dense real-valued vector, which we call an embedding vector. Luckily, indicator_column
(think of it as one-hot encoding) and embedding_column
(that converts sparse features into dense features) helps us streamline the process.
Based on our analysis of the data set from previous FairAware Tasks, we are going to move forward with the following features:
workclass
education
age_buckets
relationship
native_country
occupation
All other features will be omitted from training — but you are welcome to experiment. gender
is the only feature that will be used to filter the test set for subgroup evaluation purposes.
The following cell creates the deep columns required to define the input layer of the model:
In [0]:
deep_columns = [
tf.feature_column.indicator_column(workclass),
tf.feature_column.indicator_column(education),
tf.feature_column.indicator_column(age_buckets),
tf.feature_column.indicator_column(relationship),
tf.feature_column.embedding_column(native_country, dimension=8),
tf.feature_column.embedding_column(occupation, dimension=8),
]
With all the data preprocessing taken care of, we can now define and compile the deep neural net model. Start by using the parameters defined below. (Later on, after you've defined evaluation metrics and evaluated the model, you can come back and tweak these parameters to compare results.)
In [0]:
#@title Define Deep Neural Net Model
# Parameters from form fill-ins
HIDDEN_UNITS_LAYER_01 = 128 #@param
HIDDEN_UNITS_LAYER_02 = 64 #@param
LEARNING_RATE = 0.1 #@param
L1_REGULARIZATION_STRENGTH = 0.001 #@param
L2_REGULARIZATION_STRENGTH = 0.001 #@param
RANDOM_SEED = 512
tf.random.set_seed(RANDOM_SEED)
# List of built-in metrics that we'll need to evaluate performance.
METRICS = [
tf.keras.metrics.TruePositives(name='tp'),
tf.keras.metrics.FalsePositives(name='fp'),
tf.keras.metrics.TrueNegatives(name='tn'),
tf.keras.metrics.FalseNegatives(name='fn'),
tf.keras.metrics.BinaryAccuracy(name='accuracy'),
tf.keras.metrics.Precision(name='precision'),
tf.keras.metrics.Recall(name='recall'),
tf.keras.metrics.AUC(name='auc'),
]
regularizer = tf.keras.regularizers.l1_l2(
l1=L1_REGULARIZATION_STRENGTH, l2=L2_REGULARIZATION_STRENGTH)
model = tf.keras.Sequential([
layers.DenseFeatures(deep_columns),
layers.Dense(
HIDDEN_UNITS_LAYER_01, activation='relu', kernel_regularizer=regularizer),
layers.Dense(
HIDDEN_UNITS_LAYER_02, activation='relu', kernel_regularizer=regularizer),
layers.Dense(
1, activation='sigmoid', kernel_regularizer=regularizer)
])
model.compile(optimizer=tf.keras.optimizers.Adagrad(LEARNING_RATE),
loss=tf.keras.losses.BinaryCrossentropy(),
metrics=METRICS)
To keep things simple, we'll pass through the full training data 10 times.
In [0]:
#@title Fit Deep Neural Net Model to the Adult Training Dataset
EPOCHS = 10 #@param
BATCH_SIZE = 500 #@param
features, labels = pandas_to_numpy(train_df)
model.fit(x=features, y=labels, epochs=EPOCHS, batch_size=BATCH_SIZE)
We can now evalute the overall model's performance using the test set.
In [0]:
#@title Evaluate Deep Neural Net Performance
features, labels = pandas_to_numpy(test_df)
model.evaluate(x=features, y=labels);
You can try retraining the model using different parameters. If you leave the parameters as is, then you see that this relatively simple deep neural net does a decent job in predicting income with an overall accuracy of 0.8317 and an AUC of 0.8817.
But evaluation metrics with respect to subgroups are missing. We will cover some of the ways you can evaluate at the subgroup level in the next section.
While evaluating the overall performance of the model gives us some insight into its quality, it doesn't give us much insight into how well our model performs for different subgroups.
When evaluating a model for fairness, it's important to determine whether prediction errors are uniform across subgroups or whether certain subgroups are more susceptible to certain prediction errors than others.
A key tool for comparing the prevalence of different types of model errors is a confusion matrix. Recall from the Classification module of Machine Learning Crash Course that a confusion matrix is a grid that plots predictions vs. ground truth for your model, and tabulates statistics summarizing how often your model made the correct prediction and how often it made the wrong prediction.
Let's start by creating a binary confusion matrix for our income-prediction model—binary because our label (income_bracket
) has only two possible values (<50K
or >50K
). We'll define an income of >50K
as our positive label, and an income of <50k
as our negative label.
NOTE: Positive and negative in this context should not be interpreted as value judgments (we are not suggesting that someone who earns more than 50k a year is a better person than someone who earns less than 50k). They are just standard terms used to distinguish between the two possible predictions the model can make.
Cases where the model makes the correct prediction (the prediction matches the ground truth) are classified as true, and cases where the model makes the wrong prediction are classified as false.
Our confusion matrix thus represents four possible states:
>50K
, and that is the ground truth.<50K
, and that is the ground truth.>50K
, and that contradicts reality.<50K
, and that contradicts reality.NOTE: If desired, we can use the number of outcomes in each of these states to calculate secondary evaluation metrics, such as precision and recall.
Since we've already defined which metrics we're interested in back when we defined and compiled our model, all we have to do now is:
tf.keras.Model.predict()
for evaluation.
In [0]:
#@title Define Function to Visualize Binary Confusion Matrix
def plot_confusion_matrix(
confusion_matrix, class_names, subgroup, figsize = (8,6)):
# We're taking our calculated binary confusion matrix that's already in the
# form of an array and turning it into a pandas DataFrame because it's a lot
# easier to work with a pandas DataFrame when visualizing a heat map in
# Seaborn.
df_cm = pd.DataFrame(
confusion_matrix, index=class_names, columns=class_names,
)
rcParams.update({
'font.family':'sans-serif',
'font.sans-serif':['Liberation Sans'],
})
sns.set_context("notebook", font_scale=1.25)
fig = plt.figure(figsize=figsize)
plt.title('Confusion Matrix for Performance Across ' + subgroup)
# Combine the instance (numercial value) with its description
strings = np.asarray([['True Positives', 'False Negatives'],
['False Positives', 'True Negatives']])
labels = (np.asarray(
["{0:g}\n{1}".format(value, string) for string, value in zip(
strings.flatten(), confusion_matrix.flatten())])).reshape(2, 2)
heatmap = sns.heatmap(df_cm, annot=labels, fmt="",
linewidths=2.0, cmap=sns.color_palette("GnBu_d"));
heatmap.yaxis.set_ticklabels(
heatmap.yaxis.get_ticklabels(), rotation=0, ha='right')
heatmap.xaxis.set_ticklabels(
heatmap.xaxis.get_ticklabels(), rotation=45, ha='right')
plt.ylabel('References')
plt.xlabel('Predictions')
return fig
Now that we have all the necessary functions defined, we can now compute the binary confusion matrix and evaluation metrics using the outcomes from our deep neural net model. The output of this cell is a tabbed view, which allows us to toggle between the confusion matrix and evaluation metrics table.
Use the form below to generate confusion matrices for the two gender subgroups: Female
and Male
. Compare the number of False Positives and False Negatives for each subgroup. Are there any significant disparities in error rates that suggest the model performs better for one subgroup than another?
In [0]:
#@title Visualize Binary Confusion Matrix and Compute Evaluation Metrics Per Subgroup
CATEGORY = "gender" #@param {type:"string"}
SUBGROUP = "Male" #@param {type:"string"}
# Labels for annotating axes in plot.
classes = ['Over $50K', 'Less than $50K']
# Given define subgroup, generate predictions and obtain its corresponding
# ground truth.
subgroup_filter = test_df.loc[test_df[CATEGORY] == SUBGROUP]
features, labels = pandas_to_numpy(subgroup_filter)
subgroup_results = model.evaluate(x=features, y=labels, verbose=0)
confusion_matrix = np.array([[subgroup_results[1], subgroup_results[4]],
[subgroup_results[2], subgroup_results[3]]])
subgroup_performance_metrics = {
'ACCURACY': subgroup_results[5],
'PRECISION': subgroup_results[6],
'RECALL': subgroup_results[7],
'AUC': subgroup_results[8]
}
performance_df = pd.DataFrame(subgroup_performance_metrics, index=[SUBGROUP])
pd.options.display.float_format = '{:,.4f}'.format
plot_confusion_matrix(confusion_matrix, classes, SUBGROUP);
performance_df
Using default parameters, you may find that the model performs better for female than male. Specifically, in our run, we found that both accuracy and AUC for female (0.9137 and 0.9089, respectively) outperformed male (0.7923 and 0.8549, respectively). What is going on here?
Notice the number of true positives (top-left corner) for female is way lower compared to male (479 to 3822). Recall that in Task #1 we noticed a disproportionately high representation of male in the data set (almost 2-to-1). If you further explore the data set using Facets Dive in Task #2 by setting the color to income_bracket
and one of the axes to gender
, then you will also find a disproportionately small number of female examples in the higher income bracket, our positive label.
What this is all suggesting is that the model is overfitting, particuarly with respect to female and lower income bracket. In other words, this model will not generalize well, particularly with female data, as it does not have enough positive examples for the model to learn from. It is not doing that much better with male, either, as there is a disproportionately small number of high income bracket compared to low income bracket — though not nearly as poorly represented as with female.
Hopefully going through this confusion matrix demonstration you find that the results varies slightly from the overall performance metrics, highlighting the importance of evaluating model performance across subgroup rather than in aggregate.
In your work, make sure that you make a good decision about the tradeoffs between false positives, false negatives, true positives, and true negatives. For example, you may want a very low false positive rate, but a high true positive rate. Or you may want a high precision, but a low recall is okay.
Choose your evaluation metrics in light of these desired tradeoffs.