In [ ]:
# Only execute if you haven't already. Make sure to restart the kernel if these libraries have not been previously installed.
!pip install xgboost==0.82 --user
!pip install scikit-learn==0.20.4 --user

Import Python packages

Execute the command below (Shift + Enter) to load all the python libraries we'll need for the lab.


In [ ]:
import datetime
import pickle
import os

import pandas as pd
import xgboost as xgb
import numpy as np

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import FeatureUnion, make_pipeline
from sklearn.utils import shuffle
from sklearn.base import clone
from sklearn.model_selection import train_test_split

from witwidget.notebook.visualization import WitWidget, WitConfigBuilder

import custom_transforms

import warnings
warnings.filterwarnings(action='ignore', category=DeprecationWarning)

Before we continue, note that we'll be using your Qwiklabs project id a lot in this notebook. For convenience, set it as an environment variable using the command below:


In [ ]:
os.environ['QWIKLABS_PROJECT_ID'] = ''

Download and process data

The models you'll build will predict the income level, whether it's less than or equal to $50,000 per year, of individuals given 14 data points about each individual. You'll train your models on this UCI Census Income Dataset.

We'll read the data into a Pandas DataFrame to see what we'll be working with. It's important to shuffle our data in case the original dataset is ordered in a specific way. We use an sklearn utility called shuffle to do this, which we imported in the first cell:


In [ ]:
train_csv_path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'

COLUMNS = (
    'age',
    'workclass',
    'fnlwgt',
    'education',
    'education-num',
    'marital-status',
    'occ|upation',
    'relationship',
    'race',
    'sex',
    'capital-gain',
    'capital-loss',
    'hours-per-week',
    'native-country',
    'income-level'
)

raw_train_data = pd.read_csv(train_csv_path, names=COLUMNS, skipinitialspace=True)
raw_train_data = shuffle(raw_train_data, random_state=4)

data.head() lets us preview the first five rows of our dataset in Pandas.


In [ ]:
raw_train_data.head()

The income-level column is the thing our model will predict. This is the binary outcome of whether the individual makes more than $50,000 per year. To see the distribution of income levels in the dataset, run the following:


In [ ]:
print(raw_train_data['income-level'].value_counts(normalize=True))

As explained in this paper, each entry in the dataset contains the following information about an individual:

  • age: the age of an individual
  • workclass: a general term to represent the employment status of an individual
  • fnlwgt: final weight. In other words, this is the number of people the census believes the entry represents...
  • education: the highest level of education achieved by an individual.
  • education-num: the highest level of education achieved in numerical form.
  • marital-status: marital status of an individual.
  • occupation: the general type of occupation of an individual
  • relationship: represents what this individual is relative to others. For example an individual could be a Husband. Each entry only has one relationship attribute and is somewhat redundant with marital status.
  • race: Descriptions of an individual’s race
  • sex: the biological sex of the individual
  • capital-gain: capital gains for an individual
  • capital-loss: capital loss for an individual
  • hours-per-week: the hours an individual has reported to work per week
  • native-country: country of origin for an individual
  • income-level: whether or not an individual makes more than $50,000 annually

An important concept in machine learning is train / test split. We'll take the majority of our data and use it to train our model, and we'll set aside the rest for testing our model on data it's never seen before. There are many ways to create training and test datasets. Fortunately, for our census data we can simply download a pre-defined test set.


In [ ]:
test_csv_path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test'
raw_test_data = pd.read_csv(test_csv_path, names=COLUMNS, skipinitialspace=True, skiprows=1)

In [ ]:
raw_test_data.head()

Since we don't want to train a model on our labels, we're going to separate them from the features in both the training and test datasets. Also, notice that income-level is a string datatype. For machine learning, it's better to convert this to an binary integer datatype. We do this in the next cell.


In [ ]:
raw_train_features = raw_train_data.drop('income-level', axis=1).values
raw_test_features = raw_test_data.drop('income-level', axis=1).values

# Create training labels list
train_labels = (raw_train_data['income-level'] == '>50K').values.astype(int)
test_labels = (raw_test_data['income-level'] == '>50K.').values.astype(int)

Now you're ready to build and train your first model!

Build a First Model

The model we build closely follows a template for the census dataset found on AI Hub. For our model we use an XGBoost classifier. However, before we train our model we have to pre-process the data a little bit. We build a processing pipeline using Scikit-Learn's Pipeline constructor. We appl some custom transformations that are defined in custom_transforms.py. Open the file custom_transforms.py and inspect the code. Out features are either numerical or categorical. The numerical features are age-num, and hours-per-week. These features will be processed by applying Scikit-Learn's StandardScaler function. The categorical features are workclass, education, marital-status, and relationship. These features are one-hot encoded.


In [ ]:
numerical_indices = [0, 12]  
categorical_indices = [1, 3, 5, 7]  

p1 = make_pipeline(
    custom_transforms.PositionalSelector(categorical_indices),
    custom_transforms.StripString(),
    custom_transforms.SimpleOneHotEncoder()
)
p2 = make_pipeline(
    custom_transforms.PositionalSelector(numerical_indices),
    StandardScaler()
)
p3 = FeatureUnion([
    ('numericals', p1),
    ('categoricals', p2),
])

To finalize the pipeline we attach an XGBoost classifier at the end. The complete pipeline object takes the raw data we loaded from csv files, processes the categorical features, processes the numerical features, concatenates the two, and then passes the result through the XGBoost classifier.


In [ ]:
pipeline = make_pipeline(
    p3,
    xgb.sklearn.XGBClassifier(max_depth=4)
)

We train our model with one function call using the fit() method. We pass the fit() method our training data.


In [ ]:
pipeline.fit(raw_train_features, train_labels)

Let's go ahead and save our model as a pickle file. Executing the command below will save the trained model in the file model.pkl in the same directory as this notebook.


In [ ]:
with open('model.pkl', 'wb') as model_file:
    pickle.dump(pipeline, model_file)

Save Trained Model to AI Platform

We've got our model working locally, but it would be nice if we could make predictions on it from anywhere (not just this notebook!). In this step we'll deploy it to the cloud. For detailed instructions on how to do this visit the official documenation. Note that since we have custom components in our data pipeline we need to go through a few extra steps.

Create a Cloud Storage bucket for the model

We first need to create a storage bucket to store our pickled model file. We'll point Cloud AI Platform at this file when we deploy. Run this gsutil command to create a bucket. This will ensure the name of the cloud storage bucket you create will be globally unique.


In [ ]:
!gsutil mb gs://$QWIKLABS_PROJECT_ID

Package custom transform code

Since we're using custom transformation code we need to package it up and direct AI Platform to it when we ask it make predictions. To package our custom code we create a source distribution. The following code creates this distribution and then ports the distribution and the model file to the bucket we created. Ignore the warnings about missing meta data.


In [ ]:
%%bash

python setup.py sdist --formats=gztar

gsutil cp model.pkl gs://$QWIKLABS_PROJECT_ID/original/
gsutil cp dist/custom_transforms-0.1.tar.gz gs://$QWIKLABS_PROJECT_ID/

Create and Deploy Model

The following ai-platform gcloud command will create a new model in your project. We'll call this one census_income_classifier.


In [ ]:
!gcloud ai-platform models create census_income_classifier --regions us-central1

Now it's time to deploy the model. We can do that with this gcloud command:


In [ ]:
%%bash

MODEL_NAME="census_income_classifier"
VERSION_NAME="original"
MODEL_DIR="gs://$QWIKLABS_PROJECT_ID/original/"
CUSTOM_CODE_PATH="gs://$QWIKLABS_PROJECT_ID/custom_transforms-0.1.tar.gz"

gcloud beta ai-platform versions create $VERSION_NAME \
  --model $MODEL_NAME \
  --runtime-version 1.15 \
  --python-version 3.7 \
  --origin $MODEL_DIR \
  --package-uris $CUSTOM_CODE_PATH \
  --prediction-class predictor.MyPredictor

While this is running, check the models section of your AI Platform console. You should see your new version deploying there. When the deploy completes successfully you'll see a green check mark where the loading spinner is. The deploy should take 2-3 minutes. You will need to click on the model name in order to see the spinner/checkmark. In the command above, notice we specify prediction-class. The reason we must specify a prediction class is that by default, AI Platform prediction will call a Scikit-Learn model's predict method, which in this case returns either 0 or 1. However, the What-If Tool requires output from a model in line with a Scikit-Learn model's predict_proba method. This is because WIT wants the probabilities of the negative and positive classes, not just the final determination on which class a person belongs to. Because that allows us to do more fine-grained exploration of the model. Consequently, we must write a custom prediction routine that basically renames predict_proba as predict. The custom prediction method can be found in the file predictor.py. This file was packaged in the section Package custom transform code. By specifying prediction-class we're telling AI Platform to call our custom prediction method--basically, predict_proba-- instead of the default predict method.

Test the deployed model

To make sure your deployed model is working, test it out using gcloud to make a prediction. First, save a JSON file with one test instance for prediction:


In [ ]:
%%writefile predictions.json
[25, "Private", 226802, "11th", 7, "Never-married", "Machine-op-inspct", "Own-child", "Black", "Male", 0, 0, 40, "United-States"]

Test your model by running this code:


In [ ]:
!gcloud ai-platform predict --model=census_income_classifier --json-instances=predictions.json --version=original

You should see your model's prediction in the output. The first entry in the output is the model's probability that the individual makes under \$50K while the second entry is the model's confidence that the individual makes over \\$50k. The two entries sum to 1.

What-If Tool

To connect the What-if Tool to your AI Platform models, you need to pass it a subset of your test examples along with the ground truth values for those examples. Let's create a Numpy array of 2000 of our test examples.


In [ ]:
num_datapoints = 2000  

test_examples = np.hstack(
    (raw_test_features[:num_datapoints], 
     test_labels[:num_datapoints].reshape(-1,1)
    )
)

Instantiating the What-if Tool is as simple as creating a WitConfigBuilder object and passing it the AI Platform model we built. Note that it'll take a minute to load the visualization.


In [ ]:
config_builder = (
    WitConfigBuilder(test_examples.tolist(), COLUMNS)
    .set_ai_platform_model(os.environ['QWIKLABS_PROJECT_ID'], 'census_income_classifier', 'original')
    .set_target_feature('income-level')
    .set_model_type('classification')
    .set_label_vocab(['Under 50K', 'Over 50K'])
)

WitWidget(config_builder, height=800)

The default view on the What-if Tool is the Datapoint editor tab. Here, you can click on any individual data point to see its features and even change feature values. Navigate to the Performance & Fairness tab in the What-if Tool. By slicing on a feature you can view the model error for individual feature values. Finally, navigate to the Features tab in the What-if Tool. This shows you the distribution of values for each feature in your dataset. You can use this tab to make sure your dataset is balanced. For example, if we only had Asians in a population, the model's predictions wouldn't necessarily reflect real world data. This tab gives us a good opportunity to see where our dataset might fall short, so that we can go back and collect more data to make it balanced.

In the Features tab, we can look to see the distribution of values for each feature in the dataset. We can see that of the 2000 test datapoints, 1346 are from men and 1702 are from caucasions. Women and minorities seem under-represented in this dataset. That may lead to the model not learning an accurate representation of the world in which it is trying to make predictions (of course, even if it does learn an accurate representation, is that what we want the model to perpetuate? This is a much deeper question still falling under the ML fairness umbrella and worthy of discussion outside of WIT). Predictions on those under-represented groups are more likely to be inaccurate than predictions on the over-represented groups.

The features in this visualization can be sorted by a number of different metrics, including non-uniformity. With this sorting, the features that have the most non-uniform distributions are shown first. For numeric features, capital gain is very non-uniform, with most datapoints having it set to 0, but a small number having non-zero capital gains, all the way up to a maximum of 100k. For categorical features, country is the most non-uniform with most datapoints being from the USA, but there is a long tail of 40 other countries which are not well represented.

Back in the Performance & Fairness tab, we can set an input feature (or set of features) with which to slice the data. For example, setting this to sex allows us to see the breakdown of model performance on male datapoints versus female datapoints. We can see that the model is more accurate (has less false positives and false negatives) on females than males. We can also see that the model predicts high income for females much less than it does for males (8.0% of the time for females vs 27.1% of the time for males). Note, your numbers will be slightly different due to the random elements of model training.

Imagine a scenario where this simple income classifier was used to approve or reject loan applications (not a realistic example but it illustrates the point). In this case, 28% of men from the test dataset have their loans approved but only 10% of women have theirs approved. If we wished to ensure than men and women get their loans approved the same percentage of the time, that is a fairness concept called "demographic parity". One way to achieve demographic parity would be to have different classification thresholds for males and females in our model.

In this case, demographic parity can be found with both groups getting loans 16% of the time by having the male threshold at 0.67 and the female threshold at 0.31. Because of the vast difference in the properties of the male and female training data in this 1994 census dataset, we need quite different thresholds to achieve demographic parity. Notice how with the high male threshold there are many more false negatives than before, and with the low female threshold there are many more false positives than before. This is necessary to get the percentage of positive predictions to be equal between the two groups. WIT has buttons to optimize for other fairness constraints as well, such as "equal opportunity" and "equal accuracy". Note that the demographic parity numbers may be different from the ones in your text as the trained models are always a bit different.

The use of these features can help shed light on subsets of your data on which your classifier is performing very differently. Understanding biases in your datasets and data slices on which your model has disparate performance are very important parts of analyzing a model for fairness. There are many approaches to improving fairness, including augmenting training data, building fairness-related loss functions into your model training procedure, and post-training inference adjustments like those seen in WIT. We think that WIT provides a great interface for furthering ML fairness learning, but of course there is no silver bullet to improving ML fairness.

Training on a more balanced dataset

Using the What-If Tool we saw that the model we trained on the census dataset wouldn't be very considerate in a production environment. What if we retrained the model on a dataset that was more balanced? Fortunately, we have such a dataset. Let's train a new model on this balanced dataset and compare it to our original dataset using the What-If Tool.

First, let's load the balanced dataset into a Pandas dataframe.


In [ ]:
bal_data_path = 'https://storage.googleapis.com/cloud-training/dei/balanced_census_data.csv' 
bal_data = pd.read_csv(bal_data_path, names=COLUMNS, skiprows=1)

In [ ]:
bal_data.head()

Execute the command below to see the distribution of gender in the data.


In [ ]:
bal_data['sex'].value_counts(normalize=True)

Unlike the original dataset, this dataset has an equal number of rows for both males and females. Execute the command below to see the distriubtion of rows in the dataset of both sex and income-level.


In [ ]:
bal_data.groupby(['sex', 'income-level'])['sex'].count()

We see that not only is the dataset balanced across gender, it's also balanced across income. Let's train a model on this data. We'll use exactly the same model pipeline as in the previous section. Scikit-Learn has a convenient utility function for copying model pipelines, clone. The clone function copies a pipeline architecture without saving learned parameter values.


In [ ]:
bal_data['income-level'] = bal_data['income-level'].isin(['>50K', '>50K.']).values.astype(int)

raw_bal_features = bal_data.drop('income-level', axis=1).values
bal_labels = bal_data['income-level'].values

In [ ]:
pipeline_bal = clone(pipeline)

In [ ]:
pipeline_bal.fit(raw_bal_features, bal_labels)

As before, we save our trained model to a pickle file. Note, when we version this model in AI Platform the model in this case must be named model.pkl. It's ok to overwrite the existing model.pkl file since we'll be uploading it to Cloud Storage anyway.


In [ ]:
with open('model.pkl', 'wb') as model_file:
    pickle.dump(pipeline_bal, model_file)

Deploy the model to AI Platform using the following bash script:


In [ ]:
%%bash

gsutil cp model.pkl gs://$QWIKLABS_PROJECT_ID/balanced/
    
MODEL_NAME="census_income_classifier"
VERSION_NAME="balanced"
MODEL_DIR="gs://$QWIKLABS_PROJECT_ID/balanced/"
CUSTOM_CODE_PATH="gs://$QWIKLABS_PROJECT_ID/custom_transforms-0.1.tar.gz"

gcloud beta ai-platform versions create $VERSION_NAME \
  --model $MODEL_NAME \
  --runtime-version 1.15 \
  --python-version 3.7 \
  --origin $MODEL_DIR \
  --package-uris $CUSTOM_CODE_PATH \
  --prediction-class predictor.MyPredictor

Now let's instantiate the What-if Tool by configuring a WitConfigBuilder. Here, we want to compare the original model we built with the one trained on the balanced census dataset. To achieve this we utilize the set_compare_ai_platform_model method. We want to compare the models on a balanced test set. The balanced test is loaded and then input to WitConfigBuilder.


In [ ]:
bal_test_csv_path = 'https://storage.googleapis.com/cloud-training/dei/balanced_census_data_test.csv'
bal_test_data = pd.read_csv(bal_test_csv_path, names=COLUMNS, skipinitialspace=True)
bal_test_data['income-level'] = (bal_test_data['income-level'] == '>50K').values.astype(int)

In [ ]:
config_builder = (
    WitConfigBuilder(bal_test_data.to_numpy()[1:].tolist(), COLUMNS)
    .set_ai_platform_model(os.environ['QWIKLABS_PROJECT_ID'], 'census_income_classifier', 'original')
    .set_compare_ai_platform_model(os.environ['QWIKLABS_PROJECT_ID'], 'census_income_classifier', 'balanced')
    .set_target_feature('income-level')
    .set_model_type('classification')
    .set_label_vocab(['Under 50K', 'Over 50K'])
)

WitWidget(config_builder, height=800)

Once the WIT widget loads, click on the Performance & Fairness tab. In the Slice by field select sex and wait a minute for the graphics to load. For females, the model trained on the balanced dataset is over two times more likely to predict an income of over 50k than the model trained on the original dataset. While this results in a higher false positive rate, the false negative rate is decreased by a factor of three. This results in an improved overall accuracy of some 10 percentage points.

How else does the model trained on balanced data perform differently when compared to the original model?