Hendrik Jacob van Veen
hendrik.vanveen@nubank.com.br • https://mlwave.com
Matheus Facure
matheus.facure@nubank.com.br • https://matheusfacure.github.io/
Variational inference (MacKay, 2003) gives a computationally tractible measure of uncertainty/confidence/variance for machine learning models, including complex black-box models, like those used in the fields of gradient boosting (Chen et al, 2016) and deep learning (Schmidhuber, 2014).
The $MAPPER$ algorithm (Singh et al, 2007) [.pdf] from Topological Data Analysis (Carlsson, 2009) turns any data or function output into a graph (or simplicial complex) which is used for data exploration (Lum et al, 2009), error analysis (Carlsson et al, 2018), serving as input for higher-level machine learning algorithms (Hofer et al, 2017), and more.
Dropout (Srivastava et al, 2014) can be viewed as an ensemble of many different sub-networks inside a single neural network, which, much like bootstrap aggregation of decision trees (Breiman, 1996), aims to combat overfit. Viewed as such, dropout is applicable as a Bayesian approximation (Rubin, 1984) in the variational inference framework (Gal, 2016) (.pdf)
Interpretability is useful for detecting bias in and debugging errors of machine learning models. Many methods exist, such as tree paths (Saabas, 2014), saliency maps, permutation feature importance (Altmann et al, 2010), locally-fit white box models (van Veen, 2015) (Ribeiro et al, 2016). More recent efforts aim to combine a variety of methods (Korobov et al, 2016) (Olah et al, 2018).
Error analysis surfaces different subsets/types of the data where a model makes fundamental errors. When building policies and making financial decisions based on the output of a model it is not only useful to study the errors of a model, but also the confidence:
We will use the MNIST dataset (LeCun et al, 1999), Keras (Chollet et al, 2015) with TensorFlow (Abadi et al, 2016), NumPy (van der Walt et al., 2011), Pandas (McKinney, 2010), Scikit-Learn (Pedregosa et al, 2011), Matplotlib (Hunter, 2007), and KeplerMapper (Saul et al, 2017).
To classify between the digits 3 and 5, we will train a Multi-Layer Perceptron (Ivakhnenko et al, 1965) with 2 hidden layers, Backprop (LeCun et al, 1998) (pdf), RELU activation (Nair et al, 2010), ADAM optimizer (Kingma et al, 2014), dropout of 0.5, and softmax output, to classify between the digits 3 and 5.
We perform a 1000 forward passes to get the standard deviation and variance ratio of our predictions as per (Gal, 2016, page 51) [.pdf].
Closely following the $FiFa$ method from (Carlsson et al, 2018, page 4) we then apply $MAPPER$ with the 2D filter function [predicted probability(x), confidence(x)] to project the data. We cover this projection with 10 10% overlapping intervals per dimension. We cluster with complete single-linkage agglomerative clustering (n_clusters=3) (Ward, 1963) and use the penultimate layer as the inverse $X$. To guide exploration, we color the graph nodes by mean absolute error(x).
We also ask predictions for the digit 4 which was never seen during training (Larochelle et al, 2008), to see how this influences the confidence of the network, and to compare the graphs outputted by KeplerMapper.
For every graph node we show the original images. Binary classification on MNIST digits is easy enough to resort to a simple interpretability method to show what distinguishes the cluster from the rest of the data: We order each feature by z-score and highlight the top 10% features (Singh, 2016).
In [1]:
%matplotlib inline
import keras
from keras import backend as K
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.optimizers import Adam
import kmapper as km
import numpy as np
import pandas as pd
from sklearn import metrics, cluster, preprocessing
import xgboost as xgb
from matplotlib import pyplot as plt
plt.style.use("ggplot")
In [2]:
# get the data, shuffled and split between train and test sets
(X_train, y_train), (X_test, y_test) = mnist.load_data()
X_strange = X_train[y_train == 4]
y_strange = y_train[y_train == 4]
X_train = X_train[np.logical_or(y_train == 3, y_train == 5)]
y_train = y_train[np.logical_or(y_train == 3, y_train == 5)]
X_test = X_test[np.logical_or(y_test == 3, y_test == 5)]
y_test = y_test[np.logical_or(y_test == 3, y_test == 5)]
X_strange = X_strange[:X_test.shape[0]]
y_strange = y_strange[:X_test.shape[0]]
X_train = X_train.reshape(-1, 784)
X_test = X_test.reshape(-1, 784)
X_strange = X_strange.reshape(-1, 784)
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_strange = X_strange.astype('float32')
X_train /= 255
X_test /= 255
X_strange /= 255
print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')
print(X_strange.shape[0], 'strange samples')
# convert class vectors to binary class matrices
y_train = (y_train == 3).astype(int)
y_test = (y_test == 3).astype(int)
y_mean_test = y_test.mean()
print(y_mean_test, 'y test mean')
In [3]:
batch_size = 128
num_classes = 1
epochs = 10
model = Sequential()
model.add(Dropout(0.5, input_shape=(784,)))
model.add(Dense(512, activation='relu', input_shape=(784,)))
model.add(Dropout(0.5))
model.add(Dense(512, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='sigmoid'))
model.summary()
model.compile(loss='binary_crossentropy',
optimizer=Adam(),
metrics=['accuracy'])
In [4]:
history = model.fit(X_train, y_train,
batch_size=batch_size,
epochs=epochs,
verbose=1,
validation_data=(X_test, y_test))
score = model.evaluate(X_test, y_test, verbose=0)
score
Out[4]:
In [5]:
FP = 1000
predict_stochastic = K.function([model.layers[0].input, K.learning_phase()], [model.layers[-1].output])
y_pred_test = np.array([predict_stochastic([X_test, 1]) for _ in range(FP)])
y_pred_stochastic_test = y_pred_test.reshape(-1,y_test.shape[0]).T
y_pred_std_test = np.std(y_pred_stochastic_test, axis=1)
y_pred_mean_test = np.mean(y_pred_stochastic_test, axis=1)
y_pred_mode_test = (np.mean(y_pred_stochastic_test > .5, axis=1) > .5).astype(int).reshape(-1,1)
y_pred_var_ratio_test = 1 - np.mean((y_pred_stochastic_test > .5) == y_pred_mode_test, axis=1)
test_analysis = pd.DataFrame({
"y_true": y_test,
"y_pred": y_pred_mean_test,
"VR": y_pred_var_ratio_test,
"STD": y_pred_std_test
})
print(metrics.accuracy_score(y_true=y_test, y_pred=y_pred_mean_test > .5))
print(test_analysis.describe())
In [6]:
prediction_cut_off = (test_analysis.y_pred < .96) & (test_analysis.y_pred > .94)
std_diff = test_analysis.STD[prediction_cut_off].max() - test_analysis.STD[prediction_cut_off].min()
vr_diff = test_analysis.VR[prediction_cut_off].max() - test_analysis.VR[prediction_cut_off].min()
num_preds = test_analysis.STD[prediction_cut_off].shape[0]
# STD plot
plt.figure(figsize=(16,8))
plt.suptitle("Standard Deviation of Test Predictions", fontsize=18, weight="bold")
plt.title("For the %d predictions between 0.94 and 0.96 the STD varies with %f"%(num_preds, std_diff),
style="italic")
plt.xlabel("Standard Deviation")
plt.ylabel("Predicted Probability")
plt.scatter(test_analysis.STD, test_analysis.y_pred, alpha=.3)
plt.scatter(test_analysis.STD[prediction_cut_off],
test_analysis.y_pred[prediction_cut_off])
plt.show()
# VR plot
plt.figure(figsize=(16,8))
plt.suptitle("Variance Ratio of Test Predictions", fontsize=18, weight="bold")
plt.title("For the %d predictions between 0.94 and 0.96 the Variance Ratio varies with %f"%(num_preds, vr_diff),
style="italic")
plt.xlabel("Variance Ratio")
plt.ylabel("Predicted Probability")
plt.scatter(test_analysis.VR, test_analysis.y_pred, alpha=.3)
plt.scatter(test_analysis.VR[prediction_cut_off],
test_analysis.y_pred[prediction_cut_off])
plt.show()
In [7]:
predict_penultimate_layer = K.function([model.layers[0].input, K.learning_phase()], [model.layers[-2].output])
X_inverse_test = np.array(predict_penultimate_layer([X_test, 1]))[0]
print((X_inverse_test.shape, "X_inverse_test shape"))
In [8]:
X_projected_test = np.c_[test_analysis.STD, test_analysis.y_true - test_analysis.y_pred]
print((X_projected_test.shape, "X_projected_test shape"))
In [9]:
mapper = km.KeplerMapper(verbose=2)
G = mapper.map(X_projected_test,
X_inverse_test,
clusterer=cluster.AgglomerativeClustering(n_clusters=2),
overlap_perc=0.8,
nr_cubes=10)
In [10]:
color_function_output = np.sqrt((y_test-test_analysis.y_pred)**2)
In [11]:
import io
import base64
from scipy.misc import toimage, imsave, imresize
# Create z-scores
hard_predictions = (test_analysis.y_pred > 0.5).astype(int)
o = np.std(X_test, axis=0)
u = np.mean(X_test[hard_predictions == 0], axis=0)
v = np.mean(X_test[hard_predictions == 1], axis=0)
z_scores = (u-v)/o
scores_0 = sorted([(score,i) for i, score in enumerate(z_scores) if str(score) != "nan"],
reverse=False)
scores_1 = sorted([(score,i) for i, score in enumerate(z_scores) if str(score) != "nan"],
reverse=True)
# Fill RGBA image array with top 200 scores for positive and negative
img_array_0 = np.zeros((28,28,4))
img_array_1 = np.zeros((28,28,4))
for e, (score, i) in enumerate(scores_0[:200]):
y = i % 28
x = int((i - (i % 28))/28)
img_array_0[x][y] = [255,255,0,205-e]
for e, (score, i) in enumerate(scores_1[:200]):
y = i % 28
x = int((i - (i % 28))/28)
img_array_1[x][y] = [255,0,0,205-e]
img_array = (img_array_0 + img_array_1) / 2
# Get base64 encoded version of this
output = io.BytesIO()
img = imresize(img_array, (64,64))
img = toimage(img)
img.save(output, format="PNG")
contents = output.getvalue()
explanation_img_encoded = base64.b64encode(contents)
output.close()
# Create tooltips for each digit
tooltip_s = []
for ys, image_data in zip(y_test, X_test):
output = io.BytesIO()
img = toimage(imresize(image_data.reshape((28,28)), (64,64))) # Data was a flat row of "pixels".
img.save(output, format="PNG")
contents = output.getvalue()
img_encoded = base64.b64encode(contents)
img_tag = """<div style="width:71px;
height:71px;
overflow:hidden;
float:left;
position: relative;">
<img src="data:image/png;base64,%s" style="position:absolute; top:0; right:0" />
<img src="data:image/png;base64,%s" style="position:absolute; top:0; right:0;
opacity:0.5; width: 64px; height: 64px;" />
<div style="position: relative; top: 0; left: 1px; font-size:9px">%s</div>
</div>"""%((img_encoded.decode('utf-8'),
explanation_img_encoded.decode('utf-8'),
ys))
tooltip_s.append(img_tag)
output.close()
tooltip_s = np.array(tooltip_s)
In [12]:
_ = mapper.visualize(G,
lens=X_projected_test,
lens_names=["Uncertainty", "Error"],
custom_tooltips=tooltip_s,
color_function=color_function_output.values,
title="Confidence Graph for a MLP trained on MNIST",
path_html="confidence_graph_output.html")
In [13]:
from kmapper import jupyter
jupyter.display("confidence_graph_output.html")
In [ ]: