Robust and calibrated estimators with Scikit-Learn

Gilles Louppe (@glouppe)

New York University



In [1]:

    
# Global imports and settings

# Matplotlib
%matplotlib inline
from matplotlib import pyplot as plt
plt.rcParams["figure.figsize"] = (8, 8)
plt.rcParams["figure.max_open_warning"] = -1

# Print options
import numpy as np
np.set_printoptions(precision=3)

# Slideshow
from notebook.services.config import ConfigManager
cm = ConfigManager()
cm.update('livereveal', {'width': 1440, 'height': 768, 'scroll': True, 'theme': 'simple'})

# Silence warnings
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)
warnings.simplefilter(action="ignore", category=UserWarning)
warnings.simplefilter(action="ignore", category=RuntimeWarning)

# Utils
from robustness import plot_surface
from robustness import plot_outlier_detector



In [2]:

    
%%javascript
Reveal.addEventListener("slidechanged", function(event){ window.location.hash = "header"; });

Motivation

In theory,

Samples $x$ are drawn from a distribution $P$;
As data increases, convergence towards the optimal model is guaranteed.

In practice,

A few samples may be distant from other samples:
- either because they correspond to rare observations,
- or because they are due to experimental errors;
Because data is finite, outliers might strongly affect the resulting model.

Today's goal: build models that are robust to outliers!

Outline

Motivation
Novelty and anomaly detection
Ensembling for robustness
From least squares to least absolute deviances
Calibration

Novelty and anomaly detection

Novelty detection:

Training data is not polluted by outliers, and we are interested in detecting anomalies in new observations.

Outlier detection:

Training data contains outliers, and we need to fit the central mode of the training data, ignoring the deviant observations.

API



In [ ]:

    
# Unsupervised learning  
estimator.fit(X_train)  # no "y_train"

# Detecting novelty or outliers 
y_pred = estimator.predict(X_test)             # inliers == 1, outliers == -1
y_score = estimator.decision_function(X_test)  # outliers == highest scores



In [3]:

    
# Generate data
from sklearn.datasets import make_blobs

inliers, _ = make_blobs(n_samples=200, centers=2, random_state=1)
outliers = np.random.rand(50, 2)
outliers = np.min(inliers, axis=0) + (np.max(inliers, axis=0) - np.min(inliers, axis=0)) * outliers

X = np.vstack((inliers, outliers))
ground_truth = np.ones(len(X), dtype=np.int)
ground_truth[-len(outliers):] = 0



In [4]:

    
from sklearn.svm import OneClassSVM
from sklearn.covariance import EllipticEnvelope
from sklearn.ensemble import IsolationForest

# Unsupervised learning
estimator = OneClassSVM(nu=0.4, kernel="rbf", gamma=0.1)
# clf = EllipticEnvelope(contamination=.1)
# clf = IsolationForest(max_samples=100)
estimator.fit(X)

plot_outlier_detector(estimator, X, ground_truth)

Ensembling for robustness

Bias-variance decomposition

Theorem. For the squared error loss, the bias-variance decomposition of the expected generalization error at $X=\mathbf{x}$ is

$$ \mathbb{E}_{\cal L} \{ Err(\varphi_{\cal L}(\mathbf{x})) \} = \text{noise}(\mathbf{x}) + \text{bias}^2(\mathbf{x}) + \text{var}(\mathbf{x}) $$

Variance and robustness

Low variance implies robustness to outliers
High variance implies sensitivity to data pecularities

Ensembling reduces variance

Theorem. For the squared error loss, the bias-variance decomposition of the expected generalization error at $X=x$ of an ensemble of $M$ randomized models $\varphi_{{\cal L},\theta_m}$ is

$$ \mathbb{E}_{\cal L} \{ Err(\psi_{{\cal L},\theta_1,\dots,\theta_M}(\mathbf{x})) \} = \text{noise}(\mathbf{x}) + \text{bias}^2(\mathbf{x}) + \text{var}(\mathbf{x}) $$

where \begin{align*} \text{noise}(\mathbf{x}) &= Err(\varphi_B(\mathbf{x})), \\ \text{bias}^2(\mathbf{x}) &= (\varphi_B(\mathbf{x}) - \mathbb{E}_{{\cal L},\theta} \{ \varphi_{{\cal L},\theta}(\mathbf{x}) \} )^2, \\ \text{var}(\mathbf{x}) &= \rho(\mathbf{x}) \sigma^2_{{\cal L},\theta}(\mathbf{x}) + \frac{1 - \rho(\mathbf{x})}{M} \sigma^2_{{\cal L},\theta}(\mathbf{x}). \end{align*}



In [5]:

    
# Load data
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data[:, [0, 1]]
y = iris.target



In [7]:

    
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier().fit(X, y)
plot_surface(clf, X, y)

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100).fit(X, y)
plot_surface(clf, X, y)

From least squares to least absolute deviances

Robust learning

Most methods minimize the mean squared error $\frac{1}{N} \sum_i (y_i - \varphi(x_i))^2$
By definition, squaring residuals gives emphasis to large residuals.
Outliers are thus very likely to have a significant effect.

A robust alternative is to minimize instead the mean absolute deviation $\frac{1}{N} \sum_i |y_i - \varphi(x_i)|$
Large residuals are therefore given much less emphasis.



In [8]:

    
# Generate data
from sklearn.datasets import make_regression

n_outliers = 3
X, y, coef = make_regression(n_samples=100, n_features=1, n_informative=1, noise=10,
                             coef=True, random_state=0)

np.random.seed(1)
X[-n_outliers:] = 1 + 0.25 * np.random.normal(size=(n_outliers, 1))
y[-n_outliers:] = -100 + 10 * np.random.normal(size=n_outliers)

plt.scatter(X[:-n_outliers], y[:-n_outliers], color="b")
plt.scatter(X[-n_outliers:], y[-n_outliers:], color="r")
plt.xlim(-3, 3)
plt.ylim(-150, 120)
plt.show()



In [9]:

    
# Fit with least squares vs. least absolute deviances
from sklearn.ensemble import GradientBoostingRegressor

clf_ls = GradientBoostingRegressor(loss="ls")
clf_lad = GradientBoostingRegressor(loss="lad")
clf_ls.fit(X, y)
clf_lad.fit(X, y)

# Plot
X_test = np.linspace(-5, 5).reshape(-1, 1)
plt.scatter(X[:-n_outliers], y[:-n_outliers], color="b")
plt.scatter(X[-n_outliers:], y[-n_outliers:], color="r")
plt.plot(X_test, clf_ls.predict(X_test), "g", label="Least squares")
plt.plot(X_test, clf_lad.predict(X_test), "y", label="Lead absolute deviances")
plt.xlim(-3, 3)
plt.ylim(-150, 120)
plt.legend()
plt.show()

Robust scaling

Standardization of a dataset is a common requirement for many machine learning estimators.
Typically this is done by removing the mean and scaling to unit variance.

For similar reasons as before, outliers can influence the sample mean / variance in a negative way.
In such cases, the median and the interquartile range often give better results.



In [10]:

    
# Generate data
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split

X, y = make_blobs(n_samples=100, centers=[(0, 0), (-1, 0)], random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
X_train[0, 0] = -1000  # a fairly large outlier



In [11]:

    
# Scale data
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler

standard_scaler = StandardScaler()
Xtr_s = standard_scaler.fit_transform(X_train)
Xte_s = standard_scaler.transform(X_test)

robust_scaler = RobustScaler()
Xtr_r = robust_scaler.fit_transform(X_train)
Xte_r = robust_scaler.transform(X_test)



In [12]:

    
# Plot data
fig, ax = plt.subplots(1, 3, figsize=(12, 4))
ax[0].scatter(X_train[:, 0], X_train[:, 1], color=np.where(y_train == 0, 'r', 'b'))
ax[1].scatter(Xtr_s[:, 0], Xtr_s[:, 1], color=np.where(y_train == 0, 'r', 'b'))
ax[2].scatter(Xtr_r[:, 0], Xtr_r[:, 1], color=np.where(y_train == 0, 'r', 'b'))
ax[0].set_title("Unscaled data")
ax[1].set_title("After standard scaling (zoomed in)")
ax[2].set_title("After robust scaling (zoomed in)")

# for the scaled data, we zoom in to the data center (outlier can't be seen!)
for a in ax[1:]:
    a.set_xlim(-3, 3)
    a.set_ylim(-3, 3)
    
plt.show()



In [13]:

    
# Classify using kNN
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
knn.fit(Xtr_s, y_train)
acc_s = knn.score(Xte_s, y_test)
print("Test set accuracy using standard scaler: %.3f" % acc_s)

knn.fit(Xtr_r, y_train)
acc_r = knn.score(Xte_r, y_test)
print("Test set accuracy using robust scaler:   %.3f" % acc_r)









    



Test set accuracy using standard scaler: 0.460
Test set accuracy using robust scaler:   0.680

Calibration

In classification, you often want to predict not only the class label, but also the associated probability.
However, not all classifiers provide well-calibrated probabilities.
Thus, a separate calibration of predicted probabilities is often desirable as a postprocessing



In [14]:

    
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split

# Generate 3 blobs with 2 classes where the second blob contains
# half positive samples and half negative samples. Probability in this
# blob is therefore 0.5.
X, y = make_blobs(n_samples=10000, n_features=2, cluster_std=1.0, 
                  centers=[(-5, -5), (0, 0), (5, 5)], shuffle=False)
y[:len(X) // 2] = 0
y[len(X) // 2:] = 1
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)



In [15]:

    
# Plot
for this_y, color in zip([0, 1], ["r", "b"]):
    this_X = X_train[y_train == this_y]
    plt.scatter(this_X[:, 0], this_X[:, 1], c=color, alpha=0.2, label="Class %s" % this_y)
plt.legend(loc="best")
plt.title("Data")
plt.show()



In [16]:

    
from sklearn.naive_bayes import GaussianNB
from sklearn.calibration import CalibratedClassifierCV

# Without calibration
clf = GaussianNB()
clf.fit(X_train, y_train)  # GaussianNB itself does not support sample-weights
prob_pos_clf = clf.predict_proba(X_test)[:, 1]

# With isotonic calibration
clf_isotonic = CalibratedClassifierCV(clf, cv=2, method='isotonic')
clf_isotonic.fit(X_train, y_train)
prob_pos_isotonic = clf_isotonic.predict_proba(X_test)[:, 1]



In [17]:

    
# Plot
order = np.lexsort((prob_pos_clf, ))
plt.plot(prob_pos_clf[order], 'r', label='No calibration')
plt.plot(prob_pos_isotonic[order], 'b', label='Isotonic calibration')
plt.plot(np.linspace(0, y_test.size, 51)[1::2], y_test[order].reshape(25, -1).mean(1), 'k--', label=r'Empirical')

plt.xlabel("Instances sorted according to predicted probability "
           "(uncalibrated GNB)")
plt.ylabel("P(y=1)")
plt.legend(loc="upper left")
plt.title("Gaussian naive Bayes probabilities")
plt.ylim([-0.05, 1.05])
plt.show()

Summary

For robust and calibrated estimators:

remove outliers before training;
reduce variance by ensembling estimators;
drive your analysis with loss functions that are robust to outliers;
- avoid the squared error loss!
calibrate the output of your classifier if probabilities are important for your problem.



In [ ]:

    
questions?