Anomaly detection

Anomaly detection is a machine learning task that consists in spotting so-called outliers.

“An outlier is an observation in a data set which appears to be inconsistent with the remainder of that set of data.” Johnson 1992

“An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism.” Outlier/Anomaly Hawkins 1980

Types of anomaly detection setups

Supervised AD
- Labels available for both normal data and anomalies
- Similar to rare class mining / imbalanced classification
Semi-supervised AD (Novelty Detection)
- Only normal data available to train
- The algorithm learns on normal data only
Unsupervised AD (Outlier Detection)
- no labels, training set = normal + abnormal data
- Assumption: anomalies are very rare



In [ ]:

    
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

import numpy as np
import matplotlib
import matplotlib.pyplot as plt

Let's first get familiar with different unsupervised anomaly detection approaches and algorithms. In order to visualise the output of the different algorithms we consider a toy data set consisting in a two-dimensional Gaussian mixture.

Generating the data set



In [ ]:

    
from sklearn.datasets import make_blobs

X, y = make_blobs(n_features=2, centers=3, n_samples=500,
                  random_state=42)



In [ ]:

    
X.shape



In [ ]:

    
plt.figure()
plt.scatter(X[:, 0], X[:, 1])
plt.show()

Anomaly detection with density estimation



In [ ]:

    
from sklearn.neighbors.kde import KernelDensity

# Estimate density with a Gaussian kernel density estimator
kde = KernelDensity(kernel='gaussian')
kde = kde.fit(X)
kde



In [ ]:

    
kde_X = kde.score_samples(X)
print(kde_X.shape)  # contains the log-likelihood of the data. The smaller it is the rarer is the sample



In [ ]:

    
from scipy.stats.mstats import mquantiles
alpha_set = 0.95
tau_kde = mquantiles(kde_X, 1. - alpha_set)



In [ ]:

    
n_samples, n_features = X.shape
X_range = np.zeros((n_features, 2))
X_range[:, 0] = np.min(X, axis=0) - 1.
X_range[:, 1] = np.max(X, axis=0) + 1.

h = 0.1  # step size of the mesh
x_min, x_max = X_range[0]
y_min, y_max = X_range[1]
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

grid = np.c_[xx.ravel(), yy.ravel()]



In [ ]:

    
Z_kde = kde.score_samples(grid)
Z_kde = Z_kde.reshape(xx.shape)

plt.figure()
c_0 = plt.contour(xx, yy, Z_kde, levels=tau_kde, colors='red', linewidths=3)
plt.clabel(c_0, inline=1, fontsize=15, fmt={tau_kde[0]: str(alpha_set)})
plt.scatter(X[:, 0], X[:, 1])
plt.legend()
plt.show()

now with One-Class SVM

The problem of density based estimation is that they tend to become inefficient when the dimensionality of the data increase. It's the so-called curse of dimensionality that affects particularly density estimation algorithms. The one-class SVM algorithm can be used in such cases.



In [ ]:

    
from sklearn.svm import OneClassSVM



In [ ]:

    
nu = 0.05  # theory says it should be an upper bound of the fraction of outliers
ocsvm = OneClassSVM(kernel='rbf', gamma=0.05, nu=nu)
ocsvm.fit(X)



In [ ]:

    
X_outliers = X[ocsvm.predict(X) == -1]



In [ ]:

    
Z_ocsvm = ocsvm.decision_function(grid)
Z_ocsvm = Z_ocsvm.reshape(xx.shape)

plt.figure()
c_0 = plt.contour(xx, yy, Z_ocsvm, levels=[0], colors='red', linewidths=3)
plt.clabel(c_0, inline=1, fontsize=15, fmt={0: str(alpha_set)})
plt.scatter(X[:, 0], X[:, 1])
plt.scatter(X_outliers[:, 0], X_outliers[:, 1], color='red')
plt.legend()
plt.show()

Support vectors - Outliers

The so-called support vectors of the one-class SVM form the outliers



In [ ]:

    
X_SV = X[ocsvm.support_]
n_SV = len(X_SV)
n_outliers = len(X_outliers)

print('{0:.2f} <= {1:.2f} <= {2:.2f}?'.format(1./n_samples*n_outliers, nu, 1./n_samples*n_SV))

Only the support vectors are involved in the decision function of the One-Class SVM.

Plot the level sets of the One-Class SVM decision function as we did for the true density.
Emphasize the Support vectors.



In [ ]:

    
plt.figure()
plt.contourf(xx, yy, Z_ocsvm, 10, cmap=plt.cm.Blues_r)
plt.scatter(X[:, 0], X[:, 1], s=1.)
plt.scatter(X_SV[:, 0], X_SV[:, 1], color='orange')
plt.show()

EXERCISE:

**Change** the `gamma` parameter and see it's influence on the smoothness of the decision function.



In [ ]:

    
# %load solutions/22_A-anomaly_ocsvm_gamma.py

Isolation Forest

Isolation Forest is an anomaly detection algorithm based on trees. The algorithm builds a number of random trees and the rationale is that if a sample is isolated it should alone in a leaf after very few random splits. Isolation Forest builds a score of abnormality based the depth of the tree at which samples end up.



In [ ]:

    
from sklearn.ensemble import IsolationForest



In [ ]:

    
iforest = IsolationForest(n_estimators=300, contamination=0.10)
iforest = iforest.fit(X)



In [ ]:

    
Z_iforest = iforest.decision_function(grid)
Z_iforest = Z_iforest.reshape(xx.shape)

plt.figure()
c_0 = plt.contour(xx, yy, Z_iforest,
                  levels=[iforest.threshold_],
                  colors='red', linewidths=3)
plt.clabel(c_0, inline=1, fontsize=15,
           fmt={iforest.threshold_: str(alpha_set)})
plt.scatter(X[:, 0], X[:, 1], s=1.)
plt.legend()
plt.show()

EXERCISE:

Illustrate graphically the influence of the number of trees on the smoothness of the decision function?



In [ ]:

    
# %load solutions/22_B-anomaly_iforest_n_trees.py

Illustration on Digits data set

We will now apply the IsolationForest algorithm to spot digits written in an unconventional way.



In [ ]:

    
from sklearn.datasets import load_digits
digits = load_digits()

The digits data set consists in images (8 x 8) of digits.



In [ ]:

    
images = digits.images
labels = digits.target
images.shape



In [ ]:

    
i = 102

plt.figure(figsize=(2, 2))
plt.title('{0}'.format(labels[i]))
plt.axis('off')
plt.imshow(images[i], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()

To use the images as a training set we need to flatten the images.



In [ ]:

    
n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))



In [ ]:

    
data.shape



In [ ]:

    
X = data
y = digits.target



In [ ]:

    
X.shape

Let's focus on digit 5.



In [ ]:

    
X_5 = X[y == 5]



In [ ]:

    
X_5.shape



In [ ]:

    
fig, axes = plt.subplots(1, 5, figsize=(10, 4))
for ax, x in zip(axes, X_5[:5]):
    img = x.reshape(8, 8)
    ax.imshow(img, cmap=plt.cm.gray_r, interpolation='nearest')
    ax.axis('off')

Let's use IsolationForest to find the top 5% most abnormal images.
Let's plot them !



In [ ]:

    
from sklearn.ensemble import IsolationForest
iforest = IsolationForest(contamination=0.05)
iforest = iforest.fit(X_5)

Compute the level of "abnormality" with iforest.decision_function. The lower, the more abnormal.



In [ ]:

    
iforest_X = iforest.decision_function(X_5)
plt.hist(iforest_X);

Let's plot the strongest inliers



In [ ]:

    
X_strong_inliers = X_5[np.argsort(iforest_X)[-10:]]

fig, axes = plt.subplots(2, 5, figsize=(10, 5))

for i, ax in zip(range(len(X_strong_inliers)), axes.ravel()):
    ax.imshow(X_strong_inliers[i].reshape((8, 8)),
               cmap=plt.cm.gray_r, interpolation='nearest')
    ax.axis('off')

Let's plot the strongest outliers



In [ ]:

    
fig, axes = plt.subplots(2, 5, figsize=(10, 5))

X_outliers = X_5[iforest.predict(X_5) == -1]

for i, ax in zip(range(len(X_outliers)), axes.ravel()):
    ax.imshow(X_outliers[i].reshape((8, 8)),
               cmap=plt.cm.gray_r, interpolation='nearest')
    ax.axis('off')

EXERCISE:

Rerun the same analysis with all the other digits



In [ ]:

    
# %load solutions/22_C-anomaly_digits.py