In [1]:
### First imports and default parameters
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
# Overwritting matplotlib default linestyle of negative contours
matplotlib.rcParams['contour.negative_linestyle'] = 'solid'
SEED = 42
The goal of this section is to get familiar with different unsupervised anomaly detection approaches and algorithms. In order to visualise the output of the different algorithms we consider a toy data set consisting in a two-dimensional Gaussian mixture.
In [2]:
from utils import GaussianMixture
n_samples = 500
n_features = 2
weight_1 = 0.5
weight_2 = 0.5
mean_1 = np.zeros(n_features)
mean_2 = -1 * np.ones(n_features)
cov_1 = np.array([[2., 2.,], [2., 4.]])
cov_2 = 2 * np.identity(n_features)
weights = np.array([weight_1, weight_2])
means = np.array([mean_1, mean_2])
covars = np.array([cov_1, cov_2])
gm = GaussianMixture(weights, means, covars, random_state=SEED)
X = gm.sample(n_samples)
In [3]:
X_range = np.zeros((n_features, 2))
X_range[:, 0] = np.min(X, axis=0)
X_range[:, 1] = np.max(X, axis=0)
h = 0.1 # step size of the mesh
x_min, x_max = X_range[0, 0] - 0.1, X_range[0, 1] + 0.1
y_min, y_max = X_range[1, 0] - 0.1, X_range[1, 1] + 0.1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
grid = np.c_[xx.ravel(), yy.ravel()]
Z = gm.density(grid)
Z = Z.reshape(xx.shape)
plt.figure()
plt.contour(xx, yy, Z, 10, cmap=plt.cm.Blues_r)
plt.scatter(X[:, 0], X[:, 1], s=1.)
plt.show()
The goal is to estimate a Minimum volume set with mass at least 0.95. We know that under regularity assumption of the underlying distribution a Minimum volume set is a density level set of the density $f$. $$ \{x, f(x) \geq F_f^{-1}(1-\alpha) \} $$ where $F_f^{-1}(1-\alpha)$ is the quantile of order $1-\alpha$ of the distribution of the random variable $f(X)$.
Draw a sample $\mathcal{S}$ from the previous Gaussian mixture of size $n = 1,000,000$.
Using the sample $\mathcal{S}$ and the true density $f$ given by gm.density
compute the empirical quantile of order $1-\alpha$ for $\alpha = 0.95$ and $\alpha = 0.99$. You can use for instance the scipy.stats.mstats.mquantiles
function.
Plot the corresponding Minimum Volume set in a figure like the previous one, using plt.contour
and replacing the number of levels (10) by the threshold you computed: levels=threshold
.
Emphasize the outliers (for instance use a different color) in the previous plot where an outlier is a sample outside the Minimum Volume set.
We are going to use the plug-in approach to estimate the Minimum Volume set with mass at least 0.95. The plug-in approach means that we replace the density $f$ by an estimate $\hat f$.
Equation of the corresponding Minimum Volume set estimate.
Using a Gaussian kernel, compute a kernel density estimator of $f$. To stick with sklearn
you can use sklearn.neighbors.kde.KernelDensity
with the default bandwidth.
Compute the empirical quantile $\widehat F_{\hat f}^{-1}(1-\alpha)$ from the original sample for $\alpha = 0.95$ (still using the function mquantiles
).
Add the corresponding Minimum Volume set estimate to the previous plot to visualize the difference between the true MV set and the estimated one.
We used the default bandwidth.
Use cross validation to learn the bandwidth from the data. You may use sklearn.model_selection.GridSearchCV
to perform a 3-fold cross validation.
Plot the Minimum Volume set estimate obtained with the learnt bandwidth.
The One-Class SVM separates the data from the origin by solving the following optimization problem.
\begin{equation} \min_{\boldsymbol{w},\boldsymbol{\xi},\rho} \quad \frac{1}{2}\Vert \boldsymbol{w} \Vert^2 - \rho + \frac{1}{\nu n}\sum_{i=1}^n\xi_i\\ \text{s.t.} \quad \langle \boldsymbol{w}, \Phi(x_i) \rangle \geqslant \rho - \xi_i, \quad \xi_i \geqslant 0 \quad \quad 1 \leqslant i \leqslant n \end{equation}where the parameter $\nu \in (0,1)$ has to be set by the user and controls the proportion of outliers.
It is in general solved via the dual given by \begin{equation} \min_{\boldsymbol{\alpha}} \quad \frac{1}{2}\sum_{i,j} k(x_i, x_j)\\ \text{s.t.} \quad 0 \leqslant \alpha_i \leqslant \frac{1}{\nu n},\quad \sum_{i}\alpha_i = 1 \quad \quad 1 \leqslant i \leqslant n \end{equation}
Under mild assumptions the following inequality holds true for all sample size $n$:
$$ \frac{\text{Outliers}}{n} \leq \nu \leq \frac{\text{Support Vectors}}{n} $$.
Futhermore the two fractions on the left-hand side and right-hand converge almost surely towards $\nu$.
If we want to estimate a Minimum Volume set with mass at least $\alpha$, how can we set $\nu$?
Use sklearn.svm.OneClassSVM
with the default parameters (except the parameter $\nu$ set according to the previous question) to learn a Minimum Volume set with mass at least $0.95$.
Plot the Minimum Volume set estimate.
What do you notice? The kernel used by the One-Class SVM is given by $$ k(x, x') = \exp(-\gamma\Vert x - x' \Vert^2). $$
What is the default gamma used when we trained the One-Class SVM? Do we need to increase or decrease its value?
Check that we indeed have $$ \frac{\text{Outliers}}{n} \leq \nu \leq \frac{\text{Support Vectors}}{n} $$.
Use gamma=10
. Is the inequality satisfied? Can you guess why?
Only the support vectors are involved in the decision function of the One-Class SVM.
Concerning the support vectors, what is the main difference between the OneClass SVM and a Kernel Density estimator when using a Gaussian Kernel?
When is it particularly advantageous to use the OneClass SVM?
decision_function
.Support Vector Data Description consists in finding the ball of minimum volume containing an arbitrary proportion of data. This is obtained by minimizing the following optimization problem.
It is in general solved via the dual given by \begin{equation} \min_{\boldsymbol{\alpha}} \quad \sum_{i,j} k(x_i, x_j) + \sum_{i}\alpha_i k(x_i,x_i)\\ \text{s.t.} \quad 0 \leqslant \alpha_i \leqslant \frac{1}{\nu n},\quad \sum_{i}\alpha_i = 1 \quad \quad 1 \leqslant i \leqslant n \end{equation}
Run Isolation with n_estimators=1
, i.e., only one tree will be built.
Plot the contour of the set with the $95\%$ most normal observations. Hint: use the decision_function
and the contamination
parameter. Compare it with the true Minimum Volume set with mass at least $0.95$.
Increase n_estimators
to 50, 100 and then 500.
What does the max_sample
parameter change?
In [1]:
from sklearn import datasets
digits = datasets.load_digits()
The digits data set consits in images (8 x 8) of digits.
In [2]:
images = digits.images
labels = digits.target
images.shape
Out[2]:
In [ ]:
### Showing some images
i = 0
plt.figure()
plt.title('{0}'.format(labels[i]))
plt.axis('off')
plt.imshow(images[i], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()
To use the images as a training set we need to flatten the images.
In [4]:
n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))
data.shape
Out[4]:
We focus on the digit '5'.
In [5]:
X = data
y = digits.target
X_5 = X[y == 5]
k-NN based anomaly detection algorithm. Available on the dev version of Scikit-Learn.
Let $P$ be a distribution with finite support $C$. Let $\mu$ be the uniform distribution over the support $C$: $\mu(A) = \lambda(A)/\lambda(C)$. We now consider the joint distribution $Q$ of $(X, Y) \in \mathbb{R}^d \times \{-1, +1\}$ given by the class conditionals distributions:
$$ X \vert Y= +1 \sim P \quad \text{and} \quad X \vert Y=-1 \sim \mu $$and a priori class probabilities $p = \mathbb{P}(Y=1)$.
We are as before interested in finding a given density level set $\{ f \geq \tau \}$ where $f$ is the density of $P$ with respect to the Lebesgue measure $\lambda$.
The marginal distribution of $X$, $P_X$, is given by $$ P_X = p P + (1-p) Q. $$
Assuming that $P$ has a density $f$. It can be shown that
$$\eta(x) = P(Y=1\vert X=x) = \frac{pf(x)}{pf(x) + (1-p)/\lambda(C)}$$The optimal classification set is given by $\{ \eta \geq 1/2 \}$
Thus solving a binary classification problem assigning weights $p$ and $1-p$ to the samples can lead to a density level set.
Usually unsupervised anomaly detection algorithms returns a scoring function $s$ such that the smaller $s(x)$ the more abnormal $x$ is. To assess the performance of such a scoring function we can use the Mass Volume curve. The MV curve of a scoring function $s$ is given for all $\alpha \in (0,1)$ by $$ MV_s(\alpha) = \lambda\{x, s(x) \geq F_s^{-1}(1-\alpha) \} $$ where $F_s^{-1}(1-\alpha)$ is the quantile of order $1-\alpha$ of the distribution of the random variable $s(X)$.
Under regularity assumptions, the MV curve is also given by the parametric curve
$$ (\mathbb{P}(s(X) \geq t), \lambda(x, s(x) \geq t)) $$Assume that the random variable $X$ with distribution $P$ has a density $f$ with respect to the Lebesgue measure. Furthermore assume that for all $\alpha \in (0,1)$ the Minimum Volume set optimization problem $$ \min_{\Omega \in \mathcal{M}} \lambda(\Omega) \quad \text{ s.t } \quad P(\Omega) \geq \alpha . $$ has a unique solution given by $$ \Omega^*_{\alpha} = \{x, f(x) \geq F_f^{-1}(1-\alpha) \}. $$
Show that for all $\alpha \in (0,1)$, $MV_s(\alpha) - MV_f(\alpha) \leq \lambda(\Omega^*_{\alpha} \Delta \{x, s(x) \geq F_s^{-1}(1-\alpha) \})$.
Show that for all $\alpha \in (0,1)$, $MV_f(\alpha) \leq MV_s(\alpha)$. Hint: use the following property: for all $\alpha \in (0,1)$, $\mathbb{P}(s(X) < F_s^{-1}(1-\alpha)) \leq 1-\alpha$.
In what sense a scoring function minimizing the area under the Mass Volume curve is optimal?
The lower $MV_s$ the better the scoring function $s$. The goal is to compare the performance of scoring functions given by the One-Class SVM for different values of $\gamma$ using the Mass Volume curve. To easily plot the Mass Volume we are going to use the ROC curve function given by sklearn
. The idea is to use a uniform sample as the positive class and to consider the Gaussian mixture data set as the negative class. To plot the MV curve we want to plot the volume against the mass $\alpha$. This will be given by $1-TPR$ against $1-FPR$.
Using sklearn.model_selection.ShuffleSplit
split the data set into a training set and a test set.
Generate n_sim=100000
uniform random variables in the hypercube enclosing the data using numpy.random.uniform
.
Build a training set consisting in the concatenation of the training set of the Gaussian mixture data set (obtained in 1.) and the uniform sample.
Similarly, build the test set by concatenating the test set of the Gaussian mixture data set (obtained in 1.) and the uniform sample.
Compute the Mass Volume curve, estimated on the training set, of the scoring function obtained with the One-Class SVM for $\gamma=0.05$ and $\gamma=10$. Compute the area under the Mass Volume curve for each value of $\gamma$.
Compute the Mass Volume curve estimated on the test set for each value of $\gamma$. Compute the area under the Mass Volume curve for each value of $\gamma$.
In [32]:
from sklearn.datasets import fetch_mldata
dataset = fetch_mldata('shuttle')
X = dataset.data
y = dataset.target
### Usual setup when using this data set for anomaly detection
# Instances with label 4 are removed
# Normal data: label 1
ind = (y != 4)
X = X[ind, :]
y = y[ind]
y = (y == 1).astype(int)
In [33]:
n_samples, n_features = X.shape
n_samples, n_features
Out[33]:
StratifiedShuffleSplit
to preserve the proportion of each class in the training and test sets.
In [40]:
from sklearn.datasets import fetch_kddcup99
dataset = fetch_kddcup99(subset='http', shuffle=True,
percent10=True, random_state=SEED)
X = dataset['data']
X = X.astype('float')
y = dataset['target']
y = (y == 'normal.').astype(int)
In [52]:
n_samples, n_features = X.shape
Now we only train the model on normal data. For instance let's train the model on half the normal data.
Split the data set into a set with the normal data and a set with the anomalies.
Randomly split the normal data set into a two data sets: a training set and a test set.
Build a test set from the normal test set and the anomalies.
Train the OneClass SVM and Isolation Forest on the normal training set and compute the ROC curve on the test set.
Show that a density level set $\{f \geq \tau \}$ is a Minimum volume set with mass $\alpha = \mathbb{P}(f(X) \geq \tau)$.
Show that if the density has not flat parts, the minimum volume set optimization problem has a solution.
Let $\lambda$ denotes the Lebesgue measure of $\mathbb{R}^d$ and let $\mathcal{M}$ be the class of all measurable sets. A Minimum Volume set with mass at least $\alpha$ is the solution of the following optimization problem: $$ \min_{\Omega \in \mathcal{M}} \lambda(\Omega) \quad \text{ s.t } \quad P(\Omega) \geq \alpha . $$ We do not know the distribution $P$ but only have a sample $X_1, \dots, X_n$ drawn from this distribution.