```
In [1]:
```### First imports and default parameters
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
# Overwritting matplotlib default linestyle of negative contours
matplotlib.rcParams['contour.negative_linestyle'] = 'solid'
SEED = 42

```
In [2]:
```from utils import GaussianMixture
n_samples = 500
n_features = 2
weight_1 = 0.5
weight_2 = 0.5
mean_1 = np.zeros(n_features)
mean_2 = -1 * np.ones(n_features)
cov_1 = np.array([[2., 2.,], [2., 4.]])
cov_2 = 2 * np.identity(n_features)
weights = np.array([weight_1, weight_2])
means = np.array([mean_1, mean_2])
covars = np.array([cov_1, cov_2])
gm = GaussianMixture(weights, means, covars, random_state=SEED)
X = gm.sample(n_samples)

```
In [3]:
```X_range = np.zeros((n_features, 2))
X_range[:, 0] = np.min(X, axis=0)
X_range[:, 1] = np.max(X, axis=0)
h = 0.1 # step size of the mesh
x_min, x_max = X_range[0, 0] - 0.1, X_range[0, 1] + 0.1
y_min, y_max = X_range[1, 0] - 0.1, X_range[1, 1] + 0.1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
grid = np.c_[xx.ravel(), yy.ravel()]
Z = gm.density(grid)
Z = Z.reshape(xx.shape)
plt.figure()
plt.contour(xx, yy, Z, 10, cmap=plt.cm.Blues_r)
plt.scatter(X[:, 0], X[:, 1], s=1.)
plt.show()

```
```

The goal is to estimate a Minimum volume set with mass at least 0.95. We know that under regularity assumption of the underlying distribution a Minimum volume set is a density level set of the density $f$. $$ \{x, f(x) \geq F_f^{-1}(1-\alpha) \} $$ where $F_f^{-1}(1-\alpha)$ is the quantile of order $1-\alpha$ of the distribution of the random variable $f(X)$.

Draw a sample $\mathcal{S}$ from the previous Gaussian mixture of size $n = 1,000,000$.

Using the sample $\mathcal{S}$ and the true density $f$ given by

`gm.density`

compute the empirical quantile of order $1-\alpha$ for $\alpha = 0.95$ and $\alpha = 0.99$. You can use for instance the`scipy.stats.mstats.mquantiles`

function.Plot the corresponding Minimum Volume set in a figure like the previous one, using

`plt.contour`

and replacing the number of levels (10) by the threshold you computed:`levels=threshold`

.Emphasize the outliers (for instance use a different color) in the previous plot where an outlier is a sample outside the Minimum Volume set.

We are going to use the plug-in approach to estimate the Minimum Volume set with mass at least 0.95. The plug-in approach means that we replace the density $f$ by an estimate $\hat f$.

Equation of the corresponding Minimum Volume set estimate.

Using a Gaussian kernel, compute a kernel density estimator of $f$. To stick with

`sklearn`

you can use`sklearn.neighbors.kde.KernelDensity`

with the default bandwidth.Compute the empirical quantile $\widehat F_{\hat f}^{-1}(1-\alpha)$ from the original sample for $\alpha = 0.95$ (still using the function

`mquantiles`

).Add the corresponding Minimum Volume set estimate to the previous plot to visualize the difference between the true MV set and the estimated one.

We used the default bandwidth.

Use cross validation to learn the bandwidth from the data. You may use

`sklearn.model_selection.GridSearchCV`

to perform a 3-fold cross validation.Plot the Minimum Volume set estimate obtained with the learnt bandwidth.

The One-Class SVM separates the data from the origin by solving the following optimization problem.

\begin{equation} \min_{\boldsymbol{w},\boldsymbol{\xi},\rho} \quad \frac{1}{2}\Vert \boldsymbol{w} \Vert^2 - \rho + \frac{1}{\nu n}\sum_{i=1}^n\xi_i\\ \text{s.t.} \quad \langle \boldsymbol{w}, \Phi(x_i) \rangle \geqslant \rho - \xi_i, \quad \xi_i \geqslant 0 \quad \quad 1 \leqslant i \leqslant n \end{equation}where the parameter $\nu \in (0,1)$ has to be set by the user and controls the proportion of outliers.

It is in general solved via the dual given by \begin{equation} \min_{\boldsymbol{\alpha}} \quad \frac{1}{2}\sum_{i,j} k(x_i, x_j)\\ \text{s.t.} \quad 0 \leqslant \alpha_i \leqslant \frac{1}{\nu n},\quad \sum_{i}\alpha_i = 1 \quad \quad 1 \leqslant i \leqslant n \end{equation}

- Show that when using a Gaussian kernel, all data in the feature space lies on the same hypersphere. Show that data are always linearly separable from the origin. Hint: what's the formula of $k(x, x')$ in terms of the feature map $\Phi$.

Under mild assumptions the following inequality holds true for all sample size $n$:

$$ \frac{\text{Outliers}}{n} \leq \nu \leq \frac{\text{Support Vectors}}{n} $$.

Futhermore the two fractions on the left-hand side and right-hand converge almost surely towards $\nu$.

If we want to estimate a Minimum Volume set with mass at least $\alpha$, how can we set $\nu$?

Use

`sklearn.svm.OneClassSVM`

with the default parameters (except the parameter $\nu$ set according to the previous question) to learn a Minimum Volume set with mass at least $0.95$.Plot the Minimum Volume set estimate.

What do you notice? The kernel used by the One-Class SVM is given by $$ k(x, x') = \exp(-\gamma\Vert x - x' \Vert^2). $$

What is the default gamma used when we trained the One-Class SVM? Do we need to increase or decrease its value?

Check that we indeed have $$ \frac{\text{Outliers}}{n} \leq \nu \leq \frac{\text{Support Vectors}}{n} $$.

Use

`gamma=10`

. Is the inequality satisfied? Can you guess why?

Only the support vectors are involved in the decision function of the One-Class SVM.

- Plot the level sets of the One-Class SVM decision function as we did for the true density.
- Emphasize the Support vectors.

Concerning the support vectors, what is the main difference between the OneClass SVM and a Kernel Density estimator when using a Gaussian Kernel?

When is it particularly advantageous to use the OneClass SVM?

- Set $\nu = 0.4$. How can we estimate a Minimum Volume set with mass at least $0.95$. Hint: Use the
`decision_function`

.

\begin{equation}
\min_{R, \boldsymbol{\xi},\boldsymbol{c}} \quad R^2 + \frac{1}{\nu n}\sum_{i=1}^{n}\xi_i\\
\text{s.t.} \quad \Vert \boldsymbol{x}_i - \boldsymbol{c}\Vert^2 \leqslant R^2 + \xi_i, \quad \xi_i \geqslant 0 \quad \quad 1\leqslant i \leqslant n\\
\end{equation}

It is in general solved via the dual given by \begin{equation} \min_{\boldsymbol{\alpha}} \quad \sum_{i,j} k(x_i, x_j) + \sum_{i}\alpha_i k(x_i,x_i)\\ \text{s.t.} \quad 0 \leqslant \alpha_i \leqslant \frac{1}{\nu n},\quad \sum_{i}\alpha_i = 1 \quad \quad 1 \leqslant i \leqslant n \end{equation}

- When are SVDD and the One-Class SVM equivalent?

Run Isolation with

`n_estimators=1`

, i.e., only one tree will be built.Plot the contour of the set with the $95\%$ most normal observations. Hint: use the

`decision_function`

and the`contamination`

parameter. Compare it with the true Minimum Volume set with mass at least $0.95$.Increase

`n_estimators`

to 50, 100 and then 500.What does the

`max_sample`

parameter change?

```
In [1]:
```from sklearn import datasets
digits = datasets.load_digits()

The digits data set consits in images (8 x 8) of digits.

```
In [2]:
```images = digits.images
labels = digits.target
images.shape

```
Out[2]:
```

```
In [ ]:
```### Showing some images
i = 0
plt.figure()
plt.title('{0}'.format(labels[i]))
plt.axis('off')
plt.imshow(images[i], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()

To use the images as a training set we need to flatten the images.

```
In [4]:
```n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))
data.shape

```
Out[4]:
```

We focus on the digit '5'.

```
In [5]:
```X = data
y = digits.target
X_5 = X[y == 5]

- Use IsolationForest to find the top 5% most abnormal images.
- Plot them using the code we used to plot some images of the data set

k-NN based anomaly detection algorithm. Available on the dev version of Scikit-Learn.

Let $P$ be a distribution with finite support $C$. Let $\mu$ be the uniform distribution over the support $C$: $\mu(A) = \lambda(A)/\lambda(C)$. We now consider the joint distribution $Q$ of $(X, Y) \in \mathbb{R}^d \times \{-1, +1\}$ given by the class conditionals distributions:

$$ X \vert Y= +1 \sim P \quad \text{and} \quad X \vert Y=-1 \sim \mu $$and a priori class probabilities $p = \mathbb{P}(Y=1)$.

We are as before interested in finding a given density level set $\{ f \geq \tau \}$ where $f$ is the density of $P$ with respect to the Lebesgue measure $\lambda$.

The marginal distribution of $X$, $P_X$, is given by $$ P_X = p P + (1-p) Q. $$

Assuming that $P$ has a density $f$. It can be shown that

$$\eta(x) = P(Y=1\vert X=x) = \frac{pf(x)}{pf(x) + (1-p)/\lambda(C)}$$The optimal classification set is given by $\{ \eta \geq 1/2 \}$

- Show that if $p=1/(1 + \lambda(C)\tau)$ then $\lambda(\{\eta \geq 1/2 \} \Delta \{f \geq \tau \} ) = 0$

Thus solving a binary classification problem assigning weights $p$ and $1-p$ to the samples can lead to a density level set.

Usually unsupervised anomaly detection algorithms returns a scoring function $s$ such that the smaller $s(x)$ the more abnormal $x$ is. To assess the performance of such a scoring function we can use the Mass Volume curve. The MV curve of a scoring function $s$ is given for all $\alpha \in (0,1)$ by $$ MV_s(\alpha) = \lambda\{x, s(x) \geq F_s^{-1}(1-\alpha) \} $$ where $F_s^{-1}(1-\alpha)$ is the quantile of order $1-\alpha$ of the distribution of the random variable $s(X)$.

Under regularity assumptions, the MV curve is also given by the parametric curve

$$ (\mathbb{P}(s(X) \geq t), \lambda(x, s(x) \geq t)) $$Assume that the random variable $X$ with distribution $P$ has a density $f$ with respect to the Lebesgue measure. Furthermore assume that for all $\alpha \in (0,1)$ the Minimum Volume set optimization problem $$ \min_{\Omega \in \mathcal{M}} \lambda(\Omega) \quad \text{ s.t } \quad P(\Omega) \geq \alpha . $$ has a unique solution given by $$ \Omega^*_{\alpha} = \{x, f(x) \geq F_f^{-1}(1-\alpha) \}. $$

Show that for all $\alpha \in (0,1)$, $MV_s(\alpha) - MV_f(\alpha) \leq \lambda(\Omega^*_{\alpha} \Delta \{x, s(x) \geq F_s^{-1}(1-\alpha) \})$.

Show that for all $\alpha \in (0,1)$, $MV_f(\alpha) \leq MV_s(\alpha)$. Hint: use the following property: for all $\alpha \in (0,1)$, $\mathbb{P}(s(X) < F_s^{-1}(1-\alpha)) \leq 1-\alpha$.

In what sense a scoring function minimizing the area under the Mass Volume curve is optimal?

The lower $MV_s$ the better the scoring function $s$. The goal is to compare the performance of scoring functions given by the One-Class SVM for different values of $\gamma$ using the Mass Volume curve. To easily plot the Mass Volume we are going to use the ROC curve function given by `sklearn`

. The idea is to use a uniform sample as the positive class and to consider the Gaussian mixture data set as the negative class. To plot the MV curve we want to plot the volume against the mass $\alpha$. This will be given by $1-TPR$ against $1-FPR$.

Using

`sklearn.model_selection.ShuffleSplit`

split the data set into a training set and a test set.Generate

`n_sim=100000`

uniform random variables in the hypercube enclosing the data using`numpy.random.uniform`

.Build a training set consisting in the concatenation of the training set of the Gaussian mixture data set (obtained in 1.) and the uniform sample.

Similarly, build the test set by concatenating the test set of the Gaussian mixture data set (obtained in 1.) and the uniform sample.

Compute the Mass Volume curve, estimated on the training set, of the scoring function obtained with the One-Class SVM for $\gamma=0.05$ and $\gamma=10$. Compute the area under the Mass Volume curve for each value of $\gamma$.

Compute the Mass Volume curve estimated on the test set for each value of $\gamma$. Compute the area under the Mass Volume curve for each value of $\gamma$.

```
In [32]:
```from sklearn.datasets import fetch_mldata
dataset = fetch_mldata('shuttle')
X = dataset.data
y = dataset.target
### Usual setup when using this data set for anomaly detection
# Instances with label 4 are removed
# Normal data: label 1
ind = (y != 4)
X = X[ind, :]
y = y[ind]
y = (y == 1).astype(int)

```
In [33]:
```n_samples, n_features = X.shape
n_samples, n_features

```
Out[33]:
```

- What's the proportion of anomalies in the data set (label 0)?
- Split the data set into a training a test set using
`StratifiedShuffleSplit`

to preserve the proportion of each class in the training and test sets. - Compare the performance of the OneClass SVM and Isolation Forest using the ROC curve (computed with the true labels) on the train set and the set set:
- train the algorithm on the training set (without the labels), compute the ROC curve on the training set using the labels.
- train the algorithm on the training set (without the labels), compute the ROC curve on the test set using the labels

- Compare the performance of the OneClass SVM and Isolation Forest using the Mass Volume curve on the training and test sets

- Same questions with the http data set

```
In [40]:
```from sklearn.datasets import fetch_kddcup99
dataset = fetch_kddcup99(subset='http', shuffle=True,
percent10=True, random_state=SEED)
X = dataset['data']
X = X.astype('float')
y = dataset['target']
y = (y == 'normal.').astype(int)

```
In [52]:
```n_samples, n_features = X.shape

Now we only train the model on normal data. For instance let's train the model on half the normal data.

Split the data set into a set with the normal data and a set with the anomalies.

Randomly split the normal data set into a two data sets: a training set and a test set.

Build a test set from the normal test set and the anomalies.

Train the OneClass SVM and Isolation Forest on the normal training set and compute the ROC curve on the test set.

- Plot the MV curve estimated on the test set for the OneClass SVM and Isolation Forest.

Show that a density level set $\{f \geq \tau \}$ is a Minimum volume set with mass $\alpha = \mathbb{P}(f(X) \geq \tau)$.

Show that if the density has not flat parts, the minimum volume set optimization problem has a solution.

Let $\lambda$ denotes the Lebesgue measure of $\mathbb{R}^d$ and let $\mathcal{M}$ be the class of all measurable sets. A Minimum Volume set with mass at least $\alpha$ is the solution of the following optimization problem: $$ \min_{\Omega \in \mathcal{M}} \lambda(\Omega) \quad \text{ s.t } \quad P(\Omega) \geq \alpha . $$ We do not know the distribution $P$ but only have a sample $X_1, \dots, X_n$ drawn from this distribution.

- What is the solution for all $\alpha \in (0,1)$ of the empirical optimization problem if we solve the following empirical version of the Minimum Volume set optimization problem: $$ \min_{\Omega \in \mathcal{M}} \lambda(\Omega) \quad \text{ s.t } \quad \widehat P(\Omega) \geq \alpha $$ where for all $\Omega$, $\widehat P(\Omega) = \frac{1}{n}\sum_{i=1}^n \mathbb{I}\{X_i \in \Omega \}$.