Let us first explore an example that falls under novelty detection. Here, we train a model on data with some distribution and no outliers. The test data, has some "novel" subset of data that does not follow that distribution.


In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm
%matplotlib inline

Use the np.random module to generate a normal distribution of 1,000 data points in two dimensions (e.g. x, y) - choose whatever mean and sigma^2 you like. Generate another 1,000 data points with a normal distribution in two dimensions that are well separated from the first set. You now have two "clusters". Concatenate them so you have 2,000 data points in two dimensions. Plot the points. This will be the training set.


In [2]:
X_train_normal = np.concatenate((np.random.randn(1000,2), 2*np.random.randn(1000,2)+8.))

Plot the points.


In [3]:
plt.scatter(X_train_normal[:,0],X_train_normal[:,1])


Out[3]:
<matplotlib.collections.PathCollection at 0x10f11ae50>

Generate 100 data points with the same distribution as your first random normal 2-d set, and 100 data points with the same distribution as your second random normal 2-d set. This will be the test set labeled X_test_normal.


In [4]:
X_test_normal = np.concatenate((np.random.randn(100,2), 3*np.random.randn(100,2)+10.))

Generate 100 data points with a random uniform distribution. This will be the test set labeled X_test_uniform.


In [5]:
X_test_uniform = np.random.rand(100,2)

Define a model classifier with the svm.OneClassSVM


In [6]:
model = svm.OneClassSVM()

Fit the model to the training data.


In [7]:
model.fit(X_train_normal)


Out[7]:
OneClassSVM(cache_size=200, coef0=0.0, degree=3, gamma='auto', kernel='rbf',
      max_iter=-1, nu=0.5, random_state=None, shrinking=True, tol=0.001,
      verbose=False)

Use the trained model to predict whether X_test_normal data point are in the same distributions. Calculate the fraction of "false" predictions.


In [8]:
model.predict(X_test_normal)


Out[8]:
array([ 1., -1.,  1.,  1., -1.,  1.,  1.,  1., -1., -1.,  1., -1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1., -1.,  1.,  1., -1.,  1.,
        1.,  1.,  1., -1., -1.,  1.,  1.,  1.,  1.,  1.,  1., -1.,  1.,
        1.,  1.,  1.,  1., -1., -1., -1., -1.,  1.,  1., -1.,  1.,  1.,
        1.,  1., -1., -1.,  1.,  1.,  1.,  1.,  1., -1.,  1.,  1.,  1.,
        1., -1.,  1., -1.,  1.,  1., -1., -1.,  1.,  1.,  1.,  1.,  1.,
       -1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1., -1.,  1.,
        1.,  1., -1.,  1., -1., -1.,  1.,  1.,  1.,  1.,  1.,  1., -1.,
       -1., -1., -1., -1., -1., -1., -1.,  1., -1., -1., -1., -1., -1.,
       -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.,  1., -1.,
       -1., -1.,  1., -1., -1.,  1., -1., -1., -1., -1., -1., -1., -1.,
       -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.,
       -1., -1., -1.,  1., -1.,  1., -1., -1., -1., -1., -1., -1., -1.,
       -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.,
       -1., -1., -1., -1.,  1., -1., -1., -1., -1., -1., -1., -1., -1.,
       -1., -1., -1., -1., -1.])

Use the trained model to predict whether X_test_uniform is in the same distribution. Calculate the fraction of "false" predictions.


In [9]:
from collections import Counter

L1=model.predict(X_test_uniform)
a = Counter(L1).values()[0] 
b = Counter(L1).values()[1]
c = float(b)/a
print c


0.0752688172043

Use the trained model to see how well it recovers the training data. (Predict on the training data, and calculate the fraction of "false" predictions.)


In [10]:
L2=model.predict(X_train_normal)
a2 = Counter(L2).values()[0]
b2 = Counter(L2).values()[1]
c2 = float(b2)/a2
print c2


1.002002002

Create another instance of the model classifier, but change the kwarg value for nu. Hint: Use help to figure out what the kwargs are.


In [11]:
model2 = svm.OneClassSVM(nu=.2)

Redo the prediction on the training set, prediction on X_test_random, and prediction on X_test.


In [12]:
model2.fit(X_train_normal)
L3=model2.predict(X_train_normal)
Counter(L3)
model2.predict(X_test_normal)
model2.predict(X_test_uniform)


Out[12]:
array([ 1.,  1.,  1.,  1., -1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1., -1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])

Plot in scatter points the X_train in blue, X_test_normal in red, and X_test_uniform in black. Overplot the trained model decision function boundary for the first instance of the model classifier.


In [13]:
plt.scatter(X_train_normal[:,0],X_train_normal[:,1],color='b')
plt.scatter(X_test_normal[:,0],X_test_normal[:,1],color='r')
plt.scatter(X_test_uniform[:,0],X_test_uniform[:,1],color='k')

xx1, yy1 = np.meshgrid(np.linspace(-5, 22, 1000), np.linspace(-5, 22,1000))
Z1 =model.decision_function(np.c_[xx1.ravel(), yy1.ravel()])
Z1 = Z1.reshape(xx1.shape)
plt.contour(xx1, yy1, Z1, levels=[0],
           linewidths=2)


Out[13]:
<matplotlib.contour.QuadContourSet at 0x10f2e05d0>

Do the same for the second instance of the model classifier.


In [14]:
plt.scatter(X_train_normal[:,0],X_train_normal[:,1],color='b')
plt.scatter(X_test_normal[:,0],X_test_normal[:,1],color='r')
plt.scatter(X_test_uniform[:,0],X_test_uniform[:,1],color='k')

xx1, yy1 = np.meshgrid(np.linspace(-5, 22, 1000), np.linspace(-5, 22,1000))
Z1 =model2.decision_function(np.c_[xx1.ravel(), yy1.ravel()])
Z1 = Z1.reshape(xx1.shape)
plt.contour(xx1, yy1, Z1, levels=[0],
           linewidths=2)


Out[14]:
<matplotlib.contour.QuadContourSet at 0x10f0a3a50>
For this second example, we will explore what is known as "outlier" detection, where you do not have a training set of unpolluted data, but must instead explore what the outliers might be. Note, the difference between novelty detection and outlier detection is analogous to the difference between supervised and unsupervised classification. We will use tools from sklearn.covariance to illustrate this.

In [15]:
from sklearn.covariance import EllipticEnvelope

Test how well EllipticEnvelope predicts the outliers when you concatenate the training data with the X_test_uniform data.


In [16]:
train_uniform=np.concatenate((X_train_normal,X_test_uniform))
envelope=EllipticEnvelope()
envelope.fit(train_uniform)
envelope.predict(train_uniform)


Out[16]:
array([1, 1, 1, ..., 1, 1, 1])

Compute and plot the mahanalobis distances of X_test, X_train_normal, X_train_uniform


In [17]:
plt.scatter(range(100),envelope.mahalanobis(X_test_uniform),color='black') #idk why but on the graph it's red...
plt.scatter(range(2000),envelope.mahalanobis(X_train_normal),color='b')
plt.scatter(range(200),envelope.mahalanobis(X_test_normal),color='r')


Out[17]:
<matplotlib.collections.PathCollection at 0x1157dd6d0>