Let us first explore an example that falls under novelty detection. Here, we train a model on data with some distribution and no outliers. The test data, has some "novel" subset of data that does not follow that distribution.
In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm
%matplotlib inline
Use the np.random module to generate a normal distribution of 1,000 data points in two dimensions (e.g. x, y) - choose whatever mean and sigma^2 you like. Generate another 1,000 data points with a normal distribution in two dimensions that are well separated from the first set. You now have two "clusters". Concatenate them so you have 2,000 data points in two dimensions. Plot the points. This will be the training set.
In [2]:
X_train_normal = np.concatenate((np.random.randn(1000,2), 2*np.random.randn(1000,2)+8.))
Plot the points.
In [3]:
plt.scatter(X_train_normal[:,0],X_train_normal[:,1])
Out[3]:
Generate 100 data points with the same distribution as your first random normal 2-d set, and 100 data points with the same distribution as your second random normal 2-d set. This will be the test set labeled X_test_normal.
In [4]:
X_test_normal = np.concatenate((np.random.randn(100,2), 3*np.random.randn(100,2)+10.))
Generate 100 data points with a random uniform distribution. This will be the test set labeled X_test_uniform.
In [5]:
X_test_uniform = np.random.rand(100,2)
Define a model classifier with the svm.OneClassSVM
In [6]:
model = svm.OneClassSVM()
Fit the model to the training data.
In [7]:
model.fit(X_train_normal)
Out[7]:
Use the trained model to predict whether X_test_normal data point are in the same distributions. Calculate the fraction of "false" predictions.
In [8]:
model.predict(X_test_normal)
Out[8]:
Use the trained model to predict whether X_test_uniform is in the same distribution. Calculate the fraction of "false" predictions.
In [9]:
from collections import Counter
L1=model.predict(X_test_uniform)
a = Counter(L1).values()[0]
b = Counter(L1).values()[1]
c = float(b)/a
print c
Use the trained model to see how well it recovers the training data. (Predict on the training data, and calculate the fraction of "false" predictions.)
In [10]:
L2=model.predict(X_train_normal)
a2 = Counter(L2).values()[0]
b2 = Counter(L2).values()[1]
c2 = float(b2)/a2
print c2
Create another instance of the model classifier, but change the kwarg value for nu. Hint: Use help to figure out what the kwargs are.
In [11]:
model2 = svm.OneClassSVM(nu=.2)
Redo the prediction on the training set, prediction on X_test_random, and prediction on X_test.
In [12]:
model2.fit(X_train_normal)
L3=model2.predict(X_train_normal)
Counter(L3)
model2.predict(X_test_normal)
model2.predict(X_test_uniform)
Out[12]:
Plot in scatter points the X_train in blue, X_test_normal in red, and X_test_uniform in black. Overplot the trained model decision function boundary for the first instance of the model classifier.
In [13]:
plt.scatter(X_train_normal[:,0],X_train_normal[:,1],color='b')
plt.scatter(X_test_normal[:,0],X_test_normal[:,1],color='r')
plt.scatter(X_test_uniform[:,0],X_test_uniform[:,1],color='k')
xx1, yy1 = np.meshgrid(np.linspace(-5, 22, 1000), np.linspace(-5, 22,1000))
Z1 =model.decision_function(np.c_[xx1.ravel(), yy1.ravel()])
Z1 = Z1.reshape(xx1.shape)
plt.contour(xx1, yy1, Z1, levels=[0],
linewidths=2)
Out[13]:
Do the same for the second instance of the model classifier.
In [14]:
plt.scatter(X_train_normal[:,0],X_train_normal[:,1],color='b')
plt.scatter(X_test_normal[:,0],X_test_normal[:,1],color='r')
plt.scatter(X_test_uniform[:,0],X_test_uniform[:,1],color='k')
xx1, yy1 = np.meshgrid(np.linspace(-5, 22, 1000), np.linspace(-5, 22,1000))
Z1 =model2.decision_function(np.c_[xx1.ravel(), yy1.ravel()])
Z1 = Z1.reshape(xx1.shape)
plt.contour(xx1, yy1, Z1, levels=[0],
linewidths=2)
Out[14]:
In [15]:
from sklearn.covariance import EllipticEnvelope
Test how well EllipticEnvelope predicts the outliers when you concatenate the training data with the X_test_uniform data.
In [16]:
train_uniform=np.concatenate((X_train_normal,X_test_uniform))
envelope=EllipticEnvelope()
envelope.fit(train_uniform)
envelope.predict(train_uniform)
Out[16]:
Compute and plot the mahanalobis distances of X_test, X_train_normal, X_train_uniform
In [17]:
plt.scatter(range(100),envelope.mahalanobis(X_test_uniform),color='black') #idk why but on the graph it's red...
plt.scatter(range(2000),envelope.mahalanobis(X_train_normal),color='b')
plt.scatter(range(200),envelope.mahalanobis(X_test_normal),color='r')
Out[17]: