In this example, we extend the multivariate example to the use of classes. ISC also makes it possible to compute the anomaly score for different classes, so that apples are compared to apples an d not to oranges. In addition, it is also possibel to the anomaly detector to classify unknwon examples.
In [2]:
import pyisc;
import numpy as np
from scipy.stats import poisson, norm, multivariate_normal
%matplotlib inline
from pylab import plot, figure
In [16]:
n_classes = 3
normal_len = 10000
anomaly_len = 15
data = None
for i in range(n_classes):
po_normal = poisson(10+i)
po_normal2 = poisson(2+i)
gs_normal = norm(1+i, 12)
tmp = np.column_stack(
[
[1] * (normal_len),
list(po_normal.rvs(normal_len)),
list(po_normal2.rvs(normal_len)),
list(gs_normal.rvs(normal_len)),
[i] * (normal_len),
]
)
if data is None:
data = tmp
else:
data = np.r_[data,tmp]
# Add anomalies
for i in range(n_classes):
po_anomaly = poisson(25+i)
po_anomaly2 = poisson(3+i)
gs_anomaly = norm(2+i,30)
tmp = np.column_stack(
[
[1] * (anomaly_len),
list(po_anomaly.rvs(anomaly_len)),
list(po_anomaly2.rvs(anomaly_len)),
list(gs_anomaly.rvs(anomaly_len)),
[i] * (anomaly_len),
]
)
if data is None:
data = tmp
else:
data = np.r_[data,tmp]
Create an anomaly detector using as first argument the used statistical models. The we use
Given that we now have more than one variable, it is necessary to also add a method to combine the output from the statistical models, which in this case is the maximum anomaly score of each component model:
In [17]:
anomaly_detector = pyisc.AnomalyDetector(
component_models=[
pyisc.P_PoissonOnesided(1,0), # columns 1 and 0
pyisc.P_Poisson(2,0), # columns 2 and 0
pyisc.P_Gaussian(3) # column 3
],
output_combination_rule=pyisc.cr_max
)
Train the anomaly detector
In [18]:
anomaly_detector.fit(data, y=4); # y is the class column or an array with classes
Compute the anomaly scores for each data point
In [6]:
scores = anomaly_detector.anomaly_score(data, y=4)
In [7]:
from pandas import DataFrame
df= DataFrame(data[:15], columns=['Class','#Days', 'Freq1','Freq2','Measure'])
df['Anomaly Score'] = scores[:15]
print df.to_string()
The anomalous frequencies vs. anomaly scores for the 15 anomalous data points:
In [8]:
df= DataFrame(data[-45:], columns=['#Days', 'Freq1','Freq2','Measure','Class'])
df['Anomaly Score'] = scores[-45:]
print df.to_string()
As can be seen above, the anomalous data also have higher anomaly scores than the normal frequencies as it should be.</b>
This becomes even more visible if we plot the anomaly scores (y-axis) against each data point (x-axis):
In [9]:
plot(scores, '.');
We can also look at the details of each column in terms of their individual anomaly scores:
In [10]:
score_details = anomaly_detector.anomaly_score_details(data,y=4)
In [11]:
df= DataFrame(data[-45:], columns=['#Days', 'Freq1','Freq2','Measure','Class'])
df['Anomaly:Freq1'] = [detail[2][0] for detail in score_details[-45:]] # Anomaly Score of Freq1
df['Anomaly:Freq2'] = [detail[2][1] for detail in score_details[-45:]] # Anomaly Score of Freq2
df['Anomaly:Measure'] = [detail[2][2] for detail in score_details[-45:]] # Anomaly Score of Measure
df['Anomaly Score'] = [detail[0] for detail in score_details[-45:]] # Combined Anomaly Score
df
Out[11]:
Above, the last column corresponds to the same anomaly score as before, where we can se that it corresponds to the maximum of the individual anomaly score to the left, thus, it is the result of the combination rule specified to the anomaly detector.
In [19]:
data2 = None
true_classes = []
length = 1000
for i in range(n_classes):
po_normal = poisson(10+i)
po_normal2 = poisson(2+i)
gs_normal = norm(1+i, 12)
tmp = np.column_stack(
[
[1] * (length),
list(po_normal.rvs(length)),
list(po_normal2.rvs(length)),
list(gs_normal.rvs(length)),
[None] * (length),
]
)
true_classes += [i] * length
if data2 is None:
data2 = tmp
else:
data2 = np.r_[data2,tmp]
Then, we can also use the anomaly detector as a classifier to predict the class for each instance as below:
In [20]:
from pandas import DataFrame
from sklearn.metrics import accuracy_score
result = DataFrame(columns=['Algorithm','Accuracy'])
clf = pyisc.SklearnClassifier.clf(anomaly_detector)
predicted_classes = clf.predict(data2)
acc = accuracy_score(true_classes, predicted_classes)
result.loc[0] = ['pyISC classifier', acc]
We can also compare it to some available classifiers in Scikit-learn (http://scikit-learn.org/):
In [21]:
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
X = data.T[:-1].T
y = data.T[-1]
count = 1
for name, clf in zip(['GaussianNB',
'KNeighborsClassifier',
'RandomForestClassifier'],
[GaussianNB(),
KNeighborsClassifier(n_neighbors=1000,weights='distance'),
RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1)]):
clf.fit(X,y);
predicted_classes_SK= clf.predict(data2.T[:-1].T)
acc = accuracy_score(true_classes,predicted_classes_SK)
result.loc[count] = [name, acc]
count += 1
result
Out[21]:
The pyISC classifier performce better, most likely due to being closer to using the true distributions.
In [ ]: