pyISC Example: Anomaly Detection with Classes

In this example, we extend the multivariate example to the use of classes. ISC also makes it possible to compute the anomaly score for different classes, so that apples are compared to apples an d not to oranges. In addition, it is also possibel to the anomaly detector to classify unknwon examples.



In [2]:

    
import pyisc;
import numpy as np
from scipy.stats import poisson, norm, multivariate_normal
%matplotlib inline
from pylab import plot, figure

Data with Classification

Create a data set with 3 columns from different probablity distributions.



In [16]:

    
n_classes = 3
normal_len = 10000
anomaly_len = 15
data = None
for i in range(n_classes):
    po_normal = poisson(10+i)
    po_normal2 = poisson(2+i)
    gs_normal = norm(1+i, 12)
    tmp = np.column_stack(
        [
            [1] * (normal_len),
            list(po_normal.rvs(normal_len)),
            list(po_normal2.rvs(normal_len)),
            list(gs_normal.rvs(normal_len)),
            [i] * (normal_len),
        ]
    )
    if data is None:
        data = tmp
    else:
        data = np.r_[data,tmp]
# Add anomalies
for i in range(n_classes):
    po_anomaly = poisson(25+i)
    po_anomaly2 = poisson(3+i)
    gs_anomaly = norm(2+i,30)

    tmp = np.column_stack(
        [
            [1] * (anomaly_len),
            list(po_anomaly.rvs(anomaly_len)),
            list(po_anomaly2.rvs(anomaly_len)),
            list(gs_anomaly.rvs(anomaly_len)),
            [i] * (anomaly_len),
        ]
    )
    if data is None:
        data = tmp
    else:
        data = np.r_[data,tmp]

Anomaly Detector

Create an anomaly detector using as first argument the used statistical models. The we use

a onesided Poisson distribution for modelling the first fequency column (column 1) (as in the first example),
a twosided Poisson distribution for the second frequency column (column 2),
and a Gaussin (Normal) distribution for the last column (column 3).

Given that we now have more than one variable, it is necessary to also add a method to combine the output from the statistical models, which in this case is the maximum anomaly score of each component model:



In [17]:

    
anomaly_detector = pyisc.AnomalyDetector(
    component_models=[
        pyisc.P_PoissonOnesided(1,0), # columns 1 and 0
        pyisc.P_Poisson(2,0), # columns 2 and 0
        pyisc.P_Gaussian(3) # column 3
    ],
    output_combination_rule=pyisc.cr_max
)

Train the anomaly detector



In [18]:

    
anomaly_detector.fit(data, y=4); # y is the class column or an array with classes

Compute the anomaly scores for each data point



In [6]:

    
scores = anomaly_detector.anomaly_score(data, y=4)

Anomaly Scores with Classes

Now we can print some example of normal frequencies vs. anomaly scores for the 15 first normal data points:



In [7]:

    
from pandas import DataFrame
df= DataFrame(data[:15], columns=['Class','#Days', 'Freq1','Freq2','Measure'])
df['Anomaly Score'] = scores[:15]
print df.to_string()









    



    Class  #Days  Freq1      Freq2  Measure  Anomaly Score
0       1      8      0  -7.697829        0       1.278852
1       1      7      1  -1.986374        0       0.229776
2       1     11      1   4.321750        0       0.299526
3       1     10      2 -13.608744        0       1.485177
4       1      9      1  -5.697421        0       0.543436
5       1     15      0 -10.716751        0       2.487084
6       1     10      1   8.608478        0       0.645070
7       1     13      1   0.740692        0       1.570813
8       1      8      3  -8.460878        0       1.126847
9       1      9      2   2.373379        0       0.098839
10      1      7      1  -8.199543        0       0.804841
11      1      9      2  -2.180858        0       0.229719
12      1     17      0   1.155583        0       3.615028
13      1      9      1  11.452048        0       0.959838
14      1      8      2  -9.418372        0       0.944151

The anomalous frequencies vs. anomaly scores for the 15 anomalous data points:



In [8]:

    
df= DataFrame(data[-45:], columns=['#Days', 'Freq1','Freq2','Measure','Class'])
df['Anomaly Score'] = scores[-45:]
print df.to_string()









    



    #Days  Freq1  Freq2    Measure  Class  Anomaly Score
0       1     33      3 -16.019313      0      18.737237
1       1     23      2   1.292442      0       8.133544
2       1     35      3 -39.878326      0      21.236506
3       1     30      3   3.261550      0      15.208356
4       1     28      6 -29.670542      0      13.012713
5       1     22      4  58.718210      0      13.336878
6       1     19      1 -42.820249      0       8.188949
7       1     30      1  46.779072      0      15.208356
8       1     20      3   3.783350      0       5.674221
9       1     27      3 -10.263894      0      11.964916
10      1     21      3  24.046144      0       6.451688
11      1     19      1  15.276674      0       4.941169
12      1     20      5 -73.381322      0      21.117687
13      1     27      2  -2.798956      0      11.964916
14      1     23      5  36.579071      0       8.133544
15      1     29      3 -16.861937      1      12.192423
16      1     27      1 -21.120454      1      10.257969
17      1     34      2  34.661100      1      17.580467
18      1     18      1  15.717274      1       3.402077
19      1     30      2 -20.765808      1      13.208815
20      1     24      1 -23.980914      1       7.617997
21      1     41      5  26.982274      1      26.289043
22      1     31      4  31.152938      1      14.256697
23      1     25      2  -6.049659      1       8.461588
24      1     28      4  31.254956      1      11.208473
25      1     31      1 -36.336422      1      14.256697
26      1     30      5  19.847507      1      13.208815
27      1     23      4  41.774339      1       6.994651
28      1     24      5   9.372034      1       7.617997
29      1     34      2 -11.322955      1      17.580467
30      1     27      4  -0.253022      2       8.885002
31      1     25      5 -11.979734      2       7.252335
32      1     21      6  31.566056      2       4.433246
33      1     20      5  43.796360      2       7.097478
34      1     29      4  36.799694      2      10.654535
35      1     25      2 -20.016109      2       7.252335
36      1     28      5 -33.381069      2       9.753176
37      1     27      5  13.976541      2       8.885002
38      1     22      4  75.721336      2      19.765671
39      1     26      2   1.171927      2       8.051033
40      1     29      2  23.561053      2      10.654535
41      1     28      5  -0.414614      2       9.753176
42      1     27      8   3.528593      2       8.885002
43      1     34      6  27.775596      2      15.626730
44      1     38      5  41.101175      2      20.114275

As can be seen above, the anomalous data also have higher anomaly scores than the normal frequencies as it should be.</b>

This becomes even more visible if we plot the anomaly scores (y-axis) against each data point (x-axis):



In [9]:

    
plot(scores, '.');

We can also look at the details of each column in terms of their individual anomaly scores:



In [10]:

    
score_details = anomaly_detector.anomaly_score_details(data,y=4)



In [11]:

    
df= DataFrame(data[-45:], columns=['#Days', 'Freq1','Freq2','Measure','Class'])
df['Anomaly:Freq1'] = [detail[2][0] for detail in score_details[-45:]]   # Anomaly Score of Freq1
df['Anomaly:Freq2'] = [detail[2][1] for detail in score_details[-45:]]   # Anomaly Score of Freq2
df['Anomaly:Measure'] = [detail[2][2] for detail in score_details[-45:]] # Anomaly Score of Measure
df['Anomaly Score'] = [detail[0] for detail in score_details[-45:]]      # Combined Anomaly Score
df









    Out[11]:






  
    
      
      #Days
      Freq1
      Freq2
      Measure
      Class
      Anomaly:Freq1
      Anomaly:Freq2
      Anomaly:Measure
      Anomaly Score
    
  
  
    
      0
      1
      33
      3
      -16.019313
      0
      18.737237
      1.126847
      1.841315
      18.737237
    
    
      1
      1
      23
      2
      1.292442
      0
      8.133544
      0.067769
      0.022936
      8.133544
    
    
      2
      1
      35
      3
      -39.878326
      0
      21.236506
      1.126847
      7.269857
      21.236506
    
    
      3
      1
      30
      3
      3.261550
      0
      15.208356
      1.126847
      0.165193
      15.208356
    
    
      4
      1
      28
      6
      -29.670542
      0
      13.012713
      4.094353
      4.512347
      13.012713
    
    
      5
      1
      22
      4
      58.718210
      0
      7.271971
      1.942310
      13.336878
      13.336878
    
    
      6
      1
      19
      1
      -42.820249
      0
      4.941169
      0.229776
      8.188949
      8.188949
    
    
      7
      1
      30
      1
      46.779072
      0
      15.208356
      0.229776
      8.865386
      15.208356
    
    
      8
      1
      20
      3
      3.783350
      0
      5.674221
      1.126847
      0.205888
      5.674221
    
    
      9
      1
      27
      3
      -10.263894
      0
      11.964916
      1.126847
      1.045506
      11.964916
    
    
      10
      1
      21
      3
      24.046144
      0
      6.451688
      1.126847
      2.899835
      6.451688
    
    
      11
      1
      19
      1
      15.276674
      0
      4.941169
      0.229776
      1.452553
      4.941169
    
    
      12
      1
      20
      5
      -73.381322
      0
      5.674221
      2.939246
      21.117687
      21.117687
    
    
      13
      1
      27
      2
      -2.798956
      0
      11.964916
      0.067769
      0.280470
      11.964916
    
    
      14
      1
      23
      5
      36.579071
      0
      8.133544
      2.939246
      5.782001
      8.133544
    
    
      15
      1
      29
      3
      -16.861937
      1
      12.192423
      0.238893
      2.134855
      12.192423
    
    
      16
      1
      27
      1
      -21.120454
      1
      10.257969
      0.958507
      2.894499
      10.257969
    
    
      17
      1
      34
      2
      34.661100
      1
      17.580467
      0.013653
      5.042347
      17.580467
    
    
      18
      1
      18
      1
      15.717274
      1
      3.402077
      0.958507
      1.382516
      3.402077
    
    
      19
      1
      30
      2
      -20.765808
      1
      13.208815
      0.013653
      2.827063
      13.208815
    
    
      20
      1
      24
      1
      -23.980914
      1
      7.617997
      0.958507
      3.466532
      7.617997
    
    
      21
      1
      41
      5
      26.982274
      1
      26.289043
      1.713305
      3.294996
      26.289043
    
    
      22
      1
      31
      4
      31.152938
      1
      14.256697
      0.597893
      4.198219
      14.256697
    
    
      23
      1
      25
      2
      -6.049659
      1
      8.461588
      0.013653
      0.678233
      8.461588
    
    
      24
      1
      28
      4
      31.254956
      1
      11.208473
      0.597893
      4.221672
      11.208473
    
    
      25
      1
      31
      1
      -36.336422
      1
      14.256697
      0.958507
      6.525818
      14.256697
    
    
      26
      1
      30
      5
      19.847507
      1
      13.208815
      1.713305
      1.996556
      13.208815
    
    
      27
      1
      23
      4
      41.774339
      1
      6.812430
      0.597893
      6.994651
      6.994651
    
    
      28
      1
      24
      5
      9.372034
      1
      7.617997
      1.713305
      0.625275
      7.617997
    
    
      29
      1
      34
      2
      -11.322955
      1
      17.580467
      0.013653
      1.306628
      17.580467
    
    
      30
      1
      27
      4
      -0.253022
      2
      8.885002
      0.196422
      0.232847
      8.885002
    
    
      31
      1
      25
      5
      -11.979734
      2
      7.252335
      0.497853
      1.509430
      7.252335
    
    
      32
      1
      21
      6
      31.566056
      2
      4.433246
      1.556908
      3.955011
      4.433246
    
    
      33
      1
      20
      5
      43.796360
      2
      3.828511
      0.497853
      7.097478
      7.097478
    
    
      34
      1
      29
      4
      36.799694
      2
      10.654535
      0.196422
      5.186685
      10.654535
    
    
      35
      1
      25
      2
      -20.016109
      2
      7.252335
      0.791935
      2.815852
      7.252335
    
    
      36
      1
      28
      5
      -33.381069
      2
      9.753176
      0.497853
      5.836127
      9.753176
    
    
      37
      1
      27
      5
      13.976541
      2
      8.885002
      0.497853
      1.002358
      8.885002
    
    
      38
      1
      22
      4
      75.721336
      2
      5.079254
      0.196422
      19.765671
      19.765671
    
    
      39
      1
      26
      2
      1.171927
      2
      8.051033
      0.791935
      0.123967
      8.051033
    
    
      40
      1
      29
      2
      23.561053
      2
      10.654535
      0.791935
      2.390271
      10.654535
    
    
      41
      1
      28
      5
      -0.414614
      2
      9.753176
      0.497853
      0.245782
      9.753176
    
    
      42
      1
      27
      8
      3.528593
      2
      8.885002
      3.003881
      0.037687
      8.885002
    
    
      43
      1
      34
      6
      27.775596
      2
      15.626730
      1.556908
      3.166652
      15.626730
    
    
      44
      1
      38
      5
      41.101175
      2
      20.114275
      0.497853
      6.325301
      20.114275

Above, the last column corresponds to the same anomaly score as before, where we can se that it corresponds to the maximum of the individual anomaly score to the left, thus, it is the result of the combination rule specified to the anomaly detector.

Anomaly Detector as Classifier

Let us create a data set with unkown classes from the same distributions as above:



In [19]:

    
data2 = None
true_classes = []
length = 1000
for i in range(n_classes):
    po_normal = poisson(10+i)
    po_normal2 = poisson(2+i)
    gs_normal = norm(1+i, 12)
    tmp = np.column_stack(
        [
            [1] * (length),
            list(po_normal.rvs(length)),
            list(po_normal2.rvs(length)),
            list(gs_normal.rvs(length)),
            [None] * (length),
        ]
    )
    
    true_classes += [i] * length
    
    if data2 is None:
        data2 = tmp
    else:
        data2 = np.r_[data2,tmp]

Then, we can also use the anomaly detector as a classifier to predict the class for each instance as below:



In [20]:

    
from pandas import DataFrame
from sklearn.metrics import accuracy_score
result = DataFrame(columns=['Algorithm','Accuracy'])
clf = pyisc.SklearnClassifier.clf(anomaly_detector)
predicted_classes = clf.predict(data2)
acc = accuracy_score(true_classes, predicted_classes)
result.loc[0] = ['pyISC classifier', acc]

We can also compare it to some available classifiers in Scikit-learn (http://scikit-learn.org/):



In [21]:

    
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

X = data.T[:-1].T
y = data.T[-1]
count = 1
for name, clf in zip(['GaussianNB',
                      'KNeighborsClassifier', 
                      'RandomForestClassifier'],
                     [GaussianNB(), 
                      KNeighborsClassifier(n_neighbors=1000,weights='distance'), 
                      RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1)]):
    clf.fit(X,y);

    predicted_classes_SK= clf.predict(data2.T[:-1].T)
    acc = accuracy_score(true_classes,predicted_classes_SK)
    result.loc[count] = [name, acc]
    count += 1

result









    Out[21]:






  
    
      
      Algorithm
      Accuracy
    
  
  
    
      0
      pyISC classifier
      0.520333
    
    
      1
      GaussianNB
      0.510000
    
    
      2
      KNeighborsClassifier
      0.488000
    
    
      3
      RandomForestClassifier
      0.509000

The pyISC classifier performce better, most likely due to being closer to using the true distributions.



In [ ]:

	#Days	Freq1	Freq2	Measure	Class	Anomaly:Freq1	Anomaly:Freq2	Anomaly:Measure	Anomaly Score
0	1	33	3	-16.019313	0	18.737237	1.126847	1.841315	18.737237
1	1	23	2	1.292442	0	8.133544	0.067769	0.022936	8.133544
2	1	35	3	-39.878326	0	21.236506	1.126847	7.269857	21.236506
3	1	30	3	3.261550	0	15.208356	1.126847	0.165193	15.208356
4	1	28	6	-29.670542	0	13.012713	4.094353	4.512347	13.012713
5	1	22	4	58.718210	0	7.271971	1.942310	13.336878	13.336878
6	1	19	1	-42.820249	0	4.941169	0.229776	8.188949	8.188949
7	1	30	1	46.779072	0	15.208356	0.229776	8.865386	15.208356
8	1	20	3	3.783350	0	5.674221	1.126847	0.205888	5.674221
9	1	27	3	-10.263894	0	11.964916	1.126847	1.045506	11.964916
10	1	21	3	24.046144	0	6.451688	1.126847	2.899835	6.451688
11	1	19	1	15.276674	0	4.941169	0.229776	1.452553	4.941169
12	1	20	5	-73.381322	0	5.674221	2.939246	21.117687	21.117687
13	1	27	2	-2.798956	0	11.964916	0.067769	0.280470	11.964916
14	1	23	5	36.579071	0	8.133544	2.939246	5.782001	8.133544
15	1	29	3	-16.861937	1	12.192423	0.238893	2.134855	12.192423
16	1	27	1	-21.120454	1	10.257969	0.958507	2.894499	10.257969
17	1	34	2	34.661100	1	17.580467	0.013653	5.042347	17.580467
18	1	18	1	15.717274	1	3.402077	0.958507	1.382516	3.402077
19	1	30	2	-20.765808	1	13.208815	0.013653	2.827063	13.208815
20	1	24	1	-23.980914	1	7.617997	0.958507	3.466532	7.617997
21	1	41	5	26.982274	1	26.289043	1.713305	3.294996	26.289043
22	1	31	4	31.152938	1	14.256697	0.597893	4.198219	14.256697
23	1	25	2	-6.049659	1	8.461588	0.013653	0.678233	8.461588
24	1	28	4	31.254956	1	11.208473	0.597893	4.221672	11.208473
25	1	31	1	-36.336422	1	14.256697	0.958507	6.525818	14.256697
26	1	30	5	19.847507	1	13.208815	1.713305	1.996556	13.208815
27	1	23	4	41.774339	1	6.812430	0.597893	6.994651	6.994651
28	1	24	5	9.372034	1	7.617997	1.713305	0.625275	7.617997
29	1	34	2	-11.322955	1	17.580467	0.013653	1.306628	17.580467
30	1	27	4	-0.253022	2	8.885002	0.196422	0.232847	8.885002
31	1	25	5	-11.979734	2	7.252335	0.497853	1.509430	7.252335
32	1	21	6	31.566056	2	4.433246	1.556908	3.955011	4.433246
33	1	20	5	43.796360	2	3.828511	0.497853	7.097478	7.097478
34	1	29	4	36.799694	2	10.654535	0.196422	5.186685	10.654535
35	1	25	2	-20.016109	2	7.252335	0.791935	2.815852	7.252335
36	1	28	5	-33.381069	2	9.753176	0.497853	5.836127	9.753176
37	1	27	5	13.976541	2	8.885002	0.497853	1.002358	8.885002
38	1	22	4	75.721336	2	5.079254	0.196422	19.765671	19.765671
39	1	26	2	1.171927	2	8.051033	0.791935	0.123967	8.051033
40	1	29	2	23.561053	2	10.654535	0.791935	2.390271	10.654535
41	1	28	5	-0.414614	2	9.753176	0.497853	0.245782	9.753176
42	1	27	8	3.528593	2	8.885002	3.003881	0.037687	8.885002
43	1	34	6	27.775596	2	15.626730	1.556908	3.166652	15.626730
44	1	38	5	41.101175	2	20.114275	0.497853	6.325301	20.114275

	Algorithm	Accuracy
0	pyISC classifier	0.520333
1	GaussianNB	0.510000
2	KNeighborsClassifier	0.488000
3	RandomForestClassifier	0.509000