Anomaly detection using the Iris dataset

The Iris dataset is not perhaps the most natural dataset for anomaly detection, so I will get round to writing one with a slightly more appropriate dataset at some point.

Regardless, the principles are the same; unsupervised learning of a 'normal' model, e.g. generating a model of normality and using that to identify anomalous data points. It's then possible to use the Log Likelihood score to identify points that aren't 'normal'.

In practical terms there are some hurdles, in that ideally the model of normality should be (in general) normal, and abnormal points in the system that you're monitoring should be removed (it really is quite sensitive to training data); whether manually or automatically. If performing automatic removal you can use a more constrained model rather than using exactly the same structure. On the other hand, there are a host of benefits; it's possible to identify the likely variables that caused the abnormality, and it's also possible to identify what the model is expecting it to be. In that regard (unlike many other approaches), the model can essenentially be debugged by identifying which variables are not performing in the way that would be expected. This article won't cover those two additional points, but I will return to it at a later date.

First off, define all the imports.



In [6]:

    
%matplotlib notebook
import pandas as pd
import sys
sys.path.append("../../../bayesianpy")

import bayesianpy
from bayesianpy.network import Builder as builder

import logging
import os

import matplotlib.pyplot as plt
from IPython.display import display



In [7]:

    
logger = logging.getLogger()
logger.addHandler(logging.StreamHandler())
logger.setLevel(logging.INFO)

bayesianpy.jni.attach(logger)

db_folder = bayesianpy.utils.get_path_to_parent_dir("")
iris = pd.read_csv(os.path.join(db_folder, "data/iris.csv"), index_col=False)

Rather than using a template to build the network, it's fairly easy to define it by hand. The network looks something like the following:



In [8]:

    
network = bayesianpy.network.create_network()
cluster = builder.create_cluster_variable(network, 4)
petal_length = builder.create_continuous_variable(network, "petal_length")
petal_width = builder.create_continuous_variable(network, "petal_width")
sepal_length = builder.create_continuous_variable(network, "sepal_length")
sepal_width = builder.create_continuous_variable(network, "sepal_width")
nodes = [petal_length, petal_width, sepal_length, sepal_width]

for i, node in enumerate(nodes):
    builder.create_link(network, cluster, node)
    for j in range(i+1, len(nodes)):
        builder.create_link(network, node, nodes[j])

layout = bayesianpy.visual.NetworkLayout(network)
graph = layout.build_graph()
pos = layout.fruchterman_reingold_layout(graph)
layout.visualise(graph, pos)

Using the above network, train the model on only two of the classes (for clarity, I chose the two that are least separated, versicolor and virginica)



In [9]:

    
# build the 'normal' model on two of the classes
model = bayesianpy.model.NetworkModel(network, logger)

with bayesianpy.data.DataSet(iris.drop('iris_class', axis=1), db_folder, logger) as dataset:

    subset = dataset.subset(
        iris[(iris.iris_class == "Iris-versicolor") | (iris.iris_class == "Iris-virginica")].index.tolist())

    model.train(subset)









    



INFO:root:Writing 150 rows to storage
Writing 150 rows to storage
Writing 150 rows to storage
INFO:root:Finished writing 150 rows to storage
Finished writing 150 rows to storage
Finished writing 150 rows to storage
INFO:root:Training model...
Training model...
Training model...
INFO:root:Finished training model
Finished training model
Finished training model

The network is then ready for anomaly detection; this entails applying the entire dataset to the trained model, and plotting the results. The Log Likelihood will always give a negative value, the closer to 0, the more normal the applied data sample is.



In [10]:

    
with bayesianpy.data.DataSet(iris.drop('iris_class', axis=1), db_folder, logger) as dataset:

    # get the loglikelihood value for the whole model on each individual sample,
    # the lower the loglikelihood value the less likely the data point has been
    # generated by the model.
    results = model.batch_query(dataset, [bayesianpy.model.QueryModelStatistics()])
    display(results)









    



INFO:root:Writing 150 rows to storage
Writing 150 rows to storage
Writing 150 rows to storage
INFO:root:Finished writing 150 rows to storage
Finished writing 150 rows to storage
Finished writing 150 rows to storage
INFO:root:Using 1 processes to query 150 rows
Using 1 processes to query 150 rows
Using 1 processes to query 150 rows
INFO:root:Queried case 0
Queried case 0
Queried case 0






    






  
    
      
      sepal_length
      sepal_width
      petal_length
      petal_width
      loglikelihood
    
  
  
    
      0
      5.1
      3.5
      1.4
      0.2
      -97.335182
    
    
      1
      4.9
      3.0
      1.4
      0.2
      -68.149017
    
    
      2
      4.7
      3.2
      1.3
      0.2
      -79.445213
    
    
      3
      4.6
      3.1
      1.5
      0.2
      -69.911354
    
    
      4
      5.0
      3.6
      1.4
      0.2
      -102.695229
    
    
      5
      5.4
      3.9
      1.7
      0.4
      -91.974503
    
    
      6
      4.6
      3.4
      1.4
      0.3
      -80.899055
    
    
      7
      5.0
      3.4
      1.5
      0.2
      -88.482040
    
    
      8
      4.4
      2.9
      1.4
      0.2
      -61.011772
    
    
      9
      4.9
      3.1
      1.5
      0.1
      -79.247742
    
    
      10
      5.4
      3.7
      1.5
      0.2
      -95.258156
    
    
      11
      4.8
      3.4
      1.6
      0.2
      -85.956787
    
    
      12
      4.8
      3.0
      1.4
      0.1
      -75.082542
    
    
      13
      4.3
      3.0
      1.1
      0.1
      -78.560905
    
    
      14
      5.8
      4.0
      1.2
      0.2
      -114.067932
    
    
      15
      5.7
      4.4
      1.5
      0.4
      -102.963287
    
    
      16
      5.4
      3.9
      1.3
      0.4
      -109.151588
    
    
      17
      5.1
      3.5
      1.4
      0.3
      -89.338295
    
    
      18
      5.7
      3.8
      1.7
      0.3
      -88.886635
    
    
      19
      5.1
      3.8
      1.5
      0.3
      -101.818329
    
    
      20
      5.4
      3.4
      1.7
      0.2
      -85.641731
    
    
      21
      5.1
      3.7
      1.5
      0.4
      -91.647285
    
    
      22
      4.6
      3.6
      1.0
      0.2
      -110.777832
    
    
      23
      5.1
      3.3
      1.7
      0.5
      -58.709328
    
    
      24
      4.8
      3.4
      1.9
      0.2
      -82.125298
    
    
      25
      5.0
      3.0
      1.6
      0.2
      -64.907738
    
    
      26
      5.0
      3.4
      1.6
      0.4
      -71.292343
    
    
      27
      5.2
      3.5
      1.5
      0.2
      -95.008591
    
    
      28
      5.2
      3.4
      1.4
      0.2
      -91.972710
    
    
      29
      4.7
      3.2
      1.6
      0.2
      -74.016365
    
    
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      120
      6.9
      3.2
      5.7
      2.3
      0.401173
    
    
      121
      5.6
      2.8
      4.9
      2.0
      -1.384017
    
    
      122
      7.7
      2.8
      6.7
      2.0
      -1.740525
    
    
      123
      6.3
      2.7
      4.9
      1.8
      -0.412584
    
    
      124
      6.7
      3.3
      5.7
      2.1
      -1.309598
    
    
      125
      7.2
      3.2
      6.0
      1.8
      -1.466413
    
    
      126
      6.2
      2.8
      4.8
      1.8
      -0.455969
    
    
      127
      6.1
      3.0
      4.9
      1.8
      -0.395859
    
    
      128
      6.4
      2.8
      5.6
      2.1
      -0.793193
    
    
      129
      7.2
      3.0
      5.8
      1.6
      -1.863071
    
    
      130
      7.4
      2.8
      6.1
      1.9
      -0.757230
    
    
      131
      7.9
      3.8
      6.4
      2.0
      -3.119908
    
    
      132
      6.4
      2.8
      5.6
      2.2
      -0.823412
    
    
      133
      6.3
      2.8
      5.1
      1.5
      -0.955212
    
    
      134
      6.1
      2.6
      5.6
      1.4
      -4.589409
    
    
      135
      7.7
      3.0
      6.1
      2.3
      -3.942888
    
    
      136
      6.3
      3.4
      5.6
      2.4
      -0.968925
    
    
      137
      6.4
      3.1
      5.5
      1.8
      -1.050825
    
    
      138
      6.0
      3.0
      4.8
      1.8
      -0.667510
    
    
      139
      6.9
      3.1
      5.4
      2.1
      -0.430080
    
    
      140
      6.7
      3.1
      5.6
      2.4
      0.001756
    
    
      141
      6.9
      3.1
      5.1
      2.3
      -2.418022
    
    
      142
      5.8
      2.7
      5.1
      1.9
      -0.482643
    
    
      143
      6.8
      3.2
      5.9
      2.3
      0.176468
    
    
      144
      6.7
      3.3
      5.7
      2.5
      -0.329838
    
    
      145
      6.7
      3.0
      5.2
      2.3
      -0.883600
    
    
      146
      6.3
      2.5
      5.0
      1.9
      -1.460659
    
    
      147
      6.5
      3.0
      5.2
      2.0
      -0.433640
    
    
      148
      6.2
      3.4
      5.4
      2.3
      -2.103189
    
    
      149
      5.9
      3.0
      5.1
      1.8
      -0.567526
    
  

150 rows × 5 columns



In [11]:

    
cmap = plt.cm.get_cmap('Blues_r')
fig = plt.figure(figsize=(10, 10))
k = 1
for i, v in enumerate(nodes):
    for j in range(i+1, len(nodes)):
        
        v_name = v.getName()
        v1_name = nodes[j].getName()
        ax = fig.add_subplot(3,2,k)
        ax.set_title("{} vs {}".format(v_name, v1_name))
        h = ax.scatter(x=iris[v_name].tolist(), y=iris[v1_name].tolist(), c=results['loglikelihood'].tolist(),
                        vmin=results.loglikelihood.min(), vmax=results.loglikelihood.max(), cmap=cmap
                        )
        k+=1

    fig.subplots_adjust(right=0.8)
    cbar_ax = fig.add_axes([0.85, 0.15, 0.05, 0.7])
    fig.colorbar(h, cax=cbar_ax)
    plt.show()



In [ ]:

	sepal_length	sepal_width	petal_length	petal_width	loglikelihood
0	5.1	3.5	1.4	0.2	-97.335182
1	4.9	3.0	1.4	0.2	-68.149017
2	4.7	3.2	1.3	0.2	-79.445213
3	4.6	3.1	1.5	0.2	-69.911354
4	5.0	3.6	1.4	0.2	-102.695229
5	5.4	3.9	1.7	0.4	-91.974503
6	4.6	3.4	1.4	0.3	-80.899055
7	5.0	3.4	1.5	0.2	-88.482040
8	4.4	2.9	1.4	0.2	-61.011772
9	4.9	3.1	1.5	0.1	-79.247742
10	5.4	3.7	1.5	0.2	-95.258156
11	4.8	3.4	1.6	0.2	-85.956787
12	4.8	3.0	1.4	0.1	-75.082542
13	4.3	3.0	1.1	0.1	-78.560905
14	5.8	4.0	1.2	0.2	-114.067932
15	5.7	4.4	1.5	0.4	-102.963287
16	5.4	3.9	1.3	0.4	-109.151588
17	5.1	3.5	1.4	0.3	-89.338295
18	5.7	3.8	1.7	0.3	-88.886635
19	5.1	3.8	1.5	0.3	-101.818329
20	5.4	3.4	1.7	0.2	-85.641731
21	5.1	3.7	1.5	0.4	-91.647285
22	4.6	3.6	1.0	0.2	-110.777832
23	5.1	3.3	1.7	0.5	-58.709328
24	4.8	3.4	1.9	0.2	-82.125298
25	5.0	3.0	1.6	0.2	-64.907738
26	5.0	3.4	1.6	0.4	-71.292343
27	5.2	3.5	1.5	0.2	-95.008591
28	5.2	3.4	1.4	0.2	-91.972710
29	4.7	3.2	1.6	0.2	-74.016365
...	...	...	...	...	...
120	6.9	3.2	5.7	2.3	0.401173
121	5.6	2.8	4.9	2.0	-1.384017
122	7.7	2.8	6.7	2.0	-1.740525
123	6.3	2.7	4.9	1.8	-0.412584
124	6.7	3.3	5.7	2.1	-1.309598
125	7.2	3.2	6.0	1.8	-1.466413
126	6.2	2.8	4.8	1.8	-0.455969
127	6.1	3.0	4.9	1.8	-0.395859
128	6.4	2.8	5.6	2.1	-0.793193
129	7.2	3.0	5.8	1.6	-1.863071
130	7.4	2.8	6.1	1.9	-0.757230
131	7.9	3.8	6.4	2.0	-3.119908
132	6.4	2.8	5.6	2.2	-0.823412
133	6.3	2.8	5.1	1.5	-0.955212
134	6.1	2.6	5.6	1.4	-4.589409
135	7.7	3.0	6.1	2.3	-3.942888
136	6.3	3.4	5.6	2.4	-0.968925
137	6.4	3.1	5.5	1.8	-1.050825
138	6.0	3.0	4.8	1.8	-0.667510
139	6.9	3.1	5.4	2.1	-0.430080
140	6.7	3.1	5.6	2.4	0.001756
141	6.9	3.1	5.1	2.3	-2.418022
142	5.8	2.7	5.1	1.9	-0.482643
143	6.8	3.2	5.9	2.3	0.176468
144	6.7	3.3	5.7	2.5	-0.329838
145	6.7	3.0	5.2	2.3	-0.883600
146	6.3	2.5	5.0	1.9	-1.460659
147	6.5	3.0	5.2	2.0	-0.433640
148	6.2	3.4	5.4	2.3	-2.103189
149	5.9	3.0	5.1	1.8	-0.567526