Anomaly detection using the Iris dataset

The Iris dataset is not perhaps the most natural dataset for anomaly detection, so I will get round to writing one with a slightly more appropriate dataset at some point.

Regardless, the principles are the same; unsupervised learning of a 'normal' model, e.g. generating a model of normality and using that to identify anomalous data points. It's then possible to use the Log Likelihood score to identify points that aren't 'normal'.

In practical terms there are some hurdles, in that ideally the model of normality should be (in general) normal, and abnormal points in the system that you're monitoring should be removed (it really is quite sensitive to training data); whether manually or automatically. If performing automatic removal you can use a more constrained model rather than using exactly the same structure. On the other hand, there are a host of benefits; it's possible to identify the likely variables that caused the abnormality, and it's also possible to identify what the model is expecting it to be. In that regard (unlike many other approaches), the model can essenentially be debugged by identifying which variables are not performing in the way that would be expected. This article won't cover those two additional points, but I will return to it at a later date.

First off, define all the imports.


In [6]:
%matplotlib notebook
import pandas as pd
import sys
sys.path.append("../../../bayesianpy")

import bayesianpy
from bayesianpy.network import Builder as builder

import logging
import os

import matplotlib.pyplot as plt
from IPython.display import display

In [7]:
logger = logging.getLogger()
logger.addHandler(logging.StreamHandler())
logger.setLevel(logging.INFO)

bayesianpy.jni.attach(logger)

db_folder = bayesianpy.utils.get_path_to_parent_dir("")
iris = pd.read_csv(os.path.join(db_folder, "data/iris.csv"), index_col=False)

Rather than using a template to build the network, it's fairly easy to define it by hand. The network looks something like the following:


In [8]:
network = bayesianpy.network.create_network()
cluster = builder.create_cluster_variable(network, 4)
petal_length = builder.create_continuous_variable(network, "petal_length")
petal_width = builder.create_continuous_variable(network, "petal_width")
sepal_length = builder.create_continuous_variable(network, "sepal_length")
sepal_width = builder.create_continuous_variable(network, "sepal_width")
nodes = [petal_length, petal_width, sepal_length, sepal_width]

for i, node in enumerate(nodes):
    builder.create_link(network, cluster, node)
    for j in range(i+1, len(nodes)):
        builder.create_link(network, node, nodes[j])

layout = bayesianpy.visual.NetworkLayout(network)
graph = layout.build_graph()
pos = layout.fruchterman_reingold_layout(graph)
layout.visualise(graph, pos)


Using the above network, train the model on only two of the classes (for clarity, I chose the two that are least separated, versicolor and virginica)


In [9]:
# build the 'normal' model on two of the classes
model = bayesianpy.model.NetworkModel(network, logger)

with bayesianpy.data.DataSet(iris.drop('iris_class', axis=1), db_folder, logger) as dataset:

    subset = dataset.subset(
        iris[(iris.iris_class == "Iris-versicolor") | (iris.iris_class == "Iris-virginica")].index.tolist())

    model.train(subset)


INFO:root:Writing 150 rows to storage
Writing 150 rows to storage
Writing 150 rows to storage
INFO:root:Finished writing 150 rows to storage
Finished writing 150 rows to storage
Finished writing 150 rows to storage
INFO:root:Training model...
Training model...
Training model...
INFO:root:Finished training model
Finished training model
Finished training model

The network is then ready for anomaly detection; this entails applying the entire dataset to the trained model, and plotting the results. The Log Likelihood will always give a negative value, the closer to 0, the more normal the applied data sample is.


In [10]:
with bayesianpy.data.DataSet(iris.drop('iris_class', axis=1), db_folder, logger) as dataset:

    # get the loglikelihood value for the whole model on each individual sample,
    # the lower the loglikelihood value the less likely the data point has been
    # generated by the model.
    results = model.batch_query(dataset, [bayesianpy.model.QueryModelStatistics()])
    display(results)


INFO:root:Writing 150 rows to storage
Writing 150 rows to storage
Writing 150 rows to storage
INFO:root:Finished writing 150 rows to storage
Finished writing 150 rows to storage
Finished writing 150 rows to storage
INFO:root:Using 1 processes to query 150 rows
Using 1 processes to query 150 rows
Using 1 processes to query 150 rows
INFO:root:Queried case 0
Queried case 0
Queried case 0
sepal_length sepal_width petal_length petal_width loglikelihood
0 5.1 3.5 1.4 0.2 -97.335182
1 4.9 3.0 1.4 0.2 -68.149017
2 4.7 3.2 1.3 0.2 -79.445213
3 4.6 3.1 1.5 0.2 -69.911354
4 5.0 3.6 1.4 0.2 -102.695229
5 5.4 3.9 1.7 0.4 -91.974503
6 4.6 3.4 1.4 0.3 -80.899055
7 5.0 3.4 1.5 0.2 -88.482040
8 4.4 2.9 1.4 0.2 -61.011772
9 4.9 3.1 1.5 0.1 -79.247742
10 5.4 3.7 1.5 0.2 -95.258156
11 4.8 3.4 1.6 0.2 -85.956787
12 4.8 3.0 1.4 0.1 -75.082542
13 4.3 3.0 1.1 0.1 -78.560905
14 5.8 4.0 1.2 0.2 -114.067932
15 5.7 4.4 1.5 0.4 -102.963287
16 5.4 3.9 1.3 0.4 -109.151588
17 5.1 3.5 1.4 0.3 -89.338295
18 5.7 3.8 1.7 0.3 -88.886635
19 5.1 3.8 1.5 0.3 -101.818329
20 5.4 3.4 1.7 0.2 -85.641731
21 5.1 3.7 1.5 0.4 -91.647285
22 4.6 3.6 1.0 0.2 -110.777832
23 5.1 3.3 1.7 0.5 -58.709328
24 4.8 3.4 1.9 0.2 -82.125298
25 5.0 3.0 1.6 0.2 -64.907738
26 5.0 3.4 1.6 0.4 -71.292343
27 5.2 3.5 1.5 0.2 -95.008591
28 5.2 3.4 1.4 0.2 -91.972710
29 4.7 3.2 1.6 0.2 -74.016365
... ... ... ... ... ...
120 6.9 3.2 5.7 2.3 0.401173
121 5.6 2.8 4.9 2.0 -1.384017
122 7.7 2.8 6.7 2.0 -1.740525
123 6.3 2.7 4.9 1.8 -0.412584
124 6.7 3.3 5.7 2.1 -1.309598
125 7.2 3.2 6.0 1.8 -1.466413
126 6.2 2.8 4.8 1.8 -0.455969
127 6.1 3.0 4.9 1.8 -0.395859
128 6.4 2.8 5.6 2.1 -0.793193
129 7.2 3.0 5.8 1.6 -1.863071
130 7.4 2.8 6.1 1.9 -0.757230
131 7.9 3.8 6.4 2.0 -3.119908
132 6.4 2.8 5.6 2.2 -0.823412
133 6.3 2.8 5.1 1.5 -0.955212
134 6.1 2.6 5.6 1.4 -4.589409
135 7.7 3.0 6.1 2.3 -3.942888
136 6.3 3.4 5.6 2.4 -0.968925
137 6.4 3.1 5.5 1.8 -1.050825
138 6.0 3.0 4.8 1.8 -0.667510
139 6.9 3.1 5.4 2.1 -0.430080
140 6.7 3.1 5.6 2.4 0.001756
141 6.9 3.1 5.1 2.3 -2.418022
142 5.8 2.7 5.1 1.9 -0.482643
143 6.8 3.2 5.9 2.3 0.176468
144 6.7 3.3 5.7 2.5 -0.329838
145 6.7 3.0 5.2 2.3 -0.883600
146 6.3 2.5 5.0 1.9 -1.460659
147 6.5 3.0 5.2 2.0 -0.433640
148 6.2 3.4 5.4 2.3 -2.103189
149 5.9 3.0 5.1 1.8 -0.567526

150 rows × 5 columns


In [11]:
cmap = plt.cm.get_cmap('Blues_r')
fig = plt.figure(figsize=(10, 10))
k = 1
for i, v in enumerate(nodes):
    for j in range(i+1, len(nodes)):
        
        v_name = v.getName()
        v1_name = nodes[j].getName()
        ax = fig.add_subplot(3,2,k)
        ax.set_title("{} vs {}".format(v_name, v1_name))
        h = ax.scatter(x=iris[v_name].tolist(), y=iris[v1_name].tolist(), c=results['loglikelihood'].tolist(),
                        vmin=results.loglikelihood.min(), vmax=results.loglikelihood.max(), cmap=cmap
                        )
        k+=1

    fig.subplots_adjust(right=0.8)
    cbar_ax = fig.add_axes([0.85, 0.15, 0.05, 0.7])
    fig.colorbar(h, cax=cbar_ax)
    plt.show()



In [ ]: