Anomaly detection using the Iris dataset

The Iris dataset is not perhaps the most natural dataset for anomaly detection, so I will get round to writing one with a slightly more appropriate dataset at some point.

Regardless, the principles are the same; unsupervised learning of a 'normal' model, e.g. generating a model of normality and using that to identify anomalous data points. It's then possible to use the Log Likelihood score to identify points that aren't 'normal'.

In practical terms there are some hurdles, in that ideally the model of normality should be (in general) normal, and abnormal points in the system that you're monitoring should be removed (it really is quite sensitive to training data); whether manually or automatically. If performing automatic removal you can use a more constrained model rather than using exactly the same structure. On the other hand, there are a host of benefits; it's possible to identify the likely variables that caused the abnormality, and it's also possible to identify what the model is expecting it to be. In that regard (unlike many other approaches), the model can essenentially be debugged by identifying which variables are not performing in the way that would be expected. This article won't cover those two additional points, but I will return to it at a later date.

First off, define all the imports.


In [1]:
%matplotlib notebook
import pandas as pd
import sys
sys.path.append("../../../bayespy")

import bayespy
from bayespy.network import Builder as builder

import logging
import os

import matplotlib.pyplot as plt
from IPython.display import display

In [2]:
logger = logging.getLogger()
logger.addHandler(logging.StreamHandler())
logger.setLevel(logging.INFO)

bayespy.jni.attach(logger)

db_folder = bayespy.utils.get_path_to_parent_dir("")
iris = pd.read_csv(os.path.join(db_folder, "data/iris.csv"), index_col=False)

Rather than using a template to build the network, it's fairly easy to define it by hand. The network looks something like the following:


In [3]:
network = bayespy.network.create_network()
cluster = builder.create_cluster_variable(network, 4)
petal_length = builder.create_continuous_variable(network, "petal_length")
petal_width = builder.create_continuous_variable(network, "petal_width")
sepal_length = builder.create_continuous_variable(network, "sepal_length")
sepal_width = builder.create_continuous_variable(network, "sepal_width")
nodes = [petal_length, petal_width, sepal_length, sepal_width]

for i, node in enumerate(nodes):
    builder.create_link(network, cluster, node)
    for j in range(i+1, len(nodes)):
        builder.create_link(network, node, nodes[j])

layout = bayespy.visual.NetworkLayout(network)
graph = layout.build_graph()
pos = layout.fruchterman_reingold_layout(graph)
layout.visualise(graph, pos)


Using the above network, train the model on only two of the classes (for clarity, I chose the two that are least separated, versicolor and virginica)


In [4]:
# build the 'normal' model on two of the classes
model = bayespy.model.NetworkModel(network, logger)

with bayespy.data.DataSet(iris.drop('iris_class', axis=1), db_folder, logger) as dataset:

    subset = dataset.subset(
        iris[(iris.iris_class == "Iris-versicolor") | (iris.iris_class == "Iris-virginica")].index.tolist())

    model.train(subset)


INFO:root:Writing 150 rows to storage
Writing 150 rows to storage
INFO:root:Finished writing 150 rows to storage
Finished writing 150 rows to storage
INFO:root:Training model...
Training model...
INFO:root:Finished training model
Finished training model

The network is then ready for anomaly detection; this entails applying the entire dataset to the trained model, and plotting the results. The Log Likelihood will always give a negative value, the closer to 0, the more normal the applied data sample is.


In [5]:
with bayespy.data.DataSet(iris.drop('iris_class', axis=1), db_folder, logger) as dataset:

    # get the loglikelihood value for the whole model on each individual sample,
    # the lower the loglikelihood value the less likely the data point has been
    # generated by the model.
    results = model.batch_query(dataset, [bayespy.model.QueryModelStatistics()])
    display(results)


INFO:root:Writing 150 rows to storage
Writing 150 rows to storage
INFO:root:Finished writing 150 rows to storage
Finished writing 150 rows to storage
INFO:root:Using 1 processes to query 150 rows
Using 1 processes to query 150 rows
INFO:root:Queried case 0
Queried case 0
sepal_length sepal_width petal_length petal_width loglikelihood
0 5.1 3.5 1.4 0.2 -75.110527
1 4.9 3.0 1.4 0.2 -68.085716
2 4.7 3.2 1.3 0.2 -76.256363
3 4.6 3.1 1.5 0.2 -61.319077
4 5.0 3.6 1.4 0.2 -73.103058
5 5.4 3.9 1.7 0.4 -69.913757
6 4.6 3.4 1.4 0.3 -71.203186
7 5.0 3.4 1.5 0.2 -66.434944
8 4.4 2.9 1.4 0.2 -62.880653
9 4.9 3.1 1.5 0.1 -62.479546
10 5.4 3.7 1.5 0.2 -74.074585
11 4.8 3.4 1.6 0.2 -57.543560
12 4.8 3.0 1.4 0.1 -68.814430
13 4.3 3.0 1.1 0.1 -80.091278
14 5.8 4.0 1.2 0.2 -109.272400
15 5.7 4.4 1.5 0.4 -95.246857
16 5.4 3.9 1.3 0.4 -100.343658
17 5.1 3.5 1.4 0.3 -80.320305
18 5.7 3.8 1.7 0.3 -71.622185
19 5.1 3.8 1.5 0.3 -73.342346
20 5.4 3.4 1.7 0.2 -61.585949
21 5.1 3.7 1.5 0.4 -77.198380
22 4.6 3.6 1.0 0.2 -95.631012
23 5.1 3.3 1.7 0.5 -58.013519
24 4.8 3.4 1.9 0.2 -43.801022
25 5.0 3.0 1.6 0.2 -64.057549
26 5.0 3.4 1.6 0.4 -68.822227
27 5.2 3.5 1.5 0.2 -70.111351
28 5.2 3.4 1.4 0.2 -78.558098
29 4.7 3.2 1.6 0.2 -56.109459
... ... ... ... ... ...
120 6.9 3.2 5.7 2.3 -0.500364
121 5.6 2.8 4.9 2.0 -1.462422
122 7.7 2.8 6.7 2.0 -1.871525
123 6.3 2.7 4.9 1.8 -0.029397
124 6.7 3.3 5.7 2.1 -0.582762
125 7.2 3.2 6.0 1.8 -1.050245
126 6.2 2.8 4.8 1.8 -0.723141
127 6.1 3.0 4.9 1.8 -0.072401
128 6.4 2.8 5.6 2.1 -0.562761
129 7.2 3.0 5.8 1.6 -2.257030
130 7.4 2.8 6.1 1.9 -0.971486
131 7.9 3.8 6.4 2.0 -2.721593
132 6.4 2.8 5.6 2.2 -0.975166
133 6.3 2.8 5.1 1.5 -1.163955
134 6.1 2.6 5.6 1.4 -4.592608
135 7.7 3.0 6.1 2.3 -4.191530
136 6.3 3.4 5.6 2.4 -1.633703
137 6.4 3.1 5.5 1.8 -1.172908
138 6.0 3.0 4.8 1.8 -0.170704
139 6.9 3.1 5.4 2.1 -0.583957
140 6.7 3.1 5.6 2.4 -0.514083
141 6.9 3.1 5.1 2.3 -2.488790
142 5.8 2.7 5.1 1.9 -0.583678
143 6.8 3.2 5.9 2.3 -0.657730
144 6.7 3.3 5.7 2.5 -0.537323
145 6.7 3.0 5.2 2.3 -1.110151
146 6.3 2.5 5.0 1.9 -0.936661
147 6.5 3.0 5.2 2.0 0.271280
148 6.2 3.4 5.4 2.3 -2.550608
149 5.9 3.0 5.1 1.8 -0.886933

150 rows × 5 columns


In [9]:
cmap = plt.cm.get_cmap('Blues_r')
fig = plt.figure(figsize=(10, 10))
k = 1
for i, v in enumerate(nodes):
    for j in range(i+1, len(nodes)):
        
        v_name = v.getName()
        v1_name = nodes[j].getName()
        ax = fig.add_subplot(3,2,k)
        ax.set_title("{} vs {}".format(v_name, v1_name))
        h = ax.scatter(x=iris[v_name].tolist(), y=iris[v1_name].tolist(), c=results['loglikelihood'].tolist(),
                        vmin=results.loglikelihood.min(), vmax=results.loglikelihood.max(), cmap=cmap
                        )
        k+=1

    fig.subplots_adjust(right=0.8)
    cbar_ax = fig.add_axes([0.85, 0.15, 0.05, 0.7])
    fig.colorbar(h, cax=cbar_ax)
    plt.show()



In [ ]: