Anomaly detection using the Iris dataset

The Iris dataset is not perhaps the most natural dataset for anomaly detection, so I will get round to writing one with a slightly more appropriate dataset at some point.

Regardless, the principles are the same; unsupervised learning of a 'normal' model, e.g. generating a model of normality and using that to identify anomalous data points. It's then possible to use the Log Likelihood score to identify points that aren't 'normal'.

In practical terms there are some hurdles, in that ideally the model of normality should be (in general) normal, and abnormal points in the system that you're monitoring should be removed (it really is quite sensitive to training data); whether manually or automatically. If performing automatic removal you can use a more constrained model rather than using exactly the same structure. On the other hand, there are a host of benefits; it's possible to identify the likely variables that caused the abnormality, and it's also possible to identify what the model is expecting it to be. In that regard (unlike many other approaches), the model can essenentially be debugged by identifying which variables are not performing in the way that would be expected. This article won't cover those two additional points, but I will return to it at a later date.

First off, define all the imports.



In [1]:

    
%matplotlib notebook
import pandas as pd
import sys
sys.path.append("../../../bayespy")

import bayespy
from bayespy.network import Builder as builder

import logging
import os

import matplotlib.pyplot as plt
from IPython.display import display



In [2]:

    
logger = logging.getLogger()
logger.addHandler(logging.StreamHandler())
logger.setLevel(logging.INFO)

bayespy.jni.attach(logger)

db_folder = bayespy.utils.get_path_to_parent_dir("")
iris = pd.read_csv(os.path.join(db_folder, "data/iris.csv"), index_col=False)

Rather than using a template to build the network, it's fairly easy to define it by hand. The network looks something like the following:



In [3]:

    
network = bayespy.network.create_network()
cluster = builder.create_cluster_variable(network, 4)
petal_length = builder.create_continuous_variable(network, "petal_length")
petal_width = builder.create_continuous_variable(network, "petal_width")
sepal_length = builder.create_continuous_variable(network, "sepal_length")
sepal_width = builder.create_continuous_variable(network, "sepal_width")
nodes = [petal_length, petal_width, sepal_length, sepal_width]

for i, node in enumerate(nodes):
    builder.create_link(network, cluster, node)
    for j in range(i+1, len(nodes)):
        builder.create_link(network, node, nodes[j])

layout = bayespy.visual.NetworkLayout(network)
graph = layout.build_graph()
pos = layout.fruchterman_reingold_layout(graph)
layout.visualise(graph, pos)

Using the above network, train the model on only two of the classes (for clarity, I chose the two that are least separated, versicolor and virginica)



In [4]:

    
# build the 'normal' model on two of the classes
model = bayespy.model.NetworkModel(network, logger)

with bayespy.data.DataSet(iris.drop('iris_class', axis=1), db_folder, logger) as dataset:

    subset = dataset.subset(
        iris[(iris.iris_class == "Iris-versicolor") | (iris.iris_class == "Iris-virginica")].index.tolist())

    model.train(subset)









    



INFO:root:Writing 150 rows to storage
Writing 150 rows to storage
INFO:root:Finished writing 150 rows to storage
Finished writing 150 rows to storage
INFO:root:Training model...
Training model...
INFO:root:Finished training model
Finished training model

The network is then ready for anomaly detection; this entails applying the entire dataset to the trained model, and plotting the results. The Log Likelihood will always give a negative value, the closer to 0, the more normal the applied data sample is.



In [5]:

    
with bayespy.data.DataSet(iris.drop('iris_class', axis=1), db_folder, logger) as dataset:

    # get the loglikelihood value for the whole model on each individual sample,
    # the lower the loglikelihood value the less likely the data point has been
    # generated by the model.
    results = model.batch_query(dataset, [bayespy.model.QueryModelStatistics()])
    display(results)









    



INFO:root:Writing 150 rows to storage
Writing 150 rows to storage
INFO:root:Finished writing 150 rows to storage
Finished writing 150 rows to storage
INFO:root:Using 1 processes to query 150 rows
Using 1 processes to query 150 rows
INFO:root:Queried case 0
Queried case 0






    






  
    
      
      sepal_length
      sepal_width
      petal_length
      petal_width
      loglikelihood
    
  
  
    
      0
      5.1
      3.5
      1.4
      0.2
      -75.110527
    
    
      1
      4.9
      3.0
      1.4
      0.2
      -68.085716
    
    
      2
      4.7
      3.2
      1.3
      0.2
      -76.256363
    
    
      3
      4.6
      3.1
      1.5
      0.2
      -61.319077
    
    
      4
      5.0
      3.6
      1.4
      0.2
      -73.103058
    
    
      5
      5.4
      3.9
      1.7
      0.4
      -69.913757
    
    
      6
      4.6
      3.4
      1.4
      0.3
      -71.203186
    
    
      7
      5.0
      3.4
      1.5
      0.2
      -66.434944
    
    
      8
      4.4
      2.9
      1.4
      0.2
      -62.880653
    
    
      9
      4.9
      3.1
      1.5
      0.1
      -62.479546
    
    
      10
      5.4
      3.7
      1.5
      0.2
      -74.074585
    
    
      11
      4.8
      3.4
      1.6
      0.2
      -57.543560
    
    
      12
      4.8
      3.0
      1.4
      0.1
      -68.814430
    
    
      13
      4.3
      3.0
      1.1
      0.1
      -80.091278
    
    
      14
      5.8
      4.0
      1.2
      0.2
      -109.272400
    
    
      15
      5.7
      4.4
      1.5
      0.4
      -95.246857
    
    
      16
      5.4
      3.9
      1.3
      0.4
      -100.343658
    
    
      17
      5.1
      3.5
      1.4
      0.3
      -80.320305
    
    
      18
      5.7
      3.8
      1.7
      0.3
      -71.622185
    
    
      19
      5.1
      3.8
      1.5
      0.3
      -73.342346
    
    
      20
      5.4
      3.4
      1.7
      0.2
      -61.585949
    
    
      21
      5.1
      3.7
      1.5
      0.4
      -77.198380
    
    
      22
      4.6
      3.6
      1.0
      0.2
      -95.631012
    
    
      23
      5.1
      3.3
      1.7
      0.5
      -58.013519
    
    
      24
      4.8
      3.4
      1.9
      0.2
      -43.801022
    
    
      25
      5.0
      3.0
      1.6
      0.2
      -64.057549
    
    
      26
      5.0
      3.4
      1.6
      0.4
      -68.822227
    
    
      27
      5.2
      3.5
      1.5
      0.2
      -70.111351
    
    
      28
      5.2
      3.4
      1.4
      0.2
      -78.558098
    
    
      29
      4.7
      3.2
      1.6
      0.2
      -56.109459
    
    
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      120
      6.9
      3.2
      5.7
      2.3
      -0.500364
    
    
      121
      5.6
      2.8
      4.9
      2.0
      -1.462422
    
    
      122
      7.7
      2.8
      6.7
      2.0
      -1.871525
    
    
      123
      6.3
      2.7
      4.9
      1.8
      -0.029397
    
    
      124
      6.7
      3.3
      5.7
      2.1
      -0.582762
    
    
      125
      7.2
      3.2
      6.0
      1.8
      -1.050245
    
    
      126
      6.2
      2.8
      4.8
      1.8
      -0.723141
    
    
      127
      6.1
      3.0
      4.9
      1.8
      -0.072401
    
    
      128
      6.4
      2.8
      5.6
      2.1
      -0.562761
    
    
      129
      7.2
      3.0
      5.8
      1.6
      -2.257030
    
    
      130
      7.4
      2.8
      6.1
      1.9
      -0.971486
    
    
      131
      7.9
      3.8
      6.4
      2.0
      -2.721593
    
    
      132
      6.4
      2.8
      5.6
      2.2
      -0.975166
    
    
      133
      6.3
      2.8
      5.1
      1.5
      -1.163955
    
    
      134
      6.1
      2.6
      5.6
      1.4
      -4.592608
    
    
      135
      7.7
      3.0
      6.1
      2.3
      -4.191530
    
    
      136
      6.3
      3.4
      5.6
      2.4
      -1.633703
    
    
      137
      6.4
      3.1
      5.5
      1.8
      -1.172908
    
    
      138
      6.0
      3.0
      4.8
      1.8
      -0.170704
    
    
      139
      6.9
      3.1
      5.4
      2.1
      -0.583957
    
    
      140
      6.7
      3.1
      5.6
      2.4
      -0.514083
    
    
      141
      6.9
      3.1
      5.1
      2.3
      -2.488790
    
    
      142
      5.8
      2.7
      5.1
      1.9
      -0.583678
    
    
      143
      6.8
      3.2
      5.9
      2.3
      -0.657730
    
    
      144
      6.7
      3.3
      5.7
      2.5
      -0.537323
    
    
      145
      6.7
      3.0
      5.2
      2.3
      -1.110151
    
    
      146
      6.3
      2.5
      5.0
      1.9
      -0.936661
    
    
      147
      6.5
      3.0
      5.2
      2.0
      0.271280
    
    
      148
      6.2
      3.4
      5.4
      2.3
      -2.550608
    
    
      149
      5.9
      3.0
      5.1
      1.8
      -0.886933
    
  

150 rows × 5 columns



In [9]:

    
cmap = plt.cm.get_cmap('Blues_r')
fig = plt.figure(figsize=(10, 10))
k = 1
for i, v in enumerate(nodes):
    for j in range(i+1, len(nodes)):
        
        v_name = v.getName()
        v1_name = nodes[j].getName()
        ax = fig.add_subplot(3,2,k)
        ax.set_title("{} vs {}".format(v_name, v1_name))
        h = ax.scatter(x=iris[v_name].tolist(), y=iris[v1_name].tolist(), c=results['loglikelihood'].tolist(),
                        vmin=results.loglikelihood.min(), vmax=results.loglikelihood.max(), cmap=cmap
                        )
        k+=1

    fig.subplots_adjust(right=0.8)
    cbar_ax = fig.add_axes([0.85, 0.15, 0.05, 0.7])
    fig.colorbar(h, cax=cbar_ax)
    plt.show()



In [ ]:

	sepal_length	sepal_width	petal_length	petal_width	loglikelihood
0	5.1	3.5	1.4	0.2	-75.110527
1	4.9	3.0	1.4	0.2	-68.085716
2	4.7	3.2	1.3	0.2	-76.256363
3	4.6	3.1	1.5	0.2	-61.319077
4	5.0	3.6	1.4	0.2	-73.103058
5	5.4	3.9	1.7	0.4	-69.913757
6	4.6	3.4	1.4	0.3	-71.203186
7	5.0	3.4	1.5	0.2	-66.434944
8	4.4	2.9	1.4	0.2	-62.880653
9	4.9	3.1	1.5	0.1	-62.479546
10	5.4	3.7	1.5	0.2	-74.074585
11	4.8	3.4	1.6	0.2	-57.543560
12	4.8	3.0	1.4	0.1	-68.814430
13	4.3	3.0	1.1	0.1	-80.091278
14	5.8	4.0	1.2	0.2	-109.272400
15	5.7	4.4	1.5	0.4	-95.246857
16	5.4	3.9	1.3	0.4	-100.343658
17	5.1	3.5	1.4	0.3	-80.320305
18	5.7	3.8	1.7	0.3	-71.622185
19	5.1	3.8	1.5	0.3	-73.342346
20	5.4	3.4	1.7	0.2	-61.585949
21	5.1	3.7	1.5	0.4	-77.198380
22	4.6	3.6	1.0	0.2	-95.631012
23	5.1	3.3	1.7	0.5	-58.013519
24	4.8	3.4	1.9	0.2	-43.801022
25	5.0	3.0	1.6	0.2	-64.057549
26	5.0	3.4	1.6	0.4	-68.822227
27	5.2	3.5	1.5	0.2	-70.111351
28	5.2	3.4	1.4	0.2	-78.558098
29	4.7	3.2	1.6	0.2	-56.109459
...	...	...	...	...	...
120	6.9	3.2	5.7	2.3	-0.500364
121	5.6	2.8	4.9	2.0	-1.462422
122	7.7	2.8	6.7	2.0	-1.871525
123	6.3	2.7	4.9	1.8	-0.029397
124	6.7	3.3	5.7	2.1	-0.582762
125	7.2	3.2	6.0	1.8	-1.050245
126	6.2	2.8	4.8	1.8	-0.723141
127	6.1	3.0	4.9	1.8	-0.072401
128	6.4	2.8	5.6	2.1	-0.562761
129	7.2	3.0	5.8	1.6	-2.257030
130	7.4	2.8	6.1	1.9	-0.971486
131	7.9	3.8	6.4	2.0	-2.721593
132	6.4	2.8	5.6	2.2	-0.975166
133	6.3	2.8	5.1	1.5	-1.163955
134	6.1	2.6	5.6	1.4	-4.592608
135	7.7	3.0	6.1	2.3	-4.191530
136	6.3	3.4	5.6	2.4	-1.633703
137	6.4	3.1	5.5	1.8	-1.172908
138	6.0	3.0	4.8	1.8	-0.170704
139	6.9	3.1	5.4	2.1	-0.583957
140	6.7	3.1	5.6	2.4	-0.514083
141	6.9	3.1	5.1	2.3	-2.488790
142	5.8	2.7	5.1	1.9	-0.583678
143	6.8	3.2	5.9	2.3	-0.657730
144	6.7	3.3	5.7	2.5	-0.537323
145	6.7	3.0	5.2	2.3	-1.110151
146	6.3	2.5	5.0	1.9	-0.936661
147	6.5	3.0	5.2	2.0	0.271280
148	6.2	3.4	5.4	2.3	-2.550608
149	5.9	3.0	5.1	1.8	-0.886933