Anomaly Detection in High Dimensions Using Auto-encoder

Anomaly detection detects data points (outliers) which do not conform to an expected pattern or other items in a dataset. In statistics, anomalies, also known as outliers, are observation points that are distant from other observations. In this notebook we demonstrate how to do unsupervised anomaly detection in high dimensions using auto-encoder.

We are using one of the HiCS realworld data sets (link) for demo. The data set contains 32 dimensions and 351 instances, with 126 of them being outliers. Alternative datasets are in the same directory, however, typically datasets with higher dimension and lower outlier proportion will have a better performance. Data points with higher reconstruction error are more likely to be outliers, and this notebook can also show which dimensions the points are outlying.

References:

Initialization

Initilize nn context and load data


In [1]:
from zoo.common.nncontext import *
sc = init_nncontext("Anomaly Detection HD Example")

from scipy.io import arff
import pandas as pd
import os

dataset = "ionosphere"           #real world dataset
data_dir = os.getenv("ANALYTICS_ZOO_HOME")+"/bin/data/HiCS/"+dataset+".arff"
rawdata, _ = arff.loadarff(data_dir)
data = pd.DataFrame(rawdata)

The dataset contains 32 dimensions and 351 instances, with 126 of them being outliers.


In [2]:
data.head(5)


Out[2]:
var_0000 var_0001 var_0002 var_0003 var_0004 var_0005 var_0006 var_0007 var_0008 var_0009 ... var_0023 var_0024 var_0025 var_0026 var_0027 var_0028 var_0029 var_0030 var_0031 class
0 0.997695 0.470555 0.926215 0.511530 0.91699 0.311460 1.000000 0.518800 0.926215 0.411225 ... 0.244145 0.705390 0.269160 0.606330 0.329550 0.711335 0.227565 0.593205 0.273500 0
1 1.000000 0.405855 0.965175 0.319220 0.44566 0.032015 1.000000 0.477255 0.754370 0.161285 ... 0.367155 0.397660 0.407995 0.404800 0.442035 0.416870 0.468560 0.431310 0.487765 1
2 1.000000 0.483175 1.000000 0.502425 1.00000 0.439690 0.944825 0.505990 0.865410 0.526730 ... 0.298900 0.794920 0.389275 0.715500 0.413175 0.802180 0.379100 0.780225 0.308810 0
3 1.000000 0.274195 1.000000 1.000000 0.85608 0.000000 0.500000 0.500000 0.500000 0.500000 ... 0.953475 0.758065 1.000000 1.000000 0.399505 0.628410 1.000000 0.338090 1.000000 1
4 1.000000 0.487995 0.970700 0.532655 0.96053 0.383725 0.885760 0.418005 0.763990 0.398625 ... 0.174210 0.566450 0.233970 0.512155 0.189015 0.471465 0.202135 0.476960 0.171515 0

5 rows × 33 columns

Data preprocessing

Generate labels and normalize the data between 0 and 1.

generate labels


In [3]:
labels = data['class'].astype(int)
del data['class']

labels[labels != 0] = 1

MinMaxScaler is used since we need to keep the features of outliers


In [4]:
from sklearn.preprocessing import MinMaxScaler
data_norm = MinMaxScaler().fit_transform(data).astype('float32')

In [5]:
print("Instances: %d \nOutliers: %d\nAttributes: %d" % (len(data), sum(labels), len(data_norm[0])))


Instances: 351 
Outliers: 126
Attributes: 32

Build the model


In [6]:
from zoo.pipeline.api.keras.layers import Input, Dense
from zoo.pipeline.api.keras.models import Model

compress_rate=0.8
origin_dim=len(data_norm[0])

input = Input(shape=(origin_dim,))
encoded = Dense(int(compress_rate*origin_dim), activation='relu')(input)
decoded = Dense(origin_dim, activation='sigmoid')(encoded)
autoencoder = Model(input, decoded)

autoencoder.compile(optimizer='adadelta', loss='binary_crossentropy')


creating: createZooKerasInput
creating: createZooKerasDense
creating: createZooKerasDense
creating: createZooKerasModel
creating: createAdadelta
creating: createZooKerasBinaryCrossEntropy

Training


In [7]:
autoencoder.fit(x=data_norm,
                y=data_norm,
                batch_size=100,
                nb_epoch=2500,
                validation_data=None)

Prediction

Data are encoded and reconstructed as data_trans


In [8]:
data_trans = autoencoder.predict(data_norm).collect()

Evaluation

Calculate the euclidean distance for each point from ground truth to its reconstruction. The further the distance is, the more likely the point will be an outlier.


In [9]:
import numpy as np
dist = []
for i, x in enumerate(data_norm):
    dist.append(np.linalg.norm(data_norm[i] - data_trans[i]))
dist=np.array(dist)

Plot the ROC curve to assess the quality of anomaly detection. Here, we have achieved an AUC of 0.94 which is very good.


In [10]:
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
%matplotlib inline

fpr, tpr, threshold = roc_curve(labels, dist)
roc_auc = auc(fpr, tpr)
print('AUC = %f' % roc_auc)

plt.figure(figsize=(10, 7))
plt.plot(fpr, tpr, 'k--',
         label='mean ROC (area = %0.2f)' % roc_auc, lw=2, color='red')
plt.xlim([-0.05, 1.05])
plt.ylim([-0.05, 1.05])
plt.plot([0, 1],
         [0, 1],
         linestyle='--',
         color=(0.6, 0.6, 0.6))
plt.xlabel('False Positive rate')
plt.ylabel('True Positive rate')
plt.title('ROC Autoencoder compress rate: %0.1f ' % compress_rate + "\nInstances: %d, Outliers: %d, Attributes: %d" % (len(data), sum(labels), len(data_norm[0])))
plt.legend(loc="lower right")
plt.tight_layout()
plt.show()


AUC = 0.947866

Plot the outlier scores for each single data point. The higher scores should represent higher possibility of being outliers. Compared to the ground truth, where positive data points are indicated as red and negative as blue, positive data points have a much higher outlier score than negative points as expected.


In [11]:
plt.figure(figsize=(15, 7))
label_colors=[]*len(labels)
label_colors = list(map(lambda x: "r" if x==1 else "b", labels))  
plt.scatter(data.index, dist, c=label_colors, s=15)
plt.xlabel('Index')
plt.ylabel('Score')
plt.title("Outlier Score\nInstances: %d, Outliers: %d, Attributes: %d" % (len(data), sum(labels), len(data_norm[0])))
plt.tight_layout()
#plt.savefig("./fig/"+dataset+".score.png")
plt.show()


Show top 20 data points with highest outlier score in descending order


In [12]:
outlier_indices = np.argsort(-dist)[0:20]
print(outlier_indices)


[186  45 204 218 164 172 176  59 206 182  71 180 224   7 202 226 128 200
  37 122]

By looking at the reconstruction error, we can find hints about the dimensions in which a particular data point is outlying. Here, we plot the reconstruction error in dimension at data point of 204 which has the second highest outlier score.


In [13]:
def error_in_dim(index):
    error = []
    for i, x in enumerate(data_norm[index]):
        error.append(abs(data_norm[index][i] - data_trans[index][i]))
    error=np.array(error)
    return error

example = 204

plt.figure(figsize=(10,7))
plt.plot(error_in_dim(example))
plt.xlabel('Index')
plt.ylabel('Reconstruction error')
plt.title("Reconstruction error in each dimension of point %d" % example)
plt.tight_layout()
plt.show()


Show top 3 dimensions with highest reconstruction error in descending order. Data point of 204 has high reconstruction error at subspace of [8,23,29].


In [14]:
print(np.argsort(-error_in_dim(example))[0:3])


[ 8 23 29]

Look at the position of the point in a subspace [28,29,30], data point of 204 as an outlier, indicated as red dot is far away from other data points.


In [15]:
indicator = ['b']*len(data)
indicator[204] = 'r'
indicator=pd.Series(indicator)

from mpl_toolkits.mplot3d import Axes3D
threedee = plt.figure(figsize=(20,14)).gca(projection='3d')
threedee.scatter(data['var_0028'], data['var_0029'], zs=data['var_0030'], 
                 c=indicator)
threedee.set_xlabel('28')
threedee.set_ylabel('29')
threedee.set_zlabel('30')
plt.show()


You can find back the information in which each object is an outlier by looking at the reconstruction errors, or at least reduce drastically the search space. Here, we plot the reconstruction errors of outliers in the subspace of [8], data points of 21 232, 212, 100, 122, 19.


In [16]:
plt.figure(figsize=(10,7))
outliers = [21, 232, 212, 100, 122, 19]
for i in outliers:
    plt.plot(error_in_dim(i), label=i)
plt.legend(loc=1)
plt.xlabel('Index')
plt.ylabel('Reconstruction error')
plt.title("Reconstruction error in each dimension of outliers")
plt.tight_layout()
plt.show()


Generally look into the subspace of [6,7,8] for the full dataset, the outliers are indicated as red dots, we can find out that a couple of data points are outlying.


In [17]:
from mpl_toolkits.mplot3d import Axes3D
threedee = plt.figure(figsize=(20,14)).gca(projection='3d')

threedee.scatter(data['var_0007'], data['var_0008'], zs=data['var_0009'], c='b', alpha=0.1)

for i in outliers:
    threedee.scatter(data['var_0007'][i], data['var_0008'][i], zs=data['var_0009'][i], c="r", s=60)
    #print(data['var_0007'][i], data['var_0008'][i], data['var_0009'][i])

threedee.set_xlabel('7')
threedee.set_ylabel('8')
threedee.set_zlabel('9')
plt.show()



In [ ]: