High-dimensional data can be converted to low-dimensional codes by training a multilayer neural network with a small central layer to reconstruct high-dimensional input vectors. This kind of neural network is named Autoencoder.
Autoencoders is nonlinear dimensionality reduction technique (Hinton et al, 2006) used for unsupervised learning of features, and they can learn low-dimensional codes that work much better than principal components analysis as a tool to reduce the dimensionality of data.
If enough training data resembling some underlying pattern is provided, we can train the network to learn the patterns in the data. An anomalous test point is a point that does not match the typical data patterns. The autoencoder will likely have a high error rate in reconstructing this data, indicating the anomaly.
This framework is used to develop an anomaly detection demonstration using a deep autoencoder. The dataset is an ECG time series of heartbeats and the goal is to determine which heartbeats are outliers. The training data (20 “good” heartbeats) and the test data (training data with 3 “bad” heartbeats appended for simplicity) can be downloaded directly into the H2O cluster, as shown below. Each row represents a single heartbeat.
In [4]:
import h2o
from h2o.estimators.deeplearning import H2OAutoEncoderEstimator
h2o.init()
In [5]:
%matplotlib inline
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import os.path
PATH = os.path.expanduser("~/h2o-3/")
In [6]:
train_ecg = h2o.import_file(PATH + "smalldata/anomaly/ecg_discord_train.csv")
test_ecg = h2o.import_file(PATH + "smalldata/anomaly/ecg_discord_test.csv")
let's explore the dataset.
In [7]:
train_ecg.shape
Out[7]:
In [8]:
# transpose the frame to have the time serie as a single colum to plot
train_ecg.as_data_frame().T.plot(legend=False, title="ECG Train Data", color='blue'); # don't display the legend
in the train data we have 20 time series each of 210 data points. Notice that all the lines are compact and follow a similar shape. Is important to remember that when training with autoencoders you want to use only VALID data. All the anomalies should be removed.
Now let's use the AutoEncoderEstimator to train our neural network
In [15]:
model = H2OAutoEncoderEstimator(
activation="Tanh",
hidden=[50],
l1=1e-5,
score_interval=0,
epochs=100
)
model.train(x=train_ecg.names, training_frame=train_ecg)
In [16]:
model
Out[16]:
Our Neural Network is now able to Encode the time series.
Now we try to Compute reconstruction error with the Anomaly detection function. This is the Mean Square Error between output and input layers. Low error means that the neural network is able to encode the input well, and that means is a "known" case. A High error means that the neural network has not seen that example before and so is an anomaly.
In [17]:
reconstruction_error = model.anomaly(test_ecg)
Now the question is: Which of the test_ecg time series are most likely an anomaly?
We can select the top N that have high error rate
In [18]:
df = reconstruction_error.as_data_frame()
In [19]:
df['Rank'] = df['Reconstruction.MSE'].rank(ascending=False)
In [20]:
df_sorted = df.sort_values('Rank')
df_sorted
Out[20]:
In [21]:
anomalies = df_sorted[ df_sorted['Reconstruction.MSE'] > 1.0 ]
anomalies
Out[21]:
In [22]:
data = test_ecg.as_data_frame()
In [23]:
data.T.plot(legend=False, title="ECG Test Data", color='blue')
Out[23]:
In [24]:
ax = data.T.plot(legend=False, color='blue')
data.T[anomalies.index].plot(legend=False, title="ECG Anomalies in the Data", color='red', ax=ax);
Sometimes, there is much more unlabeled data than labeled data. It this case, it might make sense to train an autoencoder model on the unlabeled data and then fine-tune the learned model with the available labels.
TODO https://github.com/h2oai/h2o-3/blob/master/h2o-py/tests/testdir_algos/deeplearning/pyunit_autoencoderDeepLearning_large.py