This notebook shows how a Deep Learning Auto-Encoder model can be used to find outliers in a dataset.
Consider the following three-layer neural network with one hidden layer and the same number of input neurons (features) as output neurons. The loss function is the MSE between the input and the output. Hence, the network is forced to learn the identity via a nonlinear, reduced representation of the original data. Such an algorithm is called a deep autoencoder; these models have been used extensively for unsupervised, layer-wise pretraining of supervised deep learning tasks, but here we consider the autoencoder's application for discovering anomalies in data.
We use the well-known MNIST dataset of hand-written digits, where each row contains the 28^2=784 raw gray-scale pixel values from 0 to 255 of the digitized digits (0 to 9).
In [ ]:
import numpy as np
import theano
import lasagne
import matplotlib.pyplot as plt
%matplotlib inline
import gzip
import pickle
In [ ]:
# Seed for reproducibility
np.random.seed(42)
In [ ]:
# Download the MNIST digits dataset (actually, these are already downloaded locally)
# !wget -N --directory-prefix=./data/MNIST/ http://deeplearning.net/data/mnist/mnist.pkl.gz
# Load training and test splits as numpy arrays
train, val, test = pickle.load(gzip.open('data/MNIST/mnist.pkl.gz'), encoding='iso-8859-1')
X_train, y_train = train
# Omit the validation set...
X_test, y_test = test
#X_train[1000][:]
In [ ]:
# For training, we want to sample examples at random in small batches - don't care about the 'y_target'
def batch_gen(X, N):
while True:
idx = np.random.choice(len(X), N)
yield X[idx].astype('float32')
We train a Deep Learning Auto-Encoder to learn a compressed (low-dimensional) non-linear representation of the dataset, hence learning the intrinsic structure of the training dataset. The auto-encoder model is then used to transform all test set images to their reconstructed images, by passing through the lower-dimensional neural network. We then find outliers in a test dataset by comparing the reconstruction of each scanned digit with its original pixel values. The idea is that a high reconstruction error of a digit indicates that the test set point doesn't conform to the structure of the training data and can hence be called an outlier.
Train unsupervised Deep Learning autoencoder model on the training dataset. For simplicity, we train a model with 1 hidden layer of 50 Tanh neurons to create 50 non-linear features with which to reconstruct the original dataset. For now, please accept that 50 hidden units is a reasonable choice...
For simplicity, we train the auto-encoder for only 5 epoch (fives passes over the entire traing dataset).
In [ ]:
# A very simple network, an autoencoder with a single hidden layer of 50 neurons
l_in = lasagne.layers.InputLayer(shape=(None, 784))
l_hidden = lasagne.layers.DenseLayer(l_in,
num_units=50,
nonlinearity=lasagne.nonlinearities.tanh)
l_out = lasagne.layers.DenseLayer(l_hidden,
num_units=784,
nonlinearity=lasagne.nonlinearities.sigmoid)
In [ ]:
# Symbolic variable for our input features
X_sym = theano.tensor.matrix()
In [ ]:
# Theano expressions for the output distribution and loss vs the original input
output = lasagne.layers.get_output(l_out, X_sym)
# The loss function is the sum-squared-error averaged over a minibatch
sample_loss = theano.tensor.mean(lasagne.objectives.squared_error(output, X_sym), axis=1)
minibatch_loss = theano.tensor.mean(sample_loss)
In [ ]:
# We retrieve all the trainable parameters in our network
params = lasagne.layers.get_all_params(l_out, trainable=True)
# Compute Adam updates for training (scores on right show training speed variation)
updates = lasagne.updates.adam(minibatch_loss, params) # 0.065 ... 0.032
#updates = lasagne.updates.adagrad(minibatch_loss, params) # 0.056 ... 0.037
#updates = lasagne.updates.rmsprop(minibatch_loss, params) # 0.059 ... 0.041
#updates = lasagne.updates.adadelta(minibatch_loss, params) # 0.101 ... 0.065
print(params)
In [ ]:
# We define a training function that will compute the loss, and take a single optimization step
f_train = theano.function([X_sym], minibatch_loss, updates=updates)
# The prediction function doesn't require targets, and outputs only the autoencoder loss for the sample
f_predict = theano.function([X_sym], [output, sample_loss])
print("Theano functions created")
In [ ]:
# We'll choose a batch size, and calculate the number of batches in an "epoch"
BATCH_SIZE = 64
N_BATCHES = len(X_train) // BATCH_SIZE
# Minibatch generators for the training and validation sets
train_batches = batch_gen(X_train, BATCH_SIZE)
NB: Each epoch should take 10-20 seconds, although the first one may take a little longer...
In [ ]:
for epoch in range(5):
train_loss = 0
for _ in range(N_BATCHES):
X = next(train_batches)
loss = f_train(X)
train_loss += loss
train_loss /= N_BATCHES
print('Epoch {:2d}, Train loss {:.03f}'.format( epoch, train_loss, ))
print("DONE")
In [ ]:
test_reconstructed, test_loss = f_predict(X_test)
test_loss.shape
In [ ]:
def plot_by_index(X, indices):
plt.figure(figsize=(12,3))
for i in range(len(indices)):
plt.subplot(1, 12, i+1)
plt.imshow(X[indices[i]].reshape((28, 28)), cmap='gray', interpolation='nearest')
plt.axis('off')
Let's look at the test set points with low/median/high reconstruction errors. We will now visualize the original test set points and their reconstructions obtained by propagating them through the narrow neural net.
In [ ]:
# Sort the test set into recostruction error order
test_loss_sorted_indices = np.argsort( test_loss )
# Here are the best ones
test_loss[ test_loss_sorted_indices[0:10] ]
In [ ]:
indices = test_loss_sorted_indices[0:12]
plot_by_index(X_test, indices)
plot_by_index(test_reconstructed, indices)
Clearly, a well-written digit 1 appears in both the training and testing set, and is easy to reconstruct by the autoencoder with minimal reconstruction error. Nothing is as easy as a straight line.
In [ ]:
mid = len(test_loss_sorted_indices)//2
indices = test_loss_sorted_indices[mid-6:mid+6]
plot_by_index(X_test, indices)
plot_by_index(test_reconstructed, indices)
These test set digits look "normal" - it is plausible that they resemble digits from the training data to a large extent, but they do have some particularities that cause some reconstruction error.
In [ ]:
indices = test_loss_sorted_indices[-12:]
plot_by_index(X_test, indices)
plot_by_index(test_reconstructed, indices)
Now here are some pretty ugly digits that are plausibly not commonly found in the training data - some are even hard to classify by humans.
See :
https://github.com/h2oai/h2o-training-book/blob/master/hands-on_training/anomaly_detection.md
https://github.com/mikesj-public/convolutional_autoencoder/blob/master/mnist_conv_autoencode.ipynb
https://cs.stanford.edu/people/karpathy/convnetjs/demo/autoencoder.html
What is the test set error before network training?
Check whether 20 hidden units or 100 hidden units would have been a better choice
See how the learning progresses using adadelta
updates
Try adding your own example digits to ./images/mnist/
- there's a template .png
file there to start - and have a look at the reconstruction errors
In [ ]:
im = plt.imread('./images/mnist/template_28x28.png')
plt.imshow(im, 'gray')
In [ ]:
import os
image_dir = './images/mnist/'
image_files = [ '%s/%s' % (image_dir, f) for f in os.listdir(image_dir)
if (f.lower().endswith('png') or f.lower().endswith('jpg')) ]
v=[]
for i, f in enumerate(image_files):
im = plt.imread(f)
#print("Image File:%s" % (f,))
v.append( im.flatten() )
# v=[ plt.imread(f).flatten() for f in image_files ]
v_reconstructed, v_loss = f_predict(v)
v_all_indices = np.arange(len(v))
plot_by_index(v, v_all_indices)
plot_by_index(v_reconstructed, v_all_indices)
In [ ]: