Anomaly Detection Problems

Multivariate Approach

This data contains two metrics for a particular server. Anomalous behavior can be an indication of various types of problems. The goal is to create a system that can handle a live feed of data to detect anomalies.



In [1]:

    
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

1 - Check that the training data is suitable for a multivariate modeling approach (multivariate_tr.csv & multivariate_ts.csv)

i.e. check that the data appears to be from a multivariate Gaussian distribution
you don't need to get too fancy with tests here



In [2]:

    
Xtrain = pd.read_csv('./data/multivariate_tr.csv')
Xtrain.head()



In [3]:

    
Xtrain.describe()









    Out[3]:







  
    
      
      X1
      X2
    
  
  
    
      count
      5000.000000
      5000.000000
    
    
      mean
      5.355991
      2.245740
    
    
      std
      1.856961
      1.058475
    
    
      min
      -1.232979
      -1.003655
    
    
      25%
      4.125419
      1.523453
    
    
      50%
      5.353521
      2.150443
    
    
      75%
      6.598291
      2.849264
    
    
      max
      13.317971
      8.477483



In [4]:

    
Xtest = pd.read_csv('./data/multivariate_ts.csv')
Xtest.head()



In [5]:

    
Xtest.describe()









    Out[5]:







  
    
      
      X1
      X2
    
  
  
    
      count
      25408.000000
      25408.000000
    
    
      mean
      5.322931
      2.461844
    
    
      std
      1.974247
      1.054347
    
    
      min
      -1.643433
      -1.467681
    
    
      25%
      3.992453
      1.749390
    
    
      50%
      5.298202
      2.373602
    
    
      75%
      6.649019
      3.064187
    
    
      max
      14.298211
      8.973606



In [6]:

    
from scipy.stats import multivariate_normal

def check_mult_norm(data, levels):
    cov = np.cov(data.values.T)
    mu = np.mean(data).values
    
    x = np.linspace(min(data.iloc[:, 0])-0.5, max(data.iloc[:, 0])+0.5, 1000)
    y = np.linspace(min(data.iloc[:, 1])-0.5, max(data.iloc[:, 1])+0.5, 1000)
    x, y = np.meshgrid(x, y)
    
    mult = multivariate_normal(mean=mu, cov=cov)
    z = mult.pdf(np.array([x.ravel(), y.ravel()]).T).reshape(x.shape)
    
    plt.figure(figsize=(16, 10))
    plt.scatter(data.iloc[:, 0], data.iloc[:, 1])
    plt.contour(x, y, z, levels=np.arange(0, 0.5, 0.005))



In [7]:

    
check_mult_norm(Xtrain, np.arange(0, 0.5, 0.005))



In [8]:

    
check_mult_norm(Xtest, np.arange(0, 0.5, 0.005))

2 - Check for any drift in the training data mean and std over time

if there is drift, determine what span of time (window) should used for modeling incoming data



In [9]:

    
def calc_rolling_var(data, variable, variable_pos):
    rolling_mean = data[variable].cumsum() / np.arange(1, len(data)+1)
    rolling_var = []
    for i in np.arange(2, len(data)+1):
        rolling_var.append(np.sum((data.iloc[:i, variable_pos] - rolling_mean[i-1])**2) / i)
    return rolling_mean, np.array(rolling_var)

rolling_mean_x1, rolling_var_x1 = calc_rolling_var(Xtrain, 'X1', 0)
rolling_mean_x2, rolling_var_x2 = calc_rolling_var(Xtrain, 'X2', 1)



In [10]:

    
def plot_rollings(data1, data2, window, title, xlab, ylab1, ylab2):
    fig, ax = plt.subplots(2, 1, figsize=(16, 15))
    ax[0].plot(data1[window])
    ax[0].set_xlabel(xlab)
    ax[0].set_ylabel(ylab1)
    ax[0].set_title(title, size=25)
    ax[1].plot(data2[window])
    ax[1].set_xlabel(xlab)
    ax[1].set_ylabel(ylab2)

plot_rollings(rolling_mean_x1, rolling_mean_x2, np.arange(len(Xtrain)), 'Rolling Mean over Time', 'Epochs', 'X1 Measurements', 'X2 Measurements');



In [11]:

    
plot_rollings(rolling_var_x1, rolling_var_x2, np.arange(len(Xtrain)-1), 'Rolling Variance over Time', 'Epochs', 'X1 Measurements', 'X2 Measurements');



In [12]:

    
plot_rollings(rolling_mean_x1, rolling_mean_x2, np.arange(3000, 3101), 'Rolling Mean over Time', 'Epochs', 'X1 Measurements', 'X2 Measurements')

I think $N=10$ is a good window...

In the solution $N=1000$, I don't know why...



In [13]:

    
rolling_mean = []
for i in np.arange(1, len(Xtrain)+1):
    rolling_mean.append(np.sum(Xtrain.iloc[i-10:i, 0]) / 10)
rolling_mean = np.array(rolling_mean)
rolling_mean[10:20]









    Out[13]:





array([ 4.56820008,  4.53174258,  4.83127609,  5.20901821,  5.33377438,
        5.3729503 ,  5.62924027,  5.86830666,  6.31549763,  6.13364457])



In [14]:

    
def calc_rolling_var_windowed(data, variable_pos, window):
    rolling_mean = []
    for i in np.arange(1, len(data)+1):
        rolling_mean.append(np.sum(data.iloc[i-window:i, variable_pos]) / window)
    rolling_mean = np.array(rolling_mean)
    rolling_var = []
    for i in np.arange(2, len(data)+1):
        rolling_var.append(np.sum((data.iloc[i-window:i, variable_pos] - rolling_mean[i-1])**2) / window)
    return rolling_mean, np.array(rolling_var)

rolling_mean_x1_10, rolling_var_x1_10 = calc_rolling_var_windowed(Xtrain, 0, 1000)
rolling_mean_x2_10, rolling_var_x2_10 = calc_rolling_var_windowed(Xtrain, 1, 1000)



In [16]:

    
plot_rollings(rolling_mean_x1_10, rolling_mean_x2_10, np.arange(len(rolling_mean_x1_10)), 'Rolling Mean over Time', 'Epochs', 'X1 Measurements', 'X2 Measurements');



In [15]:

    
plot_rollings(rolling_var_x1_10, rolling_var_x2_10, np.arange(len(rolling_var_x1_10)), 'Rolling Variance over Time', 'Epochs', 'X1 Measurements', 'X2 Measurements');

3 - Create the following functions

on that returns an estimated multivariate normal distribution for a given window of data
another that determines if a single observation is an outlier and takes these params:
- a single observation
- the estimated distribution object
- probability threshold: below which observation is outlier



In [18]:

    
def estd_multi(data):
    cov = np.cov(data.values.T)
    mu = np.mean(data).values
    return multivariate_normal(mean=mu, cov=cov)



In [19]:

    
def is_outlier(obs, distr, threshold):
    prob = distr.pdf(obs)
    if prob < threshold:
        return 1, prob
    else:
        return 0, prob



In [20]:

    
multi = estd_multi(Xtrain.iloc[:100, ])
print(is_outlier(Xtrain.iloc[152, ], multi, threshold=0.01))
print(is_outlier(Xtrain.iloc[1523, ], multi, threshold=0.01))









    



(0, 0.05019734538227362)
(1, 0.00025898842885697503)

4 - Create a function to simulate a live feed (a stream) of the test data

returns a boolean list indicating if each observation was an outlier
computes these one-by-one as they come in through simulated feed
each observation should be based on last N observations
- N is the windows size determined in (2)
use the last N observations in the test data as a starting point
use a threshold of 0.0001



In [21]:

    
def live_feed(stream_data, past_data, window, threshold = 0.0001):
    is_out = []
    for i in np.arange(len(stream_data)):
        data = past_data.iloc[-window:, ]
        multi = estd_multi(data)
        flag, _ = is_outlier(stream_data.iloc[i, ], multi, threshold)
        is_out.append(flag)
        past_data = past_data.append(stream_data.iloc[i, ])
    return is_out, past_data



In [22]:

    
past = Xtest.iloc[:100, ]
is_out, past = live_feed(Xtest.iloc[100:200, ], past, 10)



In [23]:

    
past.iloc[np.append(np.repeat(False, 100), (np.array(is_out)))==1, ]



In [24]:

    
is_out, past = live_feed(Xtest.iloc[200:300, ], past, 10)
past.iloc[np.append(np.repeat(False, 200), (np.array(is_out)))==1, ]

5 - Create a single plot that has

scatter plot of the test data highlighting detected outliers
contour plot overlaid showing how the distribution pdf varies



In [25]:

    
past = Xtest.iloc[:1000, ]
is_out, _ = live_feed(Xtest.iloc[1000:, ], past, 1000)



In [26]:

    
np.append(np.repeat(False, 1000), (np.array(is_out)))[1000:1100]









    Out[26]:





array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0])



In [30]:

    
plt.figure(figsize=(16, 10))
plt.scatter(Xtest.iloc[np.append(np.repeat(False, 1000), (np.array(is_out)))==0, 0],
            Xtest.iloc[np.append(np.repeat(False, 1000), (np.array(is_out)))==0, 1], marker='o')
plt.scatter(Xtest.iloc[np.append(np.repeat(False, 1000), (np.array(is_out)))==1, 0],
            Xtest.iloc[np.append(np.repeat(False, 1000), (np.array(is_out)))==1, 1], marker='x')

# I'm drawing just 10 examples of the distribution in order to make the plot readable...
x = np.linspace(min(Xtest.iloc[:, 0])-0.5, max(Xtest.iloc[:, 0])+0.5, 1000)
y = np.linspace(min(Xtest.iloc[:, 1])-0.5, max(Xtest.iloc[:, 1])+0.5, 1000)
x, y = np.meshgrid(x, y)

# for i in range(10):

#     cov = np.cov(Xtest.iloc[i*2550:i*2550+10, ].values.T)
#     mu = np.mean(Xtest.iloc[i*2550:i*2550+10, ]).values

#     mult = multivariate_normal(mean=mu, cov=cov)
#     z = mult.pdf(np.array([x.ravel(), y.ravel()]).T).reshape(x.shape)
    
#     plt.contour(x, y, z, levels=[0.0001]);

cov = np.cov(Xtest.values.T)
mu = np.mean(Xtest).values

mult = multivariate_normal(mean=mu, cov=cov)
z = mult.pdf(np.array([x.ravel(), y.ravel()]).T).reshape(x.shape)

plt.contour(x, y, z);

6 - How do you think the threshold could be optimized?

I guess you could think of an expected number of anomalies in the data and try to find the best threshold for that, maybe using training data for this.

You could also try to optimize both parameters (threshold and window size) trying to avoid detecting points that aren't extreme with respect to the global multivariate normal distribution obtained from the training set...

Detect Anomalous Number Of Defects

This dataset contains chronological data for a computer system. The data simply indicates if an error occurred at a given time with a 1, otherwise a 0 is output. This system has a fixed expectation value of 5 errors per thousand. The goal is to detect if there were an abnormal number of errors for any given time-window.

1 - Given the description and goal, what distribution should be used to model this data? (errors.csv)

I would use a binomial distribution with $p=0.005$.



In [210]:

    
errors = pd.read_csv('./data/errors.csv')
errors.head()



In [211]:

    
errors.describe()









    Out[211]:







  
    
      
      0
    
  
  
    
      count
      457999.000000
    
    
      mean
      0.004884
    
    
      std
      0.069717
    
    
      min
      0.000000
    
    
      25%
      0.000000
    
    
      50%
      0.000000
    
    
      75%
      0.000000
    
    
      max
      1.000000

2 - Create a script that will simulate this data as a feed

number of defects will only be analyzed every 1000 observations
each 1000 will count as one block
each of these blocks will be classified as anomalous or not (as a whole)
use a threshold of 0.01



In [246]:

    
threshold = 0.01
n = 1000
p = 5/1000
n_blocks = int(np.ceil(len(errors) / 1000))

from scipy.stats import binom
distr = binom(p=p, n=n)

blocks = []
is_out = []
for i in np.arange(n_blocks):
    curr_count = errors[n*i:n*(i+1)].sum()[0]
    blocks.append(curr_count)
    is_out.append(distr.pmf(curr_count) < threshold)

print(is_out[:10])
print(blocks[:10])









    



[False, False, False, False, False, False, False, False, False, False]
[7, 7, 5, 8, 7, 5, 6, 1, 2, 4]

3 - Create a plot with the following properties

x-axis: block number
y-axis: number of errors in the block
highlight blocks labeled as anomalies



In [271]:

    
fig, ax = plt.subplots(figsize=(16, 10))
ax.bar(np.arange(n_blocks), blocks, color=['red' if out else 'blue' for out in is_out])
ax.set_title('Error per Block of 1000 Obs')
ax.set_xlabel('Block No.')
ax.set_ylabel('Errors')
ax.set_xlim(-1.5, n_blocks+0.5)
ax.hlines(10.5, -1.5, n_blocks+0.5, color='r', alpha=0.7, linestyles='--')
ax.xaxis.set_major_locator(plt.MaxNLocator(20));

Anomaly Detection Using Cluster Analysis

This is a very common dataset (slightly modified) of three varieties of the iris flower; it will work well for clustering.



In [272]:

    
iris_train = pd.read_csv('./data/iris_train.csv')
iris_test = pd.read_csv('./data/iris_test.csv')



In [273]:

    
iris_train.head()



In [275]:

    
iris_train.describe()

1 - Fit the training data using k-means clustering (without labels) using a number a clusters determined by the label values



In [284]:

    
from sklearn.cluster import KMeans

km = KMeans(n_clusters=3)
km.fit(iris_train.iloc[:, :4])
print(km.labels_)
print(km.cluster_centers_)









    



[1 1 0 1 2 0 1 2 2 1 2 2 0 1 2 2 0 0 2 2 0 0 1 2 0 0 1 2 0 1 1 2 1 2 2 2 1
 0 1 1 0 1 1 0 2 0 2 2 1 0 2 0 1 0 2 0 1 0 1 0 0 1 0 1 1 1 2 2 1 2 1 1 2 1
 0 1 2 1 2 2 0 1 2 1 0 2 2 2 1 0 1 2 2 1 1 1 2 1 0 1]
[[ 5.7         2.67777778  4.12222222  1.29259259]
 [ 4.97692308  3.37692308  1.44871795  0.25128205]
 [ 6.64117647  3.08235294  5.43235294  1.98235294]]

2 - Get the 98th percentile for distances from cluster centers for the training data

one value for each label/clas



In [317]:

    
centers = np.array([list(km.cluster_centers_[lab]) for lab in km.labels_])
distances = np.sqrt(np.sum((iris_train.iloc[:, :4] - centers)**2, axis=1))



In [323]:

    
percentiles = np.array([np.percentile(distances[km.labels_==i], 98) for i in range(3)])



In [324]:

    
percentiles









    Out[324]:





array([ 1.25600049,  1.20603178,  1.76679386])

3 - Use these percentiles to determine outliers in the test data and create a 3d plot using the first three attributes and highlight outliers



In [328]:

    
pred = km.predict(iris_test.iloc[:, :4])
centers = np.array([list(km.cluster_centers_[lab]) for lab in pred])
distances = np.sqrt(np.sum((iris_test.iloc[:, :4] - centers)**2, axis=1))
is_out = [distances[i] > percentiles[lab] for i, lab in enumerate(pred)]



In [337]:

    
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure(figsize=(16, 10))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(iris_test.iloc[:, 0],
           iris_test.iloc[:, 1],
           iris_test.iloc[:, 2],
           color=['red' if out else 'blue' for out in is_out]);

	X1	X2
0	4.403684	1.625136
1	4.187460	0.537484
2	3.481319	2.871732
3	4.610979	1.705542
4	7.742105	3.569791

	X1	X2
0	3.333215	3.305589
1	5.494974	2.078626
2	7.447633	3.108583
3	4.853434	2.684147
4	3.295403	3.020483

	X1	X2
132	11.241290	2.801120
155	6.194804	6.152789
179	5.100967	3.841780

	X1	X2
233	1.098334	2.882767
286	4.870404	8.778668
296	-0.387701	3.217592

	0	1	2	3	label
count	100.00000	100.000000	100.000000	100.000000	100.000000
mean	5.73800	3.088000	3.525000	1.121000	0.920000
std	0.82887	0.430006	1.784239	0.778407	0.837022
min	4.30000	2.000000	1.000000	0.100000	0.000000
25%	5.00000	2.875000	1.500000	0.275000	0.000000
50%	5.70000	3.050000	4.000000	1.300000	1.000000
75%	6.40000	3.325000	5.100000	1.800000	2.000000
max	7.90000	4.400000	6.700000	2.500000	2.000000

	X1	X2
count	5000.000000	5000.000000
mean	5.355991	2.245740
std	1.856961	1.058475
min	-1.232979	-1.003655
25%	4.125419	1.523453
50%	5.353521	2.150443
75%	6.598291	2.849264
max	13.317971	8.477483

	X1	X2
count	25408.000000	25408.000000
mean	5.322931	2.461844
std	1.974247	1.054347
min	-1.643433	-1.467681
25%	3.992453	1.749390
50%	5.298202	2.373602
75%	6.649019	3.064187
max	14.298211	8.973606

	0
count	457999.000000
mean	0.004884
std	0.069717
min	0.000000
25%	0.000000
50%	0.000000
75%	0.000000
max	1.000000

	0	1	2	3	label
0	5.5	3.5	1.3	0.2	0
1	5.0	3.5	1.6	0.6	0
2	5.8	2.7	4.1	1.0	1
3	5.2	3.5	1.5	0.2	0
4	6.7	3.0	5.2	2.3	2

	0	1	2	3	label
0	5.5	3.5	1.3	0.2	0
1	5.0	3.5	1.6	0.6	0
2	5.8	2.7	4.1	1.0	1
3	5.2	3.5	1.5	0.2	0
4	6.7	3.0	5.2	2.3	2

	0	1	2	3	label
0	5.5	3.5	1.3	0.2	0
1	5.0	3.5	1.6	0.6	0
2	5.8	2.7	4.1	1.0	1
3	5.2	3.5	1.5	0.2	0
4	6.7	3.0	5.2	2.3	2