In [1]:

    
import pandas as pd
from pandas import DataFrame as DF, Series

import numpy as np

import scipy.stats as st

import matplotlib.pyplot as plt
%matplotlib inline

Multivariate Approach

This data contains two metrics for a particular server. Anomalous behavior can be an indication of various types of problems. The goal is to create a system that can handle a live feed of data to detect anomalies.

Check that the training data is suitable for a multivariate modeling approach
- i.e. check that the data appears to be from a multivariate gaussian distribution
- you don't need to get too fancy with tests here
Check for any drift in the training data mean and std over time
- if there is drift, determine what span of time (window) should used for modeling incoming data
Create the following functions
- on that returns an estimated multivariate normal distribution for a given window of data
- another that determines if a single observation is an outlier and takes these params:
  - a single observation
  - the estimated distribution object
  - probability threshold: below which observation is outlier
Create a function to simulate a live feed (a stream) of the test data
- returns a boolean list indicating if each observation was an outlier
- computes these one-by-one as they come in through simulated feed
- each observation should be based on last N observations
  - N is the windows size determined in (2)
- use the last N observations in the test data as a starting point
- use a threshold of 0.0001
Create a single plot that has
- scatter plot of the test data highlighting detected outliers
- contour plot overlaid showing how the distribution pdf varies
How do you think the threshold could be optimized?



In [2]:

    
# read data
data_tr = pd.read_csv('multivariate_tr.csv')
data = pd.read_csv('multivariate_ts.csv')

1. Check that the training data is suitable for a multivariate modeling approach

i.e. check that the data appears to be from a multivariate gaussian distribution
you don't need to get too fancy with tests here



In [3]:

    
data_tr.describe()









    Out[3]:







  
    
      
      X1
      X2
    
  
  
    
      count
      5000.000000
      5000.000000
    
    
      mean
      5.355991
      2.245740
    
    
      std
      1.856961
      1.058475
    
    
      min
      -1.232979
      -1.003655
    
    
      25%
      4.125419
      1.523453
    
    
      50%
      5.353521
      2.150443
    
    
      75%
      6.598291
      2.849264
    
    
      max
      13.317971
      8.477483



In [4]:

    
from pandas.plotting import scatter_matrix

scatter_matrix(data_tr, hist_kwds={'bins':25}, figsize=(10,7));

2. Check for any drift in the training data mean and std over time

if there is drift, determine what span of time (window) should used for modeling incoming data



In [5]:

    
data_tr.rolling(window=1000).mean().dropna().plot(figsize=(10,5));

3. Create the following functions

on that returns an estimated multivariate normal distribution for a given window of data
another that determines if a single observation is an outlier and takes these params:
- a single observation
- the estimated distribution object
- probability threshold: below which observation is outlier



In [6]:

    
# estimate and return multivariate gaussian distribution
def multi_gaus_dist(window):
    means = window.mean()
    cov = window.cov()
    dist = st.multivariate_normal(means, cov)
    return dist

# check if obs is outlier
def isoutlier(obs, dist, thresh=0.0001):
    p = dist.pdf(obs)
    if p < thresh:
        return True
    else:
        return False



In [7]:

    
dist = multi_gaus_dist(data_tr)
sum([isoutlier(point[1:], dist) for point in data.itertuples()])









    Out[7]:





178

4. Create a function to simulate a live feed (a stream) of the test data

returns a boolean list indicating if each observation was an outlier
computes these one-by-one as they come in through simulated feed
each observation should be based on last N observations
- N is the windows size determined in (2)
use the last N observations in the test data as a starting point
use a threshold of 0.0001



In [8]:

    
combined_df = pd.concat([data_tr.iloc[-1000:], data], axis=0)



In [9]:

    
anomaly = []
for i in range(1000, len(combined_df)):
    window = combined_df.iloc[i-1000:i]
    dist = multi_gaus_dist(window)
    obs = combined_df.iloc[i]
    anomaly.append(isoutlier(obs, dist, thresh=0.00001))



In [10]:

    
outliers = data[anomaly]
inliers = data[~np.array(anomaly)]



In [11]:

    
outliers.shape, inliers.shape









    Out[11]:





((60, 2), (25348, 2))

5. Create a single plot that has

scatter plot of the test data highlighting detected outliers
contour plot overlaid showing how the distribution pdf varies



In [12]:

    
mu = data.mean()
cov = data.cov()
mvd = st.multivariate_normal(mu, cov)
mesh = np.mgrid[data.X1.min()-1:data.X1.max()+1:.01, 
               data.X2.min()-1:data.X2.max()+1:.01]
pos = np.empty(mesh[0].shape + (2,))
pos[:, :, 0] = mesh[0]
pos[:, :, 1] = mesh[1]

plt.figure(figsize=(14,10))
ax = plt.contour(mesh[0], mesh[1], mvd.pdf(pos), 10, cmap='bone')
plt.clabel(ax, inline=1, fontsize=12)

temp = data.copy()
temp['anomaly'] = anomaly
temp.loc[:, 'anomaly'] = temp.anomaly.map({True:'red', False:'#cccccc'})
plt.scatter(temp['X1'], temp['X2'], c=temp['anomaly'], alpha=0.5)
plt.xlabel('X1')
plt.ylabel('X2');

6. How do you think the threshold could be optimized?

One way to optimize the threshold would be to use historical data to label observations that, in hindsight, qualify as anomalies. With these labels you can use CV to tune the threshold until you get the best value on some metric (e.g. F1-score).

Detect Anomalous Number Of Defects

This dataset contains chronological data for a computer system. The data simply indicates if an error ocurred at a given time with a 1, otherwise a 0 is output. This system has a fixed expectation value of 5 errors per thousand. The goal is to detect if there were an abnormal number of errors for any given time-window.

Given the description and goal, what distribution should be used to model this data?
Create a script that will simulate this data as a feed
- number of defects will only be analyzed every 1000 observations
- each 1000 will count as one block
- each of these blocks will be classified as anomalous or not (as a whole)
- use a threshold of 0.01
Create a plot with the following properties
- x-axis: block number
- y-axis: number of errors in the block
- highlight blocks labeld as anomalies



In [13]:

    
data = np.loadtxt('errors.csv')
data









    Out[13]:





array([ 0.,  0.,  0., ...,  0.,  0.,  0.])

1. Given the description and goal, what distribution should be used to model this data?

Poisson Distribution.

2. Create a script that will simulate this data as a feed and label anomalies

number of errors will only be analyzed every 1000 observations
each 1000 will count as one block
each of these blocks will be classified as anomalous or not (as a whole)
use a threshold of 0.01



In [14]:

    
1e3*data.mean(), data.reshape(1000,-1).sum(axis=0).mean()









    Out[14]:





(4.8842794759825328, 4.8842794759825328)



In [15]:

    
anomalies = []
fullset = []
thresh = 0.01
mu = 1e3*data.mean()
rv = st.poisson(mu=mu)
n = 0
for i,v in enumerate(data):
    if i%1e3 == 0:
        p = rv.pmf(n)
        fullset.append((i//1e3, n))
        if p < thresh:
            anomalies.append((i//1e3, n, p))
        n = 0
    if v == 1:
        n += 1
print('Stream Complete')









    



Stream Complete



In [16]:

    
anomalies = DF(anomalies, columns=['block','n_errors','p-value']).astype({'block': int})
fullset = DF(fullset, columns=['block','n_errors']).astype({'block': int})



In [17]:

    
anomalies



In [18]:

    
fullset.head()

3. Create a plot with the following properties

x-axis: block number
y-axis: number of errors in the block
highlight blocks labeld as anomalies



In [19]:

    
# fullset.n_errors.plot(figsize=(16,8), style='o')
outliers = fullset[fullset.block.isin(anomalies.block)].n_errors
inliers = fullset[~fullset.index.isin(anomalies.block)].n_errors

outliers.plot(figsize=(16,8), style='ro')
inliers.plot(figsize=(16,8), style='x');

Anomaly Detection Using Cluster Analysis

This is a very common dataset (slightly modified) of three varieties of the iris flower; it will work well for clustering.

Fit the training data using k-means clustering (without labels) using a number a clusters determined by the label values
Get the 98th percentile for distances from cluster centers for the training data
- one value for each label/class
Use these percentiles to determine outliers in the test data and create a 3d plot using the first three attributes and highlight outliers



In [20]:

    
# read data
train = pd.read_csv('iris_train.csv')
test = pd.read_csv('iris_test.csv')
# import k-means
from sklearn.cluster import KMeans as KM

1. Fit the training data using k-means clustering (without labels) using a number a clusters determined by the label values



In [21]:

    
Xtr = train.iloc[:, :-1]
ytr = train.iloc[:, -1]

Xts = test.iloc[:, :-1]
yts = test.iloc[:, -1]



In [22]:

    
ytr.value_counts()









    Out[22]:





0    39
2    31
1    30
Name: label, dtype: int64



In [23]:

    
km = KM(n_clusters=3)
km.fit(Xtr)









    Out[23]:





KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

2. Get the 98th percentile for distances from cluster centers for the training data

one value for each label/class



In [24]:

    
# get 98th percentile for distances from each center
# transform converts the observations to distance space
quants = DF(km.transform(Xtr)).quantile(0.98)
quants









    Out[24]:





0    5.006726
1    5.537761
2    3.523649
Name: 0.98, dtype: float64



In [25]:

    
# get the distances of the test data from centers
distances_ts = DF(km.transform(Xts))

3. Use these percentiles to determine outliers in the test data and create a 3d plot using the first three attributes and highlight outliers



In [26]:

    
# outliers are those where distance from all centers is greater than respective percentiles
outlier = (distances_ts > quants).all(axis=1)
outliers = Xts[outlier]
inliers = Xts[~Xts.index.isin(outliers.index)]



In [27]:

    
outliers



In [28]:

    
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure(1, figsize=(14, 8))
# create 3d axis object to use and set rotation for good view
ax = Axes3D(fig, elev=-150, azim=3)
ax.scatter(outliers['1'], outliers['2'], outliers['3'], c='r')
# map label numbers to color values
colors = yts[yts.index.isin(inliers.index)].map({0:'b', 1:'g', 2:'k'})
ax.scatter(inliers['1'], inliers['2'], inliers['3'], c=colors);

	block	n_errors	p-value
0	0	0	0.007565
1	59	13	0.001094
2	122	11	0.007152
3	179	0	0.007565
4	343	0	0.007565
5	373	11	0.007152
6	386	11	0.007152

	0	1	2	3
1	9.7	7.5	9.5	5.5
25	12.5	8.1	9.8	3.7
38	10.4	7.9	11.1	5.3

	X1	X2
count	5000.000000	5000.000000
mean	5.355991	2.245740
std	1.856961	1.058475
min	-1.232979	-1.003655
25%	4.125419	1.523453
50%	5.353521	2.150443
75%	6.598291	2.849264
max	13.317971	8.477483