In [9]:
import pyisc;
import numpy as np
from scipy.stats import poisson
%matplotlib inline
from pylab import hist, plot, figure
In [10]:
po_normal = poisson(10)
po_anomaly = poisson(25)
freq_normal = po_normal.rvs(10000)
freq_anomaly = po_anomaly.rvs(15)
Create an 2D array with two columns that combines random frequency and time period equal to 1.
In [11]:
data = np.column_stack([
list(freq_normal)+list(freq_anomaly),
[1.0]*(len(freq_normal)+len(freq_anomaly))
])
data[:5]
Out[11]:
If we plot the histogram from the frequency data, we can see that the distribution tail is thicker at the right side because of the anomalous data points:
In [12]:
hist(data.T[0],100);
Create an anomaly detector using as first argument the P_PoissonOneside statistical model, that is, we use a Poisson distribution for modelling data, but we only care of anomalous large frequencies. The second argument is an array containg the column index used as input to the statsitical model where 0 is the column index of frequencies and 1 is the column index of the period in the data object:
In [13]:
anomaly_detector = pyisc.AnomalyDetector(
pyisc.P_PoissonOnesided(frequency_column=0,period_column=1)
)
Train the anomaly detector:
In [14]:
%timeit anomaly_detector.fit(data);
Compute the anomaly scores for each data point:
In [17]:
scores = anomaly_detector.anomaly_score(data)
In [18]:
for s in zip(freq_normal[:15], scores[:15]):
print s
The anomalous frequencies vs. anomaly scores:
In [19]:
for s in zip(freq_anomaly, scores[-15:]):
print s
As can be seen above, the anomalous frequences also have higher anomaly scores than the normal frequencies as it should be.
This becomes even more visible if we plot the frequency (x-axis) against anomaly scores (y-axis):
In [20]:
plot(data.T[0], scores, '.');
So, depending on at what level we would consider a frequency an anomaly, we can set a threshold to decide if a frequency is anomalous.
We can also "confuse" the anomaly detector by adding more normal training data closer to the anomalous data:
In [21]:
data2 = np.column_stack([
poisson(15).rvs(15),
[1.0]*15
])
anomaly_detector.fit_incrementally(data2);
In [22]:
scores_ = anomaly_detector.anomaly_score(data)
In [23]:
figure(1);plot(data.T[0], scores, 'b.');plot(data.T[0], scores_, 'gx');
Above, if we compare with previous plot, we can see that the updated anomaly scores end at below 12 (green crosses) while in previous plot, the anomaly scores end at below 20 (blue dots). Thus, the anomalous data got less anomalous given the new observed data set (data_object_2).