In a similar fashion as when we create an anomaly detector, we can create an outlier detector. The outlier detector differs from the anomaly detector since a fraction of outliers (contamination) is known beforehand and the output is a prediction of whether a data point is an outlier or not. Consequently, the outlier detector can dynamically select a threshold for deciding when a data point is an outlier or inlier from the training data. Below, we use the same data set as in previous section but we now know that there is one anomalous data point - an outlier - and five inliers in the data set.
In [4]:
import numpy as np
import pyisc
# Data with an outlier in element 1:
X = [[20, 4], [1200, 130], [12, 8], [27, 8], [-9, 13], [2, -6]]
# Create an outlier detector with the known fraction of outliers: 1 of 6:
outlier_detector = pyisc.SklearnOutlierDetector(
contamination=1.0/len(X),
component_models=pyisc.P_Gaussian([0,1])
)
# The outlier detector is trained
outlier_detector.fit(np.array(X))
# Then, the data is classified into being outliers or not:
outlier_detector.predict(np.array(X))
# The result is classification of outliers (-1) and inliers (1):
#array([ 1, -1, 1, 1, 1, 1, 1])
Out[4]:
Thus, we are able to detect the second element as an outlier. The outlier detector follows the API used in scikit-learn for outlier detection with known contamination (see http://scikit-learn.org/stable/modules/outlier_detection.html)
In [ ]: