Anomaly (outlier) Detection

  • Why not using classificaito to detect anomalous points?

Applicaitons

Characteristics of anomaly points

Output of anomaly detection

  • continous output
  • binary output

Anomaly Detection Approaches

1. Graphical approaches

  • 1-D box plot

    • limited to only one dimension
  • 2-D scatter plot

2. Statistical approaches

  • outlier score:
    • 1-dimensional: $O(x) = \displaystyle \frac{|x - \mu|}{\sigma}$
    • multidimensional: use the mahalanobis distance $O(x) = (x-\mu)^T \Sigma^{-1} (x - \mu)$

3. Distance-based approaches

  • Compute the distance between every pair of data instances
  • An anomlay is an instance for which there are fewer than $k$ neighbors within distance $\delta$
    • User can define either distance threshold $\delta$ or the number of neighbors $k$
Example: deforestatoin example
  • time-series of vegitation index
  • find the anomalies

K-D tree

  • A data structure to support nearest-neighbor search (for numeric-valued attributes)

  • Properties of K-D tree *

  • Procedure

    1. choose the dimension with largest variance
    2. find the median along the dimension as split point
    3. parition the space into two parts based on the median
    4. choose the enxt dimension and repeat
  • Conventions: the median is always one of the points (not the average of the two middle points)

Alternate between the dimensions

  • at each level, only one dimension is used
  • moving tp the next level of tree, the dimension changes

Using K-D tree for nearest neighbor query

Evaluation

Confusion Matrix

  • Positive class: anomaly points
  • Negative class: normal points

**True Positive Rate (TPR)

False Positive Rate (FPR)

ROC Curve

  • Changing the threshold for outlier score

  • Random guessing

  • (0,0): declare everything to be negative class
  • (1,1): declare everything to be positive class
  • (1,0): ideal
  • Area Under the Curve (AUC)
Example
  • Compare the results of two anomaly detection algorithms

In [ ]: