Detecting outliers in time series



In [1]:

    
__author__ = "Ben Bernstein"

TL;DR: We describe a method of finding outliers in time series data by combining two distinct techniques, STL decomposition and sequential Grubbs' tests. In the end, we arrive at a method that is flexible and identifies points of interest very quickly. Additionally, we offer a new function, comparable to scikit-learn's dataset.make_*, for generating time series datasets with varying functional forms, noise, and number of outliers.

Introduction
STL Decomposition
Grubbs' test
Sequential Grubbs' tests
Data
Detection
Benchmarks
Conclusion

Introduction

Outliers are often nuisances that we remove to avoid skew in estimates and help ensure well-behaved models. However, in business and many other contexts, outliers convey unique information, so simply identifying them can yield powerful insights and offer guidance on which parts of the data need more delving into.

In general, there are two parts to outlier detection. The first is identifying a pattern in the data, and the second is locating the points that don't fit that pattern. In traditional cases, finding a pattern usually means inferring distributions that fit the data best, and identifying outliers means finding the points that have a low probability of belonging to any of the distributions.

This framework also makes sense for time series, but instead of estimating distributions to find patterns, we need to use methods that respect the sequence structure of the data. In this post, we'll describe a method of outlier detection that combines two steps: (1) using STL decomposition to define a time series pattern, and (2) applying sequential Grubbs' tests to spot points that don't fit that pattern.

The code supporting this post is available as roam_outliers. That code creates a Python bridge to our R package outlierDetection, which is also available. Our hope is that this facilitates further exploration of our proposed method.



In [2]:

    
from roam_outliers import *

STL Decomposition

Seasonal and Trend decomposition using Loess (STL) was introduced by Cleveland et al. (1990). This method performs an addative decomposition of a time series into trend, seasonal, and remainder series via an iterative process: (1) an inner loop to determine the seasonal and trend estimates and (2) an outer loop to update weights and discount points with outsized impacts on the seasonal and trend terms.

Often we are most interested in the first two terms, but for outlier detection we are concerned only with the remainder. In other words, we want to strip any part of the time series that can be explained by regular patterns and save the left over bit to investigate further with statistical tests. One such test is the Grubbs' test.

Grubbs' test

The Grubbs' test (Grubbs 1950) is a statistical test to detect one outlier in a single sequence that is approximately normally distributed. We use this test to do the second part of outlier detection. That is, once we've determined which part of the series is a pattern and which is noise, the Grubbs' test helps us determine which points are outliers.

The test follows the usual steps where we calculate a formal statistic and then compare it to a critical value. The formal statistic is calculated by finding the point ($Y_{i}$) in the series farthest away from the mean ($\bar{Y}$) and adjusting by the standard deviation ($s$) (source):

$$ G = \frac{\max_{i=1,\ldots,N} |Y_{i}-\bar{Y}|}{s} $$

Calculating this statistic is straightforward — the remaining question is what we will compare it to. A gut instinct might be to compare this to the normal distribution because one of our assumptions is that we have "approximately normal" data. However, we have to be careful. Since we calculate $s$ using all points in the series, including the outliers we expect to find, the normality assumption is unlikely to hold.

Thankfully, Grubbs showed us how to calculate the critical values we need. We reject the null hypothesis that there are no outliers with the following expression (source):

$$ G > \frac{N-1}{\sqrt{N}} \sqrt{ \frac{t^{2}_{\alpha/(2N), N-2}} {N-2 + t^{2}_{\alpha/(2N), N-2}} } $$

where $N$ is the number of points and $t$ is the t-distribution with an $\alpha/(2N)$ significance level and $N-2$ degrees of freedom.

Sequential Grubbs' tests

The Sequential Grubbs' test works exactly as the name suggests. We perform Grubbs' tests repeatedly up to a predetermined number, which is specified by the maximum percent of outliers allowed. In each iteration, we remove the last $Y_{i}$ from the series and test the new farthest-away-from-the-mean point by recalculating all relevant values — meaning we decrement $N$ and update $\bar{Y}$, $s$, and $t$.

The key point to stress is that we have to do this sequentially because, if we can reject the null hypothesis for any point, then every previously checked point is also considered an outlier.

Data

Now that we have the process described, let's create some data to explore. Instead of tying ourselves to a single dataset, we'll use a make_time_series function that works a lot like the dataset.make_* functions in scikit-learn. It allows us to specify a few parameters:

The start, end, and time_step
A functional form for the series
A noise parameter, so the series isn't too smooth
The outliers percentage (of course we're testing for this so let's build it into our data!)

We want to highlight STL with seasonal data, so we'll make a few seasonal series where our time step is in hours and our season is a day:

Series A: Stationary sine curve with little noise and 10% outliers.
Series B: Downward trending cosine curve with average noise and 10% outliers.
Series C: Upward trending sine curve with a lot of noise and 10% outliers.

The following functions define the core pattern for each of these series:



In [3]:

    
A = lambda x: np.sin(x / (24 / (2*np.pi)))
B = lambda x: 10*np.cos(x / (24 / (2*np.pi))) - np.power(x, 0.5)
C = lambda x: 100*np.sin(x / (24 / (2*np.pi))) + np.power(x, 0.75)

In the call to make_time_series, we specify a date range, a unit for the time step (here, h for hours), and a list giving each function along with its noise rate and its outlier percentage.



In [4]:

    
dt, ys = make_time_series(
    start_dt='2016-01-01T00:00:00', 
    end_dt='2016-01-10T00:00:00', 
    time_step="h",
    functions=[(A, 0.1, 0.10), 
               (B, 1, 0.10),
               (C, 50, 0.10)],
    random_state=0)

The function np_to_df is a helper functions for plotting and passing data between Python and R.



In [5]:

    
wide_df = np_to_df(dt, ys, cols=['A','B','C'])

Here's a look at all the series we're working with:

Detection

Now let's find outliers.



In [6]:

    
wide_outlier_df = find_outliers_for_examples(wide_df)

Series A

Here's a visualization of Series A with red dots marking the points identified as outliers. Series A only has a little noise, and that makes it easier for us to find outliers because they stand out more — almost every point above 1 and below –1 is considered an outlier in this series. More importantly, we're also able to find outliers that are solidly in the middle of the range we expect. Each of the red circles between –1 and 1 is an outlier because it breaks the seasonal pattern determined when using STL, even though it wouldn't raise any flags if taken out of its sequential context.

Series B

Series B is a little more complicated than Series A because it has a downward trend on top of seasonality. However, STL has no problem identifying the overall pattern. The story is similar to Series A because the noise term still isn't too strong, but we do start to see a few points that could have been considered outliers but were missed because of the increase in noise; the Potential outlier labeled in the plot is one such point.

Series C

For Series C, as with Series B, the upward trend doesn't cause any problems. However, unlike Series A and B, the increase in noise in C causes far fewer outliers to be found; only the largest, most obvious points are identified. Taking even a quick look shows some suspect points like the ones labeled Extreme outlier(s). These seem like they should be spotted as outliers, but the noise in the series makes the method more conservative.

Benchmarks

At the top of this post, we said our implementation performed outlier detection "very quickly" — now we'll quantify that.

To benchmark our work we compare outlierDetection to a similar package called AnomalyDetection, released by Twitter. This is an R package using similar (but not identical) techniques as our own, where the biggest differences are that AnomalyDetection has more features and outlierDetection does most of the hard work in C++ (through Rcpp).

We'll generate a new dataset using make_time_series, dropping from hours to minutes for our time step, and extending the time range.



In [7]:

    
dt, ys = make_time_series(
    start_dt='2016-01-01T00:00:00',
    end_dt='2016-03-01T00:00:00', time_step="m",
    functions=[(lambda x: np.sin(x / (1440 / (2*np.pi))), 0.1, 0.10)],
    random_state=5)

This creates a large time series of ~86K observations:



In [8]:

    
wide_df = np_to_df(dt, ys, ['A'])
print("(rows, cols): {}".format(wide_df.shape))









    



(rows, cols): (86400, 2)

Here's the run with outlierDetection:



In [9]:

    
%%timeit
find_outliers_for_benchmarks(wide_df)









    



1 loop, best of 3: 23 s per loop

Here's the run with AnomalyDetection:



In [10]:

    
%%timeit
find_anomalies_for_benchmarks(wide_df)









    



1 loop, best of 3: 3min 11s per loop

A ~8x speed-up!

Please do take the absolute numbers with a grain of salt. We're showing results in Python to fit with the rest of our pipeline, so there's some extra time added for moving data around, and both perform better in R. This shows the power of building on C++ code; for processing granular sensor data or large numbers of time series sequences, it might be the only feasible option.

Conclusion

We described a method of identifying outliers in time series using STL decomposition and sequential Grubbs' tests. The method is fast and robust to different trends and seasons. Also, by applying various levels of noise to our signal, we saw that this method gets more conservative (perhaps too conservative) as noise increases. Finally, we compared our C++ implementation to another open-sourced version and saw an ~8x speed-up.