Outlier detection

Detect outliers using linear regression model and statsmodels. Based on Stackoverflow question.



In [ ]:

    
import numpy as np

import statsmodels.api as sm # For some reason this import is necessary...
import statsmodels.formula.api as smapi
import statsmodels.graphics as smgraph

import matplotlib.pyplot as plt
%matplotlib inline

Test data

Here were just making some fake data. First of all, make a list of x values from 0 to 29. Next, use the x values to generate some y data that is salted with some randomness. Finally, change the 7th value to a value that is clearly an outlier from the rest.



In [ ]:

    
x = np.arange(30, dtype=float)
# Make some y data with random noise
y = 10 * x + 5.0*np.random.randn(30)
# Add outlier #
y[10] = 180.
y[20] = 130



In [ ]:

    
plt.plot(x, y, 'o')

Regression

Here we're just doing an ordinary least squares method to fit the data. The "data ~ x" is just saying that 'data' (which is the y values) are directly related to 'x' values. This formalism apparently implies that data = m*x + b.



In [ ]:

    
# Make fit #
regression = smapi.ols("data ~ x", data=dict(data=y, x=x)).fit()
regression.summary()

Test for Outliers

Here we're using our regression results to do a test for outliers. In this case, I guess the default is a Bonferroni outlier test. We're only printing off test results where the third column is less than 0.05.

Find outliers



In [ ]:

    
test = regression.outlier_test()
test



In [ ]:

    
print('Bad data points (bonf(p) < 0.05):')
test[test['bonf(p)'] < 0.05]



In [ ]:

    
outliers = test[test['bonf(p)'] < 0.05].index.values
outliers

Figure



In [ ]:

    
figure = smgraph.regressionplots.plot_fit(regression, 1)
line = smgraph.regressionplots.abline_plot(model_results=regression, ax=figure.axes[0])
plt.plot(outliers, y[outliers], 'xm', label='outliers', ms=14)
plt.legend(loc=0);

Create a function and test it



In [ ]:

    
import statsmodels.formula.api as smapi

def get_outliers(features, target):
    regression = smapi.ols("target ~ features", data=locals()).fit()
    test = regression.outlier_test()
    outliers = test[test['bonf(p)'] < 0.05]
    return list(outliers.index.values)



In [ ]:

    
def test_outliers():
    x = np.arange(30, dtype=float)
    # Make some y data with random noise
    y = 10 * x + 5.0*np.random.randn(30)
    # Add outlier
    y[10] = 180.
    y[20] = 130  
    outliers = [10, 20]
    prediction = get_outliers(features=x, target=y)
    assert outliers == prediction
    
test_outliers()



In [ ]: