Cooks Distance

See more at Cook's Distance. A very good description of influence and leverage can be found at Linear Regression in Python, Outliers / Leverage Detect (this post should make it into the Yellowbrick documentation in some form).

A good description of outlier detection that also includes Cook's Distance is How to Make Your Machine Learning Models Robust to Outliers. Original motivating post is at How do you check the quality of your regression model in Python? - source code for this post is at Visual analytics and diagnostics of model fit for linear regression.

The statsmodels source code for Cook's Distance is at:


In [1]:
%matplotlib notebook

import scipy as sp
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt 

# Note: statsmodels requires scipy 1.2
import statsmodels.formula.api as sm

from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from statsmodels.stats.outliers_influence import OLSInfluence as influence

from yellowbrick.base import Visualizer

Random Test Data

For the purpose of generating tests for the visualizer, I'm using random test data. The compressive concrete strength dataset would also be good for integration testing.


In [2]:
# Make Test Dataset
X, y = make_regression(
    n_samples=100, n_features=14, n_informative=6, bias=1.2, noise=49.8, tail_strength=0.6, random_state=637
)

# Convert to a DataFrame for statsmodels
data = pd.DataFrame(X)
data.columns = [f"X{i}" for i in range(X.shape[1])]
data["y"] = y
data.head()


Out[2]:
X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 y
0 -0.079099 0.774174 -0.389897 -1.009309 -0.082365 0.637309 0.236902 1.618119 -0.520738 -0.210891 0.654572 -0.366485 -0.859211 0.145978 42.341339
1 -0.896311 1.928183 -0.538742 0.588117 0.724919 -0.856656 0.842047 -1.796532 -0.534544 2.421266 -0.115243 1.585129 -0.103442 0.169937 -64.162092
2 0.479350 -0.544794 -0.180732 0.899390 -0.874454 0.907707 1.058048 1.266299 0.033375 -0.241988 0.161981 -0.770880 0.785022 -2.043742 -63.616222
3 -0.793136 1.422011 -1.109729 -1.434248 1.392304 0.622966 -0.472442 0.075291 -0.077669 -1.725702 -0.075634 0.147481 -0.918953 -0.566963 -84.649787
4 0.670828 0.149658 0.116948 0.011745 -2.450315 -0.529813 -1.586794 -0.337791 -1.360943 -1.209053 -0.029231 0.909747 -0.142993 -1.622250 -166.153271

Statsmodels Results

For comparison to the custom computation, this section computes the statsmodels OLS model and Cook's Distance values. In a later section we will compare this to scikit-learn linear regression and the Yellowbrick cooks function.


In [3]:
# Compute an OLS model 
cols = data.columns
model = sm.ols(formula=f"{cols[-1]} ~ {' + '.join(cols[:-1])}", data=data)
model = model.fit()

print(model.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.881
Model:                            OLS   Adj. R-squared:                  0.862
Method:                 Least Squares   F-statistic:                     45.14
Date:                Sun, 09 Jun 2019   Prob (F-statistic):           2.75e-33
Time:                        16:26:30   Log-Likelihood:                -513.54
No. Observations:                 100   AIC:                             1057.
Df Residuals:                      85   BIC:                             1096.
Df Model:                          14                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -9.9575      5.033     -1.978      0.051     -19.965       0.050
X0           101.3951      4.852     20.897      0.000      91.748     111.042
X1             7.3318      5.241      1.399      0.165      -3.089      17.752
X2            -3.2906      4.839     -0.680      0.498     -12.912       6.331
X3            -3.7461      5.237     -0.715      0.476     -14.158       6.666
X4            28.0759      4.432      6.335      0.000      19.265      36.887
X5             3.0478      5.184      0.588      0.558      -7.259      13.355
X6             3.3205      4.966      0.669      0.505      -6.552      13.193
X7            15.3175      4.014      3.816      0.000       7.336      23.299
X8            49.7685      5.584      8.912      0.000      38.666      60.872
X9            14.7939      4.486      3.298      0.001       5.874      23.713
X10           -0.6941      5.275     -0.132      0.896     -11.183       9.795
X11          -10.1851      4.921     -2.070      0.042     -19.969      -0.402
X12           -8.5387      5.249     -1.627      0.108     -18.976       1.898
X13           25.9747      4.656      5.579      0.000      16.717      35.232
==============================================================================
Omnibus:                        0.261   Durbin-Watson:                   2.027
Prob(Omnibus):                  0.878   Jarque-Bera (JB):                0.418
Skew:                           0.093   Prob(JB):                        0.811
Kurtosis:                       2.744   Cond. No.                         2.30
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [4]:
# Compute the influence to get Cook's distance
inf = influence(model)

# cooks_distance is an attribute of incluence, here C, not sure about P (p-value maybe?)
C, P = inf.cooks_distance

In [5]:
def plot_cooks_distance(c):
    _, ax = plt.subplots(figsize=(9,6))
    ax.stem(c, markerfmt=",")
    ax.set_xlabel("instance")
    ax.set_ylabel("distance")
    ax.set_title("Cook's Distance Outlier Detection")
    return ax


plot_cooks_distance(C)


Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x12ab062b0>

Scikit-Learn Cook's Distance


In [6]:
class CooksDistance(Visualizer):
    
    def fit(self, X, y):
        # Leverage is computed as the diagonal of the projection matrix of X 
        # TODO: whiten X before computing leverage
        self.leverage_ = (X * np.linalg.pinv(X).T).sum(1)
        
        # Compute the MSE
        rank = np.linalg.matrix_rank(X)
        df = X.shape[0] - rank
        
        resid = y - LinearRegression().fit(X, y).predict(X)
        mse = np.dot(resid, resid) / df 
        
        resid_studentized_internal = resid / np.sqrt(mse) / np.sqrt(1-self.leverage_)
        
        self.distance_ = resid_studentized_internal**2 / X.shape[1]
        self.distance_ *= self.leverage_ / (1 - self.leverage_)

        self.p_values_ = sp.stats.f.sf(self.distance_, X.shape[1], df)
        
        self.draw()
        return self
    
    
    def draw(self):
        self.ax.stem(self.distance_, markerfmt=",", label="influence")
        self.ax.axhline(4/len(self.distance_), c='r', ls='--', lw=1, label="$\frac{4}{n}$")
    
    def finalize(self):
        self.ax.legend()
        self.ax.set_xlabel("instance")
        self.ax.set_ylabel("influence")
        self.ax.set_title("Cook's Distance Outlier Detection")
    
    
viz = CooksDistance().fit(X, y)
viz.finalize()