개별적인 데이터 표본이 회귀 분석 결과에 미치는 영향은 레버리지(leverage)분석을 통해 알 수 있다.
레버리지는 래의 target value $y$가 예측된(predicted) target $\hat{y}$에 미치는 영향을 나타낸 값이다. self-influence, self-sensitivity 라고도 한다
레버리지는 RegressionResults 클래스의 get_influence 메서드로 구할 수 있다.
In [1]:
from sklearn.datasets import make_regression
X0, y, coef = make_regression(n_samples=100, n_features=1, noise=20, coef=True, random_state=1)
# add high-leverage points
X0 = np.vstack([X0, np.array([[4],[3]])])
X = sm.add_constant(X0)
y = np.hstack([y, [300, 150]])
plt.scatter(X0, y)
plt.show()
In [2]:
model = sm.OLS(pd.DataFrame(y), pd.DataFrame(X))
result = model.fit()
print(result.summary())
In [3]:
influence = result.get_influence()
hat = influence.hat_matrix_diag
plt.stem(hat)
plt.axis([ -2, len(y)+2, 0, 0.2 ])
plt.show()
print("hat.sum() =", hat.sum())
In [4]:
plt.scatter(X0, y)
sm.graphics.abline_plot(model_results=result, ax=plt.gca())
idx = hat > 0.05
plt.scatter(X0[idx], y[idx], s=300, c="r", alpha=0.5)
plt.axis([-3, 5, -300, 400])
plt.show()
In [5]:
model2 = sm.OLS(y[:-1], X[:-1])
result2 = model2.fit()
plt.scatter(X0, y);
sm.graphics.abline_plot(model_results=result, c="r", linestyle="--", ax=plt.gca())
sm.graphics.abline_plot(model_results=result2, c="g", alpha=0.7, ax=plt.gca())
plt.plot(X0[-1], y[-1], marker='x', c="m", ms=20, mew=5)
plt.axis([-3, 5, -300, 400])
plt.legend(["before", "after"], loc="upper left")
plt.show()
In [6]:
model3 = sm.OLS(y[1:], X[1:])
result3 = model3.fit()
plt.scatter(X0, y)
sm.graphics.abline_plot(model_results=result, c="r", linestyle="--", ax=plt.gca())
sm.graphics.abline_plot(model_results=result3, c="g", alpha=0.7, ax=plt.gca())
plt.plot(X0[0], y[0], marker='x', c="m", ms=20, mew=5)
plt.axis([-3, 5, -300, 400])
plt.legend(["before", "after"], loc="upper left")
plt.show()
In [7]:
plt.figure(figsize=(10, 2))
plt.stem(result.resid)
plt.xlim([-2, len(y)+2])
plt.show()
In [8]:
sm.graphics.plot_leverage_resid2(result)
plt.show()
Cook's Distance
$$ D_i = \frac{e_i^2}{\text{RSS}}\left[\frac{h_{ii}}{(1-h_{ii})^2}\right] $$
In [9]:
sm.graphics.influence_plot(result, plot_alpha=0.3)
plt.show()
In [10]:
cooks_d2, pvals = influence.cooks_distance
fox_cr = 4 / (len(y) - 2)
idx = np.where(cooks_d2 > fox_cr)[0]
plt.scatter(X0, y)
plt.scatter(X0[idx], y[idx], s=300, c="r", alpha=0.5)
plt.axis([-3, 5, -300, 400])
from statsmodels.graphics import utils
utils.annotate_axes(range(len(idx)), idx, zip(X0[idx], y[idx]), [(-20,15)]*len(idx), size="large", ax=plt.gca())
plt.show()
In [11]:
idx = np.nonzero(result.outlier_test().ix[:, -1].abs() < 0.9)[0]
plt.scatter(X0, y)
plt.scatter(X0[idx], y[idx], s=300, c="r", alpha=0.5)
plt.axis([-3, 5, -300, 400]);
utils.annotate_axes(range(len(idx)), idx, zip(X0[idx], y[idx]), [(-10,10)]*len(idx), size="large", ax=plt.gca())
plt.show()
In [12]:
plt.figure(figsize=(10, 2))
plt.stem(result.outlier_test().ix[:, -1])
plt.show()