多重共線性(MultiCollinearity)

『Rによる計量経済学』第8章「多重共線性」をPythonで実行する。
テキスト付属データセット(「k0801.csv」等)については出版社サイトよりダウンロードしてください。
また、以下の説明は本書の一部を要約したものですので、より詳しい説明は本書を参照してください。

概要

次のような重回帰モデルを推定。
$Y_{i} = \alpha + \beta X_{i} + \gamma Z_{i} + u_{i} (i = 1, 2, ..., n)$
回帰係数$\hat{\beta}$の分散である$s_{\hat{\beta}}^{2}$は次の3つの部分からなる。

$s_{\hat{\beta}}^{2} = \frac{s^{2}}{\Sigma (X_{i} - X)^{2}(1 - r_{XZ}^{2})}$
$= \frac{(A)}{(B)[1-(C)]}$

ここで(C)が理由で回帰係数$\hat{\beta}$が有意にならないことを多重共線性(multi collinearity)の問題があると呼ぶ。

多重共線性の検討の手順

① 他の説明変数を除いた場合に説明力があるかを確認する。
② (A)の部分の計算
③ (B)の部分の計算
④ (C)の部分の計算
⑤ ①〜④の結果から④が原因であるかどうか判断する。

例題8-1

「k0801.csv」を分析。


In [1]:
%matplotlib inline

In [2]:
# -*- coding:utf-8 -*-
from __future__ import print_function
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

In [3]:
# データ読み込み
data = pd.read_csv('example/k0801.csv')
data


Out[3]:
i X Z Y
0 1 1 2 1
1 2 3 2 3
2 3 5 4 4
3 4 5 6 5
4 5 8 7 7
5 6 9 9 8

In [4]:
# 説明変数設定
X = data[['X', 'Z']]
X = sm.add_constant(X)
X


Out[4]:
const X Z
0 1 1 2
1 1 3 2
2 1 5 4
3 1 5 6
4 1 8 7
5 1 9 9

In [5]:
# 被説明変数設定
Y = data['Y']
Y


Out[5]:
0    1
1    3
2    4
3    5
4    7
5    8
Name: Y, dtype: int64

In [6]:
# OLSの実行(Ordinary Least Squares: 最小二乗法)
model = sm.OLS(Y,X)
results = model.fit()
print(results.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      Y   R-squared:                       0.988
Model:                            OLS   Adj. R-squared:                  0.980
Method:                 Least Squares   F-statistic:                     120.6
Date:                Sun, 19 Jul 2015   Prob (F-statistic):            0.00136
Time:                        04:05:31   Log-Likelihood:               -0.45977
No. Observations:                   6   AIC:                             6.920
Df Residuals:                       3   BIC:                             6.295
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          0.1767      0.330      0.536      0.629        -0.873     1.227
X              0.6897      0.168      4.104      0.026         0.155     1.224
Z              0.1853      0.178      1.042      0.374        -0.381     0.752
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   3.449
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.432
Skew:                           0.112   Prob(JB):                        0.806
Kurtosis:                       1.704   Cond. No.                         18.0
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

これから、Xの係数 $\hat{\beta}$ のP値が小さく有意水準5%でも有意ですが、Zの係数 $\hat{\gamma}$ のP値が大きく明らかに有意でないことがわかります。


In [7]:
# 説明変数設定
X = data['X']
X = sm.add_constant(X)
X


Out[7]:
const X
0 1 1
1 1 3
2 1 5
3 1 5
4 1 8
5 1 9

In [8]:
# 説明変数をXのみにして推定
# OLSの実行(Ordinary Least Squares: 最小二乗法)
model = sm.OLS(Y,X)
results = model.fit()
print(results.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      Y   R-squared:                       0.983
Model:                            OLS   Adj. R-squared:                  0.979
Method:                 Least Squares   F-statistic:                     235.1
Date:                Sun, 19 Jul 2015   Prob (F-statistic):           0.000106
Time:                        04:05:31   Log-Likelihood:                -1.3861
No. Observations:                   6   AIC:                             6.772
Df Residuals:                       4   BIC:                             6.356
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          0.2491      0.326      0.764      0.487        -0.656     1.154
X              0.8550      0.056     15.333      0.000         0.700     1.010
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   3.459
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.104
Skew:                          -0.185   Prob(JB):                        0.949
Kurtosis:                       2.473   Cond. No.                         12.8
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [9]:
# 説明変数設定
X = data['Z']
X = sm.add_constant(X)
X


Out[9]:
const Z
0 1 2
1 1 2
2 1 4
3 1 6
4 1 7
5 1 9

In [10]:
# 説明変数をZのみにして推定
# OLSの実行(Ordinary Least Squares: 最小二乗法)
model = sm.OLS(Y,X)
results = model.fit()
print(results.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      Y   R-squared:                       0.919
Model:                            OLS   Adj. R-squared:                  0.898
Method:                 Least Squares   F-statistic:                     45.23
Date:                Sun, 19 Jul 2015   Prob (F-statistic):            0.00255
Time:                        04:05:31   Log-Likelihood:                -6.1274
No. Observations:                   6   AIC:                             16.25
Df Residuals:                       4   BIC:                             15.84
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          0.2917      0.732      0.398      0.711        -1.741     2.324
Z              0.8750      0.130      6.725      0.003         0.514     1.236
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   2.567
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.361
Skew:                          -0.113   Prob(JB):                        0.835
Kurtosis:                       1.820   Cond. No.                         12.6
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [11]:
# 標準偏差
data.std()


Out[11]:
i    1.870829
X    2.994439
Z    2.828427
Y    2.581989
dtype: float64

In [12]:
# 平均
data.mean()


Out[12]:
i    3.500000
X    5.166667
Z    5.000000
Y    4.666667
dtype: float64

In [13]:
# 相関係数
data.corr()


Out[13]:
i X Z Y
i 1.000000 0.981778 0.982708 0.993694
X 0.981778 1.000000 0.944560 0.991600
Z 0.982708 0.944560 1.000000 0.958514
Y 0.993694 0.991600 0.958514 1.000000

以上の結果から、重回帰分析において $Z_{i}$ は有意ではなかったが、多重共線性によるものだったとうかがえる。

赤池情報量基準(Akaike's Information Criterion: AIC)

自由度調整済決定係数 $\overline{R}^{2}$ と同様に、回帰係数の実質的な説明力を表す。

しかし、Statsmodelsにはステップワイズ法を使って変数選択をしてくれるメソッドがないらしいので、ここのページからforward_selected()を借りて変数選択を行う。


In [14]:
# データ読み込み
data2 = pd.read_csv('example/k0802.csv')
data2


Out[14]:
Y X1 X2 X3 X4 X5 X6 X7 X8
0 208.7 312.8 7.7 6.5 24.2 300 322.1 278.1 92.9
1 174.0 686.9 14.0 8.4 22.3 367 434.7 273.6 90.3
2 206.5 497.6 13.7 6.2 32.5 397 177.3 387.5 92.1
3 173.2 264.9 6.2 6.9 13.6 372 581.9 502.1 89.6
4 182.0 471.2 11.1 6.1 38.4 383 213.4 317.3 88.6
5 186.1 375.0 10.9 4.8 28.0 405 597.8 524.8 93.0
6 138.8 267.3 9.2 6.0 27.7 386 368.0 481.3 90.1
7 171.0 275.9 7.4 5.9 17.8 446 423.6 1253.7 90.4
8 126.6 303.1 6.8 5.4 22.8 411 630.2 1305.2 91.9
9 131.4 369.5 6.5 5.7 18.8 367 679.8 1725.9 89.3
10 194.8 212.8 2.2 5.7 20.8 234 1157.1 1134.6 94.2
11 160.5 211.1 3.7 5.6 21.1 265 930.8 1207.2 92.0
12 118.1 284.2 0.4 5.6 29.3 145 3396.3 1974.8 100.0
13 170.9 172.9 1.0 5.5 9.4 184 2410.9 1312.8 99.2
14 199.7 588.7 7.5 4.8 25.9 346 408.1 446.7 93.0
15 194.1 554.9 4.3 4.4 54.0 394 580.3 952.3 91.2
16 211.8 620.6 3.9 4.7 39.2 291 678.3 651.9 93.7
17 161.7 620.0 4.7 4.2 49.9 341 493.0 1314.9 91.2
18 195.0 517.6 8.5 5.3 55.4 347 673.6 1533.5 91.7
19 185.2 508.1 11.4 4.6 49.2 361 286.8 1583.0 90.9
20 179.7 374.3 3.7 4.8 28.9 297 478.7 1746.0 89.6
21 201.5 239.8 4.9 4.6 22.7 275 1130.8 1864.8 94.1
22 184.2 275.5 2.8 4.6 11.9 259 1231.8 2069.2 94.1

In [15]:
import statsmodels.formula.api as smf

def forward_selected(data, response):
    """Linear model designed by forward selection.

    Parameters:
    -----------
    data : pandas DataFrame with all possible predictors and response

    response: string, name of response column in data

    Returns:
    --------
    model: an "optimal" fitted statsmodels linear model
           with an intercept
           selected by forward selection
           evaluated by adjusted R-squared
    """
    remaining = set(data.columns)
    remaining.remove(response)
    selected = []
    current_score, best_new_score = 0.0, 0.0
    while remaining and current_score == best_new_score:
        scores_with_candidates = []
        for candidate in remaining:
            formula = "{} ~ {} + 1".format(response,
                                           ' + '.join(selected + [candidate]))
            score = smf.ols(formula, data).fit().rsquared_adj
            scores_with_candidates.append((score, candidate))
        scores_with_candidates.sort()
        best_new_score, best_candidate = scores_with_candidates.pop()
        if current_score < best_new_score:
            remaining.remove(best_candidate)
            selected.append(best_candidate)
            current_score = best_new_score
    formula = "{} ~ {} + 1".format(response,
                                   ' + '.join(selected))
    model = smf.ols(formula, data).fit()
    return model

In [16]:
# 第一引数にdata、第二引数に被説明変数のコラム名
model = forward_selected(data2, 'Y')
print(model.model.formula)


Y ~ X6 + X8 + X5 + 1

In [17]:
# OLS実行
X = data2[['X5', 'X6', 'X8']]
X = sm.add_constant(X)
Y = data2['Y']
model = sm.OLS(Y,X)
results = model.fit()
print(results.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      Y   R-squared:                       0.604
Model:                            OLS   Adj. R-squared:                  0.542
Method:                 Least Squares   F-statistic:                     9.674
Date:                Sun, 19 Jul 2015   Prob (F-statistic):           0.000433
Time:                        04:05:32   Log-Likelihood:                -96.881
No. Observations:                  23   AIC:                             201.8
Df Residuals:                      19   BIC:                             206.3
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const       -620.9189    276.085     -2.249      0.037     -1198.772   -43.066
X5            -0.1958      0.093     -2.116      0.048        -0.390    -0.002
X6            -0.0638      0.012     -5.369      0.000        -0.089    -0.039
X8             9.8846      2.938      3.364      0.003         3.735    16.035
==============================================================================
Omnibus:                        2.671   Durbin-Watson:                   2.149
Prob(Omnibus):                  0.263   Jarque-Bera (JB):                1.672
Skew:                          -0.659   Prob(JB):                        0.433
Kurtosis:                       3.077   Cond. No.                     8.08e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.08e+04. This might indicate that there are
strong multicollinearity or other numerical problems.