教育経済学：課題１

「大学進学率に対する回帰分析」

大学進学率に対して、可処分所得、初年度納付金、３年前中学校卒業者数を説明変数として回帰分析を行う。
（出典）「家計調査結果」(総務省統計局)、総務省統計局「日本の統計 2015」、文部科学省調査より算出



In [1]:

    
%matplotlib inline



In [2]:

    
# -*- coding:utf-8 -*-
from __future__ import print_function
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy import stats



In [3]:

    
# データを読み込む
df = pd.read_csv("domestic.csv", index_col='year', dtype='float')
df['log_income'] = np.log(df['income'])
df['log_pay'] = np.log(df['pay'])
df['log_pop'] = np.log(df['pop'])



In [4]:

    
# 要約統計量
df[['enroll' ,'log_income', 'log_pay', 'log_pop']].describe()



In [5]:

    
# 相関を求める
df[['log_income', 'log_pay', 'log_pop']].corr()









    Out[5]:






  
    
      
      log_income
      log_pay
      log_pop
    
  
  
    
      log_income
      1.000000
      0.857604
      -0.401428
    
    
      log_pay
      0.857604
      1.000000
      -0.383214
    
    
      log_pop
      -0.401428
      -0.383214
      1.000000



In [6]:

    
# 単回帰分析（大学進学率と可処分所得）
# 説明変数設定
X = df[['log_income']]
X = sm.add_constant(X)
X.head()
# 被説明変数設定
Y = df['enroll']
Y.head()
# OLSの実行(Ordinary Least Squares: 最小二乗法)
model1 = sm.OLS(Y,X)
results1 = model1.fit()
print(results1.summary())
plt.scatter(df['log_income'], df['enroll'])
plt.plot(df['log_income'], results1.predict())









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 enroll   R-squared:                       0.780
Model:                            OLS   Adj. R-squared:                  0.775
Method:                 Least Squares   F-statistic:                     159.3
Date:                Sun, 25 Oct 2015   Prob (F-statistic):           2.20e-16
Time:                        23:39:17   Log-Likelihood:                -143.69
No. Observations:                  47   AIC:                             291.4
Df Residuals:                      45   BIC:                             295.1
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const       -498.6344     42.515    -11.728      0.000      -584.264  -413.004
log_income    41.6975      3.304     12.620      0.000        35.042    48.353
==============================================================================
Omnibus:                        4.017   Durbin-Watson:                   0.128
Prob(Omnibus):                  0.134   Jarque-Bera (JB):                3.172
Skew:                           0.628   Prob(JB):                        0.205
Kurtosis:                       3.211   Cond. No.                         717.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.






    Out[6]:





[<matplotlib.lines.Line2D at 0x108e2dd10>]



In [7]:

    
# 単回帰分析（大学進学率と初年度納付金）
# 説明変数設定
X = df[['log_pay']]
X = sm.add_constant(X)
# 被説明変数設定
Y = df['enroll']
# OLSの実行(Ordinary Least Squares: 最小二乗法)
model2 = sm.OLS(Y,X)
results2 = model2.fit()
print(results2.summary())
plt.scatter(df['log_pay'], df['enroll'])
plt.plot(df['log_pay'], results2.predict())









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 enroll   R-squared:                       0.722
Model:                            OLS   Adj. R-squared:                  0.716
Method:                 Least Squares   F-statistic:                     117.0
Date:                Sun, 25 Oct 2015   Prob (F-statistic):           4.18e-14
Time:                        23:39:17   Log-Likelihood:                -149.13
No. Observations:                  47   AIC:                             302.3
Df Residuals:                      45   BIC:                             306.0
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const       -337.2651     34.681     -9.725      0.000      -407.116  -267.414
log_pay       27.6025      2.552     10.818      0.000        22.463    32.742
==============================================================================
Omnibus:                        2.065   Durbin-Watson:                   0.180
Prob(Omnibus):                  0.356   Jarque-Bera (JB):                1.438
Skew:                           0.424   Prob(JB):                        0.487
Kurtosis:                       3.127   Cond. No.                         550.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.






    Out[7]:





[<matplotlib.lines.Line2D at 0x1094ee1d0>]



In [8]:

    
# 単回帰分析（大学進学率と初年度納付金）
# 説明変数設定
X = np.log(df[['log_pop']])
X = sm.add_constant(X)
# 被説明変数設定
Y = df['enroll']
# OLSの実行(Ordinary Least Squares: 最小二乗法)
model3 = sm.OLS(Y,X)
results3 = model3.fit()
print(results3.summary())
plt.scatter(np.log(df['log_pop']), df['enroll'])
plt.plot(np.log(df['log_pop']), results3.predict())









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 enroll   R-squared:                       0.514
Model:                            OLS   Adj. R-squared:                  0.504
Method:                 Least Squares   F-statistic:                     47.68
Date:                Sun, 25 Oct 2015   Prob (F-statistic):           1.41e-08
Time:                        23:39:17   Log-Likelihood:                -162.26
No. Observations:                  47   AIC:                             328.5
Df Residuals:                      45   BIC:                             332.2
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const       1875.1078    266.083      7.047      0.000      1339.189  2411.027
log_pop     -689.9016     99.912     -6.905      0.000      -891.135  -488.668
==============================================================================
Omnibus:                       15.329   Durbin-Watson:                   0.150
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               18.906
Skew:                          -1.133   Prob(JB):                     7.85e-05
Kurtosis:                       5.126   Cond. No.                         710.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.






    Out[8]:





[<matplotlib.lines.Line2D at 0x10992af10>]



In [9]:

    
# 説明変数設定
X = df[['log_income', 'log_pay', 'log_pop']]
X = sm.add_constant(X)
X.head()
# 被説明変数設定
Y = df['enroll']
Y.head()
# OLSの実行(Ordinary Least Squares: 最小二乗法)
model4 = sm.OLS(Y,X)
results4 = model4.fit()
print(results4.summary())









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 enroll   R-squared:                       0.958
Model:                            OLS   Adj. R-squared:                  0.956
Method:                 Least Squares   F-statistic:                     330.6
Date:                Sun, 25 Oct 2015   Prob (F-statistic):           1.04e-29
Time:                        23:39:18   Log-Likelihood:                -104.49
No. Observations:                  47   AIC:                             217.0
Df Residuals:                      43   BIC:                             224.4
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const         29.1362     45.057      0.647      0.521       -61.731   120.003
log_income    22.0736      2.888      7.642      0.000        16.248    27.899
log_pay        9.3663      1.970      4.755      0.000         5.394    13.339
log_pop      -28.0695      2.281    -12.305      0.000       -32.670   -23.469
==============================================================================
Omnibus:                        1.450   Durbin-Watson:                   0.398
Prob(Omnibus):                  0.484   Jarque-Bera (JB):                1.362
Skew:                          -0.295   Prob(JB):                        0.506
Kurtosis:                       2.410   Cond. No.                     3.12e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.12e+03. This might indicate that there are
strong multicollinearity or other numerical problems.



In [10]:

    
plt.plot(np.asarray(df.index).astype(int), df['enroll'], label='actual')
plt.plot(np.asarray(df.index).astype(int), results1.predict(), label='predict1')
plt.xlabel('year')
plt.ylabel('Enrollment Rate')
plt.legend(loc=2)
plt.savefig('predict1.png')



In [11]:

    
plt.plot(np.asarray(df.index).astype(int), df['enroll'], label='actual')
plt.plot(np.asarray(df.index).astype(int), results2.predict(), label='predict2')
plt.xlabel('year')
plt.ylabel('Enrollment Rate')
plt.legend(loc=2)
plt.savefig('predict2.png')



In [12]:

    
plt.plot(np.asarray(df.index).astype(int), df['enroll'], label='actual')
plt.plot(np.asarray(df.index).astype(int), results3.predict(), label='predict3')
plt.xlabel('year')
plt.ylabel('Enrollment Rate')
plt.legend(loc=2)
plt.savefig('predict3.png')



In [13]:

    
plt.plot(np.asarray(df.index).astype(int), df['enroll'], label='actual')
plt.plot(np.asarray(df.index).astype(int), results4.predict(), label='predict4')
plt.xlabel('year')
plt.ylabel('Enrollment Rate')
plt.legend(loc=2)
plt.savefig('predict4.png')

	enroll	log_income	log_pay	log_pop
count	47.000000	47.000000	47.000000	47.000000
mean	37.798936	12.864880	13.588054	14.342272
std	11.082234	0.234680	0.341217	0.165470
min	15.660000	12.275375	12.826275	14.008883
25%	34.905000	12.824653	13.230646	14.249943
50%	37.610000	12.949692	13.685634	14.307435
75%	48.390000	13.037063	13.893989	14.448059
max	56.200000	13.089220	13.984339	14.728288

	log_income	log_pay	log_pop
log_income	1.000000	0.857604	-0.401428
log_pay	0.857604	1.000000	-0.383214
log_pop	-0.401428	-0.383214	1.000000