教育経済学:課題1

「第三期教育入学率に対する回帰分析」

第三期教育の入学率をGDP, 人口, ジニ係数, 教育に対する政府支出割合によって重回帰分析を行う。

出所 :
OECD (2015), Gross domestic product (GDP) (indicator). doi: 10.1787/dc2f7aec-en (Accessed on 10 October 2015)
UNESCO Institute for Statistics(2015), data extracted on 10 Oct 2015 09:13 UTC (GMT) from UIS/ISU
World Bank, Development Research Group(2015), Data from database: Poverty and Equity Database. (Last Updated: 07/08/2015)
より算出


In [1]:
%matplotlib inline

In [2]:
# -*- coding:utf-8 -*-
from __future__ import print_function
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy import stats

In [3]:
# データ読み込み
# 第三期教育入学率
# http://data.uis.unesco.org/
data_enroll = pd.read_csv("tertiary.csv", index_col='Country', dtype='O')
data_enroll[data_enroll == '..'] = np.nan
# GDP 
# https://data.oecd.org/gdp/gross-domestic-product-gdp.htm
data_gdp = pd.read_csv("gdp.csv", index_col='Country', dtype='O')
data_gdp[data_gdp == '..'] = np.nan
# 人口   
# http://databank.worldbank.org/data/reports.aspx?Code=SP.POP.TOTL&id=af3ce82b&report_name=Popular_indicators&populartype=series&ispopular=y#
data_pop = pd.read_csv("population.csv", index_col='Country', dtype='O')
data_pop[data_pop == '..'] = np.nan
# ジニ係数  
# http://databank.worldbank.org/data/reports.aspx?Code=SI.POV.GINI&id=af3ce82b&report_name=Popular_indicators&populartype=series&ispopular=y#
data_gini = pd.read_csv("gini.csv", index_col='Country', dtype='O')
data_gini[data_gini == '..'] = np.nan
# 第三期教育に対する政府支出  
# https://data.oecd.org/eduresource/public-spending-on-education.htm
data_public = pd.read_csv("public_spending.csv", index_col='Country', dtype='O')
data_public[data_public == '..'] = np.nan

In [4]:
data_enroll.head()


Out[4]:
1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Country
Afghanistan NaN NaN NaN NaN 1.27942 1.29312 NaN NaN NaN NaN 3.9114 NaN 3.74394 NaN NaN NaN NaN
Albania 13.08853 13.76232 14.01383 14.73868 15.81263 20.00498 24.52325 28.78627 31.79338 32.32131 33.1062 43.56153 47.74272 55.50096 58.52989 NaN NaN
Algeria 13.60327 NaN 15.24848 16.74901 17.744 18.18056 19.76258 20.21485 22.28536 NaN 28.59671 28.75825 30.27547 31.46411 33.30463 NaN NaN
American Samoa NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Andorra NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

In [5]:
data_gdp.head()


Out[5]:
1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 ... 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
Country
Argentina NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 538035.5154 601184.0944 666237.9468 700439.5025 706116.0365 780193.8888 867545.4341 895008.2139 945659.3828 927164.1469
Australia 155037.382 178034.2597 181796.1002 196586.5927 211230.1811 228876.1743 240244.2573 261145.2252 283619.2842 299927.1378 ... 718812.8644 773857.9107 825382.9291 850582.9281 897942.1402 935668.927 984763.1119 999199.8604 1040376.497 1062956.442
Austria 79726.0876 87043.3274 94302.7153 100940.7418 104575.7382 110618.6571 115447.3111 119999.7106 128294.273 138463.2414 ... 285433.2141 311310.3909 325500.8904 342442.8873 339017.5061 350124.3423 369420.341 378088.387 382598.8984 394485.5275
Belgium 103557.1412 112908.7475 120627.0074 125781.0802 133456.1351 140001.5239 145429.1715 152579.9061 165380.5208 177771.0402 ... 346243.6361 370161.3572 388722.7568 405339.5371 406396.2064 427441.4922 451396.963 459785.3992 461907.8543 477949.3341
Brazil NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 1988025.804 2142004.309 2342708.493 2520650.653 2538040.151 2771866.298 2973856.149 NaN NaN NaN

5 rows × 35 columns


In [6]:
data_pop.head()


Out[6]:
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
Country
Afghanistan 19701940 20531160 21487079 22507368 23499850 24399948 25183615 25877544 26528741 27207291 27962207 28809167 29726803
Albania 3089027 3060173 3051010 3039616 3026939 3011487 2992547 2970017 2947314 2927519 2913021 2904780 2900489
Algeria 31183658 31590320 31990387 32394886 32817225 33267887 33749328 34261971 34811059 35401790 36036159 36717132 37439427
American Samoa 57522 58176 58729 59117 59262 59117 58648 57904 57031 56226 55636 55316 55227
Andorra 65399 67770 71046 74783 78337 81223 83373 84878 85616 85474 84419 82326 79316

In [7]:
data_gini.head()


Out[7]:
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
Country
Afghanistan NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Albania NaN NaN 32.5 NaN NaN 30.6 NaN NaN 30 NaN NaN NaN 29
Algeria NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
American Samoa NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Andorra NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

In [8]:
data_public.head()


Out[8]:
2009
Country
Australia 0.7
Austria 1.4
Belgium 1.4
Brazil 0.8
Canada 1.5

In [9]:
# 4つの指標全てにおいて調査された国を調べる
country_list = []
for i in np.asarray(data_enroll.index):
    if i in np.asarray(data_gdp.index):
        if i in np.asarray(data_pop.index):
            if i in np.asarray(data_gini.index):
                if i in np.asarray(data_public.index):
                    country_list.append(i)
print(country_list)
print(len(country_list))


['Australia', 'Austria', 'Belgium', 'Brazil', 'Canada', 'Chile', 'Czech Republic', 'Denmark', 'Estonia', 'Finland', 'France', 'Germany', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Ireland', 'Israel', 'Italy', 'Japan', 'Mexico', 'Netherlands', 'New Zealand', 'Norway', 'Poland', 'Portugal', 'Slovenia', 'South Africa', 'Spain', 'Sweden', 'Switzerland']
31

In [10]:
# 2000〜2014年におけるそれぞれのデータの最新をまとめる
for i in reversed(range(2000, 2013)):
    d = {
        'tertiary': data_enroll.ix[country_list]["%s" % i].astype(float),
        'log_gdp': np.log(data_gdp.ix[country_list]["%s" % i], dtype=float),
        'log_pop': np.log(data_pop.ix[country_list]["%s" % i].astype(float)),
        'gini': data_gini.ix[country_list]["%s" % i].astype(float),
        'public': data_public.ix[country_list]['2009'].astype(float),
        'year': i
    }
    if i == 2012:
        df = pd.DataFrame(d).dropna()
    df_test = pd.DataFrame(d).dropna()
    for j in df_test.index.values:
        if j not in df.index.values:
            df.ix[j] = df_test.ix[j]
print(len(df))
df[['tertiary', 'log_gdp', 'log_pop', 'gini', 'public']].describe()


27
Out[10]:
tertiary log_gdp log_pop gini public
count 27.000000 27.000000 27.000000 27.000000 27.000000
mean 65.841442 13.063664 16.572529 32.774074 1.166667
std 18.098664 1.393257 1.688527 6.160457 0.335123
min 16.396570 9.471810 12.678311 25.600000 0.500000
25% 59.257405 12.360758 15.661797 27.800000 1.000000
50% 70.519350 12.942301 16.168299 32.400000 1.200000
75% 77.013830 13.995792 17.557788 35.050000 1.400000
max 93.721820 15.418777 20.917337 50.800000 1.800000

In [11]:
# 外れ値を切り捨てる
df = df[df['tertiary'] >= 30]
print(len(df))
df[['tertiary', 'log_gdp', 'log_pop', 'gini', 'public']].describe()


24
Out[11]:
tertiary log_gdp log_pop gini public
count 24.000000 24.000000 24.000000 24.000000 24.000000
mean 71.143427 12.845443 16.192417 31.970833 1.195833
std 10.134027 1.312599 1.321017 5.622585 0.323673
min 55.561900 9.471810 12.678311 25.600000 0.500000
25% 61.977452 12.311479 15.528702 27.525000 1.000000
50% 71.034790 12.802285 16.139006 31.850000 1.200000
75% 77.907077 13.692860 16.991507 34.000000 1.400000
max 93.721820 15.271679 18.668033 50.800000 1.800000

In [12]:
# 相関を求める
df[['log_gdp', 'log_pop', 'gini', 'public']].corr()


Out[12]:
log_gdp log_pop gini public
log_gdp 1.000000 0.974037 0.220537 -0.298593
log_pop 0.974037 1.000000 0.337710 -0.394632
gini 0.220537 0.337710 1.000000 -0.451126
public -0.298593 -0.394632 -0.451126 1.000000

In [13]:
# 単回帰GDP
# 説明変数設定
X = df[['log_gdp']]
X = sm.add_constant(X)
X.head()
# 被説明変数設定
Y = df['tertiary']
Y.head()
# OLSの実行(Ordinary Least Squares: 最小二乗法)
model = sm.OLS(Y,X)
results = model.fit()
print(results.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:               tertiary   R-squared:                       0.178
Model:                            OLS   Adj. R-squared:                  0.141
Method:                 Least Squares   F-statistic:                     4.766
Date:                Sun, 25 Oct 2015   Prob (F-statistic):             0.0400
Time:                        23:36:47   Log-Likelihood:                -86.772
No. Observations:                  24   AIC:                             177.5
Df Residuals:                      22   BIC:                             179.9
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const        112.9931     19.265      5.865      0.000        73.040   152.946
log_gdp       -3.2579      1.492     -2.183      0.040        -6.353    -0.163
==============================================================================
Omnibus:                        1.395   Durbin-Watson:                   1.855
Prob(Omnibus):                  0.498   Jarque-Bera (JB):                0.950
Skew:                           0.481   Prob(JB):                        0.622
Kurtosis:                       2.842   Cond. No.                         130.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [14]:
# 単回帰人口
# 説明変数設定
X = df[['log_pop']]
X = sm.add_constant(X)
X.head()
# 被説明変数設定
Y = df['tertiary']
Y.head()
# OLSの実行(Ordinary Least Squares: 最小二乗法)
model = sm.OLS(Y,X)
results = model.fit()
print(results.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:               tertiary   R-squared:                       0.204
Model:                            OLS   Adj. R-squared:                  0.168
Method:                 Least Squares   F-statistic:                     5.633
Date:                Sun, 25 Oct 2015   Prob (F-statistic):             0.0268
Time:                        23:36:47   Log-Likelihood:                -86.390
No. Observations:                  24   AIC:                             176.8
Df Residuals:                      22   BIC:                             179.1
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const        127.2287     23.706      5.367      0.000        78.066   176.391
log_pop       -3.4637      1.459     -2.373      0.027        -6.490    -0.437
==============================================================================
Omnibus:                        1.636   Durbin-Watson:                   1.844
Prob(Omnibus):                  0.441   Jarque-Bera (JB):                0.978
Skew:                           0.494   Prob(JB):                        0.613
Kurtosis:                       2.985   Cond. No.                         205.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [15]:
# 単回帰ジニ係数
# 説明変数設定
X = df[['gini']]
X = sm.add_constant(X)
X.head()
# 被説明変数設定
Y = df['tertiary']
Y.head()
# OLSの実行(Ordinary Least Squares: 最小二乗法)
model = sm.OLS(Y,X)
results = model.fit()
print(results.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:               tertiary   R-squared:                       0.073
Model:                            OLS   Adj. R-squared:                  0.031
Method:                 Least Squares   F-statistic:                     1.743
Date:                Sun, 25 Oct 2015   Prob (F-statistic):              0.200
Time:                        23:36:47   Log-Likelihood:                -88.210
No. Observations:                  24   AIC:                             180.4
Df Residuals:                      22   BIC:                             182.8
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const         86.7563     12.000      7.230      0.000        61.870   111.642
gini          -0.4883      0.370     -1.320      0.200        -1.255     0.279
==============================================================================
Omnibus:                        0.863   Durbin-Watson:                   1.900
Prob(Omnibus):                  0.650   Jarque-Bera (JB):                0.773
Skew:                           0.152   Prob(JB):                        0.679
Kurtosis:                       2.175   Cond. No.                         191.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [16]:
# 単回帰政府支出
# 説明変数設定
X = df[['public']]
X = sm.add_constant(X)
X.head()
# 被説明変数設定
Y = df['tertiary']
Y.head()
# OLSの実行(Ordinary Least Squares: 最小二乗法)
model = sm.OLS(Y,X)
results = model.fit()
print(results.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:               tertiary   R-squared:                       0.095
Model:                            OLS   Adj. R-squared:                  0.054
Method:                 Least Squares   F-statistic:                     2.311
Date:                Sun, 25 Oct 2015   Prob (F-statistic):              0.143
Time:                        23:36:47   Log-Likelihood:                -87.927
No. Observations:                  24   AIC:                             179.9
Df Residuals:                      22   BIC:                             182.2
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const         59.6000      7.856      7.587      0.000        43.308    75.892
public         9.6530      6.350      1.520      0.143        -3.516    22.822
==============================================================================
Omnibus:                        0.449   Durbin-Watson:                   1.711
Prob(Omnibus):                  0.799   Jarque-Bera (JB):                0.560
Skew:                           0.081   Prob(JB):                        0.756
Kurtosis:                       2.269   Cond. No.                         7.86
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [17]:
# 重回帰分析
# 説明変数設定
X = df[['log_gdp', 'log_pop', 'gini', 'public']]
X = sm.add_constant(X)
X.head()
# 被説明変数設定
Y = df['tertiary']
Y.head()
# OLSの実行(Ordinary Least Squares: 最小二乗法)
model = sm.OLS(Y,X)
results = model.fit()
print(results.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:               tertiary   R-squared:                       0.230
Model:                            OLS   Adj. R-squared:                  0.068
Method:                 Least Squares   F-statistic:                     1.422
Date:                Sun, 25 Oct 2015   Prob (F-statistic):              0.265
Time:                        23:36:47   Log-Likelihood:                -85.983
No. Observations:                  24   AIC:                             182.0
Df Residuals:                      19   BIC:                             187.9
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const        116.5609     43.388      2.687      0.015        25.750   207.372
log_gdp       -0.5124      8.309     -0.062      0.951       -17.903    16.878
log_pop       -2.3396      8.698     -0.269      0.791       -20.545    15.866
gini          -0.1753      0.457     -0.384      0.706        -1.132     0.781
public         3.8907      7.685      0.506      0.618       -12.194    19.976
==============================================================================
Omnibus:                        0.942   Durbin-Watson:                   1.795
Prob(Omnibus):                  0.624   Jarque-Bera (JB):                0.659
Skew:                           0.393   Prob(JB):                        0.719
Kurtosis:                       2.794   Cond. No.                         857.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [18]:
# 説明変数設定
X = df[['log_gdp', 'log_pop', 'gini']]
X = sm.add_constant(X)
X.head()
# 被説明変数設定
Y = df['tertiary']
Y.head()
# OLSの実行(Ordinary Least Squares: 最小二乗法)
model = sm.OLS(Y,X)
results = model.fit()
print(results.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:               tertiary   R-squared:                       0.220
Model:                            OLS   Adj. R-squared:                  0.103
Method:                 Least Squares   F-statistic:                     1.880
Date:                Sun, 25 Oct 2015   Prob (F-statistic):              0.165
Time:                        23:36:47   Log-Likelihood:                -86.144
No. Observations:                  24   AIC:                             180.3
Df Residuals:                      20   BIC:                             185.0
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const        131.0571     31.985      4.097      0.001        64.337   197.777
log_gdp        0.6721      7.823      0.086      0.932       -15.646    16.990
log_pop       -3.7953      8.055     -0.471      0.643       -20.598    13.007
gini          -0.2218      0.439     -0.505      0.619        -1.138     0.694
==============================================================================
Omnibus:                        1.502   Durbin-Watson:                   1.874
Prob(Omnibus):                  0.472   Jarque-Bera (JB):                0.924
Skew:                           0.480   Prob(JB):                        0.630
Kurtosis:                       2.942   Cond. No.                         647.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [19]:
# 説明変数設定
X = df[['log_gdp', 'log_pop']]
X = sm.add_constant(X)
X.head()
# 被説明変数設定
Y = df['tertiary']
Y.head()
# OLSの実行(Ordinary Least Squares: 最小二乗法)
model = sm.OLS(Y,X)
results = model.fit()
print(results.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:               tertiary   R-squared:                       0.210
Model:                            OLS   Adj. R-squared:                  0.135
Method:                 Least Squares   F-statistic:                     2.792
Date:                Sun, 25 Oct 2015   Prob (F-statistic):             0.0841
Time:                        23:36:47   Log-Likelihood:                -86.296
No. Observations:                  24   AIC:                             178.6
Df Residuals:                      21   BIC:                             182.1
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const        134.8078     30.554      4.412      0.000        71.267   198.349
log_gdp        2.6818      6.614      0.405      0.689       -11.074    16.437
log_pop       -6.0592      6.572     -0.922      0.367       -19.727     7.608
==============================================================================
Omnibus:                        1.602   Durbin-Watson:                   1.837
Prob(Omnibus):                  0.449   Jarque-Bera (JB):                0.875
Skew:                           0.467   Prob(JB):                        0.646
Kurtosis:                       3.057   Cond. No.                         338.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.