통계적 사고 (2판) 연습문제 (thinkstats2.com, think-stat.xwmooc.org)
Allen Downey / 이광춘(xwMOOC)

연습문제 11.1

동료중 한명이 출산이 예정되어 있고, 출생일을 예측하는 사무실 내기게임에 참여한다고 가정하자. 임신 30주차에 내기를 건다고 가정하자. 가장 최선의 예측을 하는데 어떤 변수를 사용하면 될까? 출생전에 알려진 변수로 한정해야하고, 내기에 참여한 사람들에게 이용가능해야할 것 같다.

연습문제 11.2

트리버스-윌라드(Trivers-Willard) 가설은 많은 포유류에 있어, 성비는 “어머니 상태(maternal condition)”에 달려있다고 제시한다; 즉, 산모 연령, 크기, 건강, 사회적 지위 같은 요인. https://en.wikipedia.org/wiki/Trivers-Willard_hypothesis 참조한다. 일부 연구는 사람사이에 이런 효과를 보여주고 있지만, 결과는 혼재한다. 이번 장에서, 이런 요인과 연관된 일부 변수를 검정하지만, 성비에 통계적으로 유의적인 효과를 갖는 어떤 변수도 발견하지 못했다.

연습으로, 데이터 마이닝 접근법을 사용해서 임신파일과 응답자파일에서 다른 변수를 검정하라. 실질적인 효과를 갖는 다른 요인을 발견할 수 있는가?

연습문제 11.3

예측하고자 하는 수량(quantity)이 갯수(count)라면, 포아송 회귀를 사용할 수 있다. StatModels에 poisson 함수로 구현되어 있다. ols 나 logit 과 동일한 방식으로 작동한다. 연습으로, 이것을 사용해서 한 여성에 대해서 얼마나 많은 아기가 태어나는지 예측한다; NSFG 데이터셋에서, 이 변수는 numbabes로 불린다.. 나이가 35세, 흑인, 연가구소득이 $75,000을 넘는 대학을 졸업한 여성을 가정해보자. 그녀가 얼마나 많은 자녀를 출산할 것으로 예상되는가?

연습문제 11.4

만약 예측하고자 하는 수량이 범주형이라면, 다항 로지스틱 회귀를 사용한다. StatModels에서 mnlogit 함수로 구현되어 있다. 연습으로, 이를 사용해서, 한 여성이 혼인상태, 동거상태, 과부상태, 이혼상태, 별거상태, 미혼인지 추측해본다; NSFG 데이터셋에서, 혼인상태는 변수명 rmarital로 부호화되어 있다. 연령이 25세, 백인, 연가구소득이 약 $45,000 달러인 고졸 여성을 만났다고 가정해보자. 그녀가 혼인, 동거, 등일 확률은 얼마나 될까?



In [9]:

    
import numpy as np
import first
live, firsts, others = first.MakeFrames()
df = live[live.prglngth>30]

아래에 저자가 발견한 임신기간에 통계적으로 유의적인 변수만 나와있다.



In [3]:

    
import statsmodels.formula.api as smf
model = smf.ols('prglngth ~ birthord==1 + race==2 + nbrnaliv>1', data=df)
results = model.fit()
results.summary()









    Out[3]:





OLS Regression Results

  Dep. Variable:         prglngth        R-squared:             0.011 


  Model:                    OLS          Adj. R-squared:        0.011 


  Method:              Least Squares     F-statistic:           34.28 


  Date:              Mon, 30 Nov 2015    Prob (F-statistic):  5.09e-22 


  Time:                  10:43:48        Log-Likelihood:      -18247. 


  No. Observations:         8884         AIC:                3.650e+04


  Df Residuals:             8880         BIC:                3.653e+04


  Df Model:                    3                                      


  Covariance Type:       nonrobust                                    




                           coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept                 38.7617      0.039   1006.410   0.000     38.686    38.837


  birthord == 1[T.True]      0.1015      0.040      2.528   0.011      0.023     0.180


  race == 2[T.True]          0.1390      0.042      3.311   0.001      0.057     0.221


  nbrnaliv > 1[T.True]      -1.4944      0.164     -9.086   0.000     -1.817    -1.172




  Omnibus:        1587.470    Durbin-Watson:         1.619


  Prob(Omnibus):    0.000     Jarque-Bera (JB):   6160.751


  Skew:            -0.852     Prob(JB):               0.00


  Kurtosis:         6.707     Cond. No.               10.9

성비를 예측하는 변수에 대해서 좀더 깊게 살펴보자.



In [4]:

    
import regression
join = regression.JoinFemResp(live)



In [10]:

    
def GoMining(df):
    """Searches for variables that predict birth weight.

    df: DataFrame of pregnancy records

    returns: list of (rsquared, variable name) pairs
    """
    df['boy'] = (df.babysex==1).astype(int)
    variables = []
    for name in df.columns:
        try:
            if df[name].var() < 1e-7:
                continue

            formula='boy ~ agepreg + ' + name
            model = smf.logit(formula, data=df)
            nobs = len(model.endog)
            if nobs < len(df)/2:
                continue

            results = model.fit()
        except:
            continue

        variables.append((results.prsquared, name))

    return variables

variables = GoMining(join)









    



Optimization terminated successfully.
         Current function value: 0.687464
         Iterations 4

가장 높은 유사-$R^2$(pseudo-$R^2$) 값을 이끌어 내는 변수가 30개 있다.



In [11]:

    
regression.MiningReport(variables)









    



totalwgt_lb 0.00804663447667

임신기간 동안 알려지지 않은 변수와 다양한 사유로 수상한 냄새가 나는 변수를 제거하고 나서, 저자가 찾은 가장 최적의 모형이 다음에 나와있다:



In [12]:

    
formula='boy ~ agepreg + fmarout5==5 + infever==1'
model = smf.logit(formula, data=join)
results = model.fit()
results.summary()









    



Optimization terminated successfully.
         Current function value: 0.691983
         Iterations 4






    Out[12]:





Logit Regression Results

  Dep. Variable:         boy          No. Observations:      9148  


  Model:                Logit         Df Residuals:          9144  


  Method:                MLE          Df Model:                 3  


  Date:           Mon, 30 Nov 2015    Pseudo R-squ.:      0.001525 


  Time:               10:45:58        Log-Likelihood:      -6330.3 


  converged:            True          LL-Null:             -6339.9 


                                    LLR p-value:        0.0002326




                           coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept                 -0.1863      0.116     -1.606   0.108     -0.414     0.041


  fmarout5 == 5[T.True]      0.1493      0.048      3.087   0.002      0.054     0.244


  infever == 1[T.True]       0.2096      0.063      3.304   0.001      0.085     0.334


  agepreg                    0.0053      0.004      1.252   0.211     -0.003     0.014

응답자 자녀숫자를 예측하는 모형을 만들어보자. 연령에 대해 비선형모형을 사용했다. age3 항이 지나칠 수 있지만, 저자 생각에는 엉뚱한 선택은 아니라고 본다. 예측에 그다지 효과는 없다.



In [13]:

    
join.numbabes.replace([97], np.nan, inplace=True)
join['age2'] = join.age_r**2
join['age3'] = join.age_r**3



In [14]:

    
formula='numbabes ~ age_r + age2 + age3 + C(race) + totincr + educat'
model = smf.poisson(formula, data=join)
results = model.fit()
results.summary()









    



Optimization terminated successfully.
         Current function value: 1.677637
         Iterations 7






    Out[14]:





Poisson Regression Results

  Dep. Variable:      numbabes        No. Observations:       9148  


  Model:               Poisson        Df Residuals:           9140  


  Method:                MLE          Df Model:                  7  


  Date:           Mon, 30 Nov 2015    Pseudo R-squ.:        0.03665 


  Time:               10:46:04        Log-Likelihood:       -15347. 


  converged:            True          LL-Null:              -15931. 


                                    LLR p-value:        6.721e-248




                  coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept        -3.2731      0.719     -4.551   0.000     -4.683    -1.863


  C(race)[T.2]     -0.1403      0.014     -9.683   0.000     -0.169    -0.112


  C(race)[T.3]     -0.0959      0.024     -3.957   0.000     -0.143    -0.048


  age_r             0.3744      0.069      5.433   0.000      0.239     0.510


  age2             -0.0090      0.002     -4.158   0.000     -0.013    -0.005


  age3            7.12e-05    2.2e-05      3.237   0.001   2.81e-05     0.000


  totincr          -0.0185      0.002     -9.858   0.000     -0.022    -0.015


  educat           -0.0463      0.003    -15.988   0.000     -0.052    -0.041

이제, 35세, 흑인, 대학졸업하고 연간 가구소등이 $75,000을 넘는 여성에 대한 자녀수를 예측할 수 있다.



In [15]:

    
import pandas
columns = ['age_r', 'age2', 'age3', 'race', 'totincr', 'educat']
new = pandas.DataFrame([[35, 35**2, 35**3, 1, 14, 16]], columns=columns)
results.predict(new)









    Out[15]:





array([ 2.47393019])

혼인상태를 예측하기 위하는데, 저자가 찾은 가장 최적의 모형이 다음에 나와 있다:



In [17]:

    
formula='rmarital ~ age_r + age2 + C(race) + totincr + educat'
model = smf.mnlogit(formula, data=join)
results = model.fit()
results.summary()









    



Optimization terminated successfully.
         Current function value: 1.092083
         Iterations 8






    Out[17]:





MNLogit Regression Results

  Dep. Variable:      rmarital        No. Observations:      9148 


  Model:               MNLogit        Df Residuals:          9113 


  Method:                MLE          Df Model:                30 


  Date:           Mon, 30 Nov 2015    Pseudo R-squ.:       0.1661 


  Time:               10:46:25        Log-Likelihood:      -9990.4


  converged:            True          LL-Null:             -11981.


                                    LLR p-value:          0.000 




   rmarital=2      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept         8.9153      0.792     11.251   0.000      7.362    10.468


  C(race)[T.2]     -0.9260      0.087    -10.705   0.000     -1.096    -0.756


  C(race)[T.3]     -0.6335      0.133     -4.747   0.000     -0.895    -0.372


  age_r            -0.3567      0.050     -7.132   0.000     -0.455    -0.259


  age2              0.0047      0.001      6.054   0.000      0.003     0.006


  totincr          -0.1301      0.011    -11.475   0.000     -0.152    -0.108


  educat           -0.1940      0.018    -10.534   0.000     -0.230    -0.158


   rmarital=3      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept         2.9927      2.970      1.007   0.314     -2.829     8.815


  C(race)[T.2]     -0.3963      0.235     -1.685   0.092     -0.857     0.065


  C(race)[T.3]      0.0650      0.336      0.194   0.846     -0.593     0.723


  age_r            -0.3141      0.174     -1.806   0.071     -0.655     0.027


  age2              0.0064      0.003      2.532   0.011      0.001     0.011


  totincr          -0.3217      0.032    -10.135   0.000     -0.384    -0.259


  educat           -0.1093      0.048     -2.266   0.023     -0.204    -0.015


   rmarital=4      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept        -3.6475      1.193     -3.059   0.002     -5.985    -1.310


  C(race)[T.2]     -0.3303      0.091     -3.641   0.000     -0.508    -0.153


  C(race)[T.3]     -0.8227      0.170     -4.853   0.000     -1.155    -0.490


  age_r             0.1238      0.070      1.763   0.078     -0.014     0.261


  age2             -0.0008      0.001     -0.814   0.416     -0.003     0.001


  totincr          -0.2288      0.011    -20.058   0.000     -0.251    -0.206


  educat            0.0661      0.016      4.015   0.000      0.034     0.098


   rmarital=5      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept        -2.3978      1.269     -1.889   0.059     -4.886     0.090


  C(race)[T.2]     -1.0493      0.101    -10.366   0.000     -1.248    -0.851


  C(race)[T.3]     -0.6065      0.154     -3.937   0.000     -0.908    -0.305


  age_r             0.2084      0.077      2.699   0.007      0.057     0.360


  age2             -0.0030      0.001     -2.619   0.009     -0.005    -0.001


  totincr          -0.2900      0.014    -20.314   0.000     -0.318    -0.262


  educat           -0.0176      0.021     -0.835   0.404     -0.059     0.024


   rmarital=6      coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept         8.1722      0.797     10.252   0.000      6.610     9.734


  C(race)[T.2]     -2.1628      0.079    -27.532   0.000     -2.317    -2.009


  C(race)[T.3]     -1.9701      0.136    -14.472   0.000     -2.237    -1.703


  age_r            -0.2188      0.050     -4.337   0.000     -0.318    -0.120


  age2              0.0020      0.001      2.505   0.012      0.000     0.004


  totincr          -0.2888      0.011    -25.418   0.000     -0.311    -0.267


  educat           -0.0802      0.018     -4.582   0.000     -0.115    -0.046

25세, 백인, 고등학교 졸업, 연간 가구소득이 $45,000인 여성에 대한 예측이 다음에 나와있다.



In [18]:

    
columns = ['age_r', 'age2', 'race', 'totincr', 'educat']
new = pandas.DataFrame([[25, 25**2, 2, 11, 12]], columns=columns)
results.predict(new)









    Out[18]:





array([[ 0.74568869,  0.12829132,  0.0016241 ,  0.03280099,  0.02188668,
         0.06970823]])

그래서, 이 분의 경우 현재 결혼가능성이 75%이고, "결혼하지 않고 다른 성과 동거" 가능성이 13% 등이다.



In [ ]:

Dep. Variable:	prglngth	R-squared:	0.011
Model:	OLS	Adj. R-squared:	0.011
Method:	Least Squares	F-statistic:	34.28
Date:	Mon, 30 Nov 2015	Prob (F-statistic):	5.09e-22
Time:	10:43:48	Log-Likelihood:	-18247.
No. Observations:	8884	AIC:	3.650e+04
Df Residuals:	8880	BIC:	3.653e+04
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	38.7617	0.039	1006.410	0.000	38.686 38.837
birthord == 1[T.True]	0.1015	0.040	2.528	0.011	0.023 0.180
race == 2[T.True]	0.1390	0.042	3.311	0.001	0.057 0.221
nbrnaliv > 1[T.True]	-1.4944	0.164	-9.086	0.000	-1.817 -1.172

Omnibus:	1587.470	Durbin-Watson:	1.619
Prob(Omnibus):	0.000	Jarque-Bera (JB):	6160.751
Skew:	-0.852	Prob(JB):	0.00
Kurtosis:	6.707	Cond. No.	10.9

Dep. Variable:	boy	No. Observations:	9148
Model:	Logit	Df Residuals:	9144
Method:	MLE	Df Model:	3
Date:	Mon, 30 Nov 2015	Pseudo R-squ.:	0.001525
Time:	10:45:58	Log-Likelihood:	-6330.3
converged:	True	LL-Null:	-6339.9
		LLR p-value:	0.0002326

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	-0.1863	0.116	-1.606	0.108	-0.414 0.041
fmarout5 == 5[T.True]	0.1493	0.048	3.087	0.002	0.054 0.244
infever == 1[T.True]	0.2096	0.063	3.304	0.001	0.085 0.334
agepreg	0.0053	0.004	1.252	0.211	-0.003 0.014

Dep. Variable:	numbabes	No. Observations:	9148
Model:	Poisson	Df Residuals:	9140
Method:	MLE	Df Model:	7
Date:	Mon, 30 Nov 2015	Pseudo R-squ.:	0.03665
Time:	10:46:04	Log-Likelihood:	-15347.
converged:	True	LL-Null:	-15931.
		LLR p-value:	6.721e-248

Dep. Variable:	rmarital	No. Observations:	9148
Model:	MNLogit	Df Residuals:	9113
Method:	MLE	Df Model:	30
Date:	Mon, 30 Nov 2015	Pseudo R-squ.:	0.1661
Time:	10:46:25	Log-Likelihood:	-9990.4
converged:	True	LL-Null:	-11981.
		LLR p-value:	0.000

rmarital=2	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	8.9153	0.792	11.251	0.000	7.362 10.468
C(race)[T.2]	-0.9260	0.087	-10.705	0.000	-1.096 -0.756
C(race)[T.3]	-0.6335	0.133	-4.747	0.000	-0.895 -0.372
age_r	-0.3567	0.050	-7.132	0.000	-0.455 -0.259
age2	0.0047	0.001	6.054	0.000	0.003 0.006
totincr	-0.1301	0.011	-11.475	0.000	-0.152 -0.108
educat	-0.1940	0.018	-10.534	0.000	-0.230 -0.158
rmarital=3	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	2.9927	2.970	1.007	0.314	-2.829 8.815
C(race)[T.2]	-0.3963	0.235	-1.685	0.092	-0.857 0.065
C(race)[T.3]	0.0650	0.336	0.194	0.846	-0.593 0.723
age_r	-0.3141	0.174	-1.806	0.071	-0.655 0.027
age2	0.0064	0.003	2.532	0.011	0.001 0.011
totincr	-0.3217	0.032	-10.135	0.000	-0.384 -0.259
educat	-0.1093	0.048	-2.266	0.023	-0.204 -0.015
rmarital=4	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	-3.6475	1.193	-3.059	0.002	-5.985 -1.310
C(race)[T.2]	-0.3303	0.091	-3.641	0.000	-0.508 -0.153
C(race)[T.3]	-0.8227	0.170	-4.853	0.000	-1.155 -0.490
age_r	0.1238	0.070	1.763	0.078	-0.014 0.261
age2	-0.0008	0.001	-0.814	0.416	-0.003 0.001
totincr	-0.2288	0.011	-20.058	0.000	-0.251 -0.206
educat	0.0661	0.016	4.015	0.000	0.034 0.098
rmarital=5	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	-2.3978	1.269	-1.889	0.059	-4.886 0.090
C(race)[T.2]	-1.0493	0.101	-10.366	0.000	-1.248 -0.851
C(race)[T.3]	-0.6065	0.154	-3.937	0.000	-0.908 -0.305
age_r	0.2084	0.077	2.699	0.007	0.057 0.360
age2	-0.0030	0.001	-2.619	0.009	-0.005 -0.001
totincr	-0.2900	0.014	-20.314	0.000	-0.318 -0.262
educat	-0.0176	0.021	-0.835	0.404	-0.059 0.024
rmarital=6	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	8.1722	0.797	10.252	0.000	6.610 9.734
C(race)[T.2]	-2.1628	0.079	-27.532	0.000	-2.317 -2.009
C(race)[T.3]	-1.9701	0.136	-14.472	0.000	-2.237 -1.703
age_r	-0.2188	0.050	-4.337	0.000	-0.318 -0.120
age2	0.0020	0.001	2.505	0.012	0.000 0.004
totincr	-0.2888	0.011	-25.418	0.000	-0.311 -0.267
educat	-0.0802	0.018	-4.582	0.000	-0.115 -0.046