# Examples and Exercises from Think Stats, 2nd Edition

http://thinkstats2.com



In [1]:

%matplotlib inline

import numpy as np
import pandas as pd

import thinkstats2
import thinkplot



## Multiple regression

Let's load up the NSFG data again.



In [2]:

import first

live, firsts, others = first.MakeFrames()



Here's birth weight as a function of mother's age (which we saw in the previous chapter).



In [3]:

import statsmodels.formula.api as smf

formula = 'totalwgt_lb ~ agepreg'
model = smf.ols(formula, data=live)
results = model.fit()
results.summary()




Out[3]:

OLS Regression Results

Dep. Variable:       totalwgt_lb     R-squared:             0.005

Method:             Least Squares    F-statistic:           43.02

Date:             Thu, 28 Feb 2019   Prob (F-statistic): 5.72e-11

Time:                 09:58:53       Log-Likelihood:      -15897.

No. Observations:        9038        AIC:                3.180e+04

Df Residuals:            9036        BIC:                3.181e+04

Df Model:                   1

Covariance Type:      nonrobust

coef     std err      t      P>|t|  [0.025    0.975]

Intercept     6.8304     0.068   100.470  0.000     6.697     6.964

agepreg       0.0175     0.003     6.559  0.000     0.012     0.023

Omnibus:       1024.052   Durbin-Watson:         1.618

Prob(Omnibus):   0.000    Jarque-Bera (JB):   3081.833

Skew:           -0.601    Prob(JB):               0.00

Kurtosis:        5.596    Cond. No.               118.

Warnings:[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.



We can extract the parameters.



In [4]:

inter = results.params['Intercept']
slope = results.params['agepreg']
inter, slope




Out[4]:

(6.830396973311047, 0.017453851471802877)



And the p-value of the slope estimate.



In [5]:

slope_pvalue = results.pvalues['agepreg']
slope_pvalue




Out[5]:

5.722947107312786e-11



And the coefficient of determination.



In [6]:

results.rsquared




Out[6]:

0.004738115474710369



The difference in birth weight between first babies and others.



In [7]:

diff_weight = firsts.totalwgt_lb.mean() - others.totalwgt_lb.mean()
diff_weight




Out[7]:

-0.12476118453549034



The difference in age between mothers of first babies and others.



In [8]:

diff_age = firsts.agepreg.mean() - others.agepreg.mean()
diff_age




Out[8]:

-3.5864347661500275



The age difference plausibly explains about half of the difference in weight.



In [9]:

slope * diff_age




Out[9]:

-0.06259709972169267



Running a single regression with a categorical variable, isfirst:



In [10]:

live['isfirst'] = live.birthord == 1
formula = 'totalwgt_lb ~ isfirst'
results = smf.ols(formula, data=live).fit()
results.summary()




Out[10]:

OLS Regression Results

Dep. Variable:       totalwgt_lb     R-squared:             0.002

Method:             Least Squares    F-statistic:           17.74

Date:             Thu, 28 Feb 2019   Prob (F-statistic): 2.55e-05

Time:                 09:58:53       Log-Likelihood:      -15909.

No. Observations:        9038        AIC:                3.182e+04

Df Residuals:            9036        BIC:                3.184e+04

Df Model:                   1

Covariance Type:      nonrobust

coef     std err      t      P>|t|  [0.025    0.975]

Intercept           7.3259     0.021   356.007  0.000     7.286     7.366

isfirst[T.True]    -0.1248     0.030    -4.212  0.000    -0.183    -0.067

Omnibus:       988.919   Durbin-Watson:         1.613

Prob(Omnibus):  0.000    Jarque-Bera (JB):   2897.107

Skew:          -0.589    Prob(JB):               0.00

Kurtosis:       5.511    Cond. No.               2.58

Warnings:[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.



Now finally running a multiple regression:



In [11]:

formula = 'totalwgt_lb ~ isfirst + agepreg'
results = smf.ols(formula, data=live).fit()
results.summary()




Out[11]:

OLS Regression Results

Dep. Variable:       totalwgt_lb     R-squared:             0.005

Method:             Least Squares    F-statistic:           24.02

Date:             Thu, 28 Feb 2019   Prob (F-statistic): 3.95e-11

Time:                 09:58:53       Log-Likelihood:      -15894.

No. Observations:        9038        AIC:                3.179e+04

Df Residuals:            9035        BIC:                3.182e+04

Df Model:                   2

Covariance Type:      nonrobust

coef     std err      t      P>|t|  [0.025    0.975]

Intercept           6.9142     0.078    89.073  0.000     6.762     7.066

isfirst[T.True]    -0.0698     0.031    -2.236  0.025    -0.131    -0.009

agepreg             0.0154     0.003     5.499  0.000     0.010     0.021

Omnibus:       1019.945   Durbin-Watson:         1.618

Prob(Omnibus):   0.000    Jarque-Bera (JB):   3063.682

Skew:           -0.599    Prob(JB):               0.00

Kurtosis:        5.588    Cond. No.               137.

Warnings:[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.



As expected, when we control for mother's age, the apparent difference due to isfirst is cut in half.

If we add age squared, we can control for a quadratic relationship between age and weight.



In [12]:

live['agepreg2'] = live.agepreg**2
formula = 'totalwgt_lb ~ isfirst + agepreg + agepreg2'
results = smf.ols(formula, data=live).fit()
results.summary()




Out[12]:

OLS Regression Results

Dep. Variable:       totalwgt_lb     R-squared:             0.007

Method:             Least Squares    F-statistic:           22.64

Date:             Thu, 28 Feb 2019   Prob (F-statistic): 1.35e-14

Time:                 09:58:54       Log-Likelihood:      -15884.

No. Observations:        9038        AIC:                3.178e+04

Df Residuals:            9034        BIC:                3.181e+04

Df Model:                   3

Covariance Type:      nonrobust

coef     std err      t      P>|t|  [0.025    0.975]

Intercept           5.6923     0.286    19.937  0.000     5.133     6.252

isfirst[T.True]    -0.0504     0.031    -1.602  0.109    -0.112     0.011

agepreg             0.1124     0.022     5.113  0.000     0.069     0.155

agepreg2           -0.0018     0.000    -4.447  0.000    -0.003    -0.001

Omnibus:       1007.149   Durbin-Watson:         1.616

Prob(Omnibus):   0.000    Jarque-Bera (JB):   3003.343

Skew:           -0.594    Prob(JB):               0.00

Kurtosis:        5.562    Cond. No.           1.39e+04

Warnings:[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.[2] The condition number is large, 1.39e+04. This might indicate that there arestrong multicollinearity or other numerical problems.



When we do that, the apparent effect of isfirst gets even smaller, and is no longer statistically significant.

These results suggest that the apparent difference in weight between first babies and others might be explained by difference in mothers' ages, at least in part.

## Data Mining

We can use join to combine variables from the preganancy and respondent tables.



In [13]:

import nsfg

live = live[live.prglngth>30]
resp.index = resp.caseid
join = live.join(resp, on='caseid', rsuffix='_r')



And we can search for variables with explanatory power.

Because we don't clean most of the variables, we are probably missing some good ones.



In [14]:

import patsy

def GoMining(df):
"""Searches for variables that predict birth weight.

df: DataFrame of pregnancy records

returns: list of (rsquared, variable name) pairs
"""
variables = []
for name in df.columns:
try:
if df[name].var() < 1e-7:
continue

formula = 'totalwgt_lb ~ agepreg + ' + name
formula = formula.encode('ascii')

model = smf.ols(formula, data=df)
if model.nobs < len(df)/2:
continue

results = model.fit()
except (ValueError, TypeError, patsy.PatsyError):
continue

variables.append((results.rsquared, name))

return variables




In [15]:

variables = GoMining(join)



The following functions report the variables with the highest values of $R^2$.



In [16]:

import re

"""Reads Stata dictionary files for NSFG data.

returns: DataFrame that maps variables names to descriptions
"""

all_vars = vars1.append(vars2)
all_vars.index = all_vars.name
return all_vars

def MiningReport(variables, n=30):
"""Prints variables with the highest R^2.

t: list of (R^2, variable name) pairs
n: number of pairs to print
"""

variables.sort(reverse=True)
for r2, name in variables[:n]:
key = re.sub('_r$', '', name) try: desc = all_vars.loc[key].desc if isinstance(desc, pd.Series): desc = desc[0] print(name, r2, desc) except (KeyError, IndexError): print(name, r2)  Some of the variables that do well are not useful for prediction because they are not known ahead of time.  In [17]: MiningReport(variables)  Combining the variables that seem to have the most explanatory power.  In [18]: formula = ('totalwgt_lb ~ agepreg + C(race) + babysex==1 + ' 'nbrnaliv>1 + paydu==1 + totincr') results = smf.ols(formula, data=join).fit() results.summary()   Out[18]: OLS Regression Results Dep. Variable: totalwgt_lb R-squared: 0.060 Model: OLS Adj. R-squared: 0.059 Method: Least Squares F-statistic: 79.98 Date: Thu, 28 Feb 2019 Prob (F-statistic): 4.86e-113 Time: 09:59:13 Log-Likelihood: -14295. No. Observations: 8781 AIC: 2.861e+04 Df Residuals: 8773 BIC: 2.866e+04 Df Model: 7 Covariance Type: nonrobust coef std err t P>|t| [0.025 0.975] Intercept 6.6303 0.065 102.223 0.000 6.503 6.757 C(race)[T.2] 0.3570 0.032 11.215 0.000 0.295 0.419 C(race)[T.3] 0.2665 0.051 5.175 0.000 0.166 0.367 babysex == 1[T.True] 0.2952 0.026 11.216 0.000 0.244 0.347 nbrnaliv > 1[T.True] -1.3783 0.108 -12.771 0.000 -1.590 -1.167 paydu == 1[T.True] 0.1196 0.031 3.861 0.000 0.059 0.180 agepreg 0.0074 0.003 2.921 0.004 0.002 0.012 totincr 0.0122 0.004 3.110 0.002 0.005 0.020 Omnibus: 398.813 Durbin-Watson: 1.604 Prob(Omnibus): 0.000 Jarque-Bera (JB): 1388.362 Skew: -0.037 Prob(JB): 3.32e-302 Kurtosis: 4.947 Cond. No. 221. Warnings:[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.  ## Logistic regression Example: suppose we are trying to predict y using explanatory variables x1 and x2.  In [19]: y = np.array([0, 1, 0, 1]) x1 = np.array([0, 0, 0, 1]) x2 = np.array([0, 1, 1, 1])  According to the logit model the log odds for the$i$th element of$y$is$\log o = \beta_0 + \beta_1 x_1 + \beta_2 x_2 $So let's start with an arbitrary guess about the elements of$\beta$:  In [20]: beta = [-1.5, 2.8, 1.1]  Plugging in the model, we get log odds.  In [21]: log_o = beta[0] + beta[1] * x1 + beta[2] * x2 log_o   Out[21]: array([-1.5, -0.4, -0.4, 2.4])  Which we can convert to odds.  In [22]: o = np.exp(log_o) o   Out[22]: array([ 0.22313016, 0.67032005, 0.67032005, 11.02317638])  And then convert to probabilities.  In [23]: p = o / (o+1) p   Out[23]: array([0.18242552, 0.40131234, 0.40131234, 0.9168273 ])  The likelihoods of the actual outcomes are$p$where$y$is 1 and$1-p$where$y$is 0.  In [24]: likes = np.where(y, p, 1-p) likes   Out[24]: array([0.81757448, 0.40131234, 0.59868766, 0.9168273 ])  The likelihood of$y$given$\beta$is the product of likes:  In [25]: like = np.prod(likes) like   Out[25]: 0.1800933529673034  Logistic regression works by searching for the values in$\beta$that maximize like. Here's an example using variables in the NSFG respondent file to predict whether a baby will be a boy or a girl.  In [26]: import first live, firsts, others = first.MakeFrames() live = live[live.prglngth>30] live['boy'] = (live.babysex==1).astype(int)  The mother's age seems to have a small effect.  In [27]: model = smf.logit('boy ~ agepreg', data=live) results = model.fit() results.summary()   Optimization terminated successfully. Current function value: 0.693015 Iterations 3 Out[27]: Logit Regression Results Dep. Variable: boy No. Observations: 8884 Model: Logit Df Residuals: 8882 Method: MLE Df Model: 1 Date: Thu, 28 Feb 2019 Pseudo R-squ.: 6.144e-06 Time: 09:59:16 Log-Likelihood: -6156.7 converged: True LL-Null: -6156.8 LLR p-value: 0.7833 coef std err z P>|z| [0.025 0.975] Intercept 0.0058 0.098 0.059 0.953 -0.185 0.197 agepreg 0.0010 0.004 0.275 0.783 -0.006 0.009  Here are the variables that seemed most promising.  In [28]: formula = 'boy ~ agepreg + hpagelb + birthord + C(race)' model = smf.logit(formula, data=live) results = model.fit() results.summary()   Optimization terminated successfully. Current function value: 0.692944 Iterations 3 Out[28]: Logit Regression Results Dep. Variable: boy No. Observations: 8782 Model: Logit Df Residuals: 8776 Method: MLE Df Model: 5 Date: Thu, 28 Feb 2019 Pseudo R-squ.: 0.0001440 Time: 09:59:16 Log-Likelihood: -6085.4 converged: True LL-Null: -6086.3 LLR p-value: 0.8822 coef std err z P>|z| [0.025 0.975] Intercept -0.0301 0.104 -0.290 0.772 -0.234 0.173 C(race)[T.2] -0.0224 0.051 -0.439 0.660 -0.122 0.077 C(race)[T.3] -0.0005 0.083 -0.005 0.996 -0.163 0.162 agepreg -0.0027 0.006 -0.484 0.629 -0.014 0.008 hpagelb 0.0047 0.004 1.112 0.266 -0.004 0.013 birthord 0.0050 0.022 0.227 0.821 -0.038 0.048  To make a prediction, we have to extract the exogenous and endogenous variables.  In [29]: endog = pd.DataFrame(model.endog, columns=[model.endog_names]) exog = pd.DataFrame(model.exog, columns=model.exog_names)  The baseline prediction strategy is to guess "boy". In that case, we're right almost 51% of the time.  In [30]: actual = endog['boy'] baseline = actual.mean() baseline   Out[30]: 0.507173764518333  If we use the previous model, we can compute the number of predictions we get right.  In [31]: predict = (results.predict() >= 0.5) true_pos = predict * actual true_neg = (1 - predict) * (1 - actual) sum(true_pos), sum(true_neg)   Out[31]: (3944.0, 548.0)  And the accuracy, which is slightly higher than the baseline.  In [32]: acc = (sum(true_pos) + sum(true_neg)) / len(actual) acc   Out[32]: 0.5115007970849464  To make a prediction for an individual, we have to get their information into a DataFrame.  In [33]: columns = ['agepreg', 'hpagelb', 'birthord', 'race'] new = pd.DataFrame([[35, 39, 3, 2]], columns=columns) y = results.predict(new) y   Out[33]: 0 0.513091 dtype: float64  This person has a 51% chance of having a boy (according to the model). ## Exercises Exercise: Suppose one of your co-workers is expecting a baby and you are participating in an office pool to predict the date of birth. Assuming that bets are placed during the 30th week of pregnancy, what variables could you use to make the best prediction? You should limit yourself to variables that are known before the birth, and likely to be available to the people in the pool.  In [34]: import first live, firsts, others = first.MakeFrames() live = live[live.prglngth>30]  The following are the only variables I found that have a statistically significant effect on pregnancy length.  In [35]: import statsmodels.formula.api as smf model = smf.ols('prglngth ~ birthord==1 + race==2 + nbrnaliv>1', data=live) results = model.fit() results.summary()   Out[35]: OLS Regression Results Dep. Variable: prglngth R-squared: 0.011 Model: OLS Adj. R-squared: 0.011 Method: Least Squares F-statistic: 34.28 Date: Thu, 28 Feb 2019 Prob (F-statistic): 5.09e-22 Time: 09:59:18 Log-Likelihood: -18247. No. Observations: 8884 AIC: 3.650e+04 Df Residuals: 8880 BIC: 3.653e+04 Df Model: 3 Covariance Type: nonrobust coef std err t P>|t| [0.025 0.975] Intercept 38.7617 0.039 1006.410 0.000 38.686 38.837 birthord == 1[T.True] 0.1015 0.040 2.528 0.011 0.023 0.180 race == 2[T.True] 0.1390 0.042 3.311 0.001 0.057 0.221 nbrnaliv > 1[T.True] -1.4944 0.164 -9.086 0.000 -1.817 -1.172 Omnibus: 1587.470 Durbin-Watson: 1.619 Prob(Omnibus): 0.000 Jarque-Bera (JB): 6160.751 Skew: -0.852 Prob(JB): 0.00 Kurtosis: 6.707 Cond. No. 10.9 Warnings:[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.  Exercise: The Trivers-Willard hypothesis suggests that for many mammals the sex ratio depends on “maternal condition”; that is, factors like the mother’s age, size, health, and social status. See https://en.wikipedia.org/wiki/Trivers-Willard_hypothesis Some studies have shown this effect among humans, but results are mixed. In this chapter we tested some variables related to these factors, but didn’t find any with a statistically significant effect on sex ratio. As an exercise, use a data mining approach to test the other variables in the pregnancy and respondent files. Can you find any factors with a substantial effect?  Current function value: 0.691874 Iterations 4 Out[38]: Logit Regression Results Dep. Variable: boy No. Observations: 8884 Model: Logit Df Residuals: 8880 Method: MLE Df Model: 3 Date: Thu, 28 Feb 2019 Pseudo R-squ.: 0.001653 Time: 09:59:57 Log-Likelihood: -6146.6 converged: True LL-Null: -6156.8 LLR p-value: 0.0001432 coef std err z P>|z| [0.025 0.975] Intercept -0.1805 0.118 -1.534 0.125 -0.411 0.050 fmarout5 == 5[T.True] 0.1582 0.049 3.217 0.001 0.062 0.255 infever == 1[T.True] 0.2194 0.065 3.374 0.001 0.092 0.347 agepreg 0.0050 0.004 1.172 0.241 -0.003 0.013  Exercise: If the quantity you want to predict is a count, you can use Poisson regression, which is implemented in StatsModels with a function called poisson. It works the same way as ols and logit. As an exercise, let’s use it to predict how many children a woman has born; in the NSFG dataset, this variable is called numbabes. Suppose you meet a woman who is 35 years old, black, and a college graduate whose annual household income exceeds$75,000. How many children would you predict she has born?



In [39]:

# Solution

# I used a nonlinear model of age.

join.numbabes.replace([97], np.nan, inplace=True)
join['age2'] = join.age_r**2




In [40]:

# Solution

formula='numbabes ~ age_r + age2 + age3 + C(race) + totincr + educat'
formula='numbabes ~ age_r + age2 + C(race) + totincr + educat'
model = smf.poisson(formula, data=join)
results = model.fit()
results.summary()




Optimization terminated successfully.
Current function value: 1.677002
Iterations 7

Out[40]:

Poisson Regression Results

Dep. Variable:     numbabes       No. Observations:       8884

Model:              Poisson       Df Residuals:           8877

Method:               MLE         Df Model:                  6

Date:          Thu, 28 Feb 2019   Pseudo R-squ.:        0.03686

Time:              09:59:57       Log-Likelihood:       -14898.

converged:           True         LL-Null:              -15469.

LLR p-value:        3.681e-243

coef     std err      z      P>|z|  [0.025    0.975]

Intercept       -1.0324     0.169    -6.098  0.000    -1.364    -0.701

C(race)[T.2]    -0.1401     0.015    -9.479  0.000    -0.169    -0.111

C(race)[T.3]    -0.0991     0.025    -4.029  0.000    -0.147    -0.051

age_r            0.1556     0.010    15.006  0.000     0.135     0.176

age2            -0.0020     0.000   -13.102  0.000    -0.002    -0.002

totincr         -0.0187     0.002    -9.830  0.000    -0.022    -0.015

educat          -0.0471     0.003   -16.076  0.000    -0.053    -0.041



Now we can predict the number of children for a woman who is 35 years old, black, and a college graduate whose annual household income exceeds $75,000  In [41]: # Solution columns = ['age_r', 'age2', 'age3', 'race', 'totincr', 'educat'] new = pd.DataFrame([[35, 35**2, 35**3, 1, 14, 16]], columns=columns) results.predict(new)   Out[41]: 0 2.496802 dtype: float64  Exercise: If the quantity you want to predict is categorical, you can use multinomial logistic regression, which is implemented in StatsModels with a function called mnlogit. As an exercise, let’s use it to guess whether a woman is married, cohabitating, widowed, divorced, separated, or never married; in the NSFG dataset, marital status is encoded in a variable called rmarital. Suppose you meet a woman who is 25 years old, white, and a high school graduate whose annual household income is about$45,000. What is the probability that she is married, cohabitating, etc?



In [42]:

# Solution

# Here's the best model I could find.

formula='rmarital ~ age_r + age2 + C(race) + totincr + educat'
model = smf.mnlogit(formula, data=join)
results = model.fit()
results.summary()




Optimization terminated successfully.
Current function value: 1.084053
Iterations 8

Out[42]:

MNLogit Regression Results

Dep. Variable:     rmarital       No. Observations:      8884

Model:              MNLogit       Df Residuals:          8849

Method:               MLE         Df Model:                30

Date:          Thu, 28 Feb 2019   Pseudo R-squ.:       0.1682

Time:              09:59:58       Log-Likelihood:      -9630.7

converged:           True         LL-Null:             -11579.

LLR p-value:          0.000

rmarital=2     coef     std err      z      P>|z|  [0.025    0.975]

Intercept        9.0156     0.805    11.199  0.000     7.438    10.593

C(race)[T.2]    -0.9237     0.089   -10.418  0.000    -1.097    -0.750

C(race)[T.3]    -0.6179     0.136    -4.536  0.000    -0.885    -0.351

age_r           -0.3635     0.051    -7.150  0.000    -0.463    -0.264

age2             0.0048     0.001     6.103  0.000     0.003     0.006

totincr         -0.1310     0.012   -11.337  0.000    -0.154    -0.108

educat          -0.1953     0.019   -10.424  0.000    -0.232    -0.159

rmarital=3     coef     std err      z      P>|z|  [0.025    0.975]

Intercept        2.9570     3.020     0.979  0.328    -2.963     8.877

C(race)[T.2]    -0.4411     0.237    -1.863  0.062    -0.905     0.023

C(race)[T.3]     0.0591     0.336     0.176  0.860    -0.600     0.718

age_r           -0.3177     0.177    -1.798  0.072    -0.664     0.029

age2             0.0064     0.003     2.528  0.011     0.001     0.011

totincr         -0.3258     0.032   -10.175  0.000    -0.389    -0.263

educat          -0.0991     0.048    -2.050  0.040    -0.194    -0.004

rmarital=4     coef     std err      z      P>|z|  [0.025    0.975]

Intercept       -3.5238     1.205    -2.924  0.003    -5.886    -1.162

C(race)[T.2]    -0.3213     0.093    -3.445  0.001    -0.504    -0.139

C(race)[T.3]    -0.7706     0.171    -4.509  0.000    -1.106    -0.436

age_r            0.1155     0.071     1.626  0.104    -0.024     0.255

age2            -0.0007     0.001    -0.701  0.483    -0.003     0.001

totincr         -0.2276     0.012   -19.621  0.000    -0.250    -0.205

educat           0.0667     0.017     3.995  0.000     0.034     0.099

rmarital=5     coef     std err      z      P>|z|  [0.025    0.975]

Intercept       -2.8963     1.305    -2.220  0.026    -5.453    -0.339

C(race)[T.2]    -1.0407     0.104   -10.038  0.000    -1.244    -0.837

C(race)[T.3]    -0.5661     0.156    -3.635  0.000    -0.871    -0.261

age_r            0.2411     0.079     3.038  0.002     0.086     0.397

age2            -0.0035     0.001    -2.977  0.003    -0.006    -0.001

totincr         -0.2932     0.015   -20.159  0.000    -0.322    -0.265

educat          -0.0174     0.021    -0.813  0.416    -0.059     0.025

rmarital=6     coef     std err      z      P>|z|  [0.025    0.975]

Intercept        8.0533     0.814     9.890  0.000     6.457     9.649

C(race)[T.2]    -2.1871     0.080   -27.211  0.000    -2.345    -2.030

C(race)[T.3]    -1.9611     0.138   -14.188  0.000    -2.232    -1.690

age_r           -0.2127     0.052    -4.122  0.000    -0.314    -0.112

age2             0.0019     0.001     2.321  0.020     0.000     0.003

totincr         -0.2945     0.012   -25.320  0.000    -0.317    -0.272

educat          -0.0742     0.018    -4.169  0.000    -0.109    -0.039



Make a prediction for a woman who is 25 years old, white, and a high school graduate whose annual household income is about \$45,000.



In [43]:

# Solution

# This person has a 75% chance of being currently married,
# a 13% chance of being "not married but living with opposite
# sex partner", etc.

columns = ['age_r', 'age2', 'race', 'totincr', 'educat']
new = pd.DataFrame([[25, 25**2, 2, 11, 12]], columns=columns)
results.predict(new)




Out[43]:

0
1
2
3
4
5

0
0.750028
0.126397
0.001564
0.033403
0.021485
0.067122




