Title: Total and Partial Relationships Date: 2013-10-02 23:43 Author: cfarmer Email: carson.farmer@gmail.com Category: Statistical Modeling for Python Tags: Helpful tips, Python, Statistical Modeling, Teaching Slug: statistical-modeling-python-total-partial Latex: yes Status: draft

Total and Partial Relationships

Adjustment

There are two basic approaches to adjusting for covariates. Conceptutally, the simplest one is to hold the covariates constant at some level when collecting data or by extracting a subset of data which holds those covariates constant. The other approach is to include the covariates in your models.

For example, suppose you want to study the differences in the wages of male and females. The very simple mode wage ~ sex might give some insight, but it attributes to sex effects that might actually be due to level of education, age, or the second of the economy in which the person works. Here's the result from the simple model:

`cps.csv`


In [17]:
import pandas as pd
import statsmodels.formula.api as sm

cps = pd.read_csv("http://www.mosaic-web.org/go/datasets/cps.csv")
fit0 = sm.ols("wage ~ sex", df=cps).fit()
fit0.summary()


Out[17]:
OLS Regression Results
Dep. Variable: wage R-squared: 0.042
Model: OLS Adj. R-squared: 0.040
Method: Least Squares F-statistic: 23.43
Date: Wed, 02 Oct 2013 Prob (F-statistic): 1.70e-06
Time: 23:55:26 Log-Likelihood: -1619.8
No. Observations: 534 AIC: 3244.
Df Residuals: 532 BIC: 3252.
Df Model: 1
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 7.8789 0.322 24.497 0.000 7.247 8.511
sex[T.M] 2.1161 0.437 4.840 0.000 1.257 2.975
Omnibus: 214.494 Durbin-Watson: 1.800
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1045.357
Skew: 1.737 Prob(JB): 1.01e-227
Kurtosis: 8.909 Cond. No. 2.73

The coefficients indicate that a typical male makes \$2.12 more per hour than a typical female (notice that $R^2 = 0.0422$ is very small: sex explains hardly any of the person-to-person variability in wage).

By including the variables age, educ, and sector in the model, you can adjust for these variables:


In [18]:
fit1 = sm.ols("wage ~ age + sex + educ + sector", df=cps).fit()
fit1.summary()


Out[18]:
OLS Regression Results
Dep. Variable: wage R-squared: 0.302
Model: OLS Adj. R-squared: 0.289
Method: Least Squares F-statistic: 22.65
Date: Wed, 02 Oct 2013 Prob (F-statistic): 2.41e-35
Time: 23:57:23 Log-Likelihood: -1535.2
No. Observations: 534 AIC: 3092.
Df Residuals: 523 BIC: 3140.
Df Model: 10
coef std err t P>|t| [95.0% Conf. Int.]
Intercept -4.6941 1.538 -3.053 0.002 -7.715 -1.673
sex[T.M] 1.9417 0.423 4.592 0.000 1.111 2.772
sector[T.const] 1.4355 1.131 1.269 0.205 -0.787 3.658
sector[T.manag] 3.2711 0.767 4.266 0.000 1.765 4.778
sector[T.manuf] 0.8063 0.731 1.103 0.271 -0.630 2.243
sector[T.other] 0.7584 0.759 0.999 0.318 -0.733 2.250
sector[T.prof] 2.2478 0.670 3.356 0.001 0.932 3.564
sector[T.sales] -0.7671 0.842 -0.911 0.363 -2.421 0.887
sector[T.service] -0.5687 0.666 -0.854 0.394 -1.877 0.740
age 0.1022 0.017 6.167 0.000 0.070 0.135
educ 0.6156 0.094 6.521 0.000 0.430 0.801
Omnibus: 230.507 Durbin-Watson: 1.836
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2032.042
Skew: 1.659 Prob(JB): 0.00
Kurtosis: 11.962 Cond. No. 370.

The adjusted difference between the sexes in \$1.94 per hour (the $R^2 = 0.30$ from this model is considerably larger than for mod0, but still a lot of the person-to-person variation in wages has not been captured).

It would be wrong to claim that simply including a covariate in a model guarantees that an appropriate adjustment has been made. The effectiveness of the adjustment depends on whether the model design is appropriate, for instance whether appropriate interaction terms have been included. However, it's certainly the case that if you don't include the covariate in the model, you have not adjusted for it.

The other approach is to subsample the data so that the levels of the covariates are approximately constant. For example, here is a subset that considers workers between the ages of 30 and 35 with between 10 to 12 years of education and working in the sales sector fo the economy:


In [21]:
small = cps[(cps.age<=35) & (cps.age>=30) & (cps.educ>=10) & (cps.educ<=12) & (cps.sector=="sales")]

The choice of these particular levels of age, educ, and sector is arbitrary, but you need to choose some level if you want to hold the covariates approximately contant.

The subset of the data can be used to fit a simple mode:


In [22]:
fit2 = sm.ols("wage ~ sex", df=small).fit()
fit2.summary()


Out[22]:
OLS Regression Results
Dep. Variable: wage R-squared: 0.964
Model: OLS Adj. R-squared: 0.929
Method: Least Squares F-statistic: 27.00
Date: Thu, 03 Oct 2013 Prob (F-statistic): 0.121
Time: 00:26:40 Log-Likelihood: -1.5692
No. Observations: 3 AIC: 7.138
Df Residuals: 1 BIC: 5.336
Df Model: 1
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 4.5000 0.500 9.000 0.070 -1.853 10.853
sex[T.M] 4.5000 0.866 5.196 0.121 -6.504 15.504
Omnibus: nan Durbin-Watson: 1.000
Prob(Omnibus): nan Jarque-Bera (JB): 0.281
Skew: 0.000 Prob(JB): 0.869
Kurtosis: 1.500 Cond. No. 2.41

At first glance, there might seem to be nothing wrong with this approach and, indeed, for vary large datasets it can be effective. In this case however, there are only 3 cases that satisfy the various criteria: two women and one man.


In [23]:
small.sex.value_counts()


Out[23]:
F    2
M    1
dtype: int64

So, the \$4.50 difference in wages between the sexes depends entirely on the data from a single male! **Note**: The 'Confidence in Models' tutorial (Chapter 12) describes how to assess the precision of model coefficients. This one works out to be $4.50 \pm 11.00$; not at all precise.

Next time on, Statistical Modeling: A Fresh Approach for Python...

  • Modeling Randomness

Reference

As with all 'Statistical Modeling: A Fresh Approach for Python' tutorials, this tutorial is based directly on material from 'Statistical Modeling: A Fresh Approach (2nd Edition)' by Daniel Kaplan.

I have made an effort to keep the text and explanations consistent between the original (R-based) version and the Python tutorials, in order to keep things comparable. With that in mind, any errors, omissions, and/or differences between the two versions are mine, and any questions, comments, and/or concerns should be directed to me.