There are two basic approaches to adjusting for covariates. Conceptutally, the simplest one is to hold the covariates constant at some level when collecting data or by extracting a subset of data which holds those covariates constant. The other approach is to include the covariates in your models.
For example, suppose you want to study the differences in the wages of male and females. The very simple mode wage ~ sex
might give some insight, but it attributes to sex
effects that might actually be due to level of education, age, or the second of the economy in which the person works. Here's the result from the simple model:
`cps.csv`
In [17]:
import pandas as pd
import statsmodels.formula.api as sm
cps = pd.read_csv("http://www.mosaic-web.org/go/datasets/cps.csv")
fit0 = sm.ols("wage ~ sex", df=cps).fit()
fit0.summary()
Out[17]:
The coefficients indicate that a typical male makes \$2.12 more per hour than a typical female (notice that $R^2 = 0.0422$ is very small: sex
explains hardly any of the person-to-person variability in wage).
By including the variables age
, educ
, and sector
in the model, you can adjust for these variables:
In [18]:
fit1 = sm.ols("wage ~ age + sex + educ + sector", df=cps).fit()
fit1.summary()
Out[18]:
The adjusted difference between the sexes in \$1.94 per hour (the $R^2 = 0.30$ from this model is considerably larger than for mod0
, but still a lot of the person-to-person variation in wages has not been captured).
It would be wrong to claim that simply including a covariate in a model guarantees that an appropriate adjustment has been made. The effectiveness of the adjustment depends on whether the model design is appropriate, for instance whether appropriate interaction terms have been included. However, it's certainly the case that if you don't include the covariate in the model, you have not adjusted for it.
The other approach is to subsample the data so that the levels of the covariates are approximately constant. For example, here is a subset that considers workers between the ages of 30 and 35 with between 10 to 12 years of education and working in the sales sector fo the economy:
In [21]:
small = cps[(cps.age<=35) & (cps.age>=30) & (cps.educ>=10) & (cps.educ<=12) & (cps.sector=="sales")]
The choice of these particular levels of age
, educ
, and sector
is arbitrary, but you need to choose some level if you want to hold the covariates approximately contant.
The subset of the data can be used to fit a simple mode:
In [22]:
fit2 = sm.ols("wage ~ sex", df=small).fit()
fit2.summary()
Out[22]:
At first glance, there might seem to be nothing wrong with this approach and, indeed, for vary large datasets it can be effective. In this case however, there are only 3 cases that satisfy the various criteria: two women and one man.
In [23]:
small.sex.value_counts()
Out[23]:
So, the \$4.50 difference in wages between the sexes depends entirely on the data from a single male! **Note**: The 'Confidence in Models' tutorial (Chapter 12) describes how to assess the precision of model coefficients. This one works out to be $4.50 \pm 11.00$; not at all precise.
As with all 'Statistical Modeling: A Fresh Approach for Python' tutorials, this tutorial is based directly on material from 'Statistical Modeling: A Fresh Approach (2nd Edition)' by Daniel Kaplan.
I have made an effort to keep the text and explanations consistent between the original (R-based) version and the Python tutorials, in order to keep things comparable. With that in mind, any errors, omissions, and/or differences between the two versions are mine, and any questions, comments, and/or concerns should be directed to me.