Title: Total and Partial Relationships Date: 2013-10-02 23:43 Author: cfarmer Email: carson.farmer@gmail.com Category: Statistical Modeling for Python Tags: Helpful tips, Python, Statistical Modeling, Teaching Slug: statistical-modeling-python-total-partial Latex: yes Status: draft

Total and Partial Relationships

Adjustment

There are two basic approaches to adjusting for covariates. Conceptutally, the simplest one is to hold the covariates constant at some level when collecting data or by extracting a subset of data which holds those covariates constant. The other approach is to include the covariates in your models.

For example, suppose you want to study the differences in the wages of male and females. The very simple mode wage ~ sex might give some insight, but it attributes to sex effects that might actually be due to level of education, age, or the second of the economy in which the person works. Here's the result from the simple model:

`cps.csv`



In [17]:

    
import pandas as pd
import statsmodels.formula.api as sm

cps = pd.read_csv("http://www.mosaic-web.org/go/datasets/cps.csv")
fit0 = sm.ols("wage ~ sex", df=cps).fit()
fit0.summary()









    Out[17]:





OLS Regression Results

  Dep. Variable:           wage          R-squared:             0.042


  Model:                    OLS          Adj. R-squared:        0.040


  Method:              Least Squares     F-statistic:           23.43


  Date:              Wed, 02 Oct 2013    Prob (F-statistic):  1.70e-06


  Time:                  23:55:26        Log-Likelihood:      -1619.8


  No. Observations:          534         AIC:                   3244.


  Df Residuals:              532         BIC:                   3252.


  Df Model:                    1                                     




               coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept      7.8789      0.322     24.497   0.000      7.247     8.511


  sex[T.M]       2.1161      0.437      4.840   0.000      1.257     2.975




  Omnibus:        214.494    Durbin-Watson:         1.800 


  Prob(Omnibus):   0.000     Jarque-Bera (JB):   1045.357 


  Skew:            1.737     Prob(JB):           1.01e-227


  Kurtosis:        8.909     Cond. No.               2.73

The coefficients indicate that a typical male makes \$2.12 more per hour than a typical female (notice that $R^2 = 0.0422$ is very small: sex explains hardly any of the person-to-person variability in wage).

By including the variables age, educ, and sector in the model, you can adjust for these variables:



In [18]:

    
fit1 = sm.ols("wage ~ age + sex + educ + sector", df=cps).fit()
fit1.summary()









    Out[18]:





OLS Regression Results

  Dep. Variable:           wage          R-squared:             0.302


  Model:                    OLS          Adj. R-squared:        0.289


  Method:              Least Squares     F-statistic:           22.65


  Date:              Wed, 02 Oct 2013    Prob (F-statistic):  2.41e-35


  Time:                  23:57:23        Log-Likelihood:      -1535.2


  No. Observations:          534         AIC:                   3092.


  Df Residuals:              523         BIC:                   3140.


  Df Model:                   10                                     




                       coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept             -4.6941      1.538     -3.053   0.002     -7.715    -1.673


  sex[T.M]               1.9417      0.423      4.592   0.000      1.111     2.772


  sector[T.const]        1.4355      1.131      1.269   0.205     -0.787     3.658


  sector[T.manag]        3.2711      0.767      4.266   0.000      1.765     4.778


  sector[T.manuf]        0.8063      0.731      1.103   0.271     -0.630     2.243


  sector[T.other]        0.7584      0.759      0.999   0.318     -0.733     2.250


  sector[T.prof]         2.2478      0.670      3.356   0.001      0.932     3.564


  sector[T.sales]       -0.7671      0.842     -0.911   0.363     -2.421     0.887


  sector[T.service]     -0.5687      0.666     -0.854   0.394     -1.877     0.740


  age                    0.1022      0.017      6.167   0.000      0.070     0.135


  educ                   0.6156      0.094      6.521   0.000      0.430     0.801




  Omnibus:        230.507    Durbin-Watson:         1.836


  Prob(Omnibus):   0.000     Jarque-Bera (JB):   2032.042


  Skew:            1.659     Prob(JB):               0.00


  Kurtosis:       11.962     Cond. No.               370.

The adjusted difference between the sexes in \$1.94 per hour (the $R^2 = 0.30$ from this model is considerably larger than for mod0, but still a lot of the person-to-person variation in wages has not been captured).

It would be wrong to claim that simply including a covariate in a model guarantees that an appropriate adjustment has been made. The effectiveness of the adjustment depends on whether the model design is appropriate, for instance whether appropriate interaction terms have been included. However, it's certainly the case that if you don't include the covariate in the model, you have not adjusted for it.

The other approach is to subsample the data so that the levels of the covariates are approximately constant. For example, here is a subset that considers workers between the ages of 30 and 35 with between 10 to 12 years of education and working in the sales sector fo the economy:



In [21]:

    
small = cps[(cps.age<=35) & (cps.age>=30) & (cps.educ>=10) & (cps.educ<=12) & (cps.sector=="sales")]

The choice of these particular levels of age, educ, and sector is arbitrary, but you need to choose some level if you want to hold the covariates approximately contant.

The subset of the data can be used to fit a simple mode:



In [22]:

    
fit2 = sm.ols("wage ~ sex", df=small).fit()
fit2.summary()









    Out[22]:





OLS Regression Results

  Dep. Variable:           wage          R-squared:             0.964


  Model:                    OLS          Adj. R-squared:        0.929


  Method:              Least Squares     F-statistic:           27.00


  Date:              Thu, 03 Oct 2013    Prob (F-statistic):    0.121 


  Time:                  00:26:40        Log-Likelihood:      -1.5692


  No. Observations:            3         AIC:                   7.138


  Df Residuals:                1         BIC:                   5.336


  Df Model:                    1                                     




               coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept      4.5000      0.500      9.000   0.070     -1.853    10.853


  sex[T.M]       4.5000      0.866      5.196   0.121     -6.504    15.504




  Omnibus:           nan    Durbin-Watson:         1.000


  Prob(Omnibus):     nan    Jarque-Bera (JB):      0.281


  Skew:            0.000    Prob(JB):              0.869


  Kurtosis:        1.500    Cond. No.               2.41

At first glance, there might seem to be nothing wrong with this approach and, indeed, for vary large datasets it can be effective. In this case however, there are only 3 cases that satisfy the various criteria: two women and one man.



In [23]:

    
small.sex.value_counts()









    Out[23]:





F    2
M    1
dtype: int64

So, the \$4.50 difference in wages between the sexes depends entirely on the data from a single male! **Note**: The 'Confidence in Models' tutorial (Chapter 12) describes how to assess the precision of model coefficients. This one works out to be $4.50 \pm 11.00$; not at all precise.

Next time on, Statistical Modeling: A Fresh Approach for Python...

Modeling Randomness

Reference

As with all 'Statistical Modeling: A Fresh Approach for Python' tutorials, this tutorial is based directly on material from 'Statistical Modeling: A Fresh Approach (2nd Edition)' by Daniel Kaplan.

I have made an effort to keep the text and explanations consistent between the original (R-based) version and the Python tutorials, in order to keep things comparable. With that in mind, any errors, omissions, and/or differences between the two versions are mine, and any questions, comments, and/or concerns should be directed to me.

Dep. Variable:	wage	R-squared:	0.042
Model:	OLS	Adj. R-squared:	0.040
Method:	Least Squares	F-statistic:	23.43
Date:	Wed, 02 Oct 2013	Prob (F-statistic):	1.70e-06
Time:	23:55:26	Log-Likelihood:	-1619.8
No. Observations:	534	AIC:	3244.
Df Residuals:	532	BIC:	3252.
Df Model:	1

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	7.8789	0.322	24.497	0.000	7.247 8.511
sex[T.M]	2.1161	0.437	4.840	0.000	1.257 2.975

Omnibus:	214.494	Durbin-Watson:	1.800
Prob(Omnibus):	0.000	Jarque-Bera (JB):	1045.357
Skew:	1.737	Prob(JB):	1.01e-227
Kurtosis:	8.909	Cond. No.	2.73

Omnibus:	230.507	Durbin-Watson:	1.836
Prob(Omnibus):	0.000	Jarque-Bera (JB):	2032.042
Skew:	1.659	Prob(JB):	0.00
Kurtosis:	11.962	Cond. No.	370.

Omnibus:	nan	Durbin-Watson:	1.000
Prob(Omnibus):	nan	Jarque-Bera (JB):	0.281
Skew:	0.000	Prob(JB):	0.869
Kurtosis:	1.500	Cond. No.	2.41