Regression with Categorical Variables


In [43]:
%pylab inline
pylab.style.use('ggplot')
import numpy as np
import pandas as pd


Populating the interactive namespace from numpy and matplotlib

The BirthSmokers Data

Researchers interested in answering the above research question collected the following data (birthsmokers.txt) on a random sample of n = 32 births:

Response (y): birth weight (Weight) in grams of baby
Potential predictor (x1): Smoking status of mother (yes or no)
Potential predictor (x2): length of gestation (Gest) in weeks

The distinguishing feature of this data set is that one of the predictor variables — Smoking — is a qualitative predictor. To be more precise, smoking is a "binary variable" with only two possible values (yes or no). The other predictor variable (Gest) is, of course, quantitative.

In [44]:
smoking_txt = """Wgt	Gest	Smoke
2940	38	yes
3130	38	no
2420	36	yes
2450	34	no
2760	39	yes
2440	35	yes
3226	40	no
3301	42	yes
2729	37	no
3410	40	no
2715	36	yes
3095	39	no
3130	39	yes
3244	39	no
2520	35	no
2928	39	yes
3523	41	no
3446	42	yes
2920	38	no
2957	39	yes
3530	42	no
2580	38	yes
3040	37	no
3500	42	yes
3200	41	yes
3322	39	no
3459	40	no
3346	42	yes
2619	35	no
3175	41	yes
2740	38	yes
2841	36	no"""

In [45]:
from io import StringIO
smoking_df = pd.read_csv(StringIO(smoking_txt), sep='\t')
smoking_df.head()


Out[45]:
Wgt Gest Smoke
0 2940 38 yes
1 3130 38 no
2 2420 36 yes
3 2450 34 no
4 2760 39 yes

Let's plot how the birth weight is related to the gestation period for the two groups: smokers and non-smokers.


In [46]:
ax = smoking_df[smoking_df.Smoke == 'yes'].plot(kind='scatter', x='Gest', y='Wgt')
smoking_df[smoking_df.Smoke == 'no'].plot(kind='scatter', color='green', ax=ax, x='Gest', y='Wgt')
ax.legend(['Smokers', 'Non-smokers'], loc='upper left')


Out[46]:
<matplotlib.legend.Legend at 0x1e8169314e0>

Let's also find out the min, max and average birth weights for smokers and non-smokers.


In [47]:
smoking_df.Wgt.groupby(smoking_df.Smoke).agg({'min': np.min, 'max': np.max, 'mean': np.mean})


Out[47]:
min max mean
Smoke
no 2450 3530 3066.125
yes 2420 3500 2973.625

In [48]:
import statsmodels.formula.api as sm
result = sm.ols(formula='Wgt ~ C(Smoke) + Gest', data=smoking_df).fit()
result.summary()


Out[48]:
OLS Regression Results
Dep. Variable: Wgt R-squared: 0.896
Model: OLS Adj. R-squared: 0.889
Method: Least Squares F-statistic: 125.4
Date: Sat, 01 Apr 2017 Prob (F-statistic): 5.29e-15
Time: 23:15:49 Log-Likelihood: -195.82
No. Observations: 32 AIC: 397.6
Df Residuals: 29 BIC: 402.0
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -2389.5729 349.206 -6.843 0.000 -3103.779 -1675.366
C(Smoke)[T.yes] -244.5440 41.982 -5.825 0.000 -330.406 -158.682
Gest 143.1003 9.128 15.677 0.000 124.431 161.769
Omnibus: 1.946 Durbin-Watson: 2.016
Prob(Omnibus): 0.378 Jarque-Bera (JB): 1.162
Skew: -0.090 Prob(JB): 0.559
Kurtosis: 2.084 Cond. No. 663.

And the obligatory residual histogram:


In [49]:
result.resid.plot(kind='hist', bins=20)


Out[49]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e8169a8780>

In [66]:
#smoking_df['Smoke'] = smoking_df['Smoke'].astype('category', values=['no', 'yes']).cat.codes
predictions = result.predict(smoking_df)

In [67]:
smoking_df['fit_values'] = predictions.astype(np.int)

In [84]:
ax = smoking_df[smoking_df.Smoke == 'yes'].plot(kind='scatter', color='blue', x='Gest', y='Wgt')
ax = smoking_df[smoking_df.Smoke == 'no'].plot(kind='scatter', color='green', ax=ax, x='Gest', y='Wgt')

# Also the regression line by group
# smoking_df[smoking_df.Smoke == 'yes'].plot(kind='line', color='blue', x='Gest', y='fit_values')
ax = smoking_df[smoking_df.Smoke == 'yes'].sort_values(by='Gest').plot(
    kind='line', ax=ax, color='blue', x='Gest', y='fit_values')

ax = smoking_df[smoking_df.Smoke == 'no'].sort_values(by='Gest').plot(
    kind='line', ax=ax, color='green', x='Gest', y='fit_values')

ax.legend(['Smokers', 'Non-smokers'], loc='upper left')


Out[84]:
<matplotlib.legend.Legend at 0x1e81809c320>
  • Does the effect of the gestation length on mean birth weight depend on whether or not the mother is a smoker? The answer is no! Regardless of whether or not the mother is a smoker, for each additional one-week of gestation, the mean birth weight is predicted to increase by 143 grams. This lack of interaction between the two predictors is exhibted by the parallelness of the two lines.
  • Does the effect of smoking on mean birth weight depend on the length of gestation? The answer is no! For a fixed length of gestation, the mean birth weight of babies born to smoking mothers is predicted to be 245 grams lower than the mean birth weight of babies born to non-smoking mothers. Again, this lack of interaction between the two predictors is exhibted by the parallelness of the two lines.

When two predictors do not interact, we say that each predictor has an "additive effect" on the response. More formally, a regression model contains additive effects if the response function can be written as a sum of functions of the predictor variables:

$$y = f_1(x_1) + f_2(x_2) + ... + f_{p−1}(x_{p−1})$$

Treatment for Depressions Data

Some researchers were interested in comparing the effectiveness of three treatments for severe depression. For the sake of simplicity, we denote the three treatments A, B, and C. The researchers collected the following data (depression.txt) on a random sample of n = 36 severely depressed individuals.

y = measure of the effectiveness of the treatment for individual i
x1 = age (in years) of individual i
x2 = 1 if individual i received treatment A and 0, if not
x3 = 1 if individual i received treatment B and 0, if not

In [86]:
depression_txt = """y	age	x2	x3	TRT
56	21	1	0	A
41	23	0	1	B
40	30	0	1	B
28	19	0	0	rrrr
55	28	1	0	A
25	23	0	0	C
46	33	0	1	B
71	67	0	0	C
48	42	0	1	B
63	33	1	0	A
52	33	1	0	A
62	56	0	0	C
50	45	0	0	C
45	43	0	1	B
58	38	1	0	A
46	37	0	0	C
58	43	0	1	B
34	27	0	0	C
65	43	1	0	A
55	45	0	1	B
57	48	0	1	B
59	47	0	0	C
64	48	1	0	A
61	53	1	0	A
62	58	0	1	B
36	29	0	0	C
69	53	1	0	A
47	29	0	1	B
73	58	1	0	A
64	66	0	1	B
60	67	0	1	B
62	63	1	0	A
71	59	0	0	C
62	51	0	0	C
70	67	1	0	A
71	63	0	0	C"""

In [88]:
depression_df = pd.read_csv(StringIO(depression_txt), sep='\t')
depression_df.head()


Out[88]:
y age x2 x3 TRT
0 56 21 1 0 A
1 41 23 0 1 B
2 40 30 0 1 B
3 28 19 0 0 C
4 55 28 1 0 A

In [ ]:


In [ ]: