Regression with Categorical Variables



In [43]:

    
%pylab inline
pylab.style.use('ggplot')
import numpy as np
import pandas as pd









    



Populating the interactive namespace from numpy and matplotlib

The BirthSmokers Data

Researchers interested in answering the above research question collected the following data (birthsmokers.txt) on a random sample of n = 32 births:

Response (y): birth weight (Weight) in grams of baby
Potential predictor (x1): Smoking status of mother (yes or no)
Potential predictor (x2): length of gestation (Gest) in weeks

The distinguishing feature of this data set is that one of the predictor variables — Smoking — is a qualitative predictor. To be more precise, smoking is a "binary variable" with only two possible values (yes or no). The other predictor variable (Gest) is, of course, quantitative.



In [44]:

    
smoking_txt = """Wgt	Gest	Smoke
2940	38	yes
3130	38	no
2420	36	yes
2450	34	no
2760	39	yes
2440	35	yes
3226	40	no
3301	42	yes
2729	37	no
3410	40	no
2715	36	yes
3095	39	no
3130	39	yes
3244	39	no
2520	35	no
2928	39	yes
3523	41	no
3446	42	yes
2920	38	no
2957	39	yes
3530	42	no
2580	38	yes
3040	37	no
3500	42	yes
3200	41	yes
3322	39	no
3459	40	no
3346	42	yes
2619	35	no
3175	41	yes
2740	38	yes
2841	36	no"""



In [45]:

    
from io import StringIO
smoking_df = pd.read_csv(StringIO(smoking_txt), sep='\t')
smoking_df.head()

Let's plot how the birth weight is related to the gestation period for the two groups: smokers and non-smokers.



In [46]:

    
ax = smoking_df[smoking_df.Smoke == 'yes'].plot(kind='scatter', x='Gest', y='Wgt')
smoking_df[smoking_df.Smoke == 'no'].plot(kind='scatter', color='green', ax=ax, x='Gest', y='Wgt')
ax.legend(['Smokers', 'Non-smokers'], loc='upper left')









    Out[46]:





<matplotlib.legend.Legend at 0x1e8169314e0>

Let's also find out the min, max and average birth weights for smokers and non-smokers.



In [47]:

    
smoking_df.Wgt.groupby(smoking_df.Smoke).agg({'min': np.min, 'max': np.max, 'mean': np.mean})



In [48]:

    
import statsmodels.formula.api as sm
result = sm.ols(formula='Wgt ~ C(Smoke) + Gest', data=smoking_df).fit()
result.summary()









    Out[48]:





OLS Regression Results

  Dep. Variable:            Wgt          R-squared:             0.896


  Model:                    OLS          Adj. R-squared:        0.889


  Method:              Least Squares     F-statistic:           125.4


  Date:              Sat, 01 Apr 2017    Prob (F-statistic):  5.29e-15


  Time:                  23:15:49        Log-Likelihood:      -195.82


  No. Observations:           32         AIC:                   397.6


  Df Residuals:               29         BIC:                   402.0


  Df Model:                    2                                     


  Covariance Type:       nonrobust                                   




                     coef      std err       t       P>|t|   [0.025     0.975]  


  Intercept        -2389.5729    349.206     -6.843   0.000  -3103.779  -1675.366


  C(Smoke)[T.yes]   -244.5440     41.982     -5.825   0.000   -330.406   -158.682


  Gest               143.1003      9.128     15.677   0.000    124.431    161.769




  Omnibus:         1.946    Durbin-Watson:         2.016


  Prob(Omnibus):   0.378    Jarque-Bera (JB):      1.162


  Skew:           -0.090    Prob(JB):              0.559


  Kurtosis:        2.084    Cond. No.               663.

And the obligatory residual histogram:



In [49]:

    
result.resid.plot(kind='hist', bins=20)









    Out[49]:





<matplotlib.axes._subplots.AxesSubplot at 0x1e8169a8780>



In [66]:

    
#smoking_df['Smoke'] = smoking_df['Smoke'].astype('category', values=['no', 'yes']).cat.codes
predictions = result.predict(smoking_df)



In [67]:

    
smoking_df['fit_values'] = predictions.astype(np.int)



In [84]:

    
ax = smoking_df[smoking_df.Smoke == 'yes'].plot(kind='scatter', color='blue', x='Gest', y='Wgt')
ax = smoking_df[smoking_df.Smoke == 'no'].plot(kind='scatter', color='green', ax=ax, x='Gest', y='Wgt')

# Also the regression line by group
# smoking_df[smoking_df.Smoke == 'yes'].plot(kind='line', color='blue', x='Gest', y='fit_values')
ax = smoking_df[smoking_df.Smoke == 'yes'].sort_values(by='Gest').plot(
    kind='line', ax=ax, color='blue', x='Gest', y='fit_values')

ax = smoking_df[smoking_df.Smoke == 'no'].sort_values(by='Gest').plot(
    kind='line', ax=ax, color='green', x='Gest', y='fit_values')

ax.legend(['Smokers', 'Non-smokers'], loc='upper left')









    Out[84]:





<matplotlib.legend.Legend at 0x1e81809c320>

Does the effect of the gestation length on mean birth weight depend on whether or not the mother is a smoker? The answer is no! Regardless of whether or not the mother is a smoker, for each additional one-week of gestation, the mean birth weight is predicted to increase by 143 grams. This lack of interaction between the two predictors is exhibted by the parallelness of the two lines.

Does the effect of smoking on mean birth weight depend on the length of gestation? The answer is no! For a fixed length of gestation, the mean birth weight of babies born to smoking mothers is predicted to be 245 grams lower than the mean birth weight of babies born to non-smoking mothers. Again, this lack of interaction between the two predictors is exhibted by the parallelness of the two lines.

When two predictors do not interact, we say that each predictor has an "additive effect" on the response. More formally, a regression model contains additive effects if the response function can be written as a sum of functions of the predictor variables:

$$y = f_1(x_1) + f_2(x_2) + ... + f_{p−1}(x_{p−1})$$

Treatment for Depressions Data

Some researchers were interested in comparing the effectiveness of three treatments for severe depression. For the sake of simplicity, we denote the three treatments A, B, and C. The researchers collected the following data (depression.txt) on a random sample of n = 36 severely depressed individuals.

y = measure of the effectiveness of the treatment for individual i
x1 = age (in years) of individual i
x2 = 1 if individual i received treatment A and 0, if not
x3 = 1 if individual i received treatment B and 0, if not



In [86]:

    
depression_txt = """y	age	x2	x3	TRT
56	21	1	0	A
41	23	0	1	B
40	30	0	1	B
28	19	0	0	rrrr
55	28	1	0	A
25	23	0	0	C
46	33	0	1	B
71	67	0	0	C
48	42	0	1	B
63	33	1	0	A
52	33	1	0	A
62	56	0	0	C
50	45	0	0	C
45	43	0	1	B
58	38	1	0	A
46	37	0	0	C
58	43	0	1	B
34	27	0	0	C
65	43	1	0	A
55	45	0	1	B
57	48	0	1	B
59	47	0	0	C
64	48	1	0	A
61	53	1	0	A
62	58	0	1	B
36	29	0	0	C
69	53	1	0	A
47	29	0	1	B
73	58	1	0	A
64	66	0	1	B
60	67	0	1	B
62	63	1	0	A
71	59	0	0	C
62	51	0	0	C
70	67	1	0	A
71	63	0	0	C"""



In [88]:

    
depression_df = pd.read_csv(StringIO(depression_txt), sep='\t')
depression_df.head()



In [ ]:



In [ ]:

Dep. Variable:	Wgt	R-squared:	0.896
Model:	OLS	Adj. R-squared:	0.889
Method:	Least Squares	F-statistic:	125.4
Date:	Sat, 01 Apr 2017	Prob (F-statistic):	5.29e-15
Time:	23:15:49	Log-Likelihood:	-195.82
No. Observations:	32	AIC:	397.6
Df Residuals:	29	BIC:	402.0
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-2389.5729	349.206	-6.843	0.000	-3103.779	-1675.366
C(Smoke)[T.yes]	-244.5440	41.982	-5.825	0.000	-330.406	-158.682
Gest	143.1003	9.128	15.677	0.000	124.431	161.769

Omnibus:	1.946	Durbin-Watson:	2.016
Prob(Omnibus):	0.378	Jarque-Bera (JB):	1.162
Skew:	-0.090	Prob(JB):	0.559
Kurtosis:	2.084	Cond. No.	663.