# Regression with Categorical Variables

``````

In [43]:

%pylab inline
pylab.style.use('ggplot')
import numpy as np
import pandas as pd

``````
``````

Populating the interactive namespace from numpy and matplotlib

``````

## The BirthSmokers Data

``````Researchers interested in answering the above research question collected the following data (birthsmokers.txt) on a random sample of n = 32 births:

Response (y): birth weight (Weight) in grams of baby
Potential predictor (x1): Smoking status of mother (yes or no)
Potential predictor (x2): length of gestation (Gest) in weeks

The distinguishing feature of this data set is that one of the predictor variables — Smoking — is a qualitative predictor. To be more precise, smoking is a "binary variable" with only two possible values (yes or no). The other predictor variable (Gest) is, of course, quantitative.``````
``````

In [44]:

smoking_txt = """Wgt	Gest	Smoke
2940	38	yes
3130	38	no
2420	36	yes
2450	34	no
2760	39	yes
2440	35	yes
3226	40	no
3301	42	yes
2729	37	no
3410	40	no
2715	36	yes
3095	39	no
3130	39	yes
3244	39	no
2520	35	no
2928	39	yes
3523	41	no
3446	42	yes
2920	38	no
2957	39	yes
3530	42	no
2580	38	yes
3040	37	no
3500	42	yes
3200	41	yes
3322	39	no
3459	40	no
3346	42	yes
2619	35	no
3175	41	yes
2740	38	yes
2841	36	no"""

``````
``````

In [45]:

from io import StringIO

``````
``````

Out[45]:

Wgt
Gest
Smoke

0
2940
38
yes

1
3130
38
no

2
2420
36
yes

3
2450
34
no

4
2760
39
yes

``````

Let's plot how the birth weight is related to the gestation period for the two groups: smokers and non-smokers.

``````

In [46]:

ax = smoking_df[smoking_df.Smoke == 'yes'].plot(kind='scatter', x='Gest', y='Wgt')
smoking_df[smoking_df.Smoke == 'no'].plot(kind='scatter', color='green', ax=ax, x='Gest', y='Wgt')
ax.legend(['Smokers', 'Non-smokers'], loc='upper left')

``````
``````

Out[46]:

<matplotlib.legend.Legend at 0x1e8169314e0>

``````

Let's also find out the min, max and average birth weights for smokers and non-smokers.

``````

In [47]:

smoking_df.Wgt.groupby(smoking_df.Smoke).agg({'min': np.min, 'max': np.max, 'mean': np.mean})

``````
``````

Out[47]:

min
max
mean

Smoke

no
2450
3530
3066.125

yes
2420
3500
2973.625

``````
``````

In [48]:

import statsmodels.formula.api as sm
result = sm.ols(formula='Wgt ~ C(Smoke) + Gest', data=smoking_df).fit()
result.summary()

``````
``````

Out[48]:

OLS Regression Results

Dep. Variable:           Wgt         R-squared:             0.896

Method:             Least Squares    F-statistic:           125.4

Date:             Sat, 01 Apr 2017   Prob (F-statistic): 5.29e-15

Time:                 23:15:49       Log-Likelihood:      -195.82

No. Observations:          32        AIC:                   397.6

Df Residuals:              29        BIC:                   402.0

Df Model:                   2

Covariance Type:      nonrobust

coef     std err      t      P>|t|  [0.025    0.975]

Intercept       -2389.5729   349.206    -6.843  0.000 -3103.779 -1675.366

C(Smoke)[T.yes]  -244.5440    41.982    -5.825  0.000  -330.406  -158.682

Gest              143.1003     9.128    15.677  0.000   124.431   161.769

Omnibus:        1.946   Durbin-Watson:         2.016

Prob(Omnibus):  0.378   Jarque-Bera (JB):      1.162

Skew:          -0.090   Prob(JB):              0.559

Kurtosis:       2.084   Cond. No.               663.

``````

And the obligatory residual histogram:

``````

In [49]:

result.resid.plot(kind='hist', bins=20)

``````
``````

Out[49]:

<matplotlib.axes._subplots.AxesSubplot at 0x1e8169a8780>

``````
``````

In [66]:

#smoking_df['Smoke'] = smoking_df['Smoke'].astype('category', values=['no', 'yes']).cat.codes
predictions = result.predict(smoking_df)

``````
``````

In [67]:

smoking_df['fit_values'] = predictions.astype(np.int)

``````
``````

In [84]:

ax = smoking_df[smoking_df.Smoke == 'yes'].plot(kind='scatter', color='blue', x='Gest', y='Wgt')
ax = smoking_df[smoking_df.Smoke == 'no'].plot(kind='scatter', color='green', ax=ax, x='Gest', y='Wgt')

# Also the regression line by group
# smoking_df[smoking_df.Smoke == 'yes'].plot(kind='line', color='blue', x='Gest', y='fit_values')
ax = smoking_df[smoking_df.Smoke == 'yes'].sort_values(by='Gest').plot(
kind='line', ax=ax, color='blue', x='Gest', y='fit_values')

ax = smoking_df[smoking_df.Smoke == 'no'].sort_values(by='Gest').plot(
kind='line', ax=ax, color='green', x='Gest', y='fit_values')

ax.legend(['Smokers', 'Non-smokers'], loc='upper left')

``````
``````

Out[84]:

<matplotlib.legend.Legend at 0x1e81809c320>

``````
• Does the effect of the gestation length on mean birth weight depend on whether or not the mother is a smoker? The answer is no! Regardless of whether or not the mother is a smoker, for each additional one-week of gestation, the mean birth weight is predicted to increase by 143 grams. This lack of interaction between the two predictors is exhibted by the parallelness of the two lines.
• Does the effect of smoking on mean birth weight depend on the length of gestation? The answer is no! For a fixed length of gestation, the mean birth weight of babies born to smoking mothers is predicted to be 245 grams lower than the mean birth weight of babies born to non-smoking mothers. Again, this lack of interaction between the two predictors is exhibted by the parallelness of the two lines.

When two predictors do not interact, we say that each predictor has an "additive effect" on the response. More formally, a regression model contains additive effects if the response function can be written as a sum of functions of the predictor variables:

\$\$y = f_1(x_1) + f_2(x_2) + ... + f_{p−1}(x_{p−1})\$\$

## Treatment for Depressions Data

Some researchers were interested in comparing the effectiveness of three treatments for severe depression. For the sake of simplicity, we denote the three treatments A, B, and C. The researchers collected the following data (depression.txt) on a random sample of n = 36 severely depressed individuals.

``````y = measure of the effectiveness of the treatment for individual i
x1 = age (in years) of individual i
x2 = 1 if individual i received treatment A and 0, if not
x3 = 1 if individual i received treatment B and 0, if not``````
``````

In [86]:

depression_txt = """y	age	x2	x3	TRT
56	21	1	0	A
41	23	0	1	B
40	30	0	1	B
28	19	0	0	rrrr
55	28	1	0	A
25	23	0	0	C
46	33	0	1	B
71	67	0	0	C
48	42	0	1	B
63	33	1	0	A
52	33	1	0	A
62	56	0	0	C
50	45	0	0	C
45	43	0	1	B
58	38	1	0	A
46	37	0	0	C
58	43	0	1	B
34	27	0	0	C
65	43	1	0	A
55	45	0	1	B
57	48	0	1	B
59	47	0	0	C
64	48	1	0	A
61	53	1	0	A
62	58	0	1	B
36	29	0	0	C
69	53	1	0	A
47	29	0	1	B
73	58	1	0	A
64	66	0	1	B
60	67	0	1	B
62	63	1	0	A
71	59	0	0	C
62	51	0	0	C
70	67	1	0	A
71	63	0	0	C"""

``````
``````

In [88]:

``````
``````

Out[88]:

y
age
x2
x3
TRT

0
56
21
1
0
A

1
41
23
0
1
B

2
40
30
0
1
B

3
28
19
0
0
C

4
55
28
1
0
A

``````
``````

In [ ]:

``````
``````

In [ ]:

``````