In [43]:
%pylab inline
pylab.style.use('ggplot')
import numpy as np
import pandas as pd
Researchers interested in answering the above research question collected the following data (birthsmokers.txt) on a random sample of n = 32 births:
Response (y): birth weight (Weight) in grams of baby
Potential predictor (x1): Smoking status of mother (yes or no)
Potential predictor (x2): length of gestation (Gest) in weeks
The distinguishing feature of this data set is that one of the predictor variables — Smoking — is a qualitative predictor. To be more precise, smoking is a "binary variable" with only two possible values (yes or no). The other predictor variable (Gest) is, of course, quantitative.
In [44]:
smoking_txt = """Wgt Gest Smoke
2940 38 yes
3130 38 no
2420 36 yes
2450 34 no
2760 39 yes
2440 35 yes
3226 40 no
3301 42 yes
2729 37 no
3410 40 no
2715 36 yes
3095 39 no
3130 39 yes
3244 39 no
2520 35 no
2928 39 yes
3523 41 no
3446 42 yes
2920 38 no
2957 39 yes
3530 42 no
2580 38 yes
3040 37 no
3500 42 yes
3200 41 yes
3322 39 no
3459 40 no
3346 42 yes
2619 35 no
3175 41 yes
2740 38 yes
2841 36 no"""
In [45]:
from io import StringIO
smoking_df = pd.read_csv(StringIO(smoking_txt), sep='\t')
smoking_df.head()
Out[45]:
Let's plot how the birth weight is related to the gestation period for the two groups: smokers and non-smokers.
In [46]:
ax = smoking_df[smoking_df.Smoke == 'yes'].plot(kind='scatter', x='Gest', y='Wgt')
smoking_df[smoking_df.Smoke == 'no'].plot(kind='scatter', color='green', ax=ax, x='Gest', y='Wgt')
ax.legend(['Smokers', 'Non-smokers'], loc='upper left')
Out[46]:
Let's also find out the min, max and average birth weights for smokers and non-smokers.
In [47]:
smoking_df.Wgt.groupby(smoking_df.Smoke).agg({'min': np.min, 'max': np.max, 'mean': np.mean})
Out[47]:
In [48]:
import statsmodels.formula.api as sm
result = sm.ols(formula='Wgt ~ C(Smoke) + Gest', data=smoking_df).fit()
result.summary()
Out[48]:
And the obligatory residual histogram:
In [49]:
result.resid.plot(kind='hist', bins=20)
Out[49]:
In [66]:
#smoking_df['Smoke'] = smoking_df['Smoke'].astype('category', values=['no', 'yes']).cat.codes
predictions = result.predict(smoking_df)
In [67]:
smoking_df['fit_values'] = predictions.astype(np.int)
In [84]:
ax = smoking_df[smoking_df.Smoke == 'yes'].plot(kind='scatter', color='blue', x='Gest', y='Wgt')
ax = smoking_df[smoking_df.Smoke == 'no'].plot(kind='scatter', color='green', ax=ax, x='Gest', y='Wgt')
# Also the regression line by group
# smoking_df[smoking_df.Smoke == 'yes'].plot(kind='line', color='blue', x='Gest', y='fit_values')
ax = smoking_df[smoking_df.Smoke == 'yes'].sort_values(by='Gest').plot(
kind='line', ax=ax, color='blue', x='Gest', y='fit_values')
ax = smoking_df[smoking_df.Smoke == 'no'].sort_values(by='Gest').plot(
kind='line', ax=ax, color='green', x='Gest', y='fit_values')
ax.legend(['Smokers', 'Non-smokers'], loc='upper left')
Out[84]:
When two predictors do not interact, we say that each predictor has an "additive effect" on the response. More formally, a regression model contains additive effects if the response function can be written as a sum of functions of the predictor variables:
$$y = f_1(x_1) + f_2(x_2) + ... + f_{p−1}(x_{p−1})$$Some researchers were interested in comparing the effectiveness of three treatments for severe depression. For the sake of simplicity, we denote the three treatments A, B, and C. The researchers collected the following data (depression.txt) on a random sample of n = 36 severely depressed individuals.
y = measure of the effectiveness of the treatment for individual i
x1 = age (in years) of individual i
x2 = 1 if individual i received treatment A and 0, if not
x3 = 1 if individual i received treatment B and 0, if not
In [86]:
depression_txt = """y age x2 x3 TRT
56 21 1 0 A
41 23 0 1 B
40 30 0 1 B
28 19 0 0 rrrr
55 28 1 0 A
25 23 0 0 C
46 33 0 1 B
71 67 0 0 C
48 42 0 1 B
63 33 1 0 A
52 33 1 0 A
62 56 0 0 C
50 45 0 0 C
45 43 0 1 B
58 38 1 0 A
46 37 0 0 C
58 43 0 1 B
34 27 0 0 C
65 43 1 0 A
55 45 0 1 B
57 48 0 1 B
59 47 0 0 C
64 48 1 0 A
61 53 1 0 A
62 58 0 1 B
36 29 0 0 C
69 53 1 0 A
47 29 0 1 B
73 58 1 0 A
64 66 0 1 B
60 67 0 1 B
62 63 1 0 A
71 59 0 0 C
62 51 0 0 C
70 67 1 0 A
71 63 0 0 C"""
In [88]:
depression_df = pd.read_csv(StringIO(depression_txt), sep='\t')
depression_df.head()
Out[88]:
In [ ]:
In [ ]: