In [6]:
import first, regression
import thinkstats2, thinkplot
import numpy as np
import statsmodels.formula.api as smf

regression - fitting a model to data. The goal is to describe the relationship between the dependent variables and the explanatory variables.

multiple regression multiple explanatory variables.

linear regression wen the relationship is linear. $$ y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \epsilon $$ where $\beta_0$ is the intercept, $\beta_1$ is the parameter association with $x_1$, $\beta_2$ is the parameter associated with $x_2$ and $\epsilon$ is the residual due to random variation or unknown factors.

ordinary least squares. Given a sequence of values for y and sequences for $x_1$ and $x_2$, we can find the beta parameters that minimize $epsilon^2$


In [7]:
live, firsts, others = first.MakeFrames()
formula = 'totalwgt_lb ~ agepreg'
model = smf.ols(formula, data=live)
results = model.fit()
regression.SummarizeResults(results)


Intercept   6.83   (0)
agepreg   0.0175   (5.72e-11)
R^2 0.004738
Std(ys) 1.408
Std(res) 1.405

In [9]:
##Results are also available as parameters:
inter = results.params['Intercept']
slope = results.params['agepreg']
slope_pvalue = results.pvalues['agepreg']
results.rsquared


Out[9]:
0.0047381154747098142

In [11]:
##this gives the p-value associated with the model as a whole
results.f_pvalue


Out[11]:
5.7229471072677547e-11

In [13]:
residuals = results.resid

##this returns a sequence of values corresponding
##to agepreg.
fitted_values = results.fittedvalues

# results.summary() provides a lot of info
#the following is easier:
regression.SummarizeResults(results)


Intercept   6.83   (0)
agepreg   0.0175   (5.72e-11)
R^2 0.004738
Std(ys) 1.408
Std(res) 1.405

spurious - result for which there is no obvious mechanism that would explain it. e.g. why would first babies be lighter than others? Perhaps because mothers of first babies are younger...


In [14]:
diff_weight = firsts.totalwgt_lb.mean() - others.totalwgt_lb.mean()
diff_weight


Out[14]:
-0.12476118453549034

In [17]:
diff_age = firsts.agepreg.mean() - others.agepreg.mean()
diff_age


Out[17]:
-3.586434766150152