Using the ols
function from the 'statsmodels' library is largely a matter of familiarity with the model design language described in Chapter 6 of 'Statistical Modeling: A Fresh Approach (2nd Edition)'. Computing the fitted model values and the residuals is done with the fittedvalues
and resid
attributes of the fitted model object. To illustrate:
`swim100m.csv`
In [1]:
import pandas as pd
import statsmodels.formula.api as sm
swim = pd.read_csv("http://www.mosaic-web.org/go/datasets/swim100m.csv")
mod = sm.ols("time ~ year + sex", df=swim)
fit = mod.fit()
fit.params
Out[1]:
Once you have contructed and fitted the model, you can use the fittedvalues
and resid
attributes of the fitted model - here named fit
:
In [3]:
fit.fittedvalues.head() # There are 62 cases altogether
Out[3]:
In [5]:
fit.resid.head() # There are 62 cases altogether
Out[5]:
In [7]:
fit.fittedvalues.mean()
Out[7]:
In [12]:
fit.resid.var()
Out[12]:
In [11]:
fit.resid.std()
Out[11]:
In [13]:
fit.resid.describe()
Out[13]:
Sums of squares are very important in statistics. Here's how to calculate them for the response values, the fitted model values, and the residuals:
In [15]:
(swim.time**2.).sum() # Could also use: np.sum(swim.time**2)
Out[15]:
In [16]:
(fit.fittedvalues**2.).sum()
Out[16]:
In [31]:
(fit.resid**2).sum()
Out[31]:
The partitioning of variation by models is seen by the way the sum of squares of the fitted and the residuals add up to the sum of squares of the response:
In [19]:
227699.188 + 935.829
Out[19]:
Don't forget the squaring stage of the operation! The sum of the residuals (without squaring) is very different from the sum of squares of the residuals:
In [20]:
fit.resid.sum()
Out[20]:
In [21]:
(fit.resid**2).sum()
Out[21]:
Take care in reading numbers formatted like 1.163e-08
. The notation stands for $1.163 \times 10^{-8}$. That number, 0.00000001163, is effectively zero compared to the residuals themselves!
The functions used by ols
and other 'statsmodels' model functions will not automatically detect redundancy introduced by the modeler. In other words, if you enter two different variables x1 and x2, but set them to be numerically equal, then statsmodels (via the 'Patsy' library) will indeed produce a design matrix that isn’t full rank (i.e., contains redundancies). Avoiding this is up to the modeler, and is a simple case of careful model design. However, "structural redundancies" - those which occur inevitably no matter what the particular values in your data set - are handled automatically by the software.
For example, whenever you use a categorical variable and an intercept term in a model, there is a redundancy. This is not shown explicitly. To illustrate, here is a model with no intercept term, and both levels of the categorical variable sex
show up with parameters:
In [29]:
mod = sm.ols("time ~ sex - 1", df=swim)
fit = mod.fit()
fit.params
Out[29]:
If the intercept term is included (as it is by default unless -1
is used in the model formula), one of the levels of the categorical variable is simply dropped from the model:
In [30]:
mod = sm.ols("time ~ sex", df=swim)
fit = mod.fit()
fit.params
Out[30]:
Remember that this coefficient report implicitly involves a redundancy (this is called a "structural redundancy"). If the software had been designed differently, the report might have returned a value of nan
for sex[T.F]
.
As with all 'Statistical Modeling: A Fresh Approach for Python' tutorials, this tutorial is based directly on material from 'Statistical Modeling: A Fresh Approach (2nd Edition)' by Daniel Kaplan.
I have made an effort to keep the text and explanations consistent between the original (R-based) version and the Python tutorials, in order to keep things comparable. With that in mind, any errors, omissions, and/or differences between the two versions are mine, and any questions, comments, and/or concerns should be directed to me.