Title: Fitting Models to Data Date: 2013-10-01 12:00 Author: cfarmer Email: carson.farmer@gmail.com Category: Statistical Modeling for Python Tags: Helpful tips, Python, Statistical Modeling, Teaching Slug: statistical-modeling-python-fitting-models Latex: yes Status: draft

Fitting Models to Data

Using the ols function from the 'statsmodels' library is largely a matter of familiarity with the model design language described in Chapter 6 of 'Statistical Modeling: A Fresh Approach (2nd Edition)'. Computing the fitted model values and the residuals is done with the fittedvalues and resid attributes of the fitted model object. To illustrate:

`swim100m.csv`


In [1]:
import pandas as pd
import statsmodels.formula.api as sm

swim = pd.read_csv("http://www.mosaic-web.org/go/datasets/swim100m.csv")
mod = sm.ols("time ~ year + sex", df=swim)
fit = mod.fit()
fit.params


Out[1]:
Intercept    555.716783
sex[T.M]      -9.797961
year          -0.251464
dtype: float64

Once you have contructed and fitted the model, you can use the fittedvalues and resid attributes of the fitted model - here named fit:


In [3]:
fit.fittedvalues.head() # There are 62 cases altogether


Out[3]:
0    66.880509
1    66.126118
2    65.623190
3    65.120263
4    63.611481
dtype: float64

In [5]:
fit.resid.head() # There are 62 cases altogether


Out[5]:
0   -1.080509
1   -0.526118
2   -2.823190
3   -3.520263
4   -2.211481
dtype: float64

Sums of Squares

Computations can be performed on the fitted model values and the residuals, just like any other quantity:


In [7]:
fit.fittedvalues.mean()


Out[7]:
59.924193548199447

In [12]:
fit.resid.var()


Out[12]:
15.341465142410016

In [11]:
fit.resid.std()


Out[11]:
3.9168182421973596

In [13]:
fit.resid.describe()


Out[13]:
count    6.200000e+01
mean     1.876532e-10
std      3.916818e+00
min     -4.702699e+00
25%     -2.702699e+00
50%     -5.968440e-01
75%      1.279569e+00
max      1.907592e+01
dtype: float64

Sums of squares are very important in statistics. Here's how to calculate them for the response values, the fitted model values, and the residuals:


In [15]:
(swim.time**2.).sum() # Could also use: np.sum(swim.time**2)


Out[15]:
228635.01799999995

In [16]:
(fit.fittedvalues**2.).sum()


Out[16]:
227699.18862492123

In [31]:
(fit.resid**2).sum()


Out[31]:
4278.0064774193552

The partitioning of variation by models is seen by the way the sum of squares of the fitted and the residuals add up to the sum of squares of the response:


In [19]:
227699.188 + 935.829


Out[19]:
228635.017

Don't forget the squaring stage of the operation! The sum of the residuals (without squaring) is very different from the sum of squares of the residuals:


In [20]:
fit.resid.sum()


Out[20]:
1.1634497809609456e-08

In [21]:
(fit.resid**2).sum()


Out[21]:
935.82937368701096

Take care in reading numbers formatted like 1.163e-08. The notation stands for $1.163 \times 10^{-8}$. That number, 0.00000001163, is effectively zero compared to the residuals themselves!

Redundancy

The functions used by ols and other 'statsmodels' model functions will not automatically detect redundancy introduced by the modeler. In other words, if you enter two different variables x1 and x2, but set them to be numerically equal, then statsmodels (via the 'Patsy' library) will indeed produce a design matrix that isn’t full rank (i.e., contains redundancies). Avoiding this is up to the modeler, and is a simple case of careful model design. However, "structural redundancies" - those which occur inevitably no matter what the particular values in your data set - are handled automatically by the software.

For example, whenever you use a categorical variable and an intercept term in a model, there is a redundancy. This is not shown explicitly. To illustrate, here is a model with no intercept term, and both levels of the categorical variable sex show up with parameters:


In [29]:
mod = sm.ols("time ~ sex - 1", df=swim)
fit = mod.fit()
fit.params


Out[29]:
sex[F]    65.192258
sex[M]    54.656129
dtype: float64

If the intercept term is included (as it is by default unless -1 is used in the model formula), one of the levels of the categorical variable is simply dropped from the model:


In [30]:
mod = sm.ols("time ~ sex", df=swim)
fit = mod.fit()
fit.params


Out[30]:
Intercept    65.192258
sex[T.M]    -10.536129
dtype: float64

Remember that this coefficient report implicitly involves a redundancy (this is called a "structural redundancy"). If the software had been designed differently, the report might have returned a value of nan for sex[T.F].

Next time on, Statistical Modeling: A Fresh Approach for Python...

  • Correlation and Partitioning of Variation

Reference

As with all 'Statistical Modeling: A Fresh Approach for Python' tutorials, this tutorial is based directly on material from 'Statistical Modeling: A Fresh Approach (2nd Edition)' by Daniel Kaplan.

I have made an effort to keep the text and explanations consistent between the original (R-based) version and the Python tutorials, in order to keep things comparable. With that in mind, any errors, omissions, and/or differences between the two versions are mine, and any questions, comments, and/or concerns should be directed to me.