Title: Model Formulas and Coefficients Date: 2013-10-01 12:00 Author: cfarmer Email: carson.farmer@gmail.com Category: Statistical Modeling for Python Tags: Helpful tips, Python, Statistical Modeling, Teaching Slug: statistical-modeling-python-formulas-coefs Latex: yes Status: draft

Model Formulas and Coefficients

The ols function from the 'statsmodels' library finds model coefficients using Ordinary Least Squares (OLS) regression. To illustrate, here's a series of statements that first load the nessesary libraries ('pandas' and 'statsmodels'), read in a data frame (swim100m.csv), and build and fit a model to the data:

`swim100m.csv`


In [18]:
import pandas as pd
import statsmodels.formula.api as sm

swim = pd.read_csv("http://www.mosaic-web.org/go/datasets/swim100m.csv")
mod = sm.ols("time ~ year + sex", df=swim)
fit = mod.fit()

The first argument to sm.ols is a model design, the second is the data frame.

The object created by sm.ols - here given the name mod - is an OLS model object which contains a variety of information about the model. To fit the model (i.e., estimate a set of model parameters) we call mod.fit(), returning a fitted model object - here called fit. Model parameters (or coefficients) can be extracted from the fitted model object using the params attribute:


In [19]:
fit.params


Out[19]:
Intercept    555.716783
sex[T.M]      -9.797961
year          -0.251464
dtype: float64

A more detailed report can be extracted from the fitted model object with the summary method. This gives additional statistical information that will be used in later tutorials:


In [24]:
fit.summary()


Out[24]:
OLS Regression Results
Dep. Variable: time R-squared: 0.844
Model: OLS Adj. R-squared: 0.839
Method: Least Squares F-statistic: 159.6
Date: Tue, 01 Oct 2013 Prob (F-statistic): 1.58e-24
Time: 22:50:30 Log-Likelihood: -172.12
No. Observations: 62 AIC: 350.2
Df Residuals: 59 BIC: 356.6
Df Model: 2
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 555.7168 33.800 16.441 0.000 488.083 623.350
sex[T.M] -9.7980 1.013 -9.673 0.000 -11.825 -7.771
year -0.2515 0.017 -14.516 0.000 -0.286 -0.217
Omnibus: 52.546 Durbin-Watson: 0.375
Prob(Omnibus): 0.000 Jarque-Bera (JB): 241.626
Skew: 2.430 Prob(JB): 3.40e-53
Kurtosis: 11.362 Cond. No. 1.30e+05

From time to time in the tutorials, you will be asked to calculate model values "by hand". This is accomplied by multiplying the parameters by the appropriate values and adding them up. For example, the model value for a male swimmer in 2010 would be:


In [27]:
555.7 - 0.2515*2010 - 9.798


Out[27]:
40.38700000000006

Notice that the "value" used to multiply the intercept is always 1, and the "value" used for a categorical level is either 0 or 1 depending on whether there is a match with the level. In this example, since the swimmer in question was male, the value of sex[T.M] is 1. If the swimmer had been female, the value for sex[T.M] would have been 0.

When a model includes interaction terms, the interaction coefficients need to be multiplied by all the values involved in the interaction. For example, here is a model with an interaction between year and sex:


In [31]:
mod2 = sm.ols("time ~ year*sex", df=swim)
fit2 = mod2.fit()
fit2.params


Out[31]:
Intercept        697.301216
sex[T.M]        -302.463839
year              -0.324046
year:sex[T.M]      0.149917
dtype: float64

In [35]:
697.3 - 0.3240*2010 - 302.464 + 0.1499*2010


Out[35]:
44.89499999999998

The year:sex[T.M] parameter is being multiplied by the year (2010) and the value of sex[T.M], which is 1 for this male swimmer.

Other Useful Operations

Combining categorical variables

It may be useful to combine two (or more) categorical variables into a single variable. For example, in the Current Population Survey data, the variable sex has levels F and M, while the variable race has levels W and NW. Combining the two variables creates a new variable with four levels: FNW, MNW, FW, MW:

`cps.csv`


In [43]:
cps = pd.read_csv("http://www.mosaic-web.org/go/datasets/cps.csv")
(cps.sex + cps.race).head() # There are 534 cases in total


Out[43]:
0    MW
1    MW
2    FW
3    FW
4    MW
dtype: object

This allow us to compute group-wise statistics on the combined groups relatively easily:


In [44]:
cps.groupby(cps.sex+cps.race).wage.mean()


Out[44]:
FNW     7.494286
FW      7.928479
MNW     8.463333
MW     10.233840
Name: wage, dtype: float64

When you combine categorical variables inside a model function (such as sm.ols), you won't need to specify the data frame that the variables come from, but you will need to wrap it inside an identity (I()) function. For more information on the identiy function and other formula specifics, see the documentation here.


In [47]:
sm.ols("wage ~ I(sex+race)", df=cps).fit().params


Out[47]:
Intercept               7.494286
I(sex + race)[T.FW]     0.434194
I(sex + race)[T.MNW]    0.969048
I(sex + race)[T.MW]     2.739554
dtype: float64

Converting variable types

To convert between different data types, we can use the .astype() method of a column (or Series). This might be useful for converting a quantitative variable to a categorical variable. For example, if a quantity like month has been coded as a number, say 1 for January and 2 for February, etc. but you do not want models to treat it as a numeric value, you can change the variable type to a string with utils.month.astype(str).

To illustrate, consider two different models of the usage temperature versus month:

`utilities.csv`


In [122]:
utils = pd.read_csv("http://www.mosaic-web.org/go/datasets/utilities.csv")
fit1 = sm.ols("temp ~ month", df=utils).fit()
fit2 = sm.ols("temp ~ month.astype(str)", df=utils).fit()

Here are the graphs of those models:


In [129]:
argx = np.argsort(utils.month) # Use this to sort by month
scatter(utils.month[argx], utils.temp[argx], label='observed')
plot(utils.month[argx], fit1.fittedvalues[argx], '-', color='r', label='fitted')
legend(loc='best', numpoints=1)
title("fit1 model")
xlabel("month")
lab = ylabel("temp")



In [130]:
argx = np.argsort(utils.month) # Use this to sort by month
scatter(utils.month[argx], utils.temp[argx], label='observed')
plot(utils.month[argx], fit2.fittedvalues[argx], '-', color='r', label='fitted')
legend(loc='best', numpoints=1)
title("fit2 model")
xlabel("month")
lab = ylabel("temp")


In the first model, month is treated quantitatively, so the model term month produces a straight-line relationship that does not correspond well to the data. In the second model, month is treated categorically, allowing a more complicated model relationship.

Next time on, Statistical Modeling: A Fresh Approach for Python...

  • Fitting Models to Data

Reference

As with all 'Statistical Modeling: A Fresh Approach for Python' tutorials, this tutorial is based directly on material from 'Statistical Modeling: A Fresh Approach (2nd Edition)' by Daniel Kaplan.

I have made an effort to keep the text and explanations consistent between the original (R-based) version and the Python tutorials, in order to keep things comparable. With that in mind, any errors, omissions, and/or differences between the two versions are mine, and any questions, comments, and/or concerns should be directed to me.