The ols
function from the 'statsmodels' library finds model coefficients using Ordinary Least Squares (OLS) regression. To illustrate, here's a series of statements that first load the nessesary libraries ('pandas' and 'statsmodels'), read in a data frame (swim100m.csv
), and build and fit a model to the data:
`swim100m.csv`
In [18]:
import pandas as pd
import statsmodels.formula.api as sm
swim = pd.read_csv("http://www.mosaic-web.org/go/datasets/swim100m.csv")
mod = sm.ols("time ~ year + sex", df=swim)
fit = mod.fit()
The first argument to sm.ols
is a model design, the second is the data frame.
The object created by sm.ols
- here given the name mod
- is an OLS model object which contains a variety of information about the model. To fit the model (i.e., estimate a set of model parameters) we call mod.fit()
, returning a fitted model object - here called fit
. Model parameters (or coefficients) can be extracted from the fitted model object using the params
attribute:
In [19]:
fit.params
Out[19]:
A more detailed report can be extracted from the fitted model object with the summary
method. This gives additional statistical information that will be used in later tutorials:
In [24]:
fit.summary()
Out[24]:
From time to time in the tutorials, you will be asked to calculate model values "by hand". This is accomplied by multiplying the parameters by the appropriate values and adding them up. For example, the model value for a male swimmer in 2010 would be:
In [27]:
555.7 - 0.2515*2010 - 9.798
Out[27]:
Notice that the "value" used to multiply the intercept is always 1, and the "value" used for a categorical level is either 0 or 1 depending on whether there is a match with the level. In this example, since the swimmer in question was male, the value of sex[T.M]
is 1. If the swimmer had been female, the value for sex[T.M]
would have been 0.
When a model includes interaction terms, the interaction coefficients need to be multiplied by all the values involved in the interaction. For example, here is a model with an interaction between year
and sex
:
In [31]:
mod2 = sm.ols("time ~ year*sex", df=swim)
fit2 = mod2.fit()
fit2.params
Out[31]:
In [35]:
697.3 - 0.3240*2010 - 302.464 + 0.1499*2010
Out[35]:
The year:sex[T.M]
parameter is being multiplied by the year (2010) and the value of sex[T.M]
, which is 1 for this male swimmer.
It may be useful to combine two (or more) categorical variables into a single variable. For example, in the Current Population Survey data, the variable sex
has levels F
and M
, while the variable race
has levels W
and NW
. Combining the two variables creates a new variable with four levels: FNW
, MNW
, FW
, MW
:
`cps.csv`
In [43]:
cps = pd.read_csv("http://www.mosaic-web.org/go/datasets/cps.csv")
(cps.sex + cps.race).head() # There are 534 cases in total
Out[43]:
This allow us to compute group-wise statistics on the combined groups relatively easily:
In [44]:
cps.groupby(cps.sex+cps.race).wage.mean()
Out[44]:
When you combine categorical variables inside a model function (such as sm.ols
), you won't need to specify the data frame that the variables come from, but you will need to wrap it inside an identity (I()
) function. For more information on the identiy function and other formula specifics, see the documentation here.
In [47]:
sm.ols("wage ~ I(sex+race)", df=cps).fit().params
Out[47]:
To convert between different data types, we can use the .astype()
method of a column (or Series). This might be useful for converting a quantitative variable to a categorical variable. For example, if a quantity like month
has been coded as a number, say 1 for January and 2 for February, etc. but you do not want models to treat it as a numeric value, you can change the variable type to a string with utils.month.astype(str)
.
To illustrate, consider two different models of the usage temperature versus month:
`utilities.csv`
In [122]:
utils = pd.read_csv("http://www.mosaic-web.org/go/datasets/utilities.csv")
fit1 = sm.ols("temp ~ month", df=utils).fit()
fit2 = sm.ols("temp ~ month.astype(str)", df=utils).fit()
Here are the graphs of those models:
In [129]:
argx = np.argsort(utils.month) # Use this to sort by month
scatter(utils.month[argx], utils.temp[argx], label='observed')
plot(utils.month[argx], fit1.fittedvalues[argx], '-', color='r', label='fitted')
legend(loc='best', numpoints=1)
title("fit1 model")
xlabel("month")
lab = ylabel("temp")
In [130]:
argx = np.argsort(utils.month) # Use this to sort by month
scatter(utils.month[argx], utils.temp[argx], label='observed')
plot(utils.month[argx], fit2.fittedvalues[argx], '-', color='r', label='fitted')
legend(loc='best', numpoints=1)
title("fit2 model")
xlabel("month")
lab = ylabel("temp")
In the first model, month is treated quantitatively, so the model term month
produces a straight-line relationship that does not correspond well to the data. In the second model, month is treated categorically, allowing a more complicated model relationship.
As with all 'Statistical Modeling: A Fresh Approach for Python' tutorials, this tutorial is based directly on material from 'Statistical Modeling: A Fresh Approach (2nd Edition)' by Daniel Kaplan.
I have made an effort to keep the text and explanations consistent between the original (R-based) version and the Python tutorials, in order to keep things comparable. With that in mind, any errors, omissions, and/or differences between the two versions are mine, and any questions, comments, and/or concerns should be directed to me.