In [4]:
import statsmodels.formula.api as sm
import pandas as pd
swim = pd.read_csv("http://www.mosaic-web.org/go/datasets/swim100m.csv")
mod = sm.ols("time ~ year + sex", data=swim)
fit = mod.fit()
fit.summary()
Out[4]:
In [5]:
from scipy.stats import t
t.ppf(1 - 0.05 / 2, fit.df_resid)
Out[5]:
Given the coefficient estimate and the standard error from the regression report, the confidence interval is easily generated (it is also included in the summary()
report by default). For a 95% confidence interval, you just multiply the standard error by about 2 to get the margin of error. For example, in the above, the margin of error on sex[T.M]
is aproximately $2.0 \times 1.013 = 2.026$, or, in computer notation:
In [19]:
2.*1.013
Out[19]:
If you want the two endpoints of the confidence interval, rather than just the margin of error, do these simple calculations: 1) subtract the margin of error from the estimate, 2) add the margin of error tot he estimate. So,
In [20]:
-9.798 - (2.*1.013)
Out[20]:
In [21]:
-9.798 + (2.*1.013)
Out[21]:
The key thing to remember is the multiplier that is applied to the standard error. A multiplier of approximately 2 is used for a 95% confidence level. The Conf. Int. field in the regression report provides a convenient way to obtain the confidence intervals directly. It actually calculates the exact multiplier (which depends somewhat on the sample size) and applies it to the standard error to produce the confidence intervals. You can also retrieve the cofidence intervals directly from the fitted model object:
In [7]:
fit = sm.ols("time ~ year + sex", data=swim).fit()
fit.conf_int()
Out[7]:
Confidence intervals on model coefficients can be computed using the same bootstrapping technique introduced in Tutorial 5 (Confidence Intervals).
Start with your fitted model. To illustrate, here is the same model as above, looking at swimming time over the years, taking into account sex:
In [8]:
fit = sm.ols("time ~ year + sex", data=swim).fit()
fit.params
Out[8]:
These parameters reflect one hypothetical draw from the population-based sampling distribution. It's impossible to get another draw from the "population" here - the actual records are all you've got. But to approximate sampling variation, you can treat the sample as your population and re-sample, using the resample
function we defined in Tutorial 5 (Confidence Intervals):
In [10]:
import numpy as np
def resample(df, n=None, replace=True):
'''
A simple function to 'resample' a data frame
Arguments
----------
df : Input data frame (required)
n : The number of sampled cases to draw (optional)
replace : Sample with (True) or without (False) replacement (optional)
'''
if n is None: # if we don't supply a valid n, use same size as input
n = df.shape[0] # return data frame has shape == df.shape
# choose n cases at random, sample with replacement (replace == True)
choices = np.random.choice(df.index, n, replace=replace)
return df.ix[choices] # select subset and return new data frame
fit = sm.ols("time ~ year + sex", data=resample(swim)).fit()
fit.params
Out[10]:
Construct many such re-sampling trials and collect the results
In [12]:
s = pd.DataFrame([sm.ols("time ~ year + sex", data=resample(swim)).fit().params for i in range(500)], index=range(500))
In [13]:
s.head()
Out[13]:
To get the standard errors of the coefficients, just take the standard deviation across the re-sampling trials:
In [14]:
s.std()
Out[14]:
Multiplying the standard error by 2 gives the approximate 95% margin of error. Alternatively, you can compute the confidence interval directly:
In [15]:
from scipy.stats import t # Use the Student's t distribution to compute the multiplier
mult = t.ppf(1.-(1.-0.95)/2., s.shape[0]-1.) # Should be close to 2
confs = [s.mean() + s.std() * i * mult for i in (-1,1)] # Mean +/- Std Dev * Multiplier
And print it nicely as a Pandas data frame:
In [16]:
pd.DataFrame(confs, index=["Lower", "Upper"]).T
Out[16]:
When a model is used to make a prediction, it's helpful to be able to describe how precise the prediction is. For instance, suppose you want to use the kidsfeet.csv
data set to make a prediction of the foot width of a girl whose foot length is 25 cm.
`kidsfeet.csv`
First, fit your model:
In [17]:
feet = pd.read_csv("http://www.mosaic-web.org/go/datasets/kidsfeet.csv")
feet.columns
Out[17]:
In [18]:
feet.sex.unique()
Out[18]:
In [19]:
mod = sm.ols("width ~ length + sex", data=feet)
fit = mod.fit()
The fitted model object (fit
) contains a lot of information about the model (mod
) and the data used for fitting the parameters. For the purposes of using the fitted model to make predictions, you want a function that takes the values for the explanatory variables as inputs, and returns the corresponding model value as the output. The predict
method of the fitted model object fit
performs just such a task. With it, you can convert inputs into outputs. Take care to use the right coding for categorical variables.
In [68]:
pred = fit.predict({"length":25, "sex":"G"})
pred
Out[68]:
In order to generate a confidence interval on our prediction, we need to decide what type of interval we want. There are two types of prediction confidence intervals that we can calculate:
Interval on the model value which reflects the sampling distributions of the coefficients themselves. To calculate this, we first need to calculate the prediction mean standard error and the inverse survival function for the Student's t distribution (basically the opposite to the percent point function (ppf
) introduced earlier).
Unfortunately, this isn't (yet) as simple as it could be using the "statsmodels" library. Future version of this package may allow users to compute this directly from the fitted model object.
In [86]:
from scipy.stats import t
pmse = fit.t_test(np.array([1., 1., 25.0])).sd[0][0]
tppf = t.isf(0.05/2., fit.df_resid)
confs = [pred + i * tppf * pmse for i in (-1,1)] # Pred +/- MSE * TPPF
confs.append(pred) # Add on the prediction value
pd.DataFrame(confs, index=["Lower", "Upper", "Prediction"]).T
Out[86]:
Interval on the prediction which includes the variation due to the uncertainty in the coefficients as well as the size of a typical residual. To find this interval, use the wls_predction_std
function from the statsmodels.sandbox.regression.predstd
module.
Again, future versions of "statsmodels" may make accessing these prediction confidence intervals easier...
In [87]:
from statsmodels.sandbox.regression.predstd import wls_prediction_std
prstd, iv_l, iv_u = wls_prediction_std(fit, exog=np.array([1., 1., 25.0]))
confs = [iv_l[0], iv_u[0], pred[0]]
pd.DataFrame(confs, index=["Lower", "Upper", "Prediction"]).T
Out[87]:
The prediction interval is larger than the model-value confidence interval because the residual always gives additional uncertainty around the model value.
As with all 'Statistical Modeling: A Fresh Approach for Python' tutorials, this tutorial is based directly on material from 'Statistical Modeling: A Fresh Approach (2nd Edition)' by Daniel Kaplan.
I have made an effort to keep the text and explanations consistent between the original (R-based) version and the Python tutorials, in order to keep things comparable. With that in mind, any errors, omissions, and/or differences between the two versions are mine, and any questions, comments, and/or concerns should be directed to me.