Title: Confidence in Models Date: 2013-10-05 17:05 Author: cfarmer Email: carson.farmer@gmail.com Category: Statistical Modeling for Python Tags: Helpful tips, Python, Statistical Modeling, Teaching Slug: statistical-modeling-python-confidence-models Latex: yes Status: draft

Confidence in Models

Regression reports are generated using software you have already encountered: ols to build a model, the .fit() method to fit the model, and the summary() method to construct the report from the fitted model. To illustrate:

`swim100m.csv`


In [4]:
import statsmodels.formula.api as sm
import pandas as pd

swim = pd.read_csv("http://www.mosaic-web.org/go/datasets/swim100m.csv")
mod = sm.ols("time ~ year + sex", data=swim)
fit = mod.fit()
fit.summary()


Out[4]:
OLS Regression Results
Dep. Variable: time R-squared: 0.844
Model: OLS Adj. R-squared: 0.839
Method: Least Squares F-statistic: 159.6
Date: Sun, 06 Oct 2013 Prob (F-statistic): 1.58e-24
Time: 22:24:52 Log-Likelihood: -172.12
No. Observations: 62 AIC: 350.2
Df Residuals: 59 BIC: 356.6
Df Model: 2
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 555.7168 33.800 16.441 0.000 488.083 623.350
sex[T.M] -9.7980 1.013 -9.673 0.000 -11.825 -7.771
year -0.2515 0.017 -14.516 0.000 -0.286 -0.217
Omnibus: 52.546 Durbin-Watson: 0.375
Prob(Omnibus): 0.000 Jarque-Bera (JB): 241.626
Skew: 2.430 Prob(JB): 3.40e-53
Kurtosis: 11.362 Cond. No. 1.30e+05

In [5]:
from scipy.stats import t
t.ppf(1 - 0.05 / 2, fit.df_resid)


Out[5]:
2.0009953770482101

Confidence Intervals from Standard Errors

Given the coefficient estimate and the standard error from the regression report, the confidence interval is easily generated (it is also included in the summary() report by default). For a 95% confidence interval, you just multiply the standard error by about 2 to get the margin of error. For example, in the above, the margin of error on sex[T.M] is aproximately $2.0 \times 1.013 = 2.026$, or, in computer notation:


In [19]:
2.*1.013


Out[19]:
2.026

If you want the two endpoints of the confidence interval, rather than just the margin of error, do these simple calculations: 1) subtract the margin of error from the estimate, 2) add the margin of error tot he estimate. So,


In [20]:
-9.798 - (2.*1.013)


Out[20]:
-11.824

In [21]:
-9.798 + (2.*1.013)


Out[21]:
-7.772

The key thing to remember is the multiplier that is applied to the standard error. A multiplier of approximately 2 is used for a 95% confidence level. The Conf. Int. field in the regression report provides a convenient way to obtain the confidence intervals directly. It actually calculates the exact multiplier (which depends somewhat on the sample size) and applies it to the standard error to produce the confidence intervals. You can also retrieve the cofidence intervals directly from the fitted model object:


In [7]:
fit = sm.ols("time ~ year + sex", data=swim).fit()
fit.conf_int()


Out[7]:
0 1
Intercept 488.083311 623.350256
sex[T.M] -11.824713 -7.771209
year -0.286128 -0.216800

Bootstrapping Confidence Intervals

Confidence intervals on model coefficients can be computed using the same bootstrapping technique introduced in Tutorial 5 (Confidence Intervals).

Start with your fitted model. To illustrate, here is the same model as above, looking at swimming time over the years, taking into account sex:


In [8]:
fit = sm.ols("time ~ year + sex", data=swim).fit()
fit.params


Out[8]:
Intercept    555.716783
sex[T.M]      -9.797961
year          -0.251464
dtype: float64

These parameters reflect one hypothetical draw from the population-based sampling distribution. It's impossible to get another draw from the "population" here - the actual records are all you've got. But to approximate sampling variation, you can treat the sample as your population and re-sample, using the resample function we defined in Tutorial 5 (Confidence Intervals):


In [10]:
import numpy as np
def resample(df, n=None, replace=True):
    '''
    A simple function to 'resample' a data frame
    
    Arguments
    ----------
    df : Input data frame (required)
    n : The number of sampled cases to draw (optional)
    replace : Sample with (True) or without (False) replacement (optional)
    '''
    if n is None: # if we don't supply a valid n, use same size as input
        n = df.shape[0] # return data frame has shape == df.shape
    # choose n cases at random, sample with replacement (replace == True)
    choices = np.random.choice(df.index, n, replace=replace)
    return df.ix[choices] # select subset and return new data frame

fit = sm.ols("time ~ year + sex", data=resample(swim)).fit()
fit.params


Out[10]:
Intercept    593.800992
sex[T.M]     -10.707462
year          -0.270630
dtype: float64

Construct many such re-sampling trials and collect the results


In [12]:
s = pd.DataFrame([sm.ols("time ~ year + sex", data=resample(swim)).fit().params for i in range(500)], index=range(500))

In [13]:
s.head()


Out[13]:
Intercept sex[T.M] year
0 530.219915 -9.257418 -0.238826
1 616.791588 -11.004144 -0.282477
2 601.240175 -9.098913 -0.274934
3 609.460845 -11.378371 -0.278428
4 581.716203 -9.944458 -0.264425

To get the standard errors of the coefficients, just take the standard deviation across the re-sampling trials:


In [14]:
s.std()


Out[14]:
Intercept    43.598751
sex[T.M]      0.965063
year          0.022016
dtype: float64

Multiplying the standard error by 2 gives the approximate 95% margin of error. Alternatively, you can compute the confidence interval directly:


In [15]:
from scipy.stats import t # Use the Student's t distribution to compute the multiplier
mult = t.ppf(1.-(1.-0.95)/2., s.shape[0]-1.) # Should be close to 2
confs = [s.mean() + s.std() * i * mult for i in (-1,1)] # Mean +/- Std Dev * Multiplier

And print it nicely as a Pandas data frame:


In [16]:
pd.DataFrame(confs, index=["Lower", "Upper"]).T


Out[16]:
Lower Upper
Intercept 471.161195 642.480689
sex[T.M] -11.722037 -7.929862
year -0.295287 -0.208775

Prediction Confidence Intervals

When a model is used to make a prediction, it's helpful to be able to describe how precise the prediction is. For instance, suppose you want to use the kidsfeet.csv data set to make a prediction of the foot width of a girl whose foot length is 25 cm.

`kidsfeet.csv`

First, fit your model:


In [17]:
feet = pd.read_csv("http://www.mosaic-web.org/go/datasets/kidsfeet.csv")
feet.columns


Out[17]:
Index([u'name', u'birthmonth', u'birthyear', u'length', u'width', u'sex', u'biggerfoot', u'domhand'], dtype=object)

In [18]:
feet.sex.unique()


Out[18]:
array(['B', 'G'], dtype=object)

In [19]:
mod = sm.ols("width ~ length + sex", data=feet)
fit = mod.fit()

The fitted model object (fit) contains a lot of information about the model (mod) and the data used for fitting the parameters. For the purposes of using the fitted model to make predictions, you want a function that takes the values for the explanatory variables as inputs, and returns the corresponding model value as the output. The predict method of the fitted model object fit performs just such a task. With it, you can convert inputs into outputs. Take care to use the right coding for categorical variables.


In [68]:
pred = fit.predict({"length":25, "sex":"G"})
pred


Out[68]:
array([ 8.93427484])

In order to generate a confidence interval on our prediction, we need to decide what type of interval we want. There are two types of prediction confidence intervals that we can calculate:

Interval on the model value which reflects the sampling distributions of the coefficients themselves. To calculate this, we first need to calculate the prediction mean standard error and the inverse survival function for the Student's t distribution (basically the opposite to the percent point function (ppf) introduced earlier).

Unfortunately, this isn't (yet) as simple as it could be using the "statsmodels" library. Future version of this package may allow users to compute this directly from the fitted model object.


In [86]:
from scipy.stats import t
pmse = fit.t_test(np.array([1., 1., 25.0])).sd[0][0]
tppf = t.isf(0.05/2., fit.df_resid)

confs = [pred + i * tppf * pmse for i in (-1,1)] # Pred +/- MSE * TPPF
confs.append(pred) # Add on the prediction value
pd.DataFrame(confs, index=["Lower", "Upper", "Prediction"]).T


Out[86]:
Lower Upper Prediction
0 8.742565 9.125985 8.934275

Interval on the prediction which includes the variation due to the uncertainty in the coefficients as well as the size of a typical residual. To find this interval, use the wls_predction_std function from the statsmodels.sandbox.regression.predstd module.

Again, future versions of "statsmodels" may make accessing these prediction confidence intervals easier...


In [87]:
from statsmodels.sandbox.regression.predstd import wls_prediction_std
prstd, iv_l, iv_u = wls_prediction_std(fit, exog=np.array([1., 1., 25.0]))
confs = [iv_l[0], iv_u[0], pred[0]]
pd.DataFrame(confs, index=["Lower", "Upper", "Prediction"]).T


Out[87]:
Lower Upper Prediction
0 8.130484 9.738066 8.934275

The prediction interval is larger than the model-value confidence interval because the residual always gives additional uncertainty around the model value.

Next time on, Statistical Modeling: A Fresh Approach for Python...

  • Hypothesis Testing on Whole Models

Reference

As with all 'Statistical Modeling: A Fresh Approach for Python' tutorials, this tutorial is based directly on material from 'Statistical Modeling: A Fresh Approach (2nd Edition)' by Daniel Kaplan.

I have made an effort to keep the text and explanations consistent between the original (R-based) version and the Python tutorials, in order to keep things comparable. With that in mind, any errors, omissions, and/or differences between the two versions are mine, and any questions, comments, and/or concerns should be directed to me.