Title: Confidence in Models Date: 2013-10-05 17:05 Author: cfarmer Email: carson.farmer@gmail.com Category: Statistical Modeling for Python Tags: Helpful tips, Python, Statistical Modeling, Teaching Slug: statistical-modeling-python-confidence-models Latex: yes Status: draft

Confidence in Models

Regression reports are generated using software you have already encountered: ols to build a model, the .fit() method to fit the model, and the summary() method to construct the report from the fitted model. To illustrate:

`swim100m.csv`



In [4]:

    
import statsmodels.formula.api as sm
import pandas as pd

swim = pd.read_csv("http://www.mosaic-web.org/go/datasets/swim100m.csv")
mod = sm.ols("time ~ year + sex", data=swim)
fit = mod.fit()
fit.summary()









    Out[4]:





OLS Regression Results

  Dep. Variable:           time          R-squared:             0.844


  Model:                    OLS          Adj. R-squared:        0.839


  Method:              Least Squares     F-statistic:           159.6


  Date:              Sun, 06 Oct 2013    Prob (F-statistic):  1.58e-24


  Time:                  22:24:52        Log-Likelihood:      -172.12


  No. Observations:           62         AIC:                   350.2


  Df Residuals:               59         BIC:                   356.6


  Df Model:                    2                                     




               coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept    555.7168     33.800     16.441   0.000    488.083   623.350


  sex[T.M]      -9.7980      1.013     -9.673   0.000    -11.825    -7.771


  year          -0.2515      0.017    -14.516   0.000     -0.286    -0.217




  Omnibus:        52.546    Durbin-Watson:         0.375


  Prob(Omnibus):   0.000    Jarque-Bera (JB):    241.626


  Skew:            2.430    Prob(JB):           3.40e-53


  Kurtosis:       11.362    Cond. No.           1.30e+05



In [5]:

    
from scipy.stats import t
t.ppf(1 - 0.05 / 2, fit.df_resid)









    Out[5]:





2.0009953770482101

Confidence Intervals from Standard Errors

Given the coefficient estimate and the standard error from the regression report, the confidence interval is easily generated (it is also included in the summary() report by default). For a 95% confidence interval, you just multiply the standard error by about 2 to get the margin of error. For example, in the above, the margin of error on sex[T.M] is aproximately $2.0 \times 1.013 = 2.026$, or, in computer notation:



In [19]:

    
2.*1.013









    Out[19]:





2.026

If you want the two endpoints of the confidence interval, rather than just the margin of error, do these simple calculations: 1) subtract the margin of error from the estimate, 2) add the margin of error tot he estimate. So,



In [20]:

    
-9.798 - (2.*1.013)









    Out[20]:





-11.824



In [21]:

    
-9.798 + (2.*1.013)









    Out[21]:





-7.772

The key thing to remember is the multiplier that is applied to the standard error. A multiplier of approximately 2 is used for a 95% confidence level. The Conf. Int. field in the regression report provides a convenient way to obtain the confidence intervals directly. It actually calculates the exact multiplier (which depends somewhat on the sample size) and applies it to the standard error to produce the confidence intervals. You can also retrieve the cofidence intervals directly from the fitted model object:



In [7]:

    
fit = sm.ols("time ~ year + sex", data=swim).fit()
fit.conf_int()

Bootstrapping Confidence Intervals

Confidence intervals on model coefficients can be computed using the same bootstrapping technique introduced in Tutorial 5 (Confidence Intervals).

Start with your fitted model. To illustrate, here is the same model as above, looking at swimming time over the years, taking into account sex:



In [8]:

    
fit = sm.ols("time ~ year + sex", data=swim).fit()
fit.params









    Out[8]:





Intercept    555.716783
sex[T.M]      -9.797961
year          -0.251464
dtype: float64

These parameters reflect one hypothetical draw from the population-based sampling distribution. It's impossible to get another draw from the "population" here - the actual records are all you've got. But to approximate sampling variation, you can treat the sample as your population and re-sample, using the resample function we defined in Tutorial 5 (Confidence Intervals):



In [10]:

    
import numpy as np
def resample(df, n=None, replace=True):
    '''
    A simple function to 'resample' a data frame
    
    Arguments
    ----------
    df : Input data frame (required)
    n : The number of sampled cases to draw (optional)
    replace : Sample with (True) or without (False) replacement (optional)
    '''
    if n is None: # if we don't supply a valid n, use same size as input
        n = df.shape[0] # return data frame has shape == df.shape
    # choose n cases at random, sample with replacement (replace == True)
    choices = np.random.choice(df.index, n, replace=replace)
    return df.ix[choices] # select subset and return new data frame

fit = sm.ols("time ~ year + sex", data=resample(swim)).fit()
fit.params









    Out[10]:





Intercept    593.800992
sex[T.M]     -10.707462
year          -0.270630
dtype: float64

Construct many such re-sampling trials and collect the results



In [12]:

    
s = pd.DataFrame([sm.ols("time ~ year + sex", data=resample(swim)).fit().params for i in range(500)], index=range(500))



In [13]:

    
s.head()

To get the standard errors of the coefficients, just take the standard deviation across the re-sampling trials:



In [14]:

    
s.std()









    Out[14]:





Intercept    43.598751
sex[T.M]      0.965063
year          0.022016
dtype: float64

Multiplying the standard error by 2 gives the approximate 95% margin of error. Alternatively, you can compute the confidence interval directly:



In [15]:

    
from scipy.stats import t # Use the Student's t distribution to compute the multiplier
mult = t.ppf(1.-(1.-0.95)/2., s.shape[0]-1.) # Should be close to 2
confs = [s.mean() + s.std() * i * mult for i in (-1,1)] # Mean +/- Std Dev * Multiplier

And print it nicely as a Pandas data frame:



In [16]:

    
pd.DataFrame(confs, index=["Lower", "Upper"]).T

Prediction Confidence Intervals

When a model is used to make a prediction, it's helpful to be able to describe how precise the prediction is. For instance, suppose you want to use the kidsfeet.csv data set to make a prediction of the foot width of a girl whose foot length is 25 cm.

`kidsfeet.csv`

First, fit your model:



In [17]:

    
feet = pd.read_csv("http://www.mosaic-web.org/go/datasets/kidsfeet.csv")
feet.columns









    Out[17]:





Index([u'name', u'birthmonth', u'birthyear', u'length', u'width', u'sex', u'biggerfoot', u'domhand'], dtype=object)



In [18]:

    
feet.sex.unique()









    Out[18]:





array(['B', 'G'], dtype=object)



In [19]:

    
mod = sm.ols("width ~ length + sex", data=feet)
fit = mod.fit()

The fitted model object (fit) contains a lot of information about the model (mod) and the data used for fitting the parameters. For the purposes of using the fitted model to make predictions, you want a function that takes the values for the explanatory variables as inputs, and returns the corresponding model value as the output. The predict method of the fitted model object fit performs just such a task. With it, you can convert inputs into outputs. Take care to use the right coding for categorical variables.



In [68]:

    
pred = fit.predict({"length":25, "sex":"G"})
pred









    Out[68]:





array([ 8.93427484])

In order to generate a confidence interval on our prediction, we need to decide what type of interval we want. There are two types of prediction confidence intervals that we can calculate:

Interval on the model value which reflects the sampling distributions of the coefficients themselves. To calculate this, we first need to calculate the prediction mean standard error and the inverse survival function for the Student's t distribution (basically the opposite to the percent point function (ppf) introduced earlier).

Unfortunately, this isn't (yet) as simple as it could be using the "statsmodels" library. Future version of this package may allow users to compute this directly from the fitted model object.



In [86]:

    
from scipy.stats import t
pmse = fit.t_test(np.array([1., 1., 25.0])).sd[0][0]
tppf = t.isf(0.05/2., fit.df_resid)

confs = [pred + i * tppf * pmse for i in (-1,1)] # Pred +/- MSE * TPPF
confs.append(pred) # Add on the prediction value
pd.DataFrame(confs, index=["Lower", "Upper", "Prediction"]).T

Interval on the prediction which includes the variation due to the uncertainty in the coefficients as well as the size of a typical residual. To find this interval, use the wls_predction_std function from the statsmodels.sandbox.regression.predstd module.

Again, future versions of "statsmodels" may make accessing these prediction confidence intervals easier...



In [87]:

    
from statsmodels.sandbox.regression.predstd import wls_prediction_std
prstd, iv_l, iv_u = wls_prediction_std(fit, exog=np.array([1., 1., 25.0]))
confs = [iv_l[0], iv_u[0], pred[0]]
pd.DataFrame(confs, index=["Lower", "Upper", "Prediction"]).T

The prediction interval is larger than the model-value confidence interval because the residual always gives additional uncertainty around the model value.

Next time on, Statistical Modeling: A Fresh Approach for Python...

Hypothesis Testing on Whole Models

Reference

As with all 'Statistical Modeling: A Fresh Approach for Python' tutorials, this tutorial is based directly on material from 'Statistical Modeling: A Fresh Approach (2nd Edition)' by Daniel Kaplan.

I have made an effort to keep the text and explanations consistent between the original (R-based) version and the Python tutorials, in order to keep things comparable. With that in mind, any errors, omissions, and/or differences between the two versions are mine, and any questions, comments, and/or concerns should be directed to me.

	0	1
Intercept	488.083311	623.350256
sex[T.M]	-11.824713	-7.771209
year	-0.286128	-0.216800

	Intercept	sex[T.M]	year
0	530.219915	-9.257418	-0.238826
1	616.791588	-11.004144	-0.282477
2	601.240175	-9.098913	-0.274934
3	609.460845	-11.378371	-0.278428
4	581.716203	-9.944458	-0.264425

	Lower	Upper
Intercept	471.161195	642.480689
sex[T.M]	-11.722037	-7.929862
year	-0.295287	-0.208775

Dep. Variable:	time	R-squared:	0.844
Model:	OLS	Adj. R-squared:	0.839
Method:	Least Squares	F-statistic:	159.6
Date:	Sun, 06 Oct 2013	Prob (F-statistic):	1.58e-24
Time:	22:24:52	Log-Likelihood:	-172.12
No. Observations:	62	AIC:	350.2
Df Residuals:	59	BIC:	356.6
Df Model:	2

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	555.7168	33.800	16.441	0.000	488.083 623.350
sex[T.M]	-9.7980	1.013	-9.673	0.000	-11.825 -7.771
year	-0.2515	0.017	-14.516	0.000	-0.286 -0.217

Omnibus:	52.546	Durbin-Watson:	0.375
Prob(Omnibus):	0.000	Jarque-Bera (JB):	241.626
Skew:	2.430	Prob(JB):	3.40e-53
Kurtosis:	11.362	Cond. No.	1.30e+05