Title: Confidence Intervals Date: 2013-06-28 12:00 Author: cfarmer Email: carson.farmer@gmail.com Category: Statistical Modeling for Python Tags: Helpful tips, Python, Statistical Modeling, Teaching Slug: statistical-modeling-python-conf-int Latex: yes Status: draft

Confidence Intervals

It's easy enough to calculate the mean or median within groups. Often the point of doing this is to support a claim that the groups are diﬀerent, for instance that men are on average taller than women.

Less obvious to those starting out in statistics is the idea that a quantity such as the mean or median itself has limited precision. This limited precision arises from the sampling process that was used to collect the data. The exact mean or median relates to the particular sample you have collected. One purpose of statistical inference is to characterize sampling variability, that is, to provide a quantitative meaning for the limited precision due to sampling variability.

As usual, load pandas any other libraries that we'll need at the start of our session:



In [41]:

    
import numpy as np
import pandas as pd
import scipy.stats as st

The examples will be based on the Cherry-Blossom 2008 data described earlier:

`Cherry-Blossom-2008.csv`



In [3]:

    
run = pd.read_csv("http://www.mosaic-web.org/go/datasets/Cherry-Blossom-2008.csv")
run.columns









    Out[3]:





Index([u'position', u'division', u'total', u'name', u'age', u'place', u'net', u'gun', u'sex'], dtype=object)

Finding a Sampling Distribution through Bootstrapping

Your data are typically a sample from a population. Collecting the sample is usually hard work. Just to illustrate the context of a sampling distribution, here is a simulation of selecting a sample of size n = 100 from the population of runners. It’s essential to keep in mind that you do not usually pull out your sample using the computer in this way. Instead, you go into the field or laboratory and collect your data.



In [4]:

    
from numpy.random import choice
mysamp = run.ix[choice(run.index, 100, replace=False)]

Now that you have a sample, you can calculate the sample statistic that's of interest to you. For instance:



In [63]:

    
mysamp.groupby("sex").gun.mean()









    Out[63]:





sex
F      98.507197
M      85.622024
Name: gun, dtype: float64

Note that the results are slightly diﬀerent from those found above using the whole population. That's to be expected, since the sample is just a random part of the population. But ordinarily, you will not know what the population values are; all you have to work with is your sample.

Theoretically, the sampling distribution reﬂects the variation from one randomly dealt sample to another, where each sample is taken from the population. In practice, your only ready access to the population is through your sample. So, to simulate the process of random sampling, re-sampling is used and the re-sampling distribution is used as a convenient approximation to the sampling distribution.

Re-sampling involves drawing from the set of cases in your sample with replacement. To illustrate, consider this example of a very small, simple set: the numbers 1 to 5:



In [13]:

    
nums = range(1,6)
nums









    Out[13]:





[1, 2, 3, 4, 5]

Each resample of size n consists of n members from the set, but in any one resample each member might appear more than once and some might not appear at all. Here are three diﬀerent resamples from nums:



In [14]:

    
choice(nums, 5)









    Out[14]:





array([1, 3, 4, 2, 5])



In [15]:

    
choice(nums, 5)









    Out[15]:





array([3, 4, 3, 3, 5])



In [16]:

    
choice(nums, 5)









    Out[16]:





array([1, 4, 3, 5, 1])

To use resampling to estimate the sampling distribution of a statistic, you apply the calculation of the statistic to a resampled version of your sample. For instance, here are the group-wise means from one resample of the run sample we created above:



In [64]:

    
newsamp = mysamp.ix[choice(mysamp.index, mysamp.shape[0])]
newsamp.groupby("sex").gun.mean()









    Out[64]:





sex
F      97.468794
M      87.433019
Name: gun, dtype: float64

This is quite verbose, so it might make sense to create a simple function to make things simpler (this new function could also be used in our earlier (re)sampling examples):



In [5]:

    
def resample(df, n=None, replace=True):
    '''
    A simple function to 'resample' a data frame
    
    Arguments
    ----------
    df : Input data frame (required)
    n : The number of sampled cases to draw (optional)
    replace : Sample with (True) or without (False) replacement (optional)
    '''
    if n is None: # if we don't supply a valid n, use same size as input
        n = df.shape[0] # return data frame has shape == df.shape
    # choose n cases at random, sample with replacement (replace == True)
    choices = np.random.choice(df.index, n, replace=replace)
    return df.ix[choices] # select subset and return new data frame



In [65]:

    
resample(mysamp).groupby("sex").gun.mean()









    Out[65]:





sex
F      97.911852
M      85.591515
Name: gun, dtype: float64

And here is another:



In [66]:

    
resample(mysamp).groupby("sex").gun.mean()









    Out[66]:





sex
F      97.758000
M      84.017333
Name: gun, dtype: float64

The bootstrap procedure involves conducting many such trials and examining the variation from one trial to the next. Using Python's list comprehension (see previous Computational Techniques chapter, 'Describing Variation', or this link), it is possible to automate the collection of multiple trials. For instance, here are five trials carried out using the above resampling technique:



In [68]:

    
[resample(mysamp).groupby("sex").gun.mean().values for i in range(5)]









    Out[68]:





[array([ 101.99254386,   88.3983871 ]),
 array([ 99.37689394,  84.9860119 ]),
 array([ 99.275     ,  83.12633333]),
 array([ 99.45614035,  83.68817204]),
 array([ 100.03819444,   80.99487179])]

We can also wrap the above inside a call to pd.DataFrame to make it print nicely:



In [38]:

    
pd.DataFrame([resample(mysamp).gun.mean() for i in range(5)], index=range(5))

Typically, you will use several hundred trials for bootstrapping. The most common way to summarize the variation in bootstrap trials is to calculate a 95% coverage interval (when applied to a sampling distribution, the coverage interval is called a conﬁdence interval).

To do the computation, give a name to the results of the repeated bootstrap trials, here it’s called trials:



In [69]:

    
trials = pd.DataFrame([resample(mysamp).groupby("sex").gun.mean() for i in range(500)], index=range(500))
trials.head()

Computing the coverage intervals can be done using Series.quantile() for the 25th and 75th percentiles. You could also use scipy.stats.scoreatpercentile if you wanted (see previous Computational Techniques chapter, 'Describing Variation').



In [104]:

    
pd.DataFrame([trials.quantile(q) for q in [.025, .975]], index=("Lower", "Upper")).T

The idea of sampling distributions is based on drawing at random from the population, not resampling. Ordinarily, you can’t do this calculation since you don’t have the population at hand. But in this example, we happen to have the data for all runners. Here’s the population-based conﬁdence interval for the mean running time, with sample size n = 100, broken down by sex:



In [103]:

    
trials = pd.DataFrame([resample(run, 100).groupby("sex").gun.mean() for i in range(500)], index=range(500))
pd.DataFrame([trials.quantile(q) for q in [.025, .975]], index=("Lower", "Upper")).T

The above output shows the lower and upper limits of the confidence interval for each of the groups. The labels on the columns indicate the confidence level. By convention, the interval is at the 95% level, and so the interval runs from the 2.5 percentile to the 97.5 percentile ([.025, .975]).

Historically, statisticians have been concerned with very small samples: say n = 2 or n = 3. Even in this era of huge data sets, such small sample sizes often are encountered in laboratory experiments, etc. Bootstrapping cannot work well with such small samples, and other techniques are needed to simulate sampling variability. Many of these techniques are based in algebra and probability theory, and give somewhat complex formulas for calculating conﬁdence intervals from data. The formulas are often found in textbooks, but for most of the modeling techniques you will use in later tutorials, appropriate formulas for confidence intervals have been implemented in software. For other modeling techniques, bootstrapping is used to find the confidence intervals. But keep in mind that bootstrapping can only be eﬀective when the sample size n is one or two dozen or larger.

Computing Grade-Point Averages

The grade-point average is a kind of group-wise mean, where the group is an individual student. This is not the usual way of looking at things for a student, who sees only his or her own grades. But institutions have data on many students.

The data files grades.csv and courses.csv are drawn from an institutional database at a college. They give the grades for more than 400 students who graduated in year 2005. Another file, grade-to-number.csv, gives the rules used by the institution in converting letter grades to numbers.

`grades.csv`, `courses.csv`, `grade-to-number.csv`

The data files are part of a relational data base, a very important way of managing large amounts of data used by private and public institutions, corporations and governments - it’s the basis for a multi-billion dollar segment of the economy. Ordinarily, relational data bases are queried using special-purpose computer languages that sort, extract, and combine the data. For convinience, we can load the data from csv files using the following Python commands. We can also convert the letter grades to numbers and extract the data for one student:



In [93]:

    
grades = pd.read_csv("http://www.mosaic-web.org/go/datasets/grades.csv")
gp = pd.read_csv("http://www.mosaic-web.org/go/datasets/grade-to-number.csv")
all_students = grades.merge(gp, on="grade")
one_student = all_students[all_students.sid == "S31200"]



In [94]:

    
one_student.head() # There are 14 cases in total









    Out[94]:






  
    
      
      sid
      grade
      sessionID
      gradepoint
    
  
  
    
      1   
       S31200
       D+
       session2331
       1.33
    
    
      1036
       S31200
       A-
       session3010
       3.66
    
    
      2429
       S31200
       B-
       session1835
       2.66
    
    
      2430
       S31200
       B-
       session1839
       2.66
    
    
      2431
       S31200
       B-
       session3809
       2.66

Calculating the mean grade-point for the one student is a simple matter:



In [95]:

    
one_student.gradepoint.mean()









    Out[95]:





2.1864285714285718

It’s equally straightforward to calculate the grade-point averages for all students as individuals:



In [98]:

    
grade_points = all_students.groupby("sid").mean()
grade_points.head() # There are 443 cases

Bootstrapping can be used to find the confidence interval on the grade-point average for each student:



In [105]:

    
trials = pd.DataFrame([resample(all_students).groupby("sid").gradepoint.mean() for i in range(100)], index=range(100))
boot_grades = pd.DataFrame([trials.quantile(q) for q in [.025, .975]], index=("Lower", "Upper")).T
boot_grades.head() # Again, there are 443 cases

It’s important to point out that there are other methods for calculating confidence intervals that are based on the standard deviation of the data. Formulas and procedures for such methods are given in just about every standard introductory statistics book and would certainly be used instead of bootstrapping in a simple calculation of the sort illustrated here.

However, such formulas don’t go to the heart of the problem: accounting for variation in the grades and the contribution from different sources of that variation. For example, some of the variation in this student’s grades might be due systematically to improvement over time or due to diﬀerences between instructor’s practices. The modeling techniques introduced in the following tutorials provide a means to examine and quantify the diﬀerent sources of variation.

Next time on, Statistical Modeling: A Fresh Approach for Python...

The Language of Models

Reference

As with all 'Statistical Modeling: A Fresh Approach for Python' tutorials, this tutorial is based directly on material from 'Statistical Modeling: A Fresh Approach (2nd Edition)' by Daniel Kaplan.

I have made an effort to keep the text and explanations consistent between the original (R-based) version and the Python tutorials, in order to keep things comparable. With that in mind, any errors, omissions, and/or differences between the two versions are mine, and any questions, comments, and/or concerns should be directed to me.

	0
0	91.064000
1	92.605833
2	88.969500
3	90.524667
4	90.143000

sex	F	M
0	101.501984	87.042529
1	99.967045	87.508631
2	97.368478	85.311420
3	99.560606	83.304464
4	98.628889	87.559091

	Lower	Upper
sex
F	95.337556	102.473628
M	84.724423	92.391599

	Lower	Upper
sex
F	95.337556	102.473628
M	84.724423	92.391599

	gradepoint
sid
S31185	2.412500
S31188	3.018125
S31191	3.212143
S31194	3.359167
S31197	3.356154

	sid	grade	sessionID	gradepoint
1	S31200	D+	session2331	1.33
1036	S31200	A-	session3010	3.66
2429	S31200	B-	session1835	2.66
2430	S31200	B-	session1839	2.66
2431	S31200	B-	session3809	2.66

	Lower	Upper
S31185	1.184750	3.098750
S31188	2.536667	3.482661
S31191	3.066850	3.390382
S31194	3.031157	3.652067
S31197	3.200412	3.509527