Title: Confidence Intervals Date: 2013-06-28 12:00 Author: cfarmer Email: carson.farmer@gmail.com Category: Statistical Modeling for Python Tags: Helpful tips, Python, Statistical Modeling, Teaching Slug: statistical-modeling-python-conf-int Latex: yes Status: draft

Confidence Intervals

It's easy enough to calculate the mean or median within groups. Often the point of doing this is to support a claim that the groups are different, for instance that men are on average taller than women.

Less obvious to those starting out in statistics is the idea that a quantity such as the mean or median itself has limited precision. This limited precision arises from the sampling process that was used to collect the data. The exact mean or median relates to the particular sample you have collected. One purpose of statistical inference is to characterize sampling variability, that is, to provide a quantitative meaning for the limited precision due to sampling variability.

As usual, load pandas any other libraries that we'll need at the start of our session:


In [41]:
import numpy as np
import pandas as pd
import scipy.stats as st

The examples will be based on the Cherry-Blossom 2008 data described earlier:

`Cherry-Blossom-2008.csv`


In [3]:
run = pd.read_csv("http://www.mosaic-web.org/go/datasets/Cherry-Blossom-2008.csv")
run.columns


Out[3]:
Index([u'position', u'division', u'total', u'name', u'age', u'place', u'net', u'gun', u'sex'], dtype=object)

Finding a Sampling Distribution through Bootstrapping

Your data are typically a sample from a population. Collecting the sample is usually hard work. Just to illustrate the context of a sampling distribution, here is a simulation of selecting a sample of size n = 100 from the population of runners. It’s essential to keep in mind that you do not usually pull out your sample using the computer in this way. Instead, you go into the field or laboratory and collect your data.


In [4]:
from numpy.random import choice
mysamp = run.ix[choice(run.index, 100, replace=False)]

Now that you have a sample, you can calculate the sample statistic that's of interest to you. For instance:


In [63]:
mysamp.groupby("sex").gun.mean()


Out[63]:
sex
F      98.507197
M      85.622024
Name: gun, dtype: float64

Note that the results are slightly different from those found above using the whole population. That's to be expected, since the sample is just a random part of the population. But ordinarily, you will not know what the population values are; all you have to work with is your sample.

Theoretically, the sampling distribution reflects the variation from one randomly dealt sample to another, where each sample is taken from the population. In practice, your only ready access to the population is through your sample. So, to simulate the process of random sampling, re-sampling is used and the re-sampling distribution is used as a convenient approximation to the sampling distribution.

Re-sampling involves drawing from the set of cases in your sample with replacement. To illustrate, consider this example of a very small, simple set: the numbers 1 to 5:


In [13]:
nums = range(1,6)
nums


Out[13]:
[1, 2, 3, 4, 5]

Each resample of size n consists of n members from the set, but in any one resample each member might appear more than once and some might not appear at all. Here are three different resamples from nums:


In [14]:
choice(nums, 5)


Out[14]:
array([1, 3, 4, 2, 5])

In [15]:
choice(nums, 5)


Out[15]:
array([3, 4, 3, 3, 5])

In [16]:
choice(nums, 5)


Out[16]:
array([1, 4, 3, 5, 1])

To use resampling to estimate the sampling distribution of a statistic, you apply the calculation of the statistic to a resampled version of your sample. For instance, here are the group-wise means from one resample of the run sample we created above:


In [64]:
newsamp = mysamp.ix[choice(mysamp.index, mysamp.shape[0])]
newsamp.groupby("sex").gun.mean()


Out[64]:
sex
F      97.468794
M      87.433019
Name: gun, dtype: float64

This is quite verbose, so it might make sense to create a simple function to make things simpler (this new function could also be used in our earlier (re)sampling examples):


In [5]:
def resample(df, n=None, replace=True):
    '''
    A simple function to 'resample' a data frame
    
    Arguments
    ----------
    df : Input data frame (required)
    n : The number of sampled cases to draw (optional)
    replace : Sample with (True) or without (False) replacement (optional)
    '''
    if n is None: # if we don't supply a valid n, use same size as input
        n = df.shape[0] # return data frame has shape == df.shape
    # choose n cases at random, sample with replacement (replace == True)
    choices = np.random.choice(df.index, n, replace=replace)
    return df.ix[choices] # select subset and return new data frame

In [65]:
resample(mysamp).groupby("sex").gun.mean()


Out[65]:
sex
F      97.911852
M      85.591515
Name: gun, dtype: float64

And here is another:


In [66]:
resample(mysamp).groupby("sex").gun.mean()


Out[66]:
sex
F      97.758000
M      84.017333
Name: gun, dtype: float64

The bootstrap procedure involves conducting many such trials and examining the variation from one trial to the next. Using Python's list comprehension (see previous Computational Techniques chapter, 'Describing Variation', or this link), it is possible to automate the collection of multiple trials. For instance, here are five trials carried out using the above resampling technique:


In [68]:
[resample(mysamp).groupby("sex").gun.mean().values for i in range(5)]


Out[68]:
[array([ 101.99254386,   88.3983871 ]),
 array([ 99.37689394,  84.9860119 ]),
 array([ 99.275     ,  83.12633333]),
 array([ 99.45614035,  83.68817204]),
 array([ 100.03819444,   80.99487179])]

We can also wrap the above inside a call to pd.DataFrame to make it print nicely:


In [38]:
pd.DataFrame([resample(mysamp).gun.mean() for i in range(5)], index=range(5))


Out[38]:
0
0 91.064000
1 92.605833
2 88.969500
3 90.524667
4 90.143000

Typically, you will use several hundred trials for bootstrapping. The most common way to summarize the variation in bootstrap trials is to calculate a 95% coverage interval (when applied to a sampling distribution, the coverage interval is called a confidence interval).

To do the computation, give a name to the results of the repeated bootstrap trials, here it’s called trials:


In [69]:
trials = pd.DataFrame([resample(mysamp).groupby("sex").gun.mean() for i in range(500)], index=range(500))
trials.head()


Out[69]:
sex F M
0 101.501984 87.042529
1 99.967045 87.508631
2 97.368478 85.311420
3 99.560606 83.304464
4 98.628889 87.559091

Computing the coverage intervals can be done using Series.quantile() for the 25th and 75th percentiles. You could also use scipy.stats.scoreatpercentile if you wanted (see previous Computational Techniques chapter, 'Describing Variation').


In [104]:
pd.DataFrame([trials.quantile(q) for q in [.025, .975]], index=("Lower", "Upper")).T


Out[104]:
Lower Upper
sex
F 95.337556 102.473628
M 84.724423 92.391599

The idea of sampling distributions is based on drawing at random from the population, not resampling. Ordinarily, you can’t do this calculation since you don’t have the population at hand. But in this example, we happen to have the data for all runners. Here’s the population-based confidence interval for the mean running time, with sample size n = 100, broken down by sex:


In [103]:
trials = pd.DataFrame([resample(run, 100).groupby("sex").gun.mean() for i in range(500)], index=range(500))
pd.DataFrame([trials.quantile(q) for q in [.025, .975]], index=("Lower", "Upper")).T


Out[103]:
Lower Upper
sex
F 95.337556 102.473628
M 84.724423 92.391599

The above output shows the lower and upper limits of the confidence interval for each of the groups. The labels on the columns indicate the confidence level. By convention, the interval is at the 95% level, and so the interval runs from the 2.5 percentile to the 97.5 percentile ([.025, .975]).

Historically, statisticians have been concerned with very small samples: say n = 2 or n = 3. Even in this era of huge data sets, such small sample sizes often are encountered in laboratory experiments, etc. Bootstrapping cannot work well with such small samples, and other techniques are needed to simulate sampling variability. Many of these techniques are based in algebra and probability theory, and give somewhat complex formulas for calculating confidence intervals from data. The formulas are often found in textbooks, but for most of the modeling techniques you will use in later tutorials, appropriate formulas for confidence intervals have been implemented in software. For other modeling techniques, bootstrapping is used to find the confidence intervals. But keep in mind that bootstrapping can only be effective when the sample size n is one or two dozen or larger.

Computing Grade-Point Averages

The grade-point average is a kind of group-wise mean, where the group is an individual student. This is not the usual way of looking at things for a student, who sees only his or her own grades. But institutions have data on many students.

The data files grades.csv and courses.csv are drawn from an institutional database at a college. They give the grades for more than 400 students who graduated in year 2005. Another file, grade-to-number.csv, gives the rules used by the institution in converting letter grades to numbers.

`grades.csv`, `courses.csv`, `grade-to-number.csv`

The data files are part of a relational data base, a very important way of managing large amounts of data used by private and public institutions, corporations and governments - it’s the basis for a multi-billion dollar segment of the economy. Ordinarily, relational data bases are queried using special-purpose computer languages that sort, extract, and combine the data. For convinience, we can load the data from csv files using the following Python commands. We can also convert the letter grades to numbers and extract the data for one student:


In [93]:
grades = pd.read_csv("http://www.mosaic-web.org/go/datasets/grades.csv")
gp = pd.read_csv("http://www.mosaic-web.org/go/datasets/grade-to-number.csv")
all_students = grades.merge(gp, on="grade")
one_student = all_students[all_students.sid == "S31200"]

In [94]:
one_student.head() # There are 14 cases in total


Out[94]:
sid grade sessionID gradepoint
1 S31200 D+ session2331 1.33
1036 S31200 A- session3010 3.66
2429 S31200 B- session1835 2.66
2430 S31200 B- session1839 2.66
2431 S31200 B- session3809 2.66

Calculating the mean grade-point for the one student is a simple matter:


In [95]:
one_student.gradepoint.mean()


Out[95]:
2.1864285714285718

It’s equally straightforward to calculate the grade-point averages for all students as individuals:


In [98]:
grade_points = all_students.groupby("sid").mean()
grade_points.head() # There are 443 cases


Out[98]:
gradepoint
sid
S31185 2.412500
S31188 3.018125
S31191 3.212143
S31194 3.359167
S31197 3.356154

Bootstrapping can be used to find the confidence interval on the grade-point average for each student:


In [105]:
trials = pd.DataFrame([resample(all_students).groupby("sid").gradepoint.mean() for i in range(100)], index=range(100))
boot_grades = pd.DataFrame([trials.quantile(q) for q in [.025, .975]], index=("Lower", "Upper")).T
boot_grades.head() # Again, there are 443 cases


Out[105]:
Lower Upper
S31185 1.184750 3.098750
S31188 2.536667 3.482661
S31191 3.066850 3.390382
S31194 3.031157 3.652067
S31197 3.200412 3.509527

It’s important to point out that there are other methods for calculating confidence intervals that are based on the standard deviation of the data. Formulas and procedures for such methods are given in just about every standard introductory statistics book and would certainly be used instead of bootstrapping in a simple calculation of the sort illustrated here.

However, such formulas don’t go to the heart of the problem: accounting for variation in the grades and the contribution from different sources of that variation. For example, some of the variation in this student’s grades might be due systematically to improvement over time or due to differences between instructor’s practices. The modeling techniques introduced in the following tutorials provide a means to examine and quantify the different sources of variation.

Next time on, Statistical Modeling: A Fresh Approach for Python...

  • The Language of Models

Reference

As with all 'Statistical Modeling: A Fresh Approach for Python' tutorials, this tutorial is based directly on material from 'Statistical Modeling: A Fresh Approach (2nd Edition)' by Daniel Kaplan.

I have made an effort to keep the text and explanations consistent between the original (R-based) version and the Python tutorials, in order to keep things comparable. With that in mind, any errors, omissions, and/or differences between the two versions are mine, and any questions, comments, and/or concerns should be directed to me.