It's easy enough to calculate the mean or median within groups. Often the point of doing this is to support a claim that the groups are different, for instance that men are on average taller than women.
Less obvious to those starting out in statistics is the idea that a quantity such as the mean or median itself has limited precision. This limited precision arises from the sampling process that was used to collect the data. The exact mean or median relates to the particular sample you have collected. One purpose of statistical inference is to characterize sampling variability, that is, to provide a quantitative meaning for the limited precision due to sampling variability.
As usual, load pandas any other libraries that we'll need at the start of our session:
In [41]:
import numpy as np
import pandas as pd
import scipy.stats as st
The examples will be based on the Cherry-Blossom 2008 data described earlier:
`Cherry-Blossom-2008.csv`
In [3]:
run = pd.read_csv("http://www.mosaic-web.org/go/datasets/Cherry-Blossom-2008.csv")
run.columns
Out[3]:
Your data are typically a sample from a population. Collecting the sample is usually hard work. Just to illustrate the context of a sampling distribution, here is a simulation of selecting a sample of size n = 100
from the population of runners. It’s essential to keep in mind that you do not usually pull out your sample using the computer in this way. Instead, you go into the field or laboratory and collect your data.
In [4]:
from numpy.random import choice
mysamp = run.ix[choice(run.index, 100, replace=False)]
Now that you have a sample, you can calculate the sample statistic that's of interest to you. For instance:
In [63]:
mysamp.groupby("sex").gun.mean()
Out[63]:
Note that the results are slightly different from those found above using the whole population. That's to be expected, since the sample is just a random part of the population. But ordinarily, you will not know what the population values are; all you have to work with is your sample.
Theoretically, the sampling distribution reflects the variation from one randomly dealt sample to another, where each sample is taken from the population. In practice, your only ready access to the population is through your sample. So, to simulate the process of random sampling, re-sampling is used and the re-sampling distribution is used as a convenient approximation to the sampling distribution.
Re-sampling involves drawing from the set of cases in your sample with replacement. To illustrate, consider this example of a very small, simple set: the numbers 1 to 5:
In [13]:
nums = range(1,6)
nums
Out[13]:
Each resample of size n consists of n members from the set, but in any one resample each member might appear more than once and some might not appear at all. Here are three different resamples from nums
:
In [14]:
choice(nums, 5)
Out[14]:
In [15]:
choice(nums, 5)
Out[15]:
In [16]:
choice(nums, 5)
Out[16]:
To use resampling to estimate the sampling distribution of a statistic, you apply the calculation of the statistic to a resampled version of your sample. For instance, here are the group-wise means from one resample of the run
sample we created above:
In [64]:
newsamp = mysamp.ix[choice(mysamp.index, mysamp.shape[0])]
newsamp.groupby("sex").gun.mean()
Out[64]:
This is quite verbose, so it might make sense to create a simple function to make things simpler (this new function could also be used in our earlier (re)sampling examples):
In [5]:
def resample(df, n=None, replace=True):
'''
A simple function to 'resample' a data frame
Arguments
----------
df : Input data frame (required)
n : The number of sampled cases to draw (optional)
replace : Sample with (True) or without (False) replacement (optional)
'''
if n is None: # if we don't supply a valid n, use same size as input
n = df.shape[0] # return data frame has shape == df.shape
# choose n cases at random, sample with replacement (replace == True)
choices = np.random.choice(df.index, n, replace=replace)
return df.ix[choices] # select subset and return new data frame
In [65]:
resample(mysamp).groupby("sex").gun.mean()
Out[65]:
And here is another:
In [66]:
resample(mysamp).groupby("sex").gun.mean()
Out[66]:
The bootstrap procedure involves conducting many such trials and examining the variation from one trial to the next. Using Python's list comprehension (see previous Computational Techniques chapter, 'Describing Variation', or this link), it is possible to automate the collection of multiple trials. For instance, here are five trials carried out using the above resampling technique:
In [68]:
[resample(mysamp).groupby("sex").gun.mean().values for i in range(5)]
Out[68]:
We can also wrap the above inside a call to pd.DataFrame
to make it print nicely:
In [38]:
pd.DataFrame([resample(mysamp).gun.mean() for i in range(5)], index=range(5))
Out[38]:
Typically, you will use several hundred trials for bootstrapping. The most common way to summarize the variation in bootstrap trials is to calculate a 95% coverage interval (when applied to a sampling distribution, the coverage interval is called a confidence interval).
To do the computation, give a name to the results of the repeated bootstrap trials, here it’s called trials:
In [69]:
trials = pd.DataFrame([resample(mysamp).groupby("sex").gun.mean() for i in range(500)], index=range(500))
trials.head()
Out[69]:
Computing the coverage intervals can be done using Series.quantile()
for the 25th and 75th percentiles. You could also use scipy.stats.scoreatpercentile
if you wanted (see previous Computational Techniques chapter, 'Describing Variation').
In [104]:
pd.DataFrame([trials.quantile(q) for q in [.025, .975]], index=("Lower", "Upper")).T
Out[104]:
The idea of sampling distributions is based on drawing at random from the population, not resampling. Ordinarily, you can’t do this calculation since you don’t have the population at hand. But in this example, we happen to have the data for all runners. Here’s the population-based confidence interval for the mean running time, with sample size n = 100
, broken down by sex:
In [103]:
trials = pd.DataFrame([resample(run, 100).groupby("sex").gun.mean() for i in range(500)], index=range(500))
pd.DataFrame([trials.quantile(q) for q in [.025, .975]], index=("Lower", "Upper")).T
Out[103]:
The above output shows the lower and upper limits of the confidence interval for each of the groups. The labels on the columns indicate the confidence level. By convention, the interval is at the 95% level, and so the interval runs from the 2.5 percentile to the 97.5 percentile ([.025, .975]
).
Historically, statisticians have been concerned with very small samples: say n = 2
or n = 3
. Even in this era of huge data sets, such small sample sizes often are encountered in laboratory experiments, etc. Bootstrapping cannot work well with such small samples, and other techniques are needed to simulate sampling variability. Many of these techniques are based in algebra and probability theory, and give somewhat complex formulas for calculating confidence intervals from data. The formulas are often found in textbooks, but for most of the modeling techniques you will use in later tutorials, appropriate formulas for confidence intervals have been implemented in software. For other modeling techniques, bootstrapping is used to find the confidence intervals. But keep in mind that bootstrapping can only be effective when the sample size n
is one or two dozen or larger.
The grade-point average is a kind of group-wise mean, where the group is an individual student. This is not the usual way of looking at things for a student, who sees only his or her own grades. But institutions have data on many students.
The data files grades.csv
and courses.csv
are drawn from an institutional database at a college. They give the grades for more than 400 students who graduated in year 2005. Another file, grade-to-number.csv
, gives the rules used by the institution in converting letter grades to numbers.
`grades.csv`, `courses.csv`, `grade-to-number.csv`
The data files are part of a relational data base, a very important way of managing large amounts of data used by private and public institutions, corporations and governments - it’s the basis for a multi-billion dollar segment of the economy. Ordinarily, relational data bases are queried using special-purpose computer languages that sort, extract, and combine the data. For convinience, we can load the data from csv files using the following Python commands. We can also convert the letter grades to numbers and extract the data for one student:
In [93]:
grades = pd.read_csv("http://www.mosaic-web.org/go/datasets/grades.csv")
gp = pd.read_csv("http://www.mosaic-web.org/go/datasets/grade-to-number.csv")
all_students = grades.merge(gp, on="grade")
one_student = all_students[all_students.sid == "S31200"]
In [94]:
one_student.head() # There are 14 cases in total
Out[94]:
Calculating the mean grade-point for the one student is a simple matter:
In [95]:
one_student.gradepoint.mean()
Out[95]:
It’s equally straightforward to calculate the grade-point averages for all students as individuals:
In [98]:
grade_points = all_students.groupby("sid").mean()
grade_points.head() # There are 443 cases
Out[98]:
Bootstrapping can be used to find the confidence interval on the grade-point average for each student:
In [105]:
trials = pd.DataFrame([resample(all_students).groupby("sid").gradepoint.mean() for i in range(100)], index=range(100))
boot_grades = pd.DataFrame([trials.quantile(q) for q in [.025, .975]], index=("Lower", "Upper")).T
boot_grades.head() # Again, there are 443 cases
Out[105]:
It’s important to point out that there are other methods for calculating confidence intervals that are based on the standard deviation of the data. Formulas and procedures for such methods are given in just about every standard introductory statistics book and would certainly be used instead of bootstrapping in a simple calculation of the sort illustrated here.
However, such formulas don’t go to the heart of the problem: accounting for variation in the grades and the contribution from different sources of that variation. For example, some of the variation in this student’s grades might be due systematically to improvement over time or due to differences between instructor’s practices. The modeling techniques introduced in the following tutorials provide a means to examine and quantify the different sources of variation.
As with all 'Statistical Modeling: A Fresh Approach for Python' tutorials, this tutorial is based directly on material from 'Statistical Modeling: A Fresh Approach (2nd Edition)' by Daniel Kaplan.
I have made an effort to keep the text and explanations consistent between the original (R-based) version and the Python tutorials, in order to keep things comparable. With that in mind, any errors, omissions, and/or differences between the two versions are mine, and any questions, comments, and/or concerns should be directed to me.