Let's return to a topic we first discussed up in our introduction to simulation -- the standard error of the mean.
Here was the scenario we explored:
You want to learn about a variable $X$ in a population of interest.
Assume $X \sim N(\mu,\sigma)$.
You take a random sample of size $n$ from the population and estimate the sample mean $\overline{x}$
You repeat step 3 a large number of times, calculating a new sample mean each time.
We call the distribution of sample means the sampling distribution of the mean (note that you can also estimate the sampling distribution for any statistic of interest).
You examine the spread of your sample means. You will find that the sampling distribution of the mean is approximately normally distributed with mean $\sim\mu$, and with a standard deviation $\sim\frac{\sigma}{\sqrt{n}}$. $$ \overline{x} \sim N \left( \mu, \frac{\sigma}{\sqrt{n}}\ \right) $$
We refer to the standard deviation of a sampling distibution of a statistic as the standard error of that statistic. When the statistic of interest is the mean, this is the standard error of the mean (standard errors of the mean are often just referred to as "standard errors" as this is the most common standard error one usually calculates)
In [1]:
%matplotlib inline
import numpy as np
import scipy.stats as stats
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
matplotlib.style.use("bmh")
In [2]:
np.random.seed(20160222) # setting seed insures reproducability
mu, sigma = 10, 2
popn = stats.norm(loc=mu, scale=sigma)
ssizes = [25, 50, 100, 200, 400]
samples = [popn.rvs(size=(sz,100)) for sz in ssizes]
means = [np.mean(sample, axis=0) for sample in samples]
se = [np.std(mean) for mean in means]
The code above contains three list comprehensions for very compactly simulating sampling distribution of the man
ssizes
)sz
), generate random 100 samples, and store those samples in a matrix of size sz
$\times$ 100 (i.e. each column is a sample)
In [3]:
# make a pair of plots
ssmin, ssmax = min(ssizes), max(ssizes)
theoryss = np.linspace(ssmin, ssmax, 250)
fig, (ax1, ax2) = plt.subplots(1,2) # 1 x 2 grid of plots
fig.set_size_inches(12,4)
# plot histograms of sampling distributions
for (ss,mean) in zip(ssizes, means):
ax1.hist(mean, normed=True, histtype='stepfilled', alpha=0.75, label="n = %d" % ss)
ax1.set_xlabel("X")
ax1.set_ylabel("Density")
ax1.legend()
ax1.set_title("Sampling Distributions of Mean\nFor Different Sample Sizes")
# plot simulation SE of mean vs theory SE of mean
ax2.plot(ssizes, se, 'ko', label='simulation')
ax2.plot(theoryss, sigma/np.sqrt(theoryss), color='red', label="theory")
ax2.set_xlim(0, ssmax*1.1)
ax2.set_ylim(0, max(se)*1.1)
ax2.set_xlabel("sample size ($n$)")
ax2.set_ylabel("SE of mean")
ax2.legend()
ax2.set_title("Standard Error of Mean\nTheoretical Expectation vs. Simulation")
pass
In real-life life, we don't have access to the sampling distribution of the mean or the true population parameter $\sigma$ from which can calculate the standard error of the mean. However, we can still use our unbiased sample estimator of the standard deviation, $s$, to estimate the standard error of the mean.
$$ {SE}_{\overline{x}} = \frac{s}{\sqrt{n}} $$For the sampling distribution of the mean to be nearly normal with ${SE}_\overline{x}$ accurate, the following conditions should hold:
We know that given a random sample from a population of interest, the mean of $X$ in our random sample is unlikely to be the true population mean of $X$. However, our simulations have taught us a number of things:
We can use this knowledge to calculate plausible ranges of values for the mean. We call such ranges confidence intervals for the mean (the idea of confidence intervals can apply to other statistics as well). We're going to express our confidence intervals in terms of multiples of the standard error.
Let's start by using simulation to explore how often our confidence intervals capture the true mean when we base our confidence intervals on different multiples, $z$, of the SE.
$$ {CI}_\overline{x} = \overline{x} \pm (z \times {SE}_\overline{x}) $$For the purposes of this simulation, let's consider samples of size 50, drawn from the same population of interest as before (popn
above). We're going to generate a large number of such samples, and for each sample we will calculate the CI of the mean using the formula above. We will then ask, "for what fraction of the samples did our CI overlap the true population mean"? This will give us a sense of how well different confidence intervals do in providing a plausible range for the true mean.
In [4]:
N = 1000
samples50 = popn.rvs(size=(50, N)) # N samples of size 50
means50 = np.mean(samples50, axis=0) # sample means
std50 = np.std(samples50, axis=0, ddof=1) # sample std devs
se50 = std50/np.sqrt(50) # sample standard errors
frac_overlap_mu = []
zs = np.arange(1,3,step=0.05)
for z in zs:
lowCI = means50 - z*se50
highCI = means50 + z*se50
overlap_mu = np.logical_and(lowCI <= mu, highCI >= mu)
frac = np.count_nonzero(overlap_mu)/N
frac_overlap_mu.append(frac)
frac_overlap_mu = np.array(frac_overlap_mu)
plt.plot(zs, frac_overlap_mu * 100, 'k-', label="simulation")
plt.ylim(60, 104)
plt.xlim(1, 3)
plt.xlabel("z in CI = sample mean ± z × SE")
plt.ylabel(u"% of CIs that include popn mean")
# plot theoretical expectation
stdnorm = stats.norm(loc=0, scale=1)
plt.plot(zs, (1 - (2* stdnorm.sf(zs)))*100, 'r-', alpha=0.5, label="theory")
plt.legend(loc='lower right')
pass
How should we interpret the results above? We found as we increased the scaling of our confidence intervals (larger $z$), the true mean was within sample confidence intervals a greater proportion of the time. For example, when $z = 1$ we found that the true mean was within our CIs roughly 67% of the time, while at $z = 2$ the true mean was within our confidence intervals approximately 95% of the time.
We call $x \pm 2 \times {SE}_\overline{x}$ the approximate 95% confidence interval of the mean (see below for exact values of z). Given such a CI calculated from a random sample we can say we are "95% confident" that we have captured the true mean within the bounds of the CI (subject to the caveats about the sampling distribution above). By this we mean if we took many samples and built a confidence interval from each sample using the equation above, then about 95% of those intervals would contain the actual mean, μ. Note that this is exactly what we did in our simulation!
In [7]:
ndraw = 100
x = means50[:ndraw]
y = range(0,ndraw)
plt.errorbar(x, y, xerr=1.96*se50[:ndraw], fmt='o')
plt.vlines(mu, 0, ndraw, linestyle='dashed', color='#D55E00', linewidth=3, zorder=5)
plt.ylim(-1,101)
plt.yticks([])
plt.title("95% CI: mean ± 1.96×SE\nfor 100 samples of size 50")
fig = plt.gcf()
fig.set_size_inches(4,8)
In [6]:
perc = np.array([.80, .90, .95, .99, .997])
zval = stdnorm.ppf(1 - (1 - perc)/2) # account for the two tails of the sampling distn
print("% CI \tz × SE")
print("-----\t------")
for (i,j) in zip(perc, zval):
print("{:5.1f}\t{:6.2f}".format(i*100, j))
# see the string docs (https://docs.python.org/3.4/library/string.html)
# for information on how formatting works
In [ ]: