In [1]:
%matplotlib inline
import numpy as np
import scipy.stats as stats
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
As you carry out your own analyses, probably build up a library of code that you'll want to reuse. Rather than copying and pasting the same text over and over again into your Jupyter notebooks, you can create your own custom Python modules. To make such modules available to work with you'll need to place them in the same directory as your Jupyter notebooks (if you're working with Python installed locally on your computer there are other ways to organize your custom modules, but we won't discuss those for the purpose of this class).
Download the statspot.py
module from http://roybatty.org/statplots.py and upload it (via the "Upload" button on the Jupyter home page) to the same directory as your notebooks.
In [2]:
import statplots
I want to emphasize the difference between sample distributions and the sampling distribution of the sample mean, and re-emphasize how these distributions relate to confidence intervals for the mean.
To do so, we'll illustrate these different concepts with the "butterfat" data set we've used in previous lectures.
In [3]:
fat = pd.read_csv("http://roybatty.org/butterfat.csv")
print(fat.describe())
mean_fat = fat.butterfat.mean()
std_fat = fat.butterfat.std()
SE_fat = std_fat / np.sqrt(fat.butterfat.count())
In [4]:
plt.hist(fat.butterfat, bins=13, normed=True,
color='steelblue', alpha=0.75, label="observed data")
dist1, ax1 = statplots.norm_plot(mean_fat, std_fat, color='k',
label="normal with parameters\nof observed")
dist2, ax2 = statplots.norm_plot(mean_fat, SE_fat, color='k', linestyle='dashed',
label="distn of sample means (estimated)")
plt.xlim(2.8,5.5)
maxy = ax2.get_ylim()[1]
plt.errorbar(mean_fat, 0.95 * maxy, xerr=1.96 * SE_fat, fmt='ro', label='95% CI of mean')
plt.legend(numpoints=1,loc='center left', bbox_to_anchor=(1, 0.5))
pass
You should be careful in interpretting confidence intervals. The correct interpretation is wording like:
We are XX% confident that the population parameter is between...
A random sample of 50 college students were asked how many exclusive relationships they have been in so far. This sample yielded a mean of 3.2 and a standard deviation of 1.74. Estimate the true average number of exclusive relationships using this sample.
The 95% confidence interval for the mean is defined as: $$ \overline{x} \pm 1.96 \times SE $$
We calculate the sample standard error as: $$ {SE} = \frac{s}{\sqrt{n}} = \frac{1.74}{\sqrt{50}} = 0.246 $$
So our approximate 95% CI for the mean is: $$3.2 \pm 1.96 \times 0.246 = (2.72, 3.68) \approx (2.7, 3.7)$$.
Which of the following is the correct interpretation of this CI?
Up to now, in our discussions of sampling distributions, standard errors, and confidence intervals, our simulations have involved drawing samples from populations that are normally distributed. We found that the sampling distribution of the sample mean is itself normally distributed.
What would happen if we changed our underlying distribution to something that was not normally distributed?
Let's explore this issue by simulating sampling distributions drawn from two decidely non-normal distributions -- uniform and exponential.
In a uniform distribution, all possible outcomes are equally likely. A continuous uniform distribution is usually defined in terms of the the interval $(a,b)$ giving the upper and lower limits of the possible outcomes, with the mean $\mu = \frac{1}{2}(a+b)$ and standard deviation $\sigma = \sqrt{\frac{1}{12}(b-a)^2}$
When specifying a uniform distribution using the scipy.stats
module the loc
argument is the lower limit ($a$) and the scale
argument is the with of the distribution, such that the upper limit, $b$ is equal to loc + scale
.
In [5]:
# U(2,6)
uni = stats.uniform(loc=2, scale=4)
true_mean, true_var = uni.stats(moments='mv')
true_std = np.sqrt(true_var)
In [6]:
# plot uniform distribution
x = np.linspace(0,8, 500)
density = uni.pdf(x)
plt.plot(x, density, color='black')
plt.ylim(0, max(density)*1.5)
plt.fill_between(x, density, color='gray', alpha=0.75)
plt.ylabel("Density")
plt.title("Probability Distribution $U(\mu=2,\sigma=6)$")
pass
In [7]:
# draw 1000 samples with n = 50 and = 200 from the uniform and calculate sample means
n1, n2 = 50, 250
usamples1 = uni.rvs(size = (n1,1000))
usamples2 = uni.rvs(size = (n2,1000))
umeans1 = usamples1.mean(axis = 0)
umeans2 = usamples2.mean(axis = 0)
In [8]:
# plot resulting sampling distribution
plt.hist(umeans1, bins=21, histtype='stepfilled',
color='steelblue', alpha=0.75, normed=True)
plt.hist(umeans2, bins=21, histtype='stepfilled',
color='sandybrown', alpha=0.75, normed=True)
# draw vertical line at mean of underlying distn
ymax = plt.gca().get_ylim()[1] # find upper y limit of current plot axis
plt.vlines(true_mean, 0, ymax,
color='black', linestyle='dashed', linewidth=2)
plt.ylabel("Density")
plt.title("Simulating Sampling Distribution of Mean\nfor n = 50 and n = 200 from $U(2,6)$")
pass
In [9]:
SE_u1 = np.std(umeans1, ddof=1)
SE_u2 = np.std(umeans2, ddof=1)
print("Simulated standard error of sample mean for n = 50:", SE_u1)
print("Expected standard error of sample mean for n = 50:", true_std/np.sqrt(n1))
print()
print("Simulated standard error of sample mean for n = 200:", SE_u2)
print("Expected standard error of sample mean for n = 200:", true_std/np.sqrt(n2))
In [10]:
# define exponential distribution
expondistn = stats.expon(loc=0,scale=1)
etrue_mean, etrue_var = expondistn.stats(moments='mv')
etrue_std = np.sqrt(etrue_var)
# plot uniform distribution
x = np.linspace(0,8, 500)
density = expondistn.pdf(x)
plt.plot(x, density, color='black')
plt.ylim(0, max(density)*1.1)
plt.xlim(-1,10)
plt.fill_between(x, density, color='gray', alpha=0.75)
plt.vlines(etrue_mean, 0, expondistn.pdf(etrue_mean),
color='black', linestyle='dashed', linewidth=2)
plt.annotate("$\mu$", xy=(etrue_mean, expondistn.pdf(etrue_mean)),
xytext=(2,40),
textcoords='offset points',
horizontalalignment="center",
verticalalignment="bottom",
fontsize=18,
arrowprops=dict(arrowstyle="->",color='black'))
plt.ylabel("Density")
plt.title("Probability Distribution for Exponential")
pass
In [11]:
# draw 1000 samples with n = 10 and = 50 from the exponential and calculate sample means
n1, n2 = 10, 50
esamples1 = expondistn.rvs(size = (n1,1000))
esamples2 = expondistn.rvs(size = (n2,1000))
emeans1 = esamples1.mean(axis = 0)
emeans2 = esamples2.mean(axis = 0)
# plot resulting sampling distribution
plt.hist(emeans1, bins=21, histtype='stepfilled',
color='steelblue', alpha=0.75, normed=True)
plt.hist(emeans2, bins=21, histtype='stepfilled',
color='sandybrown', alpha=0.75, normed=True)
# draw vertical line at mean of underlying distn
ymax = plt.gca().get_ylim()[1] # find upper y limit of current plot axis
plt.vlines(etrue_mean, 0, ymax,
color='black', linestyle='dashed', linewidth=2)
plt.ylabel("Density")
plt.title("Simulating Sampling Distribution of Mean\nfor n = 50 and n = 200 from Exponential")
pass
In [12]:
SE_e1 = np.std(emeans1, ddof=1)
SE_e2 = np.std(emeans2, ddof=1)
print("Simulated standard error of sample mean for n = 10:", SE_e1)
print("Expected standard error of sample mean for n = 10:", true_std/np.sqrt(n1))
print()
print("Simulated standard error of sample mean for n = 50:", SE_e2)
print("Expected standard error of sample mean for n = 50:", true_std/np.sqrt(n2))
You can keep doing simulation experiments like the ones above with arbitrary distributions (discrete or continuous) that have well defined means ($\mu$) and finite standard deviations ($\sigma$). What you will find is that regardless of the shape of the underlying distribution the sampling distribution of the mean will, as $n$ gets large, converge to a normal distribution that has mean $\mu$ and standard deviation $\frac{\sigma}{\sqrt{n}}$.
This surprising and observation is a manifestation of the Central Limit Theorem (CLT).
Here's a concise definition of the CLT from a nice web resource on probability from the University of Utah:
Roughly, the central limit theorem states that the distribution of the average (or sum) of a large number of independent, identically distributed variables will be approximately normal, regardless of the underlying distribution.
The CLT is one reason that so real-world random process have probability distributions that are approximately normal. We'll revisit the CLT in various forms in later lectures.