In [22]:
# HIDDEN
from datascience import *
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
import math
import numpy as np
from scipy import stats
import ipywidgets as widgets
import nbinteract as nbi
The Central Limit Theorem says that the probability distribution of the sum or average of a large random sample drawn with replacement will be roughly normal, regardless of the distribution of the population from which the sample is drawn.
As we noted when we were studying Chebychev's bounds, results that can be applied to random samples regardless of the distribution of the population are very powerful, because in data science we rarely know the distribution of the population.
The Central Limit Theorem makes it possible to make inferences with very little knowledge about the population, provided we have a large random sample. That is why it is central to the field of statistical inference.
Recall Mendel's probability model for the colors of the flowers of a species of pea plant. The model says that the flower colors of the plants are like draws made at random with replacement from {Purple, Purple, Purple, White}.
In a large sample of plants, about what proportion will have purple flowers? We would expect the answer to be about 0.75, the proportion purple in the model. And, because proportions are means, the Central Limit Theorem says that the distribution of the sample proportion of purple plants is roughly normal.
We can confirm this by simulation. Let's simulate the proportion of purple-flowered plants in a sample of 200 plants.
In [2]:
colors = make_array('Purple', 'Purple', 'Purple', 'White')
model = Table().with_column('Color', colors)
model
Out[2]:
In [5]:
props = make_array()
num_plants = 200
repetitions = 1000
for i in np.arange(repetitions):
sample = model.sample(num_plants)
new_prop = np.count_nonzero(sample.column('Color') == 'Purple')/num_plants
props = np.append(props, new_prop)
props[:5]
Out[5]:
In [19]:
opts = {
'title': 'Distribution of sample proportions',
'xlabel': 'Sample Proportion',
'ylabel': 'Percent per unit',
'xlim': (0.64, 0.84),
'ylim': (0, 25),
'bins': 20,
}
nbi.hist(props, options=opts)
There's that normal curve again, as predicted by the Central Limit Theorem, centered at around 0.75 just as you would expect.
How would this distribution change if we increased the sample size? We can copy our sampling code into a function and then use interaction to see how the distribution changes as the sample size increases.
We will keep the number of repetitions
the same as before so that the two columns have the same length.
In [21]:
def empirical_props(num_plants):
props = make_array()
for i in np.arange(repetitions):
sample = model.sample(num_plants)
new_prop = np.count_nonzero(sample.column('Color') == 'Purple')/num_plants
props = np.append(props, new_prop)
return props
In [24]:
nbi.hist(empirical_props, options=opts,
num_plants=widgets.ToggleButtons(options=[100, 200, 400, 800]))
All of the above distributions are approximately normal but become more narrow as the sample size increases. For example, the proportions based on a sample size of 800 are more tightly clustered around 0.75 than those from a sample size of 200. Increasing the sample size has decreased the variability in the sample proportion.