In [22]:
# HIDDEN
from datascience import *
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
import math
import numpy as np
from scipy import stats
import ipywidgets as widgets
import nbinteract as nbi

The Central Limit Theorem

Very few of the data histograms that we have seen in this course have been bell shaped. When we have come across a bell shaped distribution, it has almost invariably been an empirical histogram of a statistic based on a random sample.

The Central Limit Theorem says that the probability distribution of the sum or average of a large random sample drawn with replacement will be roughly normal, regardless of the distribution of the population from which the sample is drawn.

As we noted when we were studying Chebychev's bounds, results that can be applied to random samples regardless of the distribution of the population are very powerful, because in data science we rarely know the distribution of the population.

The Central Limit Theorem makes it possible to make inferences with very little knowledge about the population, provided we have a large random sample. That is why it is central to the field of statistical inference.

Proportion of Purple Flowers

Recall Mendel's probability model for the colors of the flowers of a species of pea plant. The model says that the flower colors of the plants are like draws made at random with replacement from {Purple, Purple, Purple, White}.

In a large sample of plants, about what proportion will have purple flowers? We would expect the answer to be about 0.75, the proportion purple in the model. And, because proportions are means, the Central Limit Theorem says that the distribution of the sample proportion of purple plants is roughly normal.

We can confirm this by simulation. Let's simulate the proportion of purple-flowered plants in a sample of 200 plants.


In [2]:
colors = make_array('Purple', 'Purple', 'Purple', 'White')

model = Table().with_column('Color', colors)

model


Out[2]:
Color
Purple
Purple
Purple
White

In [5]:
props = make_array()

num_plants = 200
repetitions = 1000

for i in np.arange(repetitions):
    sample = model.sample(num_plants)
    new_prop = np.count_nonzero(sample.column('Color') == 'Purple')/num_plants
    props = np.append(props, new_prop)
props[:5]


Out[5]:
array([0.715, 0.725, 0.695, 0.79 , 0.765])

In [19]:
opts = {
    'title': 'Distribution of sample proportions',
    'xlabel': 'Sample Proportion',
    'ylabel': 'Percent per unit',
    'xlim': (0.64, 0.84),
    'ylim': (0, 25),
    'bins': 20,
}
nbi.hist(props, options=opts)


There's that normal curve again, as predicted by the Central Limit Theorem, centered at around 0.75 just as you would expect.

How would this distribution change if we increased the sample size? We can copy our sampling code into a function and then use interaction to see how the distribution changes as the sample size increases.

We will keep the number of repetitions the same as before so that the two columns have the same length.


In [21]:
def empirical_props(num_plants):
    props = make_array()
    for i in np.arange(repetitions):
        sample = model.sample(num_plants)
        new_prop = np.count_nonzero(sample.column('Color') == 'Purple')/num_plants
        props = np.append(props, new_prop)
    return props

In [24]:
nbi.hist(empirical_props, options=opts,
         num_plants=widgets.ToggleButtons(options=[100, 200, 400, 800]))


All of the above distributions are approximately normal but become more narrow as the sample size increases. For example, the proportions based on a sample size of 800 are more tightly clustered around 0.75 than those from a sample size of 200. Increasing the sample size has decreased the variability in the sample proportion.