The Central Limit Theorem states that the sampling distribution of the sampling means approaches a normal distribution as the sample size gets larger — no matter what the shape of the population distribution. This fact holds especially true for sample sizes over 30. All this is saying is that as you take more samples, especially large ones, your graph of the sample means will look more like a normal distribution.
In [11]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [2]:
rand_1k = np.random.randint(0,100,1000)
In [3]:
rand_1k.size
Out[3]:
In [12]:
sns.distplot(rand_1k)
Out[12]:
Thus the population follows a uniform
distribution, not a normal
distribution. Still, we will see the distribution of our means will follow a normal
distribution.
In [4]:
np.mean(rand_1k)
Out[4]:
In [5]:
subset_100 = np.random.choice(rand_1k, size=100, replace=False)
subset_100.size
Out[5]:
In [6]:
np.mean(subset_100)
Out[6]:
The mean of this subset of 100
integers is 43.2
. Not close enough.
In [7]:
# generate 50 random samples of size 100 each
subset_means = []
for i in range(0,50):
current_subset = np.random.choice(rand_1k, size=100, replace=False)
subset_means.append(np.mean(current_subset))
Calculate the mean of means (its meta :))
In [33]:
clt_mean = np.mean(subset_means)
clt_mean
Out[33]:
Calculate the SD of the means
In [34]:
subset_sd = np.std(subset_means)
subset_sd
Out[34]:
In [37]:
ax = sns.distplot(subset_means, bins=10)
# draw mean in black
ax.axvline(clt_mean, color='black', linestyle='dashed')
# draw mean +- 1 SD
ax.axvline(clt_mean + subset_sd, color='red', linestyle='dotted')
ax.axvline(clt_mean - subset_sd, color='red', linestyle='dotted')
Out[37]:
Difference between mean of means and the population mean
In [38]:
np.mean(rand_1k) - clt_mean
Out[38]: