Introduction

In this notebook, we demonstrate the validity of the central limit theorem using several built-in random number generators.

Uniformly distributed variable

Let's start with the uniform distribution between 100 and 200 (arbitrarily chosen). Let us generate 10000 samples of the uniform distribution, each with 100 observations. We will then compute the sample average for each of the 10000 samples.


In [1]:
from __future__ import division

# Imports of the libraries we need.
from random import uniform, expovariate
import matplotlib.pyplot as plt
import matplotlib as mpl
import prettyplotlib as ppl
import brewer2mpl
import matplotlib.pyplot as plt
from scipy.stats import norm
import numpy as np

colors = brewer2mpl.get_map('Set2', 'qualitative', 3).mpl_colors

mpl.rcParams['lines.linewidth'] = 2
mpl.rcParams['lines.color'] = 'r'
mpl.rcParams['axes.titlesize'] = 32
mpl.rcParams['axes.labelsize'] = 24 
mpl.rcParams['axes.labelsize'] = 24 
mpl.rcParams['xtick.labelsize'] = 24 
mpl.rcParams['ytick.labelsize'] = 24 

%matplotlib inline

In [2]:
# Generate the uniform random samples between 100 and 200.
a = 100
b = 200
samples = [[uniform(100, 200) for i in xrange(100)] for i in xrange(10000)]

Let's plot the samples, just to be sure we are getting the right thing.


In [3]:
flatarray = np.array(samples).flatten()
plt.figure(figsize=(12, 8))
ppl.plot(xrange(len(flatarray)), flatarray, 'x')

plt.ylim([0, 210])
plt.xlabel("Observation #")
plt.ylabel("Value")


Out[3]:
<matplotlib.text.Text at 0x1078fdc10>

Let's plot a normalized histogram of the sample mean (so that it becomes a PDF) and compare it with together with the Normal distribution PDF with the mean and standard deviation expected.


In [4]:
# Calculating the mean and variance.
mean = mean = (a + b) / 2
variance = (b - a) ** 2 / 12
variance_of_sample_mean = variance / np.sqrt(10000)
sigma = np.sqrt(variance_of_sample_mean)

In [5]:
sample_means = [sum(s) / 100 for s in samples]
plt.figure(figsize=(12, 8))
ppl.hist(sample_means, bins=50, normed=True, color=colors[0])
x = np.arange(140, 160, 0.01)
ppl.plot(x, norm.pdf(x, mean, sigma), color=colors[1], linestyle='--', linewidth=5)
xl = plt.xlabel("$s$")
yl = plt.ylabel("$f_S(s)$")


Exponential distribution

Let us do a second example, this time using the exponential distribution.


In [6]:
lambd = 10
samples = [[expovariate(lambd) for i in xrange(100)] for i in xrange(10000)]

In [7]:
# Calculating the mean and variance.
mean = 1 / lambd
variance = 1 / lambd ** 2
variance_of_sample_mean = variance / np.sqrt(10000)
sigma = np.sqrt(variance_of_sample_mean)

In [8]:
sample_means = [sum(s) / 100 for s in samples]
plt.figure(figsize=(12, 8))
ppl.hist(sample_means, bins=50, normed=True, color=colors[0])
x = np.arange(0.05, 0.15, 0.001)
ppl.plot(x, norm.pdf(x, mean, sigma), color=colors[1], linestyle='--', linewidth=5)
xl = plt.xlabel("$s$")
yl = plt.ylabel("$f_S(s)$")


Conclusion

Regardless of the distribution, the central limit theorem guarantees that the sample average of a sequence of independent and identically distributed variables is normally distributed with mean = population mean, and variance = population variance / n.