In [1]:
%matplotlib inline
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

np.random.seed(20160302)

Population distribution

Here is a population of size 100,000.


In [2]:
mu, sigma = 64, 8
popn = np.random.normal(loc=mu,scale=sigma, size=100000)
truemu, truesigma = np.mean(popn), np.std(popn)

In [3]:
s = \
"""For the population of interest, the true mean is {} 
and the true standard deviation is {} """
print(s.format(truemu,truesigma))


For the population of interest, the true mean is 64.03841411536553 
and the true standard deviation is 8.008087008137426 

This is what the population distribution looks like when represented as a frequency histogram.


In [21]:
plt.hist(popn, bins=50, color='gray', alpha=0.75, histtype='stepfilled')
plt.xlabel("X")
plt.ylabel("Frequency")
pass


Sample distribution

Here is sample of size 60 drawn without replacement from this population:


In [5]:
sample = np.random.choice(popn, size=60, replace=False)

In [6]:
s = \
"""For the population of interest, the point estimates of the 
mean and standard deviation are {} and {}, respectively"""

print(s.format(np.mean(sample),np.std(sample,ddof=1)))


For the population of interest, the point estimates of the 
mean and standard deviation are 63.03932509191508 and 7.240176981809635, respectively

Here is what the frequency histogram of the sample looks like.


In [31]:
plt.hist(sample, color='steelblue', alpha=0.75, 
         histtype='stepfilled',label='sample')
plt.xlabel("X")
plt.ylabel("Frequency")
pass


Here's density histograms of the population (gray) and the sample (blue) drawn together.


In [22]:
plt.hist(popn, normed=True, bins=50, label='population',
         color='gray', alpha=0.75, histtype='stepfilled')
plt.hist(sample, normed=True, label='sample',
         color='steelblue', alpha=0.75, histtype='stepfilled')
plt.xlabel("X")
plt.ylabel("Density")
plt.legend(loc="best")
pass


Sampling distribution of the mean

To estimate the sampling distribution of the mean, we draw 1000 samples of size 60 and calculate the mean for each such sample.


In [9]:
smeans = []
for i in range(1000):
    rsample = np.random.choice(popn, size=60, replace=False)
    smeans.append(np.mean(rsample))

Here is the frequency histogram of the sampling distribution of the mean.


In [36]:
plt.hist(smeans, normed=True, bins=30, label='simulated\ndistn of\n means',
         color='firebrick', alpha=0.75, histtype='stepfilled')
plt.xlabel("mean(X)")
plt.ylabel("Frequency")
pass


Here are the density histograms of the population (grey), our first sample (blue), and the sampling distribution of the mean(red), all drawn in the same plot.

IMPORTANT NOTE: to facilitate visual comparison of the distributions I've truncated the y-axis. Comment out the ylim line below to see the complete density histogram of the sampling distribution of sample means.


In [37]:
plt.hist(popn, normed=True, bins=50, label='population',
         color='gray', alpha=0.75, histtype='stepfilled')
plt.hist(sample, normed=True, label='sample',
         color='steelblue', alpha=0.75, histtype='stepfilled')
plt.hist(smeans, normed=True, bins=50, label='simulated\ndistn of\n means',
         color='firebrick', alpha=0.75, histtype='stepfilled')
plt.xlabel("X")
plt.ylabel("Density")
plt.legend(loc="best")
plt.ylim(0,0.06)  # comment out this line to remove truncation
pass



In [ ]: