Lesson 07 - Sampling Distributions

  • By knowing the mean and standard deviation of normal distribution of a population we can compare any point in the distribution as we can find percentage less than and percentage greater than any value
  • We know that measures of center can be used to compare distributions. We can compare distributions of samples with other distributions by finding means

Gambling in Vegas

  • We roll a tetrahedral die (4 sides)
  • We can get 1, 2, 3, 4
  • Expected value $\mu = (1 + 2 + 3 + 4) / 4 = 2.5 $
  • We win the gamble if average of 2 rolls is at least 3

Here we need to note that our population is 4 (1 to 4). And we are taking samples of size 2 from this population.

We have the following cases

If we take samples of this then we are essentailly taking mean of sampling distribution.

  • Mean of sample means (i.e. mean of the above 16 cases) M = 2.5
  • Distribution of sample means is Sampling Distribution
  • For our case it can be seen at http://www.wolframalpha.com/input/?i=1,+1.5,+2,+2.5,+1.5,+2,+2.5,+3,+2,+2.5,+3,+3.5,+2.5,+3,+3.5,+4
  • Taking a lot of samples and finding the mean of sample means we can see that M == $\mu$
  • To compare this sample with other sample we also need SD of population.
  • Let's call Standard deviation of distribution of sample means of sample size n as SE
  • $(\sigma / SE) = \sqrt{n} $

Central limit theorem

For any distribution if we draw a lot of samples and draw their sampling distribution it will turn out to be approximately normal given that the sample size is big enough.

Finding Standard error

Let's try and find standard deviation of sampling distribution created when rolling 2 dies and taking their sum.

The population standard deviation can be found using the STDEVP function in google spreadsheets


In [9]:
import math

def mean_standard_deviation(population):
    mean = sum(population) / len(population)
    differences = [element - mean for element in population]
    squared_differences = [diff ** 2 for diff in differences]
    mean_squared_differences = sum(squared_differences) / len(squared_differences)
    SD = math.sqrt(mean_squared_difference)
    return mean, SD

def standard_deviation_sample(population, sample_size):
    mean_population, sd_population = mean_standard_deviation(population)
    return sd_population / math.sqrt(sample_size)

population = list(range(1, 7))
print(standard_deviation_sample(population, 2))
print(standard_deviation_sample(population, 5))


1.2076147288491197
0.7637626158259733

As sample size increases the SD of sampling distribution decreases and hence it becomes skinnier.

Sampling applet is present at http://onlinestatbook.com/stat_sim/sampling_dist/index.html

Importance of Sampling distribution

It helps us find where a sample that we have lies on the sampling distibution. That helps us understand whether our sample is normal or is there anything special going on compared to other possible samples.