CLT and Sampling

sampling variability & CLT

sampling variability

Screenshot taken from Coursera 2:27

  • Sample distributions: distribution of sample data
  • Sampling distribution: The distribution of sample mean. Each sample has its own average value, and the distribution of these averages is called the "sampling distribution of the sample mean"
  • We call this distribution the sampling distribution. The mean of the sample means will probably be around the true population. Roughly 65 inches as well. The standard deviation of this sample means, we'll probably be much lower than the population standard deviation since we would expect the average height for each state to be pretty close to one another. For example, we wouldn't expect to find s state where the average height of a random sample of thousand women is as low as 4 feet or as high as 7 feet. We call the standard deviation of the sample means the standard error.
  • In fact as the sample size N increases, the standard error will decrease. The fewer woman with sample from each state, the more variable we would expect the sample means to be.

Screenshot taken from Coursera 7:20

  • Let's start with the default case of a normal distribution for the population with mean zero and standard deviation 20. Let's take samples of, let's say, size 45 from this population and what we can see here is that each one of these dot plots show us one sample of 45 observations from the normal population. We can see that the centers of each one of these samples is close to 0, though not exactly 0. And we can also see that the sample mean varies from one sample to another. Since these are random samples from the population, each time we reach out to the population and grab 45 observations We may not be getting the same sample, in fact we will not be getting the same sample and therefore the for each samples are slightly different.
  • The standard deviation of each one of these samples should be roughly equal to the population standard deviation because after all each one of these samples are simply a subset of our population We have illustrated 8 of the first samples here, but we are actually taking 200 samples from the population.
  • We can make this a very large number, say 1,000 samples from the population. And what we have at the very bottom is basically our sampling distribution. Each one of the sample means, once calculated, get dropped to the lower plot. And what we're seeing here is a distribution of sample means. Since we saw that the sample means had some variability among them, the sampling distribution basically illustrates for us what this variability look like.
  • The sampling distribution, as we expected, is. Looking just like the population distribution. So nearly normal. And the center of the sampling distribution so that is the mean of the means is close to the true population mean of 0.
  • However one big difference between our population distribution up top And our sampling distribution at the bottom is the spread of these distributions. The sampling distribution at the bottom is much skinnier than the population distribution up top. And if you think about it, while the standard deviation of the population distribution is 20, the standard error. So the standard deviation of the sample means, is only 2.93. The reason for this is that while individual observations can be very variable, it is unlikely that sample means are going to be very variable.
  • So if we want to decrease the variability of the sample means, what that means is you're taking samples that have more consistent means. In order to do that we would want to increase our sample size. Let's say that we increase our sample size all the way to 500. All right, so what we have here is again our same population distribution.
  • Here we're seeing the first 8 of the 1000 samples being taken from the population. The distributions look much more dense here because we simply have more observations. So each one of these samples represent a Sample from the population of 500 observations. And we can also see that the means are again variable, but let's check to see if they're as variable as before.
  • The curve is indeed skinnier, so the higher the sample size of each sample that you are taking from the population, the vet less samples of the means of those samples and indeed we can see graphically looking at the curve and we can see it numerically looking at the value of the standard error.

Central Limit Theorem

Screenshot taken from Coursera 8:32

  • So the central limit theorem tells us about the shape, which it says that it's going to be nearly normal, the center which is says that the sampling distribution's going to be centered at the population mean, and the spread of the sampling distribution, which we measure using the standard error.
  • If sigma is unknown which is often the case, remember sigma is the population standard deviation and oftentimes, we don't have access to the entire population to calculate this number, we use S, the standard sample deviation to estimate the standard error. So that would be the standard deviation of one sample that we happen to have at hand. In the earlier demo, the stimulation we talked about taking many samples. But if you're running a study as you can imagine, you would only take one sample. So that's the standard deviation of that sample that we would use as our best guess for the population standard deviation.
  • So it wasn't a coincidence that the sampling distribution we saw earlier was symmetric, and centered at the true population mean and that as n increase, the sample size increased, the standard error decreased. We won't go through a detailed proof of why the standard error is equal to sigma over square root of n, but understanding the inverse relationship between them is very important. As the sample size increases, we would expect samples to yield more consistent sample means, hence the variability among the sample means would be lower, which results in a lower standard error.

Screenshot taken from Coursera 12:40

10% condition

Screenshot taken from Coursera 15:38

  • First, let's focus on the 10% condition. If sampling with that replacement n needs to be less than ten percent of the population, is what we stated earlier. Why is this the case? So let's think about this for a moment, say that you live in a very small town, say that the population of the town is only a 1000 people, all right. And your family lives there as well as included your extended family. Say that I'm a researcher who is doing research on some genetic application and I actually want to randomly sample some individuals from your town.
  • Say I take a random sample of say, size just 10.
  • If we're randomly sampling 10 people out of 1000, and let's say you are included in our sample, it's going to be quite unlikely that your parents are also included in that sample as well, because remember, we're only grabbing 10 out of a population of 1000.
  • But say on the other hand, I actually sampled 500 people from the 1000 that lived in your town. If in this town you lived with your parents and all of your extended family and I've already grabbed you to be in my sample. And I have 499 other people to grab chances are I might get somebody from your family in my sample as well.
  • You and a family member of yours are not genetically independent because observations in the population itself are not independent of each other often. So therefore if we grab a very big portion of the population to be in our sample, it's going to be very difficult to make sure that the sampled individuals are independent of each other. That's why while we like large samples, we also want to keep the size of our sample somewhat proportional to our population. And a good rule of thumb usually, if we're sampling without replacement is going to be that we don't grab more than 10% of the population to be in our sample.
  • When you're sampling with replacement which is not something we often do in survey settings because I've already sampled you once and given you a survey and gotten your responses. I don't want to be able to sample you again. I don't need your responses again but if I were sampling without replacement then the probability of sampling you versus somebody from your family would stay consistent throughout all of the trials. That's why we wouldn't need to worry about the 10% condition there. But again, in realistic survey sampling situations we sample without replacement and we like large sample, but we also do not want our samples to be much larger than or any more than 10% of our population.

sample size skew condition

Screenshot taken from Coursera 16:50

  • And what about the sample size skew condition? Say we have skewed population distribution here we have a population distribution that's extremely right skewed. When the sample size is small here we're looking at a sampling distribution created based on samples of n=10, the sample means will be quite variable.
  • With quite large samples, here we're looking at our sampling distribution where for each of the individual samples based on which the sample means were calculated, those sample sizes were 200. With quite large samples like this, we can actually overcome the effect of the parent distribution.
  • And the central limit theorem kicks in. And the sampling distribution starts to resemble a closely normal distribution. Why our we somewhat obsessed with having nearly normal sampling distributions? Because we've learned earlier that once you have a normal distribution, calculating probabilities which will later serve as our P values in our hypothesis tests are relatively simple.
  • So, having a nearly normal sampling distribution that relies on central limit theory is actually going to open up a bunch of doors for us for doing statistical inference using confidence intervals and hypothesis test using normal distribution theory. Lets do another demo real quick. We looked earlier at what does a sampling distribution look like when we have nearly normal population distribution. Lets take a look to see what happens if the population distribution is not nearly normal.

Screenshot taken from Coursera 20:58

CLT (for the mean) examples

Example 1

Screenshot taken from Coursera 2:22

  • Suppose my iPod has 3,000 songs. The histogram below shows a distribution of the lengths of these songs. We also know that, for this iPod, the mean length is 3.45 minutes and the standard deviation is 1.63 minutes. Calculate the probability that a randomly selected song lasts more than five minutes. Here we're looking for the probability of one randomly selected song lasting more than five minutes. This is the same thing as saying among all the population of songs on this iPod, what percentage of them last more than five minutes.
  • This should be a pretty simple question to calculate. And lately, what we've been doing was to calculate Z-score, and use those to find probability. If that's your instinct here, though, you should not follow it.
  • Because remember that we can use Z-scores and the associated normal probabilities only if the distribution we're working with is nearly normal. And taking a look at the distribution of songs here, they certainly are not.
  • The distribution of the lengths of all of these songs on the iPod is indeed right-skewed. Does this make sense? Well, a song can't be less than zero minutes, so we have a natural boundary at the lower end. And there's really no upper end to how long your songs can be. However, as you can imagine, it's going to be fewer and fewer songs as the number of minutes increases. That's what gives us the right skewed distribution here.
  • So, we've confirmed that the population distribution makes sense, but we've also said that the methods that we've learned most recently for calculating these probabilities don't apply here. Does this mean we can't answer this question, though? No. We can actually use the histogram and the heights of the bars to estimate what percentage of songs fall between, let's say four and five minutes, five and six minutes, six and seven, so on and so forth, and use those to calculate this probability.
  • So here we're interested in everything above five minutes. This will require eyeballing the heights of these bars. It looks like there are roughly 350 songs that last between five and six minutes, 100 between six and seven minutes, 25 between seven and eight minutes. I'm kind of making these numbers up, but I'm making an educated guess here. So your estimates might be slightly off, but should be within this range. 20 songs maybe between eight and nine minutes, and five songs maybe between nine and ten minutes. It seems like there are no songs on this iPod that lasts more than ten minutes. Let's let X equal the length of one song. We're using an additional notation here that actually isn't absolutely necessary for this one question. But having used some sort of notation will come in handy in a little bit.
  • Then the probability that X is greater than 5 is 350 plus 100 plus 25 plus 20 plus 5, divided by 3,000. Which comes out to 500 over 3000, which is 0.17, approximately. So the probability that a randomly selected song on my iPod lasts more than five minutes is 0.17. Another way of thinking about this is 17% of the songs on my iPod last more than five minutes.

Example 2

Screenshot taken from Coursera 4:05

  • Now let's take a look at another question based on the same iPod. I'm about to take a trip to visit my parents and the drive is six hours. I make a random playlist of 100 songs. What is the probability that my playlist lasts the entire drive? So, we know that six hours is roughly 360 minutes.
  • So I'm going to need 360 minutes worth of songs. We could write this as, remember we were calling x the length of one song. Probability that X1 plus X2 all the way up to X100 is greater than 360 minutes.
  • How do we calculate this probability?
  • We haven't really worked with sums of random variables, but we know how to deal with averages. This is equivalent to the average length of 100 songs being greater than 360 divided by 100, 3.6 minutes.
  • So we want the average length to be greater than 3.6 minutes. Remember, this is not the same thing as every single song being more than 300 3.6 minutes. Because that would give me a very, very long playlist. That would tell me that the minimum length of that playlist would be 360 minutes. I just want the total to be greater than 360 minutes to last me the entire drive.
  • Now that we have introduced the X bar, the sample mean, that should remind us that the central limit theorem might be helpful. Because using the central limit theorem, we can find the distribution of the sample mean pretty easily. The central limit theorem says that X bar will be distributed nearly normally, with mean equal to the population mean, which is 3.45 minutes. We were given this information in the previous slide. And with standard error equal to the population standard deviation, sigma, divided by the square root of n, the sample size. So, that is 1.63 divided by square root of 100, which comes out to 0.163. Now, we have a random variable, X bar, our sample mean.
  • We know its distribution, it's normal. We know its mean, the center, 3.45. And we know something about its variability. The standard error, which is basically the standard deviation of X bar, is 0.163. And we're interested in some probability. This combination of events, a normal distribution, I know its parameters, I'm looking for probability, should prompt that we should first draw a curve before we proceed. So I'm drawing my curve, I'm setting the center at 3.45. And remember, I'm looking for the observation of interest as 3.6 minutes, and I'm looking for everything above that. Remember, drawing the curve is always your friend. If you do this first, it's much less likely that you would do something wrong in the following steps. So next, we calculate the Z-score.
  • The Z-score is equal to 3.6, the observation, minus 3.45, the mean, divided by 0.163, the standard error. And it comes out to be 0.92. Note that we divide by the standard error, and not the sigma of the standard deviation of the population. Because the observation of interest, the 3.6, is a sample mean and not an individual song. So not an individual observation. We measure the variability of individual observations with standard deviations. We measure the variability of sample means with standard errors. So whatever the observation is that you plug in in the numerator in your Z-score, its variability belongs in the denominator. In other words, our observation is an X bar, and not an X. This is where we can see the notation from earlier come in handy. We can now easily find the area, using many of the methods we've learned so far. Using a table, using R, or the applet.
  • If I wanted to find this probability using the applet, I would choose the normal distribution with mean 0 and standard deviation 1, because remember, that is indeed the standard normal. The distribution of Z-scores. And I'm looking for an upper tail. And my Z-score is 0.92. So I just need to slide my slider over to 0.92. And it turns out that the probability of the Z-score being greater than 0.92 is 0.179. So there's almost an 18% chance that my playlist lasts at least the entire drive.

Example 3

Screenshot taken from Coursera 9:26

  • To answer this question, we need to use what we know about sampling distributions based on the central limit theorem, especially the interplay between the shape and the sample size.
  • The distribution in plot C most closely resembles the normal. Therefore, this must be the distribution with the distribution of 100 sample means from random samples with size 49. Remember, the central limit theorem tells us that sampling distributions will be nearly normal when n is large. So the largest end of the options should yield the most normal looking distribution.
  • We can choose between the remaining two, depending on their shapes or spreads. The plot that resembles the pairing distribution, the population distribution, the most is the single random sample of 100 observations from this population. Because, remember, both in a population and in one, one sample, the observations are still individual observations, so not sample means. This appears to be plot B. It's right skewed just like the parent population. And it also has the largest spread. Its range goes from 0 to 35 while the ranges of other plots are much narrower.
  • Then plot A must be the distribution of 100 samples means from random samples with size 7. So what we've done here is we've used what we know about the central limit theorem, and how sample sizes affect the shapes and spreads of sampling distributions to make this assignment.

Confidence Intervals

Confidence Interval (for a mean)

Screenshot taken from Coursera 0:38

Screenshot taken from Coursera 6:15

Accuracy vs. Precision

Screenshot taken from Coursera 0:52

Screenshot taken from Coursera 4:40

Screenshot taken from Coursera 6:55

Required Sample Size for ME

Screenshot taken from Coursera 1:00

Screenshot taken from Coursera 4:00

CI (for the mean) examples

Screenshot taken from Coursera 2:00

Screenshot taken from Coursera 2:30