Relevant lectures: 8
In today's discussion, you'll get practice with inference concepts and dive deeper into the work we did in lecture 8.
This discussion will not be turned in. In fact, there is no code in this discussion; all your answers will be written in the text cells below.
The purpose of this exercise is to think about and communicate your point of view, so please work through these problems together in groups of 2 or 3.
Recall from lecture: J drives a daily commute to UC Berkeley from Beaumont Ave. in Oakland.
He wants to know what lane is best to take.
Specifically, he wants to know: is Lane 4 (the rightmost lane) better than Lane 1 (the leftmost lane)?
Our dataset contains all the work day flows over 60 minute intervals (7-8am) near Beaumont Ave.
Here's a plot of the flows from 7-8am over the time period in our data:
And here are the distributions of the flows:
Solution: We turned his question to:
Do Lane 4 and Lane 1 have different mean flows?
Solution: Null hypothesis: There is no difference in mean flows between Lane 1 and Lane 4. Any difference we observe is due to chance.
Alternative: There is a difference.
Solution: There is a difference, but we don't know whether that difference could happen by chance or not.
For example, if we flip a coin 1000 times and get 550 heads it'd be difficult to decide whether the coin is biased or not. This is a similar case.
In order to tell whether our difference is significant, we bootstrapped the mean difference between Lane 1 and 4.
This is the distribution we got:
According to this distribution, estimate the probability that we get a flow difference of 0 if the lane flows fluctuated by chance.
Why can we look at this distribution and find a probability?
Is our probability a p-value?
Finally, why did we look at the probability of getting 0 or more extreme rather than getting 20 (our previously computed mean difference) or more extreme?
Solution: It looks like roughly 40% of the curve is below 0, so the probability that we get is 0.40.
We are using this distribution to estimate the sample distribution of mean differences. Thus, this distribution lets us estimate the probability of getting a particular sample's mean difference or something more extreme.
Our probability is a p-value. This is the definition of p-value.
We looked at the probability of 0 because we want to know whether the two lanes have a mean difference or not. This bootstrap distribution is always centered at our sample mean (20 in this case, although we made up that number so it doesn't show up on the plot) so it doesn't really help to look at the number or more extreme.
Solution: From the plot, we can say the interval should be in the ballpark of: [-30, 40].
Note that in lecture, Bin switched to a one-sided interval. She realized this mistake too late as she had already given her lecture by then.
Our question statement led us to create a two-sided interval.
Solution: Our confidence interval doesn't allow us to reject the null.
However, in our plots it looks like there actually was a difference between lane 1 and 4. It's possible that the mean difference was affected by the atypical Oct-Nov period of time. This is the reason you want to perform EDA before more formal analysis!
One good way to check whether you understand something is to tweak the problem and see if you can still figure it out. Let's do that!
Let's suppose we didn't bootstrap the differences. Instead, we bootstrap the mean flow for Lane 1 and Lane 4 separately. Can we still answer our original question? If so, how? If not, explain why not.
Solution: Yes. We can use the bootstrapped means to construct confidence intervals for the sample means of both Lane 1 and 4. If the confidence intervals don't overlap, we can conclude that there is a mean difference.
Rephrase the question, null, and alternative hypotheses so that you would construct a one-sided confidence interval instead of the two-sided one above.
Then, use the plot above to estimate your one-sided confidence interval. How do the bounds of this interval compare with your previous bounds?
Solution:
Question: Does Lane 1 have higher mean flow on average than Lane 4?
Null: Lane 1 does not have a higher mean flow than Lane 4.
Alternative: Lane 1 has a higher mean flow than Lane 4.
Confidence interval: In the ballpark of [-25, inf].
The left side of the interval should be more positive than the two-sided interval since you leave 5% of the distribution on the left side as opposed to 2.5% for the two-sided interval.
Solution: We assumed that the data in that time span were atypical, and that the rest of the data were representative of flows in the future.
Solution: Easy answer: Go and get 1000 more samples of flows, then construct distribution of mean differences from that.
More complicated answer: Assume that the sample distribution of flows is bell-shaped. Then, use statistical methods to estimate the sample distribution mean and variance. Then, use that curve to make the same estimate as above.
The reason we don't teach that second method in Data 8 and in this class is because it imposes an additional assumption on your data and involves complicated equations. We have computers now, so we use the bootstrap.
In [ ]: