Discussion 5: Prediction and Inference

Relevant lectures: 8

In today's discussion, you'll get practice with inference concepts and dive deeper into the work we did in lecture 8.

This discussion will not be turned in. In fact, there is no code in this discussion; all your answers will be written in the text cells below.

The purpose of this exercise is to think about and communicate your point of view, so please work through these problems together in groups of 2 or 3.

Traffic Data Problem

Recall from lecture: J drives a daily commute to UC Berkeley from Beaumont Ave. in Oakland.

He wants to know what lane is best to take.

Specifically, he wants to know: is Lane 4 (the rightmost lane) better than Lane 1 (the leftmost lane)?

Dataset

Our dataset contains all the work day flows over 60 minute intervals (7-8am) near Beaumont Ave.

Here's a plot of the flows from 7-8am over the time period in our data:

And here are the distributions of the flows:


Question 1: Recap

First, let's walk through the steps we took during lecture to create our confidence interval.

Question 1a:

How did we change J's question into a more precise statistical question?

Solution: We turned his question to:

Do Lane 4 and Lane 1 have different mean flows?

Question 1b:

What were our null and alternative hypotheses for this question?

Solution: Null hypothesis: There is no difference in mean flows between Lane 1 and Lane 4. Any difference we observe is due to chance.

Alternative: There is a difference.

Question 1c:

Let's suppose that we took our data and found the mean flow for Lane 1 was 1000 and the mean flow for Lane 4 was 980.

This results in a (Lane 1 - Lane 4) flow of 20.

At this point, why can't we conclude that Lane 1 has a different mean flow than Lane 4?

Solution: There is a difference, but we don't know whether that difference could happen by chance or not.

For example, if we flip a coin 1000 times and get 550 heads it'd be difficult to decide whether the coin is biased or not. This is a similar case.

Question 1d:

In order to tell whether our difference is significant, we bootstrapped the mean difference between Lane 1 and 4.

This is the distribution we got:

According to this distribution, estimate the probability that we get a flow difference of 0 if the lane flows fluctuated by chance.

Why can we look at this distribution and find a probability?

Is our probability a p-value?

Finally, why did we look at the probability of getting 0 or more extreme rather than getting 20 (our previously computed mean difference) or more extreme?

Solution: It looks like roughly 40% of the curve is below 0, so the probability that we get is 0.40.

We are using this distribution to estimate the sample distribution of mean differences. Thus, this distribution lets us estimate the probability of getting a particular sample's mean difference or something more extreme.

Our probability is a p-value. This is the definition of p-value.

We looked at the probability of 0 because we want to know whether the two lanes have a mean difference or not. This bootstrap distribution is always centered at our sample mean (20 in this case, although we made up that number so it doesn't show up on the plot) so it doesn't really help to look at the number or more extreme.

Question 1e:

Use the distribution above to roughly estimate the bounds of a 95% confidence interval for this problem. (Remember to construct the correct type of interval for this problem, not just what was on the lecture slides.)

Solution: From the plot, we can say the interval should be in the ballpark of: [-30, 40].

Note that in lecture, Bin switched to a one-sided interval. She realized this mistake too late as she had already given her lecture by then.

Our question statement led us to create a two-sided interval.

Question 1f:

Does our confidence interval suggest that J should prefer one lane over the other?

Why did we say that the confidence interval probably wasn't the right tool for the job?

Solution: Our confidence interval doesn't allow us to reject the null.

However, in our plots it looks like there actually was a difference between lane 1 and 4. It's possible that the mean difference was affected by the atypical Oct-Nov period of time. This is the reason you want to perform EDA before more formal analysis!

Question 2

One good way to check whether you understand something is to tweak the problem and see if you can still figure it out. Let's do that!

Question 2a:

Let's suppose we didn't bootstrap the differences. Instead, we bootstrap the mean flow for Lane 1 and Lane 4 separately. Can we still answer our original question? If so, how? If not, explain why not.

Solution: Yes. We can use the bootstrapped means to construct confidence intervals for the sample means of both Lane 1 and 4. If the confidence intervals don't overlap, we can conclude that there is a mean difference.

Question 2b:

Rephrase the question, null, and alternative hypotheses so that you would construct a one-sided confidence interval instead of the two-sided one above.

Then, use the plot above to estimate your one-sided confidence interval. How do the bounds of this interval compare with your previous bounds?

Solution:

Question: Does Lane 1 have higher mean flow on average than Lane 4?

Null: Lane 1 does not have a higher mean flow than Lane 4.

Alternative: Lane 1 has a higher mean flow than Lane 4.

Confidence interval: In the ballpark of [-25, inf].

The left side of the interval should be more positive than the two-sided interval since you leave 5% of the distribution on the left side as opposed to 2.5% for the two-sided interval.

Question 2c:

Let's suppose we constructed the interval, then looked at our EDA and decided to cut out the data from Oct to Nov 2016 out, then recreate the confidence interval.

What new assumption did we implicitly make in this process?

Solution: We assumed that the data in that time span were atypical, and that the rest of the data were representative of flows in the future.

Question 2d:

Let's suppose we didn't have the bootstrap. How else could we estimate the sample distribution of mean differences?

There is an answer that is easy to state. There is also an answer that you might have learned if you've taken other Stats classes.

Solution: Easy answer: Go and get 1000 more samples of flows, then construct distribution of mean differences from that.

More complicated answer: Assume that the sample distribution of flows is bell-shaped. Then, use statistical methods to estimate the sample distribution mean and variance. Then, use that curve to make the same estimate as above.

The reason we don't teach that second method in Data 8 and in this class is because it imposes an additional assumption on your data and involves complicated equations. We have computers now, so we use the bootstrap.


In [ ]: