Welcome to Lab 7!
Today we will get some hands-on practice with linear regression. You can find more information about this topic in section 13.2.
Lab submissions are due by Friday, November 4 at 7:00 PM. Remember to submit your lab by running all the tests and then running the final cell in the lab.
In [2]:
# Run this cell, but please don't change it.
# These lines import the Numpy and Datascience modules.
import numpy as np
from datascience import *
# These lines do some fancy plotting magic.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)
# These lines load the tests.
from client.api.assignment import load_assignment
tests = load_assignment('lab07.ok')
(Note: clever title comes from here.)
Old Faithful is a geyser in Yellowstone National Park in the central United States. It's famous for erupting on a fairly regular schedule. You can see a video below.
In [3]:
# For the curious: this is how to display a YouTube video in a
# Jupyter notebook. The argument to YouTubeVideo is the part
# of the URL (called a "query parameter") that identifies the
# video. For example, the full URL for this video is:
# https://www.youtube.com/watch?v=wE8NDuzt8eg
from IPython.display import YouTubeVideo
YouTubeVideo("wE8NDuzt8eg")
Some of Old Faithful's eruptions last longer than others. When it has a long eruption, there's generally a longer wait until the next eruption.
If you visit Yellowstone, you might want to predict when the next eruption will happen, so you can see the rest of the park and come to see the geyser when it happens. Today, we will use a dataset on eruption durations and waiting times to see if we can make such predictions accurately with linear regression.
The dataset has one row for each observed eruption. It includes the following columns:
Run the next cell to load the dataset.
In [4]:
faithful = Table.read_table("faithful.csv")
faithful
We would like to use linear regression to make predictions, but that won't work well if the data aren't roughly linearly related. To check that, we should look at the data.
In [5]:
...
Look at the scatter plot. Are eruption duration and waiting time roughly linearly related? Is the relationship positive, as we claimed earlier? You may want to consult the textbook chapter 13 for the definition of "linearly related."
Write your answer here, replacing this text.
We're going to continue with the provisional assumption that they are linearly related, so it's reasonable to use linear regression to analyze this data.
We'd next like to plot the data in standard units. Recall that, if nums
is an array of numbers, then
(nums - np.mean(nums)) / np.std(nums)
...is an array of those numbers in standard units.
In [5]:
duration_mean = ...
duration_std = ...
wait_mean = ...
wait_std = ...
faithful_standard = Table().with_columns(
"duration (standard units)", ...,
"wait (standard units)", ...)
faithful_standard
In [7]:
_ = tests.grade('q1_3')
In [7]:
...
You'll notice that this plot looks exactly the same as the last one! The data really are different, but the axes are scaled differently. (The method scatter
scales the axes so the data fill up the available space.) So it's important to read the ticks on the axes.
Write your answer here, replacing this text.
Compute the correlation r
. Hint: Use faithful_standard
. Section 13.1 explains how to do this.
In [8]:
r = ...
r
In [10]:
_ = tests.grade('q1_6')
Recall that the correlation is the slope of the regression line when the data are put in standard units.
The next cell plots the regression line in standard units:
$$\text{waiting time (standard units)} = r \times \text{eruption duration (standard units)}.$$Then, it plots the original data again, for comparison.
In [9]:
def plot_data_and_line(dataset, x, y, point_0, point_1):
"""Makes a scatter plot of the dataset, along with a line passing through two points."""
dataset.scatter(x, y, label="data")
plt.plot(make_array(point_0.item(0), point_1.item(0)), make_array(point_0.item(1), point_1.item(1)), label="regression line")
plt.legend(bbox_to_anchor=(1.5,.8))
plot_data_and_line(faithful_standard, "duration (standard units)", "wait (standard units)", make_array(-2, -2*r), make_array(2, 2*r))
How would you take a point in standard units and convert it back to original units? We'd have to "stretch" its horizontal position by duration_std
and its vertical position by wait_std
.
That means the same thing would happen to the slope of the line.
Stretching a line horizontally makes it less steep, so we divide the slope by the stretching factor. Stretching a line vertically makes it more steep, so we multiply the slope by the stretching factor.
What is the slope of the regression line in original units?
(If the "stretching" explanation is unintuitive, consult section 13.2 in the textbook.)
In [10]:
slope = ...
slope
We know that the regression line passes through the point (duration_mean, wait_mean)
. You might recall from high-school algebra that the equation for the line is therefore:
After rearranging that equation slightly, the intercept turns out to be:
In [12]:
intercept = slope*(-duration_mean) + wait_mean
intercept
In [14]:
_ = tests.grade('q2_1')
The slope and intercept tell you exactly what the regression line looks like. To predict the waiting time for an eruption, multiply the eruption's duration by slope
and then add intercept
.
Compute the predicted waiting time for an eruption that lasts 2 minutes, and for an eruption that lasts 5 minutes.
In [13]:
two_minute_predicted_waiting_time = ...
five_minute_predicted_waiting_time = ...
# Here is a helper function to print out your predictions
# (you don't need to modify it):
def print_prediction(duration, predicted_waiting_time):
print("After an eruption lasting", duration,
"minutes, we predict you'll wait", predicted_waiting_time,
"minutes until the next eruption.")
print_prediction(2, two_minute_predicted_waiting_time)
print_prediction(5, five_minute_predicted_waiting_time)
The next cell plots the line that goes between those two points, which is (a segment of) the regression line.
In [16]:
plot_data_and_line(faithful, "duration", "wait", make_array(2, two_minute_predicted_waiting_time), make_array(5, five_minute_predicted_waiting_time))
Make predictions for the waiting time after each eruption in the faithful
table. (Of course, we know exactly what the waiting times were! We are doing this so we can see how accurate our predictions are.) Put these numbers into a column in a new table called faithful_predictions
. Its first row should look like this:
duration | wait | predicted wait |
---|---|---|
3.6 | 79 | 72.1011 |
Hint: Your answer can be just one line. There is no need for a for
loop; use array arithmetic instead.
In [17]:
faithful_predictions = ...
faithful_predictions
In [18]:
_ = tests.grade("q3_2")
How close were we? Compute the residual for each eruption in the dataset. The residual is the difference (not the absolute difference) between the actual waiting time and the predicted waiting time. Add the residuals to faithful_predictions
as a new column called "residual"
, naming the resulting table faithful_residuals
.
Hint: Again, your code will be much simpler if you don't use a for
loop.
In [1]:
faithful_residuals = ...
faithful_residuals
In [20]:
_ = tests.grade("q3_3")
Here is a plot of the residuals you computed. Each point corresponds to one eruption. It shows how much our prediction over- or under-estimated the waiting time.
In [21]:
faithful_residuals.scatter("duration", "residual", color="r")
There isn't really a pattern in the residuals, which confirms that it was reasonable to try linear regression. It's true that there are two separate clouds; the eruption durations seemed to fall into two distinct clusters. But that's just a pattern in the eruption durations, not a pattern in the relationship between eruption durations and waiting times.
Earlier, you should have found that the correlation is fairly close to 1, so the line fits fairly well on the training data. That means the residuals are overall small (close to 0) in comparison to the waiting times.
We can see that visually by plotting the waiting times and residuals together:
In [22]:
faithful_residuals.scatter("duration", "wait", label="actual waiting time", color="blue")
plt.scatter(faithful_residuals.column("duration"), faithful_residuals.column("residual"), label="residual", color="r")
plt.plot(make_array(2, 5), make_array(two_minute_predicted_waiting_time, five_minute_predicted_waiting_time), label="regression line")
plt.legend(bbox_to_anchor=(1.7,.8));
However, unless you have a strong reason to believe that the linear regression model is true, you should be wary of applying your prediction model to data that are very different from the training data.
In [23]:
zero_minute_predicted_waiting_time = ...
two_point_five_minute_predicted_waiting_time = ...
hour_predicted_waiting_time = ...
print_prediction(0, zero_minute_predicted_waiting_time)
print_prediction(2.5, two_point_five_minute_predicted_waiting_time)
print_prediction(60, hour_predicted_waiting_time)
In [24]:
_ = tests.grade('q4_1')
Question 2. Do you believe any of these values are reliable predictions? If you don't believe some of them, say why.
Write your answer here, replacing this text.
In [25]:
# For your convenience, you can run this cell to run all the tests at once!
import os
print("Running all tests...")
_ = [tests.grade(q[:-3]) for q in os.listdir("tests") if q.startswith('q')]
print("Finished running all tests.")
In [ ]:
# Run this cell to submit your work *after* you have passed all of the test cells.
# It's ok to run this cell multiple times. Only your final submission will be scored.
!TZ=America/Los_Angeles jupyter nbconvert --output=".lab07_$(date +%m%d_%H%M)_submission.html" lab07.ipynb && echo "Submitted successfully."