Adapted from Chapter 3 of An Introduction to Statistical Learning
continuous | categorical | |
supervised | regression | classification |
unsupervised | dimension reduction | clustering |
Why are we learning linear regression?
Will be using Statsmodels for teaching purposes since it has some nice characteristics for linear modeling. However, we recommend that you spend most of your energy on scikit-learn since it provides significantly more useful functionality for machine learning in general.
In [1]:
# imports
import pandas as pd
import matplotlib.pyplot as plt
# this allows plots to appear directly in the notebook
%matplotlib inline
In [2]:
# read data into a DataFrame
data = pd.read_csv('', index_col=0)
What are the features?
What is the response?
In [3]:
# print the shape of the DataFrame
There are 200 observations, and thus 200 markets in the dataset.
In [4]:
# visualize the relationship between the features and the response using scatterplots
fig, axs = plt.subplots(1, 3, sharey=True)
data.plot(kind='scatter', x='TV', y='Sales', ax=axs[0], figsize=(16, 8))
data.plot(kind='scatter', x='Radio', y='Sales', ax=axs[1])
data.plot(kind='scatter', x='Newspaper', y='Sales', ax=axs[2])
Let's pretend you work for the company that manufactures and markets this widget. The company might ask you the following: On the basis of this data, how should we spend our advertising money in the future?
This general question might lead you to more specific questions:
We will explore these questions below!
Simple linear regression is an approach for predicting a quantitative response using a single feature (or "predictor" or "input variable"). It takes the following form:
$y = \beta_0 + \beta_1x$
What does each term represent?
Together, $\beta_0$ and $\beta_1$ are called the model coefficients. To create your model, you must "learn" the values of these coefficients. And once we've learned these coefficients, we can use the model to predict Sales!
What elements are present in the diagram?
How do the model coefficients relate to the least squares line?
Here is a graphical depiction of those calculations:
Let's use Statsmodels to estimate the model coefficients for the advertising data:
In [5]:
# this is the standard import if you're using "formula notation" (similar to R)
import statsmodels.formula.api as smf
# create a fitted model in one line
lm = smf.ols(formula='Sales ~ TV', data=data).fit()
# print the coefficients
How do we interpret the TV coefficient ($\beta_1$)?
Note that if an increase in TV ad spending was associated with a decrease in sales, $\beta_1$ would be negative.
In [6]:
# manually calculate the prediction
7.032594 + 0.047537*50
Thus, we would predict Sales of 9,409 widgets in that market.
Of course, we can also use Statsmodels to make the prediction:
In [7]:
# you have to create a DataFrame since the Statsmodels formula interface expects it
X_new = pd.DataFrame({'TV': [50]})
In [8]:
# use the model to make predictions on a new value
In [9]:
# create a DataFrame with the minimum and maximum values of TV
X_new = pd.DataFrame({'TV': [data.TV.min(), data.TV.max()]})
In [10]:
# make predictions for those x values and store them
preds = lm.predict(X_new)
In [11]:
# first, plot the observed data
data.plot(kind='scatter', x='TV', y='Sales')
# then, plot the least squares line
plt.plot(X_new, preds, c='red', linewidth=2)
Question: Is linear regression a high bias/low variance model, or a low bias/high variance model?
Answer: High bias/low variance. Under repeated sampling, the line will stay roughly in the same place (low variance), but the average of those models won't do a great job capturing the true relationship (high bias). Note that low variance is a useful characteristic when you don't have a lot of training data!
A closely related concept is confidence intervals. Statsmodels calculates 95% confidence intervals for our model coefficients, which are interpreted as follows: If the population from which this sample was drawn was sampled 100 times, approximately 95 of those confidence intervals would contain the "true" coefficient.
In [12]:
# print the confidence intervals for the model coefficients
Keep in mind that we only have a single sample of data, and not the entire population of data. The "true" coefficient is either within this interval or it isn't, but there's no way to actually know. We estimate the coefficient with the data we do have, and we show uncertainty about that estimate by giving a range that the coefficient is probably within.
Note that using 95% confidence intervals is just a convention. You can create 90% confidence intervals (which will be more narrow), 99% confidence intervals (which will be wider), or whatever intervals you like.
Closely related to confidence intervals is hypothesis testing. Generally speaking, you start with a null hypothesis and an alternative hypothesis (that is opposite the null). Then, you check whether the data supports rejecting the null hypothesis or failing to reject the null hypothesis.
(Note that "failing to reject" the null is not the same as "accepting" the null hypothesis. The alternative hypothesis may indeed be true, except that you just don't have enough data to show that.)
As it relates to model coefficients, here is the conventional hypothesis test:
How do we test this hypothesis? Intuitively, we reject the null (and thus believe the alternative) if the 95% confidence interval does not include zero. Conversely, the p-value represents the probability that the coefficient is actually zero:
In [13]:
# print the p-values for the model coefficients
If the 95% confidence interval includes zero, the p-value for that coefficient will be greater than 0.05. If the 95% confidence interval does not include zero, the p-value will be less than 0.05. Thus, a p-value less than 0.05 is one way to decide whether there is likely a relationship between the feature and the response. (Again, using 0.05 as the cutoff is just a convention.)
In this case, the p-value for TV is far less than 0.05, and so we believe that there is a relationship between TV ads and Sales.
Note that we generally ignore the p-value for the intercept.
The most common way to evaluate the overall fit of a linear model is by the R-squared value. R-squared is the proportion of variance explained, meaning the proportion of variance in the observed data that is explained by the model, or the reduction in error over the null model. (The null model just predicts the mean of the observed response, and thus it has an intercept and no slope.)
R-squared is between 0 and 1, and higher is better because it means that more variance is explained by the model. Here's an example of what R-squared "looks like":
You can see that the blue line explains some of the variance in the data (R-squared=0.54), the green line explains more of the variance (R-squared=0.64), and the red line fits the training data even further (R-squared=0.66). (Does the red line look like it's overfitting?)
Let's calculate the R-squared value for our simple linear model:
In [14]:
# print the R-squared value for the model
Is that a "good" R-squared value? It's hard to say. The threshold for a good R-squared value depends widely on the domain. Therefore, it's most useful as a tool for comparing different models.
Simple linear regression can easily be extended to include multiple features. This is called multiple linear regression:
$y = \beta_0 + \beta_1x_1 + ... + \beta_nx_n$
Each $x$ represents a different feature, and each feature has its own coefficient. In this case:
$y = \beta_0 + \beta_1 \times TV + \beta_2 \times Radio + \beta_3 \times Newspaper$
Let's use Statsmodels to estimate these coefficients:
In [15]:
# create a fitted model with all three features
lm = smf.ols(formula='Sales ~ TV + Radio + Newspaper', data=data).fit()
# print the coefficients
How do we interpret these coefficients? For a given amount of Radio and Newspaper ad spending, an increase of $1000 in TV ad spending is associated with an increase in Sales of 45.765 widgets.
A lot of the information we have been reviewing piece-by-piece is available in the model summary output:
In [16]:
# print a summary of the fitted model
What are a few key things we learn from this output?
How do I decide which features to include in a linear model? Here's one idea:
What are the drawbacks to this approach?
In [17]:
# only include TV and Radio in the model
lm = smf.ols(formula='Sales ~ TV + Radio', data=data).fit()
In [18]:
# add Newspaper to the model (which we believe has no association with Sales)
lm = smf.ols(formula='Sales ~ TV + Radio + Newspaper', data=data).fit()
R-squared will always increase as you add more features to the model, even if they are unrelated to the response. Thus, selecting the model with the highest R-squared is not a reliable approach for choosing the best linear model.
There is alternative to R-squared called adjusted R-squared that penalizes model complexity (to control for overfitting), but it generally under-penalizes complexity.
So is there a better approach to feature selection? Cross-validation. It provides a more reliable estimate of out-of-sample error, and thus is a better way to choose which of your models will best generalize to out-of-sample data. There is extensive functionality for cross-validation in scikit-learn, including automated methods for searching different sets of parameters and different models. Importantly, cross-validation can be applied to any model, whereas the methods described above only apply to linear models.
In [19]:
# create X and y
feature_cols = ['TV', 'Radio', 'Newspaper']
X = data[feature_cols]
y = data.Sales
# follow the usual sklearn pattern: import, instantiate, fit
from sklearn.linear_model import LinearRegression
lm = LinearRegression(), y)
# print intercept and coefficients
print lm.intercept_
print lm.coef_
In [20]:
# pair the feature names with the coefficients
zip(feature_cols, lm.coef_)
In [21]:
# predict for a new observation
lm.predict([100, 25, 25])
In [22]:
# calculate the R-squared
lm.score(X, y)
Note that p-values and confidence intervals are not (easily) accessible through scikit-learn.
In [23]:
import numpy as np
# set a seed for reproducibility
# create a Series of booleans in which roughly half are True
nums = np.random.rand(len(data))
mask_large = nums > 0.5
# initially set Size to small, then change roughly half to be large
data['Size'] = 'small'
data.loc[mask_large, 'Size'] = 'large'
For scikit-learn, we need to represent all data numerically. If the feature only has two categories, we can simply create a dummy variable that represents the categories as a binary value:
In [24]:
# create a new Series called IsLarge
data['IsLarge'] ={'small':0, 'large':1})
Let's redo the multiple linear regression and include the IsLarge predictor:
In [25]:
# create X and y
feature_cols = ['TV', 'Radio', 'Newspaper', 'IsLarge']
X = data[feature_cols]
y = data.Sales
# instantiate, fit
lm = LinearRegression(), y)
# print coefficients
zip(feature_cols, lm.coef_)
How do we interpret the IsLarge coefficient? For a given amount of TV/Radio/Newspaper ad spending, being a large market is associated with an average increase in Sales of 57.42 widgets (as compared to a Small market, which is called the baseline level).
What if we had reversed the 0/1 coding and created the feature 'IsSmall' instead? The coefficient would be the same, except it would be negative instead of positive. As such, your choice of category for the baseline does not matter, all that changes is your interpretation of the coefficient.
In [26]:
# set a seed for reproducibility
# assign roughly one third of observations to each group
nums = np.random.rand(len(data))
mask_suburban = (nums > 0.33) & (nums < 0.66)
mask_urban = nums > 0.66
data['Area'] = 'rural'
data.loc[mask_suburban, 'Area'] = 'suburban'
data.loc[mask_urban, 'Area'] = 'urban'
We have to represent Area numerically, but we can't simply code it as 0=rural, 1=suburban, 2=urban because that would imply an ordered relationship between suburban and urban (and thus urban is somehow "twice" the suburban category).
Instead, we create another dummy variable:
In [27]:
# create three dummy variables using get_dummies, then exclude the first dummy column
area_dummies = pd.get_dummies(data.Area, prefix='Area').iloc[:, 1:]
# concatenate the dummy variable columns onto the original DataFrame (axis=0 means rows, axis=1 means columns)
data = pd.concat([data, area_dummies], axis=1)
Here is how we interpret the coding:
Why do we only need two dummy variables, not three? Because two dummies captures all of the information about the Area feature, and implicitly defines rural as the baseline level. (In general, if you have a categorical feature with k levels, you create k-1 dummy variables.)
If this is confusing, think about why we only needed one dummy variable for Size (IsLarge), not two dummy variables (IsSmall and IsLarge).
Let's include the two new dummy variables in the model:
In [28]:
# create X and y
feature_cols = ['TV', 'Radio', 'Newspaper', 'IsLarge', 'Area_suburban', 'Area_urban']
X = data[feature_cols]
y = data.Sales
# instantiate, fit
lm = LinearRegression(), y)
# print coefficients
zip(feature_cols, lm.coef_)
How do we interpret the coefficients?
A final note about dummy encoding: If you have categories that can be ranked (i.e., strongly disagree, disagree, neutral, agree, strongly agree), you can potentially use a single dummy variable and represent the categories numerically (such as 1, 2, 3, 4, 5).
You could certainly go very deep into linear regression, and learn how to apply it really, really well. It's an excellent way to start your modeling process when working a regression problem. However, it is limited by the fact that it can only make good predictions if there is a linear relationship between the features and the response, which is why more complex methods (with higher variance and lower bias) will often outperform linear regression.
Therefore, we want you to understand linear regression conceptually, understand its strengths and weaknesses, be familiar with the terminology, and know how to apply it. However, we also want to spend time on many other machine learning models, which is why we aren't going deeper here.