Concepts and data from "An Introduction to Statistical Learning, with applications in R" (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani " available at www.StatLearning.com.
For Tables reference see http://data8.org/datascience/tables.html
In [1]:
# HIDDEN
# For Tables reference see http://data8.org/datascience/tables.html
# This useful nonsense should just go at the top of your notebook.
from datascience import *
%matplotlib inline
import matplotlib.pyplot as plots
import numpy as np
from sklearn import linear_model
plots.style.use('fivethirtyeight')
plots.rc('lines', linewidth=1, color='r')
from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets
# datascience version number of last run of this notebook
version.__version__
import sys
sys.path.append("..")
from ml_table import ML_Table
import locale
locale.setlocale( locale.LC_ALL, 'en_US.UTF-8' )
Out[1]:
In [2]:
# Getting the data
advertising = ML_Table.read_table("./data/Advertising.csv")
advertising = advertising.drop(0)
advertising
Out[2]:
The multiple linear regression model takes the form
$Y = β_0 + β_1X_1 +···+β_{p}X_{p} + ε$,
where $X_j$ represents the jth predictor and $β_j$ quantifies the association between that variable and the response. We interpret βj as the average effect on Y of a one unit increase in Xj, holding all other predictors fixed.
In the advertising example, this becomes $sales= β0 + β1×TV + β2×radio + β3×newspaper + ε$.
In [3]:
advertising.linear_regression('Sales').params
Out[3]:
In [4]:
adver_model = advertising.linear_regression('Sales').model
In [5]:
adver_model(0,0,0)
Out[5]:
In [6]:
ad2 = advertising.drop('Newspaper')
ad2.linear_regression('Sales').summary()
Out[6]:
In [9]:
# Linear model with two input variables is a plane
ad2.plot_fit('Sales', ad2.linear_regression('Sales').model, width=8)
Out[9]:
At this point "ISL" skips over how to compute the standard error of the multiple regression parameters - relying on R to just produce the answer. It requires some matrix notation and a numerical computation of the matrix inverse, but involves a bunch of standard terminology that is specific to the inference aspect, as opposed to the general notion in linear algebra of approximating a function over a basis.
A nice treatment can be found at this reference
In [10]:
# response vector
Y = advertising['Sales']
labels = [lbl for lbl in advertising.labels if lbl != 'Sales']
p = len(labels) # number of parameter
n = len(Y) # number of observations
labels
Out[10]:
In [11]:
# Transform the table into a matrix
advertising.select(labels).rows
Out[11]:
In [12]:
# Design matrix
X = np.array([np.append([1], row) for row in advertising.select(labels).rows])
In [14]:
# slope vector
b0, slopes = advertising.linear_regression('Sales').params
b = np.append([b0], slopes)
In [15]:
np.shape(X), np.shape(b)
Out[15]:
In [16]:
# residual
res = np.dot(X, b) - advertising['Sales']
In [17]:
# Variance of the residual
sigma2 = sum(res**2)/(n-p-1)
sigma2
Out[17]:
In [18]:
Xt = np.transpose(X)
In [19]:
# The matrix that needs to be inverted is only p x p
np.dot(Xt, X)
Out[19]:
In [20]:
np.shape(np.dot(Xt, X))
Out[20]:
In [21]:
# standard error matrix
SEM = sigma2*np.linalg.inv(np.dot(Xt, X))
SEM
Out[21]:
In [22]:
# variance of the coefficients are the diagonal elements
variances = [SEM[i,i] for i in range(len(SEM))]
variances
Out[22]:
In [23]:
# standard error of the coeficients
SE = [np.sqrt(v) for v in variances]
SE
Out[23]:
In [24]:
# t-statistics
b/SE
Out[24]:
In [25]:
advertising.linear_regression('Sales').summary()
Out[25]:
In [26]:
advertising.RSS_model('Sales', adver_model)
Out[26]:
In [27]:
advertising.R2_model('Sales', adver_model)
Out[27]:
Above shows that spending on newspaper appears to have no effect on sales. The apparent effect when looking at newspaper versus sales in isolation is capturing the tendency to spend more on newspaper when spending more on radio.
In [28]:
advertising.Cor()
Out[28]:
In [29]:
advertising.F_model('Sales', adver_model)
Out[29]:
In [30]:
advertising.lm_fit('Sales', adver_model)
Out[30]:
In [31]:
# Using this tool for the 1D model within the table
advertising.lm_fit('Sales', advertising.regression_1d('Sales', 'TV'), 'TV')
Out[31]:
Sometimes we want to test that a particular subset of q of the coefficients are zero. This corresponds to a null hypothesis
H0 : $β_{p−q+1} =β_{p−q+2} =...=β_{p} =0$
where for convenience we have put the variables chosen for omission at the end of the list. In this case we fit a second model that uses all the variables except those last q. Suppose that the residual sum of squares for that model is $RSS_0$. Then the appropriate F-statistic is
$F = \frac{(RSS_0 − RSS)/q}{RSS/(n−p−1)}$.
In [32]:
ad2_model = ad2.linear_regression('Sales').model
ad2.lm_fit('Sales', ad2_model)
Out[32]:
In [33]:
RSS0 = ad2.RSS_model('Sales', ad2_model)
RSS = advertising.RSS_model('Sales', adver_model)
((RSS0 - RSS)/1)/(advertising.num_rows - 3 - 1)
Out[33]:
In [35]:
input_labels = [lbl for lbl in advertising.labels if lbl != 'Sales']
fwd_labels = ['Sales']
for lbl in input_labels:
fwd = advertising.select(fwd_labels + [lbl])
model = fwd.linear_regression('Sales').model
print(lbl, fwd.RSS_model('Sales', model))
In [ ]: