Concepts and data from "An Introduction to Statistical Learning, with applications in R" (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani " available at www.StatLearning.com.

For Tables reference see http://data8.org/datascience/tables.html

http://jeffskinnerbox.me/notebooks/matplotlib-2d-and-3d-plotting-in-ipython.html



In [49]:

    
# HIDDEN
# For Tables reference see http://data8.org/datascience/tables.html
# This useful nonsense should just go at the top of your notebook.
from datascience import *
%matplotlib inline
import matplotlib.pyplot as plots
import numpy as np
from sklearn import linear_model
plots.style.use('fivethirtyeight')
plots.rc('lines', linewidth=1, color='r')
from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets
# datascience version number of last run of this notebook
version.__version__


import sys
sys.path.append("..")
from ml_table import ML_Table

import locale
locale.setlocale( locale.LC_ALL, 'en_US.UTF-8' )









    Out[49]:





'en_US.UTF-8'

Acquiring and seeing trends in multidimensional data



In [50]:

    
# Getting the data
advertising = ML_Table.read_table("./data/Advertising.csv")
advertising.relabel(0, "id")









    Out[50]:





    
        
            id  TV  Radio  Newspaper  Sales
        
    
    
        
            1    230.1  37.8  69.2      22.1 
        
    
        
            2    44.5  39.3  45.1      10.4 
        
    
        
            3    17.2  45.9  69.3      9.3  
        
    
        
            4    151.5  41.3  58.5      18.5 
        
    
        
            5    180.8  10.8  58.4      12.9 
        
    
        
            6    8.7   48.9  75        7.2  
        
    
        
            7    57.5  32.8  23.5      11.8 
        
    
        
            8    120.2  19.6  11.6      13.2 
        
    
        
            9    8.6   2.1   1         4.8  
        
    
        
            10   199.8  2.6   21.2      10.6 
        
    

... (190 rows omitted)

FIGURE 3.1. For the Advertising data, the least squares fit for the regression of sales onto TV is shown. The fit is found by minimizing the sum of squared errors.

Each line segment represents an error, and the fit makes a compromise by averaging their squares. In this case a linear fit captures the essence of the relationship, although it is somewhat deficient in the left of the plot.



In [51]:

    
ax = advertising.plot_fit_1d('Sales', 'TV', advertising.regression_1d('Sales', 'TV'))
_ = ax.set_xlim(-20,300)



In [52]:

    
lr = advertising.linear_regression('Sales', 'TV')



In [5]:

    
advertising.plot_fit_1d('Sales', 'TV', lr.model)









    Out[5]:





<matplotlib.axes._subplots.AxesSubplot at 0x1151b8e48>



In [48]:

    
advertising.lm_summary_1d('Sales', 'TV')









    Out[48]:





    
        
            Param  Coefficient  Std Error  t-statistic  95% CI  99% CI
        
    
    
        
            intercept  7.03259     0.457843   15.3603     [ 6.11690767  7.94827943]  [ 5.65906473  8.40612237]
        
    
        
            TV        0.0475366   0.00269061  17.6676     [ 0.04215543  0.05291785]  [ 0.03946482  0.05560846]

Let $\hat{y}_i = \hat{\beta_0} + \hat{\beta_1}x_i$ be the prediction for $Y$ based on the $i$-th value of $X$.



In [7]:

    
# Get the actual parameters that are captured within the model
advertising.regression_1d_params('Sales', 'TV')









    Out[7]:





(7.0325935491276965, 0.047536640433019729)



In [8]:

    
# Regression yields a model.  The computational representation of a model is a function
# That can be applied to an input, 'TV' to get an estimate of an output, 'Sales'
advertise_model_tv = advertising.regression_1d('Sales', 'TV')



In [9]:

    
# Sales with no TV advertising
advertise_model_tv(0)









    Out[9]:





7.0325935491276965



In [10]:

    
# Sales with 100 units TV advertising
advertise_model_tv(100)









    Out[10]:





11.786257592429671



In [11]:

    
# Here's the output of the model applied to the input data
advertise_model_tv(advertising['TV'])









    Out[11]:





array([ 17.97077451,   9.14797405,   7.85022376,  14.23439457,
        15.62721814,   7.44616232,   9.76595037,  12.74649773,
         7.44140866,  16.53041431,  10.17476548,  17.23871025,
         8.16396559,  11.66741599,  16.73482186,  16.32125309,
        10.25557777,  20.40940417,  10.32212907,  14.03474068,
        17.41459582,  18.31779199,   7.6600772 ,  17.88520856,
         9.99412625,  19.52997632,  13.82557947,  18.44614092,
        18.85970969,  10.38868036,  20.95607553,  12.39948025,
        11.653155  ,  19.65832525,  11.58185004,  20.85149492,
        19.72012288,  10.58358059,   9.08142275,  17.87094757,
        16.65876324,  15.44657891,  20.98935118,  16.86792445,
         8.22576322,  15.35625929,  11.2966302 ,  18.43663359,
        17.83291826,  10.21279479,  16.53041431,  11.80527225,
        17.31952254,  15.71278409,  19.52046899,  16.48763133,
         7.37961102,  13.50708398,  17.05331735,  17.04856369,
         9.57580381,  19.45391769,  18.4081116 ,  11.91460652,
        13.26464711,  10.31262174,   8.52999772,  13.65444756,
        18.31779199,  17.3385372 ,  16.49713866,  12.25211667,
         8.30657551,  13.18383482,  17.17691262,   7.83596277,
         8.33985116,  12.76075872,   7.28929141,  12.54684384,
        10.66439288,  18.43187992,  10.61210257,  10.28409975,
        17.18166628,  16.21667248,  10.65963921,  12.29489965,
        11.2300789 ,  12.25211667,  13.41676436,   8.39214147,
        17.38132017,  18.95953663,  12.13802873,  14.79532693,
        16.4258337 ,  15.82211837,  20.80395828,  13.45954734,
        17.60474238,  21.12245377,  20.3523602 ,  15.96472829,
        18.3558213 ,  13.58789626,   8.22100956,  11.32990584,
         7.65532354,  19.17345152,  17.76636696,  18.52219954,
        15.38478127,  16.99627338,  10.74995883,  10.60259525,
        13.6496939 ,  10.66439288,  13.00794925,   7.95480437,
        13.74952084,   7.92628239,  17.68080101,  12.88435399,
        17.94225253,  11.17778859,   7.40337934,  10.84503211,
        17.50491544,   9.86577732,   7.0658692 ,  19.63931059,
         7.43190133,  17.48114712,   8.78669558,   9.32861328,
         8.24953154,  20.04337204,   9.07666909,  15.82211837,
        10.52178296,  16.2404408 ,  17.51442276,  12.00492614,
        11.60561836,  13.7019842 ,  18.44614092,  18.5935045 ,
         8.83898589,   9.15748138,  20.37612852,  12.78452704,
        16.4258337 ,  15.17562006,  15.95997462,   7.22749377,
        11.49628409,  14.15358229,   7.58877224,  13.29316909,
        15.23266402,  11.10648363,  15.98849661,  14.80483426,
        12.60388781,  18.17993573,   7.88349941,  16.86317079,
        17.2719859 ,  20.54726042,   9.40942557,  14.8523709 ,
         7.9643117 ,  15.0377638 ,  17.60474238,  20.19548929,
        18.84069503,  15.12332975,  20.18598196,  14.9046612 ,
        14.47683144,  17.41934948,   9.70415274,  20.70413134,
        19.09739289,  16.77760484,  13.66395489,  16.11684554,
        20.62807271,   7.92152873,   8.91029085,  10.6216099 ,
         7.85022376,  14.96170517,  14.14882862,   8.84849321,
        11.51054508,  15.44657891,  20.51398478,  18.06584779])

The residual is the difference between the model output and the observed output



In [12]:

    
residual = advertising['Sales'] - advertise_model_tv(advertising['TV'])
residual









    Out[12]:





array([ 4.12922549,  1.25202595,  1.44977624,  4.26560543, -2.72721814,
       -0.24616232,  2.03404963,  0.45350227, -2.64140866, -5.93041431,
       -1.57476548,  0.16128975,  1.03603441, -1.96741599,  2.26517814,
        6.07874691,  2.24442223,  3.99059583,  0.97787093,  0.56525932,
        0.58540418, -5.81779199, -2.0600772 , -2.38520856, -0.29412625,
       -7.52997632,  1.17442053, -2.54614092,  0.04029031,  0.11131964,
        0.44392447, -0.49948025, -2.053155  , -2.25832525, -2.08185004,
       -8.05149492,  5.67987712,  4.11641941,  1.01857725,  3.62905243,
       -0.05876324,  1.65342109, -0.28935118, -3.96792445,  0.27423678,
       -0.45625929, -0.6966302 ,  4.76336641, -3.03291826, -0.51279479,
       -5.13041431, -1.10527225,  5.28047746,  5.48721591,  0.67953101,
        7.21236867, -1.87961102, -0.30708398,  6.74668265,  1.35143631,
       -1.47580381,  4.74608231, -2.7081116 ,  2.08539348,  4.73535289,
       -1.01262174,  0.97000228, -0.25444756,  0.58220801,  4.9614628 ,
        1.80286134,  0.14788333,  0.49342449, -2.18383482, -0.17691262,
        0.86403723, -1.43985116,  1.43924128, -1.98929141, -1.54684384,
        1.13560712, -6.13187992,  0.68789743,  3.31590025,  4.51833372,
       -1.01667248,  1.34036079,  3.70510035,  1.6699211 ,  4.44788333,
       -2.21676436, -1.09214147,  2.01867983,  3.24046337, -0.63802873,
        2.10467307, -4.7258337 , -0.32211837,  4.59604172,  3.74045266,
       -5.90474238,  2.67754623, -5.5523602 , -1.26472829,  2.3441787 ,
        5.61210374, -1.02100956, -2.62990584, -2.35532354,  0.62654848,
       -4.36636696,  3.27780046, -1.28478127, -1.09627338,  3.85004117,
        1.99740475, -1.4496939 , -1.26439288,  2.89205075, -1.35480437,
        1.75047916, -0.92628239, -6.08080101,  2.31564601,  1.75774747,
       -0.57778859, -0.80337934, -2.04503211,  7.19508456, -0.16577732,
       -5.4658692 , -6.93931059, -1.73190133,  2.11885288,  2.01330442,
        2.27138672,  1.25046846,  0.75662796,  0.52333091,  4.87788163,
        0.37821704,  2.9595592 ,  2.58557724, -1.60492614, -0.20561836,
       -3.4019842 , -5.24614092,  6.8064955 ,  2.06101411,  0.94251862,
       -4.27612852, -1.18452704,  0.1741663 ,  3.82437994, -0.35997462,
       -4.02749377,  3.80371591, -4.05358229, -0.28877224, -0.39316909,
       -0.83266402,  2.19351637, -1.08849661,  3.19516574, -0.70388781,
       -6.27993573,  0.11650059, -4.66317079, -0.1719859 , -5.54726042,
       -1.00942557, -0.3523709 , -0.3643117 , -3.3377638 , -6.10474238,
        6.80451071,  1.35930497, -3.42332975, -8.38598196, -2.3046612 ,
       -3.97683144, -5.21934948, -1.00415274,  5.49586866, -1.49739289,
        5.82239516, -3.36395489,  1.18315446, -4.72807271, -1.22152873,
        1.88970915, -0.7216099 , -1.95022376,  4.63829483,  3.15117138,
       -1.24849321, -1.81054508, -2.64657891,  4.98601522, -4.66584779])

The residual is not very useful directly because balances over-estimates and under-estimates to produce the best overall estimate with the least error. We can understand the overall goodness of fit by the residual sum of squares - RSS.



In [13]:

    
# Residual Sum of Squares
RSS = sum(residual*residual)
RSS









    Out[13]:





2102.5305831313512



In [14]:

    
# This is common enough that we have it provided as a method
advertising.RSS_model('Sales', advertising.regression_1d('Sales', 'TV'), 'TV')









    Out[14]:





2102.5305831313508



In [15]:

    
# And we should move toward a general regression framework
advertising.RSS_model('Sales', advertising.linear_regression('Sales', 'TV').model, 'TV')









    Out[15]:





2102.5305831313512

With this, we can build an independent model for each of the inputs and look at the associates RSS.



In [16]:

    
advertising_models = Table().with_column('Input', ['TV', 'Radio', 'Newspaper'])
advertising_models['Model'] = advertising_models.apply(lambda i: advertising.regression_1d('Sales', i), 'Input')
advertising_models['RSS'] = advertising_models.apply(lambda i, m: advertising.RSS_model('Sales', m, i), ['Input', 'Model'])
advertising_models









    Out[16]:





    
        
            Input  Model  RSS
        
    
    
        
            TV        . at 0x ...  2102.53
        
    
        
            Radio     . at 0x ...  3618.48
        
    
        
            Newspaper  . at 0x ...  5134.8



In [17]:

    
advertising_models = Table().with_column('Input', ['TV', 'Radio', 'Newspaper'])
advertising_models['Model'] = advertising_models.apply(lambda i: advertising.linear_regression('Sales', i).model, 'Input')
advertising_models['RSS'] = advertising_models.apply(lambda i, m: advertising.RSS_model('Sales', m, i), ['Input', 'Model'])
advertising_models









    Out[17]:





    
        
            Input  Model  RSS
        
    
    
        
            TV        . a ...  2102.53
        
    
        
            Radio     . a ...  3618.48
        
    
        
            Newspaper  . a ...  5134.8

We can look at how well each of these inputs predict the output by visualizing the residuals. The magnitude of the RSS gives a sense of the error.



In [18]:

    
for mode, mdl in zip(advertising_models['Input'], advertising_models['Model']) :
    advertising.plot_fit_1d('Sales', mode, mdl)



In [19]:

    
# RSS at arbitrary point
res = lambda b0, b1: advertising.RSS('Sales', b0 + b1*advertising['TV'])



In [20]:

    
res(7.0325935491276965, 0.047536640433019729)









    Out[20]:





2102.5305831313508

Figure 3.2 - RSS and least squares regression

Regression using least squares finds parameters $b0$ and $b1$ tohat minimize the RSS. The least squares regression line estimates the population regression line.

Figure 3.2 shows (what is claimed to be) the RSS contour and surface around the regression point. The computed analog of Figure 3.2 is shown below, with the role of $b0$ and $b1$ reversed to match the tuple returned from the regression method. The plot is the text is incorrect is some important ways. The RSS is not radially symmetric around ($b0$, $b1$). Lines with a larger intercept and smaller slope or vice versa are very close to the minima, i.e., the surface is nearly flat along the upper-left to lower-right diagonal, especially where fit is not very good, since the output depends on more than this one input.

Just because a process minimizes the error does not mean that the minima is sharply defined or that the error surface has no structure. Below we go a bit beyond the text to illustrate this more fully.



In [21]:

    
ax = advertising.RSS_contour('Sales', 'TV', sensitivity=0.2)



In [22]:

    
ax = advertising.RSS_wireframe('Sales', 'TV', sensitivity=0.2)



In [23]:

    
# The minima point
advertising.linear_regression('Sales', 'TV').params









    Out[23]:





(7.032593549127693, array([ 0.04753664]))



In [24]:

    
# Some other points along the trough
points = [(0.042, 8.0), (0.044, 7.6), (0.050, 6.6), (0.054, 6.0)]
ax = advertising.RSS_contour('Sales', 'TV', sensitivity=0.2)
ax.plot([b0 for b1, b0 in points], [b1 for b1, b0 in points], 'ro')
[advertising.RSS('Sales', b0 + b1*advertising['TV']) for b1,b0 in points]









    Out[24]:





[2152.1942799600001,
 2121.3260830399995,
 2112.4219749999997,
 2165.1585052400001]



In [25]:

    
# Models as lines corresponding to points along the near-minimal vectors.
ax = advertising.plot_fit_1d('Sales', 'TV', advertising.linear_regression('Sales', 'TV').model)
for b1, b0 in points:
    fit = lambda x: b0 + b1*x
    ax.plot([0, 300], [fit(0), fit(300)])
_ = ax.set_xlim(-20,300)

3.1.2 Assessing the accuracy of coefficient estimates

The particular minima that is found through least squares regression is effected by the particular sample of the population that is observed and utilized for estimating the coefficients of the underlying population model.

To see this we can generate a synthetic population based on an ideal model plus noise. Here we can peek at the entire population (which in most settings cannot be observed). We then take samples of this population and fit a regression to those. We can then see how these regression lines differ from the population regression line, which is not exactly the ideal model.



In [26]:

    
def model (x):
    return 3*x + 2

def population(n, noise_scale = 1):
    sample = ML_Table.runiform('x', n, -2, 2)
    noise = ML_Table.rnorm('e', n, sd=noise_scale)
    sample['Y'] = sample.apply(model, 'x') + noise['e']
    return sample



In [27]:

    
data = population(100, 2)



In [28]:

    
data.scatter('x')



In [29]:

    
ax = data.plot_fit('Y', data.linear_regression('Y').model)



In [30]:

    
data.linear_regression('Y').params









    Out[30]:





(2.1918918645804979, array([ 3.12848859]))



In [31]:

    
# A random sample of the population
sample = data.sample(10)
sample.plot_fit('Y', sample.linear_regression('Y').model)









    Out[31]:





<matplotlib.axes._subplots.AxesSubplot at 0x1117dada0>



In [32]:

    
nsamples = 5
ax = data.plot_fit('Y', data.linear_regression('Y').model, linewidth=3)
for s in range(nsamples):
    fit = data.sample(10).linear_regression('Y').model
    ax.plot([-2, 2], [fit(-2), fit(2)], linewidth=1)

"The property of unbiasedness holds for the least squares coefficient estimates given by (3.4) as well: if we estimate β0 and β1 on the basis of a particular data set, then our estimates won’t be exactly equal to β0 and β1. But if we could average the estimates obtained over a huge number of data sets, then the average of these estimates would be spot on!"

To compute the standard errors associated with $β_0$ and $β_1$, we use the following formulas.

The slope, $b_1$:

$SE(\hat{β_1})^2 = \frac{σ^2}{\sum_{i=1}^n (x_i - \bar{x})^2}$

The intercept, $b_0$:

$SE(\hat{β_0})^2 = σ^2 [\frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^n (x_i - \bar{x})^2} ] $,

where $σ^2 = Var(ε)$.

In general, $σ^2$ is not known, but can be estimated from the data. The estimate of σ is known as the residual standard error, and is given by the formula $RSE = \sqrt{RSS/(n − 2)}$.



In [33]:

    
adv_sigma = advertising.RSE_model('Sales', advertising.linear_regression('Sales', 'TV').model, 'TV')



In [34]:

    
adv_sigma









    Out[34]:





3.2586563686504624



In [35]:

    
b0, b1 = advertising.linear_regression('Sales', 'TV').params
b0, b1









    Out[35]:





(7.032593549127693, array([ 0.04753664]))



In [36]:

    
advertising.RSS_model('Sales', advertising.linear_regression('Sales', 'TV').model, 'TV')









    Out[36]:





2102.5305831313512



In [ ]:



In [37]:

    
SE_b0, SE_b1 = advertising.SE_1d_params('Sales', 'TV')
SE_b0, SE_b1









    Out[37]:





(0.45784294027347844, 0.0026906071877968703)



In [38]:

    
# b0 95% confidence interval
(b0-2*SE_b0, b0+2*SE_b0)









    Out[38]:





(6.116907668580736, 7.94827942967465)



In [39]:

    
# b1 95% confidence interval
(b1-2*SE_b1, b1+2*SE_b1)









    Out[39]:





(array([ 0.04215543]), array([ 0.05291785]))



In [40]:

    
# t-statistic of the slope
b0/SE_b0









    Out[40]:





15.360275174117545



In [41]:

    
# t-statistics of the intercept
b1/SE_b1









    Out[41]:





array([ 17.6676256])



In [42]:

    
# Similar to summary of a linear model in R
advertising.lm_summary_1d('Sales', 'TV')









    Out[42]:





    
        
            Param  Coefficient  Std Error  t-statistic  95% CI  99% CI
        
    
    
        
            intercept  7.03259     0.457843   15.3603     [ 6.11690767  7.94827943]  [ 5.65906473  8.40612237]
        
    
        
            TV        0.0475366   0.00269061  17.6676     [ 0.04215543  0.05291785]  [ 0.03946482  0.05560846]



In [43]:

    
# We can just barely reject the null hypothesis for Newspaper
# advertising effecting sales
advertising.lm_summary_1d('Sales', 'Newspaper')









    Out[43]:





    
        
            Param  Coefficient  Std Error  t-statistic  95% CI  99% CI
        
    
    
        
            intercept  12.3514     0.62142   19.8761     [ 11.10856669  13.59424744]  [ 10.48714651  14.21566763]
        
    
        
            Newspaper  0.0546931   0.0165757  3.29959     [ 0.02154165  0.08784454]   [ 0.00496593  0.10442026]

3.1.3 Assessing the Acurracy of the Model

Once we have rejected the null hypothesis in favor of the alternative hypothesis, it is natural to want to quantify the extent to which the model fits the data. The quality of a linear regression fit is typically assessed using two related quantities: the residual standard error (RSE) and the $R^2$ statistic.

The RSE provides an absolute measure of lack of fit of the model to the data.



In [44]:

    
adver_model = advertising.regression_1d('Sales', 'TV')
advertising.RSE_model('Sales', adver_model, 'TV')









    Out[44]:





3.2586563686504619

The $R^2$ statistic provides an alternative measure of fit. It takes the form of a proportion—the proportion of variance explained—and so it always takes on a value between 0 and 1, and is independent of the scale of Y.

To calculate $R^2$, we use the formula

$R^2 = \frac{TSS−RSS}{TSS} = 1 − \frac{RSS}{TSS}$

where $TSS = \sum (y_i - \bar{y})^2$ is the total sum of squares.



In [45]:

    
advertising.R2_model('Sales', adver_model, 'TV')









    Out[45]:





0.61187505085007121



In [46]:

    
# the other models of advertising suggest that there is more going on
advertising.R2_model('Sales', adver_model, 'Radio')









    Out[46]:





-1.1406293912097016



In [47]:

    
advertising.R2_model('Sales', adver_model, 'Newspaper')









    Out[47]:





-1.0808654511451161

id	TV	Radio	Newspaper	Sales
1	230.1	37.8	69.2	22.1
2	44.5	39.3	45.1	10.4
3	17.2	45.9	69.3	9.3
4	151.5	41.3	58.5	18.5
5	180.8	10.8	58.4	12.9
6	8.7	48.9	75	7.2
7	57.5	32.8	23.5	11.8
8	120.2	19.6	11.6	13.2
9	8.6	2.1	1	4.8
10	199.8	2.6	21.2	10.6

Param	Coefficient	Std Error	t-statistic	95% CI	99% CI
intercept	7.03259	0.457843	15.3603	[ 6.11690767 7.94827943]	[ 5.65906473 8.40612237]
TV	0.0475366	0.00269061	17.6676	[ 0.04215543 0.05291785]	[ 0.03946482 0.05560846]

Input	Model	RSS
TV	. at 0x ...	2102.53
Radio	. at 0x ...	3618.48
Newspaper	. at 0x ...	5134.8

Param	Coefficient	Std Error	t-statistic	95% CI	99% CI
intercept	12.3514	0.62142	19.8761	[ 11.10856669 13.59424744]	[ 10.48714651 14.21566763]
Newspaper	0.0546931	0.0165757	3.29959	[ 0.02154165 0.08784454]	[ 0.00496593 0.10442026]