Intro to Statistical Machine Learning

Stats 208

Lecture 1

Prof. Sharpnack

Lecture slides at course github page

Some content of these slides are from STA 251 notes and STA 141B lectures.

The life satisfaction data example and some of the code is based on the notebook file 01 in Aurélien Geron's github page

Machine Learning

A computer program learns from experience, E, with respect to class of tasks, T, and performance measure, P, if its performance at T improves by P with E.

Examples of these categories are,

E: data (training)
P: loss (test), reward
T: classification, regression, expert selection, etc.

Inference vs. Prediction

statistical inference: is this effect significant? is the model correct? etc.
prediction: does this algorithm predict the response variable well?

terms

supervised learning: predicting one variable from many others
predictor variables: X variables
response variable: Y variable
X: $n \times p$ design matrix / features
Y: $n$ label vector



In [38]:

    
## I will be using Python 3, for install instructions see 
## http://anson.ucdavis.edu/~jsharpna/DSBook/unit1/intro.html#installation-and-workflow

## The following packages are numpy (linear algebra), pandas (data munging), 
## sklearn (machine learning), matplotlib (graphics), statsmodels (statistical models)

import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression



In [39]:

    
## Lines that start with ! run a bash command

!ls -l ../../data/winequality-red.csv









    



-rw-r--r-- 1 jsharpna jsharpna 84199 Apr  1 09:30 ../../data/winequality-red.csv



In [40]:

    
!head ../../data/winequality-red.csv









    



"fixed acidity";"volatile acidity";"citric acid";"residual sugar";"chlorides";"free sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";"quality"
7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5
7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0.65;9.8;5
11.2;0.28;0.56;1.9;0.075;17;60;0.998;3.16;0.58;9.8;6
7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
7.4;0.66;0;1.8;0.075;13;40;0.9978;3.51;0.56;9.4;5
7.9;0.6;0.06;1.6;0.069;15;59;0.9964;3.3;0.46;9.4;5
7.3;0.65;0;1.2;0.065;15;21;0.9946;3.39;0.47;10;7
7.8;0.58;0.02;2;0.073;9;18;0.9968;3.36;0.57;9.5;7

Wine dataset description

84199 bytes (not large, feel free to load into memory)
header with quotations " in the text
each line has floats without quotations
each datum separated by ;

Some Python basics:

file input/output
[f(a) for a in L] list comprehensions
iterables, basic types, built-in functions



In [41]:

    
datapath = "../../data/"
with open(datapath + 'winequality-red.csv','r') as winefile:
    header = winefile.readline()
    wine_list = [line.strip().split(';') for line in winefile]



In [42]:

    
wine_ar = np.array(wine_list,dtype=np.float64)



In [43]:

    
names = [name.strip('"') for name in header.strip().split(';')]
print(names)









    



['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol', 'quality']



In [44]:

    
#Subselect the predictor X and response y
y = wine_ar[:,-1]
X = wine_ar[:,:-1]
n,p = X.shape



In [45]:

    
y.shape, X.shape #just checking









    Out[45]:





((1599,), (1599, 11))



In [46]:

    
import statsmodels.api as sm

X = np.hstack((np.ones((n,1)),X)) #add intercept
wine_ols = sm.OLS(y,X) #Initialize the OLS 
wine_res = wine_ols.fit()



In [47]:

    
wine_res.summary()









    Out[47]:





OLS Regression Results

  Dep. Variable:             y           R-squared:             0.361 


  Model:                    OLS          Adj. R-squared:        0.356 


  Method:              Least Squares     F-statistic:           81.35 


  Date:              Mon, 01 Apr 2019    Prob (F-statistic):  1.79e-145


  Time:                  12:12:18        Log-Likelihood:      -1569.1 


  No. Observations:         1599         AIC:                   3162. 


  Df Residuals:             1587         BIC:                   3227. 


  Df Model:                   11                                      


  Covariance Type:       nonrobust                                    




           coef      std err       t       P>|t|   [0.025     0.975]  


  const     21.9652     21.195      1.036   0.300    -19.607     63.538


  x1         0.0250      0.026      0.963   0.336     -0.026      0.076


  x2        -1.0836      0.121     -8.948   0.000     -1.321     -0.846


  x3        -0.1826      0.147     -1.240   0.215     -0.471      0.106


  x4         0.0163      0.015      1.089   0.276     -0.013      0.046


  x5        -1.8742      0.419     -4.470   0.000     -2.697     -1.052


  x6         0.0044      0.002      2.009   0.045      0.000      0.009


  x7        -0.0033      0.001     -4.480   0.000     -0.005     -0.002


  x8       -17.8812     21.633     -0.827   0.409    -60.314     24.551


  x9        -0.4137      0.192     -2.159   0.031     -0.789     -0.038


  x10        0.9163      0.114      8.014   0.000      0.692      1.141


  x11        0.2762      0.026     10.429   0.000      0.224      0.328




  Omnibus:        27.376    Durbin-Watson:         1.757


  Prob(Omnibus):   0.000    Jarque-Bera (JB):     40.965


  Skew:           -0.168    Prob(JB):           1.27e-09


  Kurtosis:        3.708    Cond. No.           1.13e+05



Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.13e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Linear model

$$f_\beta(x_i) = \beta_0 + \sum_{j=1}^p \beta_j x_{i,j}$$

Inference in linear models

statistically test for significance of effects
requires normality assumptions, homoscedasticity, linear model is correct
hard to obtain significance for individual effect under colinearity

Prediction perspective

think of OLS as a black-box model for predicting $Y | X$
how do we evaluate performance of prediction?
how do we choose between multiple OLS models?

Supervised learning

Learning machine that takes $p$-dimensional data $x_i = (x_{i,1}, \ldots, x_{i,p})$ and predicts $y_i \in \mathcal Y$.

Task: Predict $y$ given $x$ as $f_\beta(x)$
Performance Metric: Loss measured with some function $\ell(\beta; x,y)$
Experience: Fit the model with training data $\{x_i,y_i\}_{i=1}^{n}$

Linear Regression

Fit: Compute $\hat \beta$ from OLS with training data $\{x_i,y_i\}_{i=1}^{n}$
Predict: For a new predictor $x_{n+1}$ predict $$\hat y = f_{\hat \beta}(x_{n+1}) = \hat \beta_0 + \sum_{j=1}^p \hat \beta_j x_{n+1,j}$$
Loss: Observe new response $y_{n+1}$ and see loss $$\ell(\hat \beta; x_{n+1},y_{n+1}) = (f_{\hat \beta}(x_{n+1}) - y_{n+1})^2$$



In [48]:

    
## The following uses pandas!

datapath = "../../data/"
oecd_bli = pd.read_csv(datapath + "oecd_bli_2015.csv", thousands=',')
oecd_bli = oecd_bli[oecd_bli["INEQUALITY"]=="TOT"]
oecd_bli = oecd_bli.pivot(index="Country", columns="Indicator", values="Value")

# Load and prepare GDP per capita data

# Download data from http://goo.gl/j1MSKe (=> imf.org)
gdp_per_capita = pd.read_csv(datapath+"gdp_per_capita.csv", thousands=',', delimiter='\t',
                             encoding='latin1', na_values="n/a")
gdp_per_capita = gdp_per_capita.rename(columns={"2015": "GDP per capita"})
gdp_per_capita = gdp_per_capita.set_index("Country")

full_country_stats = pd.merge(left=oecd_bli, right=gdp_per_capita, left_index=True, right_index=True)
full_country_stats = full_country_stats.sort_values(by="GDP per capita")



In [49]:

    
full_country_stats.head()









    Out[49]:







  
    
      
      Air pollution
      Assault rate
      Consultation on rule-making
      Dwellings without basic facilities
      Educational attainment
      Employees working very long hours
      Employment rate
      Homicide rate
      Household net adjusted disposable income
      Household net financial wealth
      ...
      Time devoted to leisure and personal care
      Voter turnout
      Water quality
      Years in education
      Subject Descriptor
      Units
      Scale
      Country/Series-specific Notes
      GDP per capita
      Estimates Start After
    
    
      Country
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Brazil
      18.0
      7.9
      4.0
      6.7
      45.0
      10.41
      67.0
      25.5
      11664.0
      6844.0
      ...
      14.97
      79.0
      72.0
      16.3
      Gross domestic product per capita, current prices
      U.S. dollars
      Units
      See notes for:  Gross domestic product, curren...
      8669.998
      2014.0
    
    
      Mexico
      30.0
      12.8
      9.0
      4.2
      37.0
      28.83
      61.0
      23.4
      13085.0
      9056.0
      ...
      13.89
      63.0
      67.0
      14.4
      Gross domestic product per capita, current prices
      U.S. dollars
      Units
      See notes for:  Gross domestic product, curren...
      9009.280
      2015.0
    
    
      Russia
      15.0
      3.8
      2.5
      15.1
      94.0
      0.16
      69.0
      12.8
      19292.0
      3412.0
      ...
      14.97
      65.0
      56.0
      16.0
      Gross domestic product per capita, current prices
      U.S. dollars
      Units
      See notes for:  Gross domestic product, curren...
      9054.914
      2015.0
    
    
      Turkey
      35.0
      5.0
      5.5
      12.7
      34.0
      40.86
      50.0
      1.2
      14095.0
      3251.0
      ...
      13.42
      88.0
      62.0
      16.4
      Gross domestic product per capita, current prices
      U.S. dollars
      Units
      See notes for:  Gross domestic product, curren...
      9437.372
      2013.0
    
    
      Hungary
      15.0
      3.6
      7.9
      4.8
      82.0
      3.19
      58.0
      1.3
      15442.0
      13277.0
      ...
      15.04
      62.0
      77.0
      17.6
      Gross domestic product per capita, current prices
      U.S. dollars
      Units
      See notes for:  Gross domestic product, curren...
      12239.894
      2015.0
    
  

5 rows × 30 columns



In [50]:

    
_ = full_country_stats.plot("GDP per capita",'Life satisfaction',kind='scatter')
plt.title('Life Satisfaction Index')









    Out[50]:





Text(0.5, 1.0, 'Life Satisfaction Index')



In [51]:

    
keepvars = full_country_stats.dtypes[full_country_stats.dtypes == float].index.values
keepvars = keepvars[:-1]
country = full_country_stats[keepvars]



In [52]:

    
Y = np.array(country['Life satisfaction'])
del country['Life satisfaction']
X_vars = country.columns.values
X = np.array(country)



In [53]:

    
def loss(yhat,y):
    """sqr error loss"""
    return (yhat - y)**2

def fit(X,Y):
    """fit the OLS from training w/ intercept"""
    lin1 = LinearRegression(fit_intercept=True) # OLS from sklearn
    lin1.fit(X,Y) # fit OLS
    return np.append(lin1.intercept_,lin1.coef_) # return betahat

def predict(x, betahat):
    """predict for point x"""
    return betahat[0] + x @ betahat[1:]

Summary

Supervised learning task is to predict $Y$ given $X$
Fit is using training data to fit parameters
Predict uses the fitted parameters to do prediction
Loss is a function that says how poorly you did on datum $x_i,y_i$

Intermission

Risk and Empirical Risk

Given a loss $\ell(\theta; X,Y)$, for parameters $\theta$, the risk is $$ R(\theta) = \mathbb E \ell(\theta; X,Y). $$

And given training data $\{x_i,y_i\}_{i=1}^{n_0}$ (drawn iid to $X,Y$), then the empirical risk is $$ R_n(\theta) = \frac 1n \sum_{i=1}^n \ell(\theta; x_i, y_i). $$ Notice that $\mathbb E R_n(\theta) = R(\theta)$ for fixed $\theta$.

For a class of parameters $\Theta$, the empirical risk minimizer (ERM) is the $$ \hat \theta = \arg \min_{\theta \in \Theta} R_n(\theta) $$ (may not be unique).

OLS is the ERM

OLS minimizes the following objective, $$ R_n(\beta) = \frac 1n \sum_{i=1}^n \left(y_i - x_i^\top \beta - \beta_0 \right)^2 $$ with respect to $\beta,\beta_0$. This is the ERM for square error loss and linear predictor.

Why is ERM a good idea?

For a fixed $\theta$ we know by the Law of Large Numbers (as long as expectations exist and data is iid), $$ R_n(\theta) = \frac 1n \sum_{i=1}^n \ell(\theta; x_i, y_i) \rightarrow \mathbb E \ell(\theta; X,Y) = R(\theta), $$ where convergence is in probability (or almost surely). We want to minimize $R(\theta)$ so $R_n(\theta)$ is a pretty good surrogate.

Example: Binary classification

Mortgage insurer pays the mortgage company if the insuree defaults on loan. To determine how much to charge want to predict if they will default (1) or not (0).

An actuary (from 19th century) says that people that are young (less than 30) are irresponsible and will not insure them. Let $x$ be the age in years, $y = 1$ if they default, and $\theta = 30$.

$$ g_\theta(x) = \left\{ \begin{array}{ll} 1, &x < \theta\\ 0, &x \ge \theta \end{array} \right. $$

0-1 loss is $$ \ell_{0-1}(\theta; X,Y) = \mathbf 1 \{g_\theta(X)\ne Y\}. $$ The risk is $$ R(\theta) = \mathbb E \mathbf 1 \{g_\theta(X)\ne Y\} = \mathbb P \{ g_\theta(X) \ne Y \}. $$ How well will he do?

Unsupervised learning

Want to summarize/compress/learn distribution of $X$. Clustering for example is the problem of assigning each datum to a cluster.

Image from https://rpubs.com/cyobero/k-means

Clustering for example is the problem of assigning each datum to a cluster in index set $[C] = \{1,\ldots,C\}$ for cluster centers $z_k$, $$ \theta = \left\{ \textrm{cluster centers, } \{ z_k \}_{k=1}^C \subset \mathbb R^p, \textrm{ cluster assignments, } \sigma:[n] \to [C] \right\} $$ The loss is $$ \ell(\theta;x_i) = \| x_i - z_{\sigma(i)} \|^2 = \sum_{j=1}^p (x_{i,j} - z_{\sigma(i),j})^2. $$ Loss, risk, and empirical risk still can be defined, but many concepts are not the same (such as bias-variance tradeoff).

Issue with training error in Supervised learning.

Let $\hat \theta$ be the ERM, then the training error is $$ R_n(\hat \theta) = \min_{\theta \in \Theta} R_n(\theta) $$ which does NOT converge to $R(\theta)$ because $$ \mathbb E \min_\theta R_n(\theta) \ne \min_{\theta} \mathbb E R_n(\theta) = \min_\theta R(\theta). $$

Solution

Split the data randomly into training and test sets:

train $\hat \theta$ with the training data
test $\hat \theta$ with the test data

Because the test data is independent of $\hat \theta$ we can think of the training process as fixed and test error is now unbiased for risk of $\hat \theta$.



In [54]:

    
## randomly shuffle data and split
n,p = X.shape
Ind = np.arange(n) 
np.random.shuffle(Ind) 
train_size = 2 * n // 3 +1 # set training set size
X_tr, X_te = X[Ind[:train_size],:], X[Ind[train_size:],:]
Y_tr, Y_te = Y[Ind[:train_size]], Y[Ind[train_size:]]



In [55]:

    
## compute losses on test set
betahat = fit(X_tr,Y_tr)
Y_hat_te = [predict(x,betahat) for x in X_te]
test_losses = [loss(yhat,y) for yhat,y in zip(Y_hat_te,Y_te)]

## compute losses on train set
Y_hat_tr = [predict(x,betahat) for x in X_tr]
train_losses = [loss(yhat,y) for yhat,y in zip(Y_hat_tr,Y_tr)]



In [56]:

    
train_losses









    Out[56]:





[7.444625512977249e-22,
 4.789408058424708e-22,
 3.4382433844272955e-22,
 6.386933748575762e-23,
 7.351858499749134e-22,
 1.3409297804094917e-21,
 2.9197810842621935e-21,
 1.039531398291255e-21,
 5.607714332892798e-22,
 2.3122614381851545e-23,
 2.1335721653214664e-21,
 5.6447928149221026e-24,
 3.3087224502121107e-22,
 1.1408544600062694e-21,
 8.496586931809782e-21,
 1.2751479190787854e-21,
 6.064590983206163e-23,
 1.91731144178247e-22,
 1.7474912689673094e-21,
 4.864338989694648e-22,
 1.550858667716521e-21,
 3.408013204286884e-21,
 2.363191630612292e-21,
 2.8989893905610626e-21,
 8.333312348132911e-23]



In [57]:

    
test_losses









    Out[57]:





[1111.340064728651,
 0.6255327739667467,
 182.73909226856128,
 1662.5641512445711,
 1187.18292683525,
 1909.680414765608,
 1171.813096477374,
 17.875942381202396,
 488.6096378479109,
 5445.915952121844,
 522.1742807860538]



In [58]:

    
print("train avg loss: {}\ntest avg loss: {}".format(np.mean(train_losses), np.mean(test_losses)))
print("n p :",n,p)









    



train avg loss: 1.376951437853411e-21
test avg loss: 1245.5019174755448
n p : 36 24



In [59]:

    
def train_test_split(X,Y,split_pr = 0.5):
    """train-test split"""
    n,p = X.shape
    Ind = np.arange(n) 
    np.random.shuffle(Ind) 
    train_size = int(split_pr * n) # set training set size
    X_tr, X_te = X[Ind[:train_size],:], X[Ind[train_size:],:]
    Y_tr, Y_te = Y[Ind[:train_size]], Y[Ind[train_size:]]
    return (X_tr,Y_tr), (X_te, Y_te)



In [60]:

    
Y = wine_ar[:,-1]
X = wine_ar[:,:-1]
(X_tr,Y_tr), (X_te, Y_te) = train_test_split(X,Y)



In [61]:

    
## compute losses on test set
betahat = fit(X_tr,Y_tr)
Y_hat_te = [predict(x,betahat) for x in X_te]
test_losses = [loss(yhat,y) for yhat,y in zip(Y_hat_te,Y_te)]

## compute losses on train set
Y_hat_tr = [predict(x,betahat) for x in X_tr]
train_losses = [loss(yhat,y) for yhat,y in zip(Y_hat_tr,Y_tr)]



In [62]:

    
print("train avg loss: {}\ntest avg loss: {}".format(np.mean(train_losses), np.mean(test_losses)))









    



train avg loss: 0.4120109503777993
test avg loss: 0.42469372283837814

Summary

Want to minimize true risk (expected loss)
Instead we minimize empirical risk (training error)
Training error is now biased, so we do training test split

Dep. Variable:	y	R-squared:	0.361
Model:	OLS	Adj. R-squared:	0.356
Method:	Least Squares	F-statistic:	81.35
Date:	Mon, 01 Apr 2019	Prob (F-statistic):	1.79e-145
Time:	12:12:18	Log-Likelihood:	-1569.1
No. Observations:	1599	AIC:	3162.
Df Residuals:	1587	BIC:	3227.
Df Model:	11
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	21.9652	21.195	1.036	0.300	-19.607	63.538
x1	0.0250	0.026	0.963	0.336	-0.026	0.076
x2	-1.0836	0.121	-8.948	0.000	-1.321	-0.846
x3	-0.1826	0.147	-1.240	0.215	-0.471	0.106
x4	0.0163	0.015	1.089	0.276	-0.013	0.046
x5	-1.8742	0.419	-4.470	0.000	-2.697	-1.052
x6	0.0044	0.002	2.009	0.045	0.000	0.009
x7	-0.0033	0.001	-4.480	0.000	-0.005	-0.002
x8	-17.8812	21.633	-0.827	0.409	-60.314	24.551
x9	-0.4137	0.192	-2.159	0.031	-0.789	-0.038
x10	0.9163	0.114	8.014	0.000	0.692	1.141
x11	0.2762	0.026	10.429	0.000	0.224	0.328

Omnibus:	27.376	Durbin-Watson:	1.757
Prob(Omnibus):	0.000	Jarque-Bera (JB):	40.965
Skew:	-0.168	Prob(JB):	1.27e-09
Kurtosis:	3.708	Cond. No.	1.13e+05

	Air pollution	Assault rate	Consultation on rule-making	Dwellings without basic facilities	Educational attainment	Employees working very long hours	Employment rate	Homicide rate	Household net adjusted disposable income	Household net financial wealth	...	Time devoted to leisure and personal care	Voter turnout	Water quality	Years in education	Subject Descriptor	Units	Scale	Country/Series-specific Notes	GDP per capita	Estimates Start After
Country
Brazil	18.0	7.9	4.0	6.7	45.0	10.41	67.0	25.5	11664.0	6844.0	...	14.97	79.0	72.0	16.3	Gross domestic product per capita, current prices	U.S. dollars	Units	See notes for: Gross domestic product, curren...	8669.998	2014.0
Mexico	30.0	12.8	9.0	4.2	37.0	28.83	61.0	23.4	13085.0	9056.0	...	13.89	63.0	67.0	14.4	Gross domestic product per capita, current prices	U.S. dollars	Units	See notes for: Gross domestic product, curren...	9009.280	2015.0
Russia	15.0	3.8	2.5	15.1	94.0	0.16	69.0	12.8	19292.0	3412.0	...	14.97	65.0	56.0	16.0	Gross domestic product per capita, current prices	U.S. dollars	Units	See notes for: Gross domestic product, curren...	9054.914	2015.0
Turkey	35.0	5.0	5.5	12.7	34.0	40.86	50.0	1.2	14095.0	3251.0	...	13.42	88.0	62.0	16.4	Gross domestic product per capita, current prices	U.S. dollars	Units	See notes for: Gross domestic product, curren...	9437.372	2013.0
Hungary	15.0	3.6	7.9	4.8	82.0	3.19	58.0	1.3	15442.0	13277.0	...	15.04	62.0	77.0	17.6	Gross domestic product per capita, current prices	U.S. dollars	Units	See notes for: Gross domestic product, curren...	12239.894	2015.0