Intro to Statistical Machine Learning

Stats 208

Lecture 1

Prof. Sharpnack

Lecture slides at course github page

Some content of these slides are from STA 251 notes and STA 141B lectures.

The life satisfaction data example and some of the code is based on the notebook file 01 in Aurélien Geron's github page

Machine Learning

A computer program learns from experience, E, with respect to class of tasks, T, and performance measure, P, if its performance at T improves by P with E.

Examples of these categories are,

  • E: data (training)
  • P: loss (test), reward
  • T: classification, regression, expert selection, etc.

Inference vs. Prediction

  • statistical inference: is this effect significant? is the model correct? etc.
  • prediction: does this algorithm predict the response variable well?

terms

  • supervised learning: predicting one variable from many others
  • predictor variables: X variables
  • response variable: Y variable
  • X: $n \times p$ design matrix / features
  • Y: $n$ label vector

In [38]:
## I will be using Python 3, for install instructions see 
## http://anson.ucdavis.edu/~jsharpna/DSBook/unit1/intro.html#installation-and-workflow

## The following packages are numpy (linear algebra), pandas (data munging), 
## sklearn (machine learning), matplotlib (graphics), statsmodels (statistical models)

import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

In [39]:
## Lines that start with ! run a bash command

!ls -l ../../data/winequality-red.csv


-rw-r--r-- 1 jsharpna jsharpna 84199 Apr  1 09:30 ../../data/winequality-red.csv

In [40]:
!head ../../data/winequality-red.csv


"fixed acidity";"volatile acidity";"citric acid";"residual sugar";"chlorides";"free sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";"quality"
7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5
7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0.65;9.8;5
11.2;0.28;0.56;1.9;0.075;17;60;0.998;3.16;0.58;9.8;6
7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
7.4;0.66;0;1.8;0.075;13;40;0.9978;3.51;0.56;9.4;5
7.9;0.6;0.06;1.6;0.069;15;59;0.9964;3.3;0.46;9.4;5
7.3;0.65;0;1.2;0.065;15;21;0.9946;3.39;0.47;10;7
7.8;0.58;0.02;2;0.073;9;18;0.9968;3.36;0.57;9.5;7

Wine dataset description

  • 84199 bytes (not large, feel free to load into memory)
  • header with quotations " in the text
  • each line has floats without quotations
  • each datum separated by ;

Some Python basics:

  • file input/output
  • [f(a) for a in L] list comprehensions
  • iterables, basic types, built-in functions

In [41]:
datapath = "../../data/"
with open(datapath + 'winequality-red.csv','r') as winefile:
    header = winefile.readline()
    wine_list = [line.strip().split(';') for line in winefile]

In [42]:
wine_ar = np.array(wine_list,dtype=np.float64)

In [43]:
names = [name.strip('"') for name in header.strip().split(';')]
print(names)


['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol', 'quality']

In [44]:
#Subselect the predictor X and response y
y = wine_ar[:,-1]
X = wine_ar[:,:-1]
n,p = X.shape

In [45]:
y.shape, X.shape #just checking


Out[45]:
((1599,), (1599, 11))

In [46]:
import statsmodels.api as sm

X = np.hstack((np.ones((n,1)),X)) #add intercept
wine_ols = sm.OLS(y,X) #Initialize the OLS 
wine_res = wine_ols.fit()

In [47]:
wine_res.summary()


Out[47]:
OLS Regression Results
Dep. Variable: y R-squared: 0.361
Model: OLS Adj. R-squared: 0.356
Method: Least Squares F-statistic: 81.35
Date: Mon, 01 Apr 2019 Prob (F-statistic): 1.79e-145
Time: 12:12:18 Log-Likelihood: -1569.1
No. Observations: 1599 AIC: 3162.
Df Residuals: 1587 BIC: 3227.
Df Model: 11
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 21.9652 21.195 1.036 0.300 -19.607 63.538
x1 0.0250 0.026 0.963 0.336 -0.026 0.076
x2 -1.0836 0.121 -8.948 0.000 -1.321 -0.846
x3 -0.1826 0.147 -1.240 0.215 -0.471 0.106
x4 0.0163 0.015 1.089 0.276 -0.013 0.046
x5 -1.8742 0.419 -4.470 0.000 -2.697 -1.052
x6 0.0044 0.002 2.009 0.045 0.000 0.009
x7 -0.0033 0.001 -4.480 0.000 -0.005 -0.002
x8 -17.8812 21.633 -0.827 0.409 -60.314 24.551
x9 -0.4137 0.192 -2.159 0.031 -0.789 -0.038
x10 0.9163 0.114 8.014 0.000 0.692 1.141
x11 0.2762 0.026 10.429 0.000 0.224 0.328
Omnibus: 27.376 Durbin-Watson: 1.757
Prob(Omnibus): 0.000 Jarque-Bera (JB): 40.965
Skew: -0.168 Prob(JB): 1.27e-09
Kurtosis: 3.708 Cond. No. 1.13e+05


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.13e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Linear model

$$f_\beta(x_i) = \beta_0 + \sum_{j=1}^p \beta_j x_{i,j}$$

Inference in linear models

  • statistically test for significance of effects
  • requires normality assumptions, homoscedasticity, linear model is correct
  • hard to obtain significance for individual effect under colinearity

Prediction perspective

  • think of OLS as a black-box model for predicting $Y | X$
  • how do we evaluate performance of prediction?
  • how do we choose between multiple OLS models?

Supervised learning

Learning machine that takes $p$-dimensional data $x_i = (x_{i,1}, \ldots, x_{i,p})$ and predicts $y_i \in \mathcal Y$.

  • Task: Predict $y$ given $x$ as $f_\beta(x)$
  • Performance Metric: Loss measured with some function $\ell(\beta; x,y)$
  • Experience: Fit the model with training data $\{x_i,y_i\}_{i=1}^{n}$

Linear Regression

  • Fit: Compute $\hat \beta$ from OLS with training data $\{x_i,y_i\}_{i=1}^{n}$
  • Predict: For a new predictor $x_{n+1}$ predict $$\hat y = f_{\hat \beta}(x_{n+1}) = \hat \beta_0 + \sum_{j=1}^p \hat \beta_j x_{n+1,j}$$
  • Loss: Observe new response $y_{n+1}$ and see loss $$\ell(\hat \beta; x_{n+1},y_{n+1}) = (f_{\hat \beta}(x_{n+1}) - y_{n+1})^2$$

In [48]:
## The following uses pandas!

datapath = "../../data/"
oecd_bli = pd.read_csv(datapath + "oecd_bli_2015.csv", thousands=',')
oecd_bli = oecd_bli[oecd_bli["INEQUALITY"]=="TOT"]
oecd_bli = oecd_bli.pivot(index="Country", columns="Indicator", values="Value")

# Load and prepare GDP per capita data

# Download data from http://goo.gl/j1MSKe (=> imf.org)
gdp_per_capita = pd.read_csv(datapath+"gdp_per_capita.csv", thousands=',', delimiter='\t',
                             encoding='latin1', na_values="n/a")
gdp_per_capita = gdp_per_capita.rename(columns={"2015": "GDP per capita"})
gdp_per_capita = gdp_per_capita.set_index("Country")

full_country_stats = pd.merge(left=oecd_bli, right=gdp_per_capita, left_index=True, right_index=True)
full_country_stats = full_country_stats.sort_values(by="GDP per capita")

In [49]:
full_country_stats.head()


Out[49]:
Air pollution Assault rate Consultation on rule-making Dwellings without basic facilities Educational attainment Employees working very long hours Employment rate Homicide rate Household net adjusted disposable income Household net financial wealth ... Time devoted to leisure and personal care Voter turnout Water quality Years in education Subject Descriptor Units Scale Country/Series-specific Notes GDP per capita Estimates Start After
Country
Brazil 18.0 7.9 4.0 6.7 45.0 10.41 67.0 25.5 11664.0 6844.0 ... 14.97 79.0 72.0 16.3 Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, curren... 8669.998 2014.0
Mexico 30.0 12.8 9.0 4.2 37.0 28.83 61.0 23.4 13085.0 9056.0 ... 13.89 63.0 67.0 14.4 Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, curren... 9009.280 2015.0
Russia 15.0 3.8 2.5 15.1 94.0 0.16 69.0 12.8 19292.0 3412.0 ... 14.97 65.0 56.0 16.0 Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, curren... 9054.914 2015.0
Turkey 35.0 5.0 5.5 12.7 34.0 40.86 50.0 1.2 14095.0 3251.0 ... 13.42 88.0 62.0 16.4 Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, curren... 9437.372 2013.0
Hungary 15.0 3.6 7.9 4.8 82.0 3.19 58.0 1.3 15442.0 13277.0 ... 15.04 62.0 77.0 17.6 Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, curren... 12239.894 2015.0

5 rows × 30 columns


In [50]:
_ = full_country_stats.plot("GDP per capita",'Life satisfaction',kind='scatter')
plt.title('Life Satisfaction Index')


Out[50]:
Text(0.5, 1.0, 'Life Satisfaction Index')

In [51]:
keepvars = full_country_stats.dtypes[full_country_stats.dtypes == float].index.values
keepvars = keepvars[:-1]
country = full_country_stats[keepvars]

In [52]:
Y = np.array(country['Life satisfaction'])
del country['Life satisfaction']
X_vars = country.columns.values
X = np.array(country)

In [53]:
def loss(yhat,y):
    """sqr error loss"""
    return (yhat - y)**2

def fit(X,Y):
    """fit the OLS from training w/ intercept"""
    lin1 = LinearRegression(fit_intercept=True) # OLS from sklearn
    lin1.fit(X,Y) # fit OLS
    return np.append(lin1.intercept_,lin1.coef_) # return betahat

def predict(x, betahat):
    """predict for point x"""
    return betahat[0] + x @ betahat[1:]

Summary

  • Supervised learning task is to predict $Y$ given $X$
  • Fit is using training data to fit parameters
  • Predict uses the fitted parameters to do prediction
  • Loss is a function that says how poorly you did on datum $x_i,y_i$

Intermission

Risk and Empirical Risk

Given a loss $\ell(\theta; X,Y)$, for parameters $\theta$, the risk is $$ R(\theta) = \mathbb E \ell(\theta; X,Y). $$

And given training data $\{x_i,y_i\}_{i=1}^{n_0}$ (drawn iid to $X,Y$), then the empirical risk is $$ R_n(\theta) = \frac 1n \sum_{i=1}^n \ell(\theta; x_i, y_i). $$ Notice that $\mathbb E R_n(\theta) = R(\theta)$ for fixed $\theta$.

For a class of parameters $\Theta$, the empirical risk minimizer (ERM) is the $$ \hat \theta = \arg \min_{\theta \in \Theta} R_n(\theta) $$ (may not be unique).

OLS is the ERM

OLS minimizes the following objective, $$ R_n(\beta) = \frac 1n \sum_{i=1}^n \left(y_i - x_i^\top \beta - \beta_0 \right)^2 $$ with respect to $\beta,\beta_0$. This is the ERM for square error loss and linear predictor.

Why is ERM a good idea?

For a fixed $\theta$ we know by the Law of Large Numbers (as long as expectations exist and data is iid), $$ R_n(\theta) = \frac 1n \sum_{i=1}^n \ell(\theta; x_i, y_i) \rightarrow \mathbb E \ell(\theta; X,Y) = R(\theta), $$ where convergence is in probability (or almost surely). We want to minimize $R(\theta)$ so $R_n(\theta)$ is a pretty good surrogate.

Example: Binary classification

Mortgage insurer pays the mortgage company if the insuree defaults on loan. To determine how much to charge want to predict if they will default (1) or not (0).

An actuary (from 19th century) says that people that are young (less than 30) are irresponsible and will not insure them. Let $x$ be the age in years, $y = 1$ if they default, and $\theta = 30$.

$$ g_\theta(x) = \left\{ \begin{array}{ll} 1, &x < \theta\\ 0, &x \ge \theta \end{array} \right. $$

0-1 loss is $$ \ell_{0-1}(\theta; X,Y) = \mathbf 1 \{g_\theta(X)\ne Y\}. $$ The risk is $$ R(\theta) = \mathbb E \mathbf 1 \{g_\theta(X)\ne Y\} = \mathbb P \{ g_\theta(X) \ne Y \}. $$ How well will he do?

Unsupervised learning

Want to summarize/compress/learn distribution of $X$. Clustering for example is the problem of assigning each datum to a cluster.

Image from https://rpubs.com/cyobero/k-means

Clustering for example is the problem of assigning each datum to a cluster in index set $[C] = \{1,\ldots,C\}$ for cluster centers $z_k$, $$ \theta = \left\{ \textrm{cluster centers, } \{ z_k \}_{k=1}^C \subset \mathbb R^p, \textrm{ cluster assignments, } \sigma:[n] \to [C] \right\} $$ The loss is $$ \ell(\theta;x_i) = \| x_i - z_{\sigma(i)} \|^2 = \sum_{j=1}^p (x_{i,j} - z_{\sigma(i),j})^2. $$ Loss, risk, and empirical risk still can be defined, but many concepts are not the same (such as bias-variance tradeoff).

Issue with training error in Supervised learning.

Let $\hat \theta$ be the ERM, then the training error is $$ R_n(\hat \theta) = \min_{\theta \in \Theta} R_n(\theta) $$ which does NOT converge to $R(\theta)$ because $$ \mathbb E \min_\theta R_n(\theta) \ne \min_{\theta} \mathbb E R_n(\theta) = \min_\theta R(\theta). $$

Solution

Split the data randomly into training and test sets:

  • train $\hat \theta$ with the training data
  • test $\hat \theta$ with the test data

Because the test data is independent of $\hat \theta$ we can think of the training process as fixed and test error is now unbiased for risk of $\hat \theta$.


In [54]:
## randomly shuffle data and split
n,p = X.shape
Ind = np.arange(n) 
np.random.shuffle(Ind) 
train_size = 2 * n // 3 +1 # set training set size
X_tr, X_te = X[Ind[:train_size],:], X[Ind[train_size:],:]
Y_tr, Y_te = Y[Ind[:train_size]], Y[Ind[train_size:]]

In [55]:
## compute losses on test set
betahat = fit(X_tr,Y_tr)
Y_hat_te = [predict(x,betahat) for x in X_te]
test_losses = [loss(yhat,y) for yhat,y in zip(Y_hat_te,Y_te)]

## compute losses on train set
Y_hat_tr = [predict(x,betahat) for x in X_tr]
train_losses = [loss(yhat,y) for yhat,y in zip(Y_hat_tr,Y_tr)]

In [56]:
train_losses


Out[56]:
[7.444625512977249e-22,
 4.789408058424708e-22,
 3.4382433844272955e-22,
 6.386933748575762e-23,
 7.351858499749134e-22,
 1.3409297804094917e-21,
 2.9197810842621935e-21,
 1.039531398291255e-21,
 5.607714332892798e-22,
 2.3122614381851545e-23,
 2.1335721653214664e-21,
 5.6447928149221026e-24,
 3.3087224502121107e-22,
 1.1408544600062694e-21,
 8.496586931809782e-21,
 1.2751479190787854e-21,
 6.064590983206163e-23,
 1.91731144178247e-22,
 1.7474912689673094e-21,
 4.864338989694648e-22,
 1.550858667716521e-21,
 3.408013204286884e-21,
 2.363191630612292e-21,
 2.8989893905610626e-21,
 8.333312348132911e-23]

In [57]:
test_losses


Out[57]:
[1111.340064728651,
 0.6255327739667467,
 182.73909226856128,
 1662.5641512445711,
 1187.18292683525,
 1909.680414765608,
 1171.813096477374,
 17.875942381202396,
 488.6096378479109,
 5445.915952121844,
 522.1742807860538]

In [58]:
print("train avg loss: {}\ntest avg loss: {}".format(np.mean(train_losses), np.mean(test_losses)))
print("n p :",n,p)


train avg loss: 1.376951437853411e-21
test avg loss: 1245.5019174755448
n p : 36 24

In [59]:
def train_test_split(X,Y,split_pr = 0.5):
    """train-test split"""
    n,p = X.shape
    Ind = np.arange(n) 
    np.random.shuffle(Ind) 
    train_size = int(split_pr * n) # set training set size
    X_tr, X_te = X[Ind[:train_size],:], X[Ind[train_size:],:]
    Y_tr, Y_te = Y[Ind[:train_size]], Y[Ind[train_size:]]
    return (X_tr,Y_tr), (X_te, Y_te)

In [60]:
Y = wine_ar[:,-1]
X = wine_ar[:,:-1]
(X_tr,Y_tr), (X_te, Y_te) = train_test_split(X,Y)

In [61]:
## compute losses on test set
betahat = fit(X_tr,Y_tr)
Y_hat_te = [predict(x,betahat) for x in X_te]
test_losses = [loss(yhat,y) for yhat,y in zip(Y_hat_te,Y_te)]

## compute losses on train set
Y_hat_tr = [predict(x,betahat) for x in X_tr]
train_losses = [loss(yhat,y) for yhat,y in zip(Y_hat_tr,Y_tr)]

In [62]:
print("train avg loss: {}\ntest avg loss: {}".format(np.mean(train_losses), np.mean(test_losses)))


train avg loss: 0.4120109503777993
test avg loss: 0.42469372283837814

Summary

  • Want to minimize true risk (expected loss)
  • Instead we minimize empirical risk (training error)
  • Training error is now biased, so we do training test split