STA 208: Homework 1

This is based on the material in Chapters 2, 3 of 'Elements of Statistical Learning' (ESL), in addition to lectures 1-4.

Instructions

We use a script that extracts your answers by looking for cells in between the cells containing the exercise statements (beginning with Exercise X.X). So you

MUST add cells in between the exercise statements and add answers within them and
MUST NOT modify the existing cells, particularly not the problem statement

To make markdown, please switch the cell type to markdown (from code) - you can hit 'm' when you are in command mode - and use the markdown language. For a brief tutorial see: https://daringfireball.net/projects/markdown/syntax

1. Conceptual Exercises

In the following exercises you should provide an explanation, with math when necessary, for any answers. When answering with math you should use basic LaTeX, as in $$E(Y|X=x) = \int_{\mathcal{Y}} f_{Y|X}(y|x) dy = \int_{\mathcal{Y}} \frac{f_{Y,X}(y,x)}{f_{X}(x)} dy$$ for displayed equations, and $R_{i,j} = 2^{-|i-j|}$ for inline equations. (To see the contents of this cell in markdown, double click on it or hit Enter in escape mode.) To see a list of latex math symbols see here: http://web.ift.uib.no/Teori/KURS/WRK/TeX/symALL.html

Exercise 1.1. (5 pts) Recall that the Hamming loss for Binary classification ($y \in \{0,1\}$) is $$l(y,\hat y) = 1\{y \ne \hat y\} = (y - \hat y)^2$$ as long as $\hat y \in \{0,1\}$. This loss can be extended to multiclass classification where there are $K$ possible values that $y$ can take (for example 'dog','cat','squirrel' or 1-5 stars). Explain how you can re-encode $y$ and $\hat y$ to be a $K-1$ dimensional vector that generalizes binary classification, and rewrite the loss using vector operations.

If we encode $\hat{y}$ be to $K$ dimensional vector, then $\hat{y} = e_i$, where $e_i$ is a vector with (K-1) 0 and a 1 at the $i$th index. The corresponding loss function is $l(y,\hat{y}) = \|y - \hat{y}\|_2^2$

It is also possible to encode $\hat{y}$ with $K-1$ dimensional vector and still use quadratic loss: For the first $K-1$ class, we still encode $\hat{y} = e_i$, for class $K$, we encode $\hat{y} = [\alpha, \alpha, ..., \alpha]$, all elements in this vector is $\alpha$, where $\alpha$ is the solution of $(1-\alpha)^2 + (K-2)\alpha^2 = 4$

Exercise 1.2 (5 pts) Ex. 2.7 in ESL

(a)
For convenience, we denote $[1, x]$ as $x$.
For linear regression, $\hat{f}(x_0) = x_0^T \hat{\beta}$, where $\beta = (X^TX)^{-1}X^TY$
so $\hat{f}(x_0) = x_0^T (X^TX)^{-1}X^TY = \sum_{i=1}^n x_0 (X^TX)^{-1} x_i^T y_i$
$l_i(x_0; X) = x_0 (X^TX)^{-1} x_i^T$

For k-nearest-neighbour regression, $\hat{f}(x_0) = \frac{1}{k} \sum_{i \in N_k(x_0)}y_i = \sum_{i=1}^n \frac{1}{k}I(x \in N_k(x_0)) y_i$
$l_i(x_0; X) = \frac{1}{k} I(x \in N_k(x_0))$

(b)
$\mathbb{E}_{Y|X}(f(x_0)-\hat{f}(x_0))^2 = \mathbb{E}_{Y|X}(f(x_0) - \mathbb{E}_{Y|X}\hat{f}(x_0) + \mathbb{E}_{Y|X}\hat{f}(x_0) - \hat{f}(x_0))^2 = [f(x_0)-\mathbb{E}_{Y|X}\hat{f}(x_0)]^2 + \mathbb{E}_{Y|X}[\mathbb{E}_{Y|X}\hat{f}(x_0) - \hat{f}(x_0)]^2 \\ + [f(x_0)-\mathbb{E}_{Y|X}\hat{f}(x_0)][\mathbb{E}_{Y|X}\hat{f}(x_0) - \mathbb{E}_{Y|X}\hat{f}(x_0)] = [f(x_0)-\mathbb{E}_{Y|X}\hat{f}(x_0)]^2 + [\mathbb{E}_{Y|X}\hat{f}(x_0) - \hat{f}(x_0)]^2 = bias_{Y|X}^2 + var(\hat{f}(x_0)_{Y|X})$

(c)
$\mathbb{E}_{Y,X}(f(x_0)-\hat{f}(x_0))^2 = \mathbb{E}_{Y,X}(f(x_0) - \mathbb{E}_{Y,X}\hat{f}(x_0) + \mathbb{E}_{Y,X}\hat{f}(x_0) - \hat{f}(x_0))^2 = [f(x_0)-\mathbb{E}_{Y,X}\hat{f}(x_0)]^2 + \mathbb{E}_{Y,X}[\mathbb{E}_{Y,X}\hat{f}(x_0) - \hat{f}(x_0)]^2 \\ + [f(x_0)-\mathbb{E}_{Y,X}\hat{f}(x_0)][\mathbb{E}_{Y,X}\hat{f}(x_0) - \mathbb{E}_{Y,X}\hat{f}(x_0)] = [f(x_0)-\mathbb{E}_{Y,X}\hat{f}(x_0)]^2 + [\mathbb{E}_{Y,X}\hat{f}(x_0) - \hat{f}(x_0)]^2 = bias_{Y,X}^2 + var(\hat{f}(x_0)_{Y,X})$

(d)
By Adam's law.
$\mathbb{E}_{X}[ bias_{Y|X}^2 ] = \mathbb{E}_{X} [f(x_0) - \mathbb{E}_{Y,X}(\hat{f}(x_0)) + \mathbb{E}_{Y,X}(\hat{f}(x_0)) - \mathbb{E}_{Y|X}(\hat{f}(x_0))]^2 = bias_{Y,X}^2 + var_X( \mathbb{E}_{Y|X}(\hat{f}(x_0)) )$
By Eve's law.
$var(\hat{f}(x_0)_{Y,X}) = \mathbb{E}_{X}(var(\hat{f}(x_0)_{Y|X})) + var_{X} (\mathbb{E}_{Y|X}(\hat{f}(x_0)))$

Exercise 1.3 (5 pts, 1 for each part) Recall that the true risk for a prediction function, $f$, a loss function, $\ell$, and a joint distribution for $Y,X$ is $$R(f) = E \ell(y,f(x))$$ For a training set $\{x_i,y_x\}_{i=1}^n$, the empirical risk is $$R_n = \frac{1}{n} \sum_{i=1}^n \ell(y_i,f(x_i)).$$ Let $y = x^\top \beta + \epsilon$ be a linear model for $Y|X$, where $x,\beta$ are $p$-dimensional such that $\epsilon$ is Gaussian with mean 0 and variance $\sigma^2$ (independent of X). Let $\ell(y,\hat y) = (y - \hat y)^2$ be square error loss.

Show that $f^\star(x) = x^\top \beta$ gives the smallest true risk (also known as the Bayes rule).
Why can't we use this prediction in practice?
Recall that OLS is the empirical risk minimizer for linear functions. Why does this tell us the following: $$ E R_n (\hat f) \le R(f^\star)$$
How do we know that $E R_n (\hat f) \le R(\hat f)$? and use this to answer Ex. 2.9 in ESL.
What about this was specific to OLS and least squares loss (can this be generalized)? What is the most general statement that you can think of that you can prove in this way?

(1)
We know that $\arg \min_{\hat{Y}} \mathbb{E}(Y-\hat{Y})^2 = \mathbb{E}(Y|X)$
$f^*(x)$ is the minimier of true risk $\mathbb{E}(Y-\hat{Y})^2$
So $f^*(x) = \mathbb{E}(Y|X=x) = \mathbb{E}(x^T \beta + \epsilon) = \mathbb{E}(x^T\beta) = x^T\beta$

(2)
We don't know $\beta$ in practice.

(3) Solution 1 $R_n(\hat f) \le R_n(f^\star)$ and $$E R_n(f^\star) = E \left( \frac 1n \sum_{i=1}^n \ell(y_i,f^\star(x_i)) \right) = \frac 1n \sum_{i=1}^n R(f^\star) = R(f^\star)$$ Hence, $E R_n(\hat f) \le R(f^\star)$.

Solution 2 ($R(f^*) = \mathbb{E}(Y - X\beta)^2 = \mathbb{E}(\epsilon)^2 = \sigma^2$ ) $\mathbb{E}(l(y_i, \hat{f}(x_i))) = \mathbb{E}[ \mathbb{E}(l(y_i, \hat{f}(x_i))) | X ]$
We calculate the conditional expectation first(treat $X$ as fixed matrix):
$\mathbb{E}[l(y_i, \hat{f}(x_i)) | X] = \mathbb{E}(x_i^T\beta + \epsilon_i - x_i^T (X^TX)^{-1}X^TY)^2 = \mathbb{E}(x_i^T\beta + \epsilon_i - x_i^T (X^TX)^{-1}X^T(X^T\beta + \mathbf{\epsilon}))^2\\ = \mathbb{E}(x_i^T\beta + \epsilon_i - x_i^T\beta + <> x_i^T(X^TX)^{-1}X^T\mathbf{\epsilon})^2 = \mathbb{E}((e_i - x_i^T(X^TX)^{-1}X^T)^T \mathbf{\epsilon})^2 = \mathbb{E}_{X}[ \mathbb{E}_{\epsilon|X} ((e_i - x_i^T(X^TX)^{-1}X^T)^T \mathbf{\epsilon})^2 ] = (e_i - x_i^T(X^TX)^{-1}X^T) \sigma^2 I (e_i - x_i^T(X^TX)^{-1}X^T)^T\\ = \sigma^2( 1 + x_i^T(X^TX)^{-1}x_i - 2e_iX(X^TX)^{-1}x_i ) = \sigma^2(1-x_i^T(X^TX)^{-1}x_i) <= \sigma^2$
$\qquad$ $\qquad$ $\qquad$ $\qquad$ $\qquad$ $\qquad$ $\qquad$ ($(X^TX)^{-1}$ is postive definite)

We have proved that $\mathbb{E}[l(y_i, \hat{f}(x_i)) | X] \leq \sigma^2$ is true for $\forall X$ s.t. $(X^TX)$ has full rank. So $\mathbb{E}(l(y_i, \hat{f}(x_i))) = \mathbb{E}[ \mathbb{E}(l(y_i, \hat{f}(x_i))) | X ] \leq \mathbb{E}(\sigma^2) = \sigma^2$

So $\mathbb{E}(R_n(\hat{f})) = E(\frac{1}{n}\sum_{i=1}^nl(y_i,\hat{f}(x_i))) = \frac{1}{n}\sum_{i=1}^n \mathbb{E}(l(y_i, \hat{f}(x_i))) \leq \sigma^2 = R(f^*)$

(4) Solution 1 Based on (1) we have that $R(f^\star) \le R(\hat f)$. Hence, we have that $$E R_n (\hat f) \le R(\hat f).$$ Therefore, the expected test error is greater than or equal to the expected training error.

Solution 2 For a newly observed $x_0$ and $y_0$, denote $\mathbf{\epsilon}_t$ as the $\epsilon$ for training set. we know that $\epsilon_0$ and $\mathbf{\epsilon}_t$ are independent.
$R(\hat{f}) = \mathbb{E}(y_0 - x_0(X^TX)^{-1}X^TY)^2 = \mathbb{E}(x_0\beta + \epsilon_0 - x_0(X^TX)^{-1}X^T(X\beta + \mathbf{\epsilon}_t))^2 = \mathbb{E}(\epsilon_0 - x_0(X^TX)^{-1}X^T\epsilon_t)^2 \\= \sigma^2 + \mathbb{E}(x_0(X^TX)^{-1}X^T\mathbf{\epsilon}_t)^2 - \mathbb{E}(\epsilon_0) \mathbb{E}(x_0(X^TX)^{-1}X^T\mathbf{\epsilon}_t) = \sigma^2 + \mathbb{E}(x_0(X^TX)^{-1}X^T\mathbf{\epsilon}_t)^2 + 0 \geq \sigma^2 = R(f^*)$

(5) If we refer to Solution 1's then we see that the only place where the Gaussian model was used was in (1). So, the most general statement is...

Let $f^\star$ be the minimizer of $R(f)$ the true risk, and let $\hat f$ be the minimizer of $R_n$. Then $E R_n(\hat f) \le R(\hat f)$.

Exercise 1.4 Ex. 3.5 in ESL

$$\min_{\beta^c}\{ \sum_{i=1}^N [y_i - \beta_0^c - \sum_{j=1}^p (x_{ij} - \bar{x}_j)\beta_j^c]^2 + \lambda \sum_{j=1}^N \beta_j^{c2} \} = \min_{\beta^c}\{ \sum_{i=1}^N [y_i - (\beta_0^c - \sum_{j=1}^p\bar{x}_j\beta_j^c) - \sum_{j=1}^p x_{ij}\beta_j^c]^2 + \lambda \sum_{j=1}^N \beta_j^{c2}\} \\ = \min_{\beta}\{ \sum_{i=1}^N [y_i - \beta_0 - \sum_{j=1}^p x_{ij}\beta_j]^2 + \lambda \sum_{j=1}^N \beta_j^{2}\} $$

Where $\beta_j = \beta_j^c$ for $1 \leq j \leq p$ and $\beta_0 = \beta_0^c - \sum_{j=1}^p\bar{x}_j\beta_j^c$

For Lasso, the proof is similar.

Exercise 1.5 Ex 3.9 in ESL

$X = QR$, where $Q$ is orthogonal matrix and $R$ is upper trangular matrix, then $\hat{y} = QQ^T y$.
We add a new feature $X_{new}$, denote $Q_{new} = [Q, q]$
$RSS = y^T(I - Q_{new} Q_{new}^T) y = y^T(I - QQ^T - qq^T)y = \|r\|_2^2 - (y^Tq)^2$
So we want to find the $q$ that maximize $y^Tq$

To make my algorithm efficient, we don't want to apply QR decomposition for our new data again and again.

The detailed algorithm:
for i = p+1,p+2,...,q:
$\qquad$ Calculate $q_i$ by gram-schmit: $q_i = x_i - \sum_{j=1}^p x_i^T x_j$ and normalize it by $q_i = q_i/\|q_i\|_2$
$\qquad$ Calculate $(y^Tq_i)^2$
Output the $i$ that maximize $(y^Tq_i)^2$

HW1 Wine Data Analysis

Instructions

You will be graded based on several criteria, and each is on a 5 point scale (5 is excellent - A - 1 is poor - C - 0 is not answered - D/F). You should strive to 'impress us' if you want a 5. This means excellent code, well explained conclusions, well annotated plots, correct answers, etc.

We will be grading you on several criteria:

Conclusions: Conclusions should be consistent with the evidence provided, the conclusion should be well justified, the principles of machine learning that you have learned should be respected (such as overfitting and underfitting etc.)
Correctness of calculations: code should be correct and reflect the principles learned in this course, the logic should be sound, the methods should match the setting and context, you should try many applicable methods that you have learned as long as they apply.
Code, Figures, and Text: Code should be annotated and easy to follow, with docstrings on the functions; captions, titles, for figures

Exercise 2 You should run the following code cells to import the code and reduce the variable set. Address the questions after the code.



In [3]:

    
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import LeaveOneOut
from sklearn import linear_model, neighbors
%matplotlib inline
plt.style.use('ggplot')

# dataset path
data_dir = "."



In [4]:

    
sample_data = pd.read_csv(data_dir+"/hw1.csv", delimiter=',')



In [5]:

    
sample_data.head()

The response variable is quality.



In [6]:

    
X = np.array(sample_data.iloc[:,range(1,5)])
y = np.array(sample_data.iloc[:,0])



In [7]:

    
def loo_risk(X,y,regmod):
    """
    Construct the leave-one-out square error risk for a regression model
    
    Input: design matrix, X, response vector, y, a regression model, regmod
    Output: scalar LOO risk
    """
    loo = LeaveOneOut()
    loo_losses = []
    for train_index, test_index in loo.split(X):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        regmod.fit(X_train,y_train)
        y_hat = regmod.predict(X_test)
        loss = np.sum((y_hat - y_test)**2)
        loo_losses.append(loss)
    return np.mean(loo_losses)

def emp_risk(X,y,regmod):
    """
    Return the empirical risk for square error loss
    
    Input: design matrix, X, response vector, y, a regression model, regmod
    Output: scalar empirical risk
    """
    regmod.fit(X,y)
    y_hat = regmod.predict(X)
    return np.mean((y_hat - y)**2)

Exercise 2.1 (5 pts) Compare the leave-one-out risk with the empirical risk for linear regression, on this dataset.



In [12]:

    
lin1 = linear_model.LinearRegression(fit_intercept=True)
print('LOO Risk: '+ str(loo_risk(X,y,lin1)))
print('Emp Risk: ' + str(emp_risk(X,y,lin1)))









    



LOO Risk: 243.831648671
Emp Risk: 242.247818463

Exercise 2.2 (10 pts) Perform kNN regression and compare the leave-one-out risk with the empirical risk for k from 1 to 50. Remark on the tradeoff between bias and variance for this dataset and compare against linear regression.



In [ ]:

    
LOOs = []
MSEs = []
K=60
Ks = range(1,K+1)
for k in Ks:
    knn = neighbors.KNeighborsRegressor(n_neighbors=k)
    LOOs.append(loo_risk(X,y,knn))
    MSEs.append(emp_risk(X,y,knn))



In [ ]:

    
plt.plot(Ks,LOOs,'r',label="LOO risk")
plt.title("Risks for kNN Regression")
plt.plot(Ks,MSEs,'b',label="Emp risk")
plt.legend()
_ = plt.xlabel('k')



In [ ]:

    
min(LOOs)



In [ ]:

    
print('optimal k:' + str(LOOs.index(min(LOOs))))

Conclusion Comparing the performance of kNN and linear regression, we see that 16-nearest neighbors achieves a LOO risk of 233.2 which is lower than that for linear regression (243.5).

Exercise 2.3 (10 pts) Implement forward stepwise regression (ESL section 3.3.2) for the linear model and compare the LOO risk for each stage. Recall that at each step forward stepwise regression will select a new variable that most improves the empirical risk and include that in the model (starting with the intercept).



In [29]:

    
n,p = X.shape
rem = set(range(p))
supp = []
LOOs = []
while len(supp) < p:
    rem = list(set(range(p)) - set(supp))
    ERMs = [emp_risk(X[:,supp+[j]],y,linear_model.LinearRegression(fit_intercept=True)) for j in rem]
    jmin = rem[np.argmin(ERMs)]
    supp.append(jmin)
    LOOs.append(loo_risk(X[:,supp],y,linear_model.LinearRegression(fit_intercept=True)))



In [32]:

    
for i,s,loo in zip(range(p),supp,LOOs):
    print("Step {} added variable {} with LOO: {}".format(i,s,loo))









    



Step 0 added variable 3 with LOO: 258.844899216
Step 1 added variable 1 with LOO: 245.588154312
Step 2 added variable 2 with LOO: 243.508220022
Step 3 added variable 0 with LOO: 243.831648671

Conclusion The selected model with the smallest LOO risk are variables 1,2,3 and not variable 0, but the LOO with the full model is not much different. We gain more from using k-nearest neighbors than doing variable selection using forward stepwise.

	quality	density	pH	alcohol	time
0	90.0	0.99780	3.51	9.4	5
1	68.5	0.99620	3.26	10.0	4
2	110.0	0.99630	3.25	9.2	5
3	71.5	0.99551	3.56	10.8	5
4	110.0	0.99600	3.28	9.8	6