We use a script that extracts your answers by looking for cells in between the cells containing the exercise statements (beginning with Exercise X.X). So you
To make markdown, please switch the cell type to markdown (from code) - you can hit 'm' when you are in command mode - and use the markdown language. For a brief tutorial see: https://daringfireball.net/projects/markdown/syntax
In the following exercises you should provide an explanation, with math when necessary, for any answers. When answering with math you should use basic LaTeX, as in $$E(Y|X=x) = \int_{\mathcal{Y}} f_{Y|X}(y|x) dy = \int_{\mathcal{Y}} \frac{f_{Y,X}(y,x)}{f_{X}(x)} dy$$ for displayed equations, and $R_{i,j} = 2^{-|i-j|}$ for inline equations. (To see the contents of this cell in markdown, double click on it or hit Enter in escape mode.) To see a list of latex math symbols see here: http://web.ift.uib.no/Teori/KURS/WRK/TeX/symALL.html
Exercise 1.1. (5 pts) Recall that the Hamming loss for Binary classification ($y \in \{0,1\}$) is $$l(y,\hat y) = 1\{y \ne \hat y\} = (y - \hat y)^2$$ as long as $\hat y \in \{0,1\}$. This loss can be extended to multiclass classification where there are $K$ possible values that $y$ can take (for example 'dog','cat','squirrel' or 1-5 stars). Explain how you can re-encode $y$ and $\hat y$ to be a $K-1$ dimensional vector that generalizes binary classification, and rewrite the loss using vector operations.
If we encode $\hat{y}$ be to $K$ dimensional vector, then $\hat{y} = e_i$, where $e_i$ is a vector with (K-1) 0 and a 1 at the $i$th index. The corresponding loss function is $l(y,\hat{y}) = \|y - \hat{y}\|_2^2$
It is also possible to encode $\hat{y}$ with $K-1$ dimensional vector and still use quadratic loss: For the first $K-1$ class, we still encode $\hat{y} = e_i$, for class $K$, we encode $\hat{y} = [\alpha, \alpha, ..., \alpha]$, all elements in this vector is $\alpha$, where $\alpha$ is the solution of $(1-\alpha)^2 + (K-2)\alpha^2 = 4$
Exercise 1.2 (5 pts) Ex. 2.7 in ESL
(a)
For convenience, we denote $[1, x]$ as $x$.
For linear regression, $\hat{f}(x_0) = x_0^T \hat{\beta}$, where $\beta = (X^TX)^{-1}X^TY$
so $\hat{f}(x_0) = x_0^T (X^TX)^{-1}X^TY = \sum_{i=1}^n x_0 (X^TX)^{-1} x_i^T y_i$
$l_i(x_0; X) = x_0 (X^TX)^{-1} x_i^T$
For k-nearest-neighbour regression, $\hat{f}(x_0) = \frac{1}{k} \sum_{i \in N_k(x_0)}y_i = \sum_{i=1}^n \frac{1}{k}I(x \in N_k(x_0)) y_i$
$l_i(x_0; X) = \frac{1}{k} I(x \in N_k(x_0))$
(b)
$\mathbb{E}_{Y|X}(f(x_0)-\hat{f}(x_0))^2 = \mathbb{E}_{Y|X}(f(x_0) - \mathbb{E}_{Y|X}\hat{f}(x_0) + \mathbb{E}_{Y|X}\hat{f}(x_0) - \hat{f}(x_0))^2 = [f(x_0)-\mathbb{E}_{Y|X}\hat{f}(x_0)]^2 + \mathbb{E}_{Y|X}[\mathbb{E}_{Y|X}\hat{f}(x_0) - \hat{f}(x_0)]^2 \\ + [f(x_0)-\mathbb{E}_{Y|X}\hat{f}(x_0)][\mathbb{E}_{Y|X}\hat{f}(x_0) - \mathbb{E}_{Y|X}\hat{f}(x_0)] = [f(x_0)-\mathbb{E}_{Y|X}\hat{f}(x_0)]^2 + [\mathbb{E}_{Y|X}\hat{f}(x_0) - \hat{f}(x_0)]^2 = bias_{Y|X}^2 + var(\hat{f}(x_0)_{Y|X})$
(c)
$\mathbb{E}_{Y,X}(f(x_0)-\hat{f}(x_0))^2 = \mathbb{E}_{Y,X}(f(x_0) - \mathbb{E}_{Y,X}\hat{f}(x_0) + \mathbb{E}_{Y,X}\hat{f}(x_0) - \hat{f}(x_0))^2 = [f(x_0)-\mathbb{E}_{Y,X}\hat{f}(x_0)]^2 + \mathbb{E}_{Y,X}[\mathbb{E}_{Y,X}\hat{f}(x_0) - \hat{f}(x_0)]^2 \\ + [f(x_0)-\mathbb{E}_{Y,X}\hat{f}(x_0)][\mathbb{E}_{Y,X}\hat{f}(x_0) - \mathbb{E}_{Y,X}\hat{f}(x_0)] = [f(x_0)-\mathbb{E}_{Y,X}\hat{f}(x_0)]^2 + [\mathbb{E}_{Y,X}\hat{f}(x_0) - \hat{f}(x_0)]^2 = bias_{Y,X}^2 + var(\hat{f}(x_0)_{Y,X})$
(d)
By Adam's law.
$\mathbb{E}_{X}[ bias_{Y|X}^2 ] = \mathbb{E}_{X} [f(x_0) - \mathbb{E}_{Y,X}(\hat{f}(x_0)) + \mathbb{E}_{Y,X}(\hat{f}(x_0)) - \mathbb{E}_{Y|X}(\hat{f}(x_0))]^2 = bias_{Y,X}^2 + var_X( \mathbb{E}_{Y|X}(\hat{f}(x_0)) )$
By Eve's law.
$var(\hat{f}(x_0)_{Y,X}) = \mathbb{E}_{X}(var(\hat{f}(x_0)_{Y|X})) + var_{X} (\mathbb{E}_{Y|X}(\hat{f}(x_0)))$
Exercise 1.3 (5 pts, 1 for each part) Recall that the true risk for a prediction function, $f$, a loss function, $\ell$, and a joint distribution for $Y,X$ is $$R(f) = E \ell(y,f(x))$$ For a training set $\{x_i,y_x\}_{i=1}^n$, the empirical risk is $$R_n = \frac{1}{n} \sum_{i=1}^n \ell(y_i,f(x_i)).$$ Let $y = x^\top \beta + \epsilon$ be a linear model for $Y|X$, where $x,\beta$ are $p$-dimensional such that $\epsilon$ is Gaussian with mean 0 and variance $\sigma^2$ (independent of X). Let $\ell(y,\hat y) = (y - \hat y)^2$ be square error loss.
(1)
We know that $\arg \min_{\hat{Y}} \mathbb{E}(Y-\hat{Y})^2 = \mathbb{E}(Y|X)$
$f^*(x)$ is the minimier of true risk $\mathbb{E}(Y-\hat{Y})^2$
So $f^*(x) = \mathbb{E}(Y|X=x) = \mathbb{E}(x^T \beta + \epsilon) = \mathbb{E}(x^T\beta) = x^T\beta$
(2)
We don't know $\beta$ in practice.
(3) Solution 1 $R_n(\hat f) \le R_n(f^\star)$ and $$E R_n(f^\star) = E \left( \frac 1n \sum_{i=1}^n \ell(y_i,f^\star(x_i)) \right) = \frac 1n \sum_{i=1}^n R(f^\star) = R(f^\star)$$ Hence, $E R_n(\hat f) \le R(f^\star)$.
Solution 2
($R(f^*) = \mathbb{E}(Y - X\beta)^2 = \mathbb{E}(\epsilon)^2 = \sigma^2$ )
$\mathbb{E}(l(y_i, \hat{f}(x_i))) = \mathbb{E}[ \mathbb{E}(l(y_i, \hat{f}(x_i))) | X ]$
We calculate the conditional expectation first(treat $X$ as fixed matrix):
$\mathbb{E}[l(y_i, \hat{f}(x_i)) | X] = \mathbb{E}(x_i^T\beta + \epsilon_i - x_i^T (X^TX)^{-1}X^TY)^2 = \mathbb{E}(x_i^T\beta + \epsilon_i - x_i^T (X^TX)^{-1}X^T(X^T\beta + \mathbf{\epsilon}))^2\\ = \mathbb{E}(x_i^T\beta + \epsilon_i - x_i^T\beta + <> x_i^T(X^TX)^{-1}X^T\mathbf{\epsilon})^2 = \mathbb{E}((e_i - x_i^T(X^TX)^{-1}X^T)^T \mathbf{\epsilon})^2 = \mathbb{E}_{X}[ \mathbb{E}_{\epsilon|X} ((e_i - x_i^T(X^TX)^{-1}X^T)^T \mathbf{\epsilon})^2 ] = (e_i - x_i^T(X^TX)^{-1}X^T) \sigma^2 I (e_i - x_i^T(X^TX)^{-1}X^T)^T\\ = \sigma^2( 1 + x_i^T(X^TX)^{-1}x_i - 2e_iX(X^TX)^{-1}x_i ) = \sigma^2(1-x_i^T(X^TX)^{-1}x_i) <= \sigma^2$
$\qquad$ $\qquad$ $\qquad$ $\qquad$ $\qquad$ $\qquad$ $\qquad$ ($(X^TX)^{-1}$ is postive definite)
We have proved that $\mathbb{E}[l(y_i, \hat{f}(x_i)) | X] \leq \sigma^2$ is true for $\forall X$ s.t. $(X^TX)$ has full rank. So $\mathbb{E}(l(y_i, \hat{f}(x_i))) = \mathbb{E}[ \mathbb{E}(l(y_i, \hat{f}(x_i))) | X ] \leq \mathbb{E}(\sigma^2) = \sigma^2$
So $\mathbb{E}(R_n(\hat{f})) = E(\frac{1}{n}\sum_{i=1}^nl(y_i,\hat{f}(x_i))) = \frac{1}{n}\sum_{i=1}^n \mathbb{E}(l(y_i, \hat{f}(x_i))) \leq \sigma^2 = R(f^*)$
(4) Solution 1 Based on (1) we have that $R(f^\star) \le R(\hat f)$. Hence, we have that $$E R_n (\hat f) \le R(\hat f).$$ Therefore, the expected test error is greater than or equal to the expected training error.
Solution 2
For a newly observed $x_0$ and $y_0$, denote $\mathbf{\epsilon}_t$ as the $\epsilon$ for training set. we know that $\epsilon_0$ and $\mathbf{\epsilon}_t$ are independent.
$R(\hat{f}) = \mathbb{E}(y_0 - x_0(X^TX)^{-1}X^TY)^2 = \mathbb{E}(x_0\beta + \epsilon_0 - x_0(X^TX)^{-1}X^T(X\beta + \mathbf{\epsilon}_t))^2 = \mathbb{E}(\epsilon_0 - x_0(X^TX)^{-1}X^T\epsilon_t)^2 \\= \sigma^2 + \mathbb{E}(x_0(X^TX)^{-1}X^T\mathbf{\epsilon}_t)^2 - \mathbb{E}(\epsilon_0) \mathbb{E}(x_0(X^TX)^{-1}X^T\mathbf{\epsilon}_t) = \sigma^2 + \mathbb{E}(x_0(X^TX)^{-1}X^T\mathbf{\epsilon}_t)^2 + 0 \geq \sigma^2 = R(f^*)$
(5) If we refer to Solution 1's then we see that the only place where the Gaussian model was used was in (1). So, the most general statement is...
Let $f^\star$ be the minimizer of $R(f)$ the true risk, and let $\hat f$ be the minimizer of $R_n$. Then $E R_n(\hat f) \le R(\hat f)$.
Exercise 1.4 Ex. 3.5 in ESL
Where $\beta_j = \beta_j^c$ for $1 \leq j \leq p$ and $\beta_0 = \beta_0^c - \sum_{j=1}^p\bar{x}_j\beta_j^c$
For Lasso, the proof is similar.
Exercise 1.5 Ex 3.9 in ESL
$X = QR$, where $Q$ is orthogonal matrix and $R$ is upper trangular matrix, then $\hat{y} = QQ^T y$.
We add a new feature $X_{new}$, denote $Q_{new} = [Q, q]$
$RSS = y^T(I - Q_{new} Q_{new}^T) y = y^T(I - QQ^T - qq^T)y = \|r\|_2^2 - (y^Tq)^2$
So we want to find the $q$ that maximize $y^Tq$
To make my algorithm efficient, we don't want to apply QR decomposition for our new data again and again.
The detailed algorithm:
for i = p+1,p+2,...,q:
$\qquad$ Calculate $q_i$ by gram-schmit: $q_i = x_i - \sum_{j=1}^p x_i^T x_j$ and normalize it by $q_i = q_i/\|q_i\|_2$
$\qquad$ Calculate $(y^Tq_i)^2$
Output the $i$ that maximize $(y^Tq_i)^2$
You will be graded based on several criteria, and each is on a 5 point scale (5 is excellent - A - 1 is poor - C - 0 is not answered - D/F). You should strive to 'impress us' if you want a 5. This means excellent code, well explained conclusions, well annotated plots, correct answers, etc.
We will be grading you on several criteria:
Exercise 2 You should run the following code cells to import the code and reduce the variable set. Address the questions after the code.
In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import LeaveOneOut
from sklearn import linear_model, neighbors
%matplotlib inline
plt.style.use('ggplot')
# dataset path
data_dir = "."
In [4]:
sample_data = pd.read_csv(data_dir+"/hw1.csv", delimiter=',')
In [5]:
sample_data.head()
Out[5]:
The response variable is quality.
In [6]:
X = np.array(sample_data.iloc[:,range(1,5)])
y = np.array(sample_data.iloc[:,0])
In [7]:
def loo_risk(X,y,regmod):
"""
Construct the leave-one-out square error risk for a regression model
Input: design matrix, X, response vector, y, a regression model, regmod
Output: scalar LOO risk
"""
loo = LeaveOneOut()
loo_losses = []
for train_index, test_index in loo.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
regmod.fit(X_train,y_train)
y_hat = regmod.predict(X_test)
loss = np.sum((y_hat - y_test)**2)
loo_losses.append(loss)
return np.mean(loo_losses)
def emp_risk(X,y,regmod):
"""
Return the empirical risk for square error loss
Input: design matrix, X, response vector, y, a regression model, regmod
Output: scalar empirical risk
"""
regmod.fit(X,y)
y_hat = regmod.predict(X)
return np.mean((y_hat - y)**2)
Exercise 2.1 (5 pts) Compare the leave-one-out risk with the empirical risk for linear regression, on this dataset.
In [12]:
lin1 = linear_model.LinearRegression(fit_intercept=True)
print('LOO Risk: '+ str(loo_risk(X,y,lin1)))
print('Emp Risk: ' + str(emp_risk(X,y,lin1)))
Exercise 2.2 (10 pts) Perform kNN regression and compare the leave-one-out risk with the empirical risk for k from 1 to 50. Remark on the tradeoff between bias and variance for this dataset and compare against linear regression.
In [ ]:
LOOs = []
MSEs = []
K=60
Ks = range(1,K+1)
for k in Ks:
knn = neighbors.KNeighborsRegressor(n_neighbors=k)
LOOs.append(loo_risk(X,y,knn))
MSEs.append(emp_risk(X,y,knn))
In [ ]:
plt.plot(Ks,LOOs,'r',label="LOO risk")
plt.title("Risks for kNN Regression")
plt.plot(Ks,MSEs,'b',label="Emp risk")
plt.legend()
_ = plt.xlabel('k')
In [ ]:
min(LOOs)
In [ ]:
print('optimal k:' + str(LOOs.index(min(LOOs))))
Conclusion Comparing the performance of kNN and linear regression, we see that 16-nearest neighbors achieves a LOO risk of 233.2 which is lower than that for linear regression (243.5).
Exercise 2.3 (10 pts) Implement forward stepwise regression (ESL section 3.3.2) for the linear model and compare the LOO risk for each stage. Recall that at each step forward stepwise regression will select a new variable that most improves the empirical risk and include that in the model (starting with the intercept).
In [29]:
n,p = X.shape
rem = set(range(p))
supp = []
LOOs = []
while len(supp) < p:
rem = list(set(range(p)) - set(supp))
ERMs = [emp_risk(X[:,supp+[j]],y,linear_model.LinearRegression(fit_intercept=True)) for j in rem]
jmin = rem[np.argmin(ERMs)]
supp.append(jmin)
LOOs.append(loo_risk(X[:,supp],y,linear_model.LinearRegression(fit_intercept=True)))
In [32]:
for i,s,loo in zip(range(p),supp,LOOs):
print("Step {} added variable {} with LOO: {}".format(i,s,loo))
Conclusion The selected model with the smallest LOO risk are variables 1,2,3 and not variable 0, but the LOO with the full model is not much different. We gain more from using k-nearest neighbors than doing variable selection using forward stepwise.