STA 208: Homework 2

This is based on the material in Chapters 3, 4.4 of 'Elements of Statistical Learning' (ESL), in addition to lectures 4-6. Chunzhe Zhang came up with the dataset and the analysis in the second section.

Instructions

We use a script that extracts your answers by looking for cells in between the cells containing the exercise statements (beginning with Exercise X.X). So you

  • MUST add cells in between the exercise statements and add answers within them and
  • MUST NOT modify the existing cells, particularly not the problem statement

To make markdown, please switch the cell type to markdown (from code) - you can hit 'm' when you are in command mode - and use the markdown language. For a brief tutorial see: https://daringfireball.net/projects/markdown/syntax

In the conceptual exercises you should provide an explanation, with math when necessary, for any answers. When answering with math you should use basic LaTeX, as in $$E(Y|X=x) = \int_{\mathcal{Y}} f_{Y|X}(y|x) dy = \int_{\mathcal{Y}} \frac{f_{Y,X}(y,x)}{f_{X}(x)} dy$$ for displayed equations, and $R_{i,j} = 2^{-|i-j|}$ for inline equations. (To see the contents of this cell in markdown, double click on it or hit Enter in escape mode.) To see a list of latex math symbols see here: http://web.ift.uib.no/Teori/KURS/WRK/TeX/symALL.html

1. Conceptual Exercises

Exercise 1.1. (5 pts) Ex. 3.29 in ESL

Exercise 1.2 (5 pts) Ex. 3.30 in ESL

Exercise 1.3 (5 pts) $Y \in \{0,1\}$ follows an exponential family model with natural parameter $\eta$ if $$P(Y=y) = \exp\left( y \eta - \psi(\eta) \right).$$ Show that when $\eta = x^\top \beta$ then $Y$ follows a logistic regression model.

2. Data Analysis


In [14]:
import numpy as np
import pandas as pd

# dataset path
data_dir = "."

Load the following medical dataset with 750 patients. The response variable is survival dates (Y), the predictors are 104 measurements measured at a specific time (numerical variables have been standardized).


In [3]:
sample_data = pd.read_csv(data_dir+"/hw2.csv", delimiter=',')

In [4]:
sample_data.head()


Out[4]:
Y V1 V2 V3 V4 V5 V6 V7 V8 V9 ... V95 V96 V97 V98 V99 V100 V101 V102 V103 V104
0 1498 No 0.171838 -0.081764 -1.448868 -1.302547 -0.143061 -0.339784 -1.206475 0.444493 ... -1.379066 0.420436 -0.827446 0.318695 -0.787409 0.351406 -0.836107 0.015502 0.435444 -0.879906
1 334 Yes -0.605767 -0.584360 -0.485169 -0.848111 -0.493546 -0.392332 -0.239788 0.421697 ... 0.398840 -0.434789 -0.698862 1.387219 0.948456 0.191397 1.451699 -1.243616 -0.699072 1.751434
2 845 Yes -0.266330 -0.126965 0.138401 0.262732 -0.202438 0.397194 0.137790 0.047847 ... -0.450999 -0.627830 0.677158 -0.140255 -0.798641 -0.972419 -0.852035 0.080914 -1.906252 0.705509
3 1484 No 0.113498 0.893293 -0.825298 -0.444168 0.756242 0.179122 -1.145078 -1.471261 ... 0.316312 0.131010 0.878134 -0.306249 -1.263270 1.316120 -0.999717 1.104161 -0.234038 -0.083488
4 301 Yes -0.620454 -0.608036 -0.088352 0.111253 -0.598898 -0.513191 0.753000 1.055418 ... 0.364884 0.251667 0.373787 -0.354599 0.085019 1.207509 -0.762206 -0.067318 0.158247 0.592638

5 rows × 105 columns


In [5]:
sample_data.V1 = sample_data.V1.eq('Yes').mul(1)

The response variable is Y for 2.1-2.3 and Z for 2.4.


In [9]:
X = np.array(sample_data.iloc[:,range(2,104)])
y = np.array(sample_data.iloc[:,0])
z = np.array(sample_data.iloc[:,1])

Exercise 2.1 (10 pts) Perform ridge regression on the method and cross-validate to find the best ridge parameter.

Exercise 2.2 (10 pts) Plot the lasso and lars path for each of the coefficients. All coefficients for a given method should be on the same plot, you should get 2 plots. What are the major differences, if any? Are there any 'leaving' events in the lasso path?

Exercise 2.3 (10 pts) Cross-validate the Lasso and compare the results to the answer to 2.1.

Exercise 2.4 (15 pts) Obtain the 'best' active set from 2.3, and create a new design matrix with only these variables. Use this to predict the categorical variable $z$ with logistic regression.