STA 208: Homework 2

This is based on the material in Chapters 3, 4.4 of 'Elements of Statistical Learning' (ESL), in addition to lectures 4-6. Chunzhe Zhang came up with the dataset and the analysis in the second section.

Instructions

We use a script that extracts your answers by looking for cells in between the cells containing the exercise statements (beginning with Exercise X.X). So you

MUST add cells in between the exercise statements and add answers within them and
MUST NOT modify the existing cells, particularly not the problem statement

To make markdown, please switch the cell type to markdown (from code) - you can hit 'm' when you are in command mode - and use the markdown language. For a brief tutorial see: https://daringfireball.net/projects/markdown/syntax

In the conceptual exercises you should provide an explanation, with math when necessary, for any answers. When answering with math you should use basic LaTeX, as in $$E(Y|X=x) = \int_{\mathcal{Y}} f_{Y|X}(y|x) dy = \int_{\mathcal{Y}} \frac{f_{Y,X}(y,x)}{f_{X}(x)} dy$$ for displayed equations, and $R_{i,j} = 2^{-|i-j|}$ for inline equations. (To see the contents of this cell in markdown, double click on it or hit Enter in escape mode.) To see a list of latex math symbols see here: http://web.ift.uib.no/Teori/KURS/WRK/TeX/symALL.html

1. Conceptual Exercises

Exercise 1.1. (5 pts) Ex. 3.29 in ESL

Exercise 1.2 (5 pts) Ex. 3.30 in ESL

Exercise 1.3 (5 pts) $Y \in \{0,1\}$ follows an exponential family model with natural parameter $\eta$ if $$P(Y=y) = \exp\left( y \eta - \psi(\eta) \right).$$ Show that when $\eta = x^\top \beta$ then $Y$ follows a logistic regression model.

2. Data Analysis



In [14]:

    
import numpy as np
import pandas as pd

# dataset path
data_dir = "."

Load the following medical dataset with 750 patients. The response variable is survival dates (Y), the predictors are 104 measurements measured at a specific time (numerical variables have been standardized).



In [3]:

    
sample_data = pd.read_csv(data_dir+"/hw2.csv", delimiter=',')



In [4]:

    
sample_data.head()









    Out[4]:






  
    
      
      Y
      V1
      V2
      V3
      V4
      V5
      V6
      V7
      V8
      V9
      ...
      V95
      V96
      V97
      V98
      V99
      V100
      V101
      V102
      V103
      V104
    
  
  
    
      0
      1498
      No
      0.171838
      -0.081764
      -1.448868
      -1.302547
      -0.143061
      -0.339784
      -1.206475
      0.444493
      ...
      -1.379066
      0.420436
      -0.827446
      0.318695
      -0.787409
      0.351406
      -0.836107
      0.015502
      0.435444
      -0.879906
    
    
      1
      334
      Yes
      -0.605767
      -0.584360
      -0.485169
      -0.848111
      -0.493546
      -0.392332
      -0.239788
      0.421697
      ...
      0.398840
      -0.434789
      -0.698862
      1.387219
      0.948456
      0.191397
      1.451699
      -1.243616
      -0.699072
      1.751434
    
    
      2
      845
      Yes
      -0.266330
      -0.126965
      0.138401
      0.262732
      -0.202438
      0.397194
      0.137790
      0.047847
      ...
      -0.450999
      -0.627830
      0.677158
      -0.140255
      -0.798641
      -0.972419
      -0.852035
      0.080914
      -1.906252
      0.705509
    
    
      3
      1484
      No
      0.113498
      0.893293
      -0.825298
      -0.444168
      0.756242
      0.179122
      -1.145078
      -1.471261
      ...
      0.316312
      0.131010
      0.878134
      -0.306249
      -1.263270
      1.316120
      -0.999717
      1.104161
      -0.234038
      -0.083488
    
    
      4
      301
      Yes
      -0.620454
      -0.608036
      -0.088352
      0.111253
      -0.598898
      -0.513191
      0.753000
      1.055418
      ...
      0.364884
      0.251667
      0.373787
      -0.354599
      0.085019
      1.207509
      -0.762206
      -0.067318
      0.158247
      0.592638
    
  

5 rows × 105 columns



In [5]:

    
sample_data.V1 = sample_data.V1.eq('Yes').mul(1)

The response variable is Y for 2.1-2.3 and Z for 2.4.



In [9]:

    
X = np.array(sample_data.iloc[:,range(2,104)])
y = np.array(sample_data.iloc[:,0])
z = np.array(sample_data.iloc[:,1])

Exercise 2.1 (10 pts) Perform ridge regression on the method and cross-validate to find the best ridge parameter.

Exercise 2.2 (10 pts) Plot the lasso and lars path for each of the coefficients. All coefficients for a given method should be on the same plot, you should get 2 plots. What are the major differences, if any? Are there any 'leaving' events in the lasso path?

Exercise 2.3 (10 pts) Cross-validate the Lasso and compare the results to the answer to 2.1.

Exercise 2.4 (15 pts) Obtain the 'best' active set from 2.3, and create a new design matrix with only these variables. Use this to predict the categorical variable $z$ with logistic regression.

	Y	V1	V2	V3	V4	V5	V6	V7	V8	V9	...	V95	V96	V97	V98	V99	V100	V101	V102	V103	V104
0	1498	No	0.171838	-0.081764	-1.448868	-1.302547	-0.143061	-0.339784	-1.206475	0.444493	...	-1.379066	0.420436	-0.827446	0.318695	-0.787409	0.351406	-0.836107	0.015502	0.435444	-0.879906
1	334	Yes	-0.605767	-0.584360	-0.485169	-0.848111	-0.493546	-0.392332	-0.239788	0.421697	...	0.398840	-0.434789	-0.698862	1.387219	0.948456	0.191397	1.451699	-1.243616	-0.699072	1.751434
2	845	Yes	-0.266330	-0.126965	0.138401	0.262732	-0.202438	0.397194	0.137790	0.047847	...	-0.450999	-0.627830	0.677158	-0.140255	-0.798641	-0.972419	-0.852035	0.080914	-1.906252	0.705509
3	1484	No	0.113498	0.893293	-0.825298	-0.444168	0.756242	0.179122	-1.145078	-1.471261	...	0.316312	0.131010	0.878134	-0.306249	-1.263270	1.316120	-0.999717	1.104161	-0.234038	-0.083488
4	301	Yes	-0.620454	-0.608036	-0.088352	0.111253	-0.598898	-0.513191	0.753000	1.055418	...	0.364884	0.251667	0.373787	-0.354599	0.085019	1.207509	-0.762206	-0.067318	0.158247	0.592638