STA 208: Homework 3

This is based on the material in Chapter 4 of 'Elements of Statistical Learning' (ESL), in addition to lectures 7-8. Chunzhe Zhang came up with the dataset and the analysis in the second section.

Instructions

We use a script that extracts your answers by looking for cells in between the cells containing the exercise statements (beginning with Exercise X.X). So you

MUST add cells in between the exercise statements and add answers within them and
MUST NOT modify the existing cells, particularly not the problem statement

To make markdown, please switch the cell type to markdown (from code) - you can hit 'm' when you are in command mode - and use the markdown language. For a brief tutorial see: https://daringfireball.net/projects/markdown/syntax

In the conceptual exercises you should provide an explanation, with math when necessary, for any answers. When answering with math you should use basic LaTeX, as in $$E(Y|X=x) = \int_{\mathcal{Y}} f_{Y|X}(y|x) dy = \int_{\mathcal{Y}} \frac{f_{Y,X}(y,x)}{f_{X}(x)} dy$$ for displayed equations, and $R_{i,j} = 2^{-|i-j|}$ for inline equations. (To see the contents of this cell in markdown, double click on it or hit Enter in escape mode.) To see a list of latex math symbols see here: http://web.ift.uib.no/Teori/KURS/WRK/TeX/symALL.html

When writing pseudocode, you should use enumerated lists, such as

Algorithm: Ordinary Least Squares Fit (Input: X, y; Output: $\beta$)

Initialize the $p \times p$ Gram matrix, $G \gets 0$, and the vector $b \gets 0$.
For each sample, $x_i$:
1. $G \gets G + x_i x_i^\top$.
2. $b \gets b + y_i x_i$
Solve the linear system $G \beta = b$ and return $\beta$

Exercise 1.1 (10 pts - 2 each)

Recall that surrogate losses for large margin classification take the form, $\phi(y_i x_i^\top \beta)$ where $y_i \in \{-1,1\}$ and $\beta, x_i \in \mathbb R^p$.

The following functions are used as surrogate losses for large margin classification. Demonstrate if they are convex or not, and follow the instructions.

exponential loss: $\phi(x) = e^{-x}$
truncated quadratic loss: $\phi(x) = (\max\{1-x,0\})^2$
hinge loss: $\phi(x) = \max\{1-x,0\}$
sigmoid loss: $\phi(x) = 1 - \tanh(\kappa x)$, for fixed $\kappa > 0$
Plot these as a function of $x$.

(This problem is due to notes of Larry Wasserman.)

Exercise 1.2 (10 pts)

Consider the truncated quadratic loss from (1.1.2). For brevity let $a_+ = max\{a,0\}$ denote the positive part of $a$.

$$\ell(y_i,x_i,\beta) = \phi(y_i x_i^\top \beta) = (1-y_i x_i^\top \beta)_+^2$$

Consider the empirical risk, $R_n$ (the average loss over a training set) for the truncated quadratic loss. What is gradient of $R_n$ in $\beta$? Does it always exists?
Demonstrate that the gradient does not have continuous derivative everywhere.
Recall that support vector machines used the hinge loss $(1 - y_i x_i^\top)_+$ with a ridge regularization. Write the regularized optimization method for the truncated quadratic loss, and derive the gradient of the regularized empirical risk.
Because the loss does not have continuous Hessian, instead of the Newton method, we will use a quasi-Newton method that replaces the Hessian with a quasi-Hessian (another matrix that is meant to approximate the Hessian). Consider the following quasi-Hessian of the regularized objective to be $$G(\beta) = \frac 1n \sum_i 2 (x_i x_i^\top 1\{ y_i x_i^\top \beta > 1 \}) + 2 \lambda.$$ Demonstrate that the quasi-Hessian is positive definite, and write pseudo-code for quasi-Newton optimization. (There was a correction in the lectures, that when minimizing a function you should subtract the gradient $\beta \gets \beta - H^{-1} g$).

HW3 Logistic, LDA, SVM



In [1]:

    
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt

import sklearn.linear_model as skl_lm
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import confusion_matrix, classification_report, precision_recall_curve, roc_curve
from sklearn import preprocessing
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
# dataset path
data_dir = "."

The following code reads the data, subselects the $y$ and $X$ variables, and makes a training and test split. This is the Abalone dataset and we will be predicting the age. V9 is age, 1 represents old, 0 represents young.



In [2]:

    
sample_data = pd.read_csv(data_dir+"/hw3.csv", delimiter=',')
sample_data.V1=sample_data.V1.factorize()[0]

X = np.array(sample_data.iloc[:,range(0,8)])
y = np.array(sample_data.iloc[:,8])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=0)

Exercise 2.1 (10 pts) Perform logistic regression using Newton conjugate gradient. You should save the predicted probabilities, and save the roc and pr curves (using roc_curve and precision_recall_curve) computed using the test set.

Exercise 2.2 (10 pts) Do the same for linear discriminant analysis.

Exercise 2.3 (10 pts) Do the same for support vector machines.

Exercise 2.4 (10 pts) Plot and compare the ROC and PR curves for the above methods.