Exercises for Chapter 3

Most exercises below are adapted from An Introduction to Statistical Learning.

Support Vector Machines

The following questions test your ability to implement SVM classifiers and reason about their effectiveness.

(a) Generate a simulated two-class data set with 100 observations and two features in which there is a visible but nonlinear separation between the two classes.



In [34]:

    
# Your code here

(b) Show that in this setting, a support vector machine with a polynomial kernel (with degree greater than 1) or a radial kernel will outperform a support vector classifier on the training data.



In [35]:

    
# Your code here



In [36]:

    
# Your code here

(d) Which technique performs best on the test data? Make plots and report training and test error rates in order to back up your assertions.



In [37]:

    
# Your code here



In [38]:

    
# Your thoughts here

Logistic regression

We have seen that we can fit an SVM with a nonlinear kernel in order to perform classification using a non-linear decision boundary. We will now see that we can also obtain a non-linear decision boundary by performing logistic regression using non-linear transformations of the features.

(a) Generate a data set with n = 500 and p = 2, such that the observations belong to two classes with a quadratic decision boundary between them.



In [39]:

    
# Your code here

(b) Plot the observations, colored according to their class labels. Your plot should display X1 on the x-axis, and X2 on the y-axis.



In [40]:

    
# Your code here



In [41]:

    
# Your code here

(d) Apply this model to the training data in order to obtain a predicted class label for each training observation. Plot the observations, colored according to the predicted class labels. The decision boundary should be linear.



In [42]:

    
# Your code here

(e) Now fit a logistic regression model to the data using non-linear functions of X1 and X2 as predictors (e.g. X12, X1 ×X2, log(X2), and so forth).



In [43]:

    
# Your code here

(f) Apply this model to the training data in order to obtain a predicted class label for each training observation. Plot the observations, colored according to the predicted class labels. The decision boundary should be obviously non-linear. If it is not, then repeat (a)-(e) until you come up with an example in which the predicted class labels are obviously non-linear.



In [44]:

    
# Your code here

(g) Fit a support vector classifier to the data with X1 and X2 as predictors. Obtain a class prediction for each training observation. Plot the observations, colored according to the predicted class labels.



In [45]:

    
# Your code here

(h) Fit a SVM using a non-linear kernel to the data. Obtain a class prediction for each training observation. Plot the observations, colored according to the predicted class labels.



In [46]:

    
# Your code here

(i) Comment on your results.



In [47]:

    
# Your thoughts here

Prove algebraically that the logistic and logit representations of the logistic regression model are equivalent. More specifically, prove that:

$$ p(X) = \frac{1}{1 + e^{-z}} \quad \Leftrightarrow \quad log(\frac{p(X)}{1-p(X)}) = z $$



In [48]:

    
# Your proof here

Comprehension questions about odds:

(a) On average, what fraction of people with an odds of 0.37 of defaulting on their credit card payment will in fact default?



In [49]:

    
# Your answer here

(b) Suppose that an individual has a 16% chance of defaulting on their credit card payment. What are the odds that they will default?



In [50]:

    
# Your answer here

Suppose we collect data for a group of students in a statistics class with variables $x_{1}$ = hours studied, $x_{2}$ = undergrad GPA, and $y$ = receive an A. We fit a logistic regression and produce estimated coefficients $w_{0} = −6$, $w_{1} = 0.05$, and $w_{2} = 1$.

(a) Estimate the probability that a student who studies for 40 hours and has an undergrad GPA of 3.5 gets an A in the class.



In [51]:

    
# Your answer here

(b) How many hours would the student in part (a) need to study to have a 50% chance of getting an A in the class?



In [52]:

    
# Your answer here

K-Nearest Neighbors

An exercise to help investigate the curse of dimensionality in nearest-neighbor algorithms:

(a) Suppose that we have a set of observations, each with measurements on $p = 1$ feature, $X$. We assume that $X$ is uniformly distributed on $[0,1]$. Each observation $X^{(i)}$ is associated with a response value $y^{(i)}$. Suppose that we wish to predict a test observation’s response using only observations that are within 10% of the range of $X$ closest to that test observation. For instance, in order to predict the response for a test observation with $X = 0.6$, we will use observations in the range $[0.55, 0.65]$. On average, what fraction of the available observations will we use to make the prediction?



In [53]:

    
# Your answer here

(b) Now suppose that we have a set of observations, each with measurements on $p = 2$ features, $X_{1}$ and $X_{2}$. We assume that ($X_{1}$, $X_{2}$) are uniformly distributed on $[0,1] × [0,1]$. We wish to predict a test observation’s response using only observations that are within 10% of the range of $X_{1}$ and within 10% of the range of $X_{2}$ closest to that test observation. For instance, in order to predict the response for a test observation with $X_{1} = 0.6$ and $X_{2} = 0.35$, we will use observations in the range $[0.55, 0.65]$ for $X_{1}$ and in the range $[0.3, 0.4]$ for $X_{2}$. On average, what fraction of the available observations will we use to make the prediction?



In [54]:

    
# Your answer here

(c) Now suppose that we have a set of observations on $p = 100$ features. Again the observations are uniformly distributed on each feature, and again each feature ranges in value from 0 to 1. We wish to predict a test observation’s response using observations within the 10% of each feature’s range that is closest to that test observation. What fraction of the available observations will we use to make the prediction?



In [55]:

    
# Your answer here

(d) Using your answers to parts (a)–(c), argue that a drawback of KNN when $p$ is large is that there are very few training observations “near” any given test observation.



In [56]:

    
# Your answer here

(e) Now suppose that we wish to make a prediction for a test observation by creating a $p$-dimensional hypercube centered around the test observation that contains, on average, 10% of the training observations. For $p = 1$, $p = 2$, and $p = 100$, what is the length of each side of the hypercube? Comment on your answer.

Note: A hypercube is a generalization of a cube to an arbitrary number of dimensions. When $p = 1$, a hypercube is simply a line segment; when $p = 2$ it is a square; and when $p = 100$ it is a 100-dimensional cube.



In [57]:

    
# Your answer here

Decision trees

Consider the Gini index, classification error, and entropy measures of impurity in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of $p(i \mid t)$. The x- axis should display $p(i \mid t)$, ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy.



In [58]:

    
# Your code here

This problem tests your ability to train decision trees and reason about their effectiveness. It uses the built-in breast cancer dataset that ships with scikit-learn. You can import this dataset through the module method sklearn.datasets.load_breast_cancer.

(a) Import the breast cancer dataset from scikit-learn. Then, create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.



In [59]:

    
# Your code here

(b) Fit a decision tree to the training data, with Purchase as the response and all other variables except for Buy as predictors. Produce summary statistics about the tree and describe the results obtained. What is the training error rate?



In [60]:

    
# Your code here



In [61]:

    
# Your code here

(d) Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?



In [62]:

    
# Your code here

(e) Apply a cross-validation function to the training set in order to determine the optimal tree size.



In [63]:

    
# Your code here

(f) Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.



In [64]:

    
# Your code here

(g) Which tree size corresponds to the lowest cross-validated classification error rate?



In [65]:

    
# Your answer here

(h) Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.



In [66]:

    
# Your code here

(i) Compare the training error rates between the pruned and un-pruned trees. Which is higher?



In [67]:

    
# Your code here

(j) Compare the test error rates between the pruned and unpruned trees. Which is higher?



In [68]:

    
# Your code here

(k) Comment on your results.



In [69]:

    
# Your thoughts here