Resampling Methods

Timothy Helton

The goal of predictive modeling is to create models that make good predictions on new data. We don't have access to this new data at the time of training, so we must use statistical methods to estimate the performance of a model on new data. This class of methods are called resampling methods, as they resampling your available training data.

NOTE:
This notebook uses code found in the k2datascience.preprocessing module. To execute all the cells do one of the following items:

Install the k2datascience package to the active Python interpreter.
Add k2datascience/k2datascience to the PYTHON_PATH system variable.
Create a link to the preprocessing.py file in the same directory as this notebook.

Imports



In [ ]:

    
import pandas as pd
import numpy as np
import scipy as sp

from k2datascience import plotting
from k2datascience import preprocessing

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline

Theory

Exercise 1

We will now derive the probability that a given observation is part of a bootstrap sample. Suppose that we obtain a bootstrap sample from a set of n observations.

(a) What is the probability that the first bootstrap observation is not the jth observation from the original sample? Justify your answer.

(b) What is the probability that the second bootstrap observation is not the jth observation from the original sample?

(d) When n = 5, what is the probability that the jth observation is in the bootstrap sample?

(e) When n = 100, what is the probability that the jth observation is in the bootstrap sample?

(f)When n = 10,000, what is the probability that the jth observation is in the bootstrap sample?

(a) What is the probability that the first bootstrap observation is not the jth observation from the original sample? Justify your answer.

$$P = \frac{n-1}{n}$$

(b) What is the probability that the second bootstrap observation is not the jth observation from the original sample?

Samples may be selected multiple times.
Same Probability as (a), since the bootstrap does not remove a sample.

$$P = \frac{n-1}{n}$$

The probablility of the $j^{th}$ sample not being in any bootstrap is the equal to the probablility of the $j^{th}$ sample not being in a single bootstrap for all n observations.

(d) When n = 5, what is the probability that the jth observation is in the bootstrap sample?



In [ ]:

    
print(f'{preprocessing.prob_bootstrap(5):.3f}')

(e) When n = 100, what is the probability that the jth observation is in the bootstrap sample?



In [ ]:

    
print(f'{preprocessing.prob_bootstrap(100):.3f}')

(f)When n = 10,000, what is the probability that the jth observation is in the bootstrap sample?



In [ ]:

    
print(f'{preprocessing.prob_bootstrap(1e4):.3f}')

Exercise 2

We now review k-fold cross-validation.

(a) Explain how k-fold cross-validation is implemented.

(b) What are the advantages and disadvantages of k-fold crossvalidation relative to:

The validation set approach?
LOOCV?

(a) Explain how k-fold cross-validation is implemented.

The data set is partitioned into k folds.
A model is fit to k - 1 folds
The error is calculated between the predicted values from the model and remaining unused fold.
Repeat the previous steps k times, so each fold is used as the test sample.
Average the results of all the models.

(b) What are the advantages and disadvantages of k-fold crossvalidation relative to:

The validation set approach?
LOOCV?

K-Fold vs Validation set CV
1. K-Fold method uses all the data to create a model.
2. K-Fold is less likely to overfit.
K-Fold vs LOOCV
1. K-Fold is faster
2. LOOCV has less bias
3. K-Fold has less variance
4. LOOCV has many sets that are collinear (resulting in higher variance).

Exercise 3

Suppose that we use some statistical learning method to make a prediction for the response Y for a particular value of the predictor X. Carefully describe how we might estimate the standard deviation of our prediction.

Calculate the standard deviation of the test metric.

Practical

Exercise 4 - Credit Card Default Data Set

We previously used logistic regression to predict the probability of default using income and balance on the Default data set. We will now estimate the test error of this logistic regression model using the validation set approach.

Task - Fit a logistic regression model that uses income and balance to predict default. Compare the error of the scikit-learn and statsmodel implementations without the validation set.



In [ ]:

    
loan = preprocessing.LoanDefault()
loan.data.info()
loan.data.head()



In [ ]:

    
data = loan.data
title = 'Loan'
plotting.correlation_heatmap_plot(data, title=title)
plotting.correlation_pair_plot(data, title=title)



In [ ]:

    
loan.validation_split()
loan.logistic_summary()

Task - Using the validation set approach, estimate the test error of this model. In order to do this, you must perform the following steps:

Split the sample set into a training set and a validation set.
Fit a multiple logistic regression model using only the training observations.
Obtain a prediction of default status for each individual in the validation set by computing the posterior probability of default for that individual, and classifying the individual to the default category if the posterior probability is greater than 0.5.
Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified.
Repeat the process in (b) three times, using three different splits of the observations into a training set and a validation set. Comment on the results obtained.
Now consider a logistic regression model that predicts the probability of default using income, balance, and a dummy variable for student. Estimate the test error for this model using the validation set approach. Comment on whether or not including a dummy variable for student leads to a reduction in the test error rate.



In [ ]:

    
loan = preprocessing.LoanDefault()
loan.logistic_bootstrap(3)



In [ ]:

    
loan.features = (pd.concat([loan.data.loc[:, ['balance', 'income']],
                            loan.data.student.cat.codes],
                           axis=1)
                 .rename(columns={0: 'student'}))
loan.validation_split()
loan.logistic_summary()

FINDINGS

The Logistic Regression models have error rates repeatably below 3%.
Adding the student variable did not reduce the error rate.

Task - Compute estimates for the standard errors of the income and balance logistic regression coefficients by using the bootrap and logistic regression functions.

Use the summary() method on the logistic regression statsmodel instance.
Implement your own bootstrap method and run the model 100 times
Comment on the estimated standard errors obtained using statsmodels and your bootstrap.



In [ ]:

    
loan = preprocessing.LoanDefault()
loan.logistic_bootstrap(100)

Exercise 5 - Stock Market Data

Task - We will compute the LOOCV error for a simple logistic regression model on the SMarket data set.

Read in the stock market data set
Fit a logistic regression model that predicts Direction using Lag1 and Lag2.
Fit a logistic regression model that predicts Direction using Lag1 and Lag2 using all but the first observation.
Use the model from (3) to predict the direction of the first observation. You can do this by predicting that the first observation will go up if $P(\mbox{direction = Up} | Lag1,Lag2 ) > 0.5$. Was this observation correctly classified?
Write a loop from i=1 to i=n, where n is the number of observations in the data set, that performs each of the following steps:
- Fit a logistic regression model using all but the ith observation to predict Direction using Lag1 and Lag2.
- Compute the posterior probability of the market moving up for the ith observation.
- Use the posterior probability for the ith observation in order to predict whether or not the market moves up.
- Determine whether or not an error was made in predicting the direction for the ith observation. If an error was made, then indicate this as a 1, and otherwise indicate it as a 0.
Take the average of the n numbers obtained in (5) in order to obtain the LOOCV estimate for the test error. Comment on the results.

Read in the stock market data set



In [ ]:

    
sm = preprocessing.StockMarket()
sm.data.info()
sm.data.head()



In [ ]:

    
data = sm.data
title = 'Stock Market'
plotting.correlation_heatmap_plot(data, title=title)
plotting.correlation_pair_plot(data, title=title)

2. Fit a logistic regression model that predicts Direction using Lag1 and Lag2.



In [ ]:

    
sm.logistic_summary()

3. Fit a logistic regression model that predicts Direction using Lag1 and Lag2 using all but the first observation.



In [ ]:

    
sm.data = sm.data.iloc[1:]
sm.logistic_summary()

4. Use the model from (3) to predict the direction of the first observation. You can do this by predicting that the first observation will go up if $P(\mbox{direction = Up} | Lag1,Lag2 ) > 0.5$. Was this observation correctly classified?



In [ ]:

    
sm.data.iloc[0]
sm.data.direction.cat.categories
sm.predict[0]

FINDINGS

The model correctly predicted the model would go up.

5. Write a loop from i=1 to i=n, where n is the number of observations in the data set, that performs each of the following steps:

Fit a logistic regression model using all but the ith observation to predict Direction using Lag1 and Lag2.
Compute the posterior probability of the market moving up for the ith observation.
Use the posterior probability for the ith observation in order to predict whether or not the market moves up.
Determine whether or not an error was made in predicting the direction for the ith observation. If an error was made, then indicate this as a 1, and otherwise indicate it as a 0.



In [ ]:

    
sm = preprocessing.StockMarket()
sm.logistic_leave_one_out()



In [ ]:

    
sm.logistic_leave_one_out()

6. Take the average of the n numbers obtained in (5) in order to obtain the LOOCV estimate for the test error. Comment on the results.

FINDINGS

For this dataset the Leave One Out cross validation did not reduce the error rate.

Exercise 6 - Simulated Data

Task - We will now perform cross-validation on a simulated data set.

Create a scatterplot of X against Y. Comment on what you find.
Compute the LOOCV errors that result from fitting the following four models using least squares: Linear, Quadratic, Cubic and Quartic.
Repeat (2) using another random seed, and report your results. Are your results the same as what you got in (2)? Why?
Which of the models in (3) had the smallest LOOCV error? Is this what you expected? Explain your answer.
Comment on the statistical significance of the coefficient estimates that results from fitting each of the models in (2) using least squares. Do these results agree with the conclusions drawn based on the cross-validation results?

1. Create a scatterplot of X against Y. Comment on what you find.



In [ ]:

    
sim = preprocessing.Simulated()
sim.data.info()
sim.data.head()



In [ ]:

    
sim.scatter_plot()

2. Compute the LOOCV errors that result from fitting the following four models using least squares: Linear, Quadratic, Cubic and Quartic.



In [ ]:

    
for deg in range(1, 5):
    print('{}\nPolynomial Model Degree: {}\n'.format('*' * 80, deg))
    sim.linear_leave_one_out(degree=deg)

3. Repeat (2) using another random seed, and report your results. Are your results the same as what you got in (2)? Why?



In [ ]:

    
sim.random_seed = 2
sim.load_data()
sim.validation_split()
sim.single_feature()

for deg in range(1, 5):
    print('{}\nPolynomial Model Degree: {}\n'.format('*' * 80, deg))
    sim.linear_leave_one_out(degree=deg)

FINDINGS

The answers are identical.
- Unclear if this is an optimization in Scikit Learn or a bug.

4. Which of the models in (3) had the smallest LOOCV error? Is this what you expected? Explain your answer.

The Quadradic model has the best fit.
This is reasonable, since the data take a quadradic form.
The two hirer order models probably suffer from overfitting.

5. Comment on the statistical significance of the coefficient estimates that results from fitting each of the models in (2) using least squares. Do these results agree with the conclusions drawn based on the cross-validation results?

Exercise 7 - Boston Housing Data

Task - We will now consider the Boston housing data set that we have used previously.

Based on this data set, provide an estimate for the population mean of medv. Call this estimate $\hat{\mu}$.
Provide an estimate of the standard error of $\hat{\mu}$. Interpret this result.
Now estimate the standard error of $\hat{\mu}$ using the bootstrap. How does this compare to your answer from (2)?
Based on your bootstrap estimate from (3), provide a 95% confidence interval for the mean of medv. Compare it to the results obtained from a t.test on medv.
Based on this data set, provide an estimate, $\hat{\mu}$ med, for the median value of medv in the population.
We now would like to estimate the standard error of $\hat{\mu}$ med. Unfortunately, there is no simple formula for computing the standard error of the median. Instead, estimate the standard error of the median using the bootstrap. Comment on your findings.
Based on this data set, provide an estimate for the tenth percentile of medv in Boston suburbs. Call this quantity $\hat{\mu}$ 0.1.
Use the bootstrap to estimate the standard error of $\hat{\mu}$ 0.1. Comment on your findings.

1. Based on this data set, provide an estimate for the population mean of medv. Call this estimate $\hat{\mu}$.



In [ ]:

    
bh = preprocessing.BostonHousing()
mu = bh.data.medv.mean()
mu

2. Provide an estimate of the standard error of $\hat{\mu}$. Interpret this result.



In [ ]:

    
mu_se = sp.stats.sem(bh.data.medv)
mu_se

3. Now estimate the standard error of $\hat{\mu}$ using the bootstrap. How does this compare to your answer from (2)?



In [ ]:

    
std_errors = []
sample_size = int(bh.data.shape[0] * 0.7)
for n in range(1000):
    new_sample = bh.data.medv.sample(n=sample_size)
    std_errors.append(sp.stats.sem(new_sample))
se_bootstrap = np.mean(std_errors)
se_bootstrap

4. Based on your bootstrap estimate from (3), provide a 95% confidence interval for the mean of medv. Compare it to the results obtained from a t.test on medv.



In [ ]:

    
offset = 2 * se_bootstrap
bh.data.medv.mean() - offset, bh.data.medv.mean() + offset



In [ ]:

    
sp.stats.t.interval(0.95, bh.data.shape[0] - 1,
                    loc=np.mean(bh.data.medv),
                    scale=sp.stats.sem(bh.data.medv))

5. Based on this data set, provide an estimate, $\hat{\mu}$ med, for the median value of medv in the population.



In [ ]:

    
bh.data.medv.median()

6. We now would like to estimate the standard error of $\hat{\mu}$ med. Unfortunately, there is no simple formula for computing the standard error of the median. Instead, estimate the standard error of the median using the bootstrap. Comment on your findings.



In [ ]:

    
medians = [(bh.data.medv
            .sample(n=bh.data.shape[0], replace=True)
            .median())
           for _ in range(1000)]
print(f'Average Median: {np.mean(medians)}')
print(f'Standard Error: {np.std(medians)}')

7. Based on this data set, provide an estimate for the tenth percentile of medv in Boston suburbs. Call this quantity $\hat{\mu}$ 0.1.



In [ ]:

    
bh.data.medv.quantile(0.1)

8. Use the bootstrap to estimate the standard error of $\hat{\mu}$ 0.1. Comment on your findings.



In [ ]:

    
quantiles = [(bh.data.medv
              .sample(bh.data.shape[0], replace=True)
              .quantile(0.1))
             for _ in range(1000)]
print(f'Average 10th Percentile: {np.mean(quantiles):.3f}')
print(f'Standard Error: {np.std(quantiles):.3f}')