Timothy Helton
The goal of predictive modeling is to create models that make good predictions on new data. We don't have access to this new data at the time of training, so we must use statistical methods to estimate the performance of a model on new data. This class of methods are called resampling methods, as they resampling your available training data.
NOTE:
This notebook uses code found in the
k2datascience.preprocessing module.
To execute all the cells do one of the following items:
In [ ]:
import pandas as pd
import numpy as np
import scipy as sp
from k2datascience import plotting
from k2datascience import preprocessing
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline
We will now derive the probability that a given observation is part of a bootstrap sample. Suppose that we obtain a bootstrap sample from a set of n observations.
(a) What is the probability that the first bootstrap observation is not the jth observation from the original sample? Justify your answer.
(b) What is the probability that the second bootstrap observation is not the jth observation from the original sample?
(c) Argue that the probability that the jth observation is not in the bootstrap sample is $(1 − 1/n) ^ n$.
(d) When n = 5, what is the probability that the jth observation is in the bootstrap sample?
(e) When n = 100, what is the probability that the jth observation is in the bootstrap sample?
(f)When n = 10,000, what is the probability that the jth observation is in the bootstrap sample?
(a) What is the probability that the first bootstrap observation is not the jth observation from the original sample? Justify your answer.
(b) What is the probability that the second bootstrap observation is not the jth observation from the original sample?
(c) Argue that the probability that the jth observation is not in the bootstrap sample is $(1 − 1/n) ^ n$.
(d) When n = 5, what is the probability that the jth observation is in the bootstrap sample?
In [ ]:
print(f'{preprocessing.prob_bootstrap(5):.3f}')
(e) When n = 100, what is the probability that the jth observation is in the bootstrap sample?
In [ ]:
print(f'{preprocessing.prob_bootstrap(100):.3f}')
(f)When n = 10,000, what is the probability that the jth observation is in the bootstrap sample?
In [ ]:
print(f'{preprocessing.prob_bootstrap(1e4):.3f}')
(a) Explain how k-fold cross-validation is implemented.
(b) What are the advantages and disadvantages of k-fold crossvalidation relative to:
Task - Fit a logistic regression model that uses income
and balance
to predict default
. Compare the error of the scikit-learn and statsmodel implementations without the validation set.
In [ ]:
loan = preprocessing.LoanDefault()
loan.data.info()
loan.data.head()
In [ ]:
data = loan.data
title = 'Loan'
plotting.correlation_heatmap_plot(data, title=title)
plotting.correlation_pair_plot(data, title=title)
In [ ]:
loan.validation_split()
loan.logistic_summary()
Task - Using the validation set approach, estimate the test error of this model. In order to do this, you must perform the following steps:
default
category if the posterior probability is greater than 0.5.default
using income
, balance
, and a dummy variable for student
. Estimate the test error for this model using the validation set approach. Comment on whether or not including a dummy variable for student
leads to a reduction in the test error rate.
In [ ]:
loan = preprocessing.LoanDefault()
loan.logistic_bootstrap(3)
In [ ]:
loan.features = (pd.concat([loan.data.loc[:, ['balance', 'income']],
loan.data.student.cat.codes],
axis=1)
.rename(columns={0: 'student'}))
loan.validation_split()
loan.logistic_summary()
Task - Compute estimates for the standard errors of the income
and balance
logistic regression coefficients by using the bootrap and logistic regression functions.
In [ ]:
loan = preprocessing.LoanDefault()
loan.logistic_bootstrap(100)
Task - We will compute the LOOCV error for a simple logistic regression model on the SMarket
data set.
Direction
using Lag1
and Lag2
.Direction
using Lag1
and Lag2
using all but the first observation.i=1
to i=n
, where n is the number of observations in the data set, that performs each of the following steps:Direction
using Lag1
and Lag2
.
In [ ]:
sm = preprocessing.StockMarket()
sm.data.info()
sm.data.head()
In [ ]:
data = sm.data
title = 'Stock Market'
plotting.correlation_heatmap_plot(data, title=title)
plotting.correlation_pair_plot(data, title=title)
2. Fit a logistic regression model that predicts Direction
using Lag1
and Lag2
.
In [ ]:
sm.logistic_summary()
3. Fit a logistic regression model that predicts Direction
using Lag1
and Lag2
using all but the first observation.
In [ ]:
sm.data = sm.data.iloc[1:]
sm.logistic_summary()
4. Use the model from (3) to predict the direction of the first observation. You can do this by predicting that the first observation will go up if $P(\mbox{direction = Up} | Lag1,Lag2 ) > 0.5$. Was this observation correctly classified?
In [ ]:
sm.data.iloc[0]
sm.data.direction.cat.categories
sm.predict[0]
5. Write a loop from i=1
to i=n
, where n is the number of observations in the data set, that performs each of the following steps:
Direction
using Lag1
and Lag2
.
In [ ]:
sm = preprocessing.StockMarket()
sm.logistic_leave_one_out()
In [ ]:
sm.logistic_leave_one_out()
6. Take the average of the n numbers obtained in (5) in order to obtain the LOOCV estimate for the test error. Comment on the results.
Task - We will now perform cross-validation on a simulated data set.
1. Create a scatterplot of X against Y. Comment on what you find.
In [ ]:
sim = preprocessing.Simulated()
sim.data.info()
sim.data.head()
In [ ]:
sim.scatter_plot()
2. Compute the LOOCV errors that result from fitting the following four models using least squares: Linear, Quadratic, Cubic and Quartic.
In [ ]:
for deg in range(1, 5):
print('{}\nPolynomial Model Degree: {}\n'.format('*' * 80, deg))
sim.linear_leave_one_out(degree=deg)
3. Repeat (2) using another random seed, and report your results. Are your results the same as what you got in (2)? Why?
In [ ]:
sim.random_seed = 2
sim.load_data()
sim.validation_split()
sim.single_feature()
for deg in range(1, 5):
print('{}\nPolynomial Model Degree: {}\n'.format('*' * 80, deg))
sim.linear_leave_one_out(degree=deg)
4. Which of the models in (3) had the smallest LOOCV error? Is this what you expected? Explain your answer.
5. Comment on the statistical significance of the coefficient estimates that results from fitting each of the models in (2) using least squares. Do these results agree with the conclusions drawn based on the cross-validation results?
Task - We will now consider the Boston
housing data set that we have used previously.
medv
. Call this estimate $\hat{\mu}$.medv
. Compare it to the results obtained from a t.test on medv
.medv
in the population.medv
in Boston suburbs. Call this quantity $\hat{\mu}$ 0.1.1. Based on this data set, provide an estimate for the population mean of medv
. Call this estimate $\hat{\mu}$.
In [ ]:
bh = preprocessing.BostonHousing()
mu = bh.data.medv.mean()
mu
2. Provide an estimate of the standard error of $\hat{\mu}$. Interpret this result.
In [ ]:
mu_se = sp.stats.sem(bh.data.medv)
mu_se
3. Now estimate the standard error of $\hat{\mu}$ using the bootstrap. How does this compare to your answer from (2)?
In [ ]:
std_errors = []
sample_size = int(bh.data.shape[0] * 0.7)
for n in range(1000):
new_sample = bh.data.medv.sample(n=sample_size)
std_errors.append(sp.stats.sem(new_sample))
se_bootstrap = np.mean(std_errors)
se_bootstrap
4. Based on your bootstrap estimate from (3), provide a 95% confidence interval for the mean of medv
. Compare it to the results obtained from a t.test on medv
.
In [ ]:
offset = 2 * se_bootstrap
bh.data.medv.mean() - offset, bh.data.medv.mean() + offset
In [ ]:
sp.stats.t.interval(0.95, bh.data.shape[0] - 1,
loc=np.mean(bh.data.medv),
scale=sp.stats.sem(bh.data.medv))
5. Based on this data set, provide an estimate, $\hat{\mu}$ med, for the median value of medv
in the population.
In [ ]:
bh.data.medv.median()
6. We now would like to estimate the standard error of $\hat{\mu}$ med. Unfortunately, there is no simple formula for computing the standard error of the median. Instead, estimate the standard error of the median using the bootstrap. Comment on your findings.
In [ ]:
medians = [(bh.data.medv
.sample(n=bh.data.shape[0], replace=True)
.median())
for _ in range(1000)]
print(f'Average Median: {np.mean(medians)}')
print(f'Standard Error: {np.std(medians)}')
7. Based on this data set, provide an estimate for the tenth percentile of medv
in Boston suburbs. Call this quantity $\hat{\mu}$ 0.1.
In [ ]:
bh.data.medv.quantile(0.1)
8. Use the bootstrap to estimate the standard error of $\hat{\mu}$ 0.1. Comment on your findings.
In [ ]:
quantiles = [(bh.data.medv
.sample(bh.data.shape[0], replace=True)
.quantile(0.1))
for _ in range(1000)]
print(f'Average 10th Percentile: {np.mean(quantiles):.3f}')
print(f'Standard Error: {np.std(quantiles):.3f}')