Timothy Helton
NOTE:
This notebook uses code found in the
k2datascience.classification module.
To execute all the cells do one of the following items:
In [ ]:
from k2datascience import classification
from k2datascience import plotting
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline
This question should be answered using the Weekly
data set. This data is similar in nature to the Smarket
data from earlier, except that it contains 1,089
weekly returns for 21 years, from the beginning of 1990 to the end of
2010.
Produce some numerical and graphical summaries of the Weekly
data. Do there appear to be any patterns?
Use the full data set to perform a logistic regression with
Direction
as the response and the five lag variables plus Volume
as predictors. Use the summary function to print the results. Do
any of the predictors appear to be statistically significant? If so,
which ones?
Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.
Now fit the logistic regression model using a training data period
from 1990 to 2008, with Lag2
as the only predictor. Compute the
confusion matrix and the overall fraction of correct predictions
for the held out data (that is, the data from 2009 and 2010).
Repeat (4) using LDA.
Which of these methods appears to provide the best results on this data?
Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should
1. Produce some numerical and graphical summaries of the Weekly
data. Do there appear to be any patterns?
In [ ]:
weekly = classification.Weekly()
weekly.data.info()
weekly.data.describe()
weekly.data.head()
In [ ]:
plotting.correlation_heatmap_plot(
data=weekly.data, title='Weekly Stockmarket')
In [ ]:
plotting.correlation_pair_plot(
weekly.data, title='Weekly Stockmarket')
2. Use the full data set to perform a logistic regression with
Direction
as the response and the five lag variables plus Volume
as predictors. Use the summary function to print the results. Do
any of the predictors appear to be statistically significant? If so,
which ones?
In [ ]:
weekly.logistic_regression(data=weekly.data)
weekly.logistic_model.summary()
3. Compute the confusion matrix and overall fraction of correct
In [ ]:
weekly.confusion
print(weekly.classification)
4. Now fit the logistic regression model using a training data period
In [ ]:
weekly.logistic_regression(data=weekly.x_train)
weekly.logistic_model.summary()
In [ ]:
weekly.confusion
print(weekly.classification)
In [ ]:
weekly.categorize(weekly.x_test)
weekly.calc_prediction(weekly.y_test, weekly.prediction_nom)
weekly.confusion
print(weekly.classification)
5. Repeat (4) using LDA.
In [ ]:
weekly.lda()
weekly.confusion
print(weekly.classification)
6. Repeat (4) using QDA.
In [ ]:
weekly.qda()
weekly.confusion
print(weekly.classification)
7. Repeat (4) using KNN with K = 1.
In [ ]:
weekly.knn()
weekly.confusion
print(weekly.classification)
8.Which of these methods appears to provide the best results on this data?
The model acuracy in decending order is the following:
9. Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should
In [ ]:
In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.
Create a binary variable, mpg01
, that contains a 1 if mpg
contains
a value above its median, and a 0 if mpg
contains a value below
its median.
Explore the data graphically in order to investigate the association
between mpg01
and the other features. Which of the other
features seem most likely to be useful in predicting mpg01
? Scatterplots
and boxplots may be useful tools to answer this question.
Describe your findings.
Split the data into a training set and a test set.
Perform LDA on the training data in order to predict mpg01
using the variables that seemed most associated with mpg01
in
(2). What is the test error of the model obtained?
Perform QDA on the training data in order to predict mpg01
using the variables that seemed most associated with mpg01
in
(2). What is the test error of the model obtained?
Perform logistic regression on the training data in order to predict
mpg01
using the variables that seemed most associated with
mpg01
in (2). What is the test error of the model obtained?
Perform KNN on the training data, with several values of K, in
order to predict mpg01
. Use only the variables that seemed most
associated with mpg01
in (2). What test errors do you obtain?
Which value of K seems to perform the best on this data set?
1. Create a binary variable, mpg01
, that contains a 1 if mpg
contains
a value above its median, and a 0 if mpg
contains a value below
its median.
In [ ]:
auto = classification.Auto()
auto.data.info()
auto.data.describe()
auto.data.head()
2. Explore the data graphically in order to investigate the association
between mpg01
and the other features. Which of the other
features seem most likely to be useful in predicting mpg01
? Scatterplots
and boxplots may be useful tools to answer this question.
Describe your findings.
In [ ]:
plotting.correlation_heatmap_plot(
data=auto.data, title='Auto')
In [ ]:
plotting.correlation_pair_plot(
data=auto.data, title='Auto')
In [ ]:
auto.box_plots()
3. Split the data into a training set and a test set.
In [ ]:
auto.x_train.info()
auto.y_train.head()
auto.x_test.info()
auto.y_test.head()
4. Perform LDA on the training data in order to predict mpg01
using the variables that seemed most associated with mpg01
in
(2). What is the test error of the model obtained?
In [ ]:
auto.classify_data(model='LDA')
auto.confusion
print(auto.classification)
5. Perform QDA on the training data in order to predict mpg01
using the variables that seemed most associated with mpg01
in
(2). What is the test error of the model obtained?
In [ ]:
auto.classify_data(model='QDA')
auto.confusion
print(auto.classification)
6. Perform logistic regression on the training data in order to predict
mpg01
using the variables that seemed most associated with
mpg01
in (2). What is the test error of the model obtained?
In [ ]:
auto.classify_data(model='LR')
auto.confusion
print(auto.classification)
7. Perform KNN on the training data, with several values of K, in
order to predict mpg01
. Use only the variables that seemed most
associated with mpg01
in (2). What test errors do you obtain?
Which value of K seems to perform the best on this data set?
In [ ]:
auto.accuracy_vs_k()
In [ ]:
auto.classify_data(model='KNN', n=13)
auto.confusion
print(auto.classification)
In [ ]: