Classification

Timothy Helton



NOTE:
This notebook uses code found in the k2datascience.classification module. To execute all the cells do one of the following items:

  • Install the k2datascience package to the active Python interpreter.
  • Add k2datascience/k2datascience to the PYTHON_PATH system variable.
  • Create a link to the classification.py file in the same directory as this notebook.


Imports


In [ ]:
from k2datascience import classification
from k2datascience import plotting

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline

Exercise 1

This question should be answered using the Weekly data set. This data is similar in nature to the Smarket data from earlier, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.

  1. Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?

  2. Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?

  3. Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.

  4. Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).

  5. Repeat (4) using LDA.

  6. Repeat (4) using QDA.
  7. Repeat (4) using KNN with K = 1.
  8. Which of these methods appears to provide the best results on this data?

  9. Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should

1. Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?


In [ ]:
weekly = classification.Weekly()
weekly.data.info()
weekly.data.describe()
weekly.data.head()

In [ ]:
plotting.correlation_heatmap_plot(
    data=weekly.data, title='Weekly Stockmarket')

In [ ]:
plotting.correlation_pair_plot(
    weekly.data, title='Weekly Stockmarket')
FINDINGS
  • There does not appear to be noticable patterns in dataset.
  • All field variables except volume appear to follow a Gausian distribution.

2. Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?


In [ ]:
weekly.logistic_regression(data=weekly.data)
weekly.logistic_model.summary()
FINDINGS
  • The intercept and lag2 features have P-values below the 0.05 threshold and appear statistically significant.

3. Compute the confusion matrix and overall fraction of correct


In [ ]:
weekly.confusion
print(weekly.classification)
FINDINGS
  • The model is not well suited to the data.
    • The Precision measures the accuracy of the Positive predictions. $$\frac{T_p}{T_p - F_p}$$
    • The Recall measures the fraction of the model correctly identified. $$\frac{T_p}{T_p + F_n}$$
    • The F1-score is the harmonic mean of the precision and recall.
      • Harmonic Mean is used when the average of rates is desired. $$\frac{2 \times Precision \times Recall}{Precision + Recall}$$
    • The Support is the total number of each class.
      • The sum each row of the confusion matrix.

4. Now fit the logistic regression model using a training data period


In [ ]:
weekly.logistic_regression(data=weekly.x_train)
weekly.logistic_model.summary()

In [ ]:
weekly.confusion
print(weekly.classification)
FINDINGS
  • Using 80% of the data as a training set did not improve the models accuracy.

In [ ]:
weekly.categorize(weekly.x_test)
weekly.calc_prediction(weekly.y_test, weekly.prediction_nom)
weekly.confusion
print(weekly.classification)
FINDINGS
  • Testing the model on the remaining 20% of the data yield a result worse than just randomly guessing.

5. Repeat (4) using LDA.


In [ ]:
weekly.lda()
weekly.confusion
print(weekly.classification)
FINDINGS
  • This model is extremely acurate.

6. Repeat (4) using QDA.


In [ ]:
weekly.qda()
weekly.confusion
print(weekly.classification)
FINDINGS
  • This model is better than the logistic regression, but not as good as the LDA model.

7. Repeat (4) using KNN with K = 1.


In [ ]:
weekly.knn()
weekly.confusion
print(weekly.classification)
FINDINGS
  • This model is better than the logistic regression, but not as good as the QDA.

8.Which of these methods appears to provide the best results on this data?

The model acuracy in decending order is the following:

  1. Linear Discriminate Analysis
  2. Quadradic Discriminate Analysis
  3. K-Nearest Neighbors
  4. Logistic Regression

9. Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should


In [ ]:

Exercise 2

In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.

  1. Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median.

  2. Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.

  3. Split the data into a training set and a test set.

  4. Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (2). What is the test error of the model obtained?

  5. Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (2). What is the test error of the model obtained?

  6. Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (2). What is the test error of the model obtained?

  7. Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (2). What test errors do you obtain? Which value of K seems to perform the best on this data set?

1. Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median.


In [ ]:
auto = classification.Auto()
auto.data.info()
auto.data.describe()
auto.data.head()

2. Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.


In [ ]:
plotting.correlation_heatmap_plot(
    data=auto.data, title='Auto')

In [ ]:
plotting.correlation_pair_plot(
    data=auto.data, title='Auto')

In [ ]:
auto.box_plots()
FINDINGS
  • The following features appear to have a direct impact on the vehicles gas milage.
    • Displacement
      • Cylinders are related to Displacement and will not be included.
    • Horsepower
    • Weight
    • Origin

3. Split the data into a training set and a test set.


In [ ]:
auto.x_train.info()
auto.y_train.head()
auto.x_test.info()
auto.y_test.head()

4. Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (2). What is the test error of the model obtained?


In [ ]:
auto.classify_data(model='LDA')
auto.confusion
print(auto.classification)

5. Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (2). What is the test error of the model obtained?


In [ ]:
auto.classify_data(model='QDA')
auto.confusion
print(auto.classification)

6. Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (2). What is the test error of the model obtained?


In [ ]:
auto.classify_data(model='LR')
auto.confusion
print(auto.classification)

7. Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (2). What test errors do you obtain? Which value of K seems to perform the best on this data set?


In [ ]:
auto.accuracy_vs_k()

In [ ]:
auto.classify_data(model='KNN', n=13)
auto.confusion
print(auto.classification)
FINDINGS
  • The most accurate model representing this dataset is the Quadratic Discriminant.

Excercise 3

Using the Boston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression, LDA, and KNN models using various subsets of the predictors. Describe your findings.


In [ ]: