Classifying spam vs. email

The goal of this assignment is to get some practice building a classification model from a dataset with more than a handful of predictor variables. Download the spam dataset from https://archive.ics.uci.edu/ml/datasets/Spambase. Split the full dataset into a training set and a test set. Using the training data ONLY, build the best model you can for predicting whether or not an email is spam. Validate your model by making predictions on the test set and considering the precision, recall, and F1-score of your predictions. Discuss your results.

Submit a writeup describing your analysis. You writeup must contain enough detail to allow reproduction of your results, along with justification for why you made the decisions that you did during construction of the model. Scores (precision, recall, and F1) for each category must be clearly displayed in a table. Include a brief discussion of your assessment of the quality of the model.

If you use Python for this assignment, you may do your work and your writeup in a Jupyter notebook and submit the single notebook file. This is the preferred (but not required) method.

Rubric:

  • correct cross-validation: 3 points
  • reasonable/appropriate choices made during model construction: 8 points
  • reproducible work: 4 points
  • clear, concise writeup with results tabulated: 3 points
  • reasonable assessment of model quality: 2 points
  • Bonus! Include a plot of the ROC, get the AUC value, and include a plot the Precision-Recall curve: 2 points

In [ ]: