This lab focuses on SMS message spam detection using $k$ nearest neighbours classification. It's a direct counterpart to the rule-based spam detection from Lab 05 and the decision tree models from Lab 07a. At the end of the lab, you should be able to use scikit-learn
to:
Let's start by importing the packages we'll need. This week, we're going to use the neighbors
subpackage from scikit-learn
to build k nearest neighbours models. We'll also use the dummy
package to build a baseline model from we which can gauge how good our final model is.
In [ ]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV, StratifiedKFold, cross_val_predict
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsClassifier
Next, let's load the data. Write the path to your sms.csv
file in the cell below:
In [ ]:
data_file = 'data/sms.csv'
Execute the cell below to load the CSV data into a pandas data frame with the columns label
and message
.
Note: This week, the CSV file is not comma separated, but instead tab separated. We can tell
pandas
about the different format using thesep
argument, as shown in the cell below. For more information, see theread_csv
documentation.
In [ ]:
sms = pd.read_csv(data_file, sep='\t', header=None, names=['label', 'message'])
sms.head()
Next, let's select our feature ($X$) and target ($y$) variables from the data. Usually, we would use all of the available data but, for speed ($k$ nearest neighbours can be CPU intensive), let's just select a random sample. We can do this using the sample
method in pandas
, as follows:
In [ ]:
sample = sms.sample(frac=0.25, random_state=0) # Randomly subsample a quarter of the available data
X = sample['message']
y = sample['label']
Let's build a nearest neighbours classification model of the SMS message data. scikit-learn
supports nearest neighbours functionality via the neighbors
subpackage. This subpackage supports both nearest neighbours regression and classification. We can use the KNeighborsClassifier
class to build our model.
KNeighborsClassifier
accepts a number of different hyperparameters and the model we build may be more or less accurate depending on their values. We can get a list of these modelling parameters using the get_params
method of the estimator (this works on any scikit-learn
estimator), like this:
In [ ]:
KNeighborsClassifier().get_params()
You can find a more detailed description of each parameter in the scikit-learn
documentation.
Let's use a grid search to select the optimal nearest neighbours classification model from a set of candidates. First, we need to build a pipeline, just as we did last week. Next, we define the parameter grid. Finally, we use a grid search to select the best model via an inner cross validation and an outer cross validation to measure the accuracy of the selected model.
Note: When using grid search with pipelines, we have to adjust the names of our hyperparameters, prepending the name of the class they apply to (in lowercase). This is so that
scikit-learn
can distinguish which hyperparameters apply to what classes. Below, we prepend the string'kneighborsclassifier__'
to each hyperparameter name because they all apply to theKNeighborsClassifier
class.
In [ ]:
pipeline = make_pipeline(
TfidfVectorizer(stop_words='english'),
KNeighborsClassifier()
)
# Build models for different values of n_neighbors (k), distance metric and weight scheme
parameters = {
'kneighborsclassifier__n_neighbors': [2, 5, 10],
'kneighborsclassifier__metric': ['manhattan', 'euclidean'],
'kneighborsclassifier__weights': ['uniform', 'distance']
}
# Use inner CV to select the best model
inner_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0) # K = 5
clf = GridSearchCV(pipeline, parameters, cv=inner_cv, n_jobs=-1) # n_jobs=-1 uses all available CPUs = faster
clf.fit(X, y)
# Use outer CV to evaluate the error of the best model
outer_cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=0) # K = 10, doesn't have to be the same
y_pred = cross_val_predict(clf, X, y, cv=outer_cv)
print(classification_report(y, y_pred)) # Print the classification report
The model is much more accurate than the rule-based model from Lab 05, but not as accurate as the random forest model from Lab 07a. Specifically, we can say that:
While no ham was misclassified as spam, we only managed to filter 44% of spam emails (approximately one in every two).
As before, we can check the parameters of the selected model using the best_params_
attribute of the fitted grid search:
In [ ]:
clf.best_params_