In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

Case Study - Text classification for SMS spam detection

We first load the text data from the dataset directory that should be located in your notebooks directory, which we created by running the fetch_data.py script from the top level of the GitHub repository.

Furthermore, we perform some simple preprocessing and split the data array into two parts:

  1. text: A list of lists, where each sublists contains the contents of our emails
  2. y: our SPAM vs HAM labels stored in binary; a 1 represents a spam message, and a 0 represnts a ham (non-spam) message.

In [ ]:
import os

with open(os.path.join("datasets", "smsspam", "SMSSpamCollection")) as f:
    lines = [line.strip().split("\t") for line in f.readlines()]

text = [x[1] for x in lines]
y = [int(x[0] == "spam") for x in lines]

In [ ]:
text[:10]

In [ ]:
y[:10]

In [ ]:
print('Number of ham and spam messages:', np.bincount(y))

In [ ]:
type(text)

In [ ]:
type(y)

Next, we split our dataset into 2 parts, the test and training dataset:


In [ ]:
from sklearn.model_selection import train_test_split

text_train, text_test, y_train, y_test = train_test_split(text, y, 
                                                          random_state=42,
                                                          test_size=0.25,
                                                          stratify=y)

Now, we use the CountVectorizer to parse the text data into a bag-of-words model.


In [ ]:
from sklearn.feature_extraction.text import CountVectorizer

print('CountVectorizer defaults')
CountVectorizer()

In [ ]:
vectorizer = CountVectorizer()
vectorizer.fit(text_train)

X_train = vectorizer.transform(text_train)
X_test = vectorizer.transform(text_test)

In [ ]:
print(len(vectorizer.vocabulary_))

In [ ]:
X_train.shape

In [ ]:
print(vectorizer.get_feature_names()[:20])

In [ ]:
print(vectorizer.get_feature_names()[2000:2020])

In [ ]:
print(X_train.shape)
print(X_test.shape)

Training a Classifier on Text Features

We can now train a classifier, for instance a logistic regression classifier, which is a fast baseline for text classification tasks:


In [ ]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf

In [ ]:
clf.fit(X_train, y_train)

We can now evaluate the classifier on the testing set. Let's first use the built-in score function, which is the rate of correct classification in the test set:


In [ ]:
clf.score(X_test, y_test)

We can also compute the score on the training set to see how well we do there:


In [ ]:
clf.score(X_train, y_train)

Visualizing important features


In [ ]:
def visualize_coefficients(classifier, feature_names, n_top_features=25):
    # get coefficients with large absolute values 
    coef = classifier.coef_.ravel()
    positive_coefficients = np.argsort(coef)[-n_top_features:]
    negative_coefficients = np.argsort(coef)[:n_top_features]
    interesting_coefficients = np.hstack([negative_coefficients, positive_coefficients])
    # plot them
    plt.figure(figsize=(15, 5))
    colors = ["red" if c < 0 else "blue" for c in coef[interesting_coefficients]]
    plt.bar(np.arange(2 * n_top_features), coef[interesting_coefficients], color=colors)
    feature_names = np.array(feature_names)
    plt.xticks(np.arange(1, 2 * n_top_features + 1), feature_names[interesting_coefficients], rotation=60, ha="right");

In [ ]:
visualize_coefficients(clf, vectorizer.get_feature_names())

In [ ]:
vectorizer = CountVectorizer(min_df=2)
vectorizer.fit(text_train)

X_train = vectorizer.transform(text_train)
X_test = vectorizer.transform(text_test)

clf = LogisticRegression()
clf.fit(X_train, y_train)

print(clf.score(X_train, y_train))
print(clf.score(X_test, y_test))

In [ ]:
len(vectorizer.get_feature_names())

In [ ]:
print(vectorizer.get_feature_names()[:20])

In [ ]:
visualize_coefficients(clf, vectorizer.get_feature_names())

EXERCISE:
  • Use TfidfVectorizer instead of CountVectorizer. Are the results better? How are the coefficients different?
  • Change the parameters min_df and ngram_range of the TfidfVectorizer and CountVectorizer. How does that change the important features?

In [ ]:
# %load solutions/12A_tfidf.py

In [ ]:
# %load solutions/12B_vectorizer_params.py