This Jupyter Notebook explores the SMS Spam Collection dataset from UCI Machine Learning Repository and compares performance of various machine learning algorithms in text processing.
To begin with, lets load the dataset into a Pandas Dataframe.
In [1]:
import csv
import pandas as pd
sms_spam_df = pd.read_csv('sms-spam.tsv', quoting=csv.QUOTE_NONE, sep='\t', names=['label', 'message'])
sms_spam_df.head()
Out[1]:
Missing values skew the dataset, and should be avoided. Lets see if the dataset has any missing values.
In [2]:
sms_spam_df.isnull().values.any()
Out[2]:
Now that we are sure there are no missing values, lets have some fun by checking stats about spam and ham(non spam) messages in the dataset.
In [3]:
sms_spam = sms_spam_df.groupby('label')['message']
sms_spam.describe()
Out[3]:
For messages to be understood by machine learning algorithms, they have to be converted into vectors. To do that, we have to first split our messages into tokens (list of words). This technique is called Bag of Words model as in the end we are left with a collection (bag) of word vectors. The following methods can be used to vectorize messages:
In [4]:
from textblob import TextBlob
def tokenize(message):
message = unicode(message, 'utf8')
return TextBlob(message).words
Lets try applying this on some of our messages. Here are the original messages we are going to tokenize.
In [5]:
sms_spam_df['message'].head()
Out[5]:
Now, here are those messages tokenized.
In [6]:
sms_spam_df['message'].head().apply(tokenize)
Out[6]:
In [7]:
from textblob import TextBlob
def lemmatize(message):
message = unicode(message, 'utf8').lower()
return [word.lemma for word in TextBlob(message).words]
Alright, here are first few of our original messages.
In [8]:
sms_spam_df['message'].head()
Out[8]:
And, here are our messages lemmatized.
In [9]:
sms_spam_df['message'].head().apply(lemmatize)
Out[9]:
As you can see, lemmatization converts messages into their base form; for example, goes becomes go as you may notice from the last message.
As already mentioned, machine learning algorithms can only understand vectors and not text. Converting list of words (obtained after tokenization or lemmatization) into vectors involves the following steps:
In [10]:
from sklearn.feature_extraction.text import CountVectorizer
"""Bag of Words Transformer using lemmatization"""
bow_transformer = CountVectorizer(analyzer=lemmatize)
bow_transformer.fit(sms_spam_df['message'])
Out[10]:
Now, lets try out the Bag of Words transformer on some dummy message.
In [11]:
dummy_vectorized = bow_transformer.transform(['Hey you... you of the you... This message is to you.'])
print dummy_vectorized
So, the message Hey you... you of the you... This message is to you. contains 8 unique words, of which you is repeated 4 times. Hope you can guess what vector representation of you is. Hint: you is repeated 4 times.
In [12]:
bow_transformer.get_feature_names()[8737]
Out[12]:
Now, lets transform entire set of messages in our dataset.
In [13]:
msgs_vectorized = bow_transformer.transform(sms_spam_df['message'])
msgs_vectorized.shape
Out[13]:
In [14]:
from sklearn.feature_extraction.text import TfidfTransformer
"""TFIDF Transformer using vectorized messages"""
tfidf_transformer = TfidfTransformer().fit(msgs_vectorized)
Lets use this transformer to weigh the previous message; Hey you... you of the you... This message is to you.
In [15]:
dummy_transformed = tfidf_transformer.transform(dummy_vectorized)
print dummy_transformed
Now, lets check IDF for you, the most frequently repeated word in the message against hey, a least repeated word.
In [16]:
print '{}: {}'.format('you', tfidf_transformer.idf_[bow_transformer.vocabulary_['you']])
print '{}: {}'.format('hey', tfidf_transformer.idf_[bow_transformer.vocabulary_['hey']])
As you can see, words with lower frequency are weighed higher than words with higher frequency in the dataset.
Now, to weigh and normalize all messages in our dataset.
In [17]:
msgs_tfidf = tfidf_transformer.transform(msgs_vectorized)
msgs_tfidf.shape
Out[17]:
In [18]:
from sklearn.naive_bayes import MultinomialNB
"""Naive Bayes classifier trained with vectorized messages and its corresponding labels"""
nb_clf = MultinomialNB(alpha=0.25)
nb_clf.fit(msgs_tfidf, sms_spam_df['label'])
Out[18]:
In [19]:
msgs_pred = nb_clf.predict(msgs_tfidf)
In [20]:
from sklearn.metrics import accuracy_score
print 'Accuracy Score: {}'.format(accuracy_score(sms_spam_df['label'], msgs_pred))
Now, lets improve our procedure. This time, doing machine learning the way its meant to be done.
For our demonstration, we trained a Naive Bayes classifier on the entire dataset. Then we tested our classifier on the same complete dataset. On doing so, we are actually overfitting our classifier.
A better approach would be to split our dataset into two partitions; one for training the classifier and another for testing the classifer. The sklearn library provides just what we need.
In [21]:
from sklearn.model_selection import train_test_split
msgs_train, msgs_test, lbls_train, lbls_test = \
train_test_split(sms_spam_df['message'], sms_spam_df['label'], test_size=0.2)
As mentioned in demonstration, we cannot directly feed our text messages to the machine learning algorithm. It has to be vectorized. If you remember, vectorization involved two processes:
Once the preprocessing is complete, we can construct the classifier.
These operations can be pipelined using the Pipeline
class from sklearn library.
In [22]:
from sklearn.pipeline import Pipeline
"""Pipeline CountVectorizer, TfidfTransformer and Naive Bayes Classifier"""
pipeline = Pipeline([
('bow', CountVectorizer(analyzer=lemmatize)),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB(alpha=0.25))
])
Cross validation (K-Folds cross validation) involves splitting the training set again into k partitions such that 1 partition is used for testing and remaining k-1 partitions are used for training. The process is repeated k times, and the average score obtained is considered the score of the machine learning model.
The cross_val_score
function of sklearn library can be used to determine the cross validation score of a model.
In [23]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(
pipeline,
msgs_train,
lbls_train,
cv=10,
scoring='accuracy',
n_jobs=-1
)
print scores
Using pipeline, we were able to construct a model that parses text message and classify them. This model is limited, and not tuned for optimal performance. Each of the model components (namely, CountVectorizer, TfidfTransformer and MultinomialNB) has its own set of hyper parameters which can be set for optimal performance.
One method to tune a model is Grid Search, which allows to define a set of hyper parameters for each component of the model, and then exhaustively searches for the best combination of these parameters that provide the best cross validation score.
In [24]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
params = {
'bow__analyzer': (lemmatize, tokenize),
'tfidf__use_idf': (True, False),
}
model = GridSearchCV(
pipeline,
params,
refit=True,
n_jobs=-1,
scoring='accuracy',
cv=StratifiedKFold(n_splits=5)
)
Our model is almost ready. All we have to do is to train it. Also, we will be timing the trainig operation.
In [25]:
%time model = model.fit(msgs_train, lbls_train)
Now that our model is trained, lets try it out.
In [26]:
print model.predict(['Hi! How are you?'])[0]
print model.predict(['Congratulations! You won free credits!'])[0]
For more fun, here is the classification report of our model.
In [27]:
msgs_pred = model.predict(msgs_test)
print 'Accuracy Score: {}'.format(accuracy_score(lbls_test, msgs_pred))
The scores are bit lower than the previous results we obtained when we tested our model on unsplit data, but is more reliable.