SMS Spam Dataset Exploration

Introduction

This Jupyter Notebook explores the SMS Spam Collection dataset from UCI Machine Learning Repository and compares performance of various machine learning algorithms in text processing.

Data Wrangling

To begin with, lets load the dataset into a Pandas Dataframe.



In [1]:

    
import csv
import pandas as pd

sms_spam_df = pd.read_csv('sms-spam.tsv', quoting=csv.QUOTE_NONE, sep='\t', names=['label', 'message'])
sms_spam_df.head()









    Out[1]:







  
    
      
      label
      message
    
  
  
    
      0
      ham
      Go until jurong point, crazy.. Available only ...
    
    
      1
      ham
      Ok lar... Joking wif u oni...
    
    
      2
      spam
      Free entry in 2 a wkly comp to win FA Cup fina...
    
    
      3
      ham
      U dun say so early hor... U c already then say...
    
    
      4
      ham
      Nah I don't think he goes to usf, he lives aro...

Missing values skew the dataset, and should be avoided. Lets see if the dataset has any missing values.



In [2]:

    
sms_spam_df.isnull().values.any()









    Out[2]:





False

Now that we are sure there are no missing values, lets have some fun by checking stats about spam and ham(non spam) messages in the dataset.



In [3]:

    
sms_spam = sms_spam_df.groupby('label')['message']
sms_spam.describe()









    Out[3]:







  
    
      
      count
      unique
      top
      freq
    
    
      label
      
      
      
      
    
  
  
    
      ham
      4827
      4518
      Sorry, I'll call later
      30
    
    
      spam
      747
      653
      Please call our customer service representativ...
      4

Data Preprocessing

For messages to be understood by machine learning algorithms, they have to be converted into vectors. To do that, we have to first split our messages into tokens (list of words). This technique is called Bag of Words model as in the end we are left with a collection (bag) of word vectors. The following methods can be used to vectorize messages:

Tokenization: splitting messages into individual words.
Lemmatization: splitting messages into individual words and converting them into their base form (lemma).

Tokenization

Tokenization simply splits the message into individual tokens.



In [4]:

    
from textblob import TextBlob

def tokenize(message):
    message = unicode(message, 'utf8')
    return TextBlob(message).words

Lets try applying this on some of our messages. Here are the original messages we are going to tokenize.



In [5]:

    
sms_spam_df['message'].head()









    Out[5]:





0    Go until jurong point, crazy.. Available only ...
1                        Ok lar... Joking wif u oni...
2    Free entry in 2 a wkly comp to win FA Cup fina...
3    U dun say so early hor... U c already then say...
4    Nah I don't think he goes to usf, he lives aro...
Name: message, dtype: object

Now, here are those messages tokenized.



In [6]:

    
sms_spam_df['message'].head().apply(tokenize)









    Out[6]:





0    [Go, until, jurong, point, crazy, Available, o...
1                       [Ok, lar, Joking, wif, u, oni]
2    [Free, entry, in, 2, a, wkly, comp, to, win, F...
3    [U, dun, say, so, early, hor, U, c, already, t...
4    [Nah, I, do, n't, think, he, goes, to, usf, he...
Name: message, dtype: object

As you can see, tokenization simply splits message into tokens.

Lemmatization

The textblob library provides tools that can convert each word in a message to its base form (lemma).



In [7]:

    
from textblob import TextBlob

def lemmatize(message):
    message = unicode(message, 'utf8').lower()
    return [word.lemma for word in TextBlob(message).words]

Alright, here are first few of our original messages.



In [8]:

    
sms_spam_df['message'].head()









    Out[8]:





0    Go until jurong point, crazy.. Available only ...
1                        Ok lar... Joking wif u oni...
2    Free entry in 2 a wkly comp to win FA Cup fina...
3    U dun say so early hor... U c already then say...
4    Nah I don't think he goes to usf, he lives aro...
Name: message, dtype: object

And, here are our messages lemmatized.



In [9]:

    
sms_spam_df['message'].head().apply(lemmatize)









    Out[9]:





0    [go, until, jurong, point, crazy, available, o...
1                       [ok, lar, joking, wif, u, oni]
2    [free, entry, in, 2, a, wkly, comp, to, win, f...
3    [u, dun, say, so, early, hor, u, c, already, t...
4    [nah, i, do, n't, think, he, go, to, usf, he, ...
Name: message, dtype: object

As you can see, lemmatization converts messages into their base form; for example, goes becomes go as you may notice from the last message.

Vectorization

As already mentioned, machine learning algorithms can only understand vectors and not text. Converting list of words (obtained after tokenization or lemmatization) into vectors involves the following steps:

Term Frequency (TF): Determine frequency of each word in the message.
Inverse Document Frequency (IDF): Weigh frequency of each word in the message such that more frequent words get lower weights.
Normalization: Normalize message vectors to unit length.

Count Vectorization

Count Vectorization obtains frequency of unique words in each tokenized message.



In [10]:

    
from sklearn.feature_extraction.text import CountVectorizer

"""Bag of Words Transformer using lemmatization"""

bow_transformer = CountVectorizer(analyzer=lemmatize)
bow_transformer.fit(sms_spam_df['message'])









    Out[10]:





CountVectorizer(analyzer=<function lemmatize at 0x7f9079c31cf8>, binary=False,
        decode_error=u'strict', dtype=<type 'numpy.int64'>,
        encoding=u'utf-8', input=u'content', lowercase=True, max_df=1.0,
        max_features=None, min_df=1, ngram_range=(1, 1), preprocessor=None,
        stop_words=None, strip_accents=None,
        token_pattern=u'(?u)\\b\\w\\w+\\b', tokenizer=None,
        vocabulary=None)

Now, lets try out the Bag of Words transformer on some dummy message.



In [11]:

    
dummy_vectorized = bow_transformer.transform(['Hey you... you of the you... This message is to you.'])
print dummy_vectorized









    



  (0, 3925)	1
  (0, 4297)	1
  (0, 5083)	1
  (0, 5589)	1
  (0, 7673)	1
  (0, 7717)	1
  (0, 7801)	1
  (0, 8737)	4

So, the message Hey you... you of the you... This message is to you. contains 8 unique words, of which you is repeated 4 times. Hope you can guess what vector representation of you is. Hint: you is repeated 4 times.



In [12]:

    
bow_transformer.get_feature_names()[8737]









    Out[12]:





u'you'

Now, lets transform entire set of messages in our dataset.



In [13]:

    
msgs_vectorized = bow_transformer.transform(sms_spam_df['message'])
msgs_vectorized.shape









    Out[13]:





(5574, 8859)

TF-IDF Transformation

Now that we have obtained a vectorized representation of messages in our dataset, we can use it to weigh words in our dataset such that words with high frequency have a lower weight (Inverse Document Frequency). Also, this process also performs normalization of messages.



In [14]:

    
from sklearn.feature_extraction.text import TfidfTransformer

"""TFIDF Transformer using vectorized messages"""

tfidf_transformer = TfidfTransformer().fit(msgs_vectorized)

Lets use this transformer to weigh the previous message; Hey you... you of the you... This message is to you.



In [15]:

    
dummy_transformed = tfidf_transformer.transform(dummy_vectorized)
print dummy_transformed









    



  (0, 8737)	0.676815614927
  (0, 7801)	0.164667697974
  (0, 7717)	0.290066163457
  (0, 7673)	0.201312794894
  (0, 5589)	0.248872120698
  (0, 5083)	0.377358904206
  (0, 4297)	0.224280949576
  (0, 3925)	0.368104513252

Now, lets check IDF for you, the most frequently repeated word in the message against hey, a least repeated word.



In [16]:

    
print '{}: {}'.format('you', tfidf_transformer.idf_[bow_transformer.vocabulary_['you']])
print '{}: {}'.format('hey', tfidf_transformer.idf_[bow_transformer.vocabulary_['hey']])









    



you: 2.25581695452
hey: 4.90754872503

As you can see, words with lower frequency are weighed higher than words with higher frequency in the dataset.

Now, to weigh and normalize all messages in our dataset.



In [17]:

    
msgs_tfidf = tfidf_transformer.transform(msgs_vectorized)
msgs_tfidf.shape









    Out[17]:





(5574, 8859)

Naive Bayes Classifier

Having converted text messages into vectors, it can be parsed by machine learning algorithms. Naive Bayes is a classification algorithm commonly used in text processing.



In [18]:

    
from sklearn.naive_bayes import MultinomialNB

"""Naive Bayes classifier trained with vectorized messages and its corresponding labels"""

nb_clf = MultinomialNB(alpha=0.25)
nb_clf.fit(msgs_tfidf, sms_spam_df['label'])









    Out[18]:





MultinomialNB(alpha=0.25, class_prior=None, fit_prior=True)

Predictions

Now that we have a trained classifier, it can be used for prediction.



In [19]:

    
msgs_pred = nb_clf.predict(msgs_tfidf)

Accuracy Score

Lets check the accuracy of our classifier.



In [20]:

    
from sklearn.metrics import accuracy_score

print 'Accuracy Score: {}'.format(accuracy_score(sms_spam_df['label'], msgs_pred))









    



Accuracy Score: 0.993362038034

Conclusion?

Woah! 99% accuracy! You really believe that is right? Think again...

Take Two

Now, lets improve our procedure. This time, doing machine learning the way its meant to be done.

Splitting Dataset

For our demonstration, we trained a Naive Bayes classifier on the entire dataset. Then we tested our classifier on the same complete dataset. On doing so, we are actually overfitting our classifier.

A better approach would be to split our dataset into two partitions; one for training the classifier and another for testing the classifer. The sklearn library provides just what we need.



In [21]:

    
from sklearn.model_selection import train_test_split

msgs_train, msgs_test, lbls_train, lbls_test = \
    train_test_split(sms_spam_df['message'], sms_spam_df['label'], test_size=0.2)

Pipeline

As mentioned in demonstration, we cannot directly feed our text messages to the machine learning algorithm. It has to be vectorized. If you remember, vectorization involved two processes:

Counting words in each message and converting dataset into one large matrix (Count Vectorization).
Weighing words based on their frequency (TF-IDF Transformation) and normalization.

Once the preprocessing is complete, we can construct the classifier.

These operations can be pipelined using the Pipeline class from sklearn library.



In [22]:

    
from sklearn.pipeline import Pipeline

"""Pipeline CountVectorizer, TfidfTransformer and Naive Bayes Classifier"""

pipeline = Pipeline([
    ('bow', CountVectorizer(analyzer=lemmatize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB(alpha=0.25))
])

Cross Validation

Cross validation (K-Folds cross validation) involves splitting the training set again into k partitions such that 1 partition is used for testing and remaining k-1 partitions are used for training. The process is repeated k times, and the average score obtained is considered the score of the machine learning model.

The cross_val_score function of sklearn library can be used to determine the cross validation score of a model.



In [23]:

    
from sklearn.model_selection import cross_val_score

scores = cross_val_score(
    pipeline, 
    msgs_train, 
    lbls_train,
    cv=10,
    scoring='accuracy',
    n_jobs=-1
)

print scores









    



[ 0.98210291  0.9753915   0.98657718  0.98657718  0.97533632  0.98651685
  0.97977528  0.97078652  0.97752809  0.98202247]

Tuning the Model

Using pipeline, we were able to construct a model that parses text message and classify them. This model is limited, and not tuned for optimal performance. Each of the model components (namely, CountVectorizer, TfidfTransformer and MultinomialNB) has its own set of hyper parameters which can be set for optimal performance.

One method to tune a model is Grid Search, which allows to define a set of hyper parameters for each component of the model, and then exhaustively searches for the best combination of these parameters that provide the best cross validation score.



In [24]:

    
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold

params = {
    'bow__analyzer': (lemmatize, tokenize),
    'tfidf__use_idf': (True, False),
}


model = GridSearchCV(
    pipeline, 
    params,
    refit=True,
    n_jobs=-1,
    scoring='accuracy',
    cv=StratifiedKFold(n_splits=5)
)

Our model is almost ready. All we have to do is to train it. Also, we will be timing the trainig operation.



In [25]:

    
%time model = model.fit(msgs_train, lbls_train)









    



CPU times: user 5.09 s, sys: 124 ms, total: 5.22 s
Wall time: 1min 9s

Now that our model is trained, lets try it out.



In [26]:

    
print model.predict(['Hi! How are you?'])[0]
print model.predict(['Congratulations! You won free credits!'])[0]









    



ham
spam

For more fun, here is the classification report of our model.



In [27]:

    
msgs_pred = model.predict(msgs_test)
print 'Accuracy Score: {}'.format(accuracy_score(lbls_test, msgs_pred))









    



Accuracy Score: 0.977578475336

The scores are bit lower than the previous results we obtained when we tested our model on unsplit data, but is more reliable.

	label	message
0	ham	Go until jurong point, crazy.. Available only ...
1	ham	Ok lar... Joking wif u oni...
2	spam	Free entry in 2 a wkly comp to win FA Cup fina...
3	ham	U dun say so early hor... U c already then say...
4	ham	Nah I don't think he goes to usf, he lives aro...

	count	unique	top	freq
label
ham	4827	4518	Sorry, I'll call later	30
spam	747	653	Please call our customer service representativ...	4