A simple Text Classifier

Author: Christin Seifert, licensed under the Creative Commons Attribution 3.0 Unported License https://creativecommons.org/licenses/by/3.0/ It is based on a tutorial of Nils Witt (https://github.com/n-witt/MachineLearningWithText_SS2017)

This is a tutorial for learning and evaluating a simple naive bayes classifier on for a simple text classification problem. In this tutorial you will:

  • inspect the data you will be using to train the decision tree
  • train a decision tree
  • evaluate how well the decision tree does
  • visualize the decision tree

It is assumed that you have some general knowledge on

  • document-term matrices
  • what a Naive Bayes classifier does

Converting texts to features

We wil start with a small example of 3 SMS'. The texts in the SMS are the following "call me tonight", "Call me a cab", "please call me... PLEASE!" In order to do text classification we need to convert the text into a feature vector. We will follow a very simple approach here:

  1. Find out which different words (or tokens) are used in the text. These makes up the vocabulary.
  2. The length of a vector for each document then is the size of the vocabulary, and each entry in the vector corresponds to one word. This means, the first entry in the vector corresponds to the first word in the vocabulary, the second to the second and .. you get the logic ;-)
  3. For each document we simply cound how often each word occurs and write it at the index in the vector that corresponds to this word.

All those things can easily be done with the CountVectorizer from the sklearn library.


In [1]:
simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']

In [2]:
# import and instantiate CountVectorizer (with the default parameters)
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

In [3]:
# learn the 'vocabulary' of the training data 
vect.fit(simple_train)


Out[3]:
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [4]:
# examine the fitted vocabulary
vect.get_feature_names()


Out[4]:
['cab', 'call', 'me', 'please', 'tonight', 'you']

Have you noticed that all words are lower case now? And that we ignored punctuation? Whether this is a good idea, depends on the application. E.g. for detecting emotions in texts, smilies (punctutation) might be a helpful feature. But for now, let's keep it simple.

Now we generate a document-term matrix. In this matrix each row corresponds to one document, each column to one feature. Entry (i,j) tells us how often word j occurs in document i.

Note: The "how often" is only true if we use the count vectorizer. Instead of word count there are many other possible features.

From the scikit-learn documentation:

In this scheme, features and samples are defined as follows:

  • Each individual token occurrence frequency (normalized or not) is treated as a feature.
  • The vector of all the token frequencies for a given document is considered a sample.

A corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus.

We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.


In [5]:
# transform training data into a 'document-term matrix'
simple_train_dtm = vect.transform(simple_train)
simple_train_dtm


Out[5]:
<3x6 sparse matrix of type '<class 'numpy.int64'>'
	with 9 stored elements in Compressed Sparse Row format>

In [6]:
# convert sparse matrix to a dense matrix
simple_train_dtm.toarray()


Out[6]:
array([[0, 1, 0, 0, 1, 1],
       [1, 1, 1, 0, 0, 0],
       [0, 1, 1, 2, 0, 0]])

We can use a pandas data frame to store the vector and the feature names together.


In [7]:
# examine the vocabulary and document-term matrix together
import pandas as pd
pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())


Out[7]:
cab call me please tonight you
0 0 1 0 0 1 1
1 1 1 1 0 0 0
2 0 1 1 2 0 0

Since in general this is an aweful lot of zeros (think of how many of all English words are present in a SMS), the more efficient way to store the information is as a sparse matrix. For humans this is a bit harder to read.

From the scikit-learn documentation:

As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have many feature values that are zeros (typically more than 99% of them).

For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.

In order to be able to store such a matrix in memory but also to speed up operations, implementations will typically use a sparse representation such as the implementations available in the scipy.sparse package.


In [8]:
# check the type of the document-term matrix
type(simple_train_dtm)


Out[8]:
scipy.sparse.csr.csr_matrix

In [9]:
# examine the sparse matrix contents
print(simple_train_dtm)


  (0, 1)	1
  (0, 4)	1
  (0, 5)	1
  (1, 0)	1
  (1, 1)	1
  (1, 2)	1
  (2, 1)	1
  (2, 2)	1
  (2, 3)	2

Generate the feature vector for a previously unseen text

In order to make predictions for unseen data, the new observation must have the same features as the training observations, both in number and meaning.


In [10]:
# example text for model testing
simple_test = ["please don't call me, I don't like you"]

In [11]:
# transform testing data into a document-term matrix (using existing vocabulary)
simple_test_dtm = vect.transform(simple_test)
simple_test_dtm.toarray()


Out[11]:
array([[0, 1, 1, 1, 0, 1]])

In [12]:
# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())


Out[12]:
cab call me please tonight you
0 0 1 1 1 0 1

A simple spam filter

Now we are going to implement a simple spam filter for SMS messages. We are given a data set with SMS that are already annotated with either spam or ham (=not spam). We first load the data set and have a look at the data.


In [13]:
path = 'material/sms.tsv'
sms = pd.read_table(path, header=None, names=['label', 'message'])

In [14]:
sms.shape


Out[14]:
(5572, 2)

In [15]:
# examine the first 10 rows
sms.head(10)


Out[15]:
label message
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...
5 spam FreeMsg Hey there darling it's been 3 week's n...
6 ham Even my brother is not like to speak with me. ...
7 ham As per your request 'Melle Melle (Oru Minnamin...
8 spam WINNER!! As a valued network customer you have...
9 spam Had your mobile 11 months or more? U R entitle...

We convert the label to a numerical value.


In [16]:
# examine the class distribution
sms.label.value_counts()


Out[16]:
ham     4825
spam     747
Name: label, dtype: int64

In [17]:
# convert label to a numerical variable
sms['label_num'] = sms.label.map({'ham':0, 'spam':1})

In [18]:
# check that the conversion worked
sms.head(10)


Out[18]:
label message label_num
0 ham Go until jurong point, crazy.. Available only ... 0
1 ham Ok lar... Joking wif u oni... 0
2 spam Free entry in 2 a wkly comp to win FA Cup fina... 1
3 ham U dun say so early hor... U c already then say... 0
4 ham Nah I don't think he goes to usf, he lives aro... 0
5 spam FreeMsg Hey there darling it's been 3 week's n... 1
6 ham Even my brother is not like to speak with me. ... 0
7 ham As per your request 'Melle Melle (Oru Minnamin... 0
8 spam WINNER!! As a valued network customer you have... 1
9 spam Had your mobile 11 months or more? U R entitle... 1

Now we have our text in the column message and our label in the column label_num. Let's have a look at the sizes.


In [19]:
# how to define X and y (from the SMS data) for use with COUNTVECTORIZER
X = sms.message
y = sms.label_num
print(X.shape)
print(y.shape)


(5572,)
(5572,)

And at the text of the first 5 messages.


In [20]:
sms.message.head()


Out[20]:
0    Go until jurong point, crazy.. Available only ...
1                        Ok lar... Joking wif u oni...
2    Free entry in 2 a wkly comp to win FA Cup fina...
3    U dun say so early hor... U c already then say...
4    Nah I don't think he goes to usf, he lives aro...
Name: message, dtype: object

We now prepare the data for the classifier. First split it into a training and a test set. There is a convenient method train_test_split available that helps us with that. We use a fixed random state random_state=42to split randomly, but at the same time get the same results each time we run the code.


In [21]:
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)


(4179,)
(1393,)
(4179,)
(1393,)

Now we use the data preprocessing knowledge from above and generate the vocabulary. We will do this ONLY on the training data set, because we presume to have no knowledge whatsoever about the test data set. So we don't know the test data's vocabulary.


In [22]:
# learn training data vocabulary, then use it to create a document-term matrix
vect = CountVectorizer()
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)

In [23]:
# examine the document-term matrix
X_train_dtm


Out[23]:
<4179x7490 sparse matrix of type '<class 'numpy.int64'>'
	with 55879 stored elements in Compressed Sparse Row format>

Next we transform the test data set using the same vocabulary (that is using the same vect object that internally knows the vocabulary).


In [24]:
# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)
X_test_dtm


Out[24]:
<1393x7490 sparse matrix of type '<class 'numpy.int64'>'
	with 16940 stored elements in Compressed Sparse Row format>

Building and evaluating a model

Now we are at the stage where we have a matrix of features and the corresponding labels. We can now train a classifier for spam detection on sms. We will use multinomial Naive Bayes:

The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.


In [25]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

In [26]:
nb.fit(X_train_dtm, y_train)
y_test_pred = nb.predict(X_test_dtm)

In [27]:
from sklearn import metrics
metrics.accuracy_score(y_test, y_test_pred)


Out[27]:
0.9885139985642498

In [29]:
# print the confusion matrix
metrics.confusion_matrix(y_test, y_test_pred)


Out[29]:
array([[1203,    4],
       [  12,  174]])

"Spaminess" of words

Before we start: the estimator has several fields that allow us to examine its internal state:


In [30]:
vect.vocabulary_


Out[30]:
{'winner': 7277,
 'as': 1046,
 'valued': 7003,
 'network': 4582,
 'customer': 2051,
 'you': 7453,
 'have': 3230,
 'been': 1244,
 'selected': 5803,
 'to': 6690,
 'receivea': 5446,
 '900': 698,
 'prize': 5243,
 'reward': 5586,
 'claim': 1757,
 'call': 1549,
 '09061701461': 193,
 'code': 1814,
 'kl341': 3804,
 'valid': 6999,
 '12': 268,
 'hours': 3376,
 'only': 4771,
 'so': 6079,
 'how': 3380,
 'scotland': 5756,
 'hope': 3352,
 'are': 1012,
 'not': 4659,
 'over': 4851,
 'showing': 5934,
 'your': 7458,
 'jjc': 3682,
 'tendencies': 6530,
 'take': 6459,
 'care': 1596,
 'live': 4001,
 'the': 6578,
 'dream': 2353,
 'when': 7227,
 'and': 926,
 'derek': 2181,
 'done': 2318,
 'with': 7295,
 'class': 1764,
 'aight': 850,
 'lemme': 3929,
 'know': 3811,
 'what': 7221,
 'up': 6941,
 'yo': 7448,
 'we': 7169,
 'watching': 7155,
 'movie': 4446,
 'on': 4759,
 'netflix': 4579,
 'why': 7249,
 'don': 2316,
 'go': 3044,
 'tell': 6519,
 'friend': 2889,
 're': 5409,
 'sure': 6398,
 'want': 7130,
 'him': 3300,
 'because': 1231,
 'he': 3239,
 'smokes': 6059,
 'too': 6730,
 'much': 4474,
 'then': 6588,
 'spend': 6174,
 'begging': 1254,
 'come': 1841,
 'smoke': 6057,
 'gosh': 3082,
 'that': 6575,
 'pain': 4878,
 'spose': 6204,
 'better': 1282,
 'shall': 5866,
 'send': 5816,
 'exe': 2602,
 'mail': 4156,
 'id': 3445,
 'ok': 4740,
 'just': 3728,
 'arrived': 1037,
 'see': 5791,
 'in': 3492,
 'couple': 1966,
 'days': 2100,
 'lt': 4096,
 'awake': 1123,
 'is': 3600,
 'there': 6594,
 'snow': 6076,
 'sounds': 6137,
 'like': 3963,
 'plan': 5055,
 'cardiff': 1593,
 'still': 6270,
 'here': 3284,
 'cold': 1820,
 'sitting': 5995,
 'radiator': 5362,
 'thanks': 6567,
 'for': 2824,
 've': 7016,
 'lovely': 4077,
 'wisheds': 7288,
 'rock': 5617,
 'no': 4625,
 'cancer': 1579,
 'moms': 4407,
 'making': 4168,
 'big': 1298,
 'deal': 2108,
 'out': 4836,
 'of': 4715,
 'regular': 5489,
 'checkup': 1696,
 'aka': 861,
 'pap': 4891,
 'smear': 6046,
 'boltblue': 1369,
 'tones': 6720,
 '150p': 293,
 'reply': 5531,
 'poly': 5115,
 'or': 4802,
 'mono': 4416,
 'eg': 2451,
 'poly3': 5116,
 'cha': 1653,
 'slide': 6026,
 'yeah': 7426,
 'slow': 6035,
 'jamz': 3642,
 'toxic': 6762,
 'me': 4244,
 'stop': 6280,
 'more': 4426,
 'txt': 6849,
 'do': 2287,
 'thought': 6626,
 'put': 5330,
 'it': 3612,
 'back': 1149,
 'box': 1401,
 'hey': 3288,
 'will': 7264,
 'be': 1220,
 'late': 3878,
 'at': 1076,
 'amk': 909,
 'need': 4563,
 'drink': 2359,
 'tea': 6498,
 'coffee': 1815,
 'yes': 7437,
 'place': 5051,
 'town': 6761,
 'meet': 4263,
 'exciting': 2598,
 'adult': 807,
 'singles': 5980,
 'now': 4670,
 'uk': 6875,
 'chat': 1680,
 '86688': 661,
 'msg': 4456,
 'tired': 6667,
 'haven': 3232,
 'slept': 6024,
 'well': 7204,
 'past': 4930,
 'few': 2713,
 'nights': 4612,
 'unni': 6929,
 'thank': 6566,
 'dear': 2112,
 'recharge': 5452,
 'rakhesh': 5379,
 'business': 1513,
 'but': 1516,
 'knackered': 3805,
 'came': 1567,
 'home': 3336,
 'went': 7208,
 'sleep': 6017,
 'good': 3068,
 'this': 6615,
 'full': 2920,
 'time': 6661,
 'work': 7341,
 'lark': 3874,
 'urgent': 6963,
 'mobile': 4384,
 'number': 4682,
 'has': 3223,
 'awarded': 1125,
 '2000': 336,
 'guaranteed': 3145,
 '09058094454': 172,
 'from': 2903,
 'land': 3856,
 'line': 3975,
 '3030': 416,
 '12hrs': 277,
 'pple': 5167,
 '700': 588,
 'excellent': 2594,
 'location': 4018,
 'wif': 7255,
 'breakfast': 1434,
 'hamper': 3188,
 'noe': 4629,
 'wat': 7151,
 'dat': 2092,
 'sells': 5810,
 '4d': 496,
 'closes': 1788,
 'free': 2870,
 'entry': 2516,
 'gr8prizes': 3097,
 'wkly': 7309,
 'comp': 1853,
 'chance': 1658,
 'win': 7267,
 'latest': 3882,
 'nokia': 4634,
 '8800': 673,
 'psp': 5298,
 '250': 359,
 'cash': 1615,
 'every': 2565,
 'wk': 7305,
 'great': 3116,
 '80878': 621,
 'http': 3393,
 'www': 7388,
 'com': 1835,
 '08715705022': 128,
 'should': 5924,
 'made': 4142,
 'an': 923,
 'appointment': 994,
 'camera': 1568,
 'sipix': 5983,
 'digital': 2236,
 '09061221066': 191,
 'fromm': 2904,
 'landline': 3858,
 'delivery': 2158,
 'within': 7298,
 '28': 366,
 'was': 7144,
 'she': 5880,
 'looking': 4047,
 'yup': 7475,
 'leh': 3926,
 'probably': 5248,
 'gotta': 3090,
 'check': 1690,
 'leo': 3933,
 'nope': 4645,
 'juz': 3732,
 'off': 4717,
 'oh': 4736,
 'fuck': 2912,
 'juswoke': 3731,
 'bed': 1237,
 'boatin': 1362,
 'docks': 2292,
 'wid': 7253,
 '25': 358,
 'year': 7427,
 'old': 4752,
 'spinout': 6182,
 'giv': 3026,
 'da': 2065,
 'gossip': 3084,
 'l8r': 3834,
 'xxx': 7405,
 'usual': 6986,
 'iam': 3432,
 'fine': 2746,
 'happy': 3213,
 'amp': 917,
 'doing': 2308,
 'got': 3085,
 'shitload': 5903,
 'diamonds': 2215,
 'though': 6625,
 'cutest': 2057,
 'girl': 3021,
 'world': 7347,
 'gud': 3147,
 'mrng': 4454,
 'hav': 3228,
 'nice': 4603,
 'day': 2099,
 'did': 2220,
 'chechi': 1689,
 'drug': 2374,
 'anymore': 962,
 'if': 3456,
 'wasn': 7146,
 'paying': 4945,
 'attention': 1088,
 'morning': 4430,
 'my': 4505,
 'love': 4074,
 'wish': 7287,
 'feeling': 2700,
 'opportunity': 4793,
 'last': 3876,
 'babe': 1143,
 'kiss': 3799,
 'please': 5072,
 'haha': 3173,
 'awesome': 1127,
 'might': 4313,
 'doin': 2306,
 'tonight': 6725,
 'talk': 6466,
 'pa': 4867,
 'am': 901,
 'able': 734,
 'dont': 2320,
 'can': 1572,
 'any': 959,
 'major': 4163,
 'roles': 5624,
 'community': 1852,
 'outreach': 4843,
 'mel': 4273,
 'money': 4411,
 'steve': 6265,
 'mate': 4216,
 'keep': 3762,
 'posted': 5151,
 'anyways': 972,
 'gym': 3165,
 'whatever': 7222,
 'smiles': 6053,
 'having': 3235,
 'miss': 4350,
 'already': 892,
 'get': 2999,
 'mystery': 4512,
 'solved': 6091,
 'opened': 4783,
 'email': 2475,
 'sent': 5824,
 'another': 945,
 'batch': 1194,
 'isn': 3607,
 'sweetie': 6427,
 'hello': 3269,
 'lover': 4079,
 'goes': 3050,
 'new': 4590,
 'job': 3685,
 'think': 6606,
 'wake': 7112,
 'slave': 6016,
 'teasing': 6508,
 'across': 770,
 'sea': 5773,
 'someonone': 6097,
 'trying': 6816,
 'contact': 1908,
 'via': 7032,
 'our': 4835,
 'dating': 2096,
 'service': 5835,
 'find': 2743,
 'who': 7241,
 'could': 1957,
 '09064015307': 208,
 'box334sk38ch': 1407,
 'll': 4008,
 'rcv': 5405,
 'msgs': 4462,
 'svc': 6414,
 'hardcore': 3215,
 'services': 5836,
 'text': 6550,
 '69988': 575,
 'nothing': 4662,
 'must': 4496,
 'age': 833,
 'verify': 7025,
 'yr': 7467,
 'try': 6814,
 'again': 830,
 'ya': 7412,
 'cant': 1583,
 'display': 2273,
 'internal': 3562,
 'subs': 6335,
 'extract': 2634,
 'them': 6585,
 'todays': 6697,
 'voda': 7069,
 'numbers': 4683,
 'ending': 2490,
 '1225': 272,
 'receive': 5445,
 '50award': 519,
 'match': 4214,
 '08712300220': 98,
 'quoting': 5359,
 '3100': 424,
 'standard': 6231,
 'rates': 5394,
 'app': 983,
 'driving': 2366,
 'raining': 5370,
 'caught': 1628,
 'mrt': 4455,
 'station': 6248,
 'lor': 4053,
 'before': 1250,
 'midnight': 4311,
 'ready': 5420,
 'moan': 4381,
 'scream': 5761,
 'senthil': 5826,
 'hsbc': 3391,
 'upgrdcentre': 6950,
 'orange': 4806,
 'may': 4235,
 'phone': 5008,
 'upgrade': 6948,
 'loyalty': 4090,
 '0207': 9,
 '153': 299,
 '9153': 701,
 'offer': 4722,
 'ends': 2492,
 '26th': 365,
 'july': 3722,
 'apply': 991,
 'opt': 4797,
 'available': 1109,
 'ur': 6959,
 '150': 291,
 'worth': 7354,
 'discount': 2266,
 'vouchers': 7082,
 'today': 6696,
 'shop': 5911,
 '85023': 652,
 'savamob': 5726,
 'offers': 4725,
 'cs': 2019,
 'pobox84': 5099,
 'm263uz': 4125,
 '00': 0,
 'sub': 6330,
 '16': 302,
 'tmr': 6684,
 'bugis': 1492,
 '930': 704,
 'captain': 1589,
 'vijaykanth': 7046,
 'comedy': 1842,
 'tv': 6838,
 'drunken': 2379,
 'grand': 3104,
 'prix': 5241,
 'later': 3881,
 '10': 245,
 'min': 4323,
 '09061221061': 190,
 '28days': 367,
 'box177': 1404,
 'm221bp': 4122,
 '2yr': 409,
 'warranty': 7142,
 '150ppm': 297,
 '99': 711,
 'gr8': 3095,
 'message': 4295,
 'leaving': 3917,
 'congrats': 1894,
 'school': 5746,
 'plans': 5060,
 'friday': 2886,
 'wait': 7108,
 'dunno': 2398,
 'wot': 7356,
 'hell': 3267,
 'im': 3469,
 'gonna': 3066,
 'weeks': 7194,
 'become': 1233,
 'slob': 6032,
 'bring': 1450,
 'some': 6092,
 'food': 2816,
 'hear': 3248,
 'philosophy': 5005,
 'say': 5732,
 'happen': 3205,
 'asked': 1056,
 'anna': 935,
 'nagar': 4516,
 'afternoon': 827,
 'round': 5636,
 'til': 6659,
 'gt': 3142,
 'ish': 3603,
 'dun': 2396,
 'pick': 5022,
 'gf': 3009,
 'looked': 4045,
 'addie': 789,
 'monday': 4409,
 'sucks': 6351,
 'her': 3283,
 'hanks': 3201,
 'lotsly': 4064,
 'pls': 5079,
 'play': 5063,
 'others': 4829,
 'life': 3953,
 'sir': 5985,
 'waiting': 7111,
 'once': 4763,
 'depends': 2174,
 'individual': 3517,
 'hair': 3177,
 'dresser': 2358,
 'pretty': 5220,
 'parents': 4905,
 'look': 4044,
 'gong': 3065,
 'kaypoh': 3758,
 'also': 895,
 'collecting': 1827,
 'coming': 1846,
 'saying': 5735,
 'order': 4810,
 'slippers': 6030,
 'cos': 1948,
 'had': 3170,
 'pay': 4940,
 'returning': 5577,
 'hungry': 3412,
 'gay': 2975,
 'guys': 3162,
 '08718730555': 145,
 '10p': 255,
 'texts': 6560,
 '08712460324': 110,
 'accidentally': 756,
 'brought': 1464,
 'em': 2474,
 'bus': 1509,
 'aft': 824,
 'lect': 3918,
 'lar': 3871,
 'car': 1591,
 'tot': 6751,
 'group': 3134,
 'lucky': 4103,
 'havent': 3233,
 'leave': 3915,
 'nobody': 4628,
 'names': 4526,
 'their': 6583,
 'penis': 4963,
 'girls': 3024,
 'name': 4522,
 'story': 6291,
 'doesn': 2298,
 'add': 785,
 'all': 879,
 'needs': 4566,
 'slowly': 6038,
 'vomit': 7075,
 'texting': 6557,
 'right': 5596,
 'ticket': 6649,
 'sorry': 6128,
 'joined': 3694,
 'league': 3910,
 'people': 4965,
 'touch': 6755,
 'mean': 4247,
 'times': 6662,
 'even': 2558,
 'personal': 4985,
 'cost': 1950,
 'week': 7190,
 'open': 4782,
 'click': 1778,
 'lists': 3996,
 'make': 4164,
 'list': 3989,
 'easy': 2423,
 'pie': 5029,
 'cool': 1933,
 'little': 4000,
 'while': 7236,
 'getting': 3007,
 'soon': 6118,
 'oops': 4781,
 'thk': 6616,
 'haf': 3172,
 'enuff': 2519,
 'speak': 6159,
 'minutes': 4340,
 'anyway': 971,
 'darren': 2090,
 'meeting': 4265,
 'ge': 2983,
 'den': 2163,
 'dinner': 2247,
 'xy': 7411,
 'feel': 2698,
 'awkward': 1128,
 'lunch': 4108,
 'buying': 1523,
 'meh': 4270,
 'hi': 3291,
 'about': 736,
 '15pm': 301,
 'taunton': 6487,
 'church': 1749,
 'holla': 3332,
 'many': 4187,
 'things': 6605,
 'its': 3620,
 'antibiotic': 958,
 'used': 6978,
 'chest': 1710,
 'abdomen': 728,
 'gynae': 3166,
 'infections': 3521,
 'bone': 1371,
 'birthdate': 1315,
 'certificate': 1652,
 'april': 1004,
 'real': 5421,
 'date': 2094,
 'publish': 5307,
 'give': 3027,
 'special': 6161,
 'treat': 6788,
 'secret': 5782,
 'way': 7165,
 'wishes': 7289,
 'cmon': 1800,
 'horny': 3361,
 'turn': 6834,
 'fantasy': 2669,
 'hot': 3368,
 'sticky': 6267,
 'replies': 5530,
 '50': 515,
 'cancel': 1576,
 'tel': 6515,
 'software': 6085,
 'than': 6563,
 'bb': 1205,
 'wont': 7332,
 'use': 6977,
 'his': 3305,
 'wife': 7256,
 'doctor': 2294,
 'madam': 4141,
 'happened': 3207,
 'interview': 3564,
 'imma': 3477,
 'cause': 1629,
 'jay': 3654,
 'wants': 7134,
 'drugs': 2376,
 'rp176781': 5642,
 'further': 2934,
 'messages': 4297,
 'regalportfolio': 5480,
 'co': 1805,
 '08717205546': 130,
 'ask': 1054,
 'macho': 4136,
 'budget': 1487,
 'bold': 1367,
 'saw': 5731,
 'one': 4765,
 'dollars': 2313,
 'said': 5685,
 'mr': 4452,
 'foley': 2804,
 'won': 7325,
 'ipod': 3589,
 'prizes': 5245,
 'eye': 2635,
 'visit': 7060,
 '82050': 626,
 'jesus': 3672,
 'christ': 1745,
 'bitch': 1319,
 'answer': 948,
 'fucking': 2915,
 'stayin': 6252,
 'trouble': 6803,
 'stranger': 6296,
 'dave': 2097,
 'other': 4828,
 'sorted': 6130,
 'bloke': 1345,
 'gona': 3063,
 'mum': 4481,
 'thinks': 6610,
 '2getha': 378,
 'tessy': 6544,
 'favor': 2686,
 'convey': 1926,
 'birthday': 1316,
 'nimya': 4616,
 'dnt': 2286,
 'forget': 2832,
 'shijas': 5892,
 'unique': 6917,
 'enough': 2508,
 '30th': 422,
 'august': 1096,
 'areyouunique': 1018,
 'either': 2461,
 'works': 7346,
 'years': 7428,
 'doesnt': 2299,
 'bother': 1394,
 'would': 7358,
 'ip': 3584,
 'address': 791,
 'test': 6545,
 'considering': 1903,
 'computer': 1874,
 'minecraft': 4329,
 'server': 5834,
 'thts': 6642,
 'god': 3049,
 'gift': 3014,
 'birds': 1311,
 'humans': 3407,
 'natural': 4544,
 'frm': 2894,
 'reverse': 5583,
 'cheating': 1688,
 'mathematics': 4219,
 'marry': 4203,
 'lovers': 4081,
 'becz': 1236,
 'they': 6601,
 'undrstndng': 6904,
 'avoids': 1120,
 'problems': 5251,
 'dis': 2259,
 'wil': 7261,
 'news': 4595,
 'by': 1532,
 'person': 4984,
 'tomorrow': 6717,
 'best': 1277,
 'break': 1432,
 'chain': 1655,
 'suffer': 6355,
 'frnds': 2897,
 'mins': 4336,
 'whn': 7240,
 'read': 5416,
 'difficult': 2232,
 'simple': 5971,
 'enter': 2510,
 'same': 5700,
 'elaine': 2466,
 'confirmed': 1890,
 'onum': 4775,
 'ela': 2463,
 'normal': 4651,
 'two': 6847,
 'cartons': 1612,
 'very': 7029,
 'pleased': 5073,
 'shelves': 5886,
 'means': 4251,
 'february': 2695,
 'stay': 6250,
 'down': 2340,
 'hustle': 3422,
 'forth': 2845,
 'during': 2402,
 'audition': 1093,
 'season': 5776,
 'since': 5975,
 'sister': 5988,
 'moved': 4444,
 'away': 1126,
 'harlem': 3220,
 'theory': 6591,
 'going': 3055,
 'book': 1375,
 '21': 345,
 'coz': 1974,
 'wanna': 7128,
 'jiayin': 3679,
 'isnt': 3608,
 'head': 3240,
 'usf': 6982,
 'fifteen': 2722,
 'texted': 6556,
 'finished': 2751,
 'long': 4040,
 'ago': 839,
 'showered': 5932,
 'er': 2527,
 'ything': 7469,
 'freemsg': 2875,
 'baby': 1146,
 'wow': 7362,
 'cam': 1565,
 'moby': 4393,
 'pic': 5021,
 'fancy': 2665,
 'w8in': 7096,
 '4utxt': 511,
 'rply': 5644,
 '82242': 628,
 'hlp': 3313,
 '08712317606': 100,
 'msg150p': 4457,
 '2rcv': 400,
 'practice': 5175,
 'smart': 6043,
 '200': 335,
 'weekly': 7193,
 'quiz': 5356,
 '85222': 654,
 'winnersclub': 7278,
 'po': 5089,
 '84': 644,
 'm26': 4124,
 '3uz': 458,
 'gbp1': 2979,
 'anthony': 956,
 'bringing': 1451,
 'fees': 2703,
 'rent': 5519,
 'stuff': 6320,
 'thats': 6577,
 'help': 3272,
 'points': 5107,
 'cultures': 2035,
 'module': 4397,
 'missing': 4354,
 'plenty': 5076,
 'seem': 5795,
 'pub': 5305,
 'tone': 6719,
 'mob': 4382,
 'nok': 4633,
 '87021': 663,
 '1st': 324,
 'txtin': 6853,
 'friends': 2890,
 'hl': 3311,
 '4info': 502,
 'died': 2226,
 'didn': 2222,
 'family': 2661,
 'str': 6292,
 'orchard': 4809,
 'per': 4967,
 'request': 5537,
 'maangalyam': 4131,
 'alaipayuthe': 865,
 'set': 5839,
 'callertune': 1557,
 'callers': 1556,
 'press': 5215,
 'copy': 1940,
 '0776xxxxxxx': 31,
 'invited': 3578,
 'xchat': 7395,
 'final': 2739,
 'attempt': 1085,
 'msgrcvdhg': 4461,
 'suite342': 6366,
 '2lands': 385,
 'row': 5639,
 'w1j6hl': 7092,
 'ldn': 3902,
 '18yrs': 311,
 'audrie': 1095,
 'lousy': 4071,
 'autocorrect': 1106,
 'after': 825,
 'quit': 5353,
 'lei': 3927,
 'shd': 5879,
 'sch': 5744,
 'hr': 3388,
 'oni': 4767,
 'ah': 841,
 'confuses': 1893,
 'maybe': 4237,
 'wrong': 7376,
 'thing': 6604,
 'sort': 6129,
 'tho': 6621,
 'called': 1554,
 'dad': 2068,
 'oredi': 4813,
 'boy': 1414,
 'father': 2679,
 'power': 5164,
 'frndship': 2898,
 'were': 7210,
 'otherwise': 4830,
 'nalla': 4520,
 'adi': 793,
 'entey': 2513,
 'nattil': 4543,
 'kittum': 3802,
 'online': 4769,
 'yep': 7435,
 'house': 3377,
 'sunday': 6376,
 'studying': 6319,
 'next': 4599,
 'weekend': 7191,
 'cine': 1751,
 'plaza': 5070,
 'mah': 4151,
 'threats': 6631,
 'sales': 5691,
 'executive': 2603,
 'shifad': 5891,
 'raised': 5372,
 'complaint': 1863,
 'against': 831,
 'official': 4728,
 'str8': 6293,
 'each': 2408,
 '8007': 613,
 'classic': 1766,
 'hit': 3307,
 'polys': 5121,
 '200p': 341,
 'pity': 5046,
 'mood': 4423,
 'suggestions': 6364,
 'space': 6148,
 'gives': 3029,
 'everything': 2574,
 'remember': 5505,
 'furniture': 2933,
 'yours': 7462,
 'around': 1031,
 'move': 4443,
 'lock': 4021,
 'locks': 4022,
 'key': 3772,
 'jenne': 3666,
 'running': 5664,
 'admit': 799,
 'mad': 4138,
 'where': 7231,
 'correction': 1945,
 'let': 3940,
 'lets': 3941,
 'run': 5663,
 'fighting': 2726,
 'lose': 4055,
 'bt': 1478,
 'fightng': 2727,
 'some1': 6093,
 'close': 1784,
 'dificult': 2234,
 'whats': 7223,
 'ay': 1134,
 'wana': 7127,
 'sat': 5717,
 'wkg': 7308,
 'mmmmm': 4374,
 'sooooo': 6123,
 'words': 7340,
 'mmmm': 4373,
 'lion': 3982,
 'devouring': 2205,
 'mom': 4404,
 'ugh': 6869,
 'apologize': 981,
 ...}

In [31]:
X_train_tokens = vect.get_feature_names()
print(X_train_tokens[:50])


['00', '000', '000pes', '008704050406', '0089', '0121', '01223585236', '01223585334', '02', '0207', '02072069400', '02073162414', '02085076972', '021', '03', '04', '0430', '05', '050703', '0578', '06', '07', '07008009200', '07046744435', '07090298926', '07099833605', '07123456789', '0721072', '07732584351', '07734396839', '07753741225', '0776xxxxxxx', '07781482378', '07786200117', '077xxx', '07801543489', '07808', '07808247860', '07815296484', '07821230901', '07880867867', '07946746291', '0796xxxxxx', '07973788240', '07xxxxxxxxx', '08', '0800', '08000407165', '08000776320', '08000839402']

In [32]:
print(X_train_tokens[-50:])


['yet', 'yetty', 'yetunde', 'yhl', 'yifeng', 'yijue', 'ym', 'ymca', 'yo', 'yoga', 'yogasana', 'yor', 'yorge', 'you', 'youdoing', 'youi', 'young', 'younger', 'your', 'youre', 'yourinclusive', 'yourjob', 'yours', 'yourself', 'youuuuu', 'yoville', 'yoyyooo', 'yr', 'yrs', 'ything', 'yummy', 'yun', 'yunny', 'yuo', 'yuou', 'yup', 'yupz', 'zac', 'zealand', 'zebra', 'zed', 'zeros', 'zhong', 'zindgi', 'zoe', 'zogtorius', 'zoom', 'zouk', 'zyada', 'èn']

In [33]:
# feature count per class
nb.feature_count_


Out[33]:
array([[ 0.,  0.,  1., ...,  0.,  1.,  1.],
       [ 7., 22.,  0., ...,  1.,  0.,  0.]])

In [34]:
# number of times each token appears across all HAM messages
ham_token_count = nb.feature_count_[0, :]

# number of times each token appears across all SPAM messages
spam_token_count = nb.feature_count_[1, :]

In [35]:
# create a table of tokens with their separate ham and spam counts
tokens = pd.DataFrame({'token':X_train_tokens, 'ham':ham_token_count, 'spam':spam_token_count}).set_index('token')
tokens.head()


Out[35]:
ham spam
token
00 0.0 7.0
000 0.0 22.0
000pes 1.0 0.0
008704050406 0.0 2.0
0089 0.0 1.0

In [36]:
tokens.sample(5, random_state=6)


Out[36]:
ham spam
token
toughest 2.0 0.0
fucking 14.0 0.0
baaaaabe 1.0 0.0
reckon 1.0 0.0
cochin 2.0 0.0

Naive Bayes counts the number of observations in each class


In [37]:
nb.class_count_


Out[37]:
array([3618.,  561.])

Add 1 to ham and spam counts to avoid dividing by 0


In [38]:
tokens['ham'] = tokens.ham + 1
tokens['spam'] = tokens.spam + 1
tokens.sample(5, random_state=6)


Out[38]:
ham spam
token
toughest 3.0 1.0
fucking 15.0 1.0
baaaaabe 2.0 1.0
reckon 2.0 1.0
cochin 3.0 1.0

In [39]:
# convert the ham and spam counts into frequencies
tokens['ham'] = tokens.ham / nb.class_count_[0]
tokens['spam'] = tokens.spam / nb.class_count_[1]
tokens.sample(5, random_state=6)


Out[39]:
ham spam
token
toughest 0.000829 0.001783
fucking 0.004146 0.001783
baaaaabe 0.000553 0.001783
reckon 0.000553 0.001783
cochin 0.000829 0.001783

Calculate the ratio of spam-to-ham for each token


In [40]:
tokens['spam_ratio'] = tokens.spam / tokens.ham
tokens.sample(5, random_state=6)


Out[40]:
ham spam spam_ratio
token
toughest 0.000829 0.001783 2.149733
fucking 0.004146 0.001783 0.429947
baaaaabe 0.000553 0.001783 3.224599
reckon 0.000553 0.001783 3.224599
cochin 0.000829 0.001783 2.149733

Examine the DataFrame sorted by spam_ratio


In [41]:
tokens.sort_values('spam_ratio', ascending=False)


Out[41]:
ham spam spam_ratio
token
claim 0.000276 0.140820 509.486631
prize 0.000276 0.105169 380.502674
150p 0.000276 0.096257 348.256684
tone 0.000276 0.083779 303.112299
18 0.000276 0.071301 257.967914
guaranteed 0.000276 0.067736 245.069519
cs 0.000276 0.062389 225.721925
500 0.000276 0.060606 219.272727
1000 0.000276 0.060606 219.272727
100 0.000276 0.057041 206.374332
awarded 0.000276 0.055258 199.925134
uk 0.000553 0.099822 180.577540
ringtone 0.000276 0.048128 174.128342
www 0.000829 0.142602 171.978610
rate 0.000276 0.044563 161.229947
tones 0.000276 0.044563 161.229947
150ppm 0.000276 0.044563 161.229947
000 0.000276 0.040998 148.331551
weekly 0.000276 0.039216 141.882353
entry 0.000276 0.039216 141.882353
mob 0.000276 0.035651 128.983957
16 0.000553 0.069519 125.759358
8007 0.000276 0.033868 122.534759
collection 0.000276 0.033868 122.534759
10p 0.000276 0.033868 122.534759
valid 0.000276 0.033868 122.534759
poly 0.000276 0.030303 109.636364
800 0.000276 0.030303 109.636364
750 0.000276 0.030303 109.636364
5000 0.000276 0.030303 109.636364
... ... ... ...
where 0.025428 0.003565 0.140200
but 0.092040 0.012478 0.135569
did 0.026534 0.003565 0.134358
went 0.013543 0.001783 0.131616
ll 0.055832 0.007130 0.127707
wait 0.014096 0.001783 0.126455
lol 0.014096 0.001783 0.126455
my 0.155611 0.019608 0.126006
feel 0.014373 0.001783 0.124023
nice 0.014373 0.001783 0.124023
anything 0.014925 0.001783 0.119430
sure 0.014925 0.001783 0.119430
something 0.014925 0.001783 0.119430
cos 0.014925 0.001783 0.119430
come 0.047264 0.005348 0.113144
morning 0.016307 0.001783 0.109308
sorry 0.032891 0.003565 0.108390
ask 0.017413 0.001783 0.102368
already 0.017689 0.001783 0.100769
said 0.017966 0.001783 0.099218
amp 0.019071 0.001783 0.093467
doing 0.019900 0.001783 0.089572
oh 0.023217 0.001783 0.076776
later 0.030956 0.001783 0.057582
lor 0.031509 0.001783 0.056572
da 0.032338 0.001783 0.055121
she 0.037313 0.001783 0.047772
he 0.045329 0.001783 0.039324
lt 0.065782 0.001783 0.027097
gt 0.066059 0.001783 0.026984

7490 rows × 3 columns


In [42]:
tokens.loc['00', 'spam_ratio']


Out[42]:
51.593582887700535

Tuning the vectorizer

Do you see any potential to enhance the vectorizer? Think about the following questions:
Are all word equally important?
Do you think there are "noise words" which negatively influence the results?
How can we account for the order of words?

Stopwords

Stopwords are the most common words in a language. Examples are 'is', 'which' and 'the'. Usually is beneficial to exclude these words in text processing tasks.
The CountVectorizer has a stop_words parameter:

  • stop_words: string {'english'}, list, or None (default)
    • If 'english', a built-in stop word list for English is used.
    • If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.
    • If None, no stop words will be used.

In [43]:
vect = CountVectorizer(stop_words='english')

n-grams

n-grams concatenate n words to form a token. The following accounts for 1- and 2-grams


In [44]:
vect = CountVectorizer(ngram_range=(1, 2))

Document frequencies

Often it's beneficial to exclude words that appear in the majority or just a couple of documents. This is, very frequent or infrequent words. This can be achieved by using the max_df and min_df parameters of the vectorizer.


In [45]:
# ignore terms that appear in more than 50% of the documents
vect = CountVectorizer(max_df=0.5)

# only keep terms that appear in at least 2 documents
vect = CountVectorizer(min_df=2)

A note on Stemming

  • 'went' and 'go'
  • 'kids' and 'kid'
  • 'negative' and 'negatively'

What is the pattern?

The process of reducing a word to it's word stem, base or root form is called stemming. Scikit-Learn has no powerfull stemmer, but other libraries like the NLTK have.

Tf-idf

  • Tf-idf can be understood as a modification of the raw term frequencies (tf)
  • The concept behind tf-idf is to downweight terms proportionally to the number of documents in which they occur.
  • The idea is that terms that occur in many different documents are likely unimportant or don't contain any useful information for Natural Language Processing tasks such as document classification.

Explanation by example

Let consider a dataset containing 3 documents:


In [46]:
import numpy as np
docs = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining and the weather is sweet'])

First, we will compute the term frequency (alternatively: Bag-of-Words) $tf(t, d)$. $t$ is the number of times a term occures in a document $d$. Using Scikit-Learn we can quickly get those numbers:


In [47]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
tf = cv.fit_transform(docs).toarray()
tf


Out[47]:
array([[0, 1, 1, 1, 0, 1, 0],
       [0, 1, 0, 0, 1, 1, 1],
       [1, 2, 1, 1, 1, 2, 1]], dtype=int64)

In [48]:
cv.vocabulary_


Out[48]:
{'the': 5, 'sun': 3, 'is': 1, 'shining': 2, 'weather': 6, 'sweet': 4, 'and': 0}

Secondly, we introduce inverse document frequency ($idf$) by defining the term document frequency $\text{df}(d,t)$, which is simply the number of documents $d$ that contain the term $t$. We can then define the idf as follows:

$$\text{idf}(t) = log{\frac{n_d}{1+\text{df}(d,t)}},$$

where
$n_d$: The total number of documents
$\text{df}(d,t)$: The number of documents that contain term $t$.

Note that the constant 1 is added to the denominator to avoid a zero-division error if a term is not contained in any document in the test dataset.

Now, Let us calculate the idfs of the words "and", "is," and "shining:"


In [49]:
n_docs = len(docs)

df_and = 1
idf_and = np.log(n_docs / (1 + df_and))
print('idf "and": %s' % idf_and)

df_is = 3
idf_is = np.log(n_docs / (1 + df_is))
print('idf "is": %s' % idf_is)

df_shining = 2
idf_shining = np.log(n_docs / (1 + df_shining))
print('idf "shining": %s' % idf_shining)


idf "and": 0.4054651081081644
idf "is": -0.2876820724517809
idf "shining": 0.0

Using those idfs, we can eventually calculate the tf-idfs for the 3rd document:

$$\text{tf-idf}(t, d) = \text{tf}(t, d) \times \text{idf}(t),$$

In [50]:
print('Tf-idfs in document 3:\n')
print('tf-idf "and": %s' % (1 * idf_and))
print('tf-idf "is": %s' % (2 * idf_is))
print('tf-idf "shining": %s' % (1 * idf_shining))


Tf-idfs in document 3:

tf-idf "and": 0.4054651081081644
tf-idf "is": -0.5753641449035618
tf-idf "shining": 0.0

Tf-idf in Scikit-Learn


In [51]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(smooth_idf=False, norm=None)
tfidf.fit_transform(tf).toarray()[-1][:3]


Out[51]:
array([2.09861229, 2.        , 1.40546511])

Wait! Those numbers aren't the same!

Tf-idf in Scikit-Learn is calculated a little bit differently. Here, the +1 count is added to the idf, whereas instead of the denominator if the df:

$$\text{idf}(t) = log{\frac{n_d}{\text{df}(d,t)}} + 1$$


In [52]:
tf_and = 1
df_and = 1 
tf_and * (np.log(n_docs / df_and) + 1)


Out[52]:
2.09861228866811

In [53]:
tf_is = 2
df_is = 3 
tf_is * (np.log(n_docs / df_is) + 1)


Out[53]:
2.0

In [54]:
tf_shining = 1
df_shining = 2 
tf_shining * (np.log(n_docs / df_shining) + 1)


Out[54]:
1.4054651081081644

Normalization

By default, Scikit-Learn performs a normalization. The most common way to normalize the raw term frequency is l2-normalization, i.e., dividing the raw term frequency vector $v$ by its length $||v||_2$ (L2- or Euclidean norm).

$$v_{norm} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v{_1}^2 + v{_2}^2 + \dots + v{_n}^2}}$$

Why is that useful?

For example, we would normalize our 3rd document 'The sun is shining and the weather is sweet' as follows:


In [55]:
tfidf = TfidfTransformer(use_idf=True, smooth_idf=False, norm='l2')
tfidf.fit_transform(tf).toarray()[-1][:3]


Out[55]:
array([0.46572049, 0.44383662, 0.31189844])

Smooth idf

We are not quite there. Sckit-Learn also applies smoothing, which changes the original formula as follows:

$$\text{idf}(t) = log{\frac{1 + n_d}{1+\text{df}(d,t)}} + 1$$


In [56]:
tfidf = TfidfTransformer(use_idf=True, smooth_idf=True, norm='l2')
tfidf.fit_transform(tf).toarray()[-1][:3]


Out[56]:
array([0.40474829, 0.47810172, 0.30782151])