A simple Text Classifier

Author: Christin Seifert, licensed under the Creative Commons Attribution 3.0 Unported License https://creativecommons.org/licenses/by/3.0/ It is based on a tutorial of Nils Witt (https://github.com/n-witt/MachineLearningWithText_SS2017)

This is a tutorial for learning and evaluating a simple naive bayes classifier on for a simple text classification problem. In this tutorial you will:

inspect the data you will be using to train the decision tree
train a decision tree
evaluate how well the decision tree does
visualize the decision tree

It is assumed that you have some general knowledge on

document-term matrices
what a Naive Bayes classifier does

Converting texts to features

We wil start with a small example of 3 SMS'. The texts in the SMS are the following "call me tonight", "Call me a cab", "please call me... PLEASE!" In order to do text classification we need to convert the text into a feature vector. We will follow a very simple approach here:

Find out which different words (or tokens) are used in the text. These makes up the vocabulary.
The length of a vector for each document then is the size of the vocabulary, and each entry in the vector corresponds to one word. This means, the first entry in the vector corresponds to the first word in the vocabulary, the second to the second and .. you get the logic ;-)
For each document we simply cound how often each word occurs and write it at the index in the vector that corresponds to this word.

All those things can easily be done with the CountVectorizer from the sklearn library.



In [1]:

    
simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']



In [2]:

    
# import and instantiate CountVectorizer (with the default parameters)
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()



In [3]:

    
# learn the 'vocabulary' of the training data 
vect.fit(simple_train)









    Out[3]:





CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)



In [4]:

    
# examine the fitted vocabulary
vect.get_feature_names()









    Out[4]:





['cab', 'call', 'me', 'please', 'tonight', 'you']

Have you noticed that all words are lower case now? And that we ignored punctuation? Whether this is a good idea, depends on the application. E.g. for detecting emotions in texts, smilies (punctutation) might be a helpful feature. But for now, let's keep it simple.

Now we generate a document-term matrix. In this matrix each row corresponds to one document, each column to one feature. Entry (i,j) tells us how often word j occurs in document i.

Note: The "how often" is only true if we use the count vectorizer. Instead of word count there are many other possible features.

From the scikit-learn documentation:

In this scheme, features and samples are defined as follows:

Each individual token occurrence frequency (normalized or not) is treated as a feature.

The vector of all the token frequencies for a given document is considered a sample.

A corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus.

We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.



In [5]:

    
# transform training data into a 'document-term matrix'
simple_train_dtm = vect.transform(simple_train)
simple_train_dtm









    Out[5]:





<3x6 sparse matrix of type '<class 'numpy.int64'>'
	with 9 stored elements in Compressed Sparse Row format>



In [6]:

    
# convert sparse matrix to a dense matrix
simple_train_dtm.toarray()









    Out[6]:





array([[0, 1, 0, 0, 1, 1],
       [1, 1, 1, 0, 0, 0],
       [0, 1, 1, 2, 0, 0]])

We can use a pandas data frame to store the vector and the feature names together.



In [7]:

    
# examine the vocabulary and document-term matrix together
import pandas as pd
pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())

Since in general this is an aweful lot of zeros (think of how many of all English words are present in a SMS), the more efficient way to store the information is as a sparse matrix. For humans this is a bit harder to read.

From the scikit-learn documentation:

As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have many feature values that are zeros (typically more than 99% of them).

For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.

In order to be able to store such a matrix in memory but also to speed up operations, implementations will typically use a sparse representation such as the implementations available in the scipy.sparse package.



In [8]:

    
# check the type of the document-term matrix
type(simple_train_dtm)









    Out[8]:





scipy.sparse.csr.csr_matrix



In [9]:

    
# examine the sparse matrix contents
print(simple_train_dtm)









    



  (0, 1)	1
  (0, 4)	1
  (0, 5)	1
  (1, 0)	1
  (1, 1)	1
  (1, 2)	1
  (2, 1)	1
  (2, 2)	1
  (2, 3)	2

Generate the feature vector for a previously unseen text

In order to make predictions for unseen data, the new observation must have the same features as the training observations, both in number and meaning.



In [10]:

    
# example text for model testing
simple_test = ["please don't call me, I don't like you"]



In [11]:

    
# transform testing data into a document-term matrix (using existing vocabulary)
simple_test_dtm = vect.transform(simple_test)
simple_test_dtm.toarray()









    Out[11]:





array([[0, 1, 1, 1, 0, 1]])



In [12]:

    
# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())

A simple spam filter

Now we are going to implement a simple spam filter for SMS messages. We are given a data set with SMS that are already annotated with either spam or ham (=not spam). We first load the data set and have a look at the data.



In [13]:

    
path = 'material/sms.tsv'
sms = pd.read_table(path, header=None, names=['label', 'message'])



In [14]:

    
sms.shape









    Out[14]:





(5572, 2)



In [15]:

    
# examine the first 10 rows
sms.head(10)









    Out[15]:







  
    
      
      label
      message
    
  
  
    
      0
      ham
      Go until jurong point, crazy.. Available only ...
    
    
      1
      ham
      Ok lar... Joking wif u oni...
    
    
      2
      spam
      Free entry in 2 a wkly comp to win FA Cup fina...
    
    
      3
      ham
      U dun say so early hor... U c already then say...
    
    
      4
      ham
      Nah I don't think he goes to usf, he lives aro...
    
    
      5
      spam
      FreeMsg Hey there darling it's been 3 week's n...
    
    
      6
      ham
      Even my brother is not like to speak with me. ...
    
    
      7
      ham
      As per your request 'Melle Melle (Oru Minnamin...
    
    
      8
      spam
      WINNER!! As a valued network customer you have...
    
    
      9
      spam
      Had your mobile 11 months or more? U R entitle...

We convert the label to a numerical value.



In [16]:

    
# examine the class distribution
sms.label.value_counts()









    Out[16]:





ham     4825
spam     747
Name: label, dtype: int64



In [17]:

    
# convert label to a numerical variable
sms['label_num'] = sms.label.map({'ham':0, 'spam':1})



In [18]:

    
# check that the conversion worked
sms.head(10)









    Out[18]:







  
    
      
      label
      message
      label_num
    
  
  
    
      0
      ham
      Go until jurong point, crazy.. Available only ...
      0
    
    
      1
      ham
      Ok lar... Joking wif u oni...
      0
    
    
      2
      spam
      Free entry in 2 a wkly comp to win FA Cup fina...
      1
    
    
      3
      ham
      U dun say so early hor... U c already then say...
      0
    
    
      4
      ham
      Nah I don't think he goes to usf, he lives aro...
      0
    
    
      5
      spam
      FreeMsg Hey there darling it's been 3 week's n...
      1
    
    
      6
      ham
      Even my brother is not like to speak with me. ...
      0
    
    
      7
      ham
      As per your request 'Melle Melle (Oru Minnamin...
      0
    
    
      8
      spam
      WINNER!! As a valued network customer you have...
      1
    
    
      9
      spam
      Had your mobile 11 months or more? U R entitle...
      1

Now we have our text in the column message and our label in the column label_num. Let's have a look at the sizes.



In [19]:

    
# how to define X and y (from the SMS data) for use with COUNTVECTORIZER
X = sms.message
y = sms.label_num
print(X.shape)
print(y.shape)









    



(5572,)
(5572,)

And at the text of the first 5 messages.



In [20]:

    
sms.message.head()









    Out[20]:





0    Go until jurong point, crazy.. Available only ...
1                        Ok lar... Joking wif u oni...
2    Free entry in 2 a wkly comp to win FA Cup fina...
3    U dun say so early hor... U c already then say...
4    Nah I don't think he goes to usf, he lives aro...
Name: message, dtype: object

We now prepare the data for the classifier. First split it into a training and a test set. There is a convenient method train_test_split available that helps us with that. We use a fixed random state random_state=42to split randomly, but at the same time get the same results each time we run the code.



In [21]:

    
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)









    



(4179,)
(1393,)
(4179,)
(1393,)

Now we use the data preprocessing knowledge from above and generate the vocabulary. We will do this ONLY on the training data set, because we presume to have no knowledge whatsoever about the test data set. So we don't know the test data's vocabulary.



In [22]:

    
# learn training data vocabulary, then use it to create a document-term matrix
vect = CountVectorizer()
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)



In [23]:

    
# examine the document-term matrix
X_train_dtm









    Out[23]:





<4179x7490 sparse matrix of type '<class 'numpy.int64'>'
	with 55879 stored elements in Compressed Sparse Row format>

Next we transform the test data set using the same vocabulary (that is using the same vect object that internally knows the vocabulary).



In [24]:

    
# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)
X_test_dtm









    Out[24]:





<1393x7490 sparse matrix of type '<class 'numpy.int64'>'
	with 16940 stored elements in Compressed Sparse Row format>

Building and evaluating a model

Now we are at the stage where we have a matrix of features and the corresponding labels. We can now train a classifier for spam detection on sms. We will use multinomial Naive Bayes:

The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.



In [25]:

    
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()



In [26]:

    
nb.fit(X_train_dtm, y_train)
y_test_pred = nb.predict(X_test_dtm)



In [27]:

    
from sklearn import metrics
metrics.accuracy_score(y_test, y_test_pred)









    Out[27]:





0.9885139985642498



In [29]:

    
# print the confusion matrix
metrics.confusion_matrix(y_test, y_test_pred)









    Out[29]:





array([[1203,    4],
       [  12,  174]])

"Spaminess" of words

Before we start: the estimator has several fields that allow us to examine its internal state:



In [30]:

    
vect.vocabulary_









    Out[30]:





{'winner': 7277,
 'as': 1046,
 'valued': 7003,
 'network': 4582,
 'customer': 2051,
 'you': 7453,
 'have': 3230,
 'been': 1244,
 'selected': 5803,
 'to': 6690,
 'receivea': 5446,
 '900': 698,
 'prize': 5243,
 'reward': 5586,
 'claim': 1757,
 'call': 1549,
 '09061701461': 193,
 'code': 1814,
 'kl341': 3804,
 'valid': 6999,
 '12': 268,
 'hours': 3376,
 'only': 4771,
 'so': 6079,
 'how': 3380,
 'scotland': 5756,
 'hope': 3352,
 'are': 1012,
 'not': 4659,
 'over': 4851,
 'showing': 5934,
 'your': 7458,
 'jjc': 3682,
 'tendencies': 6530,
 'take': 6459,
 'care': 1596,
 'live': 4001,
 'the': 6578,
 'dream': 2353,
 'when': 7227,
 'and': 926,
 'derek': 2181,
 'done': 2318,
 'with': 7295,
 'class': 1764,
 'aight': 850,
 'lemme': 3929,
 'know': 3811,
 'what': 7221,
 'up': 6941,
 'yo': 7448,
 'we': 7169,
 'watching': 7155,
 'movie': 4446,
 'on': 4759,
 'netflix': 4579,
 'why': 7249,
 'don': 2316,
 'go': 3044,
 'tell': 6519,
 'friend': 2889,
 're': 5409,
 'sure': 6398,
 'want': 7130,
 'him': 3300,
 'because': 1231,
 'he': 3239,
 'smokes': 6059,
 'too': 6730,
 'much': 4474,
 'then': 6588,
 'spend': 6174,
 'begging': 1254,
 'come': 1841,
 'smoke': 6057,
 'gosh': 3082,
 'that': 6575,
 'pain': 4878,
 'spose': 6204,
 'better': 1282,
 'shall': 5866,
 'send': 5816,
 'exe': 2602,
 'mail': 4156,
 'id': 3445,
 'ok': 4740,
 'just': 3728,
 'arrived': 1037,
 'see': 5791,
 'in': 3492,
 'couple': 1966,
 'days': 2100,
 'lt': 4096,
 'awake': 1123,
 'is': 3600,
 'there': 6594,
 'snow': 6076,
 'sounds': 6137,
 'like': 3963,
 'plan': 5055,
 'cardiff': 1593,
 'still': 6270,
 'here': 3284,
 'cold': 1820,
 'sitting': 5995,
 'radiator': 5362,
 'thanks': 6567,
 'for': 2824,
 've': 7016,
 'lovely': 4077,
 'wisheds': 7288,
 'rock': 5617,
 'no': 4625,
 'cancer': 1579,
 'moms': 4407,
 'making': 4168,
 'big': 1298,
 'deal': 2108,
 'out': 4836,
 'of': 4715,
 'regular': 5489,
 'checkup': 1696,
 'aka': 861,
 'pap': 4891,
 'smear': 6046,
 'boltblue': 1369,
 'tones': 6720,
 '150p': 293,
 'reply': 5531,
 'poly': 5115,
 'or': 4802,
 'mono': 4416,
 'eg': 2451,
 'poly3': 5116,
 'cha': 1653,
 'slide': 6026,
 'yeah': 7426,
 'slow': 6035,
 'jamz': 3642,
 'toxic': 6762,
 'me': 4244,
 'stop': 6280,
 'more': 4426,
 'txt': 6849,
 'do': 2287,
 'thought': 6626,
 'put': 5330,
 'it': 3612,
 'back': 1149,
 'box': 1401,
 'hey': 3288,
 'will': 7264,
 'be': 1220,
 'late': 3878,
 'at': 1076,
 'amk': 909,
 'need': 4563,
 'drink': 2359,
 'tea': 6498,
 'coffee': 1815,
 'yes': 7437,
 'place': 5051,
 'town': 6761,
 'meet': 4263,
 'exciting': 2598,
 'adult': 807,
 'singles': 5980,
 'now': 4670,
 'uk': 6875,
 'chat': 1680,
 '86688': 661,
 'msg': 4456,
 'tired': 6667,
 'haven': 3232,
 'slept': 6024,
 'well': 7204,
 'past': 4930,
 'few': 2713,
 'nights': 4612,
 'unni': 6929,
 'thank': 6566,
 'dear': 2112,
 'recharge': 5452,
 'rakhesh': 5379,
 'business': 1513,
 'but': 1516,
 'knackered': 3805,
 'came': 1567,
 'home': 3336,
 'went': 7208,
 'sleep': 6017,
 'good': 3068,
 'this': 6615,
 'full': 2920,
 'time': 6661,
 'work': 7341,
 'lark': 3874,
 'urgent': 6963,
 'mobile': 4384,
 'number': 4682,
 'has': 3223,
 'awarded': 1125,
 '2000': 336,
 'guaranteed': 3145,
 '09058094454': 172,
 'from': 2903,
 'land': 3856,
 'line': 3975,
 '3030': 416,
 '12hrs': 277,
 'pple': 5167,
 '700': 588,
 'excellent': 2594,
 'location': 4018,
 'wif': 7255,
 'breakfast': 1434,
 'hamper': 3188,
 'noe': 4629,
 'wat': 7151,
 'dat': 2092,
 'sells': 5810,
 '4d': 496,
 'closes': 1788,
 'free': 2870,
 'entry': 2516,
 'gr8prizes': 3097,
 'wkly': 7309,
 'comp': 1853,
 'chance': 1658,
 'win': 7267,
 'latest': 3882,
 'nokia': 4634,
 '8800': 673,
 'psp': 5298,
 '250': 359,
 'cash': 1615,
 'every': 2565,
 'wk': 7305,
 'great': 3116,
 '80878': 621,
 'http': 3393,
 'www': 7388,
 'com': 1835,
 '08715705022': 128,
 'should': 5924,
 'made': 4142,
 'an': 923,
 'appointment': 994,
 'camera': 1568,
 'sipix': 5983,
 'digital': 2236,
 '09061221066': 191,
 'fromm': 2904,
 'landline': 3858,
 'delivery': 2158,
 'within': 7298,
 '28': 366,
 'was': 7144,
 'she': 5880,
 'looking': 4047,
 'yup': 7475,
 'leh': 3926,
 'probably': 5248,
 'gotta': 3090,
 'check': 1690,
 'leo': 3933,
 'nope': 4645,
 'juz': 3732,
 'off': 4717,
 'oh': 4736,
 'fuck': 2912,
 'juswoke': 3731,
 'bed': 1237,
 'boatin': 1362,
 'docks': 2292,
 'wid': 7253,
 '25': 358,
 'year': 7427,
 'old': 4752,
 'spinout': 6182,
 'giv': 3026,
 'da': 2065,
 'gossip': 3084,
 'l8r': 3834,
 'xxx': 7405,
 'usual': 6986,
 'iam': 3432,
 'fine': 2746,
 'happy': 3213,
 'amp': 917,
 'doing': 2308,
 'got': 3085,
 'shitload': 5903,
 'diamonds': 2215,
 'though': 6625,
 'cutest': 2057,
 'girl': 3021,
 'world': 7347,
 'gud': 3147,
 'mrng': 4454,
 'hav': 3228,
 'nice': 4603,
 'day': 2099,
 'did': 2220,
 'chechi': 1689,
 'drug': 2374,
 'anymore': 962,
 'if': 3456,
 'wasn': 7146,
 'paying': 4945,
 'attention': 1088,
 'morning': 4430,
 'my': 4505,
 'love': 4074,
 'wish': 7287,
 'feeling': 2700,
 'opportunity': 4793,
 'last': 3876,
 'babe': 1143,
 'kiss': 3799,
 'please': 5072,
 'haha': 3173,
 'awesome': 1127,
 'might': 4313,
 'doin': 2306,
 'tonight': 6725,
 'talk': 6466,
 'pa': 4867,
 'am': 901,
 'able': 734,
 'dont': 2320,
 'can': 1572,
 'any': 959,
 'major': 4163,
 'roles': 5624,
 'community': 1852,
 'outreach': 4843,
 'mel': 4273,
 'money': 4411,
 'steve': 6265,
 'mate': 4216,
 'keep': 3762,
 'posted': 5151,
 'anyways': 972,
 'gym': 3165,
 'whatever': 7222,
 'smiles': 6053,
 'having': 3235,
 'miss': 4350,
 'already': 892,
 'get': 2999,
 'mystery': 4512,
 'solved': 6091,
 'opened': 4783,
 'email': 2475,
 'sent': 5824,
 'another': 945,
 'batch': 1194,
 'isn': 3607,
 'sweetie': 6427,
 'hello': 3269,
 'lover': 4079,
 'goes': 3050,
 'new': 4590,
 'job': 3685,
 'think': 6606,
 'wake': 7112,
 'slave': 6016,
 'teasing': 6508,
 'across': 770,
 'sea': 5773,
 'someonone': 6097,
 'trying': 6816,
 'contact': 1908,
 'via': 7032,
 'our': 4835,
 'dating': 2096,
 'service': 5835,
 'find': 2743,
 'who': 7241,
 'could': 1957,
 '09064015307': 208,
 'box334sk38ch': 1407,
 'll': 4008,
 'rcv': 5405,
 'msgs': 4462,
 'svc': 6414,
 'hardcore': 3215,
 'services': 5836,
 'text': 6550,
 '69988': 575,
 'nothing': 4662,
 'must': 4496,
 'age': 833,
 'verify': 7025,
 'yr': 7467,
 'try': 6814,
 'again': 830,
 'ya': 7412,
 'cant': 1583,
 'display': 2273,
 'internal': 3562,
 'subs': 6335,
 'extract': 2634,
 'them': 6585,
 'todays': 6697,
 'voda': 7069,
 'numbers': 4683,
 'ending': 2490,
 '1225': 272,
 'receive': 5445,
 '50award': 519,
 'match': 4214,
 '08712300220': 98,
 'quoting': 5359,
 '3100': 424,
 'standard': 6231,
 'rates': 5394,
 'app': 983,
 'driving': 2366,
 'raining': 5370,
 'caught': 1628,
 'mrt': 4455,
 'station': 6248,
 'lor': 4053,
 'before': 1250,
 'midnight': 4311,
 'ready': 5420,
 'moan': 4381,
 'scream': 5761,
 'senthil': 5826,
 'hsbc': 3391,
 'upgrdcentre': 6950,
 'orange': 4806,
 'may': 4235,
 'phone': 5008,
 'upgrade': 6948,
 'loyalty': 4090,
 '0207': 9,
 '153': 299,
 '9153': 701,
 'offer': 4722,
 'ends': 2492,
 '26th': 365,
 'july': 3722,
 'apply': 991,
 'opt': 4797,
 'available': 1109,
 'ur': 6959,
 '150': 291,
 'worth': 7354,
 'discount': 2266,
 'vouchers': 7082,
 'today': 6696,
 'shop': 5911,
 '85023': 652,
 'savamob': 5726,
 'offers': 4725,
 'cs': 2019,
 'pobox84': 5099,
 'm263uz': 4125,
 '00': 0,
 'sub': 6330,
 '16': 302,
 'tmr': 6684,
 'bugis': 1492,
 '930': 704,
 'captain': 1589,
 'vijaykanth': 7046,
 'comedy': 1842,
 'tv': 6838,
 'drunken': 2379,
 'grand': 3104,
 'prix': 5241,
 'later': 3881,
 '10': 245,
 'min': 4323,
 '09061221061': 190,
 '28days': 367,
 'box177': 1404,
 'm221bp': 4122,
 '2yr': 409,
 'warranty': 7142,
 '150ppm': 297,
 '99': 711,
 'gr8': 3095,
 'message': 4295,
 'leaving': 3917,
 'congrats': 1894,
 'school': 5746,
 'plans': 5060,
 'friday': 2886,
 'wait': 7108,
 'dunno': 2398,
 'wot': 7356,
 'hell': 3267,
 'im': 3469,
 'gonna': 3066,
 'weeks': 7194,
 'become': 1233,
 'slob': 6032,
 'bring': 1450,
 'some': 6092,
 'food': 2816,
 'hear': 3248,
 'philosophy': 5005,
 'say': 5732,
 'happen': 3205,
 'asked': 1056,
 'anna': 935,
 'nagar': 4516,
 'afternoon': 827,
 'round': 5636,
 'til': 6659,
 'gt': 3142,
 'ish': 3603,
 'dun': 2396,
 'pick': 5022,
 'gf': 3009,
 'looked': 4045,
 'addie': 789,
 'monday': 4409,
 'sucks': 6351,
 'her': 3283,
 'hanks': 3201,
 'lotsly': 4064,
 'pls': 5079,
 'play': 5063,
 'others': 4829,
 'life': 3953,
 'sir': 5985,
 'waiting': 7111,
 'once': 4763,
 'depends': 2174,
 'individual': 3517,
 'hair': 3177,
 'dresser': 2358,
 'pretty': 5220,
 'parents': 4905,
 'look': 4044,
 'gong': 3065,
 'kaypoh': 3758,
 'also': 895,
 'collecting': 1827,
 'coming': 1846,
 'saying': 5735,
 'order': 4810,
 'slippers': 6030,
 'cos': 1948,
 'had': 3170,
 'pay': 4940,
 'returning': 5577,
 'hungry': 3412,
 'gay': 2975,
 'guys': 3162,
 '08718730555': 145,
 '10p': 255,
 'texts': 6560,
 '08712460324': 110,
 'accidentally': 756,
 'brought': 1464,
 'em': 2474,
 'bus': 1509,
 'aft': 824,
 'lect': 3918,
 'lar': 3871,
 'car': 1591,
 'tot': 6751,
 'group': 3134,
 'lucky': 4103,
 'havent': 3233,
 'leave': 3915,
 'nobody': 4628,
 'names': 4526,
 'their': 6583,
 'penis': 4963,
 'girls': 3024,
 'name': 4522,
 'story': 6291,
 'doesn': 2298,
 'add': 785,
 'all': 879,
 'needs': 4566,
 'slowly': 6038,
 'vomit': 7075,
 'texting': 6557,
 'right': 5596,
 'ticket': 6649,
 'sorry': 6128,
 'joined': 3694,
 'league': 3910,
 'people': 4965,
 'touch': 6755,
 'mean': 4247,
 'times': 6662,
 'even': 2558,
 'personal': 4985,
 'cost': 1950,
 'week': 7190,
 'open': 4782,
 'click': 1778,
 'lists': 3996,
 'make': 4164,
 'list': 3989,
 'easy': 2423,
 'pie': 5029,
 'cool': 1933,
 'little': 4000,
 'while': 7236,
 'getting': 3007,
 'soon': 6118,
 'oops': 4781,
 'thk': 6616,
 'haf': 3172,
 'enuff': 2519,
 'speak': 6159,
 'minutes': 4340,
 'anyway': 971,
 'darren': 2090,
 'meeting': 4265,
 'ge': 2983,
 'den': 2163,
 'dinner': 2247,
 'xy': 7411,
 'feel': 2698,
 'awkward': 1128,
 'lunch': 4108,
 'buying': 1523,
 'meh': 4270,
 'hi': 3291,
 'about': 736,
 '15pm': 301,
 'taunton': 6487,
 'church': 1749,
 'holla': 3332,
 'many': 4187,
 'things': 6605,
 'its': 3620,
 'antibiotic': 958,
 'used': 6978,
 'chest': 1710,
 'abdomen': 728,
 'gynae': 3166,
 'infections': 3521,
 'bone': 1371,
 'birthdate': 1315,
 'certificate': 1652,
 'april': 1004,
 'real': 5421,
 'date': 2094,
 'publish': 5307,
 'give': 3027,
 'special': 6161,
 'treat': 6788,
 'secret': 5782,
 'way': 7165,
 'wishes': 7289,
 'cmon': 1800,
 'horny': 3361,
 'turn': 6834,
 'fantasy': 2669,
 'hot': 3368,
 'sticky': 6267,
 'replies': 5530,
 '50': 515,
 'cancel': 1576,
 'tel': 6515,
 'software': 6085,
 'than': 6563,
 'bb': 1205,
 'wont': 7332,
 'use': 6977,
 'his': 3305,
 'wife': 7256,
 'doctor': 2294,
 'madam': 4141,
 'happened': 3207,
 'interview': 3564,
 'imma': 3477,
 'cause': 1629,
 'jay': 3654,
 'wants': 7134,
 'drugs': 2376,
 'rp176781': 5642,
 'further': 2934,
 'messages': 4297,
 'regalportfolio': 5480,
 'co': 1805,
 '08717205546': 130,
 'ask': 1054,
 'macho': 4136,
 'budget': 1487,
 'bold': 1367,
 'saw': 5731,
 'one': 4765,
 'dollars': 2313,
 'said': 5685,
 'mr': 4452,
 'foley': 2804,
 'won': 7325,
 'ipod': 3589,
 'prizes': 5245,
 'eye': 2635,
 'visit': 7060,
 '82050': 626,
 'jesus': 3672,
 'christ': 1745,
 'bitch': 1319,
 'answer': 948,
 'fucking': 2915,
 'stayin': 6252,
 'trouble': 6803,
 'stranger': 6296,
 'dave': 2097,
 'other': 4828,
 'sorted': 6130,
 'bloke': 1345,
 'gona': 3063,
 'mum': 4481,
 'thinks': 6610,
 '2getha': 378,
 'tessy': 6544,
 'favor': 2686,
 'convey': 1926,
 'birthday': 1316,
 'nimya': 4616,
 'dnt': 2286,
 'forget': 2832,
 'shijas': 5892,
 'unique': 6917,
 'enough': 2508,
 '30th': 422,
 'august': 1096,
 'areyouunique': 1018,
 'either': 2461,
 'works': 7346,
 'years': 7428,
 'doesnt': 2299,
 'bother': 1394,
 'would': 7358,
 'ip': 3584,
 'address': 791,
 'test': 6545,
 'considering': 1903,
 'computer': 1874,
 'minecraft': 4329,
 'server': 5834,
 'thts': 6642,
 'god': 3049,
 'gift': 3014,
 'birds': 1311,
 'humans': 3407,
 'natural': 4544,
 'frm': 2894,
 'reverse': 5583,
 'cheating': 1688,
 'mathematics': 4219,
 'marry': 4203,
 'lovers': 4081,
 'becz': 1236,
 'they': 6601,
 'undrstndng': 6904,
 'avoids': 1120,
 'problems': 5251,
 'dis': 2259,
 'wil': 7261,
 'news': 4595,
 'by': 1532,
 'person': 4984,
 'tomorrow': 6717,
 'best': 1277,
 'break': 1432,
 'chain': 1655,
 'suffer': 6355,
 'frnds': 2897,
 'mins': 4336,
 'whn': 7240,
 'read': 5416,
 'difficult': 2232,
 'simple': 5971,
 'enter': 2510,
 'same': 5700,
 'elaine': 2466,
 'confirmed': 1890,
 'onum': 4775,
 'ela': 2463,
 'normal': 4651,
 'two': 6847,
 'cartons': 1612,
 'very': 7029,
 'pleased': 5073,
 'shelves': 5886,
 'means': 4251,
 'february': 2695,
 'stay': 6250,
 'down': 2340,
 'hustle': 3422,
 'forth': 2845,
 'during': 2402,
 'audition': 1093,
 'season': 5776,
 'since': 5975,
 'sister': 5988,
 'moved': 4444,
 'away': 1126,
 'harlem': 3220,
 'theory': 6591,
 'going': 3055,
 'book': 1375,
 '21': 345,
 'coz': 1974,
 'wanna': 7128,
 'jiayin': 3679,
 'isnt': 3608,
 'head': 3240,
 'usf': 6982,
 'fifteen': 2722,
 'texted': 6556,
 'finished': 2751,
 'long': 4040,
 'ago': 839,
 'showered': 5932,
 'er': 2527,
 'ything': 7469,
 'freemsg': 2875,
 'baby': 1146,
 'wow': 7362,
 'cam': 1565,
 'moby': 4393,
 'pic': 5021,
 'fancy': 2665,
 'w8in': 7096,
 '4utxt': 511,
 'rply': 5644,
 '82242': 628,
 'hlp': 3313,
 '08712317606': 100,
 'msg150p': 4457,
 '2rcv': 400,
 'practice': 5175,
 'smart': 6043,
 '200': 335,
 'weekly': 7193,
 'quiz': 5356,
 '85222': 654,
 'winnersclub': 7278,
 'po': 5089,
 '84': 644,
 'm26': 4124,
 '3uz': 458,
 'gbp1': 2979,
 'anthony': 956,
 'bringing': 1451,
 'fees': 2703,
 'rent': 5519,
 'stuff': 6320,
 'thats': 6577,
 'help': 3272,
 'points': 5107,
 'cultures': 2035,
 'module': 4397,
 'missing': 4354,
 'plenty': 5076,
 'seem': 5795,
 'pub': 5305,
 'tone': 6719,
 'mob': 4382,
 'nok': 4633,
 '87021': 663,
 '1st': 324,
 'txtin': 6853,
 'friends': 2890,
 'hl': 3311,
 '4info': 502,
 'died': 2226,
 'didn': 2222,
 'family': 2661,
 'str': 6292,
 'orchard': 4809,
 'per': 4967,
 'request': 5537,
 'maangalyam': 4131,
 'alaipayuthe': 865,
 'set': 5839,
 'callertune': 1557,
 'callers': 1556,
 'press': 5215,
 'copy': 1940,
 '0776xxxxxxx': 31,
 'invited': 3578,
 'xchat': 7395,
 'final': 2739,
 'attempt': 1085,
 'msgrcvdhg': 4461,
 'suite342': 6366,
 '2lands': 385,
 'row': 5639,
 'w1j6hl': 7092,
 'ldn': 3902,
 '18yrs': 311,
 'audrie': 1095,
 'lousy': 4071,
 'autocorrect': 1106,
 'after': 825,
 'quit': 5353,
 'lei': 3927,
 'shd': 5879,
 'sch': 5744,
 'hr': 3388,
 'oni': 4767,
 'ah': 841,
 'confuses': 1893,
 'maybe': 4237,
 'wrong': 7376,
 'thing': 6604,
 'sort': 6129,
 'tho': 6621,
 'called': 1554,
 'dad': 2068,
 'oredi': 4813,
 'boy': 1414,
 'father': 2679,
 'power': 5164,
 'frndship': 2898,
 'were': 7210,
 'otherwise': 4830,
 'nalla': 4520,
 'adi': 793,
 'entey': 2513,
 'nattil': 4543,
 'kittum': 3802,
 'online': 4769,
 'yep': 7435,
 'house': 3377,
 'sunday': 6376,
 'studying': 6319,
 'next': 4599,
 'weekend': 7191,
 'cine': 1751,
 'plaza': 5070,
 'mah': 4151,
 'threats': 6631,
 'sales': 5691,
 'executive': 2603,
 'shifad': 5891,
 'raised': 5372,
 'complaint': 1863,
 'against': 831,
 'official': 4728,
 'str8': 6293,
 'each': 2408,
 '8007': 613,
 'classic': 1766,
 'hit': 3307,
 'polys': 5121,
 '200p': 341,
 'pity': 5046,
 'mood': 4423,
 'suggestions': 6364,
 'space': 6148,
 'gives': 3029,
 'everything': 2574,
 'remember': 5505,
 'furniture': 2933,
 'yours': 7462,
 'around': 1031,
 'move': 4443,
 'lock': 4021,
 'locks': 4022,
 'key': 3772,
 'jenne': 3666,
 'running': 5664,
 'admit': 799,
 'mad': 4138,
 'where': 7231,
 'correction': 1945,
 'let': 3940,
 'lets': 3941,
 'run': 5663,
 'fighting': 2726,
 'lose': 4055,
 'bt': 1478,
 'fightng': 2727,
 'some1': 6093,
 'close': 1784,
 'dificult': 2234,
 'whats': 7223,
 'ay': 1134,
 'wana': 7127,
 'sat': 5717,
 'wkg': 7308,
 'mmmmm': 4374,
 'sooooo': 6123,
 'words': 7340,
 'mmmm': 4373,
 'lion': 3982,
 'devouring': 2205,
 'mom': 4404,
 'ugh': 6869,
 'apologize': 981,
 ...}



In [31]:

    
X_train_tokens = vect.get_feature_names()
print(X_train_tokens[:50])









    



['00', '000', '000pes', '008704050406', '0089', '0121', '01223585236', '01223585334', '02', '0207', '02072069400', '02073162414', '02085076972', '021', '03', '04', '0430', '05', '050703', '0578', '06', '07', '07008009200', '07046744435', '07090298926', '07099833605', '07123456789', '0721072', '07732584351', '07734396839', '07753741225', '0776xxxxxxx', '07781482378', '07786200117', '077xxx', '07801543489', '07808', '07808247860', '07815296484', '07821230901', '07880867867', '07946746291', '0796xxxxxx', '07973788240', '07xxxxxxxxx', '08', '0800', '08000407165', '08000776320', '08000839402']



In [32]:

    
print(X_train_tokens[-50:])









    



['yet', 'yetty', 'yetunde', 'yhl', 'yifeng', 'yijue', 'ym', 'ymca', 'yo', 'yoga', 'yogasana', 'yor', 'yorge', 'you', 'youdoing', 'youi', 'young', 'younger', 'your', 'youre', 'yourinclusive', 'yourjob', 'yours', 'yourself', 'youuuuu', 'yoville', 'yoyyooo', 'yr', 'yrs', 'ything', 'yummy', 'yun', 'yunny', 'yuo', 'yuou', 'yup', 'yupz', 'zac', 'zealand', 'zebra', 'zed', 'zeros', 'zhong', 'zindgi', 'zoe', 'zogtorius', 'zoom', 'zouk', 'zyada', 'èn']



In [33]:

    
# feature count per class
nb.feature_count_









    Out[33]:





array([[ 0.,  0.,  1., ...,  0.,  1.,  1.],
       [ 7., 22.,  0., ...,  1.,  0.,  0.]])



In [34]:

    
# number of times each token appears across all HAM messages
ham_token_count = nb.feature_count_[0, :]

# number of times each token appears across all SPAM messages
spam_token_count = nb.feature_count_[1, :]



In [35]:

    
# create a table of tokens with their separate ham and spam counts
tokens = pd.DataFrame({'token':X_train_tokens, 'ham':ham_token_count, 'spam':spam_token_count}).set_index('token')
tokens.head()









    Out[35]:







  
    
      
      ham
      spam
    
    
      token
      
      
    
  
  
    
      00
      0.0
      7.0
    
    
      000
      0.0
      22.0
    
    
      000pes
      1.0
      0.0
    
    
      008704050406
      0.0
      2.0
    
    
      0089
      0.0
      1.0



In [36]:

    
tokens.sample(5, random_state=6)

Naive Bayes counts the number of observations in each class



In [37]:

    
nb.class_count_









    Out[37]:





array([3618.,  561.])

Add 1 to ham and spam counts to avoid dividing by 0



In [38]:

    
tokens['ham'] = tokens.ham + 1
tokens['spam'] = tokens.spam + 1
tokens.sample(5, random_state=6)



In [39]:

    
# convert the ham and spam counts into frequencies
tokens['ham'] = tokens.ham / nb.class_count_[0]
tokens['spam'] = tokens.spam / nb.class_count_[1]
tokens.sample(5, random_state=6)

Calculate the ratio of spam-to-ham for each token



In [40]:

    
tokens['spam_ratio'] = tokens.spam / tokens.ham
tokens.sample(5, random_state=6)

Examine the DataFrame sorted by spam_ratio



In [41]:

    
tokens.sort_values('spam_ratio', ascending=False)









    Out[41]:







  
    
      
      ham
      spam
      spam_ratio
    
    
      token
      
      
      
    
  
  
    
      claim
      0.000276
      0.140820
      509.486631
    
    
      prize
      0.000276
      0.105169
      380.502674
    
    
      150p
      0.000276
      0.096257
      348.256684
    
    
      tone
      0.000276
      0.083779
      303.112299
    
    
      18
      0.000276
      0.071301
      257.967914
    
    
      guaranteed
      0.000276
      0.067736
      245.069519
    
    
      cs
      0.000276
      0.062389
      225.721925
    
    
      500
      0.000276
      0.060606
      219.272727
    
    
      1000
      0.000276
      0.060606
      219.272727
    
    
      100
      0.000276
      0.057041
      206.374332
    
    
      awarded
      0.000276
      0.055258
      199.925134
    
    
      uk
      0.000553
      0.099822
      180.577540
    
    
      ringtone
      0.000276
      0.048128
      174.128342
    
    
      www
      0.000829
      0.142602
      171.978610
    
    
      rate
      0.000276
      0.044563
      161.229947
    
    
      tones
      0.000276
      0.044563
      161.229947
    
    
      150ppm
      0.000276
      0.044563
      161.229947
    
    
      000
      0.000276
      0.040998
      148.331551
    
    
      weekly
      0.000276
      0.039216
      141.882353
    
    
      entry
      0.000276
      0.039216
      141.882353
    
    
      mob
      0.000276
      0.035651
      128.983957
    
    
      16
      0.000553
      0.069519
      125.759358
    
    
      8007
      0.000276
      0.033868
      122.534759
    
    
      collection
      0.000276
      0.033868
      122.534759
    
    
      10p
      0.000276
      0.033868
      122.534759
    
    
      valid
      0.000276
      0.033868
      122.534759
    
    
      poly
      0.000276
      0.030303
      109.636364
    
    
      800
      0.000276
      0.030303
      109.636364
    
    
      750
      0.000276
      0.030303
      109.636364
    
    
      5000
      0.000276
      0.030303
      109.636364
    
    
      ...
      ...
      ...
      ...
    
    
      where
      0.025428
      0.003565
      0.140200
    
    
      but
      0.092040
      0.012478
      0.135569
    
    
      did
      0.026534
      0.003565
      0.134358
    
    
      went
      0.013543
      0.001783
      0.131616
    
    
      ll
      0.055832
      0.007130
      0.127707
    
    
      wait
      0.014096
      0.001783
      0.126455
    
    
      lol
      0.014096
      0.001783
      0.126455
    
    
      my
      0.155611
      0.019608
      0.126006
    
    
      feel
      0.014373
      0.001783
      0.124023
    
    
      nice
      0.014373
      0.001783
      0.124023
    
    
      anything
      0.014925
      0.001783
      0.119430
    
    
      sure
      0.014925
      0.001783
      0.119430
    
    
      something
      0.014925
      0.001783
      0.119430
    
    
      cos
      0.014925
      0.001783
      0.119430
    
    
      come
      0.047264
      0.005348
      0.113144
    
    
      morning
      0.016307
      0.001783
      0.109308
    
    
      sorry
      0.032891
      0.003565
      0.108390
    
    
      ask
      0.017413
      0.001783
      0.102368
    
    
      already
      0.017689
      0.001783
      0.100769
    
    
      said
      0.017966
      0.001783
      0.099218
    
    
      amp
      0.019071
      0.001783
      0.093467
    
    
      doing
      0.019900
      0.001783
      0.089572
    
    
      oh
      0.023217
      0.001783
      0.076776
    
    
      later
      0.030956
      0.001783
      0.057582
    
    
      lor
      0.031509
      0.001783
      0.056572
    
    
      da
      0.032338
      0.001783
      0.055121
    
    
      she
      0.037313
      0.001783
      0.047772
    
    
      he
      0.045329
      0.001783
      0.039324
    
    
      lt
      0.065782
      0.001783
      0.027097
    
    
      gt
      0.066059
      0.001783
      0.026984
    
  

7490 rows × 3 columns



In [42]:

    
tokens.loc['00', 'spam_ratio']









    Out[42]:





51.593582887700535

Tuning the vectorizer

Do you see any potential to enhance the vectorizer? Think about the following questions:
Are all word equally important?
Do you think there are "noise words" which negatively influence the results?
How can we account for the order of words?

Stopwords

Stopwords are the most common words in a language. Examples are 'is', 'which' and 'the'. Usually is beneficial to exclude these words in text processing tasks.
The CountVectorizer has a stop_words parameter:

stop_words: string {'english'}, list, or None (default)
- If 'english', a built-in stop word list for English is used.
- If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.
- If None, no stop words will be used.



In [43]:

    
vect = CountVectorizer(stop_words='english')

n-grams

n-grams concatenate n words to form a token. The following accounts for 1- and 2-grams



In [44]:

    
vect = CountVectorizer(ngram_range=(1, 2))

Document frequencies

Often it's beneficial to exclude words that appear in the majority or just a couple of documents. This is, very frequent or infrequent words. This can be achieved by using the max_df and min_df parameters of the vectorizer.



In [45]:

    
# ignore terms that appear in more than 50% of the documents
vect = CountVectorizer(max_df=0.5)

# only keep terms that appear in at least 2 documents
vect = CountVectorizer(min_df=2)

A note on Stemming

'went' and 'go'
'kids' and 'kid'
'negative' and 'negatively'

What is the pattern?

The process of reducing a word to it's word stem, base or root form is called stemming. Scikit-Learn has no powerfull stemmer, but other libraries like the NLTK have.

Tf-idf

Tf-idf can be understood as a modification of the raw term frequencies (tf)
The concept behind tf-idf is to downweight terms proportionally to the number of documents in which they occur.
The idea is that terms that occur in many different documents are likely unimportant or don't contain any useful information for Natural Language Processing tasks such as document classification.

Explanation by example

Let consider a dataset containing 3 documents:



In [46]:

    
import numpy as np
docs = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining and the weather is sweet'])

First, we will compute the term frequency (alternatively: Bag-of-Words) $tf(t, d)$. $t$ is the number of times a term occures in a document $d$. Using Scikit-Learn we can quickly get those numbers:



In [47]:

    
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
tf = cv.fit_transform(docs).toarray()
tf









    Out[47]:





array([[0, 1, 1, 1, 0, 1, 0],
       [0, 1, 0, 0, 1, 1, 1],
       [1, 2, 1, 1, 1, 2, 1]], dtype=int64)



In [48]:

    
cv.vocabulary_









    Out[48]:





{'the': 5, 'sun': 3, 'is': 1, 'shining': 2, 'weather': 6, 'sweet': 4, 'and': 0}

Secondly, we introduce inverse document frequency ($idf$) by defining the term document frequency $\text{df}(d,t)$, which is simply the number of documents $d$ that contain the term $t$. We can then define the idf as follows:

$$\text{idf}(t) = log{\frac{n_d}{1+\text{df}(d,t)}},$$

where
$n_d$: The total number of documents
$\text{df}(d,t)$: The number of documents that contain term $t$.

Note that the constant 1 is added to the denominator to avoid a zero-division error if a term is not contained in any document in the test dataset.

Now, Let us calculate the idfs of the words "and", "is," and "shining:"



In [49]:

    
n_docs = len(docs)

df_and = 1
idf_and = np.log(n_docs / (1 + df_and))
print('idf "and": %s' % idf_and)

df_is = 3
idf_is = np.log(n_docs / (1 + df_is))
print('idf "is": %s' % idf_is)

df_shining = 2
idf_shining = np.log(n_docs / (1 + df_shining))
print('idf "shining": %s' % idf_shining)









    



idf "and": 0.4054651081081644
idf "is": -0.2876820724517809
idf "shining": 0.0

Using those idfs, we can eventually calculate the tf-idfs for the 3rd document:

$$\text{tf-idf}(t, d) = \text{tf}(t, d) \times \text{idf}(t),$$



In [50]:

    
print('Tf-idfs in document 3:\n')
print('tf-idf "and": %s' % (1 * idf_and))
print('tf-idf "is": %s' % (2 * idf_is))
print('tf-idf "shining": %s' % (1 * idf_shining))









    



Tf-idfs in document 3:

tf-idf "and": 0.4054651081081644
tf-idf "is": -0.5753641449035618
tf-idf "shining": 0.0

Tf-idf in Scikit-Learn



In [51]:

    
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(smooth_idf=False, norm=None)
tfidf.fit_transform(tf).toarray()[-1][:3]









    Out[51]:





array([2.09861229, 2.        , 1.40546511])

Wait! Those numbers aren't the same!

Tf-idf in Scikit-Learn is calculated a little bit differently. Here, the +1 count is added to the idf, whereas instead of the denominator if the df:

$$\text{idf}(t) = log{\frac{n_d}{\text{df}(d,t)}} + 1$$



In [52]:

    
tf_and = 1
df_and = 1 
tf_and * (np.log(n_docs / df_and) + 1)









    Out[52]:





2.09861228866811



In [53]:

    
tf_is = 2
df_is = 3 
tf_is * (np.log(n_docs / df_is) + 1)









    Out[53]:





2.0



In [54]:

    
tf_shining = 1
df_shining = 2 
tf_shining * (np.log(n_docs / df_shining) + 1)









    Out[54]:





1.4054651081081644

Normalization

By default, Scikit-Learn performs a normalization. The most common way to normalize the raw term frequency is l2-normalization, i.e., dividing the raw term frequency vector $v$ by its length $||v||_2$ (L2- or Euclidean norm).

$$v_{norm} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v{_1}^2 + v{_2}^2 + \dots + v{_n}^2}}$$

Why is that useful?

For example, we would normalize our 3rd document 'The sun is shining and the weather is sweet' as follows:



In [55]:

    
tfidf = TfidfTransformer(use_idf=True, smooth_idf=False, norm='l2')
tfidf.fit_transform(tf).toarray()[-1][:3]









    Out[55]:





array([0.46572049, 0.44383662, 0.31189844])

Smooth idf

We are not quite there. Sckit-Learn also applies smoothing, which changes the original formula as follows:

$$\text{idf}(t) = log{\frac{1 + n_d}{1+\text{df}(d,t)}} + 1$$



In [56]:

    
tfidf = TfidfTransformer(use_idf=True, smooth_idf=True, norm='l2')
tfidf.fit_transform(tf).toarray()[-1][:3]









    Out[56]:





array([0.40474829, 0.47810172, 0.30782151])

	label	message
0	ham	Go until jurong point, crazy.. Available only ...
1	ham	Ok lar... Joking wif u oni...
2	spam	Free entry in 2 a wkly comp to win FA Cup fina...
3	ham	U dun say so early hor... U c already then say...
4	ham	Nah I don't think he goes to usf, he lives aro...
5	spam	FreeMsg Hey there darling it's been 3 week's n...
6	ham	Even my brother is not like to speak with me. ...
7	ham	As per your request 'Melle Melle (Oru Minnamin...
8	spam	WINNER!! As a valued network customer you have...
9	spam	Had your mobile 11 months or more? U R entitle...

	ham	spam
token
00	0.0	7.0
000	0.0	22.0
000pes	1.0	0.0
008704050406	0.0	2.0
0089	0.0	1.0

	ham	spam
token
toughest	2.0	0.0
fucking	14.0	0.0
baaaaabe	1.0	0.0
reckon	1.0	0.0
cochin	2.0	0.0

	ham	spam
token
toughest	0.000829	0.001783
fucking	0.004146	0.001783
baaaaabe	0.000553	0.001783
reckon	0.000553	0.001783
cochin	0.000829	0.001783

	ham	spam	spam_ratio
token
claim	0.000276	0.140820	509.486631
prize	0.000276	0.105169	380.502674
150p	0.000276	0.096257	348.256684
tone	0.000276	0.083779	303.112299
18	0.000276	0.071301	257.967914
guaranteed	0.000276	0.067736	245.069519
cs	0.000276	0.062389	225.721925
500	0.000276	0.060606	219.272727
1000	0.000276	0.060606	219.272727
100	0.000276	0.057041	206.374332
awarded	0.000276	0.055258	199.925134
uk	0.000553	0.099822	180.577540
ringtone	0.000276	0.048128	174.128342
www	0.000829	0.142602	171.978610
rate	0.000276	0.044563	161.229947
tones	0.000276	0.044563	161.229947
150ppm	0.000276	0.044563	161.229947
000	0.000276	0.040998	148.331551
weekly	0.000276	0.039216	141.882353
entry	0.000276	0.039216	141.882353
mob	0.000276	0.035651	128.983957
16	0.000553	0.069519	125.759358
8007	0.000276	0.033868	122.534759
collection	0.000276	0.033868	122.534759
10p	0.000276	0.033868	122.534759
valid	0.000276	0.033868	122.534759
poly	0.000276	0.030303	109.636364
800	0.000276	0.030303	109.636364
750	0.000276	0.030303	109.636364
5000	0.000276	0.030303	109.636364
...	...	...	...
where	0.025428	0.003565	0.140200
but	0.092040	0.012478	0.135569
did	0.026534	0.003565	0.134358
went	0.013543	0.001783	0.131616
ll	0.055832	0.007130	0.127707
wait	0.014096	0.001783	0.126455
lol	0.014096	0.001783	0.126455
my	0.155611	0.019608	0.126006
feel	0.014373	0.001783	0.124023
nice	0.014373	0.001783	0.124023
anything	0.014925	0.001783	0.119430
sure	0.014925	0.001783	0.119430
something	0.014925	0.001783	0.119430
cos	0.014925	0.001783	0.119430
come	0.047264	0.005348	0.113144
morning	0.016307	0.001783	0.109308
sorry	0.032891	0.003565	0.108390
ask	0.017413	0.001783	0.102368
already	0.017689	0.001783	0.100769
said	0.017966	0.001783	0.099218
amp	0.019071	0.001783	0.093467
doing	0.019900	0.001783	0.089572
oh	0.023217	0.001783	0.076776
later	0.030956	0.001783	0.057582
lor	0.031509	0.001783	0.056572
da	0.032338	0.001783	0.055121
she	0.037313	0.001783	0.047772
he	0.045329	0.001783	0.039324
lt	0.065782	0.001783	0.027097
gt	0.066059	0.001783	0.026984