Working with Text Data and Naive Bayes in scikit-learn

Agenda

Working with text data

Representing text as data
Reading SMS data
Vectorizing SMS data
Examining the tokens and their counts
Bonus: Calculating the "spamminess" of each token

Naive Bayes classification

Building a Naive Bayes model
Comparing Naive Bayes with logistic regression

Part 1: Representing text as data

Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.

We will use CountVectorizer to "convert text into a matrix of token counts":



In [5]:

    
from sklearn.feature_extraction.text import CountVectorizer



In [6]:

    
# start with a simple example
simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!', 'help']



In [10]:

    
# learn the 'vocabulary' of the training data
vect = CountVectorizer()
vect.fit(simple_train)
# vect.get_feature_names()
vect.vocabulary_









    Out[10]:





{'cab': 0, 'call': 1, 'help': 2, 'me': 3, 'please': 4, 'tonight': 5, 'you': 6}



In [11]:

    
# transform training data into a 'document-term matrix'
simple_train_dtm = vect.transform(simple_train)
simple_train_dtm









    Out[11]:





<4x7 sparse matrix of type '<class 'numpy.int64'>'
	with 10 stored elements in Compressed Sparse Row format>



In [6]:

    
# print the sparse matrix
print(simple_train_dtm)









    



  (0, 1)	1
  (0, 5)	1
  (0, 6)	1
  (1, 0)	1
  (1, 1)	1
  (1, 3)	1
  (2, 1)	1
  (2, 3)	1
  (2, 4)	2
  (3, 2)	1



In [14]:

    
# convert sparse matrix to a dense matrix
simple_train_dtm.toarray()









    Out[14]:





array([[0, 1, 0, 0, 0, 1, 1],
       [1, 1, 0, 1, 0, 0, 0],
       [0, 1, 0, 1, 2, 0, 0],
       [0, 0, 1, 0, 0, 0, 0]])



In [16]:

    
# examine the vocabulary and document-term matrix together
import pandas as pd
pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())



In [9]:

    
# create a document-term matrix on your own
simple_train = ["call call Sorry, Ill later", 
                "K Did you me call ah just now", 
                "I call you later, don't have network. If urgnt, sms me"]



In [10]:

    
#complete your work below
# instantiate vectorizer
# fit
# transform
# convert to dense matrix

vec2 = CountVectorizer(binary=True)
vec2.fit(simple_train)
my_dtm2 = vec2.transform(simple_train)

pd.DataFrame(my_dtm2.toarray(), columns=vec2.get_feature_names())

From the scikit-learn documentation:

In this scheme, features and samples are defined as follows:

Each individual token occurrence frequency (normalized or not) is treated as a feature.

The vector of all the token frequencies for a given document is considered a multivariate sample.

A corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus.

We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.



In [10]:

    
vect.get_feature_names()









    Out[10]:





['cab', 'call', 'help', 'me', 'please', 'tonight', 'you']



In [11]:

    
# transform testing data into a document-term matrix (using existing vocabulary)
simple_test = ["please don't call me devon"]
simple_test_dtm = vect.transform(simple_test)
simple_test_dtm.toarray()









    Out[11]:





array([[0, 1, 0, 1, 1, 0, 0]])



In [12]:

    
# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())

Summary:

vect.fit(train) learns the vocabulary of the training data
vect.transform(train) uses the fitted vocabulary to build a document-term matrix from the training data
vect.transform(test) uses the fitted vocabulary to build a document-term matrix from the testing data (and ignores tokens it hasn't seen before)

Part 2: Reading SMS data



In [13]:

    
# read tab-separated file
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/sms.tsv'
col_names = ['label', 'message']
sms = pd.read_table(url, sep='\t', header=None, names=col_names)
print(sms.shape)



In [14]:

    
sms.head(5)









    Out[14]:






  
    
      
      label
      message
    
  
  
    
      0
      ham
      Go until jurong point, crazy.. Available only ...
    
    
      1
      ham
      Ok lar... Joking wif u oni...
    
    
      2
      spam
      Free entry in 2 a wkly comp to win FA Cup fina...
    
    
      3
      ham
      U dun say so early hor... U c already then say...
    
    
      4
      ham
      Nah I don't think he goes to usf, he lives aro...



In [15]:

    
sms.label.value_counts()









    Out[15]:





ham     4825
spam     747
Name: label, dtype: int64



In [16]:

    
# convert label to a numeric variable
sms['label'] = sms.label.map({'ham':0, 'spam':1})



In [17]:

    
# define X and y
X = sms.message
y = sms.label



In [21]:

    
# split into training and testing sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, stratify=y)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)









    



(4179,) (4179,)
(1393,) (1393,)

Part 3: Vectorizing SMS data



In [27]:

    
# instantiate the vectorizer
vect = CountVectorizer()



In [28]:

    
# learn training data vocabulary, then create document-term matrix
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)
X_train_dtm









    Out[28]:





<4179x7373 sparse matrix of type '<class 'numpy.int64'>'
	with 55328 stored elements in Compressed Sparse Row format>



In [29]:

    
# alternative: combine fit and transform into a single step
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm









    Out[29]:





<4179x7373 sparse matrix of type '<class 'numpy.int64'>'
	with 55328 stored elements in Compressed Sparse Row format>



In [30]:

    
# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)
X_test_dtm









    Out[30]:





<1393x7373 sparse matrix of type '<class 'numpy.int64'>'
	with 17393 stored elements in Compressed Sparse Row format>

Part 4: Examining the tokens and their counts



In [31]:

    
# store token names
X_train_tokens = vect.get_feature_names()



In [32]:

    
# first 50 tokens
print(X_train_tokens[:50])









    



['00', '000', '000pes', '008704050406', '0089', '01223585236', '01223585334', '0125698789', '02', '0207', '02073162414', '02085076972', '021', '03', '04', '0430', '05', '050703', '0578', '06', '07', '07046744435', '07090298926', '07099833605', '07123456789', '0721072', '07732584351', '07734396839', '07742676969', '07753741225', '0776xxxxxxx', '07781482378', '07786200117', '077xxx', '07808', '07808247860', '07821230901', '07880867867', '0789xxxxxxx', '07946746291', '07973788240', '07xxxxxxxxx', '08', '0800', '08000407165', '08000776320', '08000839402', '08000930705', '08000938767', '08001950382']



In [33]:

    
# last 50 tokens
print(X_train_tokens[-50:])









    



['yet', 'yetty', 'yetunde', 'yhl', 'yifeng', 'yijue', 'ym', 'ymca', 'yo', 'yoga', 'yogasana', 'yor', 'yorge', 'you', 'youdoing', 'youi', 'young', 'younger', 'youphone', 'your', 'youre', 'yourinclusive', 'yourjob', 'yours', 'yourself', 'youwanna', 'yowifes', 'yr', 'yrs', 'ything', 'yummmm', 'yummy', 'yun', 'yunny', 'yuo', 'yuou', 'yup', 'yupz', 'zaher', 'zebra', 'zed', 'zeros', 'zhong', 'zindgi', 'zoe', 'zoom', 'zouk', 'zyada', 'ú1', '〨ud']



In [34]:

    
# view X_train_dtm as a dense matrix
X_train_dtm.toarray()









    Out[34]:





array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)



In [35]:

    
# count how many times EACH token appears across ALL messages in X_train_dtm
import numpy as np
X_train_counts = np.sum(X_train_dtm.toarray(), axis=0)
X_train_counts









    Out[35]:





array([ 7, 20,  1, ...,  1,  1,  1], dtype=int64)



In [36]:

    
# create a DataFrame of tokens with their counts
pd.DataFrame({'token':X_train_tokens, 'count':X_train_counts}).sort_values(by='count', ascending=True)









    Out[36]:






  
    
      
      count
      token
    
  
  
    
      3686
      1
      juan
    
    
      4123
      1
      mailbox
    
    
      4120
      1
      mahfuuz
    
    
      4119
      1
      mahal
    
    
      4117
      1
      magicalsongs
    
    
      4115
      1
      maggi
    
    
      4114
      1
      magazine
    
    
      4111
      1
      madstini
    
    
      4110
      1
      madoke
    
    
      4109
      1
      madodu
    
    
      4106
      1
      mad2
    
    
      4105
      1
      mad1
    
    
      4103
      1
      macs
    
    
      4102
      1
      macleran
    
    
      4100
      1
      machines
    
    
      4098
      1
      macha
    
    
      4124
      1
      mailed
    
    
      4097
      1
      macedonia
    
    
      4125
      1
      mails
    
    
      4132
      1
      makiing
    
    
      4162
      1
      marking
    
    
      4159
      1
      marine
    
    
      4156
      1
      marandratha
    
    
      4155
      1
      maps
    
    
      4152
      1
      manual
    
    
      4151
      1
      manky
    
    
      4150
      1
      maniac
    
    
      4149
      1
      mango
    
    
      4147
      1
      mandara
    
    
      4146
      1
      mandan
    
    
      ...
      ...
      ...
    
    
      1228
      286
      be
    
    
      3445
      291
      if
    
    
      3002
      296
      get
    
    
      1080
      296
      at
    
    
      7159
      296
      will
    
    
      4724
      303
      or
    
    
      2296
      312
      do
    
    
      7060
      318
      we
    
    
      5974
      326
      so
    
    
      1525
      339
      but
    
    
      4599
      348
      not
    
    
      1579
      355
      can
    
    
      1014
      362
      are
    
    
      4691
      385
      on
    
    
      4610
      396
      now
    
    
      6479
      442
      that
    
    
      1555
      443
      call
    
    
      3222
      451
      have
    
    
      4654
      471
      of
    
    
      7342
      534
      your
    
    
      2826
      537
      for
    
    
      3587
      549
      it
    
    
      4450
      587
      my
    
    
      4200
      624
      me
    
    
      3578
      660
      is
    
    
      3482
      682
      in
    
    
      929
      741
      and
    
    
      6482
      979
      the
    
    
      6588
      1680
      to
    
    
      7336
      1707
      you
    
  

7373 rows × 2 columns

Bonus: Calculating the "spamminess" of each token



In [29]:

    
# create separate DataFrames for ham and spam
sms_ham = sms[sms.label==0] # ham
sms_spam = sms[sms.label==1] # spam



In [30]:

    
# learn the vocabulary of ALL messages and save it
vect.fit(sms.message)
all_tokens = vect.get_feature_names()



In [31]:

    
# create document-term matrices for ham and spam
ham_dtm = vect.transform(sms_ham.message)
spam_dtm = vect.transform(sms_spam.message)



In [32]:

    
ham_dtm.shape, spam_dtm.shape









    Out[32]:





((4825, 8713), (747, 8713))



In [33]:

    
# count how many times EACH token appears across ALL ham messages
ham_counts = np.sum(ham_dtm.toarray(), axis=0)



In [34]:

    
ham_counts









    Out[34]:





array([0, 0, 1, ..., 1, 0, 1])



In [35]:

    
# count how many times EACH token appears across ALL spam messages
spam_counts = np.sum(spam_dtm.toarray(), axis=0)



In [36]:

    
spam_counts









    Out[36]:





array([10, 29,  0, ...,  0,  1,  0])



In [37]:

    
all_tokens[0:5]









    Out[37]:





['00', '000', '000pes', '008704050406', '0089']



In [38]:

    
# create a DataFrame of tokens with their separate ham and spam counts
token_counts = pd.DataFrame({'token':all_tokens, 'ham':ham_counts, 'spam':spam_counts})



In [39]:

    
token_counts









    Out[39]:






  
    
      
      ham
      spam
      token
    
  
  
    
      0
      0
      10
      00
    
    
      1
      0
      29
      000
    
    
      2
      1
      0
      000pes
    
    
      3
      0
      2
      008704050406
    
    
      4
      0
      1
      0089
    
    
      5
      0
      1
      0121
    
    
      6
      0
      1
      01223585236
    
    
      7
      0
      2
      01223585334
    
    
      8
      1
      0
      0125698789
    
    
      9
      0
      8
      02
    
    
      10
      0
      3
      0207
    
    
      11
      0
      1
      02072069400
    
    
      12
      0
      2
      02073162414
    
    
      13
      0
      1
      02085076972
    
    
      14
      0
      2
      021
    
    
      15
      0
      13
      03
    
    
      16
      0
      12
      04
    
    
      17
      0
      1
      0430
    
    
      18
      0
      5
      05
    
    
      19
      0
      2
      050703
    
    
      20
      0
      2
      0578
    
    
      21
      0
      8
      06
    
    
      22
      0
      2
      07
    
    
      23
      0
      1
      07008009200
    
    
      24
      0
      1
      07046744435
    
    
      25
      0
      1
      07090201529
    
    
      26
      0
      1
      07090298926
    
    
      27
      0
      1
      07099833605
    
    
      28
      0
      2
      07123456789
    
    
      29
      0
      1
      0721072
    
    
      ...
      ...
      ...
      ...
    
    
      8683
      1
      0
      yowifes
    
    
      8684
      1
      0
      yoyyooo
    
    
      8685
      3
      11
      yr
    
    
      8686
      5
      3
      yrs
    
    
      8687
      1
      0
      ystrday
    
    
      8688
      1
      0
      ything
    
    
      8689
      1
      0
      yummmm
    
    
      8690
      3
      0
      yummy
    
    
      8691
      5
      0
      yun
    
    
      8692
      2
      0
      yunny
    
    
      8693
      4
      0
      yuo
    
    
      8694
      1
      0
      yuou
    
    
      8695
      43
      0
      yup
    
    
      8696
      1
      0
      yupz
    
    
      8697
      1
      0
      zac
    
    
      8698
      1
      0
      zaher
    
    
      8699
      1
      0
      zealand
    
    
      8700
      0
      1
      zebra
    
    
      8701
      0
      6
      zed
    
    
      8702
      1
      0
      zeros
    
    
      8703
      1
      0
      zhong
    
    
      8704
      2
      0
      zindgi
    
    
      8705
      1
      1
      zoe
    
    
      8706
      1
      0
      zogtorius
    
    
      8707
      1
      0
      zoom
    
    
      8708
      0
      1
      zouk
    
    
      8709
      1
      0
      zyada
    
    
      8710
      1
      0
      èn
    
    
      8711
      0
      1
      ú1
    
    
      8712
      1
      0
      〨ud
    
  

8713 rows × 3 columns



In [40]:

    
# add one to ham and spam counts to avoid dividing by zero (in the step that follows)
token_counts['ham'] = token_counts.ham + 1
token_counts['spam'] = token_counts.spam + 1



In [41]:

    
# calculate ratio of spam-to-ham for each token
token_counts['spam_ratio'] = token_counts.spam / token_counts.ham
token_counts.sort_values(by='spam_ratio', ascending=False)









    Out[41]:






  
    
      
      ham
      spam
      token
      spam_ratio
    
  
  
    
      2067
      1
      114
      claim
      114.000000
    
    
      6113
      1
      94
      prize
      94.000000
    
    
      352
      1
      72
      150p
      72.000000
    
    
      7837
      1
      61
      tone
      61.000000
    
    
      369
      1
      52
      18
      52.000000
    
    
      3688
      1
      51
      guaranteed
      51.000000
    
    
      617
      1
      45
      500
      45.000000
    
    
      2371
      1
      45
      cs
      45.000000
    
    
      299
      1
      42
      1000
      42.000000
    
    
      1333
      1
      39
      awarded
      39.000000
    
    
      8016
      2
      75
      uk
      37.500000
    
    
      356
      1
      35
      150ppm
      35.000000
    
    
      6525
      1
      33
      ringtone
      33.000000
    
    
      8596
      3
      99
      www
      33.000000
    
    
      1
      1
      30
      000
      30.000000
    
    
      2150
      1
      27
      collection
      27.000000
    
    
      2963
      1
      27
      entry
      27.000000
    
    
      364
      2
      54
      16
      27.000000
    
    
      7838
      1
      27
      tones
      27.000000
    
    
      618
      1
      26
      5000
      26.000000
    
    
      5117
      1
      26
      mob
      26.000000
    
    
      8375
      1
      25
      weekly
      25.000000
    
    
      309
      1
      25
      10p
      25.000000
    
    
      8153
      1
      25
      valid
      25.000000
    
    
      732
      1
      23
      800
      23.000000
    
    
      5297
      1
      23
      national
      23.000000
    
    
      1623
      1
      22
      bonus
      22.000000
    
    
      735
      1
      22
      8007
      22.000000
    
    
      6619
      1
      22
      sae
      22.000000
    
    
      8248
      1
      22
      vouchers
      22.000000
    
    
      ...
      ...
      ...
      ...
      ...
    
    
      3925
      166
      3
      home
      0.018072
    
    
      2815
      56
      1
      dun
      0.017857
    
    
      5533
      115
      2
      oh
      0.017391
    
    
      5217
      116
      2
      much
      0.017241
    
    
      5254
      755
      13
      my
      0.017219
    
    
      1064
      59
      1
      always
      0.016949
    
    
      7001
      59
      1
      sleep
      0.016949
    
    
      3595
      59
      1
      gonna
      0.016949
    
    
      3171
      63
      1
      feel
      0.015873
    
    
      8394
      63
      1
      went
      0.015873
    
    
      5371
      63
      1
      nice
      0.015873
    
    
      3690
      68
      1
      gud
      0.014706
    
    
      7099
      70
      1
      something
      0.014286
    
    
      7463
      72
      1
      sure
      0.013889
    
    
      4724
      75
      1
      lol
      0.013333
    
    
      1142
      77
      1
      anything
      0.012987
    
    
      2289
      77
      1
      cos
      0.012987
    
    
      2163
      231
      3
      come
      0.012987
    
    
      5167
      80
      1
      morning
      0.012500
    
    
      2714
      89
      1
      doing
      0.011236
    
    
      1084
      89
      1
      amp
      0.011236
    
    
      1247
      90
      1
      ask
      0.011111
    
    
      6626
      90
      1
      said
      0.011111
    
    
      4550
      136
      1
      later
      0.007353
    
    
      2428
      151
      1
      da
      0.006623
    
    
      4747
      163
      1
      lor
      0.006135
    
    
      6843
      168
      1
      she
      0.005952
    
    
      3805
      232
      1
      he
      0.004310
    
    
      4793
      317
      1
      lt
      0.003155
    
    
      3684
      319
      1
      gt
      0.003135
    
  

8713 rows × 4 columns



In [43]:

    
#observe spam messages that contain the word 'claim'
claim_messages = sms.message[sms.message.str.contains('claim')]

for message in claim_messages[0:5]:
    print(message, '\n')









    



WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only. 

Urgent UR awarded a complimentary trip to EuroDisinc Trav, Aco&Entry41 Or £1000. To claim txt DIS to 87121 18+6*£1.50(moreFrmMob. ShrAcomOrSglSuplt)10, LS1 3AJ 

You are a winner U have been specially selected 2 receive £1000 or a 4* holiday (flights inc) speak to a live operator 2 claim 0871277810910p/min (18+)  

PRIVATE! Your 2004 Account Statement for 07742676969 shows 786 unredeemed Bonus Points. To claim call 08719180248 Identifier Code: 45239 Expires 

Todays Voda numbers ending 7548 are selected to receive a $350 award. If you have a match please call 08712300220 quoting claim code 4041 standard rates app

Part 5: Building a Naive Bayes model

We will use Multinomial Naive Bayes:

The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.



In [37]:

    
# train a Naive Bayes model using X_train_dtm
from sklearn.naive_bayes import MultinomialNB, GaussianNB
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)









    Out[37]:





MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)



In [38]:

    
# make class predictions for X_test_dtm
y_pred_class = nb.predict(X_test_dtm)



In [39]:

    
# calculate accuracy of class predictions
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))









    



0.988513998564



In [41]:

    
print(metrics.classification_report(y_test, y_pred_class))









    



             precision    recall  f1-score   support

          0       0.99      1.00      0.99      1206
          1       0.98      0.94      0.96       187

avg / total       0.99      0.99      0.99      1393



In [43]:

    
metrics.confusion_matrix(y_test, y_pred_class)









    Out[43]:





array([[1202,    4],
       [  12,  175]])



In [47]:

    
?metrics.confusion_matrix



In [48]:

    
# confusion matrix
print(metrics.confusion_matrix(y_test, y_pred_class))









    



[[1202    4]
 [  12  175]]



In [49]:

    
# predict (poorly calibrated) probabilities
y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
y_pred_prob









    Out[49]:





array([  7.82887542e-08,   3.02868734e-08,   1.38606514e-11, ...,
         1.00000000e+00,   1.00000000e+00,   2.62417931e-06])



In [50]:

    
# calculate AUC
print(metrics.roc_auc_score(y_test, y_pred_prob))









    



0.987471288832



In [51]:

    
# print message text for the false positives
X_test[y_test < y_pred_class]









    Out[51]:





5475    Dhoni have luck to win some big title.so we wi...
2173     Yavnt tried yet and never played original either
4557                              Gettin rdy to ship comp
4382               Mathews or tait or edwards or anderson
Name: message, dtype: object



In [52]:

    
# print message text for the false negatives
X_test[y_test > y_pred_class]









    Out[52]:





4213    Missed call alert. These numbers called but le...
3360    Sorry I missed your call let's talk when you h...
2575    Your next amazing xxx PICSFREE1 video will be ...
788     Ever thought about living a good life with a p...
5370    dating:i have had two of these. Only started a...
3530    Xmas & New Years Eve tickets are now on sale f...
2352    Download as many ringtones as u like no restri...
3742                                        2/2 146tf150p
2558    This message is brought to you by GMW Ltd. and...
4144    In The Simpsons Movie released in July 2007 na...
955             Filthy stories and GIRLS waiting for your
1638    0A$NETWORKS allow companies to bill for SMS, s...
Name: message, dtype: object



In [ ]:

    
# what do you notice about the false negatives?
# X_test[3132]

Part 6: Comparing Naive Bayes with logistic regression



In [ ]:

    
#Create a logitic regression
# import/instantiate/fit



In [ ]:

    
# class predictions and predicted probabilities



In [ ]:

    
# calculate accuracy and AUC

	label	message
0	ham	Go until jurong point, crazy.. Available only ...
1	ham	Ok lar... Joking wif u oni...
2	spam	Free entry in 2 a wkly comp to win FA Cup fina...
3	ham	U dun say so early hor... U c already then say...
4	ham	Nah I don't think he goes to usf, he lives aro...

	count	token
3686	1	juan
4123	1	mailbox
4120	1	mahfuuz
4119	1	mahal
4117	1	magicalsongs
4115	1	maggi
4114	1	magazine
4111	1	madstini
4110	1	madoke
4109	1	madodu
4106	1	mad2
4105	1	mad1
4103	1	macs
4102	1	macleran
4100	1	machines
4098	1	macha
4124	1	mailed
4097	1	macedonia
4125	1	mails
4132	1	makiing
4162	1	marking
4159	1	marine
4156	1	marandratha
4155	1	maps
4152	1	manual
4151	1	manky
4150	1	maniac
4149	1	mango
4147	1	mandara
4146	1	mandan
...	...	...
1228	286	be
3445	291	if
3002	296	get
1080	296	at
7159	296	will
4724	303	or
2296	312	do
7060	318	we
5974	326	so
1525	339	but
4599	348	not
1579	355	can
1014	362	are
4691	385	on
4610	396	now
6479	442	that
1555	443	call
3222	451	have
4654	471	of
7342	534	your
2826	537	for
3587	549	it
4450	587	my
4200	624	me
3578	660	is
3482	682	in
929	741	and
6482	979	the
6588	1680	to
7336	1707	you

	ham	spam	token
0	0	10	00
1	0	29	000
2	1	0	000pes
3	0	2	008704050406
4	0	1	0089
5	0	1	0121
6	0	1	01223585236
7	0	2	01223585334
8	1	0	0125698789
9	0	8	02
10	0	3	0207
11	0	1	02072069400
12	0	2	02073162414
13	0	1	02085076972
14	0	2	021
15	0	13	03
16	0	12	04
17	0	1	0430
18	0	5	05
19	0	2	050703
20	0	2	0578
21	0	8	06
22	0	2	07
23	0	1	07008009200
24	0	1	07046744435
25	0	1	07090201529
26	0	1	07090298926
27	0	1	07099833605
28	0	2	07123456789
29	0	1	0721072
...	...	...	...
8683	1	0	yowifes
8684	1	0	yoyyooo
8685	3	11	yr
8686	5	3	yrs
8687	1	0	ystrday
8688	1	0	ything
8689	1	0	yummmm
8690	3	0	yummy
8691	5	0	yun
8692	2	0	yunny
8693	4	0	yuo
8694	1	0	yuou
8695	43	0	yup
8696	1	0	yupz
8697	1	0	zac
8698	1	0	zaher
8699	1	0	zealand
8700	0	1	zebra
8701	0	6	zed
8702	1	0	zeros
8703	1	0	zhong
8704	2	0	zindgi
8705	1	1	zoe
8706	1	0	zogtorius
8707	1	0	zoom
8708	0	1	zouk
8709	1	0	zyada
8710	1	0	èn
8711	0	1	ú1
8712	1	0	〨ud

	ham	spam	token	spam_ratio
2067	1	114	claim	114.000000
6113	1	94	prize	94.000000
352	1	72	150p	72.000000
7837	1	61	tone	61.000000
369	1	52	18	52.000000
3688	1	51	guaranteed	51.000000
617	1	45	500	45.000000
2371	1	45	cs	45.000000
299	1	42	1000	42.000000
1333	1	39	awarded	39.000000
8016	2	75	uk	37.500000
356	1	35	150ppm	35.000000
6525	1	33	ringtone	33.000000
8596	3	99	www	33.000000
1	1	30	000	30.000000
2150	1	27	collection	27.000000
2963	1	27	entry	27.000000
364	2	54	16	27.000000
7838	1	27	tones	27.000000
618	1	26	5000	26.000000
5117	1	26	mob	26.000000
8375	1	25	weekly	25.000000
309	1	25	10p	25.000000
8153	1	25	valid	25.000000
732	1	23	800	23.000000
5297	1	23	national	23.000000
1623	1	22	bonus	22.000000
735	1	22	8007	22.000000
6619	1	22	sae	22.000000
8248	1	22	vouchers	22.000000
...	...	...	...	...
3925	166	3	home	0.018072
2815	56	1	dun	0.017857
5533	115	2	oh	0.017391
5217	116	2	much	0.017241
5254	755	13	my	0.017219
1064	59	1	always	0.016949
7001	59	1	sleep	0.016949
3595	59	1	gonna	0.016949
3171	63	1	feel	0.015873
8394	63	1	went	0.015873
5371	63	1	nice	0.015873
3690	68	1	gud	0.014706
7099	70	1	something	0.014286
7463	72	1	sure	0.013889
4724	75	1	lol	0.013333
1142	77	1	anything	0.012987
2289	77	1	cos	0.012987
2163	231	3	come	0.012987
5167	80	1	morning	0.012500
2714	89	1	doing	0.011236
1084	89	1	amp	0.011236
1247	90	1	ask	0.011111
6626	90	1	said	0.011111
4550	136	1	later	0.007353
2428	151	1	da	0.006623
4747	163	1	lor	0.006135
6843	168	1	she	0.005952
3805	232	1	he	0.004310
4793	317	1	lt	0.003155
3684	319	1	gt	0.003135