Working with Text Data and Naive Bayes in scikit-learn

Agenda

Working with text data

  • Representing text as data
  • Reading SMS data
  • Vectorizing SMS data
  • Examining the tokens and their counts
  • Bonus: Calculating the "spamminess" of each token

Naive Bayes classification

  • Building a Naive Bayes model
  • Comparing Naive Bayes with logistic regression

Part 1: Representing text as data

From the scikit-learn documentation:

Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.

We will use CountVectorizer to "convert text into a matrix of token counts":


In [5]:
from sklearn.feature_extraction.text import CountVectorizer

In [6]:
# start with a simple example
simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!', 'help']

In [10]:
# learn the 'vocabulary' of the training data
vect = CountVectorizer()
vect.fit(simple_train)
# vect.get_feature_names()
vect.vocabulary_


Out[10]:
{'cab': 0, 'call': 1, 'help': 2, 'me': 3, 'please': 4, 'tonight': 5, 'you': 6}

In [11]:
# transform training data into a 'document-term matrix'
simple_train_dtm = vect.transform(simple_train)
simple_train_dtm


Out[11]:
<4x7 sparse matrix of type '<class 'numpy.int64'>'
	with 10 stored elements in Compressed Sparse Row format>

In [6]:
# print the sparse matrix
print(simple_train_dtm)


  (0, 1)	1
  (0, 5)	1
  (0, 6)	1
  (1, 0)	1
  (1, 1)	1
  (1, 3)	1
  (2, 1)	1
  (2, 3)	1
  (2, 4)	2
  (3, 2)	1

In [14]:
# convert sparse matrix to a dense matrix
simple_train_dtm.toarray()


Out[14]:
array([[0, 1, 0, 0, 0, 1, 1],
       [1, 1, 0, 1, 0, 0, 0],
       [0, 1, 0, 1, 2, 0, 0],
       [0, 0, 1, 0, 0, 0, 0]])

In [16]:
# examine the vocabulary and document-term matrix together
import pandas as pd
pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())


Out[16]:
cab call help me please tonight you
0 0 1 0 0 0 1 1
1 1 1 0 1 0 0 0
2 0 1 0 1 2 0 0
3 0 0 1 0 0 0 0

In [9]:
# create a document-term matrix on your own
simple_train = ["call call Sorry, Ill later", 
                "K Did you me call ah just now", 
                "I call you later, don't have network. If urgnt, sms me"]

In [10]:
#complete your work below
# instantiate vectorizer
# fit
# transform
# convert to dense matrix

vec2 = CountVectorizer(binary=True)
vec2.fit(simple_train)
my_dtm2 = vec2.transform(simple_train)

pd.DataFrame(my_dtm2.toarray(), columns=vec2.get_feature_names())


Out[10]:
ah call did don have if ill just later me network now sms sorry urgnt you
0 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 0
1 1 1 1 0 0 0 0 1 0 1 0 1 0 0 0 1
2 0 1 0 1 1 1 0 0 1 1 1 0 1 0 1 1

From the scikit-learn documentation:

In this scheme, features and samples are defined as follows:

  • Each individual token occurrence frequency (normalized or not) is treated as a feature.
  • The vector of all the token frequencies for a given document is considered a multivariate sample.

A corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus.

We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.


In [10]:
vect.get_feature_names()


Out[10]:
['cab', 'call', 'help', 'me', 'please', 'tonight', 'you']

In [11]:
# transform testing data into a document-term matrix (using existing vocabulary)
simple_test = ["please don't call me devon"]
simple_test_dtm = vect.transform(simple_test)
simple_test_dtm.toarray()


Out[11]:
array([[0, 1, 0, 1, 1, 0, 0]])

In [12]:
# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())


Out[12]:
cab call help me please tonight you
0 0 1 0 1 1 0 0

Summary:

  • vect.fit(train) learns the vocabulary of the training data
  • vect.transform(train) uses the fitted vocabulary to build a document-term matrix from the training data
  • vect.transform(test) uses the fitted vocabulary to build a document-term matrix from the testing data (and ignores tokens it hasn't seen before)

Part 2: Reading SMS data


In [13]:
# read tab-separated file
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/sms.tsv'
col_names = ['label', 'message']
sms = pd.read_table(url, sep='\t', header=None, names=col_names)
print(sms.shape)


(5572, 2)

In [14]:
sms.head(5)


Out[14]:
label message
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...

In [15]:
sms.label.value_counts()


Out[15]:
ham     4825
spam     747
Name: label, dtype: int64

In [16]:
# convert label to a numeric variable
sms['label'] = sms.label.map({'ham':0, 'spam':1})

In [17]:
# define X and y
X = sms.message
y = sms.label

In [21]:
# split into training and testing sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, stratify=y)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)


(4179,) (4179,)
(1393,) (1393,)

Part 3: Vectorizing SMS data


In [27]:
# instantiate the vectorizer
vect = CountVectorizer()

In [28]:
# learn training data vocabulary, then create document-term matrix
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)
X_train_dtm


Out[28]:
<4179x7373 sparse matrix of type '<class 'numpy.int64'>'
	with 55328 stored elements in Compressed Sparse Row format>

In [29]:
# alternative: combine fit and transform into a single step
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm


Out[29]:
<4179x7373 sparse matrix of type '<class 'numpy.int64'>'
	with 55328 stored elements in Compressed Sparse Row format>

In [30]:
# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)
X_test_dtm


Out[30]:
<1393x7373 sparse matrix of type '<class 'numpy.int64'>'
	with 17393 stored elements in Compressed Sparse Row format>

Part 4: Examining the tokens and their counts


In [31]:
# store token names
X_train_tokens = vect.get_feature_names()

In [32]:
# first 50 tokens
print(X_train_tokens[:50])


['00', '000', '000pes', '008704050406', '0089', '01223585236', '01223585334', '0125698789', '02', '0207', '02073162414', '02085076972', '021', '03', '04', '0430', '05', '050703', '0578', '06', '07', '07046744435', '07090298926', '07099833605', '07123456789', '0721072', '07732584351', '07734396839', '07742676969', '07753741225', '0776xxxxxxx', '07781482378', '07786200117', '077xxx', '07808', '07808247860', '07821230901', '07880867867', '0789xxxxxxx', '07946746291', '07973788240', '07xxxxxxxxx', '08', '0800', '08000407165', '08000776320', '08000839402', '08000930705', '08000938767', '08001950382']

In [33]:
# last 50 tokens
print(X_train_tokens[-50:])


['yet', 'yetty', 'yetunde', 'yhl', 'yifeng', 'yijue', 'ym', 'ymca', 'yo', 'yoga', 'yogasana', 'yor', 'yorge', 'you', 'youdoing', 'youi', 'young', 'younger', 'youphone', 'your', 'youre', 'yourinclusive', 'yourjob', 'yours', 'yourself', 'youwanna', 'yowifes', 'yr', 'yrs', 'ything', 'yummmm', 'yummy', 'yun', 'yunny', 'yuo', 'yuou', 'yup', 'yupz', 'zaher', 'zebra', 'zed', 'zeros', 'zhong', 'zindgi', 'zoe', 'zoom', 'zouk', 'zyada', 'ú1', '〨ud']

In [34]:
# view X_train_dtm as a dense matrix
X_train_dtm.toarray()


Out[34]:
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [35]:
# count how many times EACH token appears across ALL messages in X_train_dtm
import numpy as np
X_train_counts = np.sum(X_train_dtm.toarray(), axis=0)
X_train_counts


Out[35]:
array([ 7, 20,  1, ...,  1,  1,  1], dtype=int64)

In [36]:
# create a DataFrame of tokens with their counts
pd.DataFrame({'token':X_train_tokens, 'count':X_train_counts}).sort_values(by='count', ascending=True)


Out[36]:
count token
3686 1 juan
4123 1 mailbox
4120 1 mahfuuz
4119 1 mahal
4117 1 magicalsongs
4115 1 maggi
4114 1 magazine
4111 1 madstini
4110 1 madoke
4109 1 madodu
4106 1 mad2
4105 1 mad1
4103 1 macs
4102 1 macleran
4100 1 machines
4098 1 macha
4124 1 mailed
4097 1 macedonia
4125 1 mails
4132 1 makiing
4162 1 marking
4159 1 marine
4156 1 marandratha
4155 1 maps
4152 1 manual
4151 1 manky
4150 1 maniac
4149 1 mango
4147 1 mandara
4146 1 mandan
... ... ...
1228 286 be
3445 291 if
3002 296 get
1080 296 at
7159 296 will
4724 303 or
2296 312 do
7060 318 we
5974 326 so
1525 339 but
4599 348 not
1579 355 can
1014 362 are
4691 385 on
4610 396 now
6479 442 that
1555 443 call
3222 451 have
4654 471 of
7342 534 your
2826 537 for
3587 549 it
4450 587 my
4200 624 me
3578 660 is
3482 682 in
929 741 and
6482 979 the
6588 1680 to
7336 1707 you

7373 rows × 2 columns

Bonus: Calculating the "spamminess" of each token


In [29]:
# create separate DataFrames for ham and spam
sms_ham = sms[sms.label==0] # ham
sms_spam = sms[sms.label==1] # spam

In [30]:
# learn the vocabulary of ALL messages and save it
vect.fit(sms.message)
all_tokens = vect.get_feature_names()

In [31]:
# create document-term matrices for ham and spam
ham_dtm = vect.transform(sms_ham.message)
spam_dtm = vect.transform(sms_spam.message)

In [32]:
ham_dtm.shape, spam_dtm.shape


Out[32]:
((4825, 8713), (747, 8713))

In [33]:
# count how many times EACH token appears across ALL ham messages
ham_counts = np.sum(ham_dtm.toarray(), axis=0)

In [34]:
ham_counts


Out[34]:
array([0, 0, 1, ..., 1, 0, 1])

In [35]:
# count how many times EACH token appears across ALL spam messages
spam_counts = np.sum(spam_dtm.toarray(), axis=0)

In [36]:
spam_counts


Out[36]:
array([10, 29,  0, ...,  0,  1,  0])

In [37]:
all_tokens[0:5]


Out[37]:
['00', '000', '000pes', '008704050406', '0089']

In [38]:
# create a DataFrame of tokens with their separate ham and spam counts
token_counts = pd.DataFrame({'token':all_tokens, 'ham':ham_counts, 'spam':spam_counts})

In [39]:
token_counts


Out[39]:
ham spam token
0 0 10 00
1 0 29 000
2 1 0 000pes
3 0 2 008704050406
4 0 1 0089
5 0 1 0121
6 0 1 01223585236
7 0 2 01223585334
8 1 0 0125698789
9 0 8 02
10 0 3 0207
11 0 1 02072069400
12 0 2 02073162414
13 0 1 02085076972
14 0 2 021
15 0 13 03
16 0 12 04
17 0 1 0430
18 0 5 05
19 0 2 050703
20 0 2 0578
21 0 8 06
22 0 2 07
23 0 1 07008009200
24 0 1 07046744435
25 0 1 07090201529
26 0 1 07090298926
27 0 1 07099833605
28 0 2 07123456789
29 0 1 0721072
... ... ... ...
8683 1 0 yowifes
8684 1 0 yoyyooo
8685 3 11 yr
8686 5 3 yrs
8687 1 0 ystrday
8688 1 0 ything
8689 1 0 yummmm
8690 3 0 yummy
8691 5 0 yun
8692 2 0 yunny
8693 4 0 yuo
8694 1 0 yuou
8695 43 0 yup
8696 1 0 yupz
8697 1 0 zac
8698 1 0 zaher
8699 1 0 zealand
8700 0 1 zebra
8701 0 6 zed
8702 1 0 zeros
8703 1 0 zhong
8704 2 0 zindgi
8705 1 1 zoe
8706 1 0 zogtorius
8707 1 0 zoom
8708 0 1 zouk
8709 1 0 zyada
8710 1 0 èn
8711 0 1 ú1
8712 1 0 〨ud

8713 rows × 3 columns


In [40]:
# add one to ham and spam counts to avoid dividing by zero (in the step that follows)
token_counts['ham'] = token_counts.ham + 1
token_counts['spam'] = token_counts.spam + 1

In [41]:
# calculate ratio of spam-to-ham for each token
token_counts['spam_ratio'] = token_counts.spam / token_counts.ham
token_counts.sort_values(by='spam_ratio', ascending=False)


Out[41]:
ham spam token spam_ratio
2067 1 114 claim 114.000000
6113 1 94 prize 94.000000
352 1 72 150p 72.000000
7837 1 61 tone 61.000000
369 1 52 18 52.000000
3688 1 51 guaranteed 51.000000
617 1 45 500 45.000000
2371 1 45 cs 45.000000
299 1 42 1000 42.000000
1333 1 39 awarded 39.000000
8016 2 75 uk 37.500000
356 1 35 150ppm 35.000000
6525 1 33 ringtone 33.000000
8596 3 99 www 33.000000
1 1 30 000 30.000000
2150 1 27 collection 27.000000
2963 1 27 entry 27.000000
364 2 54 16 27.000000
7838 1 27 tones 27.000000
618 1 26 5000 26.000000
5117 1 26 mob 26.000000
8375 1 25 weekly 25.000000
309 1 25 10p 25.000000
8153 1 25 valid 25.000000
732 1 23 800 23.000000
5297 1 23 national 23.000000
1623 1 22 bonus 22.000000
735 1 22 8007 22.000000
6619 1 22 sae 22.000000
8248 1 22 vouchers 22.000000
... ... ... ... ...
3925 166 3 home 0.018072
2815 56 1 dun 0.017857
5533 115 2 oh 0.017391
5217 116 2 much 0.017241
5254 755 13 my 0.017219
1064 59 1 always 0.016949
7001 59 1 sleep 0.016949
3595 59 1 gonna 0.016949
3171 63 1 feel 0.015873
8394 63 1 went 0.015873
5371 63 1 nice 0.015873
3690 68 1 gud 0.014706
7099 70 1 something 0.014286
7463 72 1 sure 0.013889
4724 75 1 lol 0.013333
1142 77 1 anything 0.012987
2289 77 1 cos 0.012987
2163 231 3 come 0.012987
5167 80 1 morning 0.012500
2714 89 1 doing 0.011236
1084 89 1 amp 0.011236
1247 90 1 ask 0.011111
6626 90 1 said 0.011111
4550 136 1 later 0.007353
2428 151 1 da 0.006623
4747 163 1 lor 0.006135
6843 168 1 she 0.005952
3805 232 1 he 0.004310
4793 317 1 lt 0.003155
3684 319 1 gt 0.003135

8713 rows × 4 columns


In [43]:
#observe spam messages that contain the word 'claim'
claim_messages = sms.message[sms.message.str.contains('claim')]

for message in claim_messages[0:5]:
    print(message, '\n')


WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only. 

Urgent UR awarded a complimentary trip to EuroDisinc Trav, Aco&Entry41 Or £1000. To claim txt DIS to 87121 18+6*£1.50(moreFrmMob. ShrAcomOrSglSuplt)10, LS1 3AJ 

You are a winner U have been specially selected 2 receive £1000 or a 4* holiday (flights inc) speak to a live operator 2 claim 0871277810910p/min (18+)  

PRIVATE! Your 2004 Account Statement for 07742676969 shows 786 unredeemed Bonus Points. To claim call 08719180248 Identifier Code: 45239 Expires 

Todays Voda numbers ending 7548 are selected to receive a $350 award. If you have a match please call 08712300220 quoting claim code 4041 standard rates app 

Part 5: Building a Naive Bayes model

We will use Multinomial Naive Bayes:

The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.


In [37]:
# train a Naive Bayes model using X_train_dtm
from sklearn.naive_bayes import MultinomialNB, GaussianNB
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)


Out[37]:
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [38]:
# make class predictions for X_test_dtm
y_pred_class = nb.predict(X_test_dtm)

In [39]:
# calculate accuracy of class predictions
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))


0.988513998564

In [41]:
print(metrics.classification_report(y_test, y_pred_class))


             precision    recall  f1-score   support

          0       0.99      1.00      0.99      1206
          1       0.98      0.94      0.96       187

avg / total       0.99      0.99      0.99      1393


In [43]:
metrics.confusion_matrix(y_test, y_pred_class)


Out[43]:
array([[1202,    4],
       [  12,  175]])

In [47]:
?metrics.confusion_matrix

In [48]:
# confusion matrix
print(metrics.confusion_matrix(y_test, y_pred_class))


[[1202    4]
 [  12  175]]

In [49]:
# predict (poorly calibrated) probabilities
y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
y_pred_prob


Out[49]:
array([  7.82887542e-08,   3.02868734e-08,   1.38606514e-11, ...,
         1.00000000e+00,   1.00000000e+00,   2.62417931e-06])

In [50]:
# calculate AUC
print(metrics.roc_auc_score(y_test, y_pred_prob))


0.987471288832

In [51]:
# print message text for the false positives
X_test[y_test < y_pred_class]


Out[51]:
5475    Dhoni have luck to win some big title.so we wi...
2173     Yavnt tried yet and never played original either
4557                              Gettin rdy to ship comp
4382               Mathews or tait or edwards or anderson
Name: message, dtype: object

In [52]:
# print message text for the false negatives
X_test[y_test > y_pred_class]


Out[52]:
4213    Missed call alert. These numbers called but le...
3360    Sorry I missed your call let's talk when you h...
2575    Your next amazing xxx PICSFREE1 video will be ...
788     Ever thought about living a good life with a p...
5370    dating:i have had two of these. Only started a...
3530    Xmas & New Years Eve tickets are now on sale f...
2352    Download as many ringtones as u like no restri...
3742                                        2/2 146tf150p
2558    This message is brought to you by GMW Ltd. and...
4144    In The Simpsons Movie released in July 2007 na...
955             Filthy stories and GIRLS waiting for your
1638    0A$NETWORKS allow companies to bill for SMS, s...
Name: message, dtype: object

In [ ]:
# what do you notice about the false negatives?
# X_test[3132]

Part 6: Comparing Naive Bayes with logistic regression


In [ ]:
#Create a logitic regression
# import/instantiate/fit

In [ ]:
# class predictions and predicted probabilities

In [ ]:
# calculate accuracy and AUC