# Working with Text Data and Naive Bayes in scikit-learn

## Agenda

Working with text data

• Representing text as data
• Reading SMS data
• Vectorizing SMS data
• Examining the tokens and their counts
• Bonus: Calculating the "spamminess" of each token

Naive Bayes classification

• Building a Naive Bayes model
• Comparing Naive Bayes with logistic regression

## Part 1: Representing text as data

From the scikit-learn documentation:

Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.

We will use CountVectorizer to "convert text into a matrix of token counts":

``````

In [5]:

from sklearn.feature_extraction.text import CountVectorizer

``````
``````

In [6]:

# start with a simple example
simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!', 'help']

``````
``````

In [10]:

# learn the 'vocabulary' of the training data
vect = CountVectorizer()
vect.fit(simple_train)
# vect.get_feature_names()
vect.vocabulary_

``````
``````

Out[10]:

{'cab': 0, 'call': 1, 'help': 2, 'me': 3, 'please': 4, 'tonight': 5, 'you': 6}

``````
``````

In [11]:

# transform training data into a 'document-term matrix'
simple_train_dtm = vect.transform(simple_train)
simple_train_dtm

``````
``````

Out[11]:

<4x7 sparse matrix of type '<class 'numpy.int64'>'
with 10 stored elements in Compressed Sparse Row format>

``````
``````

In [6]:

# print the sparse matrix
print(simple_train_dtm)

``````
``````

(0, 1)	1
(0, 5)	1
(0, 6)	1
(1, 0)	1
(1, 1)	1
(1, 3)	1
(2, 1)	1
(2, 3)	1
(2, 4)	2
(3, 2)	1

``````
``````

In [14]:

# convert sparse matrix to a dense matrix
simple_train_dtm.toarray()

``````
``````

Out[14]:

array([[0, 1, 0, 0, 0, 1, 1],
[1, 1, 0, 1, 0, 0, 0],
[0, 1, 0, 1, 2, 0, 0],
[0, 0, 1, 0, 0, 0, 0]])

``````
``````

In [16]:

# examine the vocabulary and document-term matrix together
import pandas as pd
pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())

``````
``````

Out[16]:

cab
call
help
me
please
tonight
you

0
0
1
0
0
0
1
1

1
1
1
0
1
0
0
0

2
0
1
0
1
2
0
0

3
0
0
1
0
0
0
0

``````
``````

In [9]:

# create a document-term matrix on your own
simple_train = ["call call Sorry, Ill later",
"K Did you me call ah just now",
"I call you later, don't have network. If urgnt, sms me"]

``````
``````

In [10]:

#complete your work below
# instantiate vectorizer
# fit
# transform
# convert to dense matrix

vec2 = CountVectorizer(binary=True)
vec2.fit(simple_train)
my_dtm2 = vec2.transform(simple_train)

pd.DataFrame(my_dtm2.toarray(), columns=vec2.get_feature_names())

``````
``````

Out[10]:

ah
call
did
don
have
if
ill
just
later
me
network
now
sms
sorry
urgnt
you

0
0
1
0
0
0
0
1
0
1
0
0
0
0
1
0
0

1
1
1
1
0
0
0
0
1
0
1
0
1
0
0
0
1

2
0
1
0
1
1
1
0
0
1
1
1
0
1
0
1
1

``````

From the scikit-learn documentation:

In this scheme, features and samples are defined as follows:

• Each individual token occurrence frequency (normalized or not) is treated as a feature.
• The vector of all the token frequencies for a given document is considered a multivariate sample.

A corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus.

We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

``````

In [10]:

vect.get_feature_names()

``````
``````

Out[10]:

['cab', 'call', 'help', 'me', 'please', 'tonight', 'you']

``````
``````

In [11]:

# transform testing data into a document-term matrix (using existing vocabulary)
simple_test = ["please don't call me devon"]
simple_test_dtm = vect.transform(simple_test)
simple_test_dtm.toarray()

``````
``````

Out[11]:

array([[0, 1, 0, 1, 1, 0, 0]])

``````
``````

In [12]:

# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())

``````
``````

Out[12]:

cab
call
help
me
please
tonight
you

0
0
1
0
1
1
0
0

``````

Summary:

• `vect.fit(train)` learns the vocabulary of the training data
• `vect.transform(train)` uses the fitted vocabulary to build a document-term matrix from the training data
• `vect.transform(test)` uses the fitted vocabulary to build a document-term matrix from the testing data (and ignores tokens it hasn't seen before)

## Part 2: Reading SMS data

``````

In [13]:

# read tab-separated file
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/sms.tsv'
col_names = ['label', 'message']
sms = pd.read_table(url, sep='\t', header=None, names=col_names)
print(sms.shape)

``````
``````

(5572, 2)

``````
``````

In [14]:

sms.head(5)

``````
``````

Out[14]:

label
message

0
ham
Go until jurong point, crazy.. Available only ...

1
ham
Ok lar... Joking wif u oni...

2
spam
Free entry in 2 a wkly comp to win FA Cup fina...

3
ham
U dun say so early hor... U c already then say...

4
ham
Nah I don't think he goes to usf, he lives aro...

``````
``````

In [15]:

sms.label.value_counts()

``````
``````

Out[15]:

ham     4825
spam     747
Name: label, dtype: int64

``````
``````

In [16]:

# convert label to a numeric variable
sms['label'] = sms.label.map({'ham':0, 'spam':1})

``````
``````

In [17]:

# define X and y
X = sms.message
y = sms.label

``````
``````

In [21]:

# split into training and testing sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, stratify=y)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

``````
``````

(4179,) (4179,)
(1393,) (1393,)

``````

## Part 3: Vectorizing SMS data

``````

In [27]:

# instantiate the vectorizer
vect = CountVectorizer()

``````
``````

In [28]:

# learn training data vocabulary, then create document-term matrix
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)
X_train_dtm

``````
``````

Out[28]:

<4179x7373 sparse matrix of type '<class 'numpy.int64'>'
with 55328 stored elements in Compressed Sparse Row format>

``````
``````

In [29]:

# alternative: combine fit and transform into a single step
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm

``````
``````

Out[29]:

<4179x7373 sparse matrix of type '<class 'numpy.int64'>'
with 55328 stored elements in Compressed Sparse Row format>

``````
``````

In [30]:

# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)
X_test_dtm

``````
``````

Out[30]:

<1393x7373 sparse matrix of type '<class 'numpy.int64'>'
with 17393 stored elements in Compressed Sparse Row format>

``````

## Part 4: Examining the tokens and their counts

``````

In [31]:

# store token names
X_train_tokens = vect.get_feature_names()

``````
``````

In [32]:

# first 50 tokens
print(X_train_tokens[:50])

``````
``````

['00', '000', '000pes', '008704050406', '0089', '01223585236', '01223585334', '0125698789', '02', '0207', '02073162414', '02085076972', '021', '03', '04', '0430', '05', '050703', '0578', '06', '07', '07046744435', '07090298926', '07099833605', '07123456789', '0721072', '07732584351', '07734396839', '07742676969', '07753741225', '0776xxxxxxx', '07781482378', '07786200117', '077xxx', '07808', '07808247860', '07821230901', '07880867867', '0789xxxxxxx', '07946746291', '07973788240', '07xxxxxxxxx', '08', '0800', '08000407165', '08000776320', '08000839402', '08000930705', '08000938767', '08001950382']

``````
``````

In [33]:

# last 50 tokens
print(X_train_tokens[-50:])

``````
``````

['yet', 'yetty', 'yetunde', 'yhl', 'yifeng', 'yijue', 'ym', 'ymca', 'yo', 'yoga', 'yogasana', 'yor', 'yorge', 'you', 'youdoing', 'youi', 'young', 'younger', 'youphone', 'your', 'youre', 'yourinclusive', 'yourjob', 'yours', 'yourself', 'youwanna', 'yowifes', 'yr', 'yrs', 'ything', 'yummmm', 'yummy', 'yun', 'yunny', 'yuo', 'yuou', 'yup', 'yupz', 'zaher', 'zebra', 'zed', 'zeros', 'zhong', 'zindgi', 'zoe', 'zoom', 'zouk', 'zyada', 'ú1', '〨ud']

``````
``````

In [34]:

# view X_train_dtm as a dense matrix
X_train_dtm.toarray()

``````
``````

Out[34]:

array([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]], dtype=int64)

``````
``````

In [35]:

# count how many times EACH token appears across ALL messages in X_train_dtm
import numpy as np
X_train_counts = np.sum(X_train_dtm.toarray(), axis=0)
X_train_counts

``````
``````

Out[35]:

array([ 7, 20,  1, ...,  1,  1,  1], dtype=int64)

``````
``````

In [36]:

# create a DataFrame of tokens with their counts
pd.DataFrame({'token':X_train_tokens, 'count':X_train_counts}).sort_values(by='count', ascending=True)

``````
``````

Out[36]:

count
token

3686
1
juan

4123
1
mailbox

4120
1
mahfuuz

4119
1
mahal

4117
1
magicalsongs

4115
1
maggi

4114
1
magazine

4111
1
madstini

4110
1
madoke

4109
1
madodu

4106
1
mad2

4105
1
mad1

4103
1
macs

4102
1
macleran

4100
1
machines

4098
1
macha

4124
1
mailed

4097
1
macedonia

4125
1
mails

4132
1
makiing

4162
1
marking

4159
1
marine

4156
1
marandratha

4155
1
maps

4152
1
manual

4151
1
manky

4150
1
maniac

4149
1
mango

4147
1
mandara

4146
1
mandan

...
...
...

1228
286
be

3445
291
if

3002
296
get

1080
296
at

7159
296
will

4724
303
or

2296
312
do

7060
318
we

5974
326
so

1525
339
but

4599
348
not

1579
355
can

1014
362
are

4691
385
on

4610
396
now

6479
442
that

1555
443
call

3222
451
have

4654
471
of

7342
534
your

2826
537
for

3587
549
it

4450
587
my

4200
624
me

3578
660
is

3482
682
in

929
741
and

6482
979
the

6588
1680
to

7336
1707
you

7373 rows × 2 columns

``````

## Bonus: Calculating the "spamminess" of each token

``````

In [29]:

# create separate DataFrames for ham and spam
sms_ham = sms[sms.label==0] # ham
sms_spam = sms[sms.label==1] # spam

``````
``````

In [30]:

# learn the vocabulary of ALL messages and save it
vect.fit(sms.message)
all_tokens = vect.get_feature_names()

``````
``````

In [31]:

# create document-term matrices for ham and spam
ham_dtm = vect.transform(sms_ham.message)
spam_dtm = vect.transform(sms_spam.message)

``````
``````

In [32]:

ham_dtm.shape, spam_dtm.shape

``````
``````

Out[32]:

((4825, 8713), (747, 8713))

``````
``````

In [33]:

# count how many times EACH token appears across ALL ham messages
ham_counts = np.sum(ham_dtm.toarray(), axis=0)

``````
``````

In [34]:

ham_counts

``````
``````

Out[34]:

array([0, 0, 1, ..., 1, 0, 1])

``````
``````

In [35]:

# count how many times EACH token appears across ALL spam messages
spam_counts = np.sum(spam_dtm.toarray(), axis=0)

``````
``````

In [36]:

spam_counts

``````
``````

Out[36]:

array([10, 29,  0, ...,  0,  1,  0])

``````
``````

In [37]:

all_tokens[0:5]

``````
``````

Out[37]:

['00', '000', '000pes', '008704050406', '0089']

``````
``````

In [38]:

# create a DataFrame of tokens with their separate ham and spam counts
token_counts = pd.DataFrame({'token':all_tokens, 'ham':ham_counts, 'spam':spam_counts})

``````
``````

In [39]:

token_counts

``````
``````

Out[39]:

ham
spam
token

0
0
10
00

1
0
29
000

2
1
0
000pes

3
0
2
008704050406

4
0
1
0089

5
0
1
0121

6
0
1
01223585236

7
0
2
01223585334

8
1
0
0125698789

9
0
8
02

10
0
3
0207

11
0
1
02072069400

12
0
2
02073162414

13
0
1
02085076972

14
0
2
021

15
0
13
03

16
0
12
04

17
0
1
0430

18
0
5
05

19
0
2
050703

20
0
2
0578

21
0
8
06

22
0
2
07

23
0
1
07008009200

24
0
1
07046744435

25
0
1
07090201529

26
0
1
07090298926

27
0
1
07099833605

28
0
2
07123456789

29
0
1
0721072

...
...
...
...

8683
1
0
yowifes

8684
1
0
yoyyooo

8685
3
11
yr

8686
5
3
yrs

8687
1
0
ystrday

8688
1
0
ything

8689
1
0
yummmm

8690
3
0
yummy

8691
5
0
yun

8692
2
0
yunny

8693
4
0
yuo

8694
1
0
yuou

8695
43
0
yup

8696
1
0
yupz

8697
1
0
zac

8698
1
0
zaher

8699
1
0
zealand

8700
0
1
zebra

8701
0
6
zed

8702
1
0
zeros

8703
1
0
zhong

8704
2
0
zindgi

8705
1
1
zoe

8706
1
0
zogtorius

8707
1
0
zoom

8708
0
1
zouk

8709
1
0
zyada

8710
1
0
èn

8711
0
1
ú1

8712
1
0
〨ud

8713 rows × 3 columns

``````
``````

In [40]:

# add one to ham and spam counts to avoid dividing by zero (in the step that follows)
token_counts['ham'] = token_counts.ham + 1
token_counts['spam'] = token_counts.spam + 1

``````
``````

In [41]:

# calculate ratio of spam-to-ham for each token
token_counts['spam_ratio'] = token_counts.spam / token_counts.ham
token_counts.sort_values(by='spam_ratio', ascending=False)

``````
``````

Out[41]:

ham
spam
token
spam_ratio

2067
1
114
claim
114.000000

6113
1
94
prize
94.000000

352
1
72
150p
72.000000

7837
1
61
tone
61.000000

369
1
52
18
52.000000

3688
1
51
guaranteed
51.000000

617
1
45
500
45.000000

2371
1
45
cs
45.000000

299
1
42
1000
42.000000

1333
1
39
awarded
39.000000

8016
2
75
uk
37.500000

356
1
35
150ppm
35.000000

6525
1
33
ringtone
33.000000

8596
3
99
www
33.000000

1
1
30
000
30.000000

2150
1
27
collection
27.000000

2963
1
27
entry
27.000000

364
2
54
16
27.000000

7838
1
27
tones
27.000000

618
1
26
5000
26.000000

5117
1
26
mob
26.000000

8375
1
25
weekly
25.000000

309
1
25
10p
25.000000

8153
1
25
valid
25.000000

732
1
23
800
23.000000

5297
1
23
national
23.000000

1623
1
22
bonus
22.000000

735
1
22
8007
22.000000

6619
1
22
sae
22.000000

8248
1
22
vouchers
22.000000

...
...
...
...
...

3925
166
3
home
0.018072

2815
56
1
dun
0.017857

5533
115
2
oh
0.017391

5217
116
2
much
0.017241

5254
755
13
my
0.017219

1064
59
1
always
0.016949

7001
59
1
sleep
0.016949

3595
59
1
gonna
0.016949

3171
63
1
feel
0.015873

8394
63
1
went
0.015873

5371
63
1
nice
0.015873

3690
68
1
gud
0.014706

7099
70
1
something
0.014286

7463
72
1
sure
0.013889

4724
75
1
lol
0.013333

1142
77
1
anything
0.012987

2289
77
1
cos
0.012987

2163
231
3
come
0.012987

5167
80
1
morning
0.012500

2714
89
1
doing
0.011236

1084
89
1
amp
0.011236

1247
90
1
ask
0.011111

6626
90
1
said
0.011111

4550
136
1
later
0.007353

2428
151
1
da
0.006623

4747
163
1
lor
0.006135

6843
168
1
she
0.005952

3805
232
1
he
0.004310

4793
317
1
lt
0.003155

3684
319
1
gt
0.003135

8713 rows × 4 columns

``````
``````

In [43]:

#observe spam messages that contain the word 'claim'
claim_messages = sms.message[sms.message.str.contains('claim')]

for message in claim_messages[0:5]:
print(message, '\n')

``````
``````

WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.

Urgent UR awarded a complimentary trip to EuroDisinc Trav, Aco&Entry41 Or £1000. To claim txt DIS to 87121 18+6*£1.50(moreFrmMob. ShrAcomOrSglSuplt)10, LS1 3AJ

You are a winner U have been specially selected 2 receive £1000 or a 4* holiday (flights inc) speak to a live operator 2 claim 0871277810910p/min (18+)

PRIVATE! Your 2004 Account Statement for 07742676969 shows 786 unredeemed Bonus Points. To claim call 08719180248 Identifier Code: 45239 Expires

Todays Voda numbers ending 7548 are selected to receive a \$350 award. If you have a match please call 08712300220 quoting claim code 4041 standard rates app

``````

## Part 5: Building a Naive Bayes model

We will use Multinomial Naive Bayes:

The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

``````

In [37]:

# train a Naive Bayes model using X_train_dtm
from sklearn.naive_bayes import MultinomialNB, GaussianNB
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)

``````
``````

Out[37]:

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

``````
``````

In [38]:

# make class predictions for X_test_dtm
y_pred_class = nb.predict(X_test_dtm)

``````
``````

In [39]:

# calculate accuracy of class predictions
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))

``````
``````

0.988513998564

``````
``````

In [41]:

print(metrics.classification_report(y_test, y_pred_class))

``````
``````

precision    recall  f1-score   support

0       0.99      1.00      0.99      1206
1       0.98      0.94      0.96       187

avg / total       0.99      0.99      0.99      1393

``````
``````

In [43]:

metrics.confusion_matrix(y_test, y_pred_class)

``````
``````

Out[43]:

array([[1202,    4],
[  12,  175]])

``````
``````

In [47]:

?metrics.confusion_matrix

``````
``````

In [48]:

# confusion matrix
print(metrics.confusion_matrix(y_test, y_pred_class))

``````
``````

[[1202    4]
[  12  175]]

``````
``````

In [49]:

# predict (poorly calibrated) probabilities
y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
y_pred_prob

``````
``````

Out[49]:

array([  7.82887542e-08,   3.02868734e-08,   1.38606514e-11, ...,
1.00000000e+00,   1.00000000e+00,   2.62417931e-06])

``````
``````

In [50]:

# calculate AUC
print(metrics.roc_auc_score(y_test, y_pred_prob))

``````
``````

0.987471288832

``````
``````

In [51]:

# print message text for the false positives
X_test[y_test < y_pred_class]

``````
``````

Out[51]:

5475    Dhoni have luck to win some big title.so we wi...
2173     Yavnt tried yet and never played original either
4557                              Gettin rdy to ship comp
4382               Mathews or tait or edwards or anderson
Name: message, dtype: object

``````
``````

In [52]:

# print message text for the false negatives
X_test[y_test > y_pred_class]

``````
``````

Out[52]:

4213    Missed call alert. These numbers called but le...
3360    Sorry I missed your call let's talk when you h...
2575    Your next amazing xxx PICSFREE1 video will be ...
788     Ever thought about living a good life with a p...
5370    dating:i have had two of these. Only started a...
3530    Xmas & New Years Eve tickets are now on sale f...
2352    Download as many ringtones as u like no restri...
3742                                        2/2 146tf150p
2558    This message is brought to you by GMW Ltd. and...
4144    In The Simpsons Movie released in July 2007 na...
955             Filthy stories and GIRLS waiting for your
1638    0A\$NETWORKS allow companies to bill for SMS, s...
Name: message, dtype: object

``````
``````

In [ ]:

# what do you notice about the false negatives?
# X_test[3132]

``````

## Part 6: Comparing Naive Bayes with logistic regression

``````

In [ ]:

#Create a logitic regression
# import/instantiate/fit

``````
``````

In [ ]:

# class predictions and predicted probabilities

``````
``````

In [ ]:

# calculate accuracy and AUC

``````