Naive Bayes classification example from the book "Principles of Data Science"

P(spam|<sentence>) = P(<sentence>|spam) * P(spam) / P(<sentence>)
P(non-spam|<sentence>) = P(<sentence>|non-spam) * P(non-spam) / P(<sentence>)

In [1]:
import pandas as pd
import sklearn

In [3]:
df = pd.read_table('https://raw.githubusercontent.com/sinanuozdemir/sfdat22/master/data/sms.tsv', sep='\t', header=None, names=['label', 'msg'])
df


Out[3]:
label msg
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...
5 spam FreeMsg Hey there darling it's been 3 week's n...
6 ham Even my brother is not like to speak with me. ...
7 ham As per your request 'Melle Melle (Oru Minnamin...
8 spam WINNER!! As a valued network customer you have...
9 spam Had your mobile 11 months or more? U R entitle...
10 ham I'm gonna be home soon and i don't want to tal...
11 spam SIX chances to win CASH! From 100 to 20,000 po...
12 spam URGENT! You have won a 1 week FREE membership ...
13 ham I've been searching for the right words to tha...
14 ham I HAVE A DATE ON SUNDAY WITH WILL!!
15 spam XXXMobileMovieClub: To use your credit, click ...
16 ham Oh k...i'm watching here:)
17 ham Eh u remember how 2 spell his name... Yes i di...
18 ham Fine if that’s the way u feel. That’s the way ...
19 spam England v Macedonia - dont miss the goals/team...
20 ham Is that seriously how you spell his name?
21 ham I‘m going to try for 2 months ha ha only joking
22 ham So ü pay first lar... Then when is da stock co...
23 ham Aft i finish my lunch then i go str down lor. ...
24 ham Ffffffffff. Alright no way I can meet up with ...
25 ham Just forced myself to eat a slice. I'm really ...
26 ham Lol your always so convincing.
27 ham Did you catch the bus ? Are you frying an egg ...
28 ham I'm back &amp; we're packing the car now, I'll...
29 ham Ahhh. Work. I vaguely remember that! What does...
... ... ...
5542 ham Armand says get your ass over to epsilon
5543 ham U still havent got urself a jacket ah?
5544 ham I'm taking derek &amp; taylor to walmart, if I...
5545 ham Hi its in durban are you still on this number
5546 ham Ic. There are a lotta childporn cars then.
5547 spam Had your contract mobile 11 Mnths? Latest Moto...
5548 ham No, I was trying it all weekend ;V
5549 ham You know, wot people wear. T shirts, jumpers, ...
5550 ham Cool, what time you think you can get here?
5551 ham Wen did you get so spiritual and deep. That's ...
5552 ham Have a safe trip to Nigeria. Wish you happines...
5553 ham Hahaha..use your brain dear
5554 ham Well keep in mind I've only got enough gas for...
5555 ham Yeh. Indians was nice. Tho it did kane me off ...
5556 ham Yes i have. So that's why u texted. Pshew...mi...
5557 ham No. I meant the calculation is the same. That ...
5558 ham Sorry, I'll call later
5559 ham if you aren't here in the next &lt;#&gt; hou...
5560 ham Anything lor. Juz both of us lor.
5561 ham Get me out of this dump heap. My mom decided t...
5562 ham Ok lor... Sony ericsson salesman... I ask shuh...
5563 ham Ard 6 like dat lor.
5564 ham Why don't you wait 'til at least wednesday to ...
5565 ham Huh y lei...
5566 spam REMINDER FROM O2: To get 2.50 pounds free call...
5567 spam This is the 2nd time we have tried 2 contact u...
5568 ham Will ü b going to esplanade fr home?
5569 ham Pity, * was in mood for that. So...any other s...
5570 ham The guy did some bitching but I acted like i'd...
5571 ham Rofl. Its true to its name

5572 rows × 2 columns


In [4]:
df.label.value_counts()


Out[4]:
ham     4825
spam     747
Name: label, dtype: int64

In [13]:
value_probablity = df.label.value_counts()/len(df)
spam_probability = value_probablity.spam
ham_probability = value_probablity.ham
print('spam probability: {}, ham probability: {}'.format(spam_probability, ham_probability))


spam probability: 0.13406317300789664, ham probability: 0.8659368269921034

In [17]:
spams = df[df.label == 'spam']
sentence = 'send cash now'
spam_words_probability = 1
for word in sentence.split():
    word_probability = spams[spams.msg.str.contains(word)].shape[0]/float(spams.shape[0])
    print("word {} probability: {}".format(word, word_probability))
    spam_words_probability *= word_probability
spam_words_probability *= spam_probability
print('spam words probability: {}'.format(spam_words_probability))


word send probability: 0.06693440428380187
word cash probability: 0.06827309236947791
word now probability: 0.1994645247657296
spam words probability: 0.00012220082487226017

In [19]:
hams = df[df.label == 'ham']
sentence = 'send cash now'
ham_words_probability = 1
for word in sentence.split():
    word_probability = hams[hams.msg.str.contains(word)].shape[0]/float(hams.shape[0])
    print("word {} probability: {}".format(word, word_probability))
    ham_words_probability *= word_probability
ham_words_probability *= ham_probability
print('ham words probability: {}'.format(ham_words_probability))


word send probability: 0.023626943005181346
word cash probability: 0.002694300518134715
word now probability: 0.10051813471502591
ham words probability: 5.540949590575691e-06

In [20]:
if spam_words_probability > ham_words_probability:
    print('{} is more likely a spam'.format(sentence))
else:
    print('{} is more likely NOT a spam'.format(sentence))


send cash now is more likely a spam

Use the built-in library methods


In [31]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
x_train, x_test, y_train, y_test = train_test_split(df.msg, df.label, random_state=1)
vect = CountVectorizer()
train_dtm = vect.fit_transform(x_train)
test_dtm = vect.transform(x_test)
nb = MultinomialNB()
nb.fit(train_dtm, y_train)
predicts = nb.predict(test_dtm)
predicts


Out[31]:
array(['ham', 'ham', 'ham', ..., 'ham', 'spam', 'ham'], 
      dtype='<U4')

In [36]:
from sklearn import metrics
print('accuracy: {}, confusion matrix: {}'
      .format(metrics.accuracy_score(y_test, predicts), metrics.confusion_matrix(y_test, predicts)))


accuracy: 0.9885139985642498, confusion matrix: [[1203    5]
 [  11  174]]

In [ ]: