(C) 2017-2019 by Damir Cavar
Download: This and various other Jupyter notebooks are available from my GitHub repo.
This is a tutorial related to the discussion of a Bayesian classifier in the textbook Machine Learning: The Art and Science of Algorithms that Make Sense of Data by Peter Flach.
This tutorial was developed as part of my course material for the course Machine Learning for Computational Linguistics in the Computational Linguistics Program of the Department of Linguistics at Indiana University.
Assume that we have a set of e-mails that are annotated as spam or ham, as described in the textbook.
There are $4$ e-mails labeled ham and $1$ e-mail is labeled spam, that is we have a total of $5$ texts in our corpus.
If we would randomly pick an e-mail from the collection, the probability that we pick a spam e-mail would be $1 / 5$.
Spam emails might differ from ham e-mails just in some words. Here is a sample email constructed with typical keywords:
In [1]:
spam = [ """Our medicine cures baldness. No diagnostics needed.
We guarantee Fast Viagra delivery.
We can provide Human growth hormone. The cheapest Life
Insurance with us. You can Lose weight with this treatment.
Our Medicine now and No medical exams necessary.
Our Online pharmacy is the best. This cream Removes
wrinkles and Reverses aging.
One treatment and you will Stop snoring. We sell Valium
and Viagra.
Our Vicodin will help with Weight loss. Cheap Xanax.""" ]
The data structure above is a list of strings that contains only one string. The triple-double-quotes mark multi-line text. We can output the size of the variable spam this way:
In [2]:
print(len(spam))
We can create a list of ham mails in a similar way:
In [3]:
ham = [ """Hi Hans, hope to see you soon at our family party.
When will you arrive.
All the best to the family.
Sue""",
"""Dear Ata,
did you receive my last email related to the car insurance
offer? I would be happy to discuss the details with you.
Please give me a call, if you have any questions.
John Smith
Super Car Insurance""",
"""Hi everyone:
This is just a gentle reminder of today's first 2017 SLS
Colloquium, from 2.30 to 4.00 pm, in Ballantine 103.
Rodica Frimu will present a job talk entitled "What is
so tricky in subject-verb agreement?". The text of the
abstract is below.
If you would like to present something during the Spring,
please let me know.
The current online schedule with updated title
information and abstracts is available under:
http://www.iub.edu/~psyling/SLSColloquium/Spring2017.html
See you soon,
Peter""",
"""Dear Friends,
As our first event of 2017, the Polish Studies Center
presents an evening with artist and filmmaker Wojtek Sawa.
Please join us on JANUARY 26, 2017 from 5:30 p.m. to
7:30 p.m. in the Global and International Studies
Building room 1100 for a presentation by Wojtek Sawa
on his interactive installation art piece The Wall
Speaks–Voices of the Unheard. A reception will follow
the event where you will have a chance to meet the artist
and discuss his work.
Best,"""]
The ham-mail list contains $4$ e-mails:
In [4]:
print(len(ham))
We can access a particular e-mail via index from either spam or ham:
In [7]:
print(spam[0])
In [8]:
print(ham[3])
We can lower-case the email using the string lower function:
In [9]:
print(ham[3].lower())
We can loop over all e-mails in spam or ham and lower-case the content:
In [10]:
for text in ham:
print(text.lower())
We can use the tokenizer from NLTK to tokenize the lower-cased text into single tokens (words and punctuation marks):
In [11]:
from nltk import word_tokenize
print(word_tokenize(ham[0].lower()))
We can count the numer of tokens and types in lower-cased text:
In [12]:
from collections import Counter
myCounts = Counter(word_tokenize("This is a test. Will this test teach us how to count tokens?".lower()))
print(myCounts)
print("number of types:", len(myCounts))
print("number of tokens:", sum(myCounts.values()))
Now we can create a frequency profile of ham and spam words given the two text collections:
In [24]:
hamFP = Counter()
spamFP = Counter()
for text in spam:
spamFP.update(word_tokenize(text.lower()))
for text in ham:
hamFP.update(word_tokenize(text.lower()))
print("Ham:\n", hamFP)
print("-" * 30)
print("Spam:\n", spamFP)
In [33]:
from math import log
tokenlist = []
frqprofiles = []
for x in spam:
frqprofiles.append( Counter(word_tokenize(x.lower())) )
tokenlist.append( set(word_tokenize(x.lower())) )
for x in ham:
frqprofiles.append( Counter(word_tokenize(x.lower())) )
tokenlist.append( set(word_tokenize(x.lower())) )
#print(tokenlist)
for x in frqprofiles[0]:
frq = frqprofiles[0][x]
counter = 0
for y in tokenlist:
if x in y:
counter += 1
print(x, frq * log(len(tokenlist)/counter, 2))
The probability that we pick randomly an e-mail that is spam or ham can be computed as the ratio of the counts divided by the number of e-mails:
In [15]:
total = len(spam) + len(ham)
spamP = len(spam) / total
hamP = len(ham) / total
print("probability to pick spam:", spamP)
print("probability to pick ham:", hamP)
We will need the total token count to calculate the relative frequency of the tokens, that is to generate likelihood estimates. We could brute force add one to create space in the probability mass for unknown tokens.
In [16]:
totalSpam = sum(spamFP.values()) + 1
totalHam = sum(hamFP.values()) + 1
print("total spam counts + 1:", totalSpam)
print("total ham counts + 1:", totalHam)
We can relativize the counts in the frequency profiles now:
In [18]:
hamFP = Counter( dict([ (token, frequency/totalHam) for token, frequency in hamFP.items() ]) )
spamFP = Counter( dict([ (token, frequency/totalSpam) for token, frequency in spamFP.items() ]) )
print(hamFP)
print("-" * 30)
print(spamFP)
We can now compute the default probability that we want to assign to unknown words as $1 / totalSpam$ or $1 / totalHam$ respectively. Whenever we encounter an unknown token that is not in our frequency profile, we will assign the default probability to it.
In [19]:
defaultSpam = 1 / totalSpam
defaultHam = 1 / totalHam
print("default spam probability:", defaultSpam)
print("default ham probability:", defaultHam)
We can test an unknown document by calculating how likely it was generated by the hamFP-distribution or the spamFP-distribution. We have to tokenize the lower-cased unknown document and compute the product of the likelihood of every single token in the text. We should scale this likelihood with the likelihood of randomly picking a ham or a spam e-mail. Let us calculate the likelihood that the random email is spam:
In [20]:
unknownEmail = """Dear ,
we sell the cheapest and best Viagra on the planet. Our delivery is guaranteed confident and cheap.
"""
tokens = word_tokenize(unknownEmail.lower())
result = 1.0
for token in tokens:
result *= spamFP.get(token, defaultSpam)
print(result * spamP)
Since this number is very small, a better strategy might be to sum up the log-likelihoods:
In [21]:
from math import log
resultSpam = 0.0
for token in tokens:
resultSpam += log(spamFP.get(token, defaultSpam), 2)
resultSpam += log(spamP)
print(resultSpam)
In [22]:
resultHam = 0.0
for token in tokens:
resultHam += log(hamFP.get(token, defaultHam), 2)
resultHam += log(hamP)
print(resultHam)
The log-likelihood for spam is larger than for ham. Our simple classifier would have guessed that this e-mail is spam.
In [23]:
if max(resultHam, resultSpam) == resultHam:
print("e-mail is ham")
else:
print("e-mail is spam")
The are numerous ways to improve the algorithm and tutorial. Please send me your suggestions.
(C) 2017-2019 by Damir Cavar - Creative Commons Attribution-ShareAlike 4.0 International License (CA BY-SA 4.0)