In [1]:
#%load ex2_spamclassification.py

In [2]:
%matplotlib inline
import matplotlib
matplotlib.rcParams['savefig.dpi'] = 2 * matplotlib.rcParams['savefig.dpi']
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.pipeline import make_pipeline
from sklearn.naive_bayes import BernoulliNB
from utils import download
import numpy as np
import zipfile
import os

This example is modified from an excellent tutorial by Radim Rehurek, author of gensim. http://radimrehurek.com/data_science_python/

The dataset we will be using is from UCI featured a bunch of text messages, classed spam/not spam. See https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection for more details.

Gomez Hidalgo, J.M., Cajigas Bringas, G., Puertas Sanz, E., Carrero Garcia, F. Content Based SMS Spam Filtering. Proceedings of the 2006 ACM Symposium on Document Engineering (ACM DOCENG'06), Amsterdam, The Netherlands, 10-13, 2006.


In [3]:
dataset_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip"

Download the data, if the file isn't already downloaded.


In [4]:
dataset_fname = dataset_url.split("/")[-1]
if not os.path.exists(dataset_fname):
    download(dataset_url, server_fname=dataset_fname)

Get all the data out of the zipfile into a list, so we can start processing. Let's see some examples from the dataset.


In [5]:
archive = zipfile.ZipFile(dataset_fname, 'r')
raw = archive.open(archive.infolist()[0]).readlines()
labels = [l.split("\t")[0] for l in raw]
data = [l.split("\t")[1].rstrip() for l in raw]

Let's see some examples from the dataset!


In [6]:
for l, d in zip(labels, data)[:10]:
    print("%s %s" % (l, d))

labels = np.array(labels)
n_spam = np.sum(labels == "spam")
n_ham = np.sum(labels == "ham")
print("Percentage spam %f" % (float(n_spam) / len(labels)))
print("Percentage ham %f" % (float(n_ham) / len(labels)))


ham Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
ham Ok lar... Joking wif u oni...
spam Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
ham U dun say so early hor... U c already then say...
ham Nah I don't think he goes to usf, he lives around here though
spam FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv
ham Even my brother is not like to speak with me. They treat me like aids patent.
ham As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune
spam WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.
spam Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030
Percentage spam 0.134015
Percentage ham 0.865985

We want to train on 80% of the data, use last 20% for validation. Use sklearn's pipelines to make our job easy.


In [7]:
train_boundary = int(.8 * len(data))
train_X = np.array(data[:train_boundary])
train_y = np.array(labels[:train_boundary])
test_X = np.array(data[train_boundary:])
test_y = np.array(labels[train_boundary:])

text_cleaner = TfidfVectorizer()
classifier = BernoulliNB()
p = make_pipeline(text_cleaner, classifier)
p.fit(train_X, train_y)


Out[7]:
Pipeline(steps=[('tfidfvectorizer', TfidfVectorizer(analyzer=u'word', binary=False, charset=None,
        charset_error=None, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(...ary=None)), ('bernoullinb', BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True))])

See how it is doing on the training and test sets, and print some of the data we got wrong.


In [8]:
pred_train_y = p.predict(train_X)
pred_test_y = p.predict(test_X)
print("Training accuracy %f" % accuracy_score(train_y, pred_train_y))
print("Testing accuracy %f" % accuracy_score(test_y, pred_test_y))
print(" ")
print("Test classification report")
print("==========================")
print(classification_report(test_y, pred_test_y))

misses = np.where(pred_test_y != test_y)[0]
for n in misses:
    i = n + train_boundary
    lt = labels[i]
    lp = pred_test_y[n]
    d = data[i]
    print("true:%s predicted:%s %s" % (lt, lp, d))


Training accuracy 0.987890
Testing accuracy 0.978475
 
Test classification report
==========================
             precision    recall  f1-score   support

        ham       0.98      1.00      0.99       970
       spam       1.00      0.83      0.91       145

avg / total       0.98      0.98      0.98      1115

true:spam predicted:ham 3. You have received your mobile content. Enjoy
true:spam predicted:ham Want explicit SEX in 30 secs? Ring 02073162414 now! Costs 20p/min
true:spam predicted:ham Mobile Club: Choose any of the top quality items for your mobile. 7cfca1a
true:spam predicted:ham Money i have won wining number 946 wot do i do next
true:spam predicted:ham I want some cock! My hubby's away, I need a real man 2 satisfy me. Txt WIFE to 89938 for no strings action. (Txt STOP 2 end, txt rec £1.50ea. OTBox 731 LA1 7WS. )
true:spam predicted:ham Hi babe its Chloe, how r u? I was smashed on saturday night, it was great! How was your weekend? U been missing me? SP visionsms.com Text stop to stop 150p/text
true:spam predicted:ham Santa calling! Would your little ones like a call from Santa Xmas Eve? Call 09077818151 to book you time. Calls1.50ppm last 3mins 30s T&C www.santacalling.com
true:spam predicted:ham Check Out Choose Your Babe Videos @ sms.shsex.netUN fgkslpoPW fgkslpo
true:spam predicted:ham Goal! Arsenal 4 (Henry, 7 v Liverpool 2 Henry scores with a simple shot from 6 yards from a pass by Bergkamp to give Arsenal a 2 goal margin after 78 mins.
true:spam predicted:ham Hi, the SEXYCHAT girls are waiting for you to text them. Text now for a great night chatting. send STOP to stop this service
true:spam predicted:ham Hi this is Amy, we will be sending you a free phone number in a couple of days, which will give you an access to all the adult parties...
true:spam predicted:ham You can donate £2.50 to UNICEF's Asian Tsunami disaster support fund by texting DONATE to 864233. £2.50 will be added to your next bill
true:spam predicted:ham You have 1 new message. Please call 08715205273
true:spam predicted:ham PRIVATE! Your 2003 Account Statement for 078
true:spam predicted:ham dating:i have had two of these. Only started after i sent a text to talk sport radio last week. Any connection do you think or coincidence?
true:spam predicted:ham The current leading bid is 151. To pause this auction send OUT. Customer Care: 08718726270
true:spam predicted:ham You have 1 new message. Call 0207-083-6089
true:spam predicted:ham Santa Calling! Would your little ones like a call from Santa Xmas eve? Call 09058094583 to book your time.
true:spam predicted:ham Latest News! Police station toilet stolen, cops have nothing to go on!
true:spam predicted:ham "For the most sparkling shopping breaks from 45 per person; call 0121 2025050 or visit www.shortbreaks.org.uk"
true:spam predicted:ham http//tms. widelive.com/index. wml?id=820554ad0a1705572711&first=true¡C C Ringtone¡
true:spam predicted:ham Marvel Mobile Play the official Ultimate Spider-man game (£4.50) on ur mobile right now. Text SPIDER to 83338 for the game & we ll send u a FREE 8Ball wallpaper
true:spam predicted:ham Want explicit SEX in 30 secs? Ring 02073162414 now! Costs 20p/min Gsex POBOX 2667 WC1N 3XX
true:spam predicted:ham ASKED 3MOBILE IF 0870 CHATLINES INCLU IN FREE MINS. INDIA CUST SERVs SED YES. L8ER GOT MEGA BILL. 3 DONT GIV A SHIT. BAILIFF DUE IN DAYS. I O £250 3 WANT £800

In [8]: