Identifying Spam from SMS Text Messages

This analysis attempts to identify spam messages from a corpus of 5,574 SMS text messages. The corpus is labeled as either spam or ham (legitimate messages) with 4,827 as ham and 747 as spam. Using Sci-kit Learn and the Multinomial Naive Bayes model to classify messages as spam and ham.

We will look at various options to tune the model to see if we can get to 0 false positives in which legitimate messages are labled as spam. It is expected that a small percentage of spam messages making it through the spam filter is preferable to legitimate messages being excluded.

Sources:

https://archive.ics.uci.edu/ml/datasets/sms+spam+collection

https://radimrehurek.com/data_science_python/

http://adataanalyst.com/scikit-learn/countvectorizer-sklearn-example/


In [479]:
%matplotlib inline

import os
import json
import time
import pickle
import requests
from io import BytesIO
from zipfile import ZipFile

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import text

import seaborn as sns
sns.set(font_scale=1.5)

Loading the data from the UCI Machine Learning Repository


In [517]:
URL = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip'
SMS_PATH = os.path.join('datasets', 'sms')

file_name = requests.get(URL)
zipfile = ZipFile(BytesIO(file_name.content))
zip_names = zipfile.namelist()

def fetch_data(file='SMSSPamCollection'):
    for file in zip_names:
        if not os.path.isdir(SMS_PATH):
            os.makedirs(SMS_PATH)
        outpath = os.path.join(SMS_PATH, file)
        extracted_file = zipfile.read(file)
        with open(outpath, 'wb') as f:
            f.write(extracted_file)           
        
        return outpath
               
DATA = fetch_data()

In [518]:
df = pd.read_csv(DATA, sep='\t', header=None)

In [519]:
df.columns = ['Label', 'Text']

Data Exploration


In [520]:
pd.set_option('max_colwidth', 220)
df.head(20)


Out[520]:
Label Text
0 ham Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives around here though
5 spam FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv
6 ham Even my brother is not like to speak with me. They treat me like aids patent.
7 ham As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune
8 spam WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.
9 spam Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030
10 ham I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried enough today.
11 spam SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6days, 16+ TsandCs apply Reply HL 4 info
12 spam URGENT! You have won a 1 week FREE membership in our £100,000 Prize Jackpot! Txt the word: CLAIM to No: 81010 T&C www.dbuk.net LCCLTD POBOX 4403LDNW1A7RW18
13 ham I've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.
14 ham I HAVE A DATE ON SUNDAY WITH WILL!!
15 spam XXXMobileMovieClub: To use your credit, click the WAP link in the next txt message or click here>> http://wap. xxxmobilemovieclub.com?n=QJKGIGHJJGCBL
16 ham Oh k...i'm watching here:)
17 ham Eh u remember how 2 spell his name... Yes i did. He v naughty make until i v wet.
18 ham Fine if that’s the way u feel. That’s the way its gota b
19 spam England v Macedonia - dont miss the goals/team news. Txt ur national team to 87077 eg ENGLAND to 87077 Try:WALES, SCOTLAND 4txt/ú1.20 POBOXox36504W45WQ 16+

In [521]:
df.describe()


Out[521]:
Label Text
count 5572 5572
unique 2 5169
top ham Sorry, I'll call later
freq 4825 30

In [522]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
Label    5572 non-null object
Text     5572 non-null object
dtypes: object(2)
memory usage: 87.1+ KB

Since the data is labeled for us, we can do further data exploration by taking a look at how spam and ham differ.


In [523]:
# Add a field to our dataframe with the length of each message.

df['Length'] = df['Text'].apply(len)

In [524]:
df.head()


Out[524]:
Label Text Length
0 ham Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat... 111
1 ham Ok lar... Joking wif u oni... 29
2 spam Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's 155
3 ham U dun say so early hor... U c already then say... 49
4 ham Nah I don't think he goes to usf, he lives around here though 61

In [525]:
df.groupby('Label').describe()


Out[525]:
Length
count mean std min 25% 50% 75% max
Label
ham 4825.0 71.482487 58.440652 2.0 33.0 52.0 93.0 910.0
spam 747.0 138.670683 28.873603 13.0 133.0 149.0 157.0 223.0

In addition to the difference in the number of ham vs. spam, it appears that spam messages are generally longer than spam messages and more normally distributed than ham messages.


In [526]:
df.Length.plot(bins=100, kind='hist')


Out[526]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4c7d0459e8>

In [527]:
df.hist(column='Length', by='Label', bins=50, figsize=(10,4))


Out[527]:
array([<matplotlib.axes._subplots.AxesSubplot object at 0x7f4c7cdd5860>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x7f4c7cfb7748>], dtype=object)

Define the feature set through vectorization.


In [528]:
text_data = df['Text']
text_data.shape


Out[528]:
(5572,)

In [529]:
# Give our target labels numbers.

df['Label_'] = df['Label'].map({'ham': 0, 'spam': 1})

In [530]:
#stop_words = text.ENGLISH_STOP_WORDS

#Adding stop words did not significantly improve the model.

In [531]:
#textWithoutNums = text_data.replace('\d+', 'NUM_', regex=True)

#Removing all of the numbers in the messages and replacing with a text string did not improve the model either.

In [532]:
vectorizer = CountVectorizer(analyzer='word') #, stop_words=stop_words)
#vectorizer.fit(textWithoutNums)
vectorizer.fit(text_data)


Out[532]:
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [533]:
vectorizer.get_feature_names()


Out[533]:
['00',
 '000',
 '000pes',
 '008704050406',
 '0089',
 '0121',
 '01223585236',
 '01223585334',
 '0125698789',
 '02',
 '0207',
 '02072069400',
 '02073162414',
 '02085076972',
 '021',
 '03',
 '04',
 '0430',
 '05',
 '050703',
 '0578',
 '06',
 '07',
 '07008009200',
 '07046744435',
 '07090201529',
 '07090298926',
 '07099833605',
 '07123456789',
 '0721072',
 '07732584351',
 '07734396839',
 '07742676969',
 '07753741225',
 '0776xxxxxxx',
 '07781482378',
 '07786200117',
 '077xxx',
 '078',
 '07801543489',
 '07808',
 '07808247860',
 '07808726822',
 '07815296484',
 '07821230901',
 '078498',
 '07880867867',
 '0789xxxxxxx',
 '07946746291',
 '0796xxxxxx',
 '07973788240',
 '07xxxxxxxxx',
 '08',
 '0800',
 '08000407165',
 '08000776320',
 '08000839402',
 '08000930705',
 '08000938767',
 '08001950382',
 '08002888812',
 '08002986030',
 '08002986906',
 '08002988890',
 '08006344447',
 '0808',
 '08081263000',
 '08081560665',
 '0825',
 '083',
 '0844',
 '08448350055',
 '08448714184',
 '0845',
 '08450542832',
 '08452810071',
 '08452810073',
 '08452810075over18',
 '0870',
 '08700435505150p',
 '08700469649',
 '08700621170150p',
 '08701213186',
 '08701237397',
 '08701417012',
 '08701417012150p',
 '0870141701216',
 '087016248',
 '08701752560',
 '087018728737',
 '0870241182716',
 '08702490080',
 '08702840625',
 '08704050406',
 '08704439680',
 '08704439680ts',
 '08706091795',
 '0870737910216yrs',
 '08707500020',
 '08707509020',
 '0870753331018',
 '08707808226',
 '08708034412',
 '08708800282',
 '08709222922',
 '08709501522',
 '0871',
 '087104711148',
 '08712101358',
 '08712103738',
 '0871212025016',
 '08712300220',
 '087123002209am',
 '08712317606',
 '08712400200',
 '08712400602450p',
 '08712400603',
 '08712402050',
 '08712402578',
 '08712402779',
 '08712402902',
 '08712402972',
 '08712404000',
 '08712405020',
 '08712405022',
 '08712460324',
 '08712466669',
 '0871277810710p',
 '0871277810810',
 '0871277810910p',
 '08714342399',
 '087147123779am',
 '08714712379',
 '08714712388',
 '08714712394',
 '08714712412',
 '08714714011',
 '08715203028',
 '08715203649',
 '08715203652',
 '08715203656',
 '08715203677',
 '08715203685',
 '08715203694',
 '08715205273',
 '08715500022',
 '08715705022',
 '08717111821',
 '08717168528',
 '08717205546',
 '0871750',
 '08717507382',
 '08717509990',
 '08717890890',
 '08717895698',
 '08717898035',
 '08718711108',
 '08718720201',
 '08718723815',
 '08718725756',
 '08718726270',
 '087187262701',
 '08718726970',
 '08718726971',
 '08718726978',
 '087187272008',
 '08718727868',
 '08718727870',
 '08718727870150ppm',
 '08718730555',
 '08718730666',
 '08718738001',
 '08718738002',
 '08718738034',
 '08719180219',
 '08719180248',
 '08719181259',
 '08719181503',
 '08719181513',
 '08719839835',
 '08719899217',
 '08719899229',
 '08719899230',
 '09',
 '09041940223',
 '09050000301',
 '09050000332',
 '09050000460',
 '09050000555',
 '09050000878',
 '09050000928',
 '09050001295',
 '09050001808',
 '09050002311',
 '09050003091',
 '09050005321',
 '09050090044',
 '09050280520',
 '09053750005',
 '09056242159',
 '09057039994',
 '09058091854',
 '09058091870',
 '09058094454',
 '09058094455',
 '09058094507',
 '09058094565',
 '09058094583',
 '09058094594',
 '09058094597',
 '09058094599',
 '09058095107',
 '09058095201',
 '09058097189',
 '09058097218',
 '09058098002',
 '09058099801',
 '09061104276',
 '09061104283',
 '09061209465',
 '09061213237',
 '09061221061',
 '09061221066',
 '09061701444',
 '09061701461',
 '09061701851',
 '09061701939',
 '09061702893',
 '09061743386',
 '09061743806',
 '09061743810',
 '09061743811',
 '09061744553',
 '09061749602',
 '09061790121',
 '09061790125',
 '09061790126',
 '09063440451',
 '09063442151',
 '09063458130',
 '0906346330',
 '09064011000',
 '09064012103',
 '09064012160',
 '09064015307',
 '09064017295',
 '09064017305',
 '09064018838',
 '09064019014',
 '09064019788',
 '09065069120',
 '09065069154',
 '09065171142',
 '09065174042',
 '09065394514',
 '09065394973',
 '09065989180',
 '09065989182',
 '09066350750',
 '09066358152',
 '09066358361',
 '09066361921',
 '09066362206',
 '09066362220',
 '09066362231',
 '09066364311',
 '09066364349',
 '09066364589',
 '09066368327',
 '09066368470',
 '09066368753',
 '09066380611',
 '09066382422',
 '09066612661',
 '09066649731from',
 '09066660100',
 '09071512432',
 '09071512433',
 '09071517866',
 '09077818151',
 '09090204448',
 '09090900040',
 '09094100151',
 '09094646631',
 '09094646899',
 '09095350301',
 '09096102316',
 '09099725823',
 '09099726395',
 '09099726429',
 '09099726481',
 '09099726553',
 '09111030116',
 '09111032124',
 '09701213186',
 '0a',
 '0quit',
 '10',
 '100',
 '1000',
 '1000call',
 '1000s',
 '100p',
 '100percent',
 '100txt',
 '1013',
 '1030',
 '10am',
 '10k',
 '10p',
 '10ppm',
 '10th',
 '11',
 '1120',
 '113',
 '1131',
 '114',
 '1146',
 '116',
 '1172',
 '118p',
 '11mths',
 '11pm',
 '12',
 '1205',
 '120p',
 '121',
 '1225',
 '123',
 '125',
 '1250',
 '125gift',
 '128',
 '12hours',
 '12hrs',
 '12mths',
 '13',
 '130',
 '1327',
 '139',
 '14',
 '140',
 '1405',
 '140ppm',
 '145',
 '1450',
 '146tf150p',
 '14tcr',
 '14thmarch',
 '15',
 '150',
 '1500',
 '150p',
 '150p16',
 '150pm',
 '150ppermesssubscription',
 '150ppm',
 '150ppmpobox10183bhamb64xe',
 '150ppmsg',
 '150pw',
 '151',
 '153',
 '15541',
 '15pm',
 '16',
 '165',
 '1680',
 '169',
 '177',
 '18',
 '180',
 '1843',
 '18p',
 '18yrs',
 '195',
 '1956669',
 '1apple',
 '1b6a5ecef91ff9',
 '1cup',
 '1da',
 '1er',
 '1hr',
 '1im',
 '1lemon',
 '1mega',
 '1million',
 '1pm',
 '1st',
 '1st4terms',
 '1stchoice',
 '1stone',
 '1thing',
 '1tulsi',
 '1win150ppmx3',
 '1winaweek',
 '1winawk',
 '1x150p',
 '1yf',
 '20',
 '200',
 '2000',
 '2003',
 '2004',
 '2005',
 '2006',
 '2007',
 '200p',
 '2025050',
 '20m12aq',
 '20p',
 '21',
 '21870000',
 '21st',
 '22',
 '220',
 '220cm2',
 '2309',
 '23f',
 '23g',
 '24',
 '24hrs',
 '24m',
 '24th',
 '25',
 '250',
 '250k',
 '255',
 '25p',
 '26',
 '2667',
 '26th',
 '27',
 '28',
 '2814032',
 '28days',
 '28th',
 '28thfeb',
 '29',
 '2b',
 '2bold',
 '2c',
 '2channel',
 '2day',
 '2docd',
 '2end',
 '2exit',
 '2ez',
 '2find',
 '2getha',
 '2geva',
 '2go',
 '2gthr',
 '2hook',
 '2hrs',
 '2i',
 '2kbsubject',
 '2lands',
 '2marrow',
 '2moro',
 '2morow',
 '2morro',
 '2morrow',
 '2morrowxxxx',
 '2mro',
 '2mrw',
 '2mwen',
 '2nd',
 '2nhite',
 '2nights',
 '2nite',
 '2optout',
 '2p',
 '2price',
 '2px',
 '2rcv',
 '2stop',
 '2stoptx',
 '2stoptxt',
 '2u',
 '2u2',
 '2watershd',
 '2waxsto',
 '2wks',
 '2wt',
 '2wu',
 '2years',
 '2yr',
 '2yrs',
 '30',
 '300',
 '3000',
 '300603',
 '300603t',
 '300p',
 '3030',
 '30apr',
 '30ish',
 '30pm',
 '30pp',
 '30s',
 '30th',
 '31',
 '3100',
 '310303',
 '31p',
 '32',
 '32000',
 '3230',
 '32323',
 '326',
 '33',
 '330',
 '350',
 '3510i',
 '35p',
 '3650',
 '36504',
 '3680',
 '373',
 '3750',
 '37819',
 '38',
 '382',
 '391784',
 '3aj',
 '3d',
 '3days',
 '3g',
 '3gbp',
 '3hrs',
 '3lions',
 '3lp',
 '3miles',
 '3mins',
 '3mobile',
 '3optical',
 '3pound',
 '3qxj9',
 '3rd',
 '3ss',
 '3uz',
 '3wks',
 '3x',
 '3xx',
 '40',
 '400',
 '400mins',
 '400thousad',
 '402',
 '4041',
 '40411',
 '40533',
 '40gb',
 '40mph',
 '41685',
 '41782',
 '420',
 '42049',
 '4217',
 '42478',
 '42810',
 '430',
 '434',
 '44',
 '440',
 '4403ldnw1a7rw18',
 '44345',
 '447797706009',
 '447801259231',
 '448712404000',
 '449050000301',
 '449071512431',
 '45',
 '450',
 '450p',
 '450ppw',
 '450pw',
 '45239',
 '45pm',
 '47',
 '4719',
 '4742',
 '47per',
 '48',
 '4882',
 '48922',
 '49',
 '49557',
 '4a',
 '4brekkie',
 '4d',
 '4eva',
 '4few',
 '4fil',
 '4get',
 '4give',
 '4got',
 '4goten',
 '4info',
 '4jx',
 '4msgs',
 '4mths',
 '4my',
 '4qf2',
 '4t',
 '4th',
 '4the',
 '4thnov',
 '4txt',
 '4u',
 '4utxt',
 '4w',
 '4ward',
 '4wrd',
 '4xx26',
 '4years',
 '50',
 '500',
 '5000',
 '505060',
 '50award',
 '50ea',
 '50gbp',
 '50p',
 '50perweeksub',
 '50perwksub',
 '50pm',
 '50pmmorefrommobile2bremoved',
 '50ppm',
 '50rcvd',
 '50s',
 '515',
 '5226',
 '523',
 '5249',
 '526',
 '528',
 '530',
 '54',
 '542',
 '545',
 '5digital',
 '5free',
 '5ish',
 '5k',
 '5min',
 '5mls',
 '5p',
 '5pm',
 '5th',
 '5wb',
 '5we',
 '5wkg',
 '5wq',
 '5years',
 '60',
 '600',
 '6031',
 '6089',
 '60p',
 '61',
 '61200',
 '61610',
 '62220cncl',
 '6230',
 '62468',
 '62735',
 '630',
 '63miles',
 '645',
 '65',
 '650',
 '66',
 '6669',
 '674',
 '67441233',
 '68866',
 '69101',
 '69200',
 '69669',
 '69696',
 '69698',
 '69855',
 '69866',
 '69876',
 '69888',
 '69888nyt',
 '69911',
 '69969',
 '69988',
 '6days',
 '6hl',
 '6hrs',
 '6ish',
 '6missed',
 '6months',
 '6ph',
 '6pm',
 '6th',
 '6times',
 '6wu',
 '6zf',
 '700',
 '71',
 '7250',
 '7250i',
 '730',
 '731',
 '74355',
 '75',
 '750',
 '7548',
 '75max',
 '762',
 '7634',
 '7684',
 '77',
 '7732584351',
 '78',
 '786',
 '7876150ppm',
 '79',
 '7am',
 '7cfca1a',
 '7ish',
 '7mp',
 '7oz',
 '7pm',
 '7th',
 '7ws',
 '7zs',
 '80',
 '800',
 '8000930705',
 '80062',
 '8007',
 '80082',
 '80086',
 '80122300p',
 '80155',
 '80160',
 '80182',
 '8027',
 '80488',
 '80608',
 '8077',
 '80878',
 '81010',
 '81151',
 '81303',
 '81618',
 '82050',
 '820554ad0a1705572711',
 '82242',
 '82277',
 '82324',
 '82468',
 '83021',
 '83039',
 '83049',
 '83110',
 '83118',
 '83222',
 '83332',
 '83338',
 '83355',
 '83370',
 '83383',
 '83435',
 '83600',
 '83738',
 '84',
 '84025',
 '84122',
 '84128',
 '84199',
 '84484',
 '85',
 '850',
 '85023',
 '85069',
 '85222',
 '85233',
 '8552',
 '85555',
 '86021',
 '861',
 '864233',
 '86688',
 '86888',
 '87021',
 '87066',
 '87070',
 '87077',
 '87121',
 '87131',
 '8714714',
 '872',
 '87239',
 '87575',
 '8800',
 '88039',
 '88066',
 '88088',
 '88222',
 '88600',
 '88800',
 '8883',
 '88877',
 '88888',
 '89034',
 '89070',
 '89080',
 '89105',
 '89123',
 '89545',
 '89555',
 '89693',
 '89938',
 '8am',
 '8ball',
 '8lb',
 '8p',
 '8pm',
 '8th',
 '8wp',
 '900',
 '9061100010',
 '910',
 '9153',
 '9280114',
 '92h',
 '930',
 '9307622',
 '945',
 '946',
 '95',
 '9755',
 '9758',
 '97n7qp',
 '98321561',
 '99',
 '9996',
 '9ae',
 '9am',
 '9ja',
 '9pm',
 '9t',
 '9th',
 '9yt',
 '____',
 'a21',
 'a30',
 'aa',
 'aah',
 'aaniye',
 'aaooooright',
 'aathi',
 'ab',
 'abbey',
 'abdomen',
 'abeg',
 'abel',
 'aberdeen',
 'abi',
 'ability',
 'abiola',
 'abj',
 'able',
 'abnormally',
 'about',
 'aboutas',
 'above',
 'abroad',
 'absence',
 'absolutely',
 'absolutly',
 'abstract',
 'abt',
 'abta',
 'aburo',
 'abuse',
 'abusers',
 'ac',
 'academic',
 'acc',
 'accent',
 'accenture',
 'accept',
 'access',
 'accessible',
 'accidant',
 'accident',
 'accidentally',
 'accommodation',
 'accommodationvouchers',
 'accomodate',
 'accomodations',
 'accordin',
 'accordingly',
 'account',
 'accounting',
 'accounts',
 'accumulation',
 'achan',
 'ache',
 'achieve',
 'acid',
 'acknowledgement',
 'acl03530150pm',
 'acnt',
 'aco',
 'across',
 'act',
 'acted',
 'actin',
 'acting',
 'action',
 'activ8',
 'activate',
 'active',
 'activities',
 'actor',
 'actual',
 'actually',
 'ad',
 'adam',
 'add',
 'addamsfa',
 'added',
 'addicted',
 'addie',
 'adding',
 'address',
 'adds',
 'adewale',
 'adi',
 'adjustable',
 'admin',
 'administrator',
 'admirer',
 'admission',
 'admit',
 'adore',
 'adoring',
 'adp',
 'adress',
 'adrian',
 'adrink',
 'ads',
 'adsense',
 'adult',
 'adults',
 'advance',
 'adventure',
 'adventuring',
 'advice',
 'advise',
 'advising',
 'advisors',
 'aeronautics',
 'aeroplane',
 'afew',
 'affair',
 'affairs',
 'affection',
 'affectionate',
 'affections',
 'affidavit',
 'afford',
 'afghanistan',
 'afraid',
 'africa',
 'african',
 'aft',
 'after',
 'afternon',
 'afternoon',
 'afternoons',
 'afterwards',
 'aftr',
 'ag',
 'again',
 'against',
 'agalla',
 'age',
 'age16',
 'age23',
 'agency',
 'agent',
 'agents',
 'ages',
 'agidhane',
 'aging',
 'ago',
 'agree',
 'ah',
 'aha',
 'ahead',
 'ahhh',
 ...]

In [534]:
pd.DataFrame.from_dict(vectorizer.vocabulary_, orient='index').sort_values(by=0, ascending=False).head()


Out[534]:
0
〨ud 8712
ú1 8711
èn 8710
zyada 8709
zouk 8708

In [535]:
dtm = vectorizer.transform(text_data)
features = pd.DataFrame(dtm.toarray(), columns=vectorizer.get_feature_names())
features.shape


Out[535]:
(5572, 8713)

In [536]:
features.head()


Out[536]:
00 000 000pes 008704050406 0089 0121 01223585236 01223585334 0125698789 02 ... zhong zindgi zoe zogtorius zoom zouk zyada èn ú1 〨ud
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 8713 columns


In [537]:
X = features
y = np.array(df['Label_'].tolist())

In [538]:
from sklearn.cross_validation import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
print(X_train.shape, y_train.shape)


(4457, 8713) (4457,)

In [539]:
model = MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
model.fit(X_train, y_train)


Out[539]:
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [540]:
y_pred_class = model.predict(X_test)

In [541]:
print(metrics.classification_report(y_test, y_pred_class))
print('Accuracy Score: ', metrics.accuracy_score(y_test, y_pred_class))


             precision    recall  f1-score   support

          0       0.99      0.99      0.99       968
          1       0.95      0.97      0.96       147

avg / total       0.99      0.99      0.99      1115

Accuracy Score:  0.989237668161

Using Yellowbrick


In [542]:
from yellowbrick.classifier import ClassificationReport
bayes = MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
visualizer = ClassificationReport(bayes, classes=['ham', 'spam'])
visualizer.fit(X_train, y_train)
visualizer.score(X_test, y_test)
g = visualizer.poof()



In [543]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred_class)
sns.set(font_scale=1.5)

ax = plt.subplot()
sns.heatmap(cm, annot=True, ax=ax, fmt='g', cbar=False)


ax.set_xlabel('Predicted')
ax.set_ylabel('Actual')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(['Ham', 'Spam'])
ax.yaxis.set_ticklabels(['Ham', 'Spam'])
plt.show()


Using the default settings for our model does a pretty good job predicting spam and ham although not perfect. The confusion matrix shows us that there are 12 false positives ( 5 actual spam messages that are predicted to be ham with 7 actual ham message predicted as spam).

I think it is more important to a user to receive 100% of their real messages while tolerating a few spam messages. So let's see if we can tune the model to eliminate the false positives that are tagged as spam but are really ham.

We can use grid search with cross-validation to find the optimal alpha value.


In [544]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

# Split the dataset in two equal parts
X_train_, X_test_, y_train_, y_test_ = train_test_split(
    X, y, test_size=0.5, random_state=1)

# Set the parameters by cross-validation
tuned_parameters = [{'alpha': [0.5, 1.0, 1.5, 2.0, 2.5, 3.0], 'class_prior':[None], 'fit_prior': [True, False]}]

scores = ['precision', 'recall']

for score in scores:
    print("### Tuning hyper-parameters for %s ###" % score)
    print()

    clf = GridSearchCV(MultinomialNB(), tuned_parameters, cv=5,
                       scoring='%s_macro' % score)
    clf.fit(X_train_, y_train_)

    print("Best parameters set found on development set:")
    print()
    print(clf.best_params_)
    print()
    print("Grid scores on development set:")
    print()
    means = clf.cv_results_['mean_test_score']
    stds = clf.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, clf.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r"
              % (mean, std * 2, params))
    print()

    print("Detailed classification report:")
    print()
    print("The model is trained on the full development set.")
    print("The scores are computed on the full evaluation set.")
    print()
    y_true, y_pred = y_test, clf.predict(X_test)
    print(classification_report(y_true, y_pred))
    print()
    print('Accuracy Score: ', metrics.accuracy_score(y_test, y_pred))
    print()


### Tuning hyper-parameters for precision ###

Best parameters set found on development set:

{'alpha': 3.0, 'class_prior': None, 'fit_prior': True}

Grid scores on development set:

0.950 (+/-0.028) for {'alpha': 0.5, 'class_prior': None, 'fit_prior': True}
0.897 (+/-0.039) for {'alpha': 0.5, 'class_prior': None, 'fit_prior': False}
0.966 (+/-0.023) for {'alpha': 1.0, 'class_prior': None, 'fit_prior': True}
0.913 (+/-0.041) for {'alpha': 1.0, 'class_prior': None, 'fit_prior': False}
0.972 (+/-0.015) for {'alpha': 1.5, 'class_prior': None, 'fit_prior': True}
0.925 (+/-0.020) for {'alpha': 1.5, 'class_prior': None, 'fit_prior': False}
0.973 (+/-0.019) for {'alpha': 2.0, 'class_prior': None, 'fit_prior': True}
0.927 (+/-0.027) for {'alpha': 2.0, 'class_prior': None, 'fit_prior': False}
0.973 (+/-0.017) for {'alpha': 2.5, 'class_prior': None, 'fit_prior': True}
0.931 (+/-0.029) for {'alpha': 2.5, 'class_prior': None, 'fit_prior': False}
0.974 (+/-0.017) for {'alpha': 3.0, 'class_prior': None, 'fit_prior': True}
0.934 (+/-0.032) for {'alpha': 3.0, 'class_prior': None, 'fit_prior': False}

Detailed classification report:

The model is trained on the full development set.
The scores are computed on the full evaluation set.

             precision    recall  f1-score   support

          0       0.97      1.00      0.99       968
          1       1.00      0.83      0.91       147

avg / total       0.98      0.98      0.98      1115


Accuracy Score:  0.977578475336

### Tuning hyper-parameters for recall ###

Best parameters set found on development set:

{'alpha': 0.5, 'class_prior': None, 'fit_prior': True}

Grid scores on development set:

0.968 (+/-0.024) for {'alpha': 0.5, 'class_prior': None, 'fit_prior': True}
0.960 (+/-0.028) for {'alpha': 0.5, 'class_prior': None, 'fit_prior': False}
0.957 (+/-0.014) for {'alpha': 1.0, 'class_prior': None, 'fit_prior': True}
0.962 (+/-0.028) for {'alpha': 1.0, 'class_prior': None, 'fit_prior': False}
0.937 (+/-0.022) for {'alpha': 1.5, 'class_prior': None, 'fit_prior': True}
0.953 (+/-0.021) for {'alpha': 1.5, 'class_prior': None, 'fit_prior': False}
0.918 (+/-0.021) for {'alpha': 2.0, 'class_prior': None, 'fit_prior': True}
0.938 (+/-0.021) for {'alpha': 2.0, 'class_prior': None, 'fit_prior': False}
0.909 (+/-0.025) for {'alpha': 2.5, 'class_prior': None, 'fit_prior': True}
0.921 (+/-0.021) for {'alpha': 2.5, 'class_prior': None, 'fit_prior': False}
0.888 (+/-0.040) for {'alpha': 3.0, 'class_prior': None, 'fit_prior': True}
0.913 (+/-0.029) for {'alpha': 3.0, 'class_prior': None, 'fit_prior': False}

Detailed classification report:

The model is trained on the full development set.
The scores are computed on the full evaluation set.

             precision    recall  f1-score   support

          0       0.99      0.99      0.99       968
          1       0.94      0.96      0.95       147

avg / total       0.99      0.99      0.99      1115


Accuracy Score:  0.986547085202

Since we are more concerned with minimizing the false positives especially with ham classified as spam, we will use an alpha value of 3.0 with fit_prior = True.


In [545]:
from yellowbrick.classifier import ClassificationReport
bayes = MultinomialNB(alpha=3.0, class_prior=None, fit_prior=True)
visualizer = ClassificationReport(bayes, classes=['ham', 'spam'])
visualizer.fit(X_train, y_train)
visualizer.score(X_test, y_test)
g = visualizer.poof()



In [546]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
print(X_train.shape, y_train.shape)


(4457, 8713) (4457,)

In [547]:
model = MultinomialNB(alpha=3.0, class_prior=None, fit_prior=True)
model.fit(X_train, y_train)


Out[547]:
MultinomialNB(alpha=3.0, class_prior=None, fit_prior=True)

In [548]:
y_pred_class = model.predict(X_test)

In [549]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred_class)
sns.set(font_scale=1.5)

ax = plt.subplot()
sns.heatmap(cm, annot=True, ax=ax, fmt='g', cbar=False)


ax.set_xlabel('Predicted')
ax.set_ylabel('Actual')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(['Ham', 'Spam'])
ax.yaxis.set_ticklabels(['Ham', 'Spam'])
plt.show()


By using the optimal alpha value for precision, we are able to eliminate the false positives of messages predicted to be spam but are actually ham at the expense of receiving more spam messages that are incorrectly labeled as ham.


In [ ]: