Identifying Spam from SMS Text Messages

This analysis attempts to identify spam messages from a corpus of 5,574 SMS text messages. The corpus is labeled as either spam or ham (legitimate messages) with 4,827 as ham and 747 as spam. Using Sci-kit Learn and the Multinomial Naive Bayes model to classify messages as spam and ham.

We will look at various options to tune the model to see if we can get to 0 false positives in which legitimate messages are labled as spam. It is expected that a small percentage of spam messages making it through the spam filter is preferable to legitimate messages being excluded.

Sources:

https://archive.ics.uci.edu/ml/datasets/sms+spam+collection

https://radimrehurek.com/data_science_python/

http://adataanalyst.com/scikit-learn/countvectorizer-sklearn-example/



In [479]:

    
%matplotlib inline

import os
import json
import time
import pickle
import requests
from io import BytesIO
from zipfile import ZipFile

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import text

import seaborn as sns
sns.set(font_scale=1.5)

Loading the data from the UCI Machine Learning Repository



In [517]:

    
URL = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip'
SMS_PATH = os.path.join('datasets', 'sms')

file_name = requests.get(URL)
zipfile = ZipFile(BytesIO(file_name.content))
zip_names = zipfile.namelist()

def fetch_data(file='SMSSPamCollection'):
    for file in zip_names:
        if not os.path.isdir(SMS_PATH):
            os.makedirs(SMS_PATH)
        outpath = os.path.join(SMS_PATH, file)
        extracted_file = zipfile.read(file)
        with open(outpath, 'wb') as f:
            f.write(extracted_file)           
        
        return outpath
               
DATA = fetch_data()



In [518]:

    
df = pd.read_csv(DATA, sep='\t', header=None)



In [519]:

    
df.columns = ['Label', 'Text']

Data Exploration



In [520]:

    
pd.set_option('max_colwidth', 220)
df.head(20)









    Out[520]:







  
    
      
      Label
      Text
    
  
  
    
      0
      ham
      Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
    
    
      1
      ham
      Ok lar... Joking wif u oni...
    
    
      2
      spam
      Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
    
    
      3
      ham
      U dun say so early hor... U c already then say...
    
    
      4
      ham
      Nah I don't think he goes to usf, he lives around here though
    
    
      5
      spam
      FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv
    
    
      6
      ham
      Even my brother is not like to speak with me. They treat me like aids patent.
    
    
      7
      ham
      As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune
    
    
      8
      spam
      WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.
    
    
      9
      spam
      Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030
    
    
      10
      ham
      I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried enough today.
    
    
      11
      spam
      SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6days, 16+ TsandCs apply Reply HL 4 info
    
    
      12
      spam
      URGENT! You have won a 1 week FREE membership in our £100,000 Prize Jackpot! Txt the word: CLAIM to No: 81010 T&C www.dbuk.net LCCLTD POBOX 4403LDNW1A7RW18
    
    
      13
      ham
      I've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.
    
    
      14
      ham
      I HAVE A DATE ON SUNDAY WITH WILL!!
    
    
      15
      spam
      XXXMobileMovieClub: To use your credit, click the WAP link in the next txt message or click here>> http://wap. xxxmobilemovieclub.com?n=QJKGIGHJJGCBL
    
    
      16
      ham
      Oh k...i'm watching here:)
    
    
      17
      ham
      Eh u remember how 2 spell his name... Yes i did. He v naughty make until i v wet.
    
    
      18
      ham
      Fine if thats the way u feel. Thats the way its gota b
    
    
      19
      spam
      England v Macedonia - dont miss the goals/team news. Txt ur national team to 87077 eg ENGLAND to 87077 Try:WALES, SCOTLAND 4txt/ú1.20 POBOXox36504W45WQ 16+



In [521]:

    
df.describe()









    Out[521]:







  
    
      
      Label
      Text
    
  
  
    
      count
      5572
      5572
    
    
      unique
      2
      5169
    
    
      top
      ham
      Sorry, I'll call later
    
    
      freq
      4825
      30



In [522]:

    
df.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
Label    5572 non-null object
Text     5572 non-null object
dtypes: object(2)
memory usage: 87.1+ KB

Since the data is labeled for us, we can do further data exploration by taking a look at how spam and ham differ.



In [523]:

    
# Add a field to our dataframe with the length of each message.

df['Length'] = df['Text'].apply(len)



In [524]:

    
df.head()









    Out[524]:







  
    
      
      Label
      Text
      Length
    
  
  
    
      0
      ham
      Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
      111
    
    
      1
      ham
      Ok lar... Joking wif u oni...
      29
    
    
      2
      spam
      Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
      155
    
    
      3
      ham
      U dun say so early hor... U c already then say...
      49
    
    
      4
      ham
      Nah I don't think he goes to usf, he lives around here though
      61



In [525]:

    
df.groupby('Label').describe()

In addition to the difference in the number of ham vs. spam, it appears that spam messages are generally longer than spam messages and more normally distributed than ham messages.



In [526]:

    
df.Length.plot(bins=100, kind='hist')









    Out[526]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f4c7d0459e8>



In [527]:

    
df.hist(column='Length', by='Label', bins=50, figsize=(10,4))









    Out[527]:





array([<matplotlib.axes._subplots.AxesSubplot object at 0x7f4c7cdd5860>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x7f4c7cfb7748>], dtype=object)

Define the feature set through vectorization.



In [528]:

    
text_data = df['Text']
text_data.shape









    Out[528]:





(5572,)



In [529]:

    
# Give our target labels numbers.

df['Label_'] = df['Label'].map({'ham': 0, 'spam': 1})



In [530]:

    
#stop_words = text.ENGLISH_STOP_WORDS

#Adding stop words did not significantly improve the model.



In [531]:

    
#textWithoutNums = text_data.replace('\d+', 'NUM_', regex=True)

#Removing all of the numbers in the messages and replacing with a text string did not improve the model either.



In [532]:

    
vectorizer = CountVectorizer(analyzer='word') #, stop_words=stop_words)
#vectorizer.fit(textWithoutNums)
vectorizer.fit(text_data)









    Out[532]:





CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)



In [533]:

    
vectorizer.get_feature_names()









    Out[533]:





['00',
 '000',
 '000pes',
 '008704050406',
 '0089',
 '0121',
 '01223585236',
 '01223585334',
 '0125698789',
 '02',
 '0207',
 '02072069400',
 '02073162414',
 '02085076972',
 '021',
 '03',
 '04',
 '0430',
 '05',
 '050703',
 '0578',
 '06',
 '07',
 '07008009200',
 '07046744435',
 '07090201529',
 '07090298926',
 '07099833605',
 '07123456789',
 '0721072',
 '07732584351',
 '07734396839',
 '07742676969',
 '07753741225',
 '0776xxxxxxx',
 '07781482378',
 '07786200117',
 '077xxx',
 '078',
 '07801543489',
 '07808',
 '07808247860',
 '07808726822',
 '07815296484',
 '07821230901',
 '078498',
 '07880867867',
 '0789xxxxxxx',
 '07946746291',
 '0796xxxxxx',
 '07973788240',
 '07xxxxxxxxx',
 '08',
 '0800',
 '08000407165',
 '08000776320',
 '08000839402',
 '08000930705',
 '08000938767',
 '08001950382',
 '08002888812',
 '08002986030',
 '08002986906',
 '08002988890',
 '08006344447',
 '0808',
 '08081263000',
 '08081560665',
 '0825',
 '083',
 '0844',
 '08448350055',
 '08448714184',
 '0845',
 '08450542832',
 '08452810071',
 '08452810073',
 '08452810075over18',
 '0870',
 '08700435505150p',
 '08700469649',
 '08700621170150p',
 '08701213186',
 '08701237397',
 '08701417012',
 '08701417012150p',
 '0870141701216',
 '087016248',
 '08701752560',
 '087018728737',
 '0870241182716',
 '08702490080',
 '08702840625',
 '08704050406',
 '08704439680',
 '08704439680ts',
 '08706091795',
 '0870737910216yrs',
 '08707500020',
 '08707509020',
 '0870753331018',
 '08707808226',
 '08708034412',
 '08708800282',
 '08709222922',
 '08709501522',
 '0871',
 '087104711148',
 '08712101358',
 '08712103738',
 '0871212025016',
 '08712300220',
 '087123002209am',
 '08712317606',
 '08712400200',
 '08712400602450p',
 '08712400603',
 '08712402050',
 '08712402578',
 '08712402779',
 '08712402902',
 '08712402972',
 '08712404000',
 '08712405020',
 '08712405022',
 '08712460324',
 '08712466669',
 '0871277810710p',
 '0871277810810',
 '0871277810910p',
 '08714342399',
 '087147123779am',
 '08714712379',
 '08714712388',
 '08714712394',
 '08714712412',
 '08714714011',
 '08715203028',
 '08715203649',
 '08715203652',
 '08715203656',
 '08715203677',
 '08715203685',
 '08715203694',
 '08715205273',
 '08715500022',
 '08715705022',
 '08717111821',
 '08717168528',
 '08717205546',
 '0871750',
 '08717507382',
 '08717509990',
 '08717890890',
 '08717895698',
 '08717898035',
 '08718711108',
 '08718720201',
 '08718723815',
 '08718725756',
 '08718726270',
 '087187262701',
 '08718726970',
 '08718726971',
 '08718726978',
 '087187272008',
 '08718727868',
 '08718727870',
 '08718727870150ppm',
 '08718730555',
 '08718730666',
 '08718738001',
 '08718738002',
 '08718738034',
 '08719180219',
 '08719180248',
 '08719181259',
 '08719181503',
 '08719181513',
 '08719839835',
 '08719899217',
 '08719899229',
 '08719899230',
 '09',
 '09041940223',
 '09050000301',
 '09050000332',
 '09050000460',
 '09050000555',
 '09050000878',
 '09050000928',
 '09050001295',
 '09050001808',
 '09050002311',
 '09050003091',
 '09050005321',
 '09050090044',
 '09050280520',
 '09053750005',
 '09056242159',
 '09057039994',
 '09058091854',
 '09058091870',
 '09058094454',
 '09058094455',
 '09058094507',
 '09058094565',
 '09058094583',
 '09058094594',
 '09058094597',
 '09058094599',
 '09058095107',
 '09058095201',
 '09058097189',
 '09058097218',
 '09058098002',
 '09058099801',
 '09061104276',
 '09061104283',
 '09061209465',
 '09061213237',
 '09061221061',
 '09061221066',
 '09061701444',
 '09061701461',
 '09061701851',
 '09061701939',
 '09061702893',
 '09061743386',
 '09061743806',
 '09061743810',
 '09061743811',
 '09061744553',
 '09061749602',
 '09061790121',
 '09061790125',
 '09061790126',
 '09063440451',
 '09063442151',
 '09063458130',
 '0906346330',
 '09064011000',
 '09064012103',
 '09064012160',
 '09064015307',
 '09064017295',
 '09064017305',
 '09064018838',
 '09064019014',
 '09064019788',
 '09065069120',
 '09065069154',
 '09065171142',
 '09065174042',
 '09065394514',
 '09065394973',
 '09065989180',
 '09065989182',
 '09066350750',
 '09066358152',
 '09066358361',
 '09066361921',
 '09066362206',
 '09066362220',
 '09066362231',
 '09066364311',
 '09066364349',
 '09066364589',
 '09066368327',
 '09066368470',
 '09066368753',
 '09066380611',
 '09066382422',
 '09066612661',
 '09066649731from',
 '09066660100',
 '09071512432',
 '09071512433',
 '09071517866',
 '09077818151',
 '09090204448',
 '09090900040',
 '09094100151',
 '09094646631',
 '09094646899',
 '09095350301',
 '09096102316',
 '09099725823',
 '09099726395',
 '09099726429',
 '09099726481',
 '09099726553',
 '09111030116',
 '09111032124',
 '09701213186',
 '0a',
 '0quit',
 '10',
 '100',
 '1000',
 '1000call',
 '1000s',
 '100p',
 '100percent',
 '100txt',
 '1013',
 '1030',
 '10am',
 '10k',
 '10p',
 '10ppm',
 '10th',
 '11',
 '1120',
 '113',
 '1131',
 '114',
 '1146',
 '116',
 '1172',
 '118p',
 '11mths',
 '11pm',
 '12',
 '1205',
 '120p',
 '121',
 '1225',
 '123',
 '125',
 '1250',
 '125gift',
 '128',
 '12hours',
 '12hrs',
 '12mths',
 '13',
 '130',
 '1327',
 '139',
 '14',
 '140',
 '1405',
 '140ppm',
 '145',
 '1450',
 '146tf150p',
 '14tcr',
 '14thmarch',
 '15',
 '150',
 '1500',
 '150p',
 '150p16',
 '150pm',
 '150ppermesssubscription',
 '150ppm',
 '150ppmpobox10183bhamb64xe',
 '150ppmsg',
 '150pw',
 '151',
 '153',
 '15541',
 '15pm',
 '16',
 '165',
 '1680',
 '169',
 '177',
 '18',
 '180',
 '1843',
 '18p',
 '18yrs',
 '195',
 '1956669',
 '1apple',
 '1b6a5ecef91ff9',
 '1cup',
 '1da',
 '1er',
 '1hr',
 '1im',
 '1lemon',
 '1mega',
 '1million',
 '1pm',
 '1st',
 '1st4terms',
 '1stchoice',
 '1stone',
 '1thing',
 '1tulsi',
 '1win150ppmx3',
 '1winaweek',
 '1winawk',
 '1x150p',
 '1yf',
 '20',
 '200',
 '2000',
 '2003',
 '2004',
 '2005',
 '2006',
 '2007',
 '200p',
 '2025050',
 '20m12aq',
 '20p',
 '21',
 '21870000',
 '21st',
 '22',
 '220',
 '220cm2',
 '2309',
 '23f',
 '23g',
 '24',
 '24hrs',
 '24m',
 '24th',
 '25',
 '250',
 '250k',
 '255',
 '25p',
 '26',
 '2667',
 '26th',
 '27',
 '28',
 '2814032',
 '28days',
 '28th',
 '28thfeb',
 '29',
 '2b',
 '2bold',
 '2c',
 '2channel',
 '2day',
 '2docd',
 '2end',
 '2exit',
 '2ez',
 '2find',
 '2getha',
 '2geva',
 '2go',
 '2gthr',
 '2hook',
 '2hrs',
 '2i',
 '2kbsubject',
 '2lands',
 '2marrow',
 '2moro',
 '2morow',
 '2morro',
 '2morrow',
 '2morrowxxxx',
 '2mro',
 '2mrw',
 '2mwen',
 '2nd',
 '2nhite',
 '2nights',
 '2nite',
 '2optout',
 '2p',
 '2price',
 '2px',
 '2rcv',
 '2stop',
 '2stoptx',
 '2stoptxt',
 '2u',
 '2u2',
 '2watershd',
 '2waxsto',
 '2wks',
 '2wt',
 '2wu',
 '2years',
 '2yr',
 '2yrs',
 '30',
 '300',
 '3000',
 '300603',
 '300603t',
 '300p',
 '3030',
 '30apr',
 '30ish',
 '30pm',
 '30pp',
 '30s',
 '30th',
 '31',
 '3100',
 '310303',
 '31p',
 '32',
 '32000',
 '3230',
 '32323',
 '326',
 '33',
 '330',
 '350',
 '3510i',
 '35p',
 '3650',
 '36504',
 '3680',
 '373',
 '3750',
 '37819',
 '38',
 '382',
 '391784',
 '3aj',
 '3d',
 '3days',
 '3g',
 '3gbp',
 '3hrs',
 '3lions',
 '3lp',
 '3miles',
 '3mins',
 '3mobile',
 '3optical',
 '3pound',
 '3qxj9',
 '3rd',
 '3ss',
 '3uz',
 '3wks',
 '3x',
 '3xx',
 '40',
 '400',
 '400mins',
 '400thousad',
 '402',
 '4041',
 '40411',
 '40533',
 '40gb',
 '40mph',
 '41685',
 '41782',
 '420',
 '42049',
 '4217',
 '42478',
 '42810',
 '430',
 '434',
 '44',
 '440',
 '4403ldnw1a7rw18',
 '44345',
 '447797706009',
 '447801259231',
 '448712404000',
 '449050000301',
 '449071512431',
 '45',
 '450',
 '450p',
 '450ppw',
 '450pw',
 '45239',
 '45pm',
 '47',
 '4719',
 '4742',
 '47per',
 '48',
 '4882',
 '48922',
 '49',
 '49557',
 '4a',
 '4brekkie',
 '4d',
 '4eva',
 '4few',
 '4fil',
 '4get',
 '4give',
 '4got',
 '4goten',
 '4info',
 '4jx',
 '4msgs',
 '4mths',
 '4my',
 '4qf2',
 '4t',
 '4th',
 '4the',
 '4thnov',
 '4txt',
 '4u',
 '4utxt',
 '4w',
 '4ward',
 '4wrd',
 '4xx26',
 '4years',
 '50',
 '500',
 '5000',
 '505060',
 '50award',
 '50ea',
 '50gbp',
 '50p',
 '50perweeksub',
 '50perwksub',
 '50pm',
 '50pmmorefrommobile2bremoved',
 '50ppm',
 '50rcvd',
 '50s',
 '515',
 '5226',
 '523',
 '5249',
 '526',
 '528',
 '530',
 '54',
 '542',
 '545',
 '5digital',
 '5free',
 '5ish',
 '5k',
 '5min',
 '5mls',
 '5p',
 '5pm',
 '5th',
 '5wb',
 '5we',
 '5wkg',
 '5wq',
 '5years',
 '60',
 '600',
 '6031',
 '6089',
 '60p',
 '61',
 '61200',
 '61610',
 '62220cncl',
 '6230',
 '62468',
 '62735',
 '630',
 '63miles',
 '645',
 '65',
 '650',
 '66',
 '6669',
 '674',
 '67441233',
 '68866',
 '69101',
 '69200',
 '69669',
 '69696',
 '69698',
 '69855',
 '69866',
 '69876',
 '69888',
 '69888nyt',
 '69911',
 '69969',
 '69988',
 '6days',
 '6hl',
 '6hrs',
 '6ish',
 '6missed',
 '6months',
 '6ph',
 '6pm',
 '6th',
 '6times',
 '6wu',
 '6zf',
 '700',
 '71',
 '7250',
 '7250i',
 '730',
 '731',
 '74355',
 '75',
 '750',
 '7548',
 '75max',
 '762',
 '7634',
 '7684',
 '77',
 '7732584351',
 '78',
 '786',
 '7876150ppm',
 '79',
 '7am',
 '7cfca1a',
 '7ish',
 '7mp',
 '7oz',
 '7pm',
 '7th',
 '7ws',
 '7zs',
 '80',
 '800',
 '8000930705',
 '80062',
 '8007',
 '80082',
 '80086',
 '80122300p',
 '80155',
 '80160',
 '80182',
 '8027',
 '80488',
 '80608',
 '8077',
 '80878',
 '81010',
 '81151',
 '81303',
 '81618',
 '82050',
 '820554ad0a1705572711',
 '82242',
 '82277',
 '82324',
 '82468',
 '83021',
 '83039',
 '83049',
 '83110',
 '83118',
 '83222',
 '83332',
 '83338',
 '83355',
 '83370',
 '83383',
 '83435',
 '83600',
 '83738',
 '84',
 '84025',
 '84122',
 '84128',
 '84199',
 '84484',
 '85',
 '850',
 '85023',
 '85069',
 '85222',
 '85233',
 '8552',
 '85555',
 '86021',
 '861',
 '864233',
 '86688',
 '86888',
 '87021',
 '87066',
 '87070',
 '87077',
 '87121',
 '87131',
 '8714714',
 '872',
 '87239',
 '87575',
 '8800',
 '88039',
 '88066',
 '88088',
 '88222',
 '88600',
 '88800',
 '8883',
 '88877',
 '88888',
 '89034',
 '89070',
 '89080',
 '89105',
 '89123',
 '89545',
 '89555',
 '89693',
 '89938',
 '8am',
 '8ball',
 '8lb',
 '8p',
 '8pm',
 '8th',
 '8wp',
 '900',
 '9061100010',
 '910',
 '9153',
 '9280114',
 '92h',
 '930',
 '9307622',
 '945',
 '946',
 '95',
 '9755',
 '9758',
 '97n7qp',
 '98321561',
 '99',
 '9996',
 '9ae',
 '9am',
 '9ja',
 '9pm',
 '9t',
 '9th',
 '9yt',
 '____',
 'a21',
 'a30',
 'aa',
 'aah',
 'aaniye',
 'aaooooright',
 'aathi',
 'ab',
 'abbey',
 'abdomen',
 'abeg',
 'abel',
 'aberdeen',
 'abi',
 'ability',
 'abiola',
 'abj',
 'able',
 'abnormally',
 'about',
 'aboutas',
 'above',
 'abroad',
 'absence',
 'absolutely',
 'absolutly',
 'abstract',
 'abt',
 'abta',
 'aburo',
 'abuse',
 'abusers',
 'ac',
 'academic',
 'acc',
 'accent',
 'accenture',
 'accept',
 'access',
 'accessible',
 'accidant',
 'accident',
 'accidentally',
 'accommodation',
 'accommodationvouchers',
 'accomodate',
 'accomodations',
 'accordin',
 'accordingly',
 'account',
 'accounting',
 'accounts',
 'accumulation',
 'achan',
 'ache',
 'achieve',
 'acid',
 'acknowledgement',
 'acl03530150pm',
 'acnt',
 'aco',
 'across',
 'act',
 'acted',
 'actin',
 'acting',
 'action',
 'activ8',
 'activate',
 'active',
 'activities',
 'actor',
 'actual',
 'actually',
 'ad',
 'adam',
 'add',
 'addamsfa',
 'added',
 'addicted',
 'addie',
 'adding',
 'address',
 'adds',
 'adewale',
 'adi',
 'adjustable',
 'admin',
 'administrator',
 'admirer',
 'admission',
 'admit',
 'adore',
 'adoring',
 'adp',
 'adress',
 'adrian',
 'adrink',
 'ads',
 'adsense',
 'adult',
 'adults',
 'advance',
 'adventure',
 'adventuring',
 'advice',
 'advise',
 'advising',
 'advisors',
 'aeronautics',
 'aeroplane',
 'afew',
 'affair',
 'affairs',
 'affection',
 'affectionate',
 'affections',
 'affidavit',
 'afford',
 'afghanistan',
 'afraid',
 'africa',
 'african',
 'aft',
 'after',
 'afternon',
 'afternoon',
 'afternoons',
 'afterwards',
 'aftr',
 'ag',
 'again',
 'against',
 'agalla',
 'age',
 'age16',
 'age23',
 'agency',
 'agent',
 'agents',
 'ages',
 'agidhane',
 'aging',
 'ago',
 'agree',
 'ah',
 'aha',
 'ahead',
 'ahhh',
 ...]



In [534]:

    
pd.DataFrame.from_dict(vectorizer.vocabulary_, orient='index').sort_values(by=0, ascending=False).head()



In [535]:

    
dtm = vectorizer.transform(text_data)
features = pd.DataFrame(dtm.toarray(), columns=vectorizer.get_feature_names())
features.shape









    Out[535]:





(5572, 8713)



In [536]:

    
features.head()









    Out[536]:







  
    
      
      00
      000
      000pes
      008704050406
      0089
      0121
      01223585236
      01223585334
      0125698789
      02
      ...
      zhong
      zindgi
      zoe
      zogtorius
      zoom
      zouk
      zyada
      èn
      ú1
      〨ud
    
  
  
    
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      3
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      4
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
  

5 rows × 8713 columns



In [537]:

    
X = features
y = np.array(df['Label_'].tolist())



In [538]:

    
from sklearn.cross_validation import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
print(X_train.shape, y_train.shape)









    



(4457, 8713) (4457,)



In [539]:

    
model = MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
model.fit(X_train, y_train)









    Out[539]:





MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)



In [540]:

    
y_pred_class = model.predict(X_test)



In [541]:

    
print(metrics.classification_report(y_test, y_pred_class))
print('Accuracy Score: ', metrics.accuracy_score(y_test, y_pred_class))









    



             precision    recall  f1-score   support

          0       0.99      0.99      0.99       968
          1       0.95      0.97      0.96       147

avg / total       0.99      0.99      0.99      1115

Accuracy Score:  0.989237668161

Using Yellowbrick



In [542]:

    
from yellowbrick.classifier import ClassificationReport
bayes = MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
visualizer = ClassificationReport(bayes, classes=['ham', 'spam'])
visualizer.fit(X_train, y_train)
visualizer.score(X_test, y_test)
g = visualizer.poof()



In [543]:

    
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred_class)
sns.set(font_scale=1.5)

ax = plt.subplot()
sns.heatmap(cm, annot=True, ax=ax, fmt='g', cbar=False)


ax.set_xlabel('Predicted')
ax.set_ylabel('Actual')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(['Ham', 'Spam'])
ax.yaxis.set_ticklabels(['Ham', 'Spam'])
plt.show()

Using the default settings for our model does a pretty good job predicting spam and ham although not perfect. The confusion matrix shows us that there are 12 false positives ( 5 actual spam messages that are predicted to be ham with 7 actual ham message predicted as spam).

I think it is more important to a user to receive 100% of their real messages while tolerating a few spam messages. So let's see if we can tune the model to eliminate the false positives that are tagged as spam but are really ham.

We can use grid search with cross-validation to find the optimal alpha value.



In [544]:

    
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

# Split the dataset in two equal parts
X_train_, X_test_, y_train_, y_test_ = train_test_split(
    X, y, test_size=0.5, random_state=1)

# Set the parameters by cross-validation
tuned_parameters = [{'alpha': [0.5, 1.0, 1.5, 2.0, 2.5, 3.0], 'class_prior':[None], 'fit_prior': [True, False]}]

scores = ['precision', 'recall']

for score in scores:
    print("### Tuning hyper-parameters for %s ###" % score)
    print()

    clf = GridSearchCV(MultinomialNB(), tuned_parameters, cv=5,
                       scoring='%s_macro' % score)
    clf.fit(X_train_, y_train_)

    print("Best parameters set found on development set:")
    print()
    print(clf.best_params_)
    print()
    print("Grid scores on development set:")
    print()
    means = clf.cv_results_['mean_test_score']
    stds = clf.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, clf.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r"
              % (mean, std * 2, params))
    print()

    print("Detailed classification report:")
    print()
    print("The model is trained on the full development set.")
    print("The scores are computed on the full evaluation set.")
    print()
    y_true, y_pred = y_test, clf.predict(X_test)
    print(classification_report(y_true, y_pred))
    print()
    print('Accuracy Score: ', metrics.accuracy_score(y_test, y_pred))
    print()









    



### Tuning hyper-parameters for precision ###

Best parameters set found on development set:

{'alpha': 3.0, 'class_prior': None, 'fit_prior': True}

Grid scores on development set:

0.950 (+/-0.028) for {'alpha': 0.5, 'class_prior': None, 'fit_prior': True}
0.897 (+/-0.039) for {'alpha': 0.5, 'class_prior': None, 'fit_prior': False}
0.966 (+/-0.023) for {'alpha': 1.0, 'class_prior': None, 'fit_prior': True}
0.913 (+/-0.041) for {'alpha': 1.0, 'class_prior': None, 'fit_prior': False}
0.972 (+/-0.015) for {'alpha': 1.5, 'class_prior': None, 'fit_prior': True}
0.925 (+/-0.020) for {'alpha': 1.5, 'class_prior': None, 'fit_prior': False}
0.973 (+/-0.019) for {'alpha': 2.0, 'class_prior': None, 'fit_prior': True}
0.927 (+/-0.027) for {'alpha': 2.0, 'class_prior': None, 'fit_prior': False}
0.973 (+/-0.017) for {'alpha': 2.5, 'class_prior': None, 'fit_prior': True}
0.931 (+/-0.029) for {'alpha': 2.5, 'class_prior': None, 'fit_prior': False}
0.974 (+/-0.017) for {'alpha': 3.0, 'class_prior': None, 'fit_prior': True}
0.934 (+/-0.032) for {'alpha': 3.0, 'class_prior': None, 'fit_prior': False}

Detailed classification report:

The model is trained on the full development set.
The scores are computed on the full evaluation set.

             precision    recall  f1-score   support

          0       0.97      1.00      0.99       968
          1       1.00      0.83      0.91       147

avg / total       0.98      0.98      0.98      1115


Accuracy Score:  0.977578475336

### Tuning hyper-parameters for recall ###

Best parameters set found on development set:

{'alpha': 0.5, 'class_prior': None, 'fit_prior': True}

Grid scores on development set:

0.968 (+/-0.024) for {'alpha': 0.5, 'class_prior': None, 'fit_prior': True}
0.960 (+/-0.028) for {'alpha': 0.5, 'class_prior': None, 'fit_prior': False}
0.957 (+/-0.014) for {'alpha': 1.0, 'class_prior': None, 'fit_prior': True}
0.962 (+/-0.028) for {'alpha': 1.0, 'class_prior': None, 'fit_prior': False}
0.937 (+/-0.022) for {'alpha': 1.5, 'class_prior': None, 'fit_prior': True}
0.953 (+/-0.021) for {'alpha': 1.5, 'class_prior': None, 'fit_prior': False}
0.918 (+/-0.021) for {'alpha': 2.0, 'class_prior': None, 'fit_prior': True}
0.938 (+/-0.021) for {'alpha': 2.0, 'class_prior': None, 'fit_prior': False}
0.909 (+/-0.025) for {'alpha': 2.5, 'class_prior': None, 'fit_prior': True}
0.921 (+/-0.021) for {'alpha': 2.5, 'class_prior': None, 'fit_prior': False}
0.888 (+/-0.040) for {'alpha': 3.0, 'class_prior': None, 'fit_prior': True}
0.913 (+/-0.029) for {'alpha': 3.0, 'class_prior': None, 'fit_prior': False}

Detailed classification report:

The model is trained on the full development set.
The scores are computed on the full evaluation set.

             precision    recall  f1-score   support

          0       0.99      0.99      0.99       968
          1       0.94      0.96      0.95       147

avg / total       0.99      0.99      0.99      1115


Accuracy Score:  0.986547085202

Since we are more concerned with minimizing the false positives especially with ham classified as spam, we will use an alpha value of 3.0 with fit_prior = True.



In [545]:

    
from yellowbrick.classifier import ClassificationReport
bayes = MultinomialNB(alpha=3.0, class_prior=None, fit_prior=True)
visualizer = ClassificationReport(bayes, classes=['ham', 'spam'])
visualizer.fit(X_train, y_train)
visualizer.score(X_test, y_test)
g = visualizer.poof()



In [546]:

    
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
print(X_train.shape, y_train.shape)









    



(4457, 8713) (4457,)



In [547]:

    
model = MultinomialNB(alpha=3.0, class_prior=None, fit_prior=True)
model.fit(X_train, y_train)









    Out[547]:





MultinomialNB(alpha=3.0, class_prior=None, fit_prior=True)



In [548]:

    
y_pred_class = model.predict(X_test)



In [549]:

    
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred_class)
sns.set(font_scale=1.5)

ax = plt.subplot()
sns.heatmap(cm, annot=True, ax=ax, fmt='g', cbar=False)


ax.set_xlabel('Predicted')
ax.set_ylabel('Actual')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(['Ham', 'Spam'])
ax.yaxis.set_ticklabels(['Ham', 'Spam'])
plt.show()

By using the optimal alpha value for precision, we are able to eliminate the false positives of messages predicted to be spam but are actually ham at the expense of receiving more spam messages that are incorrectly labeled as ham.



In [ ]:

	Length
	count	mean	std	min	25%	50%	75%	max
Label
ham	4825.0	71.482487	58.440652	2.0	33.0	52.0	93.0	910.0
spam	747.0	138.670683	28.873603	13.0	133.0	149.0	157.0	223.0

	Label	Text
0	ham	Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
1	ham	Ok lar... Joking wif u oni...
2	spam	Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
3	ham	U dun say so early hor... U c already then say...
4	ham	Nah I don't think he goes to usf, he lives around here though
5	spam	FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv
6	ham	Even my brother is not like to speak with me. They treat me like aids patent.
7	ham	As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune
8	spam	WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.
9	spam	Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030
10	ham	I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried enough today.
11	spam	SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6days, 16+ TsandCs apply Reply HL 4 info
12	spam	URGENT! You have won a 1 week FREE membership in our £100,000 Prize Jackpot! Txt the word: CLAIM to No: 81010 T&C www.dbuk.net LCCLTD POBOX 4403LDNW1A7RW18
13	ham	I've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.
14	ham	I HAVE A DATE ON SUNDAY WITH WILL!!
15	spam	XXXMobileMovieClub: To use your credit, click the WAP link in the next txt message or click here>> http://wap. xxxmobilemovieclub.com?n=QJKGIGHJJGCBL
16	ham	Oh k...i'm watching here:)
17	ham	Eh u remember how 2 spell his name... Yes i did. He v naughty make until i v wet.
18	ham	Fine if thats the way u feel. Thats the way its gota b
19	spam	England v Macedonia - dont miss the goals/team news. Txt ur national team to 87077 eg ENGLAND to 87077 Try:WALES, SCOTLAND 4txt/ú1.20 POBOXox36504W45WQ 16+

	Label	Text
count	5572	5572
unique	2	5169
top	ham	Sorry, I'll call later
freq	4825	30

	00	000	000pes	008704050406	0089	0121	01223585236	01223585334	0125698789	02	...	zhong	zindgi	zoe	zogtorius	zoom	zouk	zyada	èn	ú1	〨ud
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

	00	000	000pes	008704050406	0089	0121	01223585236	01223585334	0125698789	02	...	zhong	zindgi	zoe	zogtorius	zoom	zouk	zyada	èn	ú1	〨ud
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

	00	000	000pes	008704050406	0089	0121	01223585236	01223585334	0125698789	02	...	zhong	zindgi	zoe	zogtorius	zoom	zouk	zyada	èn	ú1	〨ud
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0