In [1]:

    
%autosave 0
from IPython.core.display import HTML, display
display(HTML('<style>.container { width:100%; } </style>'))









    














    



Autosave disabled

Spam Detection Using a Support Vector Machine

The process of creating a spam detector using a Support Vector Machine is split up into five steps.

Create a set of the most common words occurring in spam and ham (i.e. non-spam) emails.
Transform every mail into a frequency vector: For every word in the set of most common words, the frequency vector stores the frequency of this word in the respective mail.
For every word in the list of most common word, compute the inverse document frequency.
Compute the feature matrix by transforming the frequency vectors into vectors that contain the product of the term frequency with the inverse document frequency.
Train and test an SVM using this feature matrix.

Step 1: Create the Set of Common Words



In [2]:

    
import os
import re
import numpy as np
import math



In [3]:

    
from collections import Counter

The directory https://github.com/karlstroetmann/Artificial-Intelligence/tree/master/Python/EmailData contains 960 emails that are divided into four subdirectories:

spam-train contains 350 spam emails for training,
ham-train contains 350 non-spam emails for training,
spam-test contains 130 spam emails for testing,
ham-test contains 130 non-spam emails for testing.

I have found this data on the page http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex6/ex6.html provided by Andrew Ng.

We declare some variables so that this notebook can be adapted to other data sets.



In [4]:

    
spam_dir_train = 'EmailData/spam-train/'
ham__dir_train = 'EmailData/ham-train/'
spam_dir_test  = 'EmailData/spam-test/'
ham__dir_test  = 'EmailData/ham-test/'
Directories    = [spam_dir_train, ham__dir_train, spam_dir_test, ham__dir_test]

The function $\texttt{get_word_set}(\texttt{fn})$ takes a filename $\texttt{fn}$ as its argument. It reads the file and returns a set of all words that are found in this file. The words are transformed to lower case.



In [5]:

    
def get_words_set(fn):
    with open(fn) as file:
        text = file.read()
        text = text.lower()
        return set(re.findall(r"[\w']+", text))

The function read_all_files reads all files contained in those directories that are stored in the list Directories. It returns a Counter. For every word $w$ this counter contains the number of files that contain $w$.



In [6]:

    
def read_all_files():
    Words = Counter()
    for directory in Directories:
        for file_name in os.listdir(directory):
            Words.update(get_words_set(directory + file_name))
    return Words

Common_Words is a numpy array of the 2500 most common words found in all of our emails.



In [11]:

    
M            = 2500             # number of the most common words to use
Word_Counter = read_all_files()
Common_Words = np.array(list({ w for w, _ in Word_Counter.most_common(M) }))



In [12]:

    
Common_Words









    Out[12]:





array(['hundr', 'handbook', 'san', ..., 'friday', 'instructions', 'act'],
      dtype='<U18')

Step 2: Transform Files into Frequency Vectors

Index_Dict is a dictionary that maps from the most common words to their index in the array Common_Words.



In [13]:

    
Index_Dict = { w: i for i, w in enumerate(Common_Words) }
Index_Dict









    Out[13]:





{'hundr': 0,
 'handbook': 1,
 'san': 2,
 'imagination': 3,
 'pass': 4,
 'effectively': 5,
 'television': 6,
 'generative': 7,
 'initial': 8,
 'enhance': 9,
 'edu': 10,
 'university': 11,
 'system': 12,
 'post': 13,
 'society': 14,
 'fully': 15,
 'comprehensive': 16,
 'frequently': 17,
 'bank': 18,
 'mediumsize': 19,
 'count': 20,
 'extensive': 21,
 'marketer': 22,
 'excess': 23,
 'strongly': 24,
 'constraint': 25,
 'convince': 26,
 'clause': 27,
 'western': 28,
 'blank': 29,
 'hardware': 30,
 'quebec': 31,
 'motivate': 32,
 'al': 33,
 'ourselve': 34,
 'consult': 35,
 'confidential': 36,
 'function': 37,
 'west': 38,
 'order': 39,
 'cloth': 40,
 'concern': 41,
 'property': 42,
 'entitle': 43,
 'greek': 44,
 'explore': 45,
 'mind': 46,
 'application': 47,
 'limit': 48,
 'institut': 49,
 'cognitive': 50,
 'intrusion': 51,
 'finance': 52,
 'parttime': 53,
 'morpheme': 54,
 'water': 55,
 'attention': 56,
 'forum': 57,
 'revenue': 58,
 'belgium': 59,
 'corpus': 60,
 'houston': 61,
 'abstract': 62,
 'determine': 63,
 'since': 64,
 'hours': 65,
 'representation': 66,
 'respond': 67,
 'likely': 68,
 'html': 69,
 'italian': 70,
 'reserve': 71,
 'media': 72,
 'longer': 73,
 'header': 74,
 'progress': 75,
 'consideration': 76,
 'specialist': 77,
 'seven': 78,
 'call': 79,
 'competition': 80,
 'mo': 81,
 'educational': 82,
 'aim': 83,
 'capitalfm': 84,
 'ignore': 85,
 'info': 86,
 'summarize': 87,
 'austin': 88,
 'old': 89,
 'client': 90,
 'advance': 91,
 'female': 92,
 'mexico': 93,
 'hbe': 94,
 'underlie': 95,
 'im': 96,
 'n': 97,
 'maximum': 98,
 'visit': 99,
 'science': 100,
 'weekly': 101,
 'growth': 102,
 'florida': 103,
 'md': 104,
 'diploma': 105,
 'business': 106,
 'aspect': 107,
 'edward': 108,
 'extremely': 109,
 'verify': 110,
 'card': 111,
 'ram': 112,
 'conduct': 113,
 'practice': 114,
 'mci': 115,
 'law': 116,
 'enquiry': 117,
 'march': 118,
 'fl': 119,
 'compute': 120,
 'driver': 121,
 'online': 122,
 'travel': 123,
 'services': 124,
 'replace': 125,
 'discovery': 126,
 'christian': 127,
 'organise': 128,
 'juno': 129,
 'compatible': 130,
 'disk': 131,
 'cannot': 132,
 'recent': 133,
 'space': 134,
 'ever': 135,
 'amex': 136,
 'empirical': 137,
 'restriction': 138,
 'key': 139,
 'basis': 140,
 'sociolinguistic': 141,
 'concept': 142,
 'most': 143,
 'sexually': 144,
 'rock': 145,
 'judgment': 146,
 'mr': 147,
 'article': 148,
 'rom': 149,
 'photo': 150,
 'arrange': 151,
 'excite': 152,
 'everythe': 153,
 'five': 154,
 'presence': 155,
 'attract': 156,
 'amaze': 157,
 'update': 158,
 'forward': 159,
 'york': 160,
 'prospective': 161,
 'hundred': 162,
 'homebase': 163,
 'piece': 164,
 'upgrade': 165,
 'guest': 166,
 'even': 167,
 'chat': 168,
 'kind': 169,
 'xxx': 170,
 'profitable': 171,
 'spout': 172,
 'forthcome': 173,
 'display': 174,
 'exp': 175,
 'list': 176,
 'sense': 177,
 'morphological': 178,
 'writer': 179,
 'generally': 180,
 'traffic': 181,
 'sery': 182,
 'day': 183,
 'collect': 184,
 'lists': 185,
 'ed': 186,
 'faster': 187,
 'vary': 188,
 'agent': 189,
 'der': 190,
 'debt': 191,
 'category': 192,
 'non': 193,
 'proper': 194,
 'foreign': 195,
 'wherea': 196,
 'meet': 197,
 'canadian': 198,
 'mortgage': 199,
 'ad': 200,
 'factor': 201,
 'fantastic': 202,
 'player': 203,
 'flame': 204,
 'city': 205,
 'musical': 206,
 'sue': 207,
 'reward': 208,
 'mac': 209,
 'remark': 210,
 'estate': 211,
 'represent': 212,
 'stage': 213,
 'eventually': 214,
 'green': 215,
 'device': 216,
 'major': 217,
 'interactive': 218,
 'government': 219,
 'query': 220,
 'coordinator': 221,
 'hello': 222,
 'prefer': 223,
 'position': 224,
 'immediate': 225,
 'lose': 226,
 'quite': 227,
 'blvd': 228,
 'venue': 229,
 'favourite': 230,
 'still': 231,
 'minimal': 232,
 'germanic': 233,
 'ongo': 234,
 'case': 235,
 'college': 236,
 'comparative': 237,
 'professor': 238,
 'newsletter': 239,
 'anonymous': 240,
 'noun': 241,
 'america': 242,
 'seminar': 243,
 'harri': 244,
 'advertise': 245,
 'dozen': 246,
 'hottest': 247,
 'novel': 248,
 'l': 249,
 'residual': 250,
 'un': 251,
 'etc': 252,
 'student': 253,
 'returns': 254,
 'was': 255,
 'log': 256,
 'conclusion': 257,
 'serious': 258,
 'entertainment': 259,
 'memory': 260,
 'planck': 261,
 'sound': 262,
 'word': 263,
 'ba': 264,
 'bear': 265,
 'jame': 266,
 'rush': 267,
 'really': 268,
 'happen': 269,
 'pick': 270,
 'sort': 271,
 'macintosh': 272,
 'detail': 273,
 'theory': 274,
 'african': 275,
 'least': 276,
 'us': 277,
 'indefinite': 278,
 'ensure': 279,
 'station': 280,
 'pic': 281,
 'convenience': 282,
 'phonetic': 283,
 'du': 284,
 'addresses': 285,
 'engineer': 286,
 'responsibility': 287,
 'first': 288,
 'indicate': 289,
 'speakers': 290,
 'response': 291,
 'persistent': 292,
 'papers': 293,
 'acquire': 294,
 'discount': 295,
 'fact': 296,
 'id': 297,
 'practically': 298,
 'personality': 299,
 'contain': 300,
 'increase': 301,
 'six': 302,
 'price': 303,
 'seek': 304,
 'intention': 305,
 'fast': 306,
 'linguistic': 307,
 'oral': 308,
 'payable': 309,
 'distinction': 310,
 'tips': 311,
 'works': 312,
 'among': 313,
 'protect': 314,
 'performance': 315,
 'delay': 316,
 'september': 317,
 'few': 318,
 'decade': 319,
 'structure': 320,
 'verb': 321,
 'obtain': 322,
 'identification': 323,
 'doe': 324,
 'capability': 325,
 'perception': 326,
 'present': 327,
 'relationship': 328,
 'scholar': 329,
 'must': 330,
 'bruce': 331,
 'xerox': 332,
 'cmu': 333,
 'jeff': 334,
 'describe': 335,
 'paid': 336,
 'surely': 337,
 'biz': 338,
 'advertiser': 339,
 'methodology': 340,
 'vium': 341,
 'programme': 342,
 'learn': 343,
 'v': 344,
 'significant': 345,
 'greatest': 346,
 'search': 347,
 'emerge': 348,
 'spain': 349,
 'microsoft': 350,
 'reports': 351,
 'east': 352,
 'arabic': 353,
 'positive': 354,
 'criterion': 355,
 'cut': 356,
 'wide': 357,
 'goods': 358,
 'graphic': 359,
 'especially': 360,
 'technical': 361,
 'electronic': 362,
 'laboratory': 363,
 'produce': 364,
 'nc': 365,
 'hi': 366,
 'video': 367,
 'poster': 368,
 'apply': 369,
 'appeal': 370,
 'professional': 371,
 'asset': 372,
 'lunch': 373,
 'faculty': 374,
 'bag': 375,
 'highway': 376,
 'van': 377,
 'd': 378,
 'hold': 379,
 'ca': 380,
 'generate': 381,
 'karen': 382,
 'scott': 383,
 'luck': 384,
 'bid': 385,
 'seller': 386,
 'past': 387,
 'administration': 388,
 'access': 389,
 'chance': 390,
 'anne': 391,
 'bibliography': 392,
 'newsgroup': 393,
 'environment': 394,
 'reservation': 395,
 'ascii': 396,
 'storm': 397,
 'doctor': 398,
 'john': 399,
 'imply': 400,
 'hop': 401,
 'speaker': 402,
 'premier': 403,
 'opinion': 404,
 'nj': 405,
 'ext': 406,
 'hardcopy': 407,
 'sale': 408,
 'school': 409,
 'specific': 410,
 'subscription': 411,
 'keyword': 412,
 'camera': 413,
 'solve': 414,
 'contents': 415,
 'ours': 416,
 'additional': 417,
 'equivalent': 418,
 'manchester': 419,
 'production': 420,
 'shop': 421,
 'msn': 422,
 'ibm': 423,
 'registration': 424,
 'population': 425,
 'observation': 426,
 'mailing': 427,
 'postal': 428,
 'however': 429,
 'zero': 430,
 'beach': 431,
 'food': 432,
 'context': 433,
 'onto': 434,
 'discussion': 435,
 'financial': 436,
 'powerful': 437,
 'place': 438,
 'fund': 439,
 'sorry': 440,
 'mode': 441,
 'junk': 442,
 'rights': 443,
 'benefits': 444,
 'spend': 445,
 'jone': 446,
 'expression': 447,
 'fourth': 448,
 'extent': 449,
 'audience': 450,
 'lower': 451,
 'london': 452,
 'easily': 453,
 'why': 454,
 'largest': 455,
 'step': 456,
 'model': 457,
 'clean': 458,
 'mention': 459,
 'girl': 460,
 'fall': 461,
 'skeptical': 462,
 'education': 463,
 'confident': 464,
 'russian': 465,
 'british': 466,
 'code': 467,
 'set': 468,
 'puzzle': 469,
 'assumption': 470,
 'fortune': 471,
 'brian': 472,
 'harvard': 473,
 'coordinate': 474,
 'pp': 475,
 'thank': 476,
 'latest': 477,
 'loui': 478,
 'britain': 479,
 'wonder': 480,
 'nijmegen': 481,
 'hide': 482,
 'germany': 483,
 'region': 484,
 'boyfriend': 485,
 'demand': 486,
 'along': 487,
 'lie': 488,
 'area': 489,
 'firm': 490,
 'automatic': 491,
 'stuttgart': 492,
 'assistant': 493,
 'hypothesis': 494,
 'phonetics': 495,
 'owe': 496,
 'onetime': 497,
 'relevant': 498,
 'planet': 499,
 'reader': 500,
 'umontreal': 501,
 'believer': 502,
 'line': 503,
 'billion': 504,
 'personally': 505,
 'put': 506,
 'operation': 507,
 'representative': 508,
 'unit': 509,
 'cdrom': 510,
 'contribution': 511,
 'nothing': 512,
 'envelope': 513,
 'statement': 514,
 'often': 515,
 'thompson': 516,
 'relax': 517,
 'phone': 518,
 'refinance': 519,
 'monday': 520,
 'korean': 521,
 'inc': 522,
 'moreover': 523,
 'secure': 524,
 'facility': 525,
 'usually': 526,
 'little': 527,
 'thought': 528,
 'dollars': 529,
 'experiment': 530,
 'charle': 531,
 'proceedings': 532,
 'quantity': 533,
 'artificial': 534,
 'random': 535,
 'colingacl': 536,
 'cable': 537,
 'th': 538,
 'ac': 539,
 'instead': 540,
 'gamble': 541,
 'steven': 542,
 'fee': 543,
 'broadcast': 544,
 'communication': 545,
 'deadline': 546,
 'seem': 547,
 'made': 548,
 'expensive': 549,
 'plural': 550,
 'background': 551,
 'genie': 552,
 'world': 553,
 'rest': 554,
 'means': 555,
 'color': 556,
 'cv': 557,
 'nobody': 558,
 'spokane': 559,
 'air': 560,
 'overt': 561,
 'approve': 562,
 'catch': 563,
 'derive': 564,
 'fresh': 565,
 'native': 566,
 'accepted': 567,
 'boss': 568,
 'conversational': 569,
 'paradise': 570,
 'guarantee': 571,
 'whole': 572,
 'man': 573,
 'statistical': 574,
 'prompt': 575,
 'start': 576,
 'frank': 577,
 'correct': 578,
 'centre': 579,
 'cm': 580,
 'install': 581,
 'delivery': 582,
 'listen': 583,
 'zone': 584,
 'jump': 585,
 'between': 586,
 'social': 587,
 'interface': 588,
 'tune': 589,
 'december': 590,
 'celebrate': 591,
 'corporations': 592,
 'importance': 593,
 'distinguish': 594,
 'shall': 595,
 'pennsylvanium': 596,
 'country': 597,
 'speed': 598,
 'dori': 599,
 'team': 600,
 'european': 601,
 'picture': 602,
 'stuff': 603,
 'teach': 604,
 'cfp': 605,
 'company': 606,
 'url': 607,
 'historical': 608,
 'invest': 609,
 'either': 610,
 'those': 611,
 'intelligent': 612,
 'our': 613,
 'inch': 614,
 'home': 615,
 'legitimate': 616,
 'slavic': 617,
 'contrastive': 618,
 'staff': 619,
 'assessment': 620,
 'scientific': 621,
 'participant': 622,
 'button': 623,
 'dialogue': 624,
 'enterprise': 625,
 'gay': 626,
 'discover': 627,
 'stun': 628,
 'sprachwissenschaft': 629,
 'propose': 630,
 'synthesis': 631,
 'late': 632,
 'awhile': 633,
 'war': 634,
 'state': 635,
 'package': 636,
 'blackwell': 637,
 'newest': 638,
 'match': 639,
 'lead': 640,
 'japanese': 641,
 'official': 642,
 'celebrity': 643,
 'french': 644,
 'publisher': 645,
 'near': 646,
 'font': 647,
 'art': 648,
 'train': 649,
 'reject': 650,
 'thus': 651,
 'answer': 652,
 'user': 653,
 'familiar': 654,
 'too': 655,
 'draft': 656,
 'expense': 657,
 've': 658,
 'mclaughlin': 659,
 'consonant': 660,
 'secondary': 661,
 'notify': 662,
 'ann': 663,
 'worth': 664,
 'saturday': 665,
 'round': 666,
 'total': 667,
 'much': 668,
 'suggestion': 669,
 'ohio': 670,
 'black': 671,
 'wh': 672,
 'walter': 673,
 'stamp': 674,
 'logic': 675,
 'ny': 676,
 'foot': 677,
 'embark': 678,
 'preference': 679,
 'morn': 680,
 'joseph': 681,
 'likewise': 682,
 'activity': 683,
 'register': 684,
 'store': 685,
 'typological': 686,
 'disc': 687,
 'alter': 688,
 'pound': 689,
 'translate': 690,
 'lawful': 691,
 'did': 692,
 'adult': 693,
 'doubt': 694,
 'subscribe': 695,
 'virtual': 696,
 'development': 697,
 'speech': 698,
 'cite': 699,
 'sean': 700,
 'max': 701,
 'washington': 702,
 'support': 703,
 'goal': 704,
 'californium': 705,
 'cost': 706,
 'alway': 707,
 'batch': 708,
 'il': 709,
 'content': 710,
 'ma': 711,
 'male': 712,
 'permanently': 713,
 'amateur': 714,
 'wife': 715,
 'structural': 716,
 'accessible': 717,
 'computational': 718,
 'term': 719,
 'extractor': 720,
 'head': 721,
 'integration': 722,
 'studies': 723,
 'dramatically': 724,
 'utterance': 725,
 'latex': 726,
 'merciless': 727,
 'profile': 728,
 'illustrate': 729,
 'bed': 730,
 'dare': 731,
 'strength': 732,
 'dure': 733,
 'integrate': 734,
 'perfectly': 735,
 'myself': 736,
 'die': 737,
 'enable': 738,
 'robbery': 739,
 'market': 740,
 'sexual': 741,
 'primary': 742,
 'furthermore': 743,
 'consider': 744,
 'achievement': 745,
 'component': 746,
 'pragmatic': 747,
 'write': 748,
 'privacy': 749,
 'profit': 750,
 'stealth': 751,
 'pure': 752,
 'item': 753,
 'nd': 754,
 'convention': 755,
 'robert': 756,
 'accept': 757,
 'weekend': 758,
 'cum': 759,
 'advantage': 760,
 'retrieval': 761,
 'here': 762,
 'benjamin': 763,
 'christopher': 764,
 'tom': 765,
 'film': 766,
 'bulk': 767,
 'role': 768,
 'movie': 769,
 'field': 770,
 'successful': 771,
 'omit': 772,
 'anthony': 773,
 'att': 774,
 'off': 775,
 'address': 776,
 'lanse': 777,
 'grateful': 778,
 'service': 779,
 'fill': 780,
 'coffee': 781,
 'applicant': 782,
 'begin': 783,
 'listing': 784,
 'organiser': 785,
 'every': 786,
 'txt': 787,
 'wait': 788,
 'fairchild': 789,
 'rise': 790,
 'vision': 791,
 'pm': 792,
 'expiration': 793,
 'phillip': 794,
 'nice': 795,
 'channel': 796,
 'mit': 797,
 'angele': 798,
 'currency': 799,
 'break': 800,
 'dependency': 801,
 'morphology': 802,
 'sake': 803,
 'candidate': 804,
 'finger': 805,
 'stop': 806,
 'htm': 807,
 'adopt': 808,
 'useful': 809,
 'membership': 810,
 'specifically': 811,
 'sent': 812,
 'psychology': 813,
 'fm': 814,
 'trash': 815,
 'entirely': 816,
 'enter': 817,
 'gov': 818,
 'choose': 819,
 'almost': 820,
 'tx': 821,
 'schedule': 822,
 'ahead': 823,
 'tv': 824,
 'nearly': 825,
 'private': 826,
 'phd': 827,
 'retire': 828,
 'study': 829,
 'chomsky': 830,
 'various': 831,
 'plans': 832,
 'waste': 833,
 'proud': 834,
 'ordering': 835,
 'self': 836,
 'description': 837,
 'november': 838,
 'loan': 839,
 'quickly': 840,
 'requirement': 841,
 'corner': 842,
 'responsible': 843,
 'prove': 844,
 'europe': 845,
 'roll': 846,
 'nature': 847,
 'hand': 848,
 'remove': 849,
 'pittsburgh': 850,
 'accomodation': 851,
 'paste': 852,
 'hundreds': 853,
 'addition': 854,
 'a': 855,
 'behalf': 856,
 'health': 857,
 'teacher': 858,
 'classroom': 859,
 'gold': 860,
 'william': 861,
 'oversea': 862,
 'dream': 863,
 'http': 864,
 'hour': 865,
 'download': 866,
 'occur': 867,
 'continual': 868,
 'topic': 869,
 'free': 870,
 'predicate': 871,
 'exactly': 872,
 'released': 873,
 'contract': 874,
 'unlimited': 875,
 'toll': 876,
 'approximately': 877,
 'exceed': 878,
 'employ': 879,
 'engage': 880,
 'universitaet': 881,
 'pro': 882,
 'summer': 883,
 'letter': 884,
 'standard': 885,
 'let': 886,
 'far': 887,
 'visa': 888,
 'spot': 889,
 'overview': 890,
 'import': 891,
 'attitude': 892,
 'sender': 893,
 'arrive': 894,
 'pre': 895,
 'relate': 896,
 'lecturer': 897,
 'undoubtedly': 898,
 'action': 899,
 'electronically': 900,
 'co': 901,
 'thesis': 902,
 'le': 903,
 'own': 904,
 'ticket': 905,
 'latin': 906,
 'book': 907,
 'brief': 908,
 'audio': 909,
 'version': 910,
 'owner': 911,
 'quit': 912,
 'barbara': 913,
 'presentation': 914,
 'themselve': 915,
 'texa': 916,
 'incorporate': 917,
 'lawyer': 918,
 'current': 919,
 'availability': 920,
 'launch': 921,
 'opportunity': 922,
 'discourse': 923,
 'exclusive': 924,
 'series': 925,
 'reflect': 926,
 'homepage': 927,
 'de': 928,
 'sign': 929,
 'principle': 930,
 'effective': 931,
 'reality': 932,
 'postage': 933,
 'genuine': 934,
 'path': 935,
 'partner': 936,
 'particle': 937,
 'scientist': 938,
 'experience': 939,
 'solution': 940,
 'bring': 941,
 'reconstruction': 942,
 'formation': 943,
 'spam': 944,
 'vendor': 945,
 'semantics': 946,
 'comparison': 947,
 'common': 948,
 'sex': 949,
 'sum': 950,
 'due': 951,
 'association': 952,
 'everything': 953,
 'clearly': 954,
 'contribute': 955,
 'mailto': 956,
 'tilburg': 957,
 'nlg': 958,
 'two': 959,
 'anon': 960,
 'shock': 961,
 'refund': 962,
 'traditional': 963,
 'analyze': 964,
 'strategy': 965,
 'culture': 966,
 'parent': 967,
 'financially': 968,
 'campus': 969,
 'cheque': 970,
 'demonstrate': 971,
 'verbal': 972,
 'million': 973,
 'attempt': 974,
 'characteristic': 975,
 'exercise': 976,
 'discuss': 977,
 'totally': 978,
 'suitable': 979,
 'main': 980,
 'effect': 981,
 'hr': 982,
 'cognition': 983,
 'sources': 984,
 'literary': 985,
 'center': 986,
 'welcome': 987,
 'spring': 988,
 'implementation': 989,
 'international': 990,
 'technology': 991,
 'voice': 992,
 'philosophy': 993,
 'remember': 994,
 'condition': 995,
 'together': 996,
 'mine': 997,
 'santa': 998,
 'guideline': 999,
 ...}

The function $\texttt{transform_to_vector}(L)$ takes a list of words $L$ and transforms this list into a vector $\mathbf{v}$. If $\texttt{CommonWords}[i] = w$, then $\mathbf{v}[i]$ specifies the number of times that $w$ occurs in $L$.



In [14]:

    
def transform_to_vector(L):
    Result = np.zeros((len(Common_Words, )))
    for w in L:
        if w in Index_Dict:
            Result[Index_Dict[w]] += 1
    return Result

The function $\texttt{get_word_vector}(fn)$ takes a filename fn, reads the specified file and transforms it into a feature vector.



In [15]:

    
def get_word_vector(fn):
    with open(fn) as file:
        text = file.read()
        text = text.lower()
        return transform_to_vector(re.findall(r"[\w']+", text))

Step 3: Compute the Inverse Document Frequency

In natural language processing, the notion term is used as a synonym for word. Given a term $t$ and a document $d$, the term frequency $\texttt{tf}(t, d)$ is defined as $$ \texttt{tf}(t, d) = \frac{d.\texttt{count}(t)}{\texttt{len}(d)}, $$ where $d.\texttt{count}(t)$ counts the number of times $t$ appears in $d$ and $\texttt{len}(d)$ is the length of the list representing $d$.

A corpus is a set of documents. Given a term $t$ and a corpus $\mathcal{C}$, the inverse document frequency $\texttt{idf}(t,\mathcal{C})$ is defined as $$ \texttt{idf}(t,\mathcal{C}) = \ln\left(\frac{\texttt{card}(\mathcal{C}) + 1}{\texttt{card}\bigl(\{ d \in \mathcal{C} \mid t \in d \}\bigr) + 1}\right). $$ The addition of $1$ in both nominator and denominator is called Laplace smoothing. This is necessary to prevent a division by zero error for those terms $t$ that do not occur in the list Common_Words.

Step 4: Compute the Feature Matrix

The function $\texttt{feature_matrix}(\texttt{spam_dir}, \texttt{ham_dir})$ takes two directories that contain spam and ham, respectively. It computes a matrix $X$ and a vector $Y$, where $X$ is the feature matrix and for every row $r$ of the feature matrix, $Y[r]$ is 1 if the mail is ham and 0 if it's spam.

The way $X$ is computed is quite inefficient, it would have been better to initialize $X$ as a matrix with the shape $(N,M)$, where $N$ is the number of mails and $M$ is the number of common words.



In [16]:

    
def feature_matrix(spam_dir, ham_dir):
    X = []
    Y = []
    for fn in os.listdir(spam_dir):
        X.append(get_word_vector(spam_dir + fn))
        Y.append(0)
    for fn in os.listdir(ham_dir):    
        X.append(get_word_vector(ham_dir + fn))
        Y.append(+1)
    X = np.array(X)
    Y = np.array(Y)
    return X, Y

We convert the training set into a feature matrix.



In [17]:

    
%%time
X_train, Y_train = feature_matrix(spam_dir_train, ham__dir_train)









    



CPU times: user 163 ms, sys: 20.7 ms, total: 184 ms
Wall time: 183 ms

Up to now, the feature matrix contains only the term frequencies. Next we multiply with the inverse document frequencies.



In [18]:

    
N, _ = X_train.shape
IDF  = {}
for w, i in Index_Dict.items():
    IDF[w] = np.log((N + 1) / (Word_Counter[w] + 1))
    X_train[:, i] = X_train[:, i] * IDF[w]

We build the feature matrix for the test set.



In [19]:

    
X_test, Y_test = feature_matrix(spam_dir_test, ham__dir_test)



In [20]:

    
for w, i in Index_Dict.items():
    X_test[:, i] = X_test[:, i] * IDF[w]

Step 5: Train and Test a Support Vector Machine



In [21]:

    
import sklearn.svm as svm

Train an SVM and compute the accuracy on the training data.



In [22]:

    
M = svm.SVC(kernel='linear', C=100000)
M.fit  (X_train, Y_train)
M.score(X_train, Y_train)









    Out[22]:





1.0

Compute the accuracy for the test data.



In [23]:

    
M.score(X_test, Y_test)









    Out[23]:





0.9884615384615385



In [ ]: