In [1]:

%autosave 0
from IPython.core.display import HTML, display
display(HTML('<style>.container { width:100%; } </style>'))




var element = $('#52488309-7261-4447-94a6-8e30a2a7d8ec'); IPython.notebook.set_autosave_interval(0) Autosave disabled .container { width:100%; }  Spam Detection Using the Naive Bayes Algorithm The process of creating a spam detector using the naive Bayes algorithm is split up into four steps. • Create a set of the most common words occurring in spam and ham (i.e. non-spam) emails. • For every word occurring in this set, compute the conditional probability that this words occurs in a spam or ham email. • Create a function that takes an email and the conditional probabilities computed before and that then computes the probability that the given email is spam. • Evaluate the precision and the recall of the spam classifier. Step 1: Create Word Dictionary We need the module os for reading directories and the module re for regular expressions.  In [2]: import os import re import numpy as np import math  An object of class Counter is a special form of a dictionary that is used for counting. We need a counter to figure out what the most common words are.  In [3]: from collections import Counter  The directory https://github.com/karlstroetmann/Artificial-Intelligence/tree/master/Python/EmailData contains 960 emails that are divided into four subdirectories: • spam-train contains 350 spam emails for training, • ham-train contains 350 non-spam emails for training, • spam-test contains 130 spam emails for testing, • ham-test contains 130 non-spam emails for testing. Originally, this data has been collected by Ion Androutsopoulos. I have found this data on the page http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex6/ex6.html provided by Andrew Ng. We declare some variables so this notebook can be adapted to other data sets.  In [4]: spam_dir_train = 'EmailData/spam-train/' ham__dir_train = 'EmailData/ham-train/' spam_dir_test = 'EmailData/spam-test/' ham__dir_test = 'EmailData/ham-test/' Directories = [spam_dir_train, ham__dir_train, spam_dir_test, ham__dir_test]  In order to compute the prior probability that an email is ham or spam we need to count the number of spam and ham emails.  In [5]: no_spam = len(os.listdir(spam_dir_train)) no_ham = len(os.listdir(ham__dir_train)) spam_prior = no_spam / (no_spam + no_ham) ham__prior = no_ham / (no_spam + no_ham) spam_prior, ham__prior   Out[5]: (0.5, 0.5)  I have checked that the proportion of spam and ham emails in the test directory is also$1:1$. If the proportion of spam and ham emails in life is different from$1:1$, then we would have to use this proportion in the spam filter to be developed. The function$\texttt{get_words}(\texttt{fn})$takes a filename$\texttt{fn}$as its argument. It reads the file and returns a set of all words that are found in this file. The words are transformed to lower case.  In [6]: def get_words(fn): file = open(fn) text = file.read() text = text.lower() return set(re.findall(r"[\w']+", text))  Let us test this function with a small example mail.  In [7]: get_words('EmailData/ham-train/3-380msg4.txt')   Out[7]: {'anyone', 'article', 'berkeley', 'book', 'consonant', 'edu', 'english', 'garnet', 'hard', 'helpful', 'hi', 'interest', 'irish', 'laurel', 'm', 'modern', 'palatal', 'phonetics', 'posting', 'project', 'recommend', 'slender', 'source', 'specifically', 'sutton', 'thank', 'too', 'work'}  The function read_all_files reads all files contained in those directories that are stored in the list Directories. It returns a Counter. For every word$w$this counter contains the number of files that contain$w$.  In [8]: def read_all_files(): Words = Counter() for directory in Directories: for file_name in os.listdir(directory): Words.update(get_words(directory + file_name)) return Words  Common_Words is a list of the 2500 most common words found in all of our emails.  In [9]: N = 2500 # number of the most common words to use Word_Counter = read_all_files() Word_Counter   Out[9]: Counter({'eminent': 9, 'earn': 69, 'experience': 123, 'through': 155, 'phd': 22, 'prestige': 9, 'increase': 69, 'grant': 23, 'effort': 75, 'mba': 8, 'choice': 51, 'here': 259, 'short': 86, 'field': 117, 'part': 131, 'personal': 102, 'programs': 21, 'base': 134, 'ba': 13, 'phone': 202, 'power': 52, 'necessary': 55, 'degree': 41, 'further': 154, 'detail': 143, 'call': 347, 'advance': 81, 'require': 131, 'nonaccredit': 8, 'award': 20, 'present': 142, 'knowledge': 72, 'money': 187, 'university': 307, 'diploma': 10, 'ma': 37, 'cost': 147, 'entire': 45, 'conference': 138, 'grab': 9, 'week': 173, 'receive': 283, 'start': 173, 'leverage': 5, 'offence': 4, 'our': 365, 'delete': 59, 'po': 53, 'old': 83, 'mailer': 20, 'financial': 70, 'member': 104, 'problem': 128, 'believe': 103, 'ago': 65, 'throw': 20, 'customer': 69, 'hello': 54, 'letter': 106, 'inexpensive': 24, 'guarantee': 100, 'ignore': 42, 'complete': 119, 'control': 53, 'outside': 43, 'cash': 91, 'name': 289, 'usa': 122, 'state': 220, 'pardon': 9, 'texa': 35, 'cst': 5, 'reside': 3, 'send': 360, 'lifeline': 1, 'later': 81, 'without': 122, 'print': 107, 'program': 226, 'honestly': 6, 'best': 206, 'nobrainer': 1, 'one': 404, 'note': 148, 'free': 302, 'show': 161, 'computer': 152, 'credit': 103, 'registration': 86, 'must': 181, 'grapevine': 1, 'process': 161, 'center': 60, 'today': 179, 'weekly': 35, 'mind': 62, 'zip': 75, 'interest': 283, 'compound': 12, 'few': 128, 'address': 379, 'simple': 111, 'telephone': 91, 'educational': 22, 'main': 72, 'worth': 48, 'entitle': 13, 'convert': 12, 'plan': 88, 's': 560, 'message': 189, 'join': 95, 'number': 248, 'respond': 45, 'box': 124, 'achieve': 42, 'card': 112, 'life': 99, 'solution': 28, 'mortgage': 18, 'please': 445, 'city': 120, 'information': 448, 'especially': 74, 'net': 100, 'id': 34, 'participate': 63, 'us': 308, 'pull': 8, 'independence': 14, 'tuesday': 21, 'enable': 26, 'company': 139, 'over': 250, 'simply': 123, 'night': 39, 'pm': 42, 'finances': 2, 'intrusion': 18, 'return': 103, 'solid': 15, 'establish': 35, 'mean': 81, 'freedom': 47, 'peace': 7, 'form': 210, 'begin': 69, 'system': 171, 'debt': 40, 'obtain': 41, 'secure': 32, 'per': 141, 'pack': 15, 'cozy': 1, 'oct': 6, 'vacation': 37, 'west': 26, 'archery': 1, 'felton': 1, 'pay': 149, 'e': 294, 'home': 161, 'accomodation': 10, 'virginium': 9, 'turkey': 3, 'deer': 1, 'loader': 6, 'wonderful': 12, 'sesson': 1, 'cook': 3, 'economical': 3, 'meal': 6, 'buck': 12, 'room': 55, 'mail': 350, 'reserve': 28, 'stay': 32, 'noon': 5, 'nov': 3, 'muzzel': 1, 'hunt': 3, 'season': 10, 'announce': 71, 'want': 231, 'follow': 320, 'space': 44, 'wood': 3, 'com': 257, 'compuserve': 26, 'day': 244, 'dec': 7, 'wild': 7, 'lunch': 44, 'book': 145, 'camp': 6, 'three': 103, 'doe': 29, 'additional': 110, 'million': 111, 'wi': 1, 'reach': 72, 'commercial': 37, 'info': 71, 'future': 116, 'success': 82, 'nettool': 1, 'fingertip': 9, 'internet': 188, 'software': 119, 'network': 43, 'search': 88, 'permanently': 12, 'area': 161, 'evaluation': 39, 'proper': 23, 'requirement': 43, 'presence': 26, 'section': 74, 'stop': 75, 'regard': 60, 'propose': 46, 'web': 211, 'advantage': 67, 'sender': 27, 'certain': 47, 'help': 164, 'remove': 203, 'storefront': 2, 'target': 36, 'product': 137, 'fellow': 22, 'promote': 38, 'luck': 34, 'basis': 64, 'request': 157, 'loc': 2, 'comply': 24, 'recent': 65, 'lead': 63, 'mailing': 71, 'bill': 84, 'selection': 38, 'c': 174, 'ooo': 1, 'waterford': 1, 'reply': 131, 'ten': 35, 'paragraph': 13, 'post': 113, 'unite': 61, 'transmission': 13, 'gov': 30, 'http': 399, 'entrepreneur': 14, 'subject': 192, 'tool': 70, 'service': 171, 'dear': 70, 'business': 164, 'assist': 24, 'level': 107, 'need': 250, 'sale': 74, 'thoma': 16, 'item': 40, 'unbelievable': 9, 'much': 190, 'try': 125, 'set': 105, 'wish': 142, 'thank': 182, 'market': 156, 'email': 429, 'vast': 5, 'online': 126, 'venture': 7, 'federal': 36, 'audience': 16, 'unwise': 1, 'check': 210, 'greatest': 35, 'unmissable': 1, 're': 198, 'titanictesco': 1, 'park': 32, 'fame': 1, 'onto': 10, 'release': 45, 'include': 354, 'player': 13, 'visit': 138, 'ultimate': 15, 'refreshment': 1, 'stack': 7, 'gossip': 6, 'shop': 35, 'while': 116, 'chart': 10, 'cd': 63, 'never': 120, 'unlikely': 4, 'package': 73, 'alway': 91, 'www': 296, 'undead': 1, 'band': 12, 'why': 118, 'billy': 2, 'event': 50, 'full': 143, 'right': 144, 'digital': 33, 'delay': 16, 'yourself': 90, 'late': 30, 'friend': 90, 'easy': 125, 'available': 254, 'beautiful': 32, 'placebo': 3, 'chance': 63, 'fantastic': 35, 'top': 77, 'pick': 44, 'mtv': 1, 'glamour': 2, 'run': 81, 'access': 98, 'john': 96, 'competition': 40, 'click': 135, 'offer': 229, 'compaq': 5, 'n': 78, 'pop': 22, 'roll': 24, 'scoop': 6, 'dizzy': 1, 'premiere': 4, 'big': 63, 'sound': 76, 'bathtub': 2, 'reporter': 6, 'crash': 7, 'witch': 7, 'radio': 28, 'tesco': 1, 'portrait': 1, 'drink': 6, 'milan': 4, 'down': 96, 'atmosphere': 4, 'play': 64, 'provide': 203, 'london': 40, 'thing': 109, 'aqua': 3, 'crazy': 6, 'fun': 66, 'tale': 3, 'site': 218, 'record': 65, 'spellbind': 1, 'prepare': 44, 'nt': 222, 'true': 78, 'leicester': 3, 'unsubscribe': 23, 'glitz': 2, 'b': 118, 'technology': 84, 'xpack': 1, 'robbie': 7, 'emma': 1, 'fizzy': 1, 'rem': 4, 'icon': 6, 'miss': 66, 'exclusive': 39, 'capitalfm': 23, 'hit': 54, 'spook': 1, 'thursday': 26, 'save': 108, 'straight': 13, 'choose': 75, 'question': 221, 'rock': 11, 'star': 27, 'music': 29, 'europe': 41, 'halloween': 1, 'bumper': 1, 'hesitate': 39, 'accelerate': 5, 'graphic': 29, 'storm': 10, 'horror': 1, 'instant': 16, 'supply': 23, 'special': 162, 'spin': 6, 'prizewin': 1, 'll': 147, 'regular': 41, 'hurry': 12, 'many': 244, 'even': 194, 'colors': 1, 'reveal': 18, 'celine': 3, 'ghost': 1, 'too': 85, 'attend': 30, 've': 124, 'website': 81, 'starstud': 1, 'travolta': 1, 'foyer': 2, 'adulterous': 1, 'list': 329, 'classic': 12, 'absolutely': 50, 'south': 45, 'enter': 74, 'latest': 57, 'doorstep': 5, 'pc': 37, 'prize': 25, 'label': 21, 'roundup': 3, 'connolly': 1, 'dion': 3, 'tell': 130, 'megastar': 1, 'fill': 88, 'desktop': 7, 'presario': 4, 'dolby': 4, 'nail': 3, 'win': 120, 'paradise': 14, 'stock': 31, 'thompson': 10, 'scary': 7, 'titanic': 1, 'couple': 38, 'guess': 17, 'discount': 30, 'flick': 1, 'u': 123, 'entirely': 11, 'amaze': 52, 'link': 96, 'advertisement': 54, 'better': 106, 'william': 34, 'feel': 62, 'become': 90, 'spooky': 1, 'album': 16, 'game': 46, 'still': 99, 'manufacturer': 13, 'buy': 111, 'primary': 23, 'bring': 102, 'screen': 25, 'president': 18, 'biz': 10, 'coolest': 3, 'surround': 12, 'poster': 24, 'everythe': 21, 'fm': 15, 'focus': 78, 'talk': 92, 'team': 32, 'jimmy': 2, 'mailbox': 27, 'cdparadise': 4, 'next': 121, 'catch': 18, 'favourite': 13, 'world': 183, 'saint': 5, 'laugh': 19, 'up': 1, 'whether': 85, 'performance': 22, 'bunch': 6, 'hot': 45, 'bath': 3, 'head': 44, 'fantasy': 7, 'square': 6, 'capital': 58, 'movie': 32, 'major': 123, 'submission': 95, 'hrs': 3, 'resubmit': 2, 'meta': 3, 'automatically': 39, 'report': 129, 'calle': 4, 'notice': 40, 'engine': 50, 'compose': 6, 'fees': 8, 'within': 176, 'advertiser': 16, 'bulk': 79, 'after': 163, 'each': 201, 'etc': 150, 'every': 171, 'appropriate': 33, 'page': 181, 'toll': 50, 'monthly': 39, 'pro': 19, 'hr': 18, 'extractor': 14, 'block': 30, 'month': 131, 'review': 87, 'trie': 4, 'submit': 104, 'media': 16, 'tag': 11, 'thousands': 12, 'solve': 12, 'helps': 1, 'reg': 15, 'dollar': 98, 'something': 67, 'gotta': 5, 'wasus': 2, 'spam': 31, 'safeaddress': 1, 'idc': 2, 'discreet': 5, 'powerful': 39, 'quickly': 37, 'exceptions': 4, 'community': 45, 't': 146, 'high': 88, 'literally': 13, 'general': 107, 'along': 78, 'travel': 71, 'ask': 137, 'benefit': 38, 'oversea': 14, 'paper': 176, 'finance': 17, 'soundest': 4, 'promise': 37, 'legally': 14, 'amount': 101, 'extract': 29, 'clearly': 36, 'confirm': 27, 'certainly': 25, 'espouse': 5, 'upon': 51, 'contract': 19, 'beverly': 2, 'word': 191, 'extra': 64, 'nbc': 4, 'thousand': 99, 'means': 50, 'curency': 2, 'work': 304, 'soon': 95, 'monitor': 13, 'before': 190, 'ending': 2, 'themselve': 51, 'transact': 4, 'vary': 24, 'tran': 7, 'march': 62, 'transaction': 9, 'move': 71, 'ca': 111, 'under': 109, 'exactly': 65, 'kid': 23, 'public': 41, 'bl': 4, 'nightly': 7, 'view': 66, 'greatly': 16, 'earlier': 29, 'contact': 203, 'likewise': 13, 'currency': 17, 'minute': 106, 'wall': 17, 'create': 89, 'reason': 77, 'daily': 40, 'yet': 50, 'effect': 46, 'editorial': 15, 'santa': 14, 'optional': 26, 'conversion': 7, 'flaw': 6, 'back': 139, 'completely': 65, 'end': 105, 'amass': 4, 'individual': 75, 'operate': 37, 'organization': 50, 'however': 97, 'watch': 64, 'someone': 86, 'rate': 101, 'iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiius': 2, 'wealth': 16, 'fortune': 26, 'own': 170, 'wealthiest': 6, 'cartel': 4, 'explosive': 8, 'political': 23, 'membership': 30, 'corner': 14, 'national': 53, 'change': 137, 'hemisphere': 4, 'payable': 59, 'mllionaire': 2, 'attache': 2, 'dollars': 21, 'write': 160, 'o': 120, 'overnight': 34, 'anniversarry': 2, 'let': 121, 'group': 104, 'first': 274, 'assure': 16, 'rumble': 4, 'profile': 14, 'same': 154, 'attention': 60, 'iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiius': 2, 'publication': 87, 'continue': 64, 'postage': 23, 'else': 87, 'gold': 28, 'instruction': 102, 'nor': 42, 'cold': 9, 'm': 209, 'int': 6, 'fee': 94, 'most': 232, 'date': 129, 'different': 161, 'announcement': 47, 'concern': 64, 'glad': 15, 'unlike': 16, 'earth': 33, 'guise': 6, 'able': 99, 'parent': 10, 'easily': 63, 'anyone': 130, 'add': 122, 'york': 68, 'depend': 39, 'long': 83, 'ourself': 2, 'allow': 112, 'action': 61, 'pertinent': 5, 'below': 213, 'street': 65, 'exist': 63, 'operation': 23, 'legal': 67, 'advice': 18, 'monica': 4, 'extremely': 47, 'disclose': 6, 'leave': 103, 'cancel': 13, 'important': 110, 'californium': 52, 'lessly': 2, 'refund': 34, 'american': 88, 'uniform': 7, 'document': 38, 'confidential': 16, 'supporter': 6, 'hand': 89, 'read': 167, 'conclude': 22, 'reiterate': 4, 'keep': 120, 'grow': 52, 'until': 86, 'surely': 28, 'hi': 34, 'secret': 55, 'global': 33, 'unlimit': 34, 'administrative': 8, 'profit': 63, 'enquiry': 13, 'divulge': 4, 'don': 59, 'great': 141, 'line': 156, 'learn': 124, 'ship': 60, 'immediately': 84, 'those': 183, 'instruct': 21, 'limited': 23, 'ourselve': 14, 'worldwide': 44, 'iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiius': 2, 'excerpt': 8, 'purpose': 40, 'source': 67, 'plus': 109, 'again': 128, 'office': 95, 'left': 3, 'school': 65, 'low': 51, 'hundred': 66, 'envy': 4, 'total': 79, 'hills': 2, 'd': 212, 'recently': 56, 'second': 117, 'suite': 58, 'exchange': 43, 'share': 84, 'method': 99, 'fluctuate': 5, 'differential': 6, 'around': 67, 'britney': 4, 'tomorrow': 9, 'tip': 29, 'noisy': 2, 'listen': 29, 'loca': 1, 'sneak': 5, 'excite': 61, 'peek': 6, 'everybody': 16, 'teacher': 25, 'ride': 4, 'beer': 11, 'smash': 3, 'calm': 2, 'vonda': 1, 'answer': 95, 'gerus': 2, 'century': 23, 'ever': 120, 'stereo': 1, 'chat': 21, 'californication': 1, 'universal': 33, 'channel': 19, 'globe': 10, 'zone': 11, 'hottest': 30, 'chilus': 1, 'uk': 97, 'celluloid': 1, 'red': 11, 'tvchannel': 3, 'lines': 8, 'song': 15, 'wait': 71, 'singles': 7, 'fave': 1, 'halliwell': 2, 'past': 80, 'terminator': 1, 'whispering': 1, 'tv': 26, 'compzone': 10, 'preacher': 3, 'june': 60, 'eurovision': 2, 'man': 46, 'forthcome': 12, 'rd': 59, 'piece': 42, 'centrepiece': 1, 'goss': 1, 'playstation': 5, 'break': 89, 'newradioworld': 3, 'bag': 10, 'ricky': 2, 'stress': 20, 'fabulous': 21, 'manic': 3, 'itch': 1, 'diary': 2, 'backstreet': 2, 'live': 114, 'highlight': 13, 'examiner': 1, 'ad': 74, 'recognise': 3, 'entertain': 7, 'martin': 21, 'angele': 14, 'studio': 8, 'beverage': 10, 'gear': 4, 'st': 90, 'saturday': 44, 'dreamcast': 1, 'somethe': 4, 'webchat': 2, 'co': 53, 'delivery': 52, 'summer': 30, 'both': 176, 'weekend': 20, 'ring': 8, 'th': 156, 'despatch': 4, 'foursome': 1, 'preparation': 12, 'madonna': 3, 'panic': 2, 'where': 198, 'taylor': 8, 'till': 4, 'la': 42, 'girls': 5, 'professor': 46, 'baz': 3, 'livin': 1, 'winning': 6, 'luhrmann': 3, 'revisionline': 1, 'holiday': 25, 'meet': 98, 'lyric': 10, 'shepard': 1, 'boyzone': 6, 'revision': 3, 'video': 73, 'size': 37, 'nd': 59, 'bargain': 16, 'goodie': 2, 'precious': 3, 'vote': 15, 'braless': 1, 'prof': 34, 'ticket': 37, 'feature': 90, 'prior': 37, 'rubber': 1, 'carefully': 29, 'really': 106, 'order': 271, 'vida': 1, 'pepper': 1, 'ball': 7, 'film': 19, 'musical': 11, 'wednesday': 23, 'expensive': 22, 'millennium': 8, 'boy': 35, 'schizophonic': 2, 'winner': 25, 'lot': 85, 'cinema': 14, 'margherita': 2, 'gadget': 2, 'title': 132, 'sega': 1, 'comp': 9, 'lo': 19, 'everyone': 74, 'kit': 12, 'mark': 66, 'character': 21, 'preorder': 7, 'price': 139, 'love': 55, 'hits': 6, 'jeni': 1, 'girl': 36, 'xxx': 38, 'teen': 19, 'trial': 33, 'tempt': 4, 'index': 56, 'adult': 70, 'tantalize': 3, 'forbidden': 2, 'html': 129, 'mci': 17, 'z': 13, 'ones': 4, 'shortest': 1, 'range': 57, 'kevin': 6, 'cyberpromo': 3, 'several': 109, 'blank': 17, 'familiar': 20, 'numerous': 22, 'duplicate': 31, 'circle': 10, 'finish': 24, 'opportunity': 116, 'extension': 23, 'international': 141, 'user': 53, 'gigabyte': 1, 'contain': 98, 'newer': 2, 'responsive': 7, 'broad': 18, 'possible': 131, 'profanity': 9, 'fund': 42, 'bank': 81, 'randomly': 4, 'seal': 5, 'offers': 6, 'almost': 50, 'off': 107, 'ours': 17, 'canada': 55, 'cause': 28, 'released': 15, 'teaser': 1, 'newsgroup': 15, 'risk': 50, 'cleanest': 16, 'postings': 1, 'vulgarity': 10, 'cut': 43, 'mine': 30, 'highly': 37, 'alberta': 2, 'sort': 36, 'fax': 247, 'filter': 31, 'produce': 63, 'seeker': 3, 'wrap': 23, 'ups': 2, 'dure': 13, 'download': 43, 'dupe': 12, 'kick': 7, 'undeliverable': 25, 'sell': 108, 'finally': 51, 'fedex': 5, 'unique': 42, 'real': 90, 'anon': 10, 'nobody': 12, 'fold': 8, 'generate': 67, 'private': 30, 'nospam': 1, 'mil': 15, 'bonus': 49, 'enclose': 43, 'monrose': 1, 'mlmer': 1, 'type': 183, 'nondeliverable': 1, 'key': 40, 'actually': 54, 'unless': 27, 'adam': 8, ...})   In [10]: Common_Words = { w for w, _ in Word_Counter.most_common(N) } Common_Words   Out[10]: {'load', 'comprise', 'familiar', 'teen', 'massive', 'gamble', 'none', 'implementation', 'majority', 'cgibin', 'rejection', 'smaller', 'launch', 'lee', 'database', 'food', 'window', 'transfer', 'candidate', 'delay', 'frank', 'multus', 'late', 'engage', 'work', 'government', 'newest', 'call', 'vulgarity', 'material', 'organisation', 'affect', 'hundreds', 'propose', 'john', 'campus', 'competition', 'view', 'penny', 'currency', 'gender', 'class', 'santa', 'andrew', 'bonus', 'refinance', 'organization', 'eric', 'site', 'ongo', 'italy', 'sprachwissenschaft', 'operator', 'little', 'appear', 'ms', 'actually', 'perform', 'monthly', 'opposite', 'latex', 'job', 'forum', 'correct', 'install', 'miss', 'local', 'remain', 'chain', 'music', 'ready', 'hundr', 'dinner', 'bill', 'singapore', 'option', 'multimedium', 'dialect', 'translation', 'most', 'different', 'literature', 'unite', 'sit', 'sun', 'desirous', 'bear', 'scientist', 'income', 'urge', 'life', 'extensive', 'label', 'city', 'july', 'mouse', 'win', 'continent', 'die', 'tom', 'diploma', 'edit', 'fulfill', 'sequence', 'lucky', 'less', 'manufacturer', 'implication', 'colingacl', 'global', 'referral', 'western', 'using', 'map', 'great', 'response', 'wrong', 'bind', 'analyse', 'busy', 'pb', 'commerce', 'ext', 'polish', 'worldwide', 'previously', 'peter', 'studies', 'appropriate', 'mailbox', 'again', 'partner', 'truly', 'catch', 'develop', 'meeting', 'title', 'cfp', 'parttime', 'begin', 'dozen', 'addition', 'artificial', 'experiment', 'mci', 'around', 'alexis', 'april', 'dictionary', 'receive', 'internet', 'exercise', 'edinburgh', 'oversea', 'paper', 'charge', 'j', 'store', 'nice', 'raleigh', 'six', 'style', 'www', 'member', 'activity', 'y', 'onetime', 'comment', 'status', 'means', 'reap', 'chat', 'utility', 'usage', 'beautiful', 'extractor', 'hotmail', 'phrase', 'doctor', 'generally', 'highly', 'helpful', 'access', 'red', 'ac', 'usa', 'structural', 'likewise', 'major', 'wherea', 'acl', 'canadian', 'cds', 'mastercard', 'programme', 'living', 'ps', 'goe', 'design', 'end', 'foot', 'postscript', 'empirical', 'color', 'corner', 'unsubscribe', 'change', 'deal', 'substantial', 'planet', 'michigan', 'dollars', 'simon', 'trip', 'award', 'credit', 'distinguish', 'quantifier', 'registration', 'grateful', 'doe', 'hesitate', 'psycholinguistic', 'robert', 'publication', 'typical', 'demo', 'log', 'e', 'interest', 'm', 'tremendous', 'simple', 'phonological', 'excellent', 'enjoy', 'genie', 'preview', 'december', 'impossible', 'america', 'webmaster', 'dori', 'le', 'de', 'please', 'reality', 'heart', 'weeks', 'parse', 'meet', 'russian', 'dear', 'netherland', 'speak', 'floor', 'blvd', 'entirely', 'clearance', 'practical', 'contents', 'medium', 'lay', 'somewhat', 'surely', 'accompany', 'buy', 'judgment', 'orders', 'fairchild', 'zero', 'focus', 'spout', 'wish', 'translate', 'vendor', 'faith', 'sincerely', 'mixe', 'draw', 'typology', 'joan', 'participation', 'quote', 'integrate', 'recently', 'vocabulary', 'interaction', 'kit', 'cycle', 'session', 'august', 'sometime', 'non', 'volumes', 'anderson', 'anna', 'discussion', 'diskette', 'finding', 'entire', 'fl', 'tip', 'player', 'traditional', 'lie', 'opportunity', 'ltd', 'shop', 'front', 'reread', 'alway', 'condition', 'band', 'sex', 'eye', 'proper', 'century', 'avenue', 'buck', 'motivation', 'postfach', 'macintosh', 'ignore', 'inquiry', 'vary', 'md', 'poor', 'move', 'top', 'capture', 'european', 'harri', 'israel', 'modify', 'eventually', 'conversation', 'assessment', 'produce', 'coverage', 'media', 'click', 'indo', 'ma', 'female', 'genuine', 'typological', 'interdisciplinary', 'predicate', 'banner', 'provide', 'back', 'generate', 'independent', 'june', 'monday', 'individual', 'speed', 'dan', 'thing', 'demand', 'wealth', 'value', 'programs', 'texa', 'nt', 'national', 'introduction', 'amazing', 'hr', 'intelligent', 'request', 'surface', 'classified', 'policy', 'mediumsize', 'plain', 'hit', 'im', 'commonly', 'star', 'guy', 'symposium', 'w', 'september', 'forever', 'notify', 'rest', 'repeat', 'martin', 'description', 'undoubtedly', 'myself', 'zip', 'kong', 'happen', 'direct', 'percentage', 'grammar', 'delivery', 'reply', 'du', 'weekend', 'although', 'french', 'rich', 'http', 'stay', 'scheme', 'conceptual', 'deep', 'subject', 'abuse', 'consult', 'below', 'fastest', 'organiser', 'comparable', 'currently', 'influence', 'suppose', 'htm', 'enhance', 'video', 'jone', 'document', 'mb', 'consideration', 'apology', 'iro', 'michael', 'result', 'bid', 'until', 'excess', 'put', 'exclude', 'hi', 'unlimit', 'bring', 'explore', 'trend', 'try', 'numbers', 'expect', 'organise', 'ed', 'history', 'favorite', 'due', 'nc', 'promptly', 'yours', 'san', 'reports', 'documentation', 'initially', 'near', 'birth', 'park', 'trash', 'firm', 'confident', 'virtually', 'syntactic', 'evergrow', 'quit', 'register', 'general', 'stun', 'benefit', 'quality', 'spain', 'teacher', 'research', 'pic', 'participant', 'evaluation', 'responsible', 'perceive', 'institution', 'bernard', 'off', 'cooperation', 'marketing', 'north', 'illustrate', 'hardcore', 'yield', 'newsgroup', 'fantastic', 'science', 'beach', 'innovative', 'treat', 'signal', 'run', 'exactly', 'brand', 'pennsylvanium', 'wait', 'contact', 'minute', 'totally', 'statistics', 'state', 'amateur', 'researcher', 'clean', 'cluster', 'open', 'reconstruction', 'chri', 'perfectly', 'help', 'completely', 'operate', 'loss', 'watch', 'approve', 'someone', 'arise', 'scope', 'development', 'unless', 'version', 'necessarily', 'edition', 'dutch', 'novel', 'trial', 'juno', 'fast', 'millions', 'twenty', 'dramatically', 'anywhere', 'original', 'acceptance', 'downsize', 'today', 'exact', 'weekly', 'keynote', 'forget', 'characteristic', 'txt', 'even', 'increase', 'clear', 'advertiser', 'purchase', 'date', 'integration', 'conversational', 'adult', 'classic', 'plan', 'earth', 'bottom', 'associate', 'sales', 'south', 'comprehensive', 'making', 'transcription', 'easily', 'finger', 'la', 'mortgage', 'add', 'conceal', 'verbal', 'underlie', 'long', 'import', 'analysis', 'tool', 'application', 'perhap', 'snail', 'review', 'easiest', 'extremely', 'though', 'verify', 'leave', 'virtual', 'dynamic', 'recipient', 'couple', 'least', 'germany', 'useful', 'client', 'television', 'prof', 'scott', 'phd', 'importance', 'gb', 'korean', 'order', 'variation', 'aim', 'trust', 'thoma', 'corporations', 'musical', 'much', 'lifetime', 'intrusion', 'set', 'iii', 'potential', 'concept', 'country', 'hotel', 'school', 'master', 'acquire', 'laugh', 'mo', 'psychological', 'club', 'obviously', 'debt', 'extract', 'vision', 'million', 'acquisition', 'object', 'human', 'sake', 'include', 't', 'success', 'theoretical', 'community', 'vacation', 'sample', 'wh', 'five', 'finance', 'search', 'datum', 'engine', 'while', 'everybody', 'february', 'author', 'editor', 'summarize', 'fundamental', 'publish', 'blackwell', 'yourself', 'hello', 'investigation', 'universal', 'karen', 'jump', 'umontreal', 'follow', 'professional', 'emerge', 'once', 'day', 'jame', 'slip', 'medical', 'borrow', 'song', 'idea', 'hour', 'argument', 'obvious', 'listing', 'november', 'subscription', 'shift', 'pleasure', 'london', 'however', 'fun', 'tree', 'man', 'classroom', 'dates', 'mt', 'accommodation', 'cm', 'alternative', 'send', 'diversity', 'framework', 'bag', 'cognition', 'relationship', 'reasonable', 'att', 'exceed', 'von', 'during', 'region', 'accessible', 'linguistics', 'refer', 'observe', 'britain', 'creditor', 'specify', 'moneymake', 'relevance', 'honor', 'connection', 'news', 'parameter', 'twelve', 'cv', 'mike', 'parallel', 'apply', 'is', 'here', 'short', 'password', 'mellon', 'textbook', 'telephone', 'morn', 'paragraph', 'educational', 'worth', 'effective', 'reflect', 'normal', 'compute', 'ba', 'pretty', 'comparative', 'contribute', 'latest', 'minimum', 'interpretation', 'australium', 'drop', 'tell', 'fill', 'richer', 'compile', 'residual', 'convention', 'belgium', 'martha', 'functional', 'variety', 'white', 'spell', 'interview', 'isp', 'cognitive', 'slavic', 'u', 'possibility', 'speech', 'hire', 'spokane', 'japanese', 'education', 'unify', 'distribute', 'perception', 'enquiry', 'president', 'poster', 'african', 'town', 'overload', 'everythe', 'testimonial', 'death', 'christian', 'purpose', 'return', 'team', 'duration', 'cinema', 'lot', 'seem', 'linguistic', 'freedom', 'v', 'want', 'cheap', 'total', 'dept', 'literary', 'second', 'honest', 'wife', 'head', 'agency', 'final', 'assume', 'berlin', 'alone', 'compete', 'three', 'attach', 'montreal', 'movie', 'consist', 're', 'incredible', 'lack', 'mistake', 'satisfy', 'fine', 'middle', 'ask', 'capability', 'guest', 'discover', 'resort', 'amount', 'define', 'certainly', 'problem', 'organize', 'believe', 'resell', 'survey', 'whatsoever', 'lisa', 'cent', 'largest', 'requirement', 'property', 'msn', 'mail', 'loan', 'domain', 'released', 'pour', 'sum', 'complete', 'syntax', 'symbol', 'command', 'sheffield', 'interpret', 'reviewer', 'merciless', 'nijmegen', 'dupe', 'industrial', 'phonology', 'create', 'pittsburgh', 'radio', 'together', 'sender', 'privacy', 'nature', 'length', 'contributor', 'forthcome', 'bet', 'genre', 'product', 'expiration', 'indiana', 'nl', 'obligation', 'newsletter', 'ii', 'emailer', 'exclusive', 'note', 'maintain', 'assure', 'rock', 'quick', 'retail', 'copy', 'x', 'ad', 'light', 'serve', 'toy', 'press', 'file', 'days', 'gold', 'strength', 'few', 'eastern', 'both', 'dependency', 'chomsky', 'course', 'morpheme', 'financially', 'device', 'behind', 'situation', 'parent', 'foundation', 'bit', 'orient', 'stealth', 'remember', 'vium', 'personally', 'difficult', 'money', 'jp', 'institute', 'disc', 'lyric', 'correspondence', 'cancel', 'probably', 'asset', 'prompt', 'equipment', 'feature', 'id', 'conclude', 'need', 'sale', 'truth', 'automatically', 'mix', 'spend', 'sorry', 'index', 'art', 'fact', 'department', 'common', 'estate', 'publisher', 'basically', 'conjunction', 'ram', 'arrange', 'whether', 'example', 'sure', 'plenary', 'love', 'preparation', 'actual', 'cameraready', 'cost', 'point', 'experience', 'quickly', 'thousands', 'ultimate', 'browser', 'started', 'reg', 'coordinate', 'rush', 'network', 'indefinite', 'delete', 'mailer', 'package', 'instead', 'side', 'already', 'essential', 'ever', 'responsibility', 'mit', 'brief', 'desire', 'discovery', 'royal', 'update', 'intelligence', 'control', 'present', 'round', 'sponsor', 'occur', 'philosophy', 'current', 'web', 'text', 'broadcast', 'spot', 'effect', 'tradition', 'tv', 'germanic', ...}  Computing the Conditional Probabilities Having computed the most common words, we are now ready to compute the conditional probability that a given word occurs in a spam email. The function$\texttt{get_common_words}(\texttt{fn})$takes a filename$\texttt{fn}$as its argument. It reads the file and returns the set of all words in Common_Words that are found in the given file.  In [11]: def get_common_words(fn): return get_words(fn) & Common_Words  We test this function for a small email.  In [12]: get_common_words('EmailData/ham-train/3-380msg4.txt')   Out[12]: {'anyone', 'article', 'berkeley', 'book', 'consonant', 'edu', 'english', 'hard', 'helpful', 'hi', 'interest', 'm', 'modern', 'phonetics', 'project', 'recommend', 'source', 'specifically', 'thank', 'too', 'work'}  The function count_common_words takes a string specifying a directory. It returns a Counter that counts how often the words in Common_Words occur in any of the files in directory.  In [13]: def count_commmon_words(directory): Words = Counter() for file_name in os.listdir(directory): Words.update(get_common_words(directory + file_name)) return Words  Next, we compute dictionaries that store the number of occurrences in emails for every common word.  In [14]: Spam_Counter = count_commmon_words(spam_dir_train) Spam_Counter   Out[14]: Counter({'earn': 51, 'experience': 63, 'through': 75, 'phd': 6, 'increase': 39, 'grant': 12, 'effort': 42, 'choice': 23, 'here': 146, 'short': 38, 'field': 33, 'part': 50, 'personal': 67, 'programs': 15, 'base': 42, 'ba': 8, 'phone': 93, 'power': 30, 'necessary': 25, 'degree': 9, 'further': 51, 'detail': 55, 'call': 132, 'advance': 20, 'require': 64, 'award': 13, 'present': 27, 'knowledge': 30, 'money': 140, 'university': 15, 'diploma': 7, 'ma': 13, 'cost': 99, 'entire': 29, 'conference': 6, 'week': 104, 'receive': 157, 'start': 106, 'our': 223, 'delete': 39, 'po': 27, 'old': 40, 'mailer': 15, 'financial': 55, 'member': 54, 'problem': 47, 'believe': 56, 'ago': 29, 'throw': 13, 'customer': 52, 'hello': 36, 'letter': 67, 'inexpensive': 16, 'guarantee': 73, 'ignore': 22, 'complete': 54, 'control': 30, 'outside': 20, 'cash': 69, 'name': 133, 'usa': 31, 'state': 103, 'texa': 5, 'send': 154, 'later': 35, 'without': 66, 'print': 56, 'program': 99, 'best': 123, 'one': 168, 'note': 59, 'free': 198, 'show': 76, 'computer': 69, 'credit': 71, 'registration': 9, 'must': 80, 'process': 54, 'center': 16, 'today': 116, 'weekly': 26, 'mind': 33, 'zip': 52, 'interest': 98, 'compound': 3, 'few': 68, 'address': 166, 'simple': 72, 'telephone': 31, 'educational': 3, 'main': 22, 'worth': 32, 'entitle': 4, 'convert': 8, 'plan': 41, 's': 219, 'message': 106, 'join': 54, 'number': 106, 'respond': 24, 'box': 56, 'achieve': 21, 'card': 70, 'life': 65, 'solution': 13, 'mortgage': 17, 'please': 188, 'city': 69, 'information': 153, 'especially': 23, 'net': 68, 'id': 21, 'participate': 31, 'us': 156, 'independence': 12, 'tuesday': 2, 'enable': 9, 'company': 102, 'over': 146, 'simply': 79, 'night': 18, 'pm': 26, 'intrusion': 15, 'return': 65, 'solid': 14, 'establish': 15, 'mean': 17, 'freedom': 36, 'form': 63, 'begin': 32, 'system': 65, 'debt': 30, 'obtain': 19, 'secure': 25, 'per': 79, 'pack': 13, 'vacation': 31, 'west': 6, 'pay': 92, 'e': 87, 'home': 101, 'accomodation': 1, 'wonderful': 10, 'three': 25, 'buck': 7, 'room': 12, 'mail': 179, 'reserve': 16, 'stay': 21, 'season': 7, 'announce': 13, 'want': 145, 'follow': 118, 'space': 16, 'com': 160, 'compuserve': 17, 'day': 154, 'lunch': 8, 'book': 33, 'doe': 9, 'additional': 51, 'million': 85, 'reach': 43, 'commercial': 19, 'info': 47, 'future': 74, 'success': 60, 'internet': 124, 'software': 67, 'network': 22, 'search': 60, 'permanently': 8, 'area': 43, 'evaluation': 4, 'proper': 7, 'requirement': 13, 'presence': 9, 'section': 31, 'stop': 38, 'regard': 20, 'propose': 9, 'web': 93, 'advantage': 45, 'sender': 23, 'certain': 16, 'help': 89, 'remove': 150, 'target': 23, 'product': 98, 'fellow': 16, 'promote': 24, 'luck': 23, 'basis': 18, 'request': 80, 'comply': 17, 'recent': 7, 'lead': 19, 'mailing': 55, 'bill': 54, 'selection': 9, 'c': 49, 'reply': 89, 'ten': 17, 'paragraph': 9, 'post': 30, 'unite': 31, 'transmission': 8, 'gov': 20, 'http': 157, 'entrepreneur': 10, 'subject': 102, 'tool': 36, 'service': 108, 'dear': 37, 'business': 114, 'assist': 9, 'level': 44, 'need': 145, 'sale': 57, 'thoma': 5, 'item': 11, 'much': 102, 'try': 75, 'set': 43, 'wish': 85, 'thank': 89, 'market': 102, 'email': 185, 'online': 85, 'federal': 27, 'audience': 7, 'park': 16, 'check': 126, 'greatest': 27, 're': 104, 'onto': 6, 'release': 30, 'include': 129, 'player': 10, 'visit': 80, 'ultimate': 10, 'shop': 26, 'while': 46, 'chart': 8, 'cd': 41, 'never': 79, 'package': 48, 'alway': 57, 'www': 110, 'band': 8, 'why': 65, 'event': 15, 'full': 65, 'digital': 16, 'right': 96, 'delay': 9, 'yourself': 72, 'late': 13, 'friend': 56, 'easy': 89, 'available': 93, 'beautiful': 20, 'chance': 43, 'fantastic': 27, 'top': 50, 'pick': 32, 'run': 48, 'access': 45, 'john': 14, 'competition': 25, 'click': 100, 'offer': 143, 'pop': 14, 'n': 27, 'roll': 18, 'big': 46, 'sound': 29, 'radio': 17, 'down': 67, 'play': 29, 'provide': 71, 'london': 11, 'thing': 57, 'fun': 44, 'site': 121, 'record': 32, 'prepare': 22, 'nt': 127, 'true': 44, 'unsubscribe': 20, 'b': 40, 'technology': 29, 'miss': 42, 'exclusive': 23, 'capitalfm': 17, 'hit': 42, 'thursday': 3, 'save': 81, 'straight': 6, 'choose': 52, 'question': 76, 'rock': 7, 'star': 16, 'music': 14, 'europe': 12, 'hesitate': 25, 'graphic': 12, 'storm': 7, 'instant': 15, 'supply': 13, 'special': 85, 'll': 107, 'regular': 16, 'hurry': 8, 'many': 117, 'even': 99, 'reveal': 12, 'too': 46, 'attend': 7, 've': 81, 'website': 39, 'list': 166, 'classic': 3, 'absolutely': 39, 'south': 15, 'enter': 50, 'latest': 36, 'pc': 18, 'prize': 16, 'label': 8, 'tell': 71, 'fill': 55, 'win': 87, 'paradise': 7, 'stock': 23, 'thompson': 5, 'discount': 18, 'couple': 21, 'guess': 8, 'u': 48, 'entirely': 4, 'amaze': 37, 'link': 53, 'advertisement': 42, 'better': 57, 'william': 6, 'feel': 33, 'become': 42, 'manufacturer': 9, 'album': 12, 'game': 28, 'still': 52, 'buy': 79, 'primary': 8, 'bring': 35, 'screen': 12, 'president': 7, 'biz': 9, 'surround': 5, 'poster': 1, 'everythe': 15, 'fm': 10, 'focus': 5, 'talk': 29, 'team': 20, 'mailbox': 24, 'next': 69, 'catch': 16, 'favourite': 9, 'world': 80, 'laugh': 12, 'whether': 20, 'performance': 8, 'hot': 27, 'head': 12, 'capital': 43, 'movie': 19, 'major': 57, 'submission': 11, 'automatically': 31, 'report': 70, 'notice': 7, 'engine': 34, 'advertiser': 14, 'within': 89, 'bulk': 58, 'after': 80, 'each': 96, 'etc': 53, 'every': 114, 'appropriate': 5, 'page': 57, 'toll': 36, 'monthly': 28, 'pro': 10, 'hr': 15, 'extractor': 11, 'block': 17, 'month': 94, 'review': 23, 'submit': 19, 'media': 3, 'tag': 3, 'thousands': 10, 'solve': 4, 'something': 32, 'spam': 26, 'reg': 11, 'dollar': 73, 'powerful': 27, 'quickly': 29, 'community': 7, 't': 73, 'high': 47, 'literally': 9, 'general': 13, 'along': 30, 'travel': 31, 'ask': 67, 'benefit': 27, 'oversea': 7, 'paper': 34, 'finance': 14, 'promise': 23, 'legally': 10, 'amount': 61, 'clearly': 7, 'confirm': 9, 'certainly': 11, 'upon': 23, 'contract': 10, 'word': 38, 'extra': 42, 'thousand': 71, 'means': 18, 'work': 123, 'soon': 50, 'monitor': 8, 'before': 89, 'themselve': 21, 'vary': 8, 'march': 10, 'move': 38, 'ca': 43, 'under': 55, 'exactly': 38, 'kid': 14, 'public': 21, 'view': 17, 'greatly': 9, 'earlier': 4, 'contact': 67, 'likewise': 4, 'currency': 10, 'minute': 41, 'wall': 13, 'create': 56, 'daily': 22, 'yet': 20, 'reason': 42, 'effect': 5, 'editorial': 3, 'santa': 4, 'optional': 13, 'back': 85, 'completely': 36, 'end': 47, 'individual': 25, 'operate': 22, 'organization': 23, 'however': 27, 'watch': 36, 'someone': 48, 'rate': 61, 'wealth': 10, 'fortune': 22, 'own': 96, 'political': 3, 'membership': 19, 'corner': 8, 'national': 13, 'change': 59, 'payable': 35, 'dollars': 19, 'write': 55, 'o': 42, 'overnight': 25, 'let': 65, 'group': 34, 'first': 110, 'assure': 9, 'profile': 10, 'same': 69, 'attention': 22, 'publication': 12, 'continue': 33, 'postage': 18, 'else': 51, 'gold': 20, 'instruction': 64, 'nor': 23, 'm': 78, 'fee': 36, 'most': 112, 'date': 53, 'different': 62, 'announcement': 6, 'concern': 12, 'glad': 7, 'unlike': 10, 'earth': 21, 'able': 46, 'parent': 8, 'easily': 40, 'anyone': 61, 'add': 79, 'york': 18, 'depend': 15, 'long': 39, 'below': 99, 'allow': 60, 'action': 37, 'street': 37, 'operation': 10, 'exist': 23, 'legal': 42, 'advice': 6, 'extremely': 25, 'leave': 57, 'cancel': 8, 'important': 45, 'californium': 9, 'refund': 29, 'american': 35, 'document': 11, 'confidential': 13, 'hand': 45, 'read': 85, 'conclude': 12, 'keep': 80, 'grow': 22, 'until': 44, 'surely': 16, 'hi': 24, 'secret': 38, 'global': 17, 'unlimit': 20, 'profit': 48, 'enquiry': 1, 'don': 42, 'great': 81, 'line': 93, 'learn': 53, 'ship': 42, 'immediately': 51, 'those': 77, 'instruct': 15, 'limited': 17, 'ourselve': 9, 'worldwide': 26, 'purpose': 13, 'source': 21, 'plus': 60, 'again': 78, 'office': 48, 'school': 16, 'low': 33, 'hundred': 48, 'total': 47, 'd': 54, 'recently': 21, 'second': 23, 'suite': 36, 'exchange': 19, 'share': 42, 'method': 39, 'extract': 15, 'around': 33, 'tip': 19, 'listen': 13, 'excite': 42, 'teacher': 1, 'everybody': 7, 'beer': 5, 'answer': 50, 'century': 10, 'ever': 78, 'love': 38, 'chat': 16, 'universal': 3, 'channel': 9, 'globe': 5, 'zone': 8, 'hottest': 18, 'uk': 21, 'red': 8, 'song': 6, 'wait': 51, 'past': 42, 'tv': 18, 'compzone': 7, 'june': 10, 'man': 17, 'forthcome': 4, 'rd': 20, 'piece': 32, 'break': 45, 'bag': 7, 'stress': 6, 'fabulous': 17, 'live': 69, 'highlight': 6, 'ad': 49, 'martin': 3, 'angele': 7, 'beverage': 6, 'st': 38, 'saturday': 13, 'co': 18, 'delivery': 33, 'summer': 8, 'both': 45, 'weekend': 12, 'th': 45, 'where': 90, 'la': 8, 'professor': 3, 'holiday': 17, 'meet': 31, 'lyric': 5, 'video': 39, 'size': 22, 'nd': 26, 'bargain': 12, 'vote': 9, 'prof': 2, 'ticket': 24, 'feature': 22, 'prior': 23, 'carefully': 20, 'really': 65, 'order': 130, 'film': 10, 'musical': 5, 'wednesday': 4, 'expensive': 17, 'boy': 24, 'winner': 16, 'cinema': 9, 'lot': 52, 'title': 33, 'lo': 7, 'everyone': 48, 'kit': 10, 'mark': 14, 'character': 3, 'price': 84, 'preparation': 4, 'girl': 20, 'xxx': 21, 'teen': 15, 'trial': 23, 'index': 19, 'adult': 42, 'html': 42, 'mci': 12, 'z': 5, 'range': 14, 'familiar': 7, 'several': 49, 'blank': 12, 'numerous': 10, 'duplicate': 22, 'circle': 6, 'finish': 12, 'opportunity': 73, 'extension': 6, 'international': 38, 'user': 22, 'contain': 43, 'possible': 40, 'broad': 1, 'bank': 51, 'fund': 18, 'almost': 32, 'off': 70, 'ours': 12, 'canada': 12, 'cause': 10, 'released': 10, 'risk': 36, 'newsgroup': 10, 'cleanest': 11, 'vulgarity': 7, 'cut': 26, 'mine': 13, 'highly': 19, 'sort': 17, 'fax': 51, 'filter': 19, 'produce': 33, 'wrap': 15, 'dure': 3, 'download': 30, 'dupe': 8, 'undeliverable': 18, 'sell': 84, 'finally': 34, 'unique': 27, 'real': 52, 'anon': 6, 'nobody': 9, 'private': 23, 'generate': 43, 'mil': 9, 'bonus': 41, 'enclose': 28, 'type': 84, 'key': 20, 'actually': 24, 'unless': 18, 'fast': 27, 'place': 81, 'yes': 30, 'remain': 11, 'valid': 14, 'close': 22, 'specifically': 6, 'since': 37, 'w': 18, 'file': 55, 'huge': 37, 'is': 83, 'tremendous': 11, 'small': 41, 'password': 8, 'purchase': 73, 'are': 56, 'against': 33, 'anything': 43, 'course': 40, 'edu': 9, 'average': 18, 'directory': 23, 'eliminate': 25, 'replace': 16, 'super': 24, 'production': 10, 'bottom': 23, 'clock': 7, 'server': 27, 'account': 46, 'rich': 24, 'gather': 7, 'webmaster': 12, 'marketer': 14, 'envelope': 27, 'postmaster': 6, 'abuse': 10, 'stealth': 22, 'whole': 24, 'inside': 15, 'ensure': 7, 'org': 14, 'vium': 45, 'faster': 23, 'removal': 9, 'investment': 36, 'longer': 23, 'classify': 10, 'cdrom': 12, 'pure': 12, 'isp': 17, 'road': 25, 'less': 57, 'client': 18, 'result': 56, 'bid': 18, 'excess': 18, 'put': 77, 'reduce': 25, 'fresh': 42, 'otherwise': 14, 'using': 27, 'response': 44, 'combine': 14, 'fact': 43, 'tout': 6, 'addresses': 19, 'numbers': 10, 'collect': 24, 'country': 41, 'due': 26, 'seem': 18, 'flame': 10, 'prodigy': 7, 'sign': 41, 'dozen': 5, 'test': 35, 'example': 35, 'near': 15, 'sure': 65, 'lists': 20, 'consist': 3, 'actual': 9, 'diskette': 8, 'fine': 10, 'act': 21, 'doubt': 27, 'magazine': 18, 'window': 33, 'compress': 7, 'stimulate': 4, 'activity': 12, 'whatsoever': 13, 'comment': 11, 'position': 35, 'multiple': 16, 'macintosh': 5, 'utility': 12, 'everything': 50, 'meg': 7, 'treat': 22, 'intelligence': 15, 'command': 6, 'once': 68, 'conversation': 6, 'compatible': 6, 'disk': 13, 'girlfriend': 4, 'design': 34, 'rom': 11, 'above': 62, 'mac': 9, 'differently': 4, 'woman': 15, 'king': 6, 'protection': 13, 'install': 9, 'celebrity': 7, 'correct': 14, 'copy': 64, 'guy': 15, 'code': 58, 'personality': 4, 'x': 46, 'either': 33, 'toy': 9, 'existence': 6, 'voice': 14, 'likes': 6, 'hear': 42, 'hard': 44, 'unmark': 6, 'ibm': 7, 'boyfriend': 5, 'sexual': 12, 'reality': 8, 'turn': 41, 'model': 8, 'remember': 45, 'deat': 28, 'higher': 15, 'continent': 5, 'interactive': 11, 'realistic': 9, 'guide': 29, 'drive': 23, 'relate': 17, 'virtual': 10, 'blvd': 12, 'least': 44, 'upset': 7, 'obey': 4, 'beg': 5, 'sexually': 8, 'attitude': 5, 'ram': 7, 'inform': 13, 'partner': 27, 'v': 28, 'blast': 7, 'club': 18, 'artificial': 6, 'clothe': 10, 'imagine': 33, 'porn': 8, 'handle': 26, 'sex': 22, 'story': 20, 'picture': 16, 'birth': 6, 'none': 6, 'rejection': 1, 'charge': 48, 'responsible': 10, 'north': 7, 'qualify': 23, 'law': 33, 'perform': 11, 'job': 48, 'annual': 12, 'conduct': 4, 'creditor': 15, 'bankruptcy': 21, 'regardless': 10, 'match': 9, 'apply': 18, 'bad': 12, 'excellent': 24, 'income': 69, 'payment': 38, 'express': 28, 'application': 15, 'seek': 13, 'security': 41, 'made': 9, 'nj': 8, 'student': 13, 'prompt': 16, 'deposit': 21, 'resource': 22, 'history': 13, 'guaranteed': 21, 'signature': 29, 'savings': 12, 'final': 13, 'datum': 9, 'recieve': 11, 'text': 31, 'clean': 19, 'open': 45, 'together': 20, 'cheque': 5, 'value': 30, 'unsolicit': 14, 'england': 6, 'clear': 15, 'direct': 36, 'minimum': 10, 'import': 8, 'disc': 5, 'recipient': 13, 'fully': 22, 'quote': 10, 'pound': 7, 'normally': 8, 'resident': 11, 'virtually': 14, 'collection': 15, 'select': 48, 'resell': 22, 'cent': 19, 'msn': 9, 'marketing': 29, 'ability': 23, 'management': 12, 'compare': 11, 'class': 26, 'hour': 93, 'mastercard': 35, 'nothing': 48, 'copyright': 17, 'speed': 20, 'accept': 50, 'tree': 7, 'mass': 16, 'expiration': 21, 'deal': 34, 'visa': 40, 'anywhere': 49, 'aol': 39, 'ready': 40, 'dream': 46, 'reward': 10, 'smith': 4, 'sales': 22, 'person': 47, 'function': 6, 'step': 55, 'setup': 9, 'currently': 24, 'hours': 14, 'stepby': 15, 'tax': 28, 'touch': 10, 'thesis': 2, 'kind': 25, 'yours': 54, 'provider': 17, 'rights': 22, 'volume': 15, 'trash': 13, 'satisfy': 19, 'period': 23, 'thereafter': 9, 'sample': 18, 'separate': 11, 'quality': 25, 'services': 15, ...})   In [15]: Ham__Counter = count_commmon_words(ham__dir_train) Ham__Counter   Out[15]: Counter({'range': 29, 'comprise': 4, 'through': 33, 'future': 20, 'lab': 9, 'practice': 11, 'coordinate': 7, 'language': 241, 'international': 76, 'research': 116, 'promise': 5, 'area': 72, 'broad': 10, 'www': 116, 'fund': 12, 'identify': 30, 'pari': 15, 'canada': 28, 'work': 99, 'sunday': 9, 'call': 119, 'umontreal': 7, 'follow': 130, 'assess': 9, 'therefore': 16, 'syntax': 65, 'israel': 8, 'modify': 8, 'present': 79, 'ca': 38, 'outside': 10, 'tag': 7, 'view': 33, 'usa': 53, 'current': 32, 'state': 57, 'researcher': 40, 'face': 13, 'together': 43, 'programme': 35, 'morphology': 36, 'provide': 86, 'html': 56, 'examine': 16, 'individual': 27, 'accept': 47, 'arabic': 10, 'own': 32, 'target': 8, 'mt': 7, 'pre': 9, 'little': 13, 'computational': 38, 'national': 30, 'forum': 27, 'coordinator': 9, 'specifically': 8, 'bell': 9, 'europe': 20, 'registration': 52, 'bar': 4, 'france': 21, 'either': 40, 'description': 39, 'c': 79, 'mike': 7, 'direct': 27, 'committee': 50, 'short': 25, 'consequence': 15, 'workshop': 71, 'hebrew': 8, 'date': 35, 'concern': 36, 'theme': 27, 'although': 30, 'edu': 105, 'centre': 29, 'xerox': 12, 'http': 137, 'support': 37, 'subject': 44, 'generation': 19, 'where': 57, 'exist': 25, 'university': 201, 'parse': 14, 'papers': 99, 'possibility': 18, 'iro': 7, 'michael': 32, 'result': 41, 'colingacl': 7, 'aim': 41, 'approach': 71, 'art': 24, 'much': 37, 'each': 53, 'common': 23, 'george': 15, 'potential': 16, 'collect': 8, 'develop': 32, 'body': 8, 'august': 42, 'session': 47, 'final': 33, 'montreal': 14, 'challenge': 17, 'contact': 89, 'process': 69, 're': 52, 'homepage': 13, 'susan': 14, 'text': 77, 'web': 65, 'robert': 28, 'benjamin': 19, 'connection': 11, 'speech': 57, 'read': 32, 'visit': 28, 'editorial': 8, 'william': 18, 'chri': 2, 'order': 73, 'h': 34, 'm': 79, 'l': 51, 'j': 44, 'et': 17, 'grammar': 52, 'relation': 26, 'site': 47, 'development': 56, 'bank': 10, 'pattern': 22, 'resource': 24, 'christian': 8, 'word': 114, 'ed': 31, 'g': 61, 'nl': 42, 'function': 25, 'locate': 9, 'linguistic': 170, 'mean': 43, 'semantics': 53, 'english': 125, 'life': 12, 'le': 21, 'sign': 17, 'de': 87, 'social': 30, 'paul': 28, 'please': 133, 'verbal': 9, 'total': 9, 'note': 47, 'harri': 9, 'lexical': 44, 'elizabeth': 9, 'matter': 14, 'verb': 39, 'john': 60, 'linguistics': 103, 'theory': 71, 'natural': 52, 'philosophy': 14, 'ac': 54, 'information': 174, 'k': 35, 'thompson': 4, 'dynamic': 12, 'industry': 3, 'million': 5, 'experience': 33, 'include': 130, 'conference': 99, 'expense': 10, 'implementation': 10, 'effort': 15, 'benefit': 4, 'software': 23, 'database': 15, 'window': 3, 'strong': 9, 'theart': 3, 'candidate': 11, 'position': 43, 'fax': 130, 'science': 61, 'phonetics': 18, 'complete': 30, 'signal': 9, 'prefer': 19, 'n': 28, 'phonology': 45, 'prosodic': 11, 'jean': 13, 'advantage': 8, 'two': 79, 'design': 23, 'length': 27, 'enclose': 5, 'between': 79, 'mac': 6, 'send': 109, 'statistical': 18, 'break': 24, 'substantial': 12, 'job': 21, 'inc': 18, 'skill': 10, 'scientific': 17, 'house': 15, 'knowledge': 23, 'engineer': 15, 'computer': 44, 'salary': 7, 'x': 43, 'graphic': 8, 'center': 31, 'acoustic': 6, 'singapore': 10, 'publication': 53, 'tel': 61, 'e': 131, 'successful': 19, 'apply': 50, 'mr': 11, 'both': 83, 'personal': 11, 'telephone': 33, 'post': 49, 's': 189, 'join': 13, 'project': 33, 'sun': 8, 'number': 80, 'scientist': 10, 'desirable': 5, 'model': 45, 'analysis': 70, 'tool': 22, 'institute': 52, 'relevant': 24, 'californium': 32, 'technical': 24, 'least': 27, 'less': 13, 'phd': 9, 'us': 72, 'need': 54, 'encourage': 23, 'preferably': 20, 'degree': 18, 'stateof': 3, 'email': 136, 'require': 34, 'interaction': 31, 'system': 59, 'chinese': 16, 'content': 33, 'end': 31, 'official': 11, 'fuer': 12, 'later': 25, 'application': 52, 'yet': 15, 'begin': 21, 'mid': 2, 'inform': 8, 'period': 10, 'six': 13, 'sincerely': 4, 'keep': 8, 'january': 26, 'week': 22, 'sprachwissenschaft': 9, 'expect': 18, 'r': 40, 'student': 65, 'cognitive': 43, 'issue': 77, 'october': 26, 'oxford': 13, 'press': 21, 'upto': 5, 'paper': 100, 'f': 30, 'most': 52, 'pp': 38, 'learn': 34, 'key': 13, 'study': 85, 'wide': 33, 'history': 23, 'concept': 20, 'introduction': 24, 'brief': 20, 'title': 61, 'overview': 12, 'org': 18, 'second': 60, 'accessible': 11, 'cloth': 16, 'first': 100, 'cover': 32, 'book': 79, 'third': 19, 'act': 13, 'secretary': 7, 'theoretical': 41, 'patrick': 10, 'general': 61, 'ask': 34, 'p': 68, 'po': 17, 'representation': 35, 'package': 4, 'author': 72, 'dialogue': 18, 'publish': 57, 'page': 89, 'available': 85, 'preliminary': 8, 'jame': 18, 'interface': 23, 'jan': 18, 'formal': 30, 'prove': 6, 'november': 19, 'inference': 11, 'postscript': 20, 'dates': 13, 'prepare': 14, 'version': 24, 'further': 63, 'b': 44, 'latex': 11, 'notification': 34, 'place': 54, 'o': 39, 'limit': 44, 'van': 26, 'steve': 7, 'original': 28, 'tilburg': 11, 'acceptance': 32, 'proceedings': 20, 'host': 15, 'september': 31, 'involve': 28, 'selection': 15, 'aspect': 54, 'interest': 112, 'invite': 74, 'chair': 29, 'initial': 18, 'box': 39, 'room': 21, 'interpretation': 23, 'professor': 25, 'htm': 11, 'martha': 7, 'netherland': 30, 'important': 44, 'submission': 64, 'bring': 42, 'index': 22, 'topics': 10, 'semantic': 51, 'phone': 54, 'department': 74, 'focus': 55, 'context': 40, 'due': 35, 'office': 19, 'anne': 17, 'guideline': 13, 'form': 83, 'mark': 36, 'topic': 74, 'faculty': 20, 'submit': 63, 'preparation': 6, 'technique': 16, 'lead': 31, 'discussion': 81, 'cluster': 9, 'parameter': 10, 'real': 12, 'recognition': 23, 'linguist': 71, 'principle': 29, 'help': 41, 'background': 19, 'andrew': 14, 'decision': 12, 'algorithm': 9, 'datum': 47, 'our': 44, 'enable': 8, 'hide': 3, 'tree': 3, 'statement': 21, 'affiliation': 47, 'clearly': 19, 'select': 30, 'editor': 32, 'suitable': 11, 'reflect': 20, 'list': 75, 'distribution': 14, 'message': 36, 'criterion': 13, 'set': 38, 'goal': 14, 'goodness': 2, 'mit': 24, 'series': 19, 'foundation': 14, 'valuable': 6, 'request': 32, 'communication': 46, 'maximum': 17, 'below': 52, 'underlie': 12, 'isbn': 18, 'method': 34, 'show': 44, 'review': 44, 'reader': 15, 'decade': 8, 'reviewer': 9, 'document': 21, 'abstracts': 9, 'choice': 10, 'website': 19, 'address': 113, 'field': 59, 'style': 27, 'educational': 9, 'organizer': 26, 'discourse': 51, 'market': 15, 'december': 27, 'methodology': 17, 'contribution': 24, 'structure': 59, 'case': 57, 'variable': 7, 'psychology': 18, 'documentation': 5, 'affect': 8, 'informative': 4, 'corpus': 29, 'audience': 8, 'deadline': 55, 'april': 42, 'italian': 15, 'ignore': 12, 'interpret': 14, 'brian': 11, 'texa': 19, 'operator': 3, 'category': 23, 'ii': 27, 'url': 23, 'china': 6, 'must': 58, 'austin': 10, 'cv': 12, 'translation': 35, 'directory': 6, 'electronic': 29, 'amsterdam': 20, 'long': 24, 'il': 14, 'meet': 40, 'u': 41, 'david': 34, 'un': 11, 'simply': 16, 'iii': 15, 'translate': 9, 'world': 48, 'school': 39, 'v': 45, 'd': 94, 'approximately': 13, 'price': 23, 'ad': 6, 'idea': 24, 'recent': 37, 'thanks': 7, 'commercial': 9, 'equipment': 8, 'file': 21, 'even': 36, 'middle': 8, 'along': 22, 'walk': 6, 'opportunity': 15, 'someone': 12, 'eastern': 11, 'expression': 17, 'man': 13, 'wonder': 9, 'different': 56, 'old': 17, 'confirm': 9, 'instead': 8, 'glad': 3, 'those': 58, 'french': 38, 'talk': 34, 'mass': 7, 'sit': 1, 'foreign': 14, 'thank': 38, 'print': 22, 'ibm': 4, 'lose': 10, 'country': 26, 'discuss': 38, 'respond': 8, 'mary': 8, 'write': 68, 'anyone': 37, 'want': 25, 'one': 125, 'input': 8, 'service': 18, 'question': 92, 'assume': 19, 'though': 18, 'grateful': 7, 'early': 21, 'three': 48, 'speak': 44, 'attention': 26, 'probably': 10, 'bite': 5, 'demonstrate': 12, 'useful': 14, 'net': 7, 'option': 7, 'feature': 42, 'keyword': 9, 'sale': 2, 'else': 15, 'still': 21, 'search': 6, 'engine': 6, 'teacher': 20, 'query': 25, 'conclusion': 13, 'spanish': 20, 'similar': 16, 'response': 18, 'quite': 21, 'try': 16, 'au': 13, 'etc': 57, 'fact': 26, 'nt': 34, 'excellent': 4, 'digital': 5, 'colleague': 16, 'true': 13, 'polish': 10, 'return': 14, 'puzzle': 7, 'seem': 33, 'next': 15, 'waste': 2, 'door': 1, 'decide': 8, 'again': 15, 'favourite': 1, 'indeed': 17, 'yes': 5, 'tell': 22, 'mine': 10, 'fail': 6, 'turn': 12, 'com': 33, 'build': 27, 'surprise': 9, 'perhap': 22, 'returns': 2, 'large': 20, 'head': 24, 'experiment': 10, 'extremely': 8, 'offer': 34, 'relate': 57, 'notion': 21, 'far': 19, 'guy': 7, 'size': 10, 'volume': 42, 'vol': 14, 'syntactic': 30, 'll': 6, 'june': 31, 'edinburgh': 10, 'quality': 18, 'assistant': 9, 'vowel': 12, 'dutch': 17, 'king': 10, 'stress': 6, 'uk': 46, 'ling': 20, 'additional': 27, 'point': 44, 'intend': 27, 'receive': 53, 'register': 24, 'home': 24, 'separate': 18, 'ascii': 13, 'february': 27, 'inch': 5, 'anybody': 11, 'inquiry': 13, 'march': 38, 'name': 84, 'minute': 37, 'compare': 14, 'speaker': 75, 'type': 56, 'margin': 8, 'charle': 16, 'announce': 37, 'during': 19, 'copy': 64, 'perspective': 41, 'seminar': 10, 'universitaet': 13, 'st': 28, 'anonymous': 18, 'announcement': 26, 'th': 67, 'abstract': 81, 'card': 17, 'vium': 41, 'format': 32, 'pure': 3, 'correspondence': 13, 'nd': 17, 'reference': 72, 'slavic': 10, 'germany': 45, 'negation': 9, 'participate': 16, 'speakers': 10, 'accompany': 10, 'maria': 14, 'acceptable': 7, 'publisher': 14, 'arrange': 5, 'standard': 30, 'detail': 47, 'participation': 28, 'famous': 7, 'several': 32, 'lie': 7, 'rejection': 8, 'onepage': 13, 'teach': 35, 'late': 12, 'before': 49, 'day': 30, 'run': 11, 'under': 28, 'campus': 15, 'comparison': 18, 'compatible': 2, 'structural': 15, 'major': 35, 'cultural': 13, 'santa': 8, 'west': 12, 'nature': 22, 'organization': 15, 'scope': 15, 'part': 48, 'foot': 3, 'understand': 37, 'society': 34, 'cognition': 13, 'basis': 28, 'relationship': 17, 'program': 67, 'mexico': 14, 'notify': 14, 'special': 37, 'characteristic': 9, 'lecture': 17, 'hardcopy': 13, 'emphasis': 6, 'summer': 16, 'within': 45, 'course': 41, 'crosslinguistic': 8, 'plan': 22, 'enjoy': 5, 'america': 25, 'conceptual': 11, 'july': 32, 'city': 19, 'functional': 25, 'realistic': 2, 'american': 29, 'consideration': 17, 'over': 32, 'four': 22, 'night': 9, 'pragmatic': 39, 'native': 24, 'joan': 9, 'psychological': 9, 'addition': 24, 'direction': 12, 'share': 24, 'der': 14, 'germanic': 9, 'modal': 7, 'romance': 9, 'belgium': 7, 'around': 16, 'fine': 6, 'traditional': 16, 'previous': 12, 'pay': 21, 'attract': 8, 'possible': 58, 'problem': 54, 'organize': 41, 'property': 17, 'soon': 16, 'sum': 11, 'european': 39, 'highly': 8, 'propose': 27, 'interdisciplinary': 14, 'avoid': 8, 'italy': 20, 'logic': 17, 'framework': 25, 'utrecht': 9, 'ph': 23, 'elsewhere': 12, 'let': 22, 'serve': 12, 'thus': 24, 'attend': 14, 'logical': 12, 'fee': 32, 'term': 38, 'main': 31, 'entitle': 5, 'introductory': 8, 'solution': 11, 'advance': 43, 'heart': 4, 'variety': 36, 'become': 27, 'reduce': 8, 'dus': 8, 'explore': 16, 'purpose': 19, 'association': 33, 'per': 27, 'advertise': 2, 'community': 20, 'german': 36, 'und': 9, 'integration': 12, 'im': 10, 'near': 8, 'sheffield': 10, 'die': 12, 'sense': 20, 'college': 29, 'east': 9, 'explanation': 12, 'ltd': 6, 'trade': 2, 'side': 12, 'answer': 21, 'essential': 4, 'mail': 81, 'innovative': 3, 'material': 34, 'sery': 10, 'daily': 10, 'canadian': 9, 'moment': 2, 'forthcome': 7, 'record': 13, 'classroom': 12, 'genre': 9, 'contrast': 15, 'local': 20, 'welcome': 28, 'ave': 4, 'increase': 12, 'gold': 1, 'co': 18, 'textbook': 8, 'every': 12, 'unite': 17, 'dr': 35, 'contribute': 17, 'latest': 8, 'making': 1, 'australium': 6, 'describe': 21, 'whole': 21, 'grammatical': 32, 'especially': 31, 'discount': 1, 'blvd': 2, 'client': 2, 'prof': 20, 'assist': 8, 'level': 37, 'distribute': 10, 'pb': 10, 'peter': 30, 'plus': 17, 'kind': 29, 'master': 7, 'seller': 1, 'beyond': 13, 'mode': 9, 'directly': 25, 'difference': 18, 'proposal': 23, 'pl': 6, 'across': 17, 'charge': 10, 'base': 57, 'se': 9, 'member': 22, 'usage': 17, 'pour': 6, 'organisation': 6, 'tutorial': 13, 'universite': 11, 'hour': 13, 'respect': 13, 'dan': 13, 'forward': 15, 'fr': 14, 'edition': 7, 'relevance': 15, 'parallel': 13, 'du': 9, 'preference': 14, 'marie': 13, 'la': 19, 'organiser': 16, 'consider': 47, 'leave': 18, 'half': 9, 'summary': 35, 'hold': 63, 'mailto': 1, 'en': 10, 'president': 6, 'organise': 12, 'team': 3, 'institut': 13, 'exact': 4, 'dictionary': 20, 'many': 56, 'japanese': 26, 'behalf': 6, 'reply': 12, 'graduate': 30, 'enough': 11, 'after': 40, 'hbe': 6, 'lot': 14, 'former': 5, 'phrase': 26, 'unfortunately': 3, 'integrate': 21, 'japan': 20, 'jp': 14, 'edit': 10, 'dear': 14, 'link': 22, 'here': 40, 'recently': 25, 't': 33, 'doubt': 5, 'while': 33, 'participant': 38, 'alway': 16, 'upon': 12, 'almost': 4, 'off': 7, 'thousand': 2, 'means': 20, 'themselve': 15, 'count': 8, 'access': 22, 'occur': 11, 'sentence': 25, 'indo': 10, 'borrow': 6, 'suggest': 22, 'effect': 30, 'bilingual': 10, 'back': 15, 'necessarily': 10, 'surface': 17, 'lexicon': 22, 'maintain': 6, 'refer': 20, 'w': 36, 'same': 41, 'multilingual': 19, 'dialect': 20, 'voice': 8, 'another': 28, 'morpheme': 10, 'anthropology': 11, 'comparative': 27, 'account': 41, 'noun': 20, 'urge': 4, 'york': 35, 'fill': 13, 'recognize': 5, 'influence': 14, 'easiest': 1, 'boundary': 6, 'bibliography': 15, 'feel': 11, 'article': 32, 'item': 18, 'previously': 14, 'draw': 11, 'component': 8, 'accord': 17, 'hardly': 9, 'whether': 38, 'literary': 9, 'example': 61, 'identical': 11, 'central': 18, 'constituent': 9, 'attach': 7, 'generative': 15, 'subscribe': 7, 'subscription': 7, 'dissertation': 15, 'unpublish': 16, 'notice': 19, 'report': 30, 'line': 21, 'appear': 33, 'newsletter': 3, 'spring': 4, 'max': 14, 'volumes': 5, 'actual': 6, 'al': 10, 'journal': 27, 'onto': 2, 'collection': 12, 'cd': 4, 'user': 16, 'amount': 15, 'institution': 13, 'believe': 15, 'ago': 17, 'full': 41, 'deliver': 3, 'multus': 8, 'move': 18, 'image': 12, 'produce': 15, 'media': 9, 'medical': 5, 'among': 28, 'greater': 6, 'open': 41, 'outstand': 4, 'down': 5, 'play': 16, 'thing': 20, ...})  For every common word$w$we compute the probability that$w$occurs in a spam or ham email. The formula for spam is: $$P(w \in\texttt{Spam}) = \frac{\mbox{number of spam emails containing w}}{\mbox{number of all spam emails}}$$ The formula for ham is similar: $$P(w \in\texttt{Ham}) = \frac{\mbox{number of ham emails containing w}}{\mbox{number of all ham emails}}$$ However, if we would use this formular, than a common word$w$that, for some reason, hasn't yet occurred in any spam email, would have a probability of$0$of occurring in spam email. Hence, our classifier would never classify an email with the word$w$as spam. As this cannot be right, we assume that there is one further spam email that contains every common word. This Laplace smoothing assumption changes the formula for$P(w \in\texttt{Spam})$as follows: $$P(w \in\texttt{Spam}) = \frac{\mbox{number of spam emails containing w + 1}}{\mbox{number of all spam emails + 1}}$$  In [16]: Spam_Probability = {} Ham__Probability = {} for w in Common_Words: Spam_Probability[w] = (Spam_Counter[w] + 1) / (no_spam + 1) Ham__Probability[w] = (Ham__Counter[w] + 1) / (no_ham + 1) Spam_Probability   Out[16]: {'load': 0.037037037037037035, 'comprise': 0.011396011396011397, 'familiar': 0.022792022792022793, 'teen': 0.045584045584045586, 'massive': 0.02564102564102564, 'gamble': 0.04843304843304843, 'none': 0.019943019943019943, 'implementation': 0.002849002849002849, 'majority': 0.05128205128205128, 'cgibin': 0.02564102564102564, 'rejection': 0.005698005698005698, 'smaller': 0.002849002849002849, 'launch': 0.03418803418803419, 'lee': 0.011396011396011397, 'database': 0.07977207977207977, 'food': 0.022792022792022793, 'window': 0.09686609686609686, 'transfer': 0.02564102564102564, 'candidate': 0.017094017094017096, 'delay': 0.02849002849002849, 'frank': 0.03418803418803419, 'multus': 0.05128205128205128, 'late': 0.039886039886039885, 'engage': 0.011396011396011397, 'work': 0.35327635327635326, 'government': 0.039886039886039885, 'newest': 0.03133903133903134, 'call': 0.3789173789173789, 'vulgarity': 0.022792022792022793, 'material': 0.07122507122507123, 'organisation': 0.008547008547008548, 'affect': 0.008547008547008548, 'hundreds': 0.045584045584045586, 'propose': 0.02849002849002849, 'john': 0.042735042735042736, 'campus': 0.002849002849002849, 'competition': 0.07407407407407407, 'view': 0.05128205128205128, 'penny': 0.05128205128205128, 'currency': 0.03133903133903134, 'gender': 0.002849002849002849, 'class': 0.07692307692307693, 'santa': 0.014245014245014245, 'andrew': 0.005698005698005698, 'bonus': 0.11965811965811966, 'refinance': 0.037037037037037035, 'organization': 0.06837606837606838, 'eric': 0.008547008547008548, 'site': 0.3475783475783476, 'ongo': 0.014245014245014245, 'italy': 0.022792022792022793, 'sprachwissenschaft': 0.002849002849002849, 'operator': 0.017094017094017096, 'little': 0.16809116809116809, 'appear': 0.05698005698005698, 'ms': 0.02849002849002849, 'actually': 0.07122507122507123, 'perform': 0.03418803418803419, 'monthly': 0.08262108262108261, 'opposite': 0.008547008547008548, 'latex': 0.005698005698005698, 'job': 0.1396011396011396, 'forum': 0.017094017094017096, 'correct': 0.042735042735042736, 'install': 0.02849002849002849, 'miss': 0.1225071225071225, 'local': 0.07407407407407407, 'remain': 0.03418803418803419, 'chain': 0.042735042735042736, 'music': 0.042735042735042736, 'ready': 0.1168091168091168, 'hundr': 0.02849002849002849, 'dinner': 0.008547008547008548, 'bill': 0.15669515669515668, 'singapore': 0.002849002849002849, 'option': 0.05698005698005698, 'multimedium': 0.008547008547008548, 'dialect': 0.002849002849002849, 'translation': 0.002849002849002849, 'most': 0.32193732193732194, 'different': 0.1794871794871795, 'literature': 0.005698005698005698, 'unite': 0.09116809116809117, 'sit': 0.06552706552706553, 'sun': 0.02564102564102564, 'desirous': 0.03133903133903134, 'bear': 0.017094017094017096, 'scientist': 0.005698005698005698, 'income': 0.19943019943019943, 'urge': 0.019943019943019943, 'life': 0.18803418803418803, 'extensive': 0.02564102564102564, 'label': 0.02564102564102564, 'city': 0.19943019943019943, 'july': 0.019943019943019943, 'mouse': 0.02849002849002849, 'win': 0.25071225071225073, 'continent': 0.017094017094017096, 'die': 0.017094017094017096, 'tom': 0.011396011396011397, 'diploma': 0.022792022792022793, 'edit': 0.022792022792022793, 'fulfill': 0.03418803418803419, 'sequence': 0.05128205128205128, 'lucky': 0.05982905982905983, 'less': 0.16524216524216523, 'manufacturer': 0.02849002849002849, 'implication': 0.002849002849002849, 'colingacl': 0.002849002849002849, 'global': 0.05128205128205128, 'referral': 0.02564102564102564, 'western': 0.011396011396011397, 'using': 0.07977207977207977, 'map': 0.005698005698005698, 'great': 0.2336182336182336, 'response': 0.1282051282051282, 'wrong': 0.039886039886039885, 'bind': 0.017094017094017096, 'analyse': 0.005698005698005698, 'busy': 0.017094017094017096, 'pb': 0.002849002849002849, 'commerce': 0.022792022792022793, 'ext': 0.045584045584045586, 'polish': 0.005698005698005698, 'worldwide': 0.07692307692307693, 'previously': 0.04843304843304843, 'peter': 0.014245014245014245, 'studies': 0.002849002849002849, 'appropriate': 0.017094017094017096, 'mailbox': 0.07122507122507123, 'again': 0.22507122507122507, 'partner': 0.07977207977207977, 'truly': 0.08831908831908832, 'catch': 0.04843304843304843, 'develop': 0.042735042735042736, 'meeting': 0.019943019943019943, 'title': 0.09686609686609686, 'cfp': 0.002849002849002849, 'parttime': 0.02849002849002849, 'begin': 0.09401709401709402, 'dozen': 0.017094017094017096, 'addition': 0.037037037037037035, 'artificial': 0.019943019943019943, 'experiment': 0.008547008547008548, 'mci': 0.037037037037037035, 'around': 0.09686609686609686, 'alexis': 0.002849002849002849, 'april': 0.022792022792022793, 'dictionary': 0.002849002849002849, 'receive': 0.45014245014245013, 'internet': 0.3561253561253561, 'exercise': 0.014245014245014245, 'edinburgh': 0.002849002849002849, 'oversea': 0.022792022792022793, 'paper': 0.09971509971509972, 'charge': 0.1396011396011396, 'j': 0.03418803418803419, 'store': 0.07977207977207977, 'nice': 0.05982905982905983, 'raleigh': 0.037037037037037035, 'six': 0.07977207977207977, 'style': 0.014245014245014245, 'www': 0.3162393162393162, 'member': 0.15669515669515668, 'activity': 0.037037037037037035, 'y': 0.014245014245014245, 'onetime': 0.03133903133903134, 'comment': 0.03418803418803419, 'status': 0.014245014245014245, 'means': 0.05413105413105413, 'reap': 0.03418803418803419, 'chat': 0.04843304843304843, 'utility': 0.037037037037037035, 'usage': 0.011396011396011397, 'beautiful': 0.05982905982905983, 'extractor': 0.03418803418803419, 'hotmail': 0.037037037037037035, 'phrase': 0.011396011396011397, 'doctor': 0.02849002849002849, 'generally': 0.005698005698005698, 'highly': 0.05698005698005698, 'helpful': 0.014245014245014245, 'access': 0.13105413105413105, 'red': 0.02564102564102564, 'ac': 0.019943019943019943, 'usa': 0.09116809116809117, 'structural': 0.002849002849002849, 'likewise': 0.014245014245014245, 'major': 0.16524216524216523, 'wherea': 0.002849002849002849, 'acl': 0.002849002849002849, 'canadian': 0.019943019943019943, 'cds': 0.03133903133903134, 'mastercard': 0.10256410256410256, 'programme': 0.002849002849002849, 'living': 0.02849002849002849, 'ps': 0.02564102564102564, 'goe': 0.019943019943019943, 'design': 0.09971509971509972, 'end': 0.13675213675213677, 'foot': 0.03133903133903134, 'postscript': 0.002849002849002849, 'empirical': 0.002849002849002849, 'color': 0.042735042735042736, 'corner': 0.02564102564102564, 'unsubscribe': 0.05982905982905983, 'change': 0.17094017094017094, 'deal': 0.09971509971509972, 'substantial': 0.05698005698005698, 'planet': 0.02849002849002849, 'michigan': 0.002849002849002849, 'dollars': 0.05698005698005698, 'simon': 0.005698005698005698, 'trip': 0.042735042735042736, 'award': 0.039886039886039885, 'credit': 0.20512820512820512, 'distinguish': 0.005698005698005698, 'quantifier': 0.002849002849002849, 'registration': 0.02849002849002849, 'grateful': 0.005698005698005698, 'doe': 0.02849002849002849, 'hesitate': 0.07407407407407407, 'psycholinguistic': 0.002849002849002849, 'robert': 0.014245014245014245, 'publication': 0.037037037037037035, 'typical': 0.022792022792022793, 'demo': 0.042735042735042736, 'log': 0.03418803418803419, 'e': 0.25071225071225073, 'interest': 0.28205128205128205, 'm': 0.22507122507122507, 'tremendous': 0.03418803418803419, 'simple': 0.20797720797720798, 'phonological': 0.002849002849002849, 'excellent': 0.07122507122507123, 'enjoy': 0.10541310541310542, 'genie': 0.02849002849002849, 'preview': 0.037037037037037035, 'december': 0.037037037037037035, 'impossible': 0.011396011396011397, 'america': 0.08547008547008547, 'webmaster': 0.037037037037037035, 'dori': 0.02849002849002849, 'le': 0.011396011396011397, 'de': 0.014245014245014245, 'please': 0.5384615384615384, 'reality': 0.02564102564102564, 'heart': 0.02849002849002849, 'weeks': 0.03418803418803419, 'parse': 0.002849002849002849, 'meet': 0.09116809116809117, 'russian': 0.005698005698005698, 'dear': 0.10826210826210826, 'netherland': 0.008547008547008548, 'speak': 0.019943019943019943, 'floor': 0.011396011396011397, 'blvd': 0.037037037037037035, 'entirely': 0.014245014245014245, 'clearance': 0.03133903133903134, 'practical': 0.011396011396011397, 'contents': 0.019943019943019943, 'medium': 0.02564102564102564, 'lay': 0.045584045584045586, 'somewhat': 0.019943019943019943, 'surely': 0.04843304843304843, 'accompany': 0.002849002849002849, 'buy': 0.22792022792022792, 'judgment': 0.03133903133903134, 'orders': 0.06837606837606838, 'fairchild': 0.03133903133903134, 'zero': 0.019943019943019943, 'focus': 0.017094017094017096, 'spout': 0.03133903133903134, 'wish': 0.245014245014245, 'translate': 0.008547008547008548, 'vendor': 0.03133903133903134, 'faith': 0.042735042735042736, 'sincerely': 0.10256410256410256, 'mixe': 0.011396011396011397, 'draw': 0.019943019943019943, 'typology': 0.002849002849002849, 'joan': 0.002849002849002849, 'participation': 0.02849002849002849, 'quote': 0.03133903133903134, 'integrate': 0.005698005698005698, 'recently': 0.06267806267806268, 'vocabulary': 0.002849002849002849, 'interaction': 0.002849002849002849, 'kit': 0.03133903133903134, 'cycle': 0.02564102564102564, 'session': 0.011396011396011397, 'august': 0.005698005698005698, 'sometime': 0.05128205128205128, 'non': 0.019943019943019943, 'volumes': 0.017094017094017096, 'anderson': 0.008547008547008548, 'anna': 0.002849002849002849, 'discussion': 0.011396011396011397, 'diskette': 0.02564102564102564, 'finding': 0.014245014245014245, 'entire': 0.08547008547008547, 'fl': 0.037037037037037035, 'tip': 0.05698005698005698, 'player': 0.03133903133903134, 'traditional': 0.05413105413105413, 'lie': 0.017094017094017096, 'opportunity': 0.21082621082621084, 'ltd': 0.008547008547008548, 'shop': 0.07692307692307693, 'front': 0.05698005698005698, 'reread': 0.02849002849002849, 'alway': 0.16524216524216523, 'condition': 0.019943019943019943, 'band': 0.02564102564102564, 'sex': 0.06552706552706553, 'eye': 0.04843304843304843, 'proper': 0.022792022792022793, 'century': 0.03133903133903134, 'avenue': 0.03418803418803419, 'buck': 0.022792022792022793, 'motivation': 0.002849002849002849, 'postfach': 0.002849002849002849, 'macintosh': 0.017094017094017096, 'ignore': 0.06552706552706553, 'inquiry': 0.04843304843304843, 'vary': 0.02564102564102564, 'md': 0.04843304843304843, 'poor': 0.03418803418803419, 'move': 0.1111111111111111, 'top': 0.1452991452991453, 'capture': 0.017094017094017096, 'european': 0.008547008547008548, 'harri': 0.002849002849002849, 'israel': 0.008547008547008548, 'modify': 0.03133903133903134, 'eventually': 0.03418803418803419, 'conversation': 0.019943019943019943, 'assessment': 0.008547008547008548, 'produce': 0.09686609686609686, 'coverage': 0.011396011396011397, 'media': 0.011396011396011397, 'click': 0.28774928774928776, 'indo': 0.002849002849002849, 'ma': 0.039886039886039885, 'female': 0.017094017094017096, 'genuine': 0.022792022792022793, 'typological': 0.002849002849002849, 'interdisciplinary': 0.005698005698005698, 'predicate': 0.002849002849002849, 'banner': 0.02564102564102564, 'provide': 0.20512820512820512, 'back': 0.245014245014245, 'generate': 0.12535612535612536, 'independent': 0.06267806267806268, 'june': 0.03133903133903134, 'monday': 0.02849002849002849, 'individual': 0.07407407407407407, 'speed': 0.05982905982905983, 'dan': 0.005698005698005698, 'thing': 0.16524216524216523, 'demand': 0.02564102564102564, 'wealth': 0.03133903133903134, 'value': 0.08831908831908832, 'programs': 0.045584045584045586, 'texa': 0.017094017094017096, 'nt': 0.3646723646723647, 'national': 0.039886039886039885, 'introduction': 0.014245014245014245, 'amazing': 0.05698005698005698, 'hr': 0.045584045584045586, 'intelligent': 0.005698005698005698, 'request': 0.23076923076923078, 'surface': 0.008547008547008548, 'classified': 0.022792022792022793, 'policy': 0.017094017094017096, 'mediumsize': 0.02564102564102564, 'plain': 0.042735042735042736, 'hit': 0.1225071225071225, 'im': 0.008547008547008548, 'commonly': 0.017094017094017096, 'star': 0.04843304843304843, 'guy': 0.045584045584045586, 'symposium': 0.002849002849002849, 'w': 0.05413105413105413, 'september': 0.017094017094017096, 'forever': 0.05413105413105413, 'notify': 0.02564102564102564, 'rest': 0.08262108262108261, 'repeat': 0.019943019943019943, 'martin': 0.011396011396011397, 'description': 0.014245014245014245, 'undoubtedly': 0.022792022792022793, 'myself': 0.06552706552706553, 'zip': 0.150997150997151, 'kong': 0.005698005698005698, 'happen': 0.10256410256410256, 'direct': 0.10541310541310542, 'percentage': 0.03418803418803419, 'grammar': 0.002849002849002849, 'delivery': 0.09686609686609686, 'reply': 0.2564102564102564, 'du': 0.002849002849002849, 'weekend': 0.037037037037037035, 'although': 0.039886039886039885, 'french': 0.005698005698005698, 'rich': 0.07122507122507123, 'http': 0.45014245014245013, 'stay': 0.06267806267806268, 'scheme': 0.03133903133903134, 'conceptual': 0.005698005698005698, 'deep': 0.02564102564102564, 'subject': 0.2934472934472934, 'abuse': 0.03133903133903134, 'consult': 0.008547008547008548, 'below': 0.2849002849002849, 'fastest': 0.037037037037037035, 'organiser': 0.002849002849002849, 'comparable': 0.011396011396011397, 'currently': 0.07122507122507123, 'influence': 0.005698005698005698, 'suppose': 0.03418803418803419, 'htm': 0.07122507122507123, 'enhance': 0.014245014245014245, 'video': 0.11396011396011396, 'jone': 0.008547008547008548, 'document': 0.03418803418803419, 'mb': 0.02564102564102564, 'consideration': 0.02564102564102564, 'apology': 0.02849002849002849, 'iro': 0.002849002849002849, 'michael': 0.019943019943019943, 'result': 0.1623931623931624, 'bid': 0.05413105413105413, 'until': 0.1282051282051282, 'excess': 0.05413105413105413, 'put': 0.2222222222222222, 'exclude': 0.014245014245014245, 'hi': 0.07122507122507123, 'unlimit': 0.05982905982905983, 'bring': 0.10256410256410256, 'explore': 0.008547008547008548, 'trend': 0.011396011396011397, 'try': 0.21652421652421652, 'numbers': 0.03133903133903134, 'expect': 0.07122507122507123, 'organise': 0.002849002849002849, 'ed': 0.017094017094017096, 'history': 0.039886039886039885, 'favorite': 0.03133903133903134, 'due': 0.07692307692307693, 'nc': 0.042735042735042736, 'promptly': 0.02849002849002849, 'yours': 0.15669515669515668, 'san': 0.022792022792022793, 'reports': 0.06267806267806268, 'documentation': 0.014245014245014245, 'initially': 0.042735042735042736, 'near': 0.045584045584045586, 'birth': 0.019943019943019943, 'park': 0.04843304843304843, 'trash': 0.039886039886039885, 'firm': 0.02564102564102564, 'confident': 0.037037037037037035, 'virtually': 0.042735042735042736, 'syntactic': 0.002849002849002849, 'evergrow': 0.02849002849002849, 'quit': 0.05413105413105413, 'register': 0.07977207977207977, 'general': 0.039886039886039885, 'stun': 0.03133903133903134, 'benefit': 0.07977207977207977, 'quality': 0.07407407407407407, 'spain': 0.002849002849002849, 'teacher': 0.005698005698005698, 'research': 0.07692307692307693, 'pic': 0.017094017094017096, 'participant': 0.039886039886039885, 'evaluation': 0.014245014245014245, 'responsible': 0.03133903133903134, 'perceive': 0.002849002849002849, 'institution': 0.008547008547008548, 'bernard': 0.002849002849002849, 'off': 0.2022792022792023, 'cooperation': 0.005698005698005698, 'marketing': 0.08547008547008547, 'north': 0.022792022792022793, 'illustrate': 0.002849002849002849, 'hardcore': 0.02849002849002849, 'yield': 0.019943019943019943, 'newsgroup': 0.03133903133903134, 'fantastic': 0.07977207977207977, 'science': 0.008547008547008548, 'beach': 0.05982905982905983, 'innovative': 0.014245014245014245, 'treat': 0.06552706552706553, 'signal': 0.011396011396011397, 'run': 0.1396011396011396, 'exactly': 0.1111111111111111, 'brand': 0.04843304843304843, 'pennsylvanium': 0.002849002849002849, 'wait': 0.14814814814814814, 'contact': 0.19373219373219372, 'minute': 0.11965811965811966, 'totally': 0.05982905982905983, 'statistics': 0.042735042735042736, 'state': 0.2962962962962963, 'amateur': 0.042735042735042736, 'researcher': 0.005698005698005698, 'clean': 0.05698005698005698, 'cluster': 0.002849002849002849, 'open': 0.13105413105413105, 'reconstruction': 0.002849002849002849, 'chri': 0.014245014245014245, 'perfectly': 0.05982905982905983, 'help': 0.2564102564102564, 'completely': 0.10541310541310542, 'operate': 0.06552706552706553, 'loss': 0.039886039886039885, 'watch': 0.10541310541310542, 'approve': 0.03133903133903134, 'someone': 0.1396011396011396, 'arise': 0.005698005698005698, 'scope': 0.005698005698005698, 'development': 0.008547008547008548, 'unless': 0.05413105413105413, 'version': 0.045584045584045586, 'necessarily': 0.002849002849002849, 'edition': 0.019943019943019943, 'dutch': 0.002849002849002849, 'novel': 0.005698005698005698, 'trial': 0.06837606837606838, 'juno': 0.03418803418803419, 'fast': 0.07977207977207977, 'millions': 0.045584045584045586, 'twenty': 0.019943019943019943, 'dramatically': 0.03418803418803419, 'anywhere': 0.14245014245014245, 'original': 0.05698005698005698, 'acceptance': 0.017094017094017096, 'downsize': 0.03133903133903134, 'today': 0.3333333333333333, 'exact': 0.05128205128205128, 'weekly': 0.07692307692307693, 'keynote': 0.002849002849002849, 'forget': 0.07977207977207977, 'characteristic': 0.005698005698005698, 'txt': 0.03418803418803419, 'even': 0.2849002849002849, 'increase': 0.11396011396011396, 'clear': 0.045584045584045586, 'advertiser': 0.042735042735042736, 'purchase': 0.21082621082621084, 'date': 0.15384615384615385, 'integration': 0.005698005698005698, 'conversational': 0.002849002849002849, 'adult': 0.1225071225071225, 'classic': 0.011396011396011397, 'plan': 0.11965811965811966, 'earth': 0.06267806267806268, 'bottom': 0.06837606837606838, 'associate': 0.06552706552706553, 'sales': 0.06552706552706553, 'south': 0.045584045584045586, 'comprehensive': 0.019943019943019943, 'making': 0.03418803418803419, 'transcription': 0.002849002849002849, 'easily': 0.1168091168091168, 'finger': 0.03418803418803419, 'la': 0.02564102564102564, 'mortgage': 0.05128205128205128, 'add': 0.22792022792022792, 'conceal': 0.037037037037037035, 'verbal': 0.005698005698005698, 'underlie': 0.002849002849002849, 'long': 0.11396011396011396, 'import': 0.02564102564102564, 'analysis': 0.014245014245014245, 'tool': 0.10541310541310542, 'application': 0.045584045584045586, 'perhap': 0.014245014245014245, 'snail': 0.03133903133903134, 'review': 0.06837606837606838, 'easiest': 0.04843304843304843, 'extremely': 0.07407407407407407, 'though': 0.05698005698005698, 'verify': 0.045584045584045586, 'leave': 0.16524216524216523, 'virtual': 0.03133903133903134, 'dynamic': 0.011396011396011397, 'recipient': 0.039886039886039885, 'couple': 0.06267806267806268, 'least': 0.1282051282051282, 'germany': 0.019943019943019943, 'useful': 0.03133903133903134, 'client': 0.05413105413105413, 'television': 0.042735042735042736, 'prof': 0.008547008547008548, 'scott': 0.005698005698005698, 'phd': 0.019943019943019943, 'importance': 0.008547008547008548, 'gb': 0.011396011396011397, 'korean': 0.002849002849002849, 'order': 0.3732193732193732, 'variation': 0.002849002849002849, 'aim': 0.008547008547008548, 'trust': 0.039886039886039885, 'thoma': 0.017094017094017096, 'corporations': 0.04843304843304843, 'musical': 0.017094017094017096, 'much': 0.2934472934472934, 'lifetime': 0.04843304843304843, 'intrusion': 0.045584045584045586, 'set': 0.12535612535612536, 'iii': 0.014245014245014245, 'potential': 0.13105413105413105, 'concept': 0.017094017094017096, 'country': 0.11965811965811966, 'hotel': 0.022792022792022793, 'school': 0.04843304843304843, 'master': 0.03418803418803419, 'acquire': 0.06552706552706553, 'laugh': 0.037037037037037035, 'mo': 0.02564102564102564, 'psychological': 0.002849002849002849, 'club': 0.05413105413105413, 'obviously': 0.06837606837606838, 'debt': 0.08831908831908832, 'extract': 0.045584045584045586, 'vision': 0.011396011396011397, 'million': 0.245014245014245, 'acquisition': 0.008547008547008548, 'object': 0.008547008547008548, 'human': 0.022792022792022793, 'sake': 0.037037037037037035, 'include': 0.37037037037037035, 't': 0.21082621082621084, 'success': 0.1737891737891738, 'theoretical': 0.002849002849002849, 'community': 0.022792022792022793, 'vacation': 0.09116809116809117, 'sample': 0.05413105413105413, 'wh': 0.005698005698005698, 'five': 0.04843304843304843, 'finance': 0.042735042735042736, 'search': 0.1737891737891738, 'datum': 0.02849002849002849, 'engine': 0.09971509971509972, 'while': 0.1339031339031339, 'everybody': 0.022792022792022793, 'february': 0.011396011396011397, 'author': 0.022792022792022793, 'editor': 0.005698005698005698, 'summarize': 0.008547008547008548, 'fundamental': 0.002849002849002849, 'publish': 0.06837606837606838, 'blackwell': 0.002849002849002849, 'yourself': 0.20797720797720798, 'hello': 0.10541310541310542, 'investigation': 0.017094017094017096, 'universal': 0.011396011396011397, 'karen': 0.022792022792022793, 'jump': 0.039886039886039885, 'umontreal': 0.002849002849002849, 'follow': 0.33903133903133903, 'professional': 0.09686609686609686, 'emerge': 0.005698005698005698, 'once': 0.19658119658119658, 'day': 0.4415954415954416, 'jame': 0.008547008547008548, 'slip': 0.037037037037037035, 'medical': 0.02849002849002849, 'borrow': 0.037037037037037035, 'song': 0.019943019943019943, 'idea': 0.08547008547008547, 'hour': 0.2678062678062678, 'argument': 0.002849002849002849, 'obvious': 0.008547008547008548, 'listing': 0.014245014245014245, 'november': 0.008547008547008548, 'subscription': 0.017094017094017096, 'shift': 0.011396011396011397, 'pleasure': 0.02564102564102564, 'london': 0.03418803418803419, 'however': 0.07977207977207977, 'fun': 0.1282051282051282, 'tree': 0.022792022792022793, 'man': 0.05128205128205128, 'classroom': 0.002849002849002849, 'dates': 0.002849002849002849, 'mt': 0.008547008547008548, 'accommodation': 0.008547008547008548, 'cm': 0.005698005698005698, 'alternative': 0.011396011396011397, 'send': 0.4415954415954416, 'diversity': 0.002849002849002849, 'framework': 0.002849002849002849, 'bag': 0.022792022792022793, 'cognition': 0.002849002849002849, 'relationship': 0.008547008547008548, 'reasonable': 0.022792022792022793, 'att': 0.017094017094017096, 'exceed': 0.019943019943019943, 'von': 0.002849002849002849, 'during': 0.02564102564102564, 'region': 0.005698005698005698, 'accessible': 0.037037037037037035, 'linguistics': 0.005698005698005698, 'refer': 0.02849002849002849, 'observe': 0.005698005698005698, 'britain': 0.002849002849002849, 'creditor': 0.045584045584045586, 'specify': 0.02564102564102564, 'moneymake': 0.05982905982905983, 'relevance': 0.002849002849002849, 'honor': 0.017094017094017096, 'connection': 0.07977207977207977, 'news': 0.08831908831908832, 'parameter': 0.005698005698005698, 'twelve': 0.03418803418803419, 'cv': 0.005698005698005698, 'mike': 0.014245014245014245, 'parallel': 0.005698005698005698, 'apply': 0.05413105413105413, 'is': 0.23931623931623933, 'here': 0.4188034188034188, 'short': 0.1111111111111111, 'password': 0.02564102564102564, 'mellon': 0.002849002849002849, 'textbook': 0.002849002849002849, 'telephone': 0.09116809116809117, 'morn': 0.019943019943019943, 'paragraph': 0.02849002849002849, 'educational': 0.011396011396011397, 'worth': 0.09401709401709402, 'effective': 0.10826210826210826, 'reflect': 0.005698005698005698, 'normal': 0.011396011396011397, 'compute': 0.011396011396011397, 'ba': 0.02564102564102564, 'pretty': 0.039886039886039885, 'comparative': 0.002849002849002849, 'contribute': 0.011396011396011397, 'latest': 0.10541310541310542, 'minimum': 0.03133903133903134, 'interpretation': 0.002849002849002849, 'australium': 0.011396011396011397, 'drop': 0.07122507122507123, 'tell': 0.20512820512820512, 'fill': 0.15954415954415954, 'richer': 0.042735042735042736, 'compile': 0.02564102564102564, 'residual': 0.03418803418803419, 'convention': 0.008547008547008548, 'belgium': 0.002849002849002849, 'martha': 0.002849002849002849, 'functional': 0.014245014245014245, 'variety': 0.014245014245014245, 'white': 0.008547008547008548, 'spell': 0.005698005698005698, 'interview': 0.05413105413105413, 'isp': 0.05128205128205128, 'cognitive': 0.002849002849002849, 'slavic': 0.002849002849002849, 'u': 0.1396011396011396, 'possibility': 0.019943019943019943, 'speech': 0.005698005698005698, 'hire': 0.017094017094017096, 'spokane': 0.03133903133903134, 'japanese': 0.008547008547008548, 'education': 0.03418803418803419, 'unify': 0.002849002849002849, 'distribute': 0.045584045584045586, 'perception': 0.002849002849002849, 'enquiry': 0.005698005698005698, 'president': 0.022792022792022793, 'poster': 0.005698005698005698, 'african': 0.014245014245014245, 'town': 0.03133903133903134, 'overload': 0.039886039886039885, 'everythe': 0.045584045584045586, 'testimonial': 0.04843304843304843, 'death': 0.03418803418803419, 'christian': 0.005698005698005698, 'purpose': 0.039886039886039885, 'return': 0.18803418803418803, 'team': 0.05982905982905983, 'duration': 0.002849002849002849, 'cinema': 0.02849002849002849, 'lot': 0.150997150997151, 'seem': 0.05413105413105413, 'linguistic': 0.002849002849002849, 'freedom': 0.10541310541310542, 'v': 0.08262108262108261, 'want': 0.41595441595441596, 'cheap': 0.03418803418803419, 'total': 0.13675213675213677, 'dept': 0.022792022792022793, 'literary': 0.002849002849002849, 'second': 0.06837606837606838, 'honest': 0.05128205128205128, 'wife': 0.037037037037037035, 'head': 0.037037037037037035, 'agency': 0.05413105413105413, 'final': 0.039886039886039885, 'assume': 0.06837606837606838, 'berlin': 0.002849002849002849, 'alone': 0.07407407407407407, 'compete': 0.019943019943019943, 'three': 0.07407407407407407, 'attach': 0.039886039886039885, 'montreal': 0.002849002849002849, 'movie': 0.05698005698005698, 'consist': 0.011396011396011397, 're': 0.29914529914529914, 'incredible': 0.05413105413105413, 'lack': 0.022792022792022793, 'mistake': 0.022792022792022793, 'satisfy': 0.05698005698005698, 'fine': 0.03133903133903134, 'middle': 0.037037037037037035, 'ask': 0.19373219373219372, 'capability': 0.017094017094017096, 'guest': 0.011396011396011397, 'discover': 0.09116809116809117, 'resort': 0.02849002849002849, 'amount': 0.17663817663817663, 'define': 0.002849002849002849, 'certainly': 0.03418803418803419, 'problem': 0.13675213675213677, 'organize': 0.008547008547008548, 'believe': 0.1623931623931624, 'resell': 0.06552706552706553, 'survey': 0.017094017094017096, 'whatsoever': 0.039886039886039885, 'lisa': 0.008547008547008548, 'cent': 0.05698005698005698, 'largest': 0.05698005698005698, 'requirement': 0.039886039886039885, 'property': 0.03133903133903134, 'msn': 0.02849002849002849, 'mail': 0.5128205128205128, 'loan': 0.05982905982905983, 'domain': 0.06267806267806268, 'released': 0.03133903133903134, 'pour': 0.008547008547008548, 'sum': 0.008547008547008548, 'complete': 0.15669515669515668, 'syntax': 0.002849002849002849, 'symbol': 0.019943019943019943, 'command': 0.019943019943019943, 'sheffield': 0.002849002849002849, 'interpret': 0.002849002849002849, 'reviewer': 0.008547008547008548, 'merciless': 0.03133903133903134, 'nijmegen': 0.002849002849002849, 'dupe': 0.02564102564102564, 'industrial': 0.008547008547008548, 'phonology': 0.002849002849002849, 'create': 0.1623931623931624, 'pittsburgh': 0.008547008547008548, 'radio': 0.05128205128205128, 'together': 0.05982905982905983, 'sender': 0.06837606837606838, 'privacy': 0.037037037037037035, 'nature': 0.008547008547008548, 'length': 0.011396011396011397, 'contributor': 0.002849002849002849, 'forthcome': 0.014245014245014245, 'bet': 0.042735042735042736, 'genre': 0.002849002849002849, 'product': 0.28205128205128205, 'expiration': 0.06267806267806268, 'indiana': 0.002849002849002849, 'nl': 0.002849002849002849, 'obligation': 0.037037037037037035, 'newsletter': 0.045584045584045586, 'ii': 0.011396011396011397, 'emailer': 0.03418803418803419, 'exclusive': 0.06837606837606838, 'note': 0.17094017094017094, 'maintain': 0.03418803418803419, 'assure': 0.02849002849002849, 'rock': 0.022792022792022793, 'quick': 0.08262108262108261, 'retail': 0.03133903133903134, 'copy': 0.18518518518518517, 'x': 0.1339031339031339, 'ad': 0.14245014245014245, 'light': 0.014245014245014245, 'serve': 0.02564102564102564, 'toy': 0.02849002849002849, 'press': 0.037037037037037035, 'file': 0.15954415954415954, 'days': 0.042735042735042736, 'gold': 0.05982905982905983, 'strength': 0.014245014245014245, 'few': 0.19658119658119658, 'eastern': 0.011396011396011397, 'both': 0.13105413105413105, 'dependency': 0.002849002849002849, 'chomsky': 0.002849002849002849, 'course': 0.1168091168091168, 'morpheme': 0.002849002849002849, 'financially': 0.04843304843304843, 'device': 0.017094017094017096, 'behind': 0.02849002849002849, 'situation': 0.03133903133903134, 'parent': 0.02564102564102564, 'foundation': 0.017094017094017096, 'bit': 0.05698005698005698, 'orient': 0.008547008547008548, 'stealth': 0.06552706552706553, 'remember': 0.13105413105413105, 'vium': 0.13105413105413105, 'personally': 0.017094017094017096, 'difficult': 0.037037037037037035, 'money': 0.4017094017094017, 'jp': 0.005698005698005698, 'institute': 0.011396011396011397, 'disc': 0.017094017094017096, 'lyric': 0.017094017094017096, 'correspondence': 0.008547008547008548, 'cancel': 0.02564102564102564, 'probably': 0.07407407407407407, 'asset': 0.042735042735042736, 'prompt': 0.04843304843304843, 'equipment': 0.017094017094017096, 'feature': 0.06552706552706553, 'id': 0.06267806267806268, 'conclude': 0.037037037037037035, 'need': 0.41595441595441596, 'sale': 0.16524216524216523, 'truth': 0.02564102564102564, 'automatically': 0.09116809116809117, 'mix': 0.037037037037037035, 'spend': 0.1396011396011396, 'sorry': 0.06837606837606838, 'index': 0.05698005698005698, 'art': 0.03133903133903134, 'fact': 0.12535612535612536, 'department': 0.02564102564102564, 'common': 0.042735042735042736, 'estate': 0.039886039886039885, 'publisher': 0.019943019943019943, 'basically': 0.04843304843304843, 'conjunction': 0.002849002849002849, 'ram': 0.022792022792022793, 'arrange': 0.014245014245014245, 'whether': 0.05982905982905983, 'example': 0.10256410256410256, 'sure': 0.18803418803418803, 'plenary': 0.002849002849002849, 'love': 0.1111111111111111, 'preparation': 0.014245014245014245, 'actual': 0.02849002849002849, 'cameraready': 0.002849002849002849, 'cost': 0.2849002849002849, 'point': 0.08831908831908832, 'experience': 0.18233618233618235, 'quickly': 0.08547008547008547, 'thousands': 0.03133903133903134, 'ultimate': 0.03133903133903134, 'browser': 0.02564102564102564, 'started': 0.03418803418803419, 'reg': 0.03418803418803419, 'coordinate': 0.002849002849002849, 'rush': 0.042735042735042736, 'network': 0.06552706552706553, 'indefinite': 0.002849002849002849, 'delete': 0.11396011396011396, 'mailer': 0.045584045584045586, 'package': 0.1396011396011396, 'instead': 0.05128205128205128, 'side': 0.011396011396011397, 'already': 0.14814814814814814, 'essential': 0.03133903133903134, 'ever': 0.22507122507122507, 'responsibility': 0.019943019943019943, 'mit': 0.002849002849002849, 'brief': 0.02564102564102564, 'desire': 0.06267806267806268, 'discovery': 0.019943019943019943, 'royal': 0.011396011396011397, 'update': 0.08262108262108261, 'intelligence': 0.045584045584045586, 'control': 0.08831908831908832, 'present': 0.07977207977207977, 'round': 0.011396011396011397, 'sponsor': 0.022792022792022793, 'occur': 0.017094017094017096, 'philosophy': 0.005698005698005698, 'current': 0.06267806267806268, 'web': 0.2678062678062678, 'text': 0.09116809116809117, 'broadcast': 0.017094017094017096, 'spot': 0.022792022792022793, 'effect': 0.017094017094017096, 'tradition': 0.002849002849002849, 'tv': 0.05413105413105413, 'germanic': 0.002849002849002849, ...}  According to our computation, the probabilty that a spam email contains the word 'consonant' is about$0.28\%$, while the probability that this word occurs in a ham email is$2.55\%$.  In [17]: Spam_Probability['consonant'], Ham__Probability['consonant']   Out[17]: (0.002849002849002849, 0.02564102564102564)  For the word 'dollar' the probabilty that a spam email contains this word is about$21.1\%$, while the probability that this word occurs in a ham email is$1.99\%$.  In [18]: Spam_Probability['dollar'], Ham__Probability['dollar']   Out[18]: (0.21082621082621084, 0.019943019943019943)  Deciding whether an Email is Spam Given a file name fn, this function returns the probability that the message contained in the given file is spam. When implementing the formula $$\arg\max\limits_{C \in \mathcal{C}} \left(\prod\limits_{i=1}^m P(f_i \;|\; C)\right) \cdot P(C)$$ we have to be careful, because a naive implementation will eveluate the product $$\prod\limits_{i=1}^m P(f_i \;|\; C)$$ as the number$0$due to numerical underflow. The trick to compute this product is to remember that $$\ln(a \cdot b) = \ln(a) + \ln(b)$$ and therefore transform the product into a sum of logarithms: $$\prod\limits_{i=1}^m P(f_i \;|\; C) = \exp\left(\alpha + \sum\limits_{i=1}^m \ln\bigl(P(f_i \;|\; C)\bigr) \right) \cdot \exp(-\alpha)$$ Here, the constant$\alpha$has to be chosen such that the application of the function exp to the value $$\alpha + \sum\limits_{i=1}^m \ln\bigl(P(f_i \;|\; C)\bigr)$$ does not lead to an underflow error. As we want to compute a probability, we have to be aware that the term $$\left(\prod\limits_{i=1}^m P(f_i \;|\; C)\right) \cdot P(C)$$ is not the probability that the object is of class$C$but rather is only proportional to this probability. The fact that the probability of an email being spam + the probability that the email is ham must be$1$enables us to compute the probability.  In [19]: def spam_probability(fn): log_p_spam = 0.0 log_p_ham = 0.0 words = get_common_words(fn) for w in Common_Words: if w in words: log_p_spam += math.log(Spam_Probability[w]) log_p_ham += math.log(Ham__Probability[w]) else: log_p_spam += math.log(1.0 - Spam_Probability[w]) log_p_ham += math.log(1.0 - Ham__Probability[w]) alpha = abs(max(log_p_spam, log_p_ham)) p_spam = math.exp(log_p_spam + alpha) * spam_prior p_ham = math.exp(log_p_ham + alpha) * ham__prior return p_spam / (p_spam + p_ham)  Let us test this with a ham email.  In [20]: spam_probability('EmailData/ham-train/3-430msg1.txt')   Out[20]: 6.289803980920058e-29  Ok, we got this one right. Let us check the general performance. Evaluate Precision and Recall In order to evalate the performance of this algorithm, we need to define two new concepts: precision and recall. Let us call the ham emails the positives, while the spam emails are called the negatives. Then we define • true positives: ham emails that are classified as ham, • false positives: spam emails that are classified as ham, • true negatives: spam emails that are classified as spam, • false negatives: ham emails that are classified as spam. The precision of the spam classifier is then defined as $$\texttt{precision} = \frac{\mbox{number of true positives}}{\mbox{number of true positives} + \mbox{number of false positives}}$$ Therefore, the precision measures the percentage of the ham emails in the set of all emails that are classified as ham. The recall of the spam classifier is defined as $$\texttt{recall} = \frac{\mbox{number of true positives}}{\mbox{number of true positives} + \mbox{number of false negatives}}$$ Therefore, the recall measures the percentage of those ham emails that are indeed classified as ham. Usually, it is very important that the recall is high as we don't want to loose a ham email because our classifier has incorrectly classified it as a spam email. On the other hand, having a high precision is not that important. After all, if$10\%$of the emails offered to us as ham are, in fact, spam, we might tolerate this. However, we would certainly not tolerate loosing$10\%\$ of our ham emails because they are incorrectly specified as spam.

The function precission_recall takes two directories as arguments: spam_dir is supposed to contain spam emails, while ham_dir contains ham emails. It computes the precision and the recall of our spam classifier with respect to these test data.



In [21]:

def precission_recall(spam_dir, ham_dir):
TN = 0 # true negatives
FP = 0 # false positives
for email in os.listdir(spam_dir):
if spam_probability(spam_dir + email) > 0.5:
TN += 1
else:
FP += 1
FN = 0 # false negatives
TP = 0 # true positives
for email in os.listdir(ham_dir):
if spam_probability(ham_dir + email) > 0.5:
FN += 1
else:
TP += 1
precision = TP / (TP + FP)
recall    = TP / (TP + FN)
accuracy  = (TN + TP) / (TN + TP + FN + FP)
return precision, recall, accuracy




In [22]:

precission_recall(spam_dir_train, ham__dir_train)




Out[22]:

(0.8495145631067961, 1.0, 0.9114285714285715)




In [23]:

precission_recall(spam_dir_test, ham__dir_test)




Out[23]:

(0.7791411042944786, 0.9769230769230769, 0.85)




In [ ]: