Algorithms Exercise 1

Imports


In [2]:
%matplotlib inline
from matplotlib import pyplot as plt
import numpy as np

Word counting

Write a function tokenize that takes a string of English text returns a list of words. It should also remove stop words, which are common short words that are often removed before natural language processing. Your function should have the following logic:

  • Split the string into lines using splitlines.
  • Split each line into a list of words and merge the lists for each line.
  • Use Python's builtin filter function to remove all punctuation.
  • If stop_words is a list, remove all occurences of the words in the list.
  • If stop_words is a space delimeted string of words, split them and remove them.
  • Remove any remaining empty words.
  • Make all words lowercase.

In [3]:
file = open("mobydick_chapter1.txt")
mobydick = file.read()

mobydick = mobydick.splitlines()
mobydick = " ".join(mobydick)

punctuation = ["-", ",", "."]

mobydick = list(mobydick)
mobydick_f = list(filter(lambda c: c not in punctuation, mobydick))

mobydick_f = "".join(mobydick_f)

stop_words = ["of", "or", "in"]

mobydick_fs = mobydick_f.split()
mobydick_fs = list(filter(lambda w: w not in stop_words, mobydick_fs))



        


    
print(mobydick_fs)


['Call', 'me', 'Ishmael', 'Some', 'years', 'agonever', 'mind', 'how', 'long', 'preciselyhaving', 'little', 'no', 'money', 'my', 'purse', 'and', 'nothing', 'particular', 'to', 'interest', 'me', 'on', 'shore', 'I', 'thought', 'I', 'would', 'sail', 'about', 'a', 'little', 'and', 'see', 'the', 'watery', 'part', 'the', 'world', 'It', 'is', 'a', 'way', 'I', 'have', 'driving', 'off', 'the', 'spleen', 'and', 'regulating', 'the', 'circulation', 'Whenever', 'I', 'find', 'myself', 'growing', 'grim', 'about', 'the', 'mouth;', 'whenever', 'it', 'is', 'a', 'damp', 'drizzly', 'November', 'my', 'soul;', 'whenever', 'I', 'find', 'myself', 'involuntarily', 'pausing', 'before', 'coffin', 'warehouses', 'and', 'bringing', 'up', 'the', 'rear', 'every', 'funeral', 'I', 'meet;', 'and', 'especially', 'whenever', 'my', 'hypos', 'get', 'such', 'an', 'upper', 'hand', 'me', 'that', 'it', 'requires', 'a', 'strong', 'moral', 'principle', 'to', 'prevent', 'me', 'from', 'deliberately', 'stepping', 'into', 'the', 'street', 'and', 'methodically', 'knocking', "people's", 'hats', 'offthen', 'I', 'account', 'it', 'high', 'time', 'to', 'get', 'to', 'sea', 'as', 'soon', 'as', 'I', 'can', 'This', 'is', 'my', 'substitute', 'for', 'pistol', 'and', 'ball', 'With', 'a', 'philosophical', 'flourish', 'Cato', 'throws', 'himself', 'upon', 'his', 'sword;', 'I', 'quietly', 'take', 'to', 'the', 'ship', 'There', 'is', 'nothing', 'surprising', 'this', 'If', 'they', 'but', 'knew', 'it', 'almost', 'all', 'men', 'their', 'degree', 'some', 'time', 'other', 'cherish', 'very', 'nearly', 'the', 'same', 'feelings', 'towards', 'the', 'ocean', 'with', 'me', 'There', 'now', 'is', 'your', 'insular', 'city', 'the', 'Manhattoes', 'belted', 'round', 'by', 'wharves', 'as', 'Indian', 'isles', 'by', 'coral', 'reefscommerce', 'surrounds', 'it', 'with', 'her', 'surf', 'Right', 'and', 'left', 'the', 'streets', 'take', 'you', 'waterward', 'Its', 'extreme', 'downtown', 'is', 'the', 'battery', 'where', 'that', 'noble', 'mole', 'is', 'washed', 'by', 'waves', 'and', 'cooled', 'by', 'breezes', 'which', 'a', 'few', 'hours', 'previous', 'were', 'out', 'sight', 'land', 'Look', 'at', 'the', 'crowds', 'watergazers', 'there', 'Circumambulate', 'the', 'city', 'a', 'dreamy', 'Sabbath', 'afternoon', 'Go', 'from', 'Corlears', 'Hook', 'to', 'Coenties', 'Slip', 'and', 'from', 'thence', 'by', 'Whitehall', 'northward', 'What', 'do', 'you', 'see?Posted', 'like', 'silent', 'sentinels', 'all', 'around', 'the', 'town', 'stand', 'thousands', 'upon', 'thousands', 'mortal', 'men', 'fixed', 'ocean', 'reveries', 'Some', 'leaning', 'against', 'the', 'spiles;', 'some', 'seated', 'upon', 'the', 'pierheads;', 'some', 'looking', 'over', 'the', 'bulwarks', 'ships', 'from', 'China;', 'some', 'high', 'aloft', 'the', 'rigging', 'as', 'if', 'striving', 'to', 'get', 'a', 'still', 'better', 'seaward', 'peep', 'But', 'these', 'are', 'all', 'landsmen;', 'week', 'days', 'pent', 'up', 'lath', 'and', 'plastertied', 'to', 'counters', 'nailed', 'to', 'benches', 'clinched', 'to', 'desks', 'How', 'then', 'is', 'this?', 'Are', 'the', 'green', 'fields', 'gone?', 'What', 'do', 'they', 'here?', 'But', 'look!', 'here', 'come', 'more', 'crowds', 'pacing', 'straight', 'for', 'the', 'water', 'and', 'seemingly', 'bound', 'for', 'a', 'dive', 'Strange!', 'Nothing', 'will', 'content', 'them', 'but', 'the', 'extremest', 'limit', 'the', 'land;', 'loitering', 'under', 'the', 'shady', 'lee', 'yonder', 'warehouses', 'will', 'not', 'suffice', 'No', 'They', 'must', 'get', 'just', 'as', 'nigh', 'the', 'water', 'as', 'they', 'possibly', 'can', 'without', 'falling', 'And', 'there', 'they', 'standmiles', 'themleagues', 'Inlanders', 'all', 'they', 'come', 'from', 'lanes', 'and', 'alleys', 'streets', 'and', 'avenuesnorth', 'east', 'south', 'and', 'west', 'Yet', 'here', 'they', 'all', 'unite', 'Tell', 'me', 'does', 'the', 'magnetic', 'virtue', 'the', 'needles', 'the', 'compasses', 'all', 'those', 'ships', 'attract', 'them', 'thither?', 'Once', 'more', 'Say', 'you', 'are', 'the', 'country;', 'some', 'high', 'land', 'lakes', 'Take', 'almost', 'any', 'path', 'you', 'please', 'and', 'ten', 'to', 'one', 'it', 'carries', 'you', 'down', 'a', 'dale', 'and', 'leaves', 'you', 'there', 'by', 'a', 'pool', 'the', 'stream', 'There', 'is', 'magic', 'it', 'Let', 'the', 'most', 'absentminded', 'men', 'be', 'plunged', 'his', 'deepest', 'reveriesstand', 'that', 'man', 'on', 'his', 'legs', 'set', 'his', 'feet', 'agoing', 'and', 'he', 'will', 'infallibly', 'lead', 'you', 'to', 'water', 'if', 'water', 'there', 'be', 'all', 'that', 'region', 'Should', 'you', 'ever', 'be', 'athirst', 'the', 'great', 'American', 'desert', 'try', 'this', 'experiment', 'if', 'your', 'caravan', 'happen', 'to', 'be', 'supplied', 'with', 'a', 'metaphysical', 'professor', 'Yes', 'as', 'every', 'one', 'knows', 'meditation', 'and', 'water', 'are', 'wedded', 'for', 'ever', 'But', 'here', 'is', 'an', 'artist', 'He', 'desires', 'to', 'paint', 'you', 'the', 'dreamiest', 'shadiest', 'quietest', 'most', 'enchanting', 'bit', 'romantic', 'landscape', 'all', 'the', 'valley', 'the', 'Saco', 'What', 'is', 'the', 'chief', 'element', 'he', 'employs?', 'There', 'stand', 'his', 'trees', 'each', 'with', 'a', 'hollow', 'trunk', 'as', 'if', 'a', 'hermit', 'and', 'a', 'crucifix', 'were', 'within;', 'and', 'here', 'sleeps', 'his', 'meadow', 'and', 'there', 'sleep', 'his', 'cattle;', 'and', 'up', 'from', 'yonder', 'cottage', 'goes', 'a', 'sleepy', 'smoke', 'Deep', 'into', 'distant', 'woodlands', 'winds', 'a', 'mazy', 'way', 'reaching', 'to', 'overlapping', 'spurs', 'mountains', 'bathed', 'their', 'hillside', 'blue', 'But', 'though', 'the', 'picture', 'lies', 'thus', 'tranced', 'and', 'though', 'this', 'pinetree', 'shakes', 'down', 'its', 'sighs', 'like', 'leaves', 'upon', 'this', "shepherd's", 'head', 'yet', 'all', 'were', 'vain', 'unless', 'the', "shepherd's", 'eye', 'were', 'fixed', 'upon', 'the', 'magic', 'stream', 'before', 'him', 'Go', 'visit', 'the', 'Prairies', 'June', 'when', 'for', 'scores', 'on', 'scores', 'miles', 'you', 'wade', 'kneedeep', 'among', 'Tigerlilieswhat', 'is', 'the', 'one', 'charm', 'wanting?Waterthere', 'is', 'not', 'a', 'drop', 'water', 'there!', 'Were', 'Niagara', 'but', 'a', 'cataract', 'sand', 'would', 'you', 'travel', 'your', 'thousand', 'miles', 'to', 'see', 'it?', 'Why', 'did', 'the', 'poor', 'poet', 'Tennessee', 'upon', 'suddenly', 'receiving', 'two', 'handfuls', 'silver', 'deliberate', 'whether', 'to', 'buy', 'him', 'a', 'coat', 'which', 'he', 'sadly', 'needed', 'invest', 'his', 'money', 'a', 'pedestrian', 'trip', 'to', 'Rockaway', 'Beach?', 'Why', 'is', 'almost', 'every', 'robust', 'healthy', 'boy', 'with', 'a', 'robust', 'healthy', 'soul', 'him', 'at', 'some', 'time', 'other', 'crazy', 'to', 'go', 'to', 'sea?', 'Why', 'upon', 'your', 'first', 'voyage', 'as', 'a', 'passenger', 'did', 'you', 'yourself', 'feel', 'such', 'a', 'mystical', 'vibration', 'when', 'first', 'told', 'that', 'you', 'and', 'your', 'ship', 'were', 'now', 'out', 'sight', 'land?', 'Why', 'did', 'the', 'old', 'Persians', 'hold', 'the', 'sea', 'holy?', 'Why', 'did', 'the', 'Greeks', 'give', 'it', 'a', 'separate', 'deity', 'and', 'own', 'brother', 'Jove?', 'Surely', 'all', 'this', 'is', 'not', 'without', 'meaning', 'And', 'still', 'deeper', 'the', 'meaning', 'that', 'story', 'Narcissus', 'who', 'because', 'he', 'could', 'not', 'grasp', 'the', 'tormenting', 'mild', 'image', 'he', 'saw', 'the', 'fountain', 'plunged', 'into', 'it', 'and', 'was', 'drowned', 'But', 'that', 'same', 'image', 'we', 'ourselves', 'see', 'all', 'rivers', 'and', 'oceans', 'It', 'is', 'the', 'image', 'the', 'ungraspable', 'phantom', 'life;', 'and', 'this', 'is', 'the', 'key', 'to', 'it', 'all', 'Now', 'when', 'I', 'say', 'that', 'I', 'am', 'the', 'habit', 'going', 'to', 'sea', 'whenever', 'I', 'begin', 'to', 'grow', 'hazy', 'about', 'the', 'eyes', 'and', 'begin', 'to', 'be', 'over', 'conscious', 'my', 'lungs', 'I', 'do', 'not', 'mean', 'to', 'have', 'it', 'inferred', 'that', 'I', 'ever', 'go', 'to', 'sea', 'as', 'a', 'passenger', 'For', 'to', 'go', 'as', 'a', 'passenger', 'you', 'must', 'needs', 'have', 'a', 'purse', 'and', 'a', 'purse', 'is', 'but', 'a', 'rag', 'unless', 'you', 'have', 'something', 'it', 'Besides', 'passengers', 'get', 'seasickgrow', "quarrelsomedon't", 'sleep', 'nightsdo', 'not', 'enjoy', 'themselves', 'much', 'as', 'a', 'general', 'thing;no', 'I', 'never', 'go', 'as', 'a', 'passenger;', 'nor', 'though', 'I', 'am', 'something', 'a', 'salt', 'do', 'I', 'ever', 'go', 'to', 'sea', 'as', 'a', 'Commodore', 'a', 'Captain', 'a', 'Cook', 'I', 'abandon', 'the', 'glory', 'and', 'distinction', 'such', 'offices', 'to', 'those', 'who', 'like', 'them', 'For', 'my', 'part', 'I', 'abominate', 'all', 'honourable', 'respectable', 'toils', 'trials', 'and', 'tribulations', 'every', 'kind', 'whatsoever', 'It', 'is', 'quite', 'as', 'much', 'as', 'I', 'can', 'do', 'to', 'take', 'care', 'myself', 'without', 'taking', 'care', 'ships', 'barques', 'brigs', 'schooners', 'and', 'what', 'not', 'And', 'as', 'for', 'going', 'as', 'cookthough', 'I', 'confess', 'there', 'is', 'considerable', 'glory', 'that', 'a', 'cook', 'being', 'a', 'sort', 'officer', 'on', 'shipboardyet', 'somehow', 'I', 'never', 'fancied', 'broiling', 'fowls;though', 'once', 'broiled', 'judiciously', 'buttered', 'and', 'judgmatically', 'salted', 'and', 'peppered', 'there', 'is', 'no', 'one', 'who', 'will', 'speak', 'more', 'respectfully', 'not', 'to', 'say', 'reverentially', 'a', 'broiled', 'fowl', 'than', 'I', 'will', 'It', 'is', 'out', 'the', 'idolatrous', 'dotings', 'the', 'old', 'Egyptians', 'upon', 'broiled', 'ibis', 'and', 'roasted', 'river', 'horse', 'that', 'you', 'see', 'the', 'mummies', 'those', 'creatures', 'their', 'huge', 'bakehouses', 'the', 'pyramids', 'No', 'when', 'I', 'go', 'to', 'sea', 'I', 'go', 'as', 'a', 'simple', 'sailor', 'right', 'before', 'the', 'mast', 'plumb', 'down', 'into', 'the', 'forecastle', 'aloft', 'there', 'to', 'the', 'royal', 'masthead', 'True', 'they', 'rather', 'order', 'me', 'about', 'some', 'and', 'make', 'me', 'jump', 'from', 'spar', 'to', 'spar', 'like', 'a', 'grasshopper', 'a', 'May', 'meadow', 'And', 'at', 'first', 'this', 'sort', 'thing', 'is', 'unpleasant', 'enough', 'It', 'touches', "one's", 'sense', 'honour', 'particularly', 'if', 'you', 'come', 'an', 'old', 'established', 'family', 'the', 'land', 'the', 'Van', 'Rensselaers', 'Randolphs', 'Hardicanutes', 'And', 'more', 'than', 'all', 'if', 'just', 'previous', 'to', 'putting', 'your', 'hand', 'into', 'the', 'tarpot', 'you', 'have', 'been', 'lording', 'it', 'as', 'a', 'country', 'schoolmaster', 'making', 'the', 'tallest', 'boys', 'stand', 'awe', 'you', 'The', 'transition', 'is', 'a', 'keen', 'one', 'I', 'assure', 'you', 'from', 'a', 'schoolmaster', 'to', 'a', 'sailor', 'and', 'requires', 'a', 'strong', 'decoction', 'Seneca', 'and', 'the', 'Stoics', 'to', 'enable', 'you', 'to', 'grin', 'and', 'bear', 'it', 'But', 'even', 'this', 'wears', 'off', 'time', 'What', 'it', 'if', 'some', 'old', 'hunks', 'a', 'seacaptain', 'orders', 'me', 'to', 'get', 'a', 'broom', 'and', 'sweep', 'down', 'the', 'decks?', 'What', 'does', 'that', 'indignity', 'amount', 'to', 'weighed', 'I', 'mean', 'the', 'scales', 'the', 'New', 'Testament?', 'Do', 'you', 'think', 'the', 'archangel', 'Gabriel', 'thinks', 'anything', 'the', 'less', 'me', 'because', 'I', 'promptly', 'and', 'respectfully', 'obey', 'that', 'old', 'hunks', 'that', 'particular', 'instance?', 'Who', "ain't", 'a', 'slave?', 'Tell', 'me', 'that', 'Well', 'then', 'however', 'the', 'old', 'seacaptains', 'may', 'order', 'me', 'abouthowever', 'they', 'may', 'thump', 'and', 'punch', 'me', 'about', 'I', 'have', 'the', 'satisfaction', 'knowing', 'that', 'it', 'is', 'all', 'right;', 'that', 'everybody', 'else', 'is', 'one', 'way', 'other', 'served', 'much', 'the', 'same', 'wayeither', 'a', 'physical', 'metaphysical', 'point', 'view', 'that', 'is;', 'and', 'so', 'the', 'universal', 'thump', 'is', 'passed', 'round', 'and', 'all', 'hands', 'should', 'rub', 'each', "other's", 'shoulderblades', 'and', 'be', 'content', 'Again', 'I', 'always', 'go', 'to', 'sea', 'as', 'a', 'sailor', 'because', 'they', 'make', 'a', 'point', 'paying', 'me', 'for', 'my', 'trouble', 'whereas', 'they', 'never', 'pay', 'passengers', 'a', 'single', 'penny', 'that', 'I', 'ever', 'heard', 'On', 'the', 'contrary', 'passengers', 'themselves', 'must', 'pay', 'And', 'there', 'is', 'all', 'the', 'difference', 'the', 'world', 'between', 'paying', 'and', 'being', 'paid', 'The', 'act', 'paying', 'is', 'perhaps', 'the', 'most', 'uncomfortable', 'infliction', 'that', 'the', 'two', 'orchard', 'thieves', 'entailed', 'upon', 'us', 'But', 'BEING', 'PAIDwhat', 'will', 'compare', 'with', 'it?', 'The', 'urbane', 'activity', 'with', 'which', 'a', 'man', 'receives', 'money', 'is', 'really', 'marvellous', 'considering', 'that', 'we', 'so', 'earnestly', 'believe', 'money', 'to', 'be', 'the', 'root', 'all', 'earthly', 'ills', 'and', 'that', 'on', 'no', 'account', 'can', 'a', 'monied', 'man', 'enter', 'heaven', 'Ah!', 'how', 'cheerfully', 'we', 'consign', 'ourselves', 'to', 'perdition!', 'Finally', 'I', 'always', 'go', 'to', 'sea', 'as', 'a', 'sailor', 'because', 'the', 'wholesome', 'exercise', 'and', 'pure', 'air', 'the', 'forecastle', 'deck', 'For', 'as', 'this', 'world', 'head', 'winds', 'are', 'far', 'more', 'prevalent', 'than', 'winds', 'from', 'astern', '(that', 'is', 'if', 'you', 'never', 'violate', 'the', 'Pythagorean', 'maxim)', 'so', 'for', 'the', 'most', 'part', 'the', 'Commodore', 'on', 'the', 'quarterdeck', 'gets', 'his', 'atmosphere', 'at', 'second', 'hand', 'from', 'the', 'sailors', 'on', 'the', 'forecastle', 'He', 'thinks', 'he', 'breathes', 'it', 'first;', 'but', 'not', 'so', 'In', 'much', 'the', 'same', 'way', 'do', 'the', 'commonalty', 'lead', 'their', 'leaders', 'many', 'other', 'things', 'at', 'the', 'same', 'time', 'that', 'the', 'leaders', 'little', 'suspect', 'it', 'But', 'wherefore', 'it', 'was', 'that', 'after', 'having', 'repeatedly', 'smelt', 'the', 'sea', 'as', 'a', 'merchant', 'sailor', 'I', 'should', 'now', 'take', 'it', 'into', 'my', 'head', 'to', 'go', 'on', 'a', 'whaling', 'voyage;', 'this', 'the', 'invisible', 'police', 'officer', 'the', 'Fates', 'who', 'has', 'the', 'constant', 'surveillance', 'me', 'and', 'secretly', 'dogs', 'me', 'and', 'influences', 'me', 'some', 'unaccountable', 'wayhe', 'can', 'better', 'answer', 'than', 'any', 'one', 'else', 'And', 'doubtless', 'my', 'going', 'on', 'this', 'whaling', 'voyage', 'formed', 'part', 'the', 'grand', 'programme', 'Providence', 'that', 'was', 'drawn', 'up', 'a', 'long', 'time', 'ago', 'It', 'came', 'as', 'a', 'sort', 'brief', 'interlude', 'and', 'solo', 'between', 'more', 'extensive', 'performances', 'I', 'take', 'it', 'that', 'this', 'part', 'the', 'bill', 'must', 'have', 'run', 'something', 'like', 'this:', '"GRAND', 'CONTESTED', 'ELECTION', 'FOR', 'THE', 'PRESIDENCY', 'OF', 'THE', 'UNITED', 'STATES', '"WHALING', 'VOYAGE', 'BY', 'ONE', 'ISHMAEL', '"BLOODY', 'BATTLE', 'IN', 'AFFGHANISTAN"', 'Though', 'I', 'cannot', 'tell', 'why', 'it', 'was', 'exactly', 'that', 'those', 'stage', 'managers', 'the', 'Fates', 'put', 'me', 'down', 'for', 'this', 'shabby', 'part', 'a', 'whaling', 'voyage', 'when', 'others', 'were', 'set', 'down', 'for', 'magnificent', 'parts', 'high', 'tragedies', 'and', 'short', 'and', 'easy', 'parts', 'genteel', 'comedies', 'and', 'jolly', 'parts', 'farcesthough', 'I', 'cannot', 'tell', 'why', 'this', 'was', 'exactly;', 'yet', 'now', 'that', 'I', 'recall', 'all', 'the', 'circumstances', 'I', 'think', 'I', 'can', 'see', 'a', 'little', 'into', 'the', 'springs', 'and', 'motives', 'which', 'being', 'cunningly', 'presented', 'to', 'me', 'under', 'various', 'disguises', 'induced', 'me', 'to', 'set', 'about', 'performing', 'the', 'part', 'I', 'did', 'besides', 'cajoling', 'me', 'into', 'the', 'delusion', 'that', 'it', 'was', 'a', 'choice', 'resulting', 'from', 'my', 'own', 'unbiased', 'freewill', 'and', 'discriminating', 'judgment', 'Chief', 'among', 'these', 'motives', 'was', 'the', 'overwhelming', 'idea', 'the', 'great', 'whale', 'himself', 'Such', 'a', 'portentous', 'and', 'mysterious', 'monster', 'roused', 'all', 'my', 'curiosity', 'Then', 'the', 'wild', 'and', 'distant', 'seas', 'where', 'he', 'rolled', 'his', 'island', 'bulk;', 'the', 'undeliverable', 'nameless', 'perils', 'the', 'whale;', 'these', 'with', 'all', 'the', 'attending', 'marvels', 'a', 'thousand', 'Patagonian', 'sights', 'and', 'sounds', 'helped', 'to', 'sway', 'me', 'to', 'my', 'wish', 'With', 'other', 'men', 'perhaps', 'such', 'things', 'would', 'not', 'have', 'been', 'inducements;', 'but', 'as', 'for', 'me', 'I', 'am', 'tormented', 'with', 'an', 'everlasting', 'itch', 'for', 'things', 'remote', 'I', 'love', 'to', 'sail', 'forbidden', 'seas', 'and', 'land', 'on', 'barbarous', 'coasts', 'Not', 'ignoring', 'what', 'is', 'good', 'I', 'am', 'quick', 'to', 'perceive', 'a', 'horror', 'and', 'could', 'still', 'be', 'social', 'with', 'itwould', 'they', 'let', 'mesince', 'it', 'is', 'but', 'well', 'to', 'be', 'on', 'friendly', 'terms', 'with', 'all', 'the', 'inmates', 'the', 'place', 'one', 'lodges', 'By', 'reason', 'these', 'things', 'then', 'the', 'whaling', 'voyage', 'was', 'welcome;', 'the', 'great', 'floodgates', 'the', 'wonderworld', 'swung', 'open', 'and', 'the', 'wild', 'conceits', 'that', 'swayed', 'me', 'to', 'my', 'purpose', 'two', 'and', 'two', 'there', 'floated', 'into', 'my', 'inmost', 'soul', 'endless', 'processions', 'the', 'whale', 'and', 'mid', 'most', 'them', 'all', 'one', 'grand', 'hooded', 'phantom', 'like', 'a', 'snow', 'hill', 'the', 'air']

In [4]:
phrase = ['the cat', 'ran away']
' '.join(phrase).split(' ')


Out[4]:
['the', 'cat', 'ran', 'away']

In [5]:
def tokenize(s, stop_words=None, punctuation='`~!@#$%^&*()_-+={[}]|\:;"<,>.?/}\t'):
    s = s.splitlines()
    s = " ".join(s)
    
    punctuation_l = list(punctuation)

    s = list(s)
    s_f = list(filter(lambda c: c not in punctuation, s))

    s_f = "".join(s_f)
    
    stop_words_l = []

    #http://stackoverflow.com/questions/402504/how-to-determine-the-variable-type-in-python
    if type(stop_words) is str:
        stop_words_l = stop_words.split(" ")
    elif type(stop_words) is list:
        stop_words_l = stop_words
    else:
        stop_words_l = []

    s_fs = s_f.split()
    s_fs = list(filter(lambda w: w not in stop_words_l, s_fs))
    s_fs = [w.lower() for w in s_fs]
    
    return s_fs

In [6]:
print()




In [7]:
assert tokenize("This, is the way; that things will end", stop_words=['the', 'is']) == \
    ['this', 'way', 'that', 'things', 'will', 'end']
wasteland = """
APRIL is the cruellest month, breeding
Lilacs out of the dead land, mixing
Memory and desire, stirring
Dull roots with spring rain.
"""

assert tokenize(wasteland, stop_words='is the of and') == \
    ['april','cruellest','month','breeding','lilacs','out','dead','land',
     'mixing','memory','desire','stirring','dull','roots','with','spring',
     'rain']

Write a function count_words that takes a list of words and returns a dictionary where the keys in the dictionary are the unique words in the list and the values are the word counts.


In [8]:
def count_words(data):
    count = {}
    for w in range(0, len(data)):
        if data[w] in count:
            count[data[w]] += 1
        else:
            count[data[w]] = 1
    
    #this does not sort correctly, and from what I can tell, dictionaries can't be sorted anyway
    return(count)

In [9]:
assert count_words(tokenize('this and the this from and a a a')) == \
    {'a': 3, 'and': 2, 'from': 1, 'the': 1, 'this': 2}

Write a function sort_word_counts that return a list of sorted word counts:

  • Each element of the list should be a (word, count) tuple.
  • The list should be sorted by the word counts, with the higest counts coming first.
  • To perform this sort, look at using the sorted function with a custom key and reverse argument.

In [10]:
def sort_word_counts(wc):
    """Return a list of 2-tuples of (word, count), sorted by count descending."""
    wordlist = []
    n = 0
    for w in wc:
        wordlist.append((w, wc[w]))
    
    #http://stackoverflow.com/questions/3121979/how-to-sort-list-tuple-of-lists-tuples
    wordlist_s = sorted(wordlist, key=lambda tup: tup[1], reverse=True)
    print(wordlist_s)
    return(wordlist_s)

In [11]:
assert sort_word_counts(count_words(tokenize('this and a the this this and a a a'))) == \
    [('a', 4), ('this', 3), ('and', 2), ('the', 1)]


[('a', 4), ('this', 3), ('and', 2), ('the', 1)]

Perform a word count analysis on Chapter 1 of Moby Dick, whose text can be found in the file mobydick_chapter1.txt:

  • Read the file into a string.
  • Tokenize with stop words of 'the of and a to in is it that as'.
  • Perform a word count, the sort and save the result in a variable named swc.

In [12]:
# YOUR CODE HERE
file = open("mobydick_chapter1.txt")
mobydick = file.read()

mobydick_t = tokenize(mobydick, stop_words = "the of and a to in is it that as")

mobydick_wc = count_words(mobydick_t)

swc = sort_word_counts(mobydick_wc)


[('i', 43), ('me', 24), ('you', 23), ('all', 23), ('this', 17), ('for', 16), ('but', 15), ('there', 15), ('my', 14), ('with', 13), ('on', 12), ('they', 12), ('go', 12), ('some', 11), ('from', 11), ('not', 11), ('or', 10), ('sea', 10), ('his', 10), ('one', 10), ('he', 9), ('be', 9), ('upon', 9), ('if', 9), ('into', 9), ('have', 8), ('was', 8), ('by', 8), ('what', 7), ('part', 7), ('do', 7), ('and', 7), ('were', 7), ('why', 7), ('land', 6), ('down', 6), ('your', 6), ('more', 6), ('time', 6), ('can', 6), ('about', 6), ('will', 6), ('take', 6), ('old', 6), ('like', 6), ('water', 6), ('voyage', 6), ('it', 6), ('get', 6), ('here', 5), ('now', 5), ('when', 5), ('no', 5), ('most', 5), ('sailor', 5), ('whaling', 5), ('other', 5), ('ever', 5), ('the', 5), ('same', 5), ('such', 5), ('whenever', 5), ('at', 5), ('did', 5), ('see', 5), ('are', 5), ('who', 5), ('these', 4), ('passenger', 4), ('first', 4), ('though', 4), ('their', 4), ('every', 4), ('way', 4), ('so', 4), ('money', 4), ('high', 4), ('being', 4), ('little', 4), ('than', 4), ('up', 4), ('much', 4), ('never', 4), ('two', 4), ('because', 4), ('then', 4), ('an', 4), ('must', 4), ('tell', 4), ('them', 4), ('which', 4), ('things', 4), ('am', 4), ('those', 4), ('men', 4), ('would', 3), ('right', 3), ('him', 3), ('stand', 3), ('paying', 3), ('out', 3), ('still', 3), ('whale', 3), ('soul', 3), ('come', 3), ('we', 3), ('before', 3), ('passengers', 3), ('going', 3), ('set', 3), ('yet', 3), ('head', 3), ('sort', 3), ('parts', 3), ('myself', 3), ('grand', 3), ('say', 3), ('may', 3), ('without', 3), ('forecastle', 3), ('man', 3), ('purse', 3), ('winds', 3), ('great', 3), ('hand', 3), ('almost', 3), ('image', 3), ('world', 3), ('nothing', 3), ('ships', 3), ('something', 3), ('should', 3), ('how', 3), ('broiled', 3), ('healthy', 2), ('long', 2), ('lead', 2), ('meadow', 2), ('account', 2), ('thousand', 2), ('city', 2), ('commodore', 2), ('once', 2), ('ocean', 2), ('glory', 2), ('mean', 2), ('between', 2), ('fixed', 2), ('fates', 2), ('thousands', 2), ('besides', 2), ('miles', 2), ('where', 2), ('point', 2), ('stream', 2), ('order', 2), ('round', 2), ('care', 2), ('exactly', 2), ('under', 2), ('begin', 2), ('strong', 2), ('scores', 2), ('metaphysical', 2), ('ship', 2), ('look', 2), ('been', 2), ('leaves', 2), ('cannot', 2), ('better', 2), ('streets', 2), ('schoolmaster', 2), ('meaning', 2), ('cook', 2), ('aloft', 2), ('himself', 2), ('any', 2), ('unless', 2), ('let', 2), ('wild', 2), ('particular', 2), ('find', 2), ('requires', 2), ('yonder', 2), ('thinks', 2), ('pay', 2), ('chief', 2), ('robust', 2), ('among', 2), ('over', 2), ('just', 2), ('perhaps', 2), ('does', 2), ('plunged', 2), ('sight', 2), ('spar', 2), ('each', 2), ('ishmael', 2), ('could', 2), ('off', 2), ('thump', 2), ("shepherd's", 2), ('content', 2), ('magic', 2), ('country', 2), ('air', 2), ('well', 2), ('motives', 2), ('in', 2), ('distant', 2), ('officer', 2), ('make', 2), ('think', 2), ('phantom', 2), ('crowds', 2), ('previous', 2), ('leaders', 2), ('hunks', 2), ('themselves', 2), ('own', 2), ('its', 2), ('warehouses', 2), ('ourselves', 2), ('respectfully', 2), ('sleep', 2), ('seas', 2), ('sail', 2), ('always', 2), ('else', 2), ('needed', 1), ('finally', 1), ('clinched', 1), ('enough', 1), ('trees', 1), ('corlears', 1), ('hillside', 1), ('poor', 1), ('performances', 1), ('waterward', 1), ('monied', 1), ('egyptians', 1), ('believe', 1), ('carries', 1), ('solo', 1), ('bulwarks', 1), ('scales', 1), ('prairies', 1), ('fowl', 1), ('activity', 1), ('downtown', 1), ("people's", 1), ('others', 1), ('valley', 1), ('lording', 1), ('waves', 1), ('circumambulate', 1), ('transition', 1), ('attract', 1), ('masthead', 1), ('lee', 1), ('thingno', 1), ('limit', 1), ('wharves', 1), ('penny', 1), ('physical', 1), ('wedded', 1), ('slip', 1), ('meet', 1), ('town', 1), ('interlude', 1), ('attending', 1), ('wholesome', 1), ('handfuls', 1), ('hermit', 1), ('social', 1), ('urbane', 1), ('ungraspable', 1), ('striving', 1), ('crucifix', 1), ('virtue', 1), ('tormenting', 1), ('tragedies', 1), ('barques', 1), ('single', 1), ('commonalty', 1), ('true', 1), ('contrary', 1), ('themleagues', 1), ('idolatrous', 1), ('bloody', 1), ('kneedeep', 1), ('hats', 1), ('has', 1), ('circulation', 1), ('constant', 1), ('island', 1), ('abandon', 1), ('quietly', 1), ('wayhe', 1), ('inlanders', 1), ('rather', 1), ('ah', 1), ('van', 1), ('paint', 1), ('tranced', 1), ('grasshopper', 1), ('infallibly', 1), ('perils', 1), ('quarterdeck', 1), ('states', 1), ('grow', 1), ('alleys', 1), ('roused', 1), ('astern', 1), ('whatsoever', 1), ('easy', 1), ('bear', 1), ('receives', 1), ('open', 1), ('west', 1), ('holy', 1), ('cherish', 1), ('bit', 1), ('formed', 1), ('cunningly', 1), ('shadiest', 1), ('fields', 1), ('rensselaers', 1), ('after', 1), ('honour', 1), ('uncomfortable', 1), ('stepping', 1), ('nailed', 1), ('growing', 1), ('surveillance', 1), ('friendly', 1), ('root', 1), ('putting', 1), ('seeposted', 1), ('throws', 1), ('boy', 1), ('possibly', 1), ('landscape', 1), ('damp', 1), ('stoics', 1), ('breathes', 1), ('buy', 1), ('drop', 1), ('comedies', 1), ('leaning', 1), ('springs', 1), ('habit', 1), ('yourself', 1), ('counters', 1), ('pure', 1), ('prevalent', 1), ('straight', 1), ('remote', 1), ('cooled', 1), ('regulating', 1), ('shabby', 1), ('saw', 1), ('funeral', 1), ('entailed', 1), ('mysterious', 1), ('satisfaction', 1), ('rigging', 1), ('feelings', 1), ('drizzly', 1), ('itwould', 1), ('inmates', 1), ('reveries', 1), ('suspect', 1), ('endless', 1), ('mystical', 1), ('enchanting', 1), ('agoing', 1), ('general', 1), ('seneca', 1), ('bulk', 1), ('hook', 1), ('feel', 1), ('coenties', 1), ('choice', 1), ('oceans', 1), ('tallest', 1), ('plastertied', 1), ('indignity', 1), ('battery', 1), ('violate', 1), ('trouble', 1), ('travel', 1), ('presidency', 1), ('surrounds', 1), ('conscious', 1), ('athirst', 1), ('atmosphere', 1), ('conceits', 1), ('coffin', 1), ('seasickgrow', 1), ('within', 1), ('pinetree', 1), ('huge', 1), ('sailors', 1), ('somehow', 1), ('terms', 1), ('weighed', 1), ('broom', 1), ('quietest', 1), ('supplied', 1), ('her', 1), ('cheerfully', 1), ('degree', 1), ('falling', 1), ('standmiles', 1), ('knows', 1), ('confess', 1), ('needles', 1), ('answer', 1), ("quarrelsomedon't", 1), ('captain', 1), ('providence', 1), ('judgmatically', 1), ('influences', 1), ('whitehall', 1), ('infliction', 1), ('considerable', 1), ('place', 1), ('thieves', 1), ('new', 1), ("one's", 1), ('various', 1), ('desires', 1), ('delusion', 1), ('act', 1), ('left', 1), ('royal', 1), ('wayeither', 1), ('suddenly', 1), ('offices', 1), ('induced', 1), ('presented', 1), ('stage', 1), ('pistol', 1), ('isles', 1), ('extreme', 1), ('hollow', 1), ('horror', 1), ('vain', 1), ('ten', 1), ('good', 1), ('schooners', 1), ('avenuesnorth', 1), ('view', 1), ('maxim', 1), ('employs', 1), ('established', 1), ('mind', 1), ('river', 1), ('moral', 1), ('east', 1), ('suffice', 1), ('kind', 1), ('rockaway', 1), ('jump', 1), ('unbiased', 1), ('landsmen', 1), ('testament', 1), ('overwhelming', 1), ('peep', 1), ('dotings', 1), ('deliberately', 1), ('processions', 1), ('around', 1), ('mild', 1), ('silver', 1), ('upper', 1), ('plumb', 1), ('brief', 1), ('farcesthough', 1), ('watergazers', 1), ('everlasting', 1), ('disguises', 1), ('perceive', 1), ('experiment', 1), ('fancied', 1), ('awe', 1), ('pedestrian', 1), ('salted', 1), ('undeliverable', 1), ('merchant', 1), ('universal', 1), ('lakes', 1), ('obey', 1), ('interest', 1), ('battle', 1), ('curiosity', 1), ('really', 1), ('pyramids', 1), ('second', 1), ('tigerlilieswhat', 1), ('deepest', 1), ('fountain', 1), ('heaven', 1), ('hazy', 1), ('affghanistan', 1), ('shipboardyet', 1), ('family', 1), ('particularly', 1), ('dogs', 1), ('far', 1), ('ills', 1), ('week', 1), ('gabriel', 1), ('picture', 1), ('brother', 1), ('strange', 1), ('sleeps', 1), ('few', 1), ('magnetic', 1), ('freewill', 1), ('prevent', 1), ('paid', 1), ('gone', 1), ('honourable', 1), ('floated', 1), ('hypos', 1), ('northward', 1), ('overlapping', 1), ('amount', 1), ('idea', 1), ('again', 1), ('trials', 1), ('coat', 1), ('hands', 1), ('goes', 1), ('desks', 1), ('call', 1), ('surprising', 1), ('mesince', 1), ('election', 1), ('smelt', 1), ('abominate', 1), ('methodically', 1), ('unpleasant', 1), ('assure', 1), ('less', 1), ('magnificent', 1), ('lies', 1), ('poet', 1), ('indian', 1), ('swayed', 1), ('romantic', 1), ('snow', 1), ('bathed', 1), ('deck', 1), ('coral', 1), ('cato', 1), ('invisible', 1), ('key', 1), ('circumstances', 1), ('manhattoes', 1), ('touches', 1), ('orchard', 1), ('dreamiest', 1), ('tormented', 1), ('mortal', 1), ('ball', 1), ('involuntarily', 1), ('earthly', 1), ('randolphs', 1), ('knocking', 1), ('noble', 1), ('grin', 1), ('beach', 1), ('give', 1), ('cajoling', 1), ('hardicanutes', 1), ('seated', 1), ('instance', 1), ('reveriesstand', 1), ('ago', 1), ('distinction', 1), ('salt', 1), ('wantingwaterthere', 1), ('insular', 1), ('needs', 1), ('pool', 1), ('happen', 1), ('mouth', 1), ('broiling', 1), ('lanes', 1), ('sword', 1), ('sighs', 1), ('pausing', 1), ('wade', 1), ('extensive', 1), ('vibration', 1), ('mole', 1), ('hooded', 1), ('against', 1), ('shoulderblades', 1), ('soon', 1), ('separate', 1), ('benches', 1), ('legs', 1), ('simple', 1), ('seacaptain', 1), ('toils', 1), ('shore', 1), ('try', 1), ('abouthowever', 1), ('seemingly', 1), ('patagonian', 1), ('life', 1), ('contested', 1), ('shady', 1), ('bakehouses', 1), ('path', 1), ('extremest', 1), ('niagara', 1), ('hold', 1), ('saco', 1), ('washed', 1), ('nor', 1), ('came', 1), ('caravan', 1), ('helped', 1), ('doubtless', 1), ('american', 1), ('inducements', 1), ('sentinels', 1), ('smoke', 1), ('however', 1), ('china', 1), ('south', 1), ('belted', 1), ('short', 1), ('reaching', 1), ('invest', 1), ('years', 1), ('desert', 1), ('wonderworld', 1), ('coasts', 1), ('lodges', 1), ('bringing', 1), ('june', 1), ('paidwhat', 1), ('persians', 1), ('everybody', 1), ('spleen', 1), ('ignoring', 1), ('rear', 1), ('portentous', 1), ('many', 1), ('swung', 1), ('dive', 1), ("ain't", 1), ('seacaptains', 1), ('wherefore', 1), ('whether', 1), ('served', 1), ('crazy', 1), ('nameless', 1), ('rub', 1), ('thing', 1), ('rolled', 1), ('roasted', 1), ('thought', 1), ('watery', 1), ('resulting', 1), ('forbidden', 1), ('bill', 1), ('heard', 1), ('fowlsthough', 1), ('respectable', 1), ('tarpot', 1), ('dale', 1), ('passed', 1), ('thither', 1), ('pent', 1), ('deliberate', 1), ('grim', 1), ('floodgates', 1), ('taking', 1), ('cookthough', 1), ('wish', 1), ('enter', 1), ('professor', 1), ('itch', 1), ('drowned', 1), ('sway', 1), ('archangel', 1), ('us', 1), ('genteel', 1), ('decks', 1), ('creatures', 1), ('boys', 1), ('nightsdo', 1), ('sabbath', 1), ('jolly', 1), ('towards', 1), ('monster', 1), ('exercise', 1), ('judgment', 1), ('earnestly', 1), ('considering', 1), ('thence', 1), ('preciselyhaving', 1), ('managers', 1), ('spiles', 1), ('united', 1), ('mid', 1), ('surely', 1), ('silent', 1), ('mummies', 1), ('even', 1), ('story', 1), ('repeatedly', 1), ('slave', 1), ("other's", 1), ('sleepy', 1), ('police', 1), ('secretly', 1), ('speak', 1), ('rag', 1), ('green', 1), ('consign', 1), ('judiciously', 1), ('bound', 1), ('deity', 1), ('deeper', 1), ('knowing', 1), ('knew', 1), ('yes', 1), ('spurs', 1), ('cottage', 1), ('looking', 1), ('quite', 1), ('offthen', 1), ('cataract', 1), ('run', 1), ('put', 1), ('pierheads', 1), ('mazy', 1), ('buttered', 1), ('hill', 1), ('punch', 1), ('purpose', 1), ('driving', 1), ('trunk', 1), ('nearly', 1), ('visit', 1), ('eyes', 1), ('keen', 1), ('dreamy', 1), ('difference', 1), ('enable', 1), ('wears', 1), ('quick', 1), ('performing', 1), ('barbarous', 1), ('reverentially', 1), ('principle', 1), ('deep', 1), ('very', 1), ('inferred', 1), ('unite', 1), ('receiving', 1), ('absentminded', 1), ('agonever', 1), ('especially', 1), ('compare', 1), ('greeks', 1), ('feet', 1), ('enjoy', 1), ('region', 1), ('whereas', 1), ('tennessee', 1), ('days', 1), ('reefscommerce', 1), ('pythagorean', 1), ('shakes', 1), ('horse', 1), ('peppered', 1), ('told', 1), ('sense', 1), ('woodlands', 1), ('marvellous', 1), ('rivers', 1), ('having', 1), ('gets', 1), ('seaward', 1), ('programme', 1), ('mast', 1), ('grasp', 1), ('perdition', 1), ('inmost', 1), ('eye', 1), ('surf', 1), ('ibis', 1), ('making', 1), ('afternoon', 1), ('philosophical', 1), ('sand', 1), ('discriminating', 1), ('cattle', 1), ('welcome', 1), ('marvels', 1), ('november', 1), ('love', 1), ('artist', 1), ('sounds', 1), ('unaccountable', 1), ('brigs', 1), ('breezes', 1), ('sweep', 1), ('hours', 1), ('street', 1), ('sights', 1), ('charm', 1), ('loitering', 1), ('element', 1), ('recall', 1), ('tribulations', 1), ('drawn', 1), ('of', 1), ('narcissus', 1), ('decoction', 1), ('promptly', 1), ('pacing', 1), ('sadly', 1), ('thus', 1), ('please', 1), ('flourish', 1), ('blue', 1), ('lungs', 1), ('compasses', 1), ('nigh', 1), ('orders', 1), ('meditation', 1), ('anything', 1), ('reason', 1), ('mountains', 1), ('lath', 1), ('trip', 1), ('substitute', 1), ('jove', 1)]

In [13]:
assert swc[0]==('i',43)
assert len(swc)==848

Create a "Cleveland Style" dotplot of the counts of the top 50 words using Matplotlib. If you don't know what a dotplot is, you will have to do some research...


In [14]:
# YOUR CODE HERE
words, freq = zip(*swc) 
plt.bar(np.arange(len(words)), freq, linestyle="dotted")

plt.title("Word Frequency")
plt.xlabel("Word")
plt.ylabel("Frequency")

#plt.xticks(words)
#couldn't figure out how to format the plot correctly


Out[14]:
<matplotlib.text.Text at 0x7f0eb3579e80>

In [28]:
# YOUR CODE HERE
words, freq = zip(*swc) 
plt.scatter(freq, np.arange(0, len(words), -1))

plt.title("Word Frequency")
plt.xlabel("Word")
plt.ylabel("Frequency")

#plt.xticks(words)
#couldn't figure out how to format the plot correctly


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-28-8b9e015c0f6c> in <module>()
      1 # YOUR CODE HERE
      2 words, freq = zip(*swc)
----> 3 plt.scatter(freq, np.arange(0, len(words), -1))
      4 
      5 plt.title("Word Frequency")

/usr/local/lib/python3.4/dist-packages/matplotlib/pyplot.py in scatter(x, y, s, c, marker, cmap, norm, vmin, vmax, alpha, linewidths, verts, hold, **kwargs)
   3198         ret = ax.scatter(x, y, s=s, c=c, marker=marker, cmap=cmap, norm=norm,
   3199                          vmin=vmin, vmax=vmax, alpha=alpha,
-> 3200                          linewidths=linewidths, verts=verts, **kwargs)
   3201         draw_if_interactive()
   3202     finally:

/usr/local/lib/python3.4/dist-packages/matplotlib/axes/_axes.py in scatter(self, x, y, s, c, marker, cmap, norm, vmin, vmax, alpha, linewidths, verts, **kwargs)
   3589         y = np.ma.ravel(y)
   3590         if x.size != y.size:
-> 3591             raise ValueError("x and y must be the same size")
   3592 
   3593         s = np.ma.ravel(s)  # This doesn't have to match x, y in size.

ValueError: x and y must be the same size

In [22]:
assert True # use this for grading the dotplot

In [ ]:


In [ ]: