Algorithms Exercise 1

Imports


In [199]:
%matplotlib inline
from matplotlib import pyplot as plt
import numpy as np

Word counting

Write a function tokenize that takes a string of English text returns a list of words. It should also remove stop words, which are common short words that are often removed before natural language processing. Your function should have the following logic:

  • Split the string into lines using splitlines.
  • Split each line into a list of words and merge the lists for each line.
  • Use Python's builtin filter function to remove all punctuation.
  • If stop_words is a list, remove all occurences of the words in the list.
  • If stop_words is a space delimeted string of words, split them and remove them.
  • Remove any remaining empty words.
  • Make all words lowercase.

In [311]:
things = "hello!"
def ispuct(char, punctuation='`~!@#$%^&*()_-+={[}]|\:;"<,>.?/}\t'):
    return (not (char in punctuation))

#x = list(filter(ispuct, things))
#a = ''
#a.join(x)
#print(new_things)

In [326]:
def tokenize(s, stop_words = '', punctuation='`~!@#$%^&*()_+={[}]|\:;"<,>.?/}\t'):
    m = []
    s = s.replace("-", " ")
    stop = stop_words
    def is_stop(word, stop_words = stop):
        return not (word in stop_words)
    def is_space(word, space = ['']):
        return not (word in space)
    for line in s.splitlines():
        raw = line.lower().split(' ' or '.')  
        y = list()
        for w in raw:
                x = list(filter(ispuct, w))
                y.append(a.join(x))
        words = list(filter(is_space, y))
        words = list(filter(is_stop, words))
        m += words
    return m
    
tokenize("This, is the way; that things will hi--end", stop_words = 'is the')
#ispuct('!')


Out[326]:
['this', 'way', 'that', 'things', 'will', 'hi', 'end']

In [279]:
wasteland = """
APRIL is the cruellest month, breeding
Lilacs out of the dead land, mixing
Memory and desire, stirring
Dull roots with spring rain.
"""
tokenize(wasteland, stop_words='is the of and')


Out[279]:
['april',
 'cruellest',
 'month,',
 'breeding',
 'lilacs',
 'out',
 'dead',
 'land,',
 'mixing',
 'memory',
 'desire,',
 'stirring',
 'dull',
 'roots',
 'with',
 'spring',
 'rain.']

In [280]:
assert tokenize("This, is the way; that things will end", stop_words=['the', 'is']) == \
    ['this', 'way', 'that', 'things', 'will', 'end']
wasteland = """
APRIL is the cruellest month, breeding
Lilacs out of the dead land, mixing
Memory and desire, stirring
Dull roots with spring rain.
"""
#tokenize(wasteland, stop_words='is the of and')
assert tokenize(wasteland, stop_words='is the of and') == \
    ['april','cruellest','month','breeding','lilacs','out','dead','land',
     'mixing','memory','desire','stirring','dull','roots','with','spring',
     'rain']


---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-280-02800c8173c3> in <module>()
----> 1 assert tokenize("This, is the way; that things will end", stop_words=['the', 'is']) ==     ['this', 'way', 'that', 'things', 'will', 'end']
      2 wasteland = """
      3 APRIL is the cruellest month, breeding
      4 Lilacs out of the dead land, mixing
      5 Memory and desire, stirring

AssertionError: 

Write a function count_words that takes a list of words and returns a dictionary where the keys in the dictionary are the unique words in the list and the values are the word counts.


In [281]:
def count_words(data):
    """Return a word count dictionary from the list of words in data."""
    dictionary = {}
    for n in data:
        dictionary[n]= data.count(n)
    return dictionary

In [282]:
assert count_words(tokenize('this and the this from and a a a')) == \
    {'a': 3, 'and': 2, 'from': 1, 'the': 1, 'this': 2}

Write a function sort_word_counts that return a list of sorted word counts:

  • Each element of the list should be a (word, count) tuple.
  • The list should be sorted by the word counts, with the higest counts coming first.
  • To perform this sort, look at using the sorted function with a custom key and reverse argument.

In [283]:
def sort_word_counts(wc):
    """Return a list of 2-tuples of (word, count), sorted by count descending."""
    l = [(i,wc[i]) for i in wc]
    return sorted(l, key = lambda x:x[1], reverse = True)

In [284]:
print(sort_word_counts(count_words(tokenize('this and a the this this and a a a'))))


[('a', 4), ('this', 3), ('and', 2), ('the', 1)]

In [285]:
assert sort_word_counts(count_words(tokenize('this and a the this this and a a a'))) == \
    [('a', 4), ('this', 3), ('and', 2), ('the', 1)]

Perform a word count analysis on Chapter 1 of Moby Dick, whose text can be found in the file mobydick_chapter1.txt:

  • Read the file into a string.
  • Tokenize with stop words of 'the of and a to in is it that as'.
  • Perform a word count, the sort and save the result in a variable named swc.

In [338]:
txt = open('mobydick_chapter1.txt', 'r')
x = txt.read()
swc = sort_word_counts(count_words(tokenize(s = x, stop_words = ['the', 'of', 'and', 'to', 'in', 'is', 'it', 'that', 'as', 'a'])))
string = ''
x = (tokenize(s = x, stop_words = ['the', 'of', 'and', 'to', 'in', 'is', 'it', 'that', 'as', 'a']))
for things in x:
    string = string + things + " "
print(len(swc))
#print(swc)
print(string)
punchfactor = 4


844
call me ishmael some years ago never mind how long precisely having little or no money my purse nothing particular interest me on shore i thought i would sail about little see watery part world way i have driving off spleen regulating circulation whenever i find myself growing grim about mouth whenever damp drizzly november my soul whenever i find myself involuntarily pausing before coffin warehouses bringing up rear every funeral i meet especially whenever my hypos get such an upper hand me requires strong moral principle prevent me from deliberately stepping into street methodically knocking people's hats off then i account high time get sea soon i can this my substitute for pistol ball with philosophical flourish cato throws himself upon his sword i quietly take ship there nothing surprising this if they but knew almost all men their degree some time or other cherish very nearly same feelings towards ocean with me there now your insular city manhattoes belted round by wharves indian isles by coral reefs commerce surrounds with her surf right left streets take you waterward its extreme downtown battery where noble mole washed by waves cooled by breezes which few hours previous were out sight land look at crowds water gazers there circumambulate city dreamy sabbath afternoon go from corlears hook coenties slip from thence by whitehall northward what do you see posted like silent sentinels all around town stand thousands upon thousands mortal men fixed ocean reveries some leaning against spiles some seated upon pier heads some looking over bulwarks ships from china some high aloft rigging if striving get still better seaward peep but these are all landsmen week days pent up lath plaster tied counters nailed benches clinched desks how then this are green fields gone what do they here but look here come more crowds pacing straight for water seemingly bound for dive strange nothing will content them but extremest limit land loitering under shady lee yonder warehouses will not suffice no they must get just nigh water they possibly can without falling there they stand miles them leagues inlanders all they come from lanes alleys streets avenues north east south west yet here they all unite tell me does magnetic virtue needles compasses all those ships attract them thither once more say you are country some high land lakes take almost any path you please ten one carries you down dale leaves you there by pool stream there magic let most absent minded men be plunged his deepest reveries stand man on his legs set his feet going he will infallibly lead you water if water there be all region should you ever be athirst great american desert try this experiment if your caravan happen be supplied with metaphysical professor yes every one knows meditation water are wedded for ever but here an artist he desires paint you dreamiest shadiest quietest most enchanting bit romantic landscape all valley saco what chief element he employs there stand his trees each with hollow trunk if hermit crucifix were within here sleeps his meadow there sleep his cattle up from yonder cottage goes sleepy smoke deep into distant woodlands winds mazy way reaching overlapping spurs mountains bathed their hill side blue but though picture lies thus tranced though this pine tree shakes down its sighs like leaves upon this shepherd's head yet all were vain unless shepherd's eye were fixed upon magic stream before him go visit prairies june when for scores on scores miles you wade knee deep among tiger lilies what one charm wanting water there not drop water there were niagara but cataract sand would you travel your thousand miles see why did poor poet tennessee upon suddenly receiving two handfuls silver deliberate whether buy him coat which he sadly needed or invest his money pedestrian trip rockaway beach why almost every robust healthy boy with robust healthy soul him at some time or other crazy go sea why upon your first voyage passenger did you yourself feel such mystical vibration when first told you your ship were now out sight land why did old persians hold sea holy why did greeks give separate deity own brother jove surely all this not without meaning still deeper meaning story narcissus who because he could not grasp tormenting mild image he saw fountain plunged into was drowned but same image we ourselves see all rivers oceans image ungraspable phantom life this key all now when i say i am habit going sea whenever i begin grow hazy about eyes begin be over conscious my lungs i do not mean have inferred i ever go sea passenger for go passenger you must needs have purse purse but rag unless you have something besides passengers get sea sick grow quarrelsome don't sleep nights do not enjoy themselves much general thing no i never go passenger nor though i am something salt do i ever go sea commodore or captain or cook i abandon glory distinction such offices those who like them for my part i abominate all honourable respectable toils trials tribulations every kind whatsoever quite much i can do take care myself without taking care ships barques brigs schooners what not for going cook though i confess there considerable glory cook being sort officer on ship board yet somehow i never fancied broiling fowls though once broiled judiciously buttered judgmatically salted peppered there no one who will speak more respectfully not say reverentially broiled fowl than i will out idolatrous dotings old egyptians upon broiled ibis roasted river horse you see mummies those creatures their huge bake houses pyramids no when i go sea i go simple sailor right before mast plumb down into forecastle aloft there royal mast head true they rather order me about some make me jump from spar spar like grasshopper may meadow at first this sort thing unpleasant enough touches one's sense honour particularly if you come an old established family land van rensselaers or randolphs or hardicanutes more than all if just previous putting your hand into tar pot you have been lording country schoolmaster making tallest boys stand awe you transition keen one i assure you from schoolmaster sailor requires strong decoction seneca stoics enable you grin bear but even this wears off time what if some old hunks sea captain orders me get broom sweep down decks what does indignity amount weighed i mean scales new testament do you think archangel gabriel thinks anything less me because i promptly respectfully obey old hunks particular instance who ain't slave tell me well then however old sea captains may order me about however they may thump punch me about i have satisfaction knowing all right everybody else one way or other served much same way either physical or metaphysical point view so universal thump passed round all hands should rub each other's shoulder blades be content again i always go sea sailor because they make point paying me for my trouble whereas they never pay passengers single penny i ever heard on contrary passengers themselves must pay there all difference world between paying being paid act paying perhaps most uncomfortable infliction two orchard thieves entailed upon us but being paid what will compare with urbane activity with which man receives money really marvellous considering we so earnestly believe money be root all earthly ills on no account can monied man enter heaven ah how cheerfully we consign ourselves perdition finally i always go sea sailor because wholesome exercise pure air fore castle deck for this world head winds are far more prevalent than winds from astern if you never violate pythagorean maxim so for most part commodore on quarter deck gets his atmosphere at second hand from sailors on forecastle he thinks he breathes first but not so much same way do commonalty lead their leaders many other things at same time leaders little suspect but wherefore was after having repeatedly smelt sea merchant sailor i should now take into my head go on whaling voyage this invisible police officer fates who has constant surveillance me secretly dogs me influences me some unaccountable way he can better answer than any one else doubtless my going on this whaling voyage formed part grand programme providence was drawn up long time ago came sort brief interlude solo between more extensive performances i take this part bill must have run something like this grand contested election for presidency united states whaling voyage by one ishmael bloody battle affghanistan though i cannot tell why was exactly those stage managers fates put me down for this shabby part whaling voyage when others were set down for magnificent parts high tragedies short easy parts genteel comedies jolly parts farces though i cannot tell why this was exactly yet now i recall all circumstances i think i can see little into springs motives which being cunningly presented me under various disguises induced me set about performing part i did besides cajoling me into delusion was choice resulting from my own unbiased freewill discriminating judgment chief among these motives was overwhelming idea great whale himself such portentous mysterious monster roused all my curiosity then wild distant seas where he rolled his island bulk undeliverable nameless perils whale these with all attending marvels thousand patagonian sights sounds helped sway me my wish with other men perhaps such things would not have been inducements but for me i am tormented with an everlasting itch for things remote i love sail forbidden seas land on barbarous coasts not ignoring what good i am quick perceive horror could still be social with would they let me since but well be on friendly terms with all inmates place one lodges by reason these things then whaling voyage was welcome great flood gates wonder world swung open wild conceits swayed me my purpose two two there floated into my inmost soul endless processions whale mid most them all one grand hooded phantom like snow hill air 

In [ ]:
assert swc[0]==('i',43)
assert len(swc)==848 - punchfactor #4 is the punchfactor, ranked out of 4

Create a "Cleveland Style" dotplot of the counts of the top 50 words using Matplotlib. If you don't know what a dotplot is, you will have to do some research...


In [344]:
x = np.array(swc)
plt.plot(x[0:50,1], range(50),'o')


Out[344]:
[<matplotlib.lines.Line2D at 0x7f0c8a957d68>]

In [ ]:
assert True # use this for grading the dotplot