A short primer on natural language processing (NLP)

This notebook is part of the material for the talk "Beyond PageRank" on Studienstiftung Winterakademie 2017.

Dependencies

To run the notebook, install the following dependencies:

  • NLTK
  • scikit-learn
  • Numpy Toolchain (numpy, scipy, pylab, pandas, ...)

Explained Main Concepts

The toolchain presented here includes:

  • Tokenization
  • Stemming/Lemmatization
  • tf-idf feature extraction
  • embedding queries and documents in feature space

Some imports


In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from pylab import *
%matplotlib inline

import seaborn as sns

sns.set_style("white")

import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
import numpy as np

from nltk.stem.porter import *

import pandas as pd

In [2]:
vect = TfidfVectorizer(min_df=1)
tfidf = vect.fit_transform([ "I'd like an apple",
                             "An apple a day keeps the doctor away",
                             "Never compare an apple to an orange",
                             "I prefer scikit-learn to Orange"])
sns.heatmap((tfidf * tfidf.T).A)
show()


Load example data from the NLTK gutenberg dataset. All credits to the Gutenberg project for providing the content.


In [3]:
def get_corpus():
    corpus = {}
    for key in nltk.corpus.gutenberg.fileids():
        text = nltk.corpus.gutenberg.raw(key)
        key_ = key.replace(".txt", "")
        corpus[key_] = text

        print("Loaded {} containing {} characters".format(key_, len(text)))
    return corpus

corpus = get_corpus()


Loaded austen-emma containing 887071 characters
Loaded austen-persuasion containing 466292 characters
Loaded austen-sense containing 673022 characters
Loaded bible-kjv containing 4332554 characters
Loaded blake-poems containing 38153 characters
Loaded bryant-stories containing 249439 characters
Loaded burgess-busterbrown containing 84663 characters
Loaded carroll-alice containing 144395 characters
Loaded chesterton-ball containing 457450 characters
Loaded chesterton-brown containing 406629 characters
Loaded chesterton-thursday containing 320525 characters
Loaded edgeworth-parents containing 935158 characters
Loaded melville-moby_dick containing 1242990 characters
Loaded milton-paradise containing 468220 characters
Loaded shakespeare-caesar containing 112310 characters
Loaded shakespeare-hamlet containing 162881 characters
Loaded shakespeare-macbeth containing 100351 characters
Loaded whitman-leaves containing 711215 characters

Preprocessing: Stemming, Tokenization, Bag-of-Words

Use the WordNet stemmer. After preprocessing, a Bag of Words is generated excluding stop words for all documents in the corpus, returned as a matrix.


In [4]:
stemmer = nltk.WordNetLemmatizer()
tokens = {k : word_tokenize(corpus[k]) for k in corpus.keys()}

stemmed_stopwords = [stemmer.lemmatize(t.lower()) for t in stopwords.words('english')]
stemmed_tokens = {k : [stemmer.lemmatize(t.lower()) for t in tokens[k]] for k in corpus.keys()}

In [50]:
index = set()
for k in corpus.keys():
    for token in stemmed_tokens[k]:
        index.add(token)

counts = pd.DataFrame(index=index, columns=corpus.keys(), data=0)

In [51]:
for key in corpus.keys():
    print(key)
    for t in stemmed_tokens[key]:
        if t in stemmed_stopwords:
            continue
        counts[key][t] += 1
counts


shakespeare-hamlet
austen-persuasion
chesterton-thursday
austen-emma
shakespeare-caesar
whitman-leaves
austen-sense
melville-moby_dick
chesterton-brown
bible-kjv
milton-paradise
carroll-alice
burgess-busterbrown
edgeworth-parents
chesterton-ball
blake-poems
shakespeare-macbeth
bryant-stories
Out[51]:
shakespeare-hamlet austen-persuasion chesterton-thursday austen-emma shakespeare-caesar whitman-leaves austen-sense melville-moby_dick chesterton-brown bible-kjv milton-paradise carroll-alice burgess-busterbrown edgeworth-parents chesterton-ball blake-poems shakespeare-macbeth bryant-stories
jaffa 0 0 0 0 0 1 0 2 0 0 0 0 0 0 0 0 0 0
krusenstern 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0
germanic 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0
latched 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
pugilism 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
humpback 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
lemuel 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0
fortress 0 0 0 0 0 1 0 4 5 17 0 0 0 0 1 0 0 0
curvicues 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
sanguinary 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
cannon 5 0 0 0 0 20 0 4 0 0 0 0 0 0 0 0 1 0
barley-corn 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
allurings 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
excited 0 7 2 6 0 1 6 8 0 0 0 0 6 6 3 0 0 2
armhole 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0
petrifick 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
cough 0 0 0 2 0 0 1 3 0 0 0 0 0 0 2 0 0 0
gleeful 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
incites 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
overhear 0 0 0 1 0 0 0 0 0 0 0 0 2 0 0 0 0 0
brain-trucks 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
jesui 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
90:3 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
stickes 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
undertakes 0 2 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0
confronted 0 0 1 0 0 1 0 1 2 0 0 0 0 0 1 0 1 0
frisking 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
19:9 0 0 0 0 0 0 0 0 0 24 0 0 0 0 0 0 0 0
shirley's 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
119:45 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
self-rolled 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
card-racks 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
harrington 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
bafflers 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
undeservedly 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
i'd 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0
69:2 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
virtue 0 2 7 5 0 3 3 22 2 7 42 0 0 5 12 1 0 1
29:17 0 0 0 0 0 0 0 0 0 11 0 0 0 0 0 0 0 0
cherethims 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
multiplying 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 1 0
spurted 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
ninth 0 0 0 2 1 0 0 0 0 34 0 0 0 0 0 0 0 0
hatchet-faced 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
seduction 0 0 0 0 0 2 1 0 0 0 0 0 0 0 0 0 0 0
binnacle-watch 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
91:10 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
'god 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
39:14 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0
fiddler 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0
cascade 0 0 0 0 0 1 0 2 0 0 0 0 0 0 0 0 0 0
trice 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 1
strawberry-leaves 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
`rather 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
elisheba 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
ungainly 0 0 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0
sojourner 0 0 0 0 0 0 0 0 0 11 2 0 0 0 0 0 0 0
_idea_ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
cat. 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
worried 0 0 1 0 0 0 2 1 1 0 0 1 0 0 0 0 0 0

48148 rows × 18 columns


In [52]:
# store as sparse dataframe
scounts = counts.to_sparse(fill_value=0)
print("Density", scounts.density)


Density 0.137591961821421

In [53]:
scounts.to_pickle("corpus_wordcounts.npy")

In [140]:
scounts = pd.read_pickle("corpus_wordcounts.npy")
cols = list(scounts.keys())
scounts.sort_values(by="shakespeare-hamlet", ascending=False)
keywords = (scounts/scounts.sum(axis=0)).prod(axis=1).sort_values(ascending=False)

print(", ".join(list(keywords.index[100:300])))


among, fear, brought, ground, turn, water, wish, arm, dead, sound, lost, none, lay, saying, bed, run, meet, doubt, fall, need, truth, ready, free, close, making, sit, wood, please, secret, fast, red, past, dare, age, met, lie, laugh, noise, tongue, rise, breath, besides, wise, fly, angry, write, shake, [, ], forgot, gently, lettered, care., caring, self-same, 7:57, tumble, flexion, 43:25, interweaving, 119:36, 43:3, no, stomach's, japonica, money-making, elihoenai, erst, vibration, feebleness, 115:14, adze, 2,800, national, ekronites, military, begun, thalia, bezer, reckons, hoof, adiew, magic, 64:12, dreamed, giddy, cavern, evan's, airless, ideality, gold-bound-brow, 3:15, every, unpack, pub-frequenting, syringa, imploring, himselves, merry-mad, ranck, verdigris, richest, ever-returning, snap-shotted, seventy-seven, meekly, mug, etta, afar, moonshine, mac, passport, 23:21, 27:59, confines, gritted, zaccur, ulloa, onset, youthfull, compliment, petrified, dotes, `do, unfixed, 23:48, arnholds, excavating, loyalty, exploded, civitas, clank, anim, 17:9, 1:71, demonstrate, work-basket, contrasting, winded, hampstead, brandy, _times_, twinge, 'just, woollen-draper, wedding-cake, wrestling, jeopardy, aiath, ijeabarim, construe, striker, evincing, significance, 'advance, 3:60, solicited, contracted, sourse, infringing, nursing, highly, spungie, undiscoverable, intimately, what-you-call-him, 81:7, kells, gripped, seyward, benign, putteth, 19:33, aid, cherith, stimulate, chemarims, 6:53, 68:2, 71:11, thrusteth, absurdity, succeeding, sand-hills, 46:25, hesitate, charles's, tattersall, meddleth, breez, rid, across, carmelite, sea-crashing, ogre, dragon, 56:5, moony, unfrequent, 115:4

Create a Bag-of-Words (BOW) and display the words that occur most often


In [55]:
#max_keys = []

#keys = []
#values = []
#for k,v in counts.items():
#    keys.append(k)
#    values.append(int(v))

#bag = pd.DataFrame({"words" : keys, "count" : values}).set_index("words")
#bag = bag.sort_values(by="count",ascending=False)[0:40]
#bag

In [5]:
import scipy

X = scipy.sparse.csr_matrix(scounts.values.T)
X.shape


Out[5]:
(18, 48148)

In [8]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer()
tfidf.fit(X)
scores = tfidf.transform(X)

In [90]:
id_bad,  = np.where(scounts.index == "bad")[0]
id_nice, = np.where(scounts.index == "nice")[0]

print(id_bad, id_nice)


9591 13054

In [78]:
print(scores.shape)

for k in range(1):
    print(scores[k,:])


(18, 48148)
  (0, 48117)	0.00081466999824
  (0, 48115)	0.000856338515666
  (0, 48109)	0.00307501059797
  (0, 48102)	0.000856338515666
  (0, 48101)	0.000673774579697
  (0, 48100)	0.00134754915939
  (0, 48098)	0.00074954545914
  (0, 48081)	0.00224863637742
  (0, 48054)	0.00269509831879
  (0, 48044)	0.00171267703133
  (0, 48033)	0.00040733499912
  (0, 47985)	0.00184368859548
  (0, 47977)	0.000526380687402
  (0, 47970)	0.000856338515666
  (0, 47964)	0.00074954545914
  (0, 47955)	0.000856338515666
  (0, 47934)	0.000615002119595
  (0, 47924)	0.00277624530675
  (0, 47913)	0.000856338515666
  (0, 47894)	0.00808529495636
  (0, 47878)	0.000856338515666
  (0, 47852)	0.000460188466644
  (0, 47850)	0.000325645127099
  (0, 47834)	0.00527486608282
  (0, 47829)	0.00074954545914
  :	:
  (0, 253)	0.000673774579697
  (0, 251)	0.00583011514417
  (0, 242)	0.00040733499912
  (0, 219)	0.00299818183656
  (0, 212)	0.000615002119595
  (0, 207)	0.0184368859548
  (0, 202)	0.00040733499912
  (0, 201)	0.000526380687402
  (0, 192)	0.00074954545914
  (0, 191)	0.000976935381298
  (0, 156)	0.000384417587201
  (0, 145)	0.00269092311041
  (0, 136)	0.000673774579697
  (0, 128)	0.002569015547
  (0, 111)	0.00269509831879
  (0, 78)	0.00074954545914
  (0, 72)	0.00074954545914
  (0, 67)	0.00074954545914
  (0, 61)	0.00149909091828
  (0, 55)	0.00171267703133
  (0, 51)	0.000343816751432
  (0, 39)	0.000673774579697
  (0, 30)	0.000308646707758
  (0, 23)	0.00074954545914
  (0, 10)	0.00307501059797

In [144]:
import seaborn as sns
sns.set_style("white")

def plot_embedding(word1, word2):
    id_bad,  = np.where(scounts.index == word1)[0]
    id_nice, = np.where(scounts.index == word2)[0]
    y = np.zeros((18, 2))

    for k in range(18):

        vec = scores[k,:].toarray()[0]

        y[k,0] = vec[id_bad]
        y[k,1] = vec[id_nice]

    y = y / (y**2).sum(axis=-1, keepdims=True)**0.5

    figure(figsize=(5,5))
    scatter(y[:,0], y[:,1], c=range(18), cmap="Set1")
    for i, txt in enumerate(corpus.keys()):
        if "shake" in txt or True:
            annotate(txt, (y[i,0],y[i,1]))
    
    plot(np.linspace(0,1/2**0.5,10), np.linspace(0,1/2**0.5,10))
        
    xlim([0, 1])
    ylim([0, 1])
    xlabel(word1)
    ylabel(word2)

    
def query(query):
    pass

plot_embedding("love", "lie")    
show()

# macbeth —> macht, orakel, schicksaal, könig
# hamlet —>  macht, könig, rache, liebe


"""
among, fear, brought, ground, turn, water, wish, arm, dead, sound, lost, none, lay, saying, bed,
run, meet, doubt, fall, need, truth, ready, free, close, making, sit, wood, please, secret, fast,
red, past, dare, age, met, lie, laugh, noise, tongue, rise, breath, besides, wise, fly, angry,
write, shake, [, ], forgot, gently, lettered, care., caring, self-same, 7:57, tumble, flexion,
43:25, interweaving, 119:36, 43:3, no, stomach's, japonica, money-making, elihoenai, erst, vibration,
feebleness, 115:14, adze, 2,800, national, ekronites, military, begun, thalia, bezer, reckons, hoof,
adiew, magic, 64:12, dreamed, giddy, cavern, evan's, airless, ideality, gold-bound-brow, 3:15, every,
unpack, pub-frequenting, syringa, imploring, himselves, merry-mad, ranck, verdigris, richest,
ever-returning, snap-shotted, seventy-seven, meekly, mug, etta, afar, moonshine, mac, passport,
23:21, 27:59, confines, gritted, zaccur, ulloa, onset, youthfull, compliment, petrified, dotes,
`do, unfixed, 23:48, arnholds, excavating, loyalty, exploded, civitas, clank, anim, 17:9, 1:71,
demonstrate, work-basket, contrasting, winded, hampstead, brandy, _times_, twinge, 'just, woollen-draper,
wedding-cake, wrestling, jeopardy, aiath, ijeabarim, construe, striker, evincing, significance,
'advance, 3:60, solicited, contracted, sourse, infringing, nursing, highly, spungie, undiscoverable,
intimately, what-you-call-him, 81:7, kells, gripped, seyward, benign, putteth, 19:33, aid, cherith,
stimulate, chemarims, 6:53, 68:2, 71:11, thrusteth, absurdity, succeeding, sand-hills, 46:25, hesitate,
charles's, tattersall, meddleth, breez, rid, across, carmelite, sea-crashing, ogre, dragon, 56:5, moony,
unfrequent
"""


Out[144]:
"\namong, fear, brought, ground, turn, water, wish, arm, dead, sound, lost, none, lay, saying, bed,\nrun, meet, doubt, fall, need, truth, ready, free, close, making, sit, wood, please, secret, fast,\nred, past, dare, age, met, lie, laugh, noise, tongue, rise, breath, besides, wise, fly, angry,\nwrite, shake, [, ], forgot, gently, lettered, care., caring, self-same, 7:57, tumble, flexion,\n43:25, interweaving, 119:36, 43:3, no, stomach's, japonica, money-making, elihoenai, erst, vibration,\nfeebleness, 115:14, adze, 2,800, national, ekronites, military, begun, thalia, bezer, reckons, hoof,\nadiew, magic, 64:12, dreamed, giddy, cavern, evan's, airless, ideality, gold-bound-brow, 3:15, every,\nunpack, pub-frequenting, syringa, imploring, himselves, merry-mad, ranck, verdigris, richest,\never-returning, snap-shotted, seventy-seven, meekly, mug, etta, afar, moonshine, mac, passport,\n23:21, 27:59, confines, gritted, zaccur, ulloa, onset, youthfull, compliment, petrified, dotes,\n`do, unfixed, 23:48, arnholds, excavating, loyalty, exploded, civitas, clank, anim, 17:9, 1:71,\ndemonstrate, work-basket, contrasting, winded, hampstead, brandy, _times_, twinge, 'just, woollen-draper,\nwedding-cake, wrestling, jeopardy, aiath, ijeabarim, construe, striker, evincing, significance,\n'advance, 3:60, solicited, contracted, sourse, infringing, nursing, highly, spungie, undiscoverable,\nintimately, what-you-call-him, 81:7, kells, gripped, seyward, benign, putteth, 19:33, aid, cherith,\nstimulate, chemarims, 6:53, 68:2, 71:11, thrusteth, absurdity, succeeding, sand-hills, 46:25, hesitate,\ncharles's, tattersall, meddleth, breez, rid, across, carmelite, sea-crashing, ogre, dragon, 56:5, moony,\nunfrequent\n"

In [58]:
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA

pca = PCA(2)
embedding = pca.fit_transform(X.toarray())

#tsne = TSNE()
#embedding = tsne.fit_transform(x_)

scatter(embedding[:,0], embedding[:,1])
for i, txt in enumerate(corpus.keys()):
    if "shake" in txt or True:
        annotate(txt, (embedding[i,0],embedding[i,1]))
show()



In [69]:
from nltk.collocations import *
trigram_measures = nltk.collocations.TrigramAssocMeasures()

for key in corpus.keys():
    if "shakespeare" in key:
        finder = TrigramCollocationFinder.from_words(
             nltk.corpus.gutenberg.words(key+".txt"))
        print(key)
        print(finder.nbest(trigram_measures.pmi, 30))


shakespeare-macbeth
[('Assassination', 'Could', 'trammell'), ('Lifes', 'fitfull', 'Feuer'), ('Mothers', 'womb', 'Vntimely'), ('Obliuious', 'Antidote', 'Cleanse'), ('Saint', 'Colmes', 'ynch'), ('THE', 'TRAGEDIE', 'OF'), ('TRAGEDIE', 'OF', 'MACBETH'), ('William', 'Shakespeare', '1603'), ('Witchcraft', 'celebrates', 'Pale'), ('choppie', 'finger', 'laying'), ('forge', 'Quarrels', 'vniust'), ('grim', 'Alarme', 'Excite'), ('lated', 'Traueller', 'apace'), ('minutely', 'Reuolts', 'vpbraid'), ('multitudinous', 'Seas', 'incarnardine'), ('sad', 'bosomes', 'empty'), ('womb', 'Vntimely', 'ript'), ('yesty', 'Waues', 'Confound'), ('Accounted', 'dangerous', 'folly'), ('After', 'Lifes', 'fitfull'), ('Auarice', 'stickes', 'deeper'), ('Interdiction', 'stands', 'accust'), ('Iourney', 'Soundly', 'inuite'), ('Neptunes', 'Ocean', 'wash'), ('Pale', 'Heccats', 'Offrings'), ('Ruines', 'wastfull', 'entrance'), ('Winters', 'fire', 'Authoriz'), ('celebrates', 'Pale', 'Heccats'), ('doubly', 'redoubled', 'stroakes'), ('eternall', 'Iewell', 'Giuen')]
shakespeare-hamlet
[('Coronet', 'weeds', 'Clambring'), ('Fruite', 'vnripe', 'stickes'), ('Harlots', 'Cheeke', 'beautied'), ('Herald', 'Mercurie', 'New'), ('Mercurie', 'New', 'lighted'), ('Midnight', 'Weeds', 'collected'), ('Nemian', 'Lions', 'nerue'), ('Phoebus', 'Cart', 'gon'), ('Russet', 'mantle', 'clad'), ('William', 'Shakespeare', '1599'), ('dilated', 'Articles', 'allow'), ('exception', 'Roughly', 'awake'), ('feares', 'forgetting', 'manners'), ('hideous', 'crash', 'Takes'), ('high', 'Easterne', 'Hill'), ('primall', 'eldest', 'curse'), ('recklesse', 'Libertine', 'Himselfe'), ('swaggering', 'vpspring', 'reeles'), ('veyled', 'lids', 'Seeke'), ('yon', 'high', 'Easterne'), ('Boy', 'thirty', 'yeares'), ('Goodman', 'Deluer', 'Clown'), ('Guard', 'carrying', 'Torches'), ('Ingenious', 'sence', 'Depriu'), ('Platforme', 'twixt', 'eleuen'), ('Respeaking', 'earthly', 'Thunder'), ('Rich', 'gifts', 'wax'), ('Scale', 'weighing', 'Delight'), ('awe', 'Payes', 'homage'), ('crash', 'Takes', 'Prisoner')]
shakespeare-caesar
[('Et', 'Tu', 'Brute'), ('OF', 'IVLIVS', 'CaeSAR'), ('THE', 'TRAGEDIE', 'OF'), ('TRAGEDIE', 'OF', 'IVLIVS'), ('William', 'Shakespeare', '1599'), ('deceitfull', 'Iades', 'Sinke'), ('lowly', 'courtesies', 'Might'), ('twentie', 'Torches', 'ioyn'), ('yon', 'grey', 'Lines'), ('Fierce', 'fiery', 'Warriours'), ('buy', 'mens', 'voyces'), ('craues', 'warie', 'walking'), ('losses', 'shold', 'indure'), ('open', 'Perils', 'surest'), ('poor', 'dum', 'mouths'), ('wide', 'Walkes', 'incompast'), ('Perils', 'surest', 'answered'), ('Shakespeare', '1599', ']'), ('ambitious', 'Ocean', 'swell'), ('hundred', 'gastly', 'Women'), ('owe', 'mo', 'teares'), ('busie', 'care', 'drawes'), ('fierce', 'Ciuill', 'strife'), ('former', 'Ensigne', 'Two'), ('honest', 'Neighbors', 'showted'), ('poor', 'poor', 'dum'), ('seuenty', 'fiue', 'Drachmaes'), ('young', 'Ambitions', 'Ladder'), ('Souldier', 'ordered', 'Honourably'), ('base', 'Spaniell', 'fawning')]