Exercise 1)


In [1]:
#EYE DROPS OFF SHELF
# Correct: Eye(Noun) Drops(Noun Plural) off(Preposition) shelf(Noun)
# Misinterpretation: Eye(Noun) Drops(Verb) off(Preporsition) shelf(Noun)

Exercise 2)


In [2]:
import nltk
from nltk.corpus import brown
brown_tagged = brown.tagged_words(tagset='universal')
# word: contest, guess: noun
tag_fd = nltk.FreqDist(tag for (word, tag) in brown_tagged if word == 'contest')
print tag_fd.keys()
print tag_fd['VERB']
print tag_fd['NOUN']


[u'VERB', u'NOUN']
2
22

In [3]:
# word: play, guess: verb
tag_fd = nltk.FreqDist(tag for (word, tag) in brown_tagged if word == 'play')
print tag_fd.keys()
print tag_fd['VERB']
print tag_fd['NOUN']


[u'VERB', u'NOUN']
109
88

Exercise 3)


In [4]:
# (They, PRON) (wind, VERB) (back, ADP) (the, DET) (clock, NOUN)
# (while, CONJ) (we, PRON) (chase, VERB) (after, ADP) (the, DET) (wind, NOUN)
# => (wind, VERB) vs. (wind, NOUN)

Exercise 4)


In [5]:
# Pronunciation dictionary, maps from written form to phonetic transcriptions of possible pronunciations
# Syntax tree analyzer, maps from surface form of sentence to analyzed tree structure

Exercise 5)


In [6]:
d = {}
d['foo'] = 'bar'
d['baz'] = 123
d


Out[6]:
{'baz': 123, 'foo': 'bar'}

In [7]:
# d['xyz']

Exercise 6)


In [8]:
del d['foo']
d


Out[8]:
{'baz': 123}

Exercise 7)


In [9]:
d1 = {'foo': 123, 'bar': 456}
d2 = {1: 'some entry', 'foo': 'something new'}
d1.update(d2)
print d1
print d2


{1: 'some entry', 'foo': 'something new', 'bar': 456}
{1: 'some entry', 'foo': 'something new'}

In [10]:
# added entries from d2 to d1, updated entries in d1 if existing in d2
# useful e.g. for cases like default tagger fallback -> can create a default dict that can be updated with more sophisticated tagger

Exercise 8)


In [11]:
e = {'headword': 'piano', 'part-of-speech': 'NOUN', 'sense': 'a musical instrument with keys', 'example': 'She like to play pieces by Mozart on her piano.'}

Exercise 9)


In [12]:
from nltk.text import Text
Text(brown.words()).concordance('go')


Displaying 25 of 626 matches:
struction bonds . The bond issue will go to the state courts for a friendly te
ress text still had `` quite a way to go '' toward completion . Decisions are 
 replied , `` I would say it's got to go thru several more drafts '' . Salinge
ause the levy is already scheduled to go up by 1 per cent on that date to pay 
irst year , 1963 . Both figures would go higher in later years . Other parts o
lion dollars the first year and would go up to 21 millions by 1966 . The Presi
said yesterday he would be willing to go before the city council `` or anyone 
red would know what to do or where to go in the event of an enemy attack . The
e another . Riverside residents would go to the Seekonk assembly point . Mr. H
nk E. Smith as the one most likely to go , the redistricting battle will put t
e battle . Then he could tell them to go home , while the administration conti
pinion as to how far the board should go , and whose advice it should follow .
n to where the parents wanted them to go . Dr. Melvin W. Barnes , superintende
ay night in Salem . On Friday he will go to Portland for the swearing in of De
`` Then we'd really have someplace to go '' . Bowie , Md. , March 17 -- Gainin
`` Spahnie doesn't know how to merely go through the motions '' , remarked Eno
Slocum Memorial Award . To Spahn will go the Sid Mercer Memorial Award as the 
ds . The writers' Gold Tee Award will go to John McAuliffe of Plainfield , N. 
ever , Mr. Parichy and his bride will go to Vero Beach on their wedding trip ,
ncy '' . No matter how many Americans go abroad in summer , probably a hundred
, service and comfort stations ( they go together like Scots and heather ) , d
Werner said , was let manual laborers go home Tuesday night for some rest . Wo
t destroyers . It could , reputedly , go 70,000 miles without refueling and st
dinator of audio-visual education may go to the state Supreme Court , it appea
 therefore knew them . It was time to go up myself '' . Fiedler was then techn

In [13]:
Text(brown.words()).concordance('went')


Displaying 25 of 507 matches:
ade good his promise . `` Everything went real smooth '' , the sheriff said . 
axation . Under committee rules , it went automatically to a subcommittee for 
e . And after several correspondents went into Pathet Lao territory and expose
the Kansas City scoring in the sixth went like this : Lumpe worked a walk as t
his season to 13 straight before one went astray last Saturday night in the 41
nd caught one pass for 13 yards . He went into the Army in March , 1957 , and 
m Monday , ran for 30 minutes , then went in , while the reserves scrimmaged f
tchie of the Ogden Standard Examiner went to his compartment to talk with him 
pped into his second shot . The ball went off in a majestic arc , an out-of-bo
 rare sense of humor . Everywhere he went in town , people sidled up , gave hi
e is where a man was born , reared , went to school and , most particularly , 
 off for good behavior . He promptly went to communist East Germany . The magi
in the Skipjack . With the machinery went a complete design for the hull . The
issile submarines '' , the statement went on . The atom reactor , water cooled
e this account : Thomas early Sunday went to the home of his uncle and aunt , 
sheep showman contest . Blue ribbons went to Stephanie Shaw of Hillsboro , Lar
nks . Swine showmanship championship went to Bob Day , with Tom Day and Hutchi
d on the jury . Then , when the case went to the jury , the judge excused one 
streamer of dust it landed . Fiedler went on to make several other test flight
inals . Second grand prize of $5,000 went to Mrs. Clara L. Oliver for her Hawa
Last two to be added before the book went to press were the marriages of Mered
 next year's television shows . So I went to see `` La Dolce Vita '' . It has 
ou have language problems . The week went along briskly enough . I bought a ne
eaded up one of the four groups that went on simultaneous tours after the Gall
tings and sculptures to each group , went one of the Gallery's blue-uniformed 

In [14]:
# 'go' appears after 'to', 'would', 'will', 'may' where 'went' cannot appear
# both can appear after pronouns/nouns

Exercise 10)


In [15]:
brown_tagged_sents = brown.tagged_sents()
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.tag(nltk.word_tokenize('Villa Park is an association football stadium in Aston, Birmingham, England, with a seating capacity of 42,682. Formerly a sports ground in a Victorian amusement park, it has been the home of Aston Villa Football Club since 17 April 1897.'))


Out[15]:
[('Villa', u'NP'),
 ('Park', u'NN-TL'),
 ('is', u'BEZ'),
 ('an', u'AT'),
 ('association', u'NN'),
 ('football', u'NN'),
 ('stadium', u'NN'),
 ('in', u'IN'),
 ('Aston', None),
 (',', u','),
 ('Birmingham', u'NP'),
 (',', u','),
 ('England', u'NP'),
 (',', u','),
 ('with', u'IN'),
 ('a', u'AT'),
 ('seating', u'VBG'),
 ('capacity', u'NN'),
 ('of', u'IN'),
 ('42,682', None),
 ('.', u'.'),
 ('Formerly', u'RB'),
 ('a', u'AT'),
 ('sports', u'NNS'),
 ('ground', u'NN'),
 ('in', u'IN'),
 ('a', u'AT'),
 ('Victorian', u'JJ'),
 ('amusement', u'NN'),
 ('park', u'NN'),
 (',', u','),
 ('it', u'PPS'),
 ('has', u'HVZ'),
 ('been', u'BEN'),
 ('the', u'AT'),
 ('home', u'NR'),
 ('of', u'IN'),
 ('Aston', None),
 ('Villa', u'NP'),
 ('Football', u'NN-TL'),
 ('Club', u'NN-TL'),
 ('since', u'CS'),
 ('17', u'CD'),
 ('April', u'NP'),
 ('1897', u'CD'),
 ('.', u'.')]

In [16]:
# cannot tag words that it has not seen during training

Exercise 11)


In [17]:
help(nltk.AffixTagger)


Help on class AffixTagger in module nltk.tag.sequential:

class AffixTagger(ContextTagger)
 |  A tagger that chooses a token's tag based on a leading or trailing
 |  substring of its word string.  (It is important to note that these
 |  substrings are not necessarily "true" morphological affixes).  In
 |  particular, a fixed-length substring of the word is looked up in a
 |  table, and the corresponding tag is returned.  Affix taggers are
 |  typically constructed by training them on a tagged corpus.
 |  
 |  Construct a new affix tagger.
 |  
 |  :param affix_length: The length of the affixes that should be
 |      considered during training and tagging.  Use negative
 |      numbers for suffixes.
 |  :param min_stem_length: Any words whose length is less than
 |      min_stem_length+abs(affix_length) will be assigned a
 |      tag of None by this tagger.
 |  
 |  Method resolution order:
 |      AffixTagger
 |      ContextTagger
 |      SequentialBackoffTagger
 |      nltk.tag.api.TaggerI
 |      __builtin__.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, train=None, model=None, affix_length=-3, min_stem_length=2, backoff=None, cutoff=0, verbose=False)
 |  
 |  context(self, tokens, index, history)
 |  
 |  encode_json_obj(self)
 |  
 |  ----------------------------------------------------------------------
 |  Class methods defined here:
 |  
 |  decode_json_obj(cls, obj) from __builtin__.type
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  json_tag = u'nltk.tag.sequential.AffixTagger'
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from ContextTagger:
 |  
 |  __repr__(self)
 |  
 |  __str__(self)
 |      x.__str__() <==> str(x)
 |  
 |  __unicode__ = __str__(...)
 |      x.__str__() <==> str(x)
 |  
 |  choose_tag(self, tokens, index, history)
 |  
 |  size(self)
 |      :return: The number of entries in the table used by this
 |          tagger to map from contexts to tags.
 |  
 |  unicode_repr = __repr__(self)
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from SequentialBackoffTagger:
 |  
 |  tag(self, tokens)
 |  
 |  tag_one(self, tokens, index, history)
 |      Determine an appropriate tag for the specified token, and
 |      return that tag.  If this tagger is unable to determine a tag
 |      for the specified token, then its backoff tagger is consulted.
 |      
 |      :rtype: str
 |      :type tokens: list
 |      :param tokens: The list of words that are being tagged.
 |      :type index: int
 |      :param index: The index of the word whose tag should be
 |          returned.
 |      :type history: list(str)
 |      :param history: A list of the tags for all words before *index*.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from SequentialBackoffTagger:
 |  
 |  backoff
 |      The backoff tagger for this tagger.
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from nltk.tag.api.TaggerI:
 |  
 |  evaluate(self, gold)
 |      Score the accuracy of the tagger against the gold standard.
 |      Strip the tags from the gold standard text, retag it using
 |      the tagger, then compute the accuracy score.
 |      
 |      :type gold: list(list(tuple(str, str)))
 |      :param gold: The list of tagged sentences to score the tagger on.
 |      :rtype: float
 |  
 |  tag_sents(self, sentences)
 |      Apply ``self.tag()`` to each element of *sentences*.  I.e.:
 |      
 |          return [self.tag(sent) for sent in sentences]
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from nltk.tag.api.TaggerI:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)


In [18]:
affix_tagger = nltk.AffixTagger(brown_tagged_sents, affix_length=3, min_stem_length=1)
affix_tagger.tag(nltk.word_tokenize('The Road to Emmaus appearance is one of the early resurrection appearances of Jesus after his crucifixion and the discovery of the empty tomb. Both the Meeting on the road to Emmaus and the subsequent Supper at Emmaus, depicting the meal that Jesus had with two disciples after the encounter on the road, have been popular subjects in art.'))


Out[18]:
[('The', None),
 ('Road', u'NN-TL'),
 ('to', None),
 ('Emmaus', u'NP'),
 ('appearance', u'NN'),
 ('is', None),
 ('one', None),
 ('of', None),
 ('the', None),
 ('early', u'JJ'),
 ('resurrection', u'NN'),
 ('appearances', u'NN'),
 ('of', None),
 ('Jesus', u'NP'),
 ('after', u'IN'),
 ('his', None),
 ('crucifixion', u'JJ'),
 ('and', None),
 ('the', None),
 ('discovery', u'NN'),
 ('of', None),
 ('the', None),
 ('empty', u'NN'),
 ('tomb', u'NR'),
 ('.', None),
 ('Both', u'ABX'),
 ('the', None),
 ('Meeting', u'NP'),
 ('on', None),
 ('the', None),
 ('road', u'NN'),
 ('to', None),
 ('Emmaus', u'NP'),
 ('and', None),
 ('the', None),
 ('subsequent', u'NN'),
 ('Supper', u'JJ-TL'),
 ('at', None),
 ('Emmaus', u'NP'),
 (',', None),
 ('depicting', u'NN'),
 ('the', None),
 ('meal', u'NN'),
 ('that', u'CS'),
 ('Jesus', u'NP'),
 ('had', None),
 ('with', u'IN'),
 ('two', None),
 ('disciples', u'NN'),
 ('after', u'IN'),
 ('the', None),
 ('encounter', u'VB'),
 ('on', None),
 ('the', None),
 ('road', u'NN'),
 (',', None),
 ('have', u'HV'),
 ('been', u'BEN'),
 ('popular', u'NN'),
 ('subjects', u'NN'),
 ('in', None),
 ('art', None),
 ('.', None)]

In [19]:
# does well on nouns with typical endings
affix_tagger = nltk.AffixTagger(brown_tagged_sents, affix_length=2, min_stem_length=0)
affix_tagger.tag(nltk.word_tokenize('The Road to Emmaus appearance is one of the early resurrection appearances of Jesus after his crucifixion and the discovery of the empty tomb. Both the Meeting on the road to Emmaus and the subsequent Supper at Emmaus, depicting the meal that Jesus had with two disciples after the encounter on the road, have been popular subjects in art.'))


Out[19]:
[('The', u'AT'),
 ('Road', u'NP'),
 ('to', u'TO'),
 ('Emmaus', u'NN-TL'),
 ('appearance', u'NN'),
 ('is', u'BEZ'),
 ('one', u'IN'),
 ('of', u'IN'),
 ('the', u'AT'),
 ('early', u'DT'),
 ('resurrection', u'NN'),
 ('appearances', u'NN'),
 ('of', u'IN'),
 ('Jesus', u'NP'),
 ('after', u'IN'),
 ('his', u'PP$'),
 ('crucifixion', u'NN'),
 ('and', u'CC'),
 ('the', u'AT'),
 ('discovery', u'NN'),
 ('of', u'IN'),
 ('the', u'AT'),
 ('empty', u'NN'),
 ('tomb', u'TO'),
 ('.', None),
 ('Both', u'NP'),
 ('the', u'AT'),
 ('Meeting', u'NP'),
 ('on', u'IN'),
 ('the', u'AT'),
 ('road', u'NN'),
 ('to', u'TO'),
 ('Emmaus', u'NN-TL'),
 ('and', u'CC'),
 ('the', u'AT'),
 ('subsequent', u'JJ'),
 ('Supper', u'NP'),
 ('at', u'IN'),
 ('Emmaus', u'NN-TL'),
 (',', None),
 ('depicting', u'NN'),
 ('the', u'AT'),
 ('meal', u'NNS'),
 ('that', u'AT'),
 ('Jesus', u'NP'),
 ('had', u'HVD'),
 ('with', u'IN'),
 ('two', u'CD'),
 ('disciples', u'NN'),
 ('after', u'IN'),
 ('the', u'AT'),
 ('encounter', u'NN'),
 ('on', u'IN'),
 ('the', u'AT'),
 ('road', u'NN'),
 (',', None),
 ('have', u'HVD'),
 ('been', u'BE'),
 ('popular', u'NN'),
 ('subjects', u'JJ'),
 ('in', u'IN'),
 ('art', u'BER'),
 ('.', None)]

Exercise 12)


In [20]:
bigram_tagger = nltk.BigramTagger(brown_tagged_sents)

In [21]:
bigram_tagger.tag(brown.sents()[0])


Out[21]:
[(u'The', u'AT'),
 (u'Fulton', u'NP-TL'),
 (u'County', u'NN-TL'),
 (u'Grand', u'JJ-TL'),
 (u'Jury', u'NN-TL'),
 (u'said', u'VBD'),
 (u'Friday', u'NR'),
 (u'an', u'AT'),
 (u'investigation', u'NN'),
 (u'of', u'IN'),
 (u"Atlanta's", u'NP$'),
 (u'recent', u'JJ'),
 (u'primary', u'NN'),
 (u'election', u'NN'),
 (u'produced', u'VBN'),
 (u'``', u'``'),
 (u'no', u'AT'),
 (u'evidence', u'NN'),
 (u"''", u"''"),
 (u'that', u'WPS'),
 (u'any', None),
 (u'irregularities', None),
 (u'took', None),
 (u'place', None),
 (u'.', None)]

In [22]:
bigram_tagger.tag(nltk.word_tokenize('The Road to Emmaus appearance is one of the early resurrection appearances of Jesus after his crucifixion and the discovery of the empty tomb. Both the Meeting on the road to Emmaus and the subsequent Supper at Emmaus, depicting the meal that Jesus had with two disciples after the encounter on the road, have been popular subjects in art.'))


Out[22]:
[('The', u'AT'),
 ('Road', None),
 ('to', None),
 ('Emmaus', None),
 ('appearance', None),
 ('is', None),
 ('one', None),
 ('of', None),
 ('the', None),
 ('early', None),
 ('resurrection', None),
 ('appearances', None),
 ('of', None),
 ('Jesus', None),
 ('after', None),
 ('his', None),
 ('crucifixion', None),
 ('and', None),
 ('the', None),
 ('discovery', None),
 ('of', None),
 ('the', None),
 ('empty', None),
 ('tomb', None),
 ('.', None),
 ('Both', None),
 ('the', None),
 ('Meeting', None),
 ('on', None),
 ('the', None),
 ('road', None),
 ('to', None),
 ('Emmaus', None),
 ('and', None),
 ('the', None),
 ('subsequent', None),
 ('Supper', None),
 ('at', None),
 ('Emmaus', None),
 (',', None),
 ('depicting', None),
 ('the', None),
 ('meal', None),
 ('that', None),
 ('Jesus', None),
 ('had', None),
 ('with', None),
 ('two', None),
 ('disciples', None),
 ('after', None),
 ('the', None),
 ('encounter', None),
 ('on', None),
 ('the', None),
 ('road', None),
 (',', None),
 ('have', None),
 ('been', None),
 ('popular', None),
 ('subjects', None),
 ('in', None),
 ('art', None),
 ('.', None)]

In [23]:
# performs very badly on unseen data -> once it sees 'None' tag, cannot tag the rest because in training it hasn't seen bigrams with 'None'

Exercise 13)


In [24]:
date_dict = {'day': '22', 'month': 'April', 'month_num': '04', 'year': '2017'}
print '{day} of {month}, {year}.'.format(**date_dict)
print '{month_num}/{day}/{year}'.format(**date_dict)


22 of April, 2017.
04/22/2017

Exercise 14)


In [25]:
tags = []
for sent in brown_tagged_sents:
    for (word, tag) in sent:
        tags.append(tag)
        
sorted(set(tags))


Out[25]:
[u"'",
 u"''",
 u'(',
 u'(-HL',
 u')',
 u')-HL',
 u'*',
 u'*-HL',
 u'*-NC',
 u'*-TL',
 u',',
 u',-HL',
 u',-NC',
 u',-TL',
 u'--',
 u'---HL',
 u'.',
 u'.-HL',
 u'.-NC',
 u'.-TL',
 u':',
 u':-HL',
 u':-TL',
 u'ABL',
 u'ABN',
 u'ABN-HL',
 u'ABN-NC',
 u'ABN-TL',
 u'ABX',
 u'AP',
 u'AP$',
 u'AP+AP-NC',
 u'AP-HL',
 u'AP-NC',
 u'AP-TL',
 u'AT',
 u'AT-HL',
 u'AT-NC',
 u'AT-TL',
 u'AT-TL-HL',
 u'BE',
 u'BE-HL',
 u'BE-TL',
 u'BED',
 u'BED*',
 u'BED-NC',
 u'BEDZ',
 u'BEDZ*',
 u'BEDZ-HL',
 u'BEDZ-NC',
 u'BEG',
 u'BEM',
 u'BEM*',
 u'BEM-NC',
 u'BEN',
 u'BEN-TL',
 u'BER',
 u'BER*',
 u'BER*-NC',
 u'BER-HL',
 u'BER-NC',
 u'BER-TL',
 u'BEZ',
 u'BEZ*',
 u'BEZ-HL',
 u'BEZ-NC',
 u'BEZ-TL',
 u'CC',
 u'CC-HL',
 u'CC-NC',
 u'CC-TL',
 u'CC-TL-HL',
 u'CD',
 u'CD$',
 u'CD-HL',
 u'CD-NC',
 u'CD-TL',
 u'CD-TL-HL',
 u'CS',
 u'CS-HL',
 u'CS-NC',
 u'CS-TL',
 u'DO',
 u'DO*',
 u'DO*-HL',
 u'DO+PPSS',
 u'DO-HL',
 u'DO-NC',
 u'DO-TL',
 u'DOD',
 u'DOD*',
 u'DOD*-TL',
 u'DOD-NC',
 u'DOZ',
 u'DOZ*',
 u'DOZ*-TL',
 u'DOZ-HL',
 u'DOZ-TL',
 u'DT',
 u'DT$',
 u'DT+BEZ',
 u'DT+BEZ-NC',
 u'DT+MD',
 u'DT-HL',
 u'DT-NC',
 u'DT-TL',
 u'DTI',
 u'DTI-HL',
 u'DTI-TL',
 u'DTS',
 u'DTS+BEZ',
 u'DTS-HL',
 u'DTX',
 u'EX',
 u'EX+BEZ',
 u'EX+HVD',
 u'EX+HVZ',
 u'EX+MD',
 u'EX-HL',
 u'EX-NC',
 u'FW-*',
 u'FW-*-TL',
 u'FW-AT',
 u'FW-AT+NN-TL',
 u'FW-AT+NP-TL',
 u'FW-AT-HL',
 u'FW-AT-TL',
 u'FW-BE',
 u'FW-BER',
 u'FW-BEZ',
 u'FW-CC',
 u'FW-CC-TL',
 u'FW-CD',
 u'FW-CD-TL',
 u'FW-CS',
 u'FW-DT',
 u'FW-DT+BEZ',
 u'FW-DTS',
 u'FW-HV',
 u'FW-IN',
 u'FW-IN+AT',
 u'FW-IN+AT-T',
 u'FW-IN+AT-TL',
 u'FW-IN+NN',
 u'FW-IN+NN-TL',
 u'FW-IN+NP-TL',
 u'FW-IN-TL',
 u'FW-JJ',
 u'FW-JJ-NC',
 u'FW-JJ-TL',
 u'FW-JJR',
 u'FW-JJT',
 u'FW-NN',
 u'FW-NN$',
 u'FW-NN$-TL',
 u'FW-NN-NC',
 u'FW-NN-TL',
 u'FW-NN-TL-NC',
 u'FW-NNS',
 u'FW-NNS-NC',
 u'FW-NNS-TL',
 u'FW-NP',
 u'FW-NP-TL',
 u'FW-NPS',
 u'FW-NPS-TL',
 u'FW-NR',
 u'FW-NR-TL',
 u'FW-OD-NC',
 u'FW-OD-TL',
 u'FW-PN',
 u'FW-PP$',
 u'FW-PP$-NC',
 u'FW-PP$-TL',
 u'FW-PPL',
 u'FW-PPL+VBZ',
 u'FW-PPO',
 u'FW-PPO+IN',
 u'FW-PPS',
 u'FW-PPSS',
 u'FW-PPSS+HV',
 u'FW-QL',
 u'FW-RB',
 u'FW-RB+CC',
 u'FW-RB-TL',
 u'FW-TO+VB',
 u'FW-UH',
 u'FW-UH-NC',
 u'FW-UH-TL',
 u'FW-VB',
 u'FW-VB-NC',
 u'FW-VB-TL',
 u'FW-VBD',
 u'FW-VBD-TL',
 u'FW-VBG',
 u'FW-VBG-TL',
 u'FW-VBN',
 u'FW-VBZ',
 u'FW-WDT',
 u'FW-WPO',
 u'FW-WPS',
 u'HV',
 u'HV*',
 u'HV+TO',
 u'HV-HL',
 u'HV-NC',
 u'HV-TL',
 u'HVD',
 u'HVD*',
 u'HVD-HL',
 u'HVG',
 u'HVG-HL',
 u'HVN',
 u'HVZ',
 u'HVZ*',
 u'HVZ-NC',
 u'HVZ-TL',
 u'IN',
 u'IN+IN',
 u'IN+PPO',
 u'IN-HL',
 u'IN-NC',
 u'IN-TL',
 u'IN-TL-HL',
 u'JJ',
 u'JJ$-TL',
 u'JJ+JJ-NC',
 u'JJ-HL',
 u'JJ-NC',
 u'JJ-TL',
 u'JJ-TL-HL',
 u'JJ-TL-NC',
 u'JJR',
 u'JJR+CS',
 u'JJR-HL',
 u'JJR-NC',
 u'JJR-TL',
 u'JJS',
 u'JJS-HL',
 u'JJS-TL',
 u'JJT',
 u'JJT-HL',
 u'JJT-NC',
 u'JJT-TL',
 u'MD',
 u'MD*',
 u'MD*-HL',
 u'MD+HV',
 u'MD+PPSS',
 u'MD+TO',
 u'MD-HL',
 u'MD-NC',
 u'MD-TL',
 u'NIL',
 u'NN',
 u'NN$',
 u'NN$-HL',
 u'NN$-TL',
 u'NN+BEZ',
 u'NN+BEZ-TL',
 u'NN+HVD-TL',
 u'NN+HVZ',
 u'NN+HVZ-TL',
 u'NN+IN',
 u'NN+MD',
 u'NN+NN-NC',
 u'NN-HL',
 u'NN-NC',
 u'NN-TL',
 u'NN-TL-HL',
 u'NN-TL-NC',
 u'NNS',
 u'NNS$',
 u'NNS$-HL',
 u'NNS$-NC',
 u'NNS$-TL',
 u'NNS$-TL-HL',
 u'NNS+MD',
 u'NNS-HL',
 u'NNS-NC',
 u'NNS-TL',
 u'NNS-TL-HL',
 u'NNS-TL-NC',
 u'NP',
 u'NP$',
 u'NP$-HL',
 u'NP$-TL',
 u'NP+BEZ',
 u'NP+BEZ-NC',
 u'NP+HVZ',
 u'NP+HVZ-NC',
 u'NP+MD',
 u'NP-HL',
 u'NP-NC',
 u'NP-TL',
 u'NP-TL-HL',
 u'NPS',
 u'NPS$',
 u'NPS$-HL',
 u'NPS$-TL',
 u'NPS-HL',
 u'NPS-NC',
 u'NPS-TL',
 u'NR',
 u'NR$',
 u'NR$-TL',
 u'NR+MD',
 u'NR-HL',
 u'NR-NC',
 u'NR-TL',
 u'NR-TL-HL',
 u'NRS',
 u'NRS-TL',
 u'OD',
 u'OD-HL',
 u'OD-NC',
 u'OD-TL',
 u'PN',
 u'PN$',
 u'PN+BEZ',
 u'PN+HVD',
 u'PN+HVZ',
 u'PN+MD',
 u'PN-HL',
 u'PN-NC',
 u'PN-TL',
 u'PP$',
 u'PP$$',
 u'PP$-HL',
 u'PP$-NC',
 u'PP$-TL',
 u'PPL',
 u'PPL-HL',
 u'PPL-NC',
 u'PPL-TL',
 u'PPLS',
 u'PPO',
 u'PPO-HL',
 u'PPO-NC',
 u'PPO-TL',
 u'PPS',
 u'PPS+BEZ',
 u'PPS+BEZ-HL',
 u'PPS+BEZ-NC',
 u'PPS+HVD',
 u'PPS+HVZ',
 u'PPS+MD',
 u'PPS-HL',
 u'PPS-NC',
 u'PPS-TL',
 u'PPSS',
 u'PPSS+BEM',
 u'PPSS+BER',
 u'PPSS+BER-N',
 u'PPSS+BER-NC',
 u'PPSS+BER-TL',
 u'PPSS+BEZ',
 u'PPSS+BEZ*',
 u'PPSS+HV',
 u'PPSS+HV-TL',
 u'PPSS+HVD',
 u'PPSS+MD',
 u'PPSS+MD-NC',
 u'PPSS+VB',
 u'PPSS-HL',
 u'PPSS-NC',
 u'PPSS-TL',
 u'QL',
 u'QL-HL',
 u'QL-NC',
 u'QL-TL',
 u'QLP',
 u'RB',
 u'RB$',
 u'RB+BEZ',
 u'RB+BEZ-HL',
 u'RB+BEZ-NC',
 u'RB+CS',
 u'RB-HL',
 u'RB-NC',
 u'RB-TL',
 u'RBR',
 u'RBR+CS',
 u'RBR-NC',
 u'RBT',
 u'RN',
 u'RP',
 u'RP+IN',
 u'RP-HL',
 u'RP-NC',
 u'RP-TL',
 u'TO',
 u'TO+VB',
 u'TO-HL',
 u'TO-NC',
 u'TO-TL',
 u'UH',
 u'UH-HL',
 u'UH-NC',
 u'UH-TL',
 u'VB',
 u'VB+AT',
 u'VB+IN',
 u'VB+JJ-NC',
 u'VB+PPO',
 u'VB+RP',
 u'VB+TO',
 u'VB+VB-NC',
 u'VB-HL',
 u'VB-NC',
 u'VB-TL',
 u'VBD',
 u'VBD-HL',
 u'VBD-NC',
 u'VBD-TL',
 u'VBG',
 u'VBG+TO',
 u'VBG-HL',
 u'VBG-NC',
 u'VBG-TL',
 u'VBN',
 u'VBN+TO',
 u'VBN-HL',
 u'VBN-NC',
 u'VBN-TL',
 u'VBN-TL-HL',
 u'VBN-TL-NC',
 u'VBZ',
 u'VBZ-HL',
 u'VBZ-NC',
 u'VBZ-TL',
 u'WDT',
 u'WDT+BER',
 u'WDT+BER+PP',
 u'WDT+BEZ',
 u'WDT+BEZ-HL',
 u'WDT+BEZ-NC',
 u'WDT+BEZ-TL',
 u'WDT+DO+PPS',
 u'WDT+DOD',
 u'WDT+HVZ',
 u'WDT-HL',
 u'WDT-NC',
 u'WP$',
 u'WPO',
 u'WPO-NC',
 u'WPO-TL',
 u'WPS',
 u'WPS+BEZ',
 u'WPS+BEZ-NC',
 u'WPS+BEZ-TL',
 u'WPS+HVD',
 u'WPS+HVZ',
 u'WPS+MD',
 u'WPS-HL',
 u'WPS-NC',
 u'WPS-TL',
 u'WQL',
 u'WQL-TL',
 u'WRB',
 u'WRB+BER',
 u'WRB+BEZ',
 u'WRB+BEZ-TL',
 u'WRB+DO',
 u'WRB+DOD',
 u'WRB+DOD*',
 u'WRB+DOZ',
 u'WRB+IN',
 u'WRB+MD',
 u'WRB-HL',
 u'WRB-NC',
 u'WRB-TL',
 u'``']

In [26]:
# shorter with list comprehension:
sorted(set(tag for sent in brown_tagged_sents for (word, tag) in sent))


Out[26]:
[u"'",
 u"''",
 u'(',
 u'(-HL',
 u')',
 u')-HL',
 u'*',
 u'*-HL',
 u'*-NC',
 u'*-TL',
 u',',
 u',-HL',
 u',-NC',
 u',-TL',
 u'--',
 u'---HL',
 u'.',
 u'.-HL',
 u'.-NC',
 u'.-TL',
 u':',
 u':-HL',
 u':-TL',
 u'ABL',
 u'ABN',
 u'ABN-HL',
 u'ABN-NC',
 u'ABN-TL',
 u'ABX',
 u'AP',
 u'AP$',
 u'AP+AP-NC',
 u'AP-HL',
 u'AP-NC',
 u'AP-TL',
 u'AT',
 u'AT-HL',
 u'AT-NC',
 u'AT-TL',
 u'AT-TL-HL',
 u'BE',
 u'BE-HL',
 u'BE-TL',
 u'BED',
 u'BED*',
 u'BED-NC',
 u'BEDZ',
 u'BEDZ*',
 u'BEDZ-HL',
 u'BEDZ-NC',
 u'BEG',
 u'BEM',
 u'BEM*',
 u'BEM-NC',
 u'BEN',
 u'BEN-TL',
 u'BER',
 u'BER*',
 u'BER*-NC',
 u'BER-HL',
 u'BER-NC',
 u'BER-TL',
 u'BEZ',
 u'BEZ*',
 u'BEZ-HL',
 u'BEZ-NC',
 u'BEZ-TL',
 u'CC',
 u'CC-HL',
 u'CC-NC',
 u'CC-TL',
 u'CC-TL-HL',
 u'CD',
 u'CD$',
 u'CD-HL',
 u'CD-NC',
 u'CD-TL',
 u'CD-TL-HL',
 u'CS',
 u'CS-HL',
 u'CS-NC',
 u'CS-TL',
 u'DO',
 u'DO*',
 u'DO*-HL',
 u'DO+PPSS',
 u'DO-HL',
 u'DO-NC',
 u'DO-TL',
 u'DOD',
 u'DOD*',
 u'DOD*-TL',
 u'DOD-NC',
 u'DOZ',
 u'DOZ*',
 u'DOZ*-TL',
 u'DOZ-HL',
 u'DOZ-TL',
 u'DT',
 u'DT$',
 u'DT+BEZ',
 u'DT+BEZ-NC',
 u'DT+MD',
 u'DT-HL',
 u'DT-NC',
 u'DT-TL',
 u'DTI',
 u'DTI-HL',
 u'DTI-TL',
 u'DTS',
 u'DTS+BEZ',
 u'DTS-HL',
 u'DTX',
 u'EX',
 u'EX+BEZ',
 u'EX+HVD',
 u'EX+HVZ',
 u'EX+MD',
 u'EX-HL',
 u'EX-NC',
 u'FW-*',
 u'FW-*-TL',
 u'FW-AT',
 u'FW-AT+NN-TL',
 u'FW-AT+NP-TL',
 u'FW-AT-HL',
 u'FW-AT-TL',
 u'FW-BE',
 u'FW-BER',
 u'FW-BEZ',
 u'FW-CC',
 u'FW-CC-TL',
 u'FW-CD',
 u'FW-CD-TL',
 u'FW-CS',
 u'FW-DT',
 u'FW-DT+BEZ',
 u'FW-DTS',
 u'FW-HV',
 u'FW-IN',
 u'FW-IN+AT',
 u'FW-IN+AT-T',
 u'FW-IN+AT-TL',
 u'FW-IN+NN',
 u'FW-IN+NN-TL',
 u'FW-IN+NP-TL',
 u'FW-IN-TL',
 u'FW-JJ',
 u'FW-JJ-NC',
 u'FW-JJ-TL',
 u'FW-JJR',
 u'FW-JJT',
 u'FW-NN',
 u'FW-NN$',
 u'FW-NN$-TL',
 u'FW-NN-NC',
 u'FW-NN-TL',
 u'FW-NN-TL-NC',
 u'FW-NNS',
 u'FW-NNS-NC',
 u'FW-NNS-TL',
 u'FW-NP',
 u'FW-NP-TL',
 u'FW-NPS',
 u'FW-NPS-TL',
 u'FW-NR',
 u'FW-NR-TL',
 u'FW-OD-NC',
 u'FW-OD-TL',
 u'FW-PN',
 u'FW-PP$',
 u'FW-PP$-NC',
 u'FW-PP$-TL',
 u'FW-PPL',
 u'FW-PPL+VBZ',
 u'FW-PPO',
 u'FW-PPO+IN',
 u'FW-PPS',
 u'FW-PPSS',
 u'FW-PPSS+HV',
 u'FW-QL',
 u'FW-RB',
 u'FW-RB+CC',
 u'FW-RB-TL',
 u'FW-TO+VB',
 u'FW-UH',
 u'FW-UH-NC',
 u'FW-UH-TL',
 u'FW-VB',
 u'FW-VB-NC',
 u'FW-VB-TL',
 u'FW-VBD',
 u'FW-VBD-TL',
 u'FW-VBG',
 u'FW-VBG-TL',
 u'FW-VBN',
 u'FW-VBZ',
 u'FW-WDT',
 u'FW-WPO',
 u'FW-WPS',
 u'HV',
 u'HV*',
 u'HV+TO',
 u'HV-HL',
 u'HV-NC',
 u'HV-TL',
 u'HVD',
 u'HVD*',
 u'HVD-HL',
 u'HVG',
 u'HVG-HL',
 u'HVN',
 u'HVZ',
 u'HVZ*',
 u'HVZ-NC',
 u'HVZ-TL',
 u'IN',
 u'IN+IN',
 u'IN+PPO',
 u'IN-HL',
 u'IN-NC',
 u'IN-TL',
 u'IN-TL-HL',
 u'JJ',
 u'JJ$-TL',
 u'JJ+JJ-NC',
 u'JJ-HL',
 u'JJ-NC',
 u'JJ-TL',
 u'JJ-TL-HL',
 u'JJ-TL-NC',
 u'JJR',
 u'JJR+CS',
 u'JJR-HL',
 u'JJR-NC',
 u'JJR-TL',
 u'JJS',
 u'JJS-HL',
 u'JJS-TL',
 u'JJT',
 u'JJT-HL',
 u'JJT-NC',
 u'JJT-TL',
 u'MD',
 u'MD*',
 u'MD*-HL',
 u'MD+HV',
 u'MD+PPSS',
 u'MD+TO',
 u'MD-HL',
 u'MD-NC',
 u'MD-TL',
 u'NIL',
 u'NN',
 u'NN$',
 u'NN$-HL',
 u'NN$-TL',
 u'NN+BEZ',
 u'NN+BEZ-TL',
 u'NN+HVD-TL',
 u'NN+HVZ',
 u'NN+HVZ-TL',
 u'NN+IN',
 u'NN+MD',
 u'NN+NN-NC',
 u'NN-HL',
 u'NN-NC',
 u'NN-TL',
 u'NN-TL-HL',
 u'NN-TL-NC',
 u'NNS',
 u'NNS$',
 u'NNS$-HL',
 u'NNS$-NC',
 u'NNS$-TL',
 u'NNS$-TL-HL',
 u'NNS+MD',
 u'NNS-HL',
 u'NNS-NC',
 u'NNS-TL',
 u'NNS-TL-HL',
 u'NNS-TL-NC',
 u'NP',
 u'NP$',
 u'NP$-HL',
 u'NP$-TL',
 u'NP+BEZ',
 u'NP+BEZ-NC',
 u'NP+HVZ',
 u'NP+HVZ-NC',
 u'NP+MD',
 u'NP-HL',
 u'NP-NC',
 u'NP-TL',
 u'NP-TL-HL',
 u'NPS',
 u'NPS$',
 u'NPS$-HL',
 u'NPS$-TL',
 u'NPS-HL',
 u'NPS-NC',
 u'NPS-TL',
 u'NR',
 u'NR$',
 u'NR$-TL',
 u'NR+MD',
 u'NR-HL',
 u'NR-NC',
 u'NR-TL',
 u'NR-TL-HL',
 u'NRS',
 u'NRS-TL',
 u'OD',
 u'OD-HL',
 u'OD-NC',
 u'OD-TL',
 u'PN',
 u'PN$',
 u'PN+BEZ',
 u'PN+HVD',
 u'PN+HVZ',
 u'PN+MD',
 u'PN-HL',
 u'PN-NC',
 u'PN-TL',
 u'PP$',
 u'PP$$',
 u'PP$-HL',
 u'PP$-NC',
 u'PP$-TL',
 u'PPL',
 u'PPL-HL',
 u'PPL-NC',
 u'PPL-TL',
 u'PPLS',
 u'PPO',
 u'PPO-HL',
 u'PPO-NC',
 u'PPO-TL',
 u'PPS',
 u'PPS+BEZ',
 u'PPS+BEZ-HL',
 u'PPS+BEZ-NC',
 u'PPS+HVD',
 u'PPS+HVZ',
 u'PPS+MD',
 u'PPS-HL',
 u'PPS-NC',
 u'PPS-TL',
 u'PPSS',
 u'PPSS+BEM',
 u'PPSS+BER',
 u'PPSS+BER-N',
 u'PPSS+BER-NC',
 u'PPSS+BER-TL',
 u'PPSS+BEZ',
 u'PPSS+BEZ*',
 u'PPSS+HV',
 u'PPSS+HV-TL',
 u'PPSS+HVD',
 u'PPSS+MD',
 u'PPSS+MD-NC',
 u'PPSS+VB',
 u'PPSS-HL',
 u'PPSS-NC',
 u'PPSS-TL',
 u'QL',
 u'QL-HL',
 u'QL-NC',
 u'QL-TL',
 u'QLP',
 u'RB',
 u'RB$',
 u'RB+BEZ',
 u'RB+BEZ-HL',
 u'RB+BEZ-NC',
 u'RB+CS',
 u'RB-HL',
 u'RB-NC',
 u'RB-TL',
 u'RBR',
 u'RBR+CS',
 u'RBR-NC',
 u'RBT',
 u'RN',
 u'RP',
 u'RP+IN',
 u'RP-HL',
 u'RP-NC',
 u'RP-TL',
 u'TO',
 u'TO+VB',
 u'TO-HL',
 u'TO-NC',
 u'TO-TL',
 u'UH',
 u'UH-HL',
 u'UH-NC',
 u'UH-TL',
 u'VB',
 u'VB+AT',
 u'VB+IN',
 u'VB+JJ-NC',
 u'VB+PPO',
 u'VB+RP',
 u'VB+TO',
 u'VB+VB-NC',
 u'VB-HL',
 u'VB-NC',
 u'VB-TL',
 u'VBD',
 u'VBD-HL',
 u'VBD-NC',
 u'VBD-TL',
 u'VBG',
 u'VBG+TO',
 u'VBG-HL',
 u'VBG-NC',
 u'VBG-TL',
 u'VBN',
 u'VBN+TO',
 u'VBN-HL',
 u'VBN-NC',
 u'VBN-TL',
 u'VBN-TL-HL',
 u'VBN-TL-NC',
 u'VBZ',
 u'VBZ-HL',
 u'VBZ-NC',
 u'VBZ-TL',
 u'WDT',
 u'WDT+BER',
 u'WDT+BER+PP',
 u'WDT+BEZ',
 u'WDT+BEZ-HL',
 u'WDT+BEZ-NC',
 u'WDT+BEZ-TL',
 u'WDT+DO+PPS',
 u'WDT+DOD',
 u'WDT+HVZ',
 u'WDT-HL',
 u'WDT-NC',
 u'WP$',
 u'WPO',
 u'WPO-NC',
 u'WPO-TL',
 u'WPS',
 u'WPS+BEZ',
 u'WPS+BEZ-NC',
 u'WPS+BEZ-TL',
 u'WPS+HVD',
 u'WPS+HVZ',
 u'WPS+MD',
 u'WPS-HL',
 u'WPS-NC',
 u'WPS-TL',
 u'WQL',
 u'WQL-TL',
 u'WRB',
 u'WRB+BER',
 u'WRB+BEZ',
 u'WRB+BEZ-TL',
 u'WRB+DO',
 u'WRB+DOD',
 u'WRB+DOD*',
 u'WRB+DOZ',
 u'WRB+IN',
 u'WRB+MD',
 u'WRB-HL',
 u'WRB-NC',
 u'WRB-TL',
 u'``']

Exercis 15)


In [27]:
# a)
# solution only considers common singular nouns in base form, using only one category for shorter run time
brown_tagged_words = brown.tagged_words(categories='news')
relevant_pairs = [pair for pair in brown_tagged_words if pair[1] in ['NN', 'NNS']]
fdist = nltk.FreqDist(relevant_pairs)
seen_words = []
for pair in relevant_pairs:
    plural_pair = (pair[0] + 's', 'NNS')
    if pair[0] not in seen_words and pair[1] == 'NN' and plural_pair in relevant_pairs and fdist[pair] < fdist[plural_pair]:
        print pair[0], ': singular ', fdist[pair], ', plural ', fdist[plural_pair]
        seen_words.append(pair[0])


cost : singular  13 , plural  17
official : singular  6 , plural  12
bond : singular  11 , plural  28
item : singular  1 , plural  4
fund : singular  11 , plural  27
legislator : singular  2 , plural  8
estimate : singular  2 , plural  3
fee : singular  7 , plural  12
lieutenant : singular  1 , plural  2
senator : singular  3 , plural  4
loan : singular  5 , plural  6
person : singular  9 , plural  24
recommendation : singular  1 , plural  6
completion : singular  2 , plural  3
minute : singular  1 , plural  25
procedure : singular  2 , plural  5
abuse : singular  1 , plural  2
worker : singular  2 , plural  15
employer : singular  2 , plural  4
payment : singular  6 , plural  8
student : singular  17 , plural  21
scholarship : singular  3 , plural  4
grant : singular  5 , plural  13
dollar : singular  5 , plural  15
finance : singular  1 , plural  2
hour : singular  17 , plural  22
element : singular  2 , plural  3
step : singular  7 , plural  8
affair : singular  6 , plural  8
application : singular  2 , plural  4
method : singular  3 , plural  10
correspondent : singular  1 , plural  2
observer : singular  1 , plural  2
reform : singular  3 , plural  4
rate : singular  7 , plural  9
letter : singular  7 , plural  12
revenue : singular  3 , plural  13
voter : singular  2 , plural  12
member : singular  35 , plural  69
product : singular  10 , plural  16
wage : singular  2 , plural  11
intention : singular  4 , plural  6
passenger : singular  2 , plural  3
good : singular  3 , plural  5
visitor : singular  1 , plural  2
month : singular  32 , plural  42
owner : singular  3 , plural  10
banker : singular  1 , plural  8
officer : singular  7 , plural  18
artist : singular  2 , plural  5
street : singular  5 , plural  7
seat : singular  3 , plural  4
prospect : singular  5 , plural  7
signature : singular  1 , plural  5
acre : singular  2 , plural  8
builder : singular  1 , plural  7
Attorney : singular  1 , plural  2
defendant : singular  3 , plural  9
run : singular  20 , plural  30
appearance : singular  2 , plural  3
error : singular  1 , plural  8
hit : singular  4 , plural  11
spectator : singular  2 , plural  5
allowance : singular  2 , plural  3
second : singular  1 , plural  2
string : singular  1 , plural  4
yard : singular  1 , plural  22
arm : singular  2 , plural  8
organ : singular  1 , plural  2
dancer : singular  2 , plural  3
writer : singular  2 , plural  4
grip : singular  1 , plural  2
player : singular  6 , plural  12
fan : singular  2 , plural  3
arrangement : singular  2 , plural  4
raise : singular  1 , plural  3
duffer : singular  1 , plural  2
sport : singular  2 , plural  6
golfer : singular  2 , plural  3
spot : singular  1 , plural  2
force : singular  4 , plural  16
stroke : singular  6 , plural  8
change : singular  4 , plural  10
talk : singular  3 , plural  7
obstacle : singular  1 , plural  2
representative : singular  4 , plural  5
folk : singular  1 , plural  2
pain : singular  1 , plural  2
nerve : singular  1 , plural  2
contribution : singular  3 , plural  8
bridge : singular  1 , plural  2
sale : singular  8 , plural  51
sculpture : singular  1 , plural  3
cut : singular  3 , plural  4
store : singular  3 , plural  4
mile : singular  1 , plural  10
machine : singular  5 , plural  7
rifle : singular  1 , plural  2
call : singular  4 , plural  7
resident : singular  2 , plural  6
share : singular  10 , plural  19
friend : singular  2 , plural  11
adviser : singular  3 , plural  7
design : singular  5 , plural  9
detail : singular  2 , plural  3
narcotic : singular  1 , plural  6
relationship : singular  1 , plural  2
neighbor : singular  4 , plural  6
motion : singular  2 , plural  4
motorist : singular  2 , plural  3
rebel : singular  2 , plural  3
minister : singular  1 , plural  2
appeal : singular  2 , plural  6
institution : singular  8 , plural  10
newspaper : singular  1 , plural  2
commitment : singular  2 , plural  3
obligation : singular  1 , plural  4
mail : singular  2 , plural  3
guest : singular  5 , plural  14
path : singular  1 , plural  2
conversion : singular  1 , plural  2
flower : singular  1 , plural  3
bird : singular  2 , plural  4
quota : singular  1 , plural  3
decoration : singular  1 , plural  2
destroyer : singular  1 , plural  2
farmer : singular  3 , plural  5
hoodlum : singular  1 , plural  2
janitor : singular  1 , plural  2
teamster : singular  1 , plural  4
price : singular  13 , plural  19
underwriter : singular  1 , plural  2
manufacturer : singular  5 , plural  10
shareholder : singular  1 , plural  2
mistake : singular  1 , plural  2
dealer : singular  4 , plural  10
chair : singular  1 , plural  4
competitor : singular  1 , plural  2
picker : singular  1 , plural  3
material : singular  2 , plural  3
taxpayer : singular  3 , plural  5
parent : singular  1 , plural  4
adjustment : singular  1 , plural  2
standard : singular  2 , plural  4
piece : singular  3 , plural  7
white : singular  1 , plural  2
tribe : singular  1 , plural  4
snack : singular  1 , plural  2
poster : singular  1 , plural  2
wood : singular  1 , plural  4
pool : singular  1 , plural  3
cookie : singular  2 , plural  3
feature : singular  1 , plural  2
puppet : singular  4 , plural  5
height : singular  1 , plural  2
drum : singular  1 , plural  4
shade : singular  1 , plural  2
decorator : singular  1 , plural  2
draft : singular  1 , plural  2
individual : singular  4 , plural  5
major : singular  2 , plural  3
Guest : singular  1 , plural  3
toy : singular  1 , plural  4
language : singular  2 , plural  3
tank : singular  1 , plural  2
chamber : singular  1 , plural  2
painting : singular  1 , plural  5
aide : singular  1 , plural  3
soldier : singular  1 , plural  2
musician : singular  1 , plural  3
negotiation : singular  1 , plural  11
hat : singular  1 , plural  3
tone : singular  1 , plural  2
employee : singular  1 , plural  11
outsider : singular  1 , plural  2
technique : singular  1 , plural  2
implement : singular  1 , plural  3
restriction : singular  1 , plural  3
mechanic : singular  1 , plural  2
chore : singular  1 , plural  4
reporter : singular  2 , plural  4
vow : singular  1 , plural  2
mine : singular  2 , plural  3
weapon : singular  1 , plural  6
endowment : singular  1 , plural  2
pocket : singular  1 , plural  2
investor : singular  1 , plural  7
turnpike : singular  5 , plural  10
truck : singular  1 , plural  6
resource : singular  1 , plural  6
citizen : singular  1 , plural  6

In [28]:
# b)
from collections import Counter
tagged_set = set(brown_tagged_words)
data = Counter(word for (word, tag) in tagged_set)
word_most_tags = data.most_common(1)[0][0]
for (word, tag) in tagged_set:
    if word == word_most_tags:
        print tag
        print nltk.help.brown_tagset(tag)
        print '\n'


TO
TO: infinitival to
    to t'
None


TO-HL
No matching tags found.
None


IN-HL
No matching tags found.
None


IN-TL
No matching tags found.
None


NPS
NPS: noun, plural, proper
    Chases Aderholds Chapelles Armisteads Lockies Carbones French Marskmen
    Toppers Franciscans Romans Cadillacs Masons Blacks Catholics British
    Dixiecrats Mississippians Congresses ...
None


IN
IN: preposition
    of in for by considering to on among at through with under into
    regarding than since despite according per before toward against as
    after during including between without except upon out over ...
None



In [29]:
# c)
tags = [tag for (word, tag) in brown_tagged_words]
fdist = nltk.FreqDist(tags)
for (tag, count) in fdist.most_common(20):
    print tag, ': ', count
    print nltk.help.brown_tagset(tag)
    print '\n'
# mostly open-class words: nouns, verbs


NN :  13162
NN: noun, singular, common
    failure burden court fire appointment awarding compensation Mayor
    interim committee fact effect airport management surveillance jail
    doctor intern extern night weekend duty legislation Tax Office ...
None


IN :  10616
IN: preposition
    of in for by considering to on among at through with under into
    regarding than since despite according per before toward against as
    after during including between without except upon out over ...
None


AT :  8893
AT: article
    the an no a every th' ever' ye
None


NP :  6866
NP: noun, singular, proper
    Fulton Atlanta September-October Durwood Pye Ivan Allen Jr. Jan.
    Alpharetta Grady William B. Hartsfield Pearl Williams Aug. Berry J. M.
    Cheshire Griffin Opelika Ala. E. Pelham Snodgrass ...
None


, :  5133
,: comma
    ,
None


NNS :  5066
NNS: noun, plural, common
    irregularities presentments thanks reports voters laws legislators
    years areas adjustments chambers $100 bonds courts sales details raises
    sessions members congressmen votes polls calls ...
None


. :  4452
.: sentence terminator
    . ? ; ! :
None


JJ :  4392
JJ: adjective
    ecent over-all possible hard-fought favorable hard meager fit such
    widespread outmoded inadequate ambiguous grand clerical effective
    orderly federal foster general proportionate ...
None


CC :  2664
CC: conjunction, coordinating
    and or but plus & either neither nor yet 'n' and/or minus an'
None


VBD :  2524
VBD: verb, past tense
    said produced took recommended commented urged found added praised
    charged listed became announced brought attended wanted voted defeated
    received got stood shot scheduled feared promised made ...
None


NN-TL :  2486
No matching tags found.
None


VB :  2440
VB: verb, base: uninflected present, imperative or infinitive
    investigate find act follow inure achieve reduce take remedy re-set
    distribute realize disable feel receive continue place protect
    eliminate elaborate work permit run enter force ...
None


VBN :  2269
VBN: verb, past participle
    conducted charged won received studied revised operated accepted
    combined experienced recommended effected granted seen protected
    adopted retarded notarized selected composed gotten printed ...
None


RB :  2166
RB: adverb
    only often generally also nevertheless upon together back newly no
    likely meanwhile near then heavily there apparently yet outright fully
    aside consistently specifically formally ever just ...
None


CD :  2020
CD: numeral, cardinal
    two one 1 four 2 1913 71 74 637 1937 8 five three million 87-31 29-5
    seven 1,119 fifty-three 7.5 billion hundred 125,000 1,700 60 100 six
    ...
None


CS :  1509
CS: conjunction, subordinating
    that as after whether before while like because if since for than altho
    until so unless though providing once lest s'posin' till whereas
    whereupon supposing tho' albeit then so's 'fore
None


VBG :  1398
VBG: verb, present participle or gerund
    modernizing improving purchasing Purchasing lacking enabling pricing
    keeping getting picking entering voting warning making strengthening
    setting neighboring attending participating moving ...
None


TO :  1237
TO: infinitival to
    to t'
None


PPS :  1056
PPS: pronoun, personal, nominative, 3rd person singular
    it he she thee
None


PP$ :  1051
PP$: determiner, possessive
    our its his their my your her out thy mine thine
None



In [34]:
# d)
all_bigrams = list(nltk.bigrams(brown_tagged_words))
tags_before_nouns = [tag1 for ((word1, tag1), (word2, tag2)) in all_bigrams if tag2.startswith('N')]
fdist = nltk.FreqDist(tags_before_nouns)
for (tag, count) in fdist.most_common(20):
    print tag, ': ', count
    print nltk.help.brown_tagset(tag)
    print '\n'


AT :  5969
AT: article
    the an no a every th' ever' ye
None


IN :  3328
IN: preposition
    of in for by considering to on among at through with under into
    regarding than since despite according per before toward against as
    after during including between without except upon out over ...
None


JJ :  3128
JJ: adjective
    ecent over-all possible hard-fought favorable hard meager fit such
    widespread outmoded inadequate ambiguous grand clerical effective
    orderly federal foster general proportionate ...
None


NP :  2977
NP: noun, singular, proper
    Fulton Atlanta September-October Durwood Pye Ivan Allen Jr. Jan.
    Alpharetta Grady William B. Hartsfield Pearl Williams Aug. Berry J. M.
    Cheshire Griffin Opelika Ala. E. Pelham Snodgrass ...
None


NN :  2367
NN: noun, singular, common
    failure burden court fire appointment awarding compensation Mayor
    interim committee fact effect airport management surveillance jail
    doctor intern extern night weekend duty legislation Tax Office ...
None


, :  1231
,: comma
    ,
None


NN-TL :  1172
No matching tags found.
None


. :  1106
.: sentence terminator
    . ? ; ! :
None


CC :  894
CC: conjunction, coordinating
    and or but plus & either neither nor yet 'n' and/or minus an'
None


CD :  841
CD: numeral, cardinal
    two one 1 four 2 1913 71 74 637 1937 8 five three million 87-31 29-5
    seven 1,119 fifty-three 7.5 billion hundred 125,000 1,700 60 100 six
    ...
None


PP$ :  739
PP$: determiner, possessive
    our its his their my your her out thy mine thine
None


JJ-TL :  626
No matching tags found.
None


AP :  570
AP: determiner/pronoun, post-determiner
    many other next more last former little several enough most least only
    very few fewer past same Last latter less single plenty 'nough lesser
    certain various manye next-to-last particular final previous present
    nuf
None


NP-TL :  523
No matching tags found.
None


VBG :  422
VBG: verb, present participle or gerund
    modernizing improving purchasing Purchasing lacking enabling pricing
    keeping getting picking entering voting warning making strengthening
    setting neighboring attending participating moving ...
None


VBN :  357
VBN: verb, past participle
    conducted charged won received studied revised operated accepted
    combined experienced recommended effected granted seen protected
    adopted retarded notarized selected composed gotten printed ...
None


DT :  335
DT: determiner/pronoun, singular
    this each another that 'nother
None


VB :  332
VB: verb, base: uninflected present, imperative or infinitive
    investigate find act follow inure achieve reduce take remedy re-set
    distribute realize disable feel receive continue place protect
    eliminate elaborate work permit run enter force ...
None


VBD :  309
VBD: verb, past tense
    said produced took recommended commented urged found added praised
    charged listed became announced brought attended wanted voted defeated
    received got stood shot scheduled feared promised made ...
None


CS :  263
CS: conjunction, subordinating
    that as after whether before while like because if since for than altho
    until so unless though providing once lest s'posin' till whereas
    whereupon supposing tho' albeit then so's 'fore
None


Exercise 16)


In [ ]: