Exercise from http://www.nltk.org/book_1ed/ch05.html

Author : Nirmal kumar Ravi

Tokenize and tag the following sentence: They wind back the clock, while we chase after the wind. What different pronunciations and parts of speech are involved?


In [11]:
import nltk

text = nltk.word_tokenize("They wind back the clock, while we chase after the wind")
tagged = nltk.pos_tag(text)
for i in range(0,len(tagged)):
    txt,tag = tagged[i]
    print txt,tag,nltk.help.upenn_tagset(tag)


They PRPPRP: pronoun, personal
    hers herself him himself hisself it itself me myself one oneself ours
    ourselves ownself self she thee theirs them themselves they thou thy us
 None
wind VBPVBP: verb, present tense, not 3rd person singular
    predominate wrap resort sue twist spill cure lengthen brush terminate
    appear tend stray glisten obtain comprise detest tease attract
    emphasize mold postpone sever return wag ...
 None
back RBRB: adverb
    occasionally unabatingly maddeningly adventurously professedly
    stirringly prominently technologically magisterially predominately
    swiftly fiscally pitilessly ...
 None
the DTDT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
 None
clock NNNN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humour falloff slick wind hyena override subhumanity
    machinist ...
 None
, ,,: comma
    ,
 None
while ININ: preposition or conjunction, subordinating
    astride among uppon whether out inside pro despite on by throughout
    below within for towards near behind atop around if like until below
    next into if beside ...
 None
we PRPPRP: pronoun, personal
    hers herself him himself hisself it itself me myself one oneself ours
    ourselves ownself self she thee theirs them themselves they thou thy us
 None
chase VBPVBP: verb, present tense, not 3rd person singular
    predominate wrap resort sue twist spill cure lengthen brush terminate
    appear tend stray glisten obtain comprise detest tease attract
    emphasize mold postpone sever return wag ...
 None
after ININ: preposition or conjunction, subordinating
    astride among uppon whether out inside pro despite on by throughout
    below within for towards near behind atop around if like until below
    next into if beside ...
 None
the DTDT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
 None
wind NNNN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humour falloff slick wind hyena override subhumanity
    machinist ...
 None

Review the mappings in 5.4. Discuss any other examples of mappings you can think of. What type of information do they map from and to?

  • word => frequency ; It maps the word to its frequency. This can be used in document classification. For example a document about fish may have the word 'fish' used often than any other document.

Using the Python interpreter in interactive mode, experiment with the dictionary examples in this chapter. Create a dictionary d, and add some entries. What happens if you try to access a non-existent entry, e.g. d['xyz']?


In [14]:
d = dict()
#inserts
d['name'] = 'Nirmal kumar Ravi'
d['job'] = 'Software Developer'
#read
print 'My name is %s and I work as %s'%(d['name'],d['job'])
#access field 
if 'intrestedin' not in d:
    print "We will get a key error if we try d['intrestedin']"


My name is Nirmal kumar Ravi and I work as Software Developer
We will get a key error if we try d['intrestedin']

Try deleting an element from a dictionary d, using the syntax del d['abc']. Check that the item was deleted.


In [15]:
del d['job'] 
if 'job' not in d:
    print 'deleted job'


deleted job

Create two dictionaries, d1 and d2, and add some entries to each. Now issue the command d1.update(d2). What did this do? What might it be useful for?


In [19]:
d1 = {'wow': 3, 'cat':2}
d2 = {'dog':4, 'cat':4,'rat':2}

d1.update(d2)
print d1


{'wow': 3, 'dog': 4, 'rat': 2, 'cat': 4}
  • the keys in d1 are updated by values in d2
  • can be used for combine operation

Create a dictionary e, to represent a single lexical entry for some word of your choice. Define keys like headword, part-of-speech, sense, and example, and assign them suitable values.


In [1]:
e = {'word':'elephant','pos':'Noun','example':'Elephants never forget'}
for k in e:
    print k,e[k]


word elephant
pos Noun
example Elephants never forget

Satisfy yourself that there are restrictions on the distribution of go and went, in the sense that they cannot be freely interchanged in the kinds of contexts illustrated in (3d) in 5.7.

  • I go to scholl everyday
  • I went to coffe shop yesterday
  • From above two sentences we can note that go and went cannot be freely interchanged. For example 'I go to coffe shop yesterday' Does not make any sence.

Train a unigram tagger and run it on some new text. Observe that some words are not assigned a tag. Why not?


In [12]:
import nltk
from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents(categories='news')
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
sent = "IN the movie “The Big Short,” Steve Carell plays a slightly altered version of me. In real life, I am a portfolio manager and financial services analyst who over a 25-year career has, at times, been highly critical of bank behavior"
word_tags = unigram_tagger.tag(nltk.word_tokenize(sent))

[w for w,t in word_tags if t == None]


Out[12]:
['IN',
 '\xe2\x80\x9cThe',
 'Short',
 '\xe2\x80\x9d',
 'Carell',
 'portfolio',
 'analyst',
 '25-year']
  • These words or not tagged because they are not in our training set

Learn about the affix tagger (type help(nltk.AffixTagger)). Train an affix tagger and run it on some new text. Experiment with different settings for the affix length and the minimum word length. Discuss your findings.


In [15]:
affix_tagger = nltk.AffixTagger(train=brown_tagged_sents, affix_length=1, min_stem_length=3)
affix_tags = unigram_tagger.tag(nltk.word_tokenize(sent))
print affix_tags


[('IN', None), ('the', u'AT'), ('movie', u'NN'), ('\xe2\x80\x9cThe', None), ('Big', u'JJ-TL'), ('Short', None), (',', u','), ('\xe2\x80\x9d', None), ('Steve', u'NP'), ('Carell', None), ('plays', u'VBZ'), ('a', u'AT'), ('slightly', u'RB'), ('altered', u'VBN'), ('version', u'NN'), ('of', u'IN'), ('me', u'PPO'), ('.', u'.'), ('In', u'IN'), ('real', u'JJ'), ('life', u'NN'), (',', u','), ('I', u'PPSS'), ('am', u'BEM'), ('a', u'AT'), ('portfolio', None), ('manager', u'NN'), ('and', u'CC'), ('financial', u'JJ'), ('services', u'NNS'), ('analyst', None), ('who', u'WPS'), ('over', u'IN'), ('a', u'AT'), ('25-year', None), ('career', u'NN'), ('has', u'HVZ'), (',', u','), ('at', u'IN'), ('times', u'NNS'), (',', u','), ('been', u'BEN'), ('highly', u'QL'), ('critical', u'JJ'), ('of', u'IN'), ('bank', u'NN'), ('behavior', u'NN')]
  • constucted by leading or trailing substring of the words
  • trained using tagged sentences
  • Some of them are not tagged because they are not found in train set

Train a bigram tagger with no backoff tagger, and run it on some of the training data. Next, run it on some new data. What happens to the performance of the tagger? Why?


In [22]:
brown_news_tagged = brown.tagged_words(categories='news')
size = int(len(brown_tagged_sents) * 0.9)
train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]
unigram_tagger = nltk.BigramTagger(train_sents)
print '%f accuracy on train'%(unigram_tagger.evaluate(train_sents))
print '%f accuracy on test'%(unigram_tagger.evaluate(test_sents))


0.786436 accuracy on train
0.102761 accuracy on test
  • The perfomance of tagger on train is higher than perfomance on test.
  • This is obivous because as we trained our model on train_sents It does well than unseen data which is test_sents

We can use a dictionary to specify the values to be substituted into a formatting string. Read Python's library documentation for formatting strings http://docs.python.org/lib/typesseq-strings.html and use this method to display today's date in two different formats.


In [27]:
print 'Day {0} Month {1} Year {2}'.format('2', '7', '2016')
print '{0}/{1}/{2}'.format('2', '7', '2016')
print '{1}/{0}/{2}'.format('2', '7', '2016')


Day 2 Month 7 Year 2016
2/7/2016
7/2/2016

Use sorted() and set() to get a sorted list of tags used in the Brown corpus, removing duplicates.


In [35]:
print sorted(set(brown.tagged_words()))[:20]


[(u'!', u'.'), (u'!', u'.-HL'), (u'$.027', u'NNS'), (u'$.03', u'NNS'), (u'$.03', u'NNS-HL'), (u'$.054/mbf', u'NNS'), (u'$.07', u'NNS'), (u'$.07/cwt', u'NNS'), (u'$.076', u'NNS'), (u'$.09', u'NNS'), (u'$.10-a-minute', u'NN-HL'), (u'$.105', u'NNS'), (u'$.12', u'NNS'), (u'$.30', u'NNS'), (u'$.30/mbf', u'NNS'), (u'$.50', u'NN'), (u'$.50', u'NNS'), (u'$.65', u'NNS'), (u'$.75', u'NNS'), (u'$.80', u'NNS')]