Tokenize and tag the following sentence: They wind back the clock, while we chase after the wind. What different pronunciations and parts of speech are involved?
In [11]:
import nltk
text = nltk.word_tokenize("They wind back the clock, while we chase after the wind")
tagged = nltk.pos_tag(text)
for i in range(0,len(tagged)):
txt,tag = tagged[i]
print txt,tag,nltk.help.upenn_tagset(tag)
Review the mappings in 5.4. Discuss any other examples of mappings you can think of. What type of information do they map from and to?
Using the Python interpreter in interactive mode, experiment with the dictionary examples in this chapter. Create a dictionary d, and add some entries. What happens if you try to access a non-existent entry, e.g. d['xyz']?
In [14]:
d = dict()
#inserts
d['name'] = 'Nirmal kumar Ravi'
d['job'] = 'Software Developer'
#read
print 'My name is %s and I work as %s'%(d['name'],d['job'])
#access field
if 'intrestedin' not in d:
print "We will get a key error if we try d['intrestedin']"
Try deleting an element from a dictionary d, using the syntax del d['abc']. Check that the item was deleted.
In [15]:
del d['job']
if 'job' not in d:
print 'deleted job'
Create two dictionaries, d1 and d2, and add some entries to each. Now issue the command d1.update(d2). What did this do? What might it be useful for?
In [19]:
d1 = {'wow': 3, 'cat':2}
d2 = {'dog':4, 'cat':4,'rat':2}
d1.update(d2)
print d1
Create a dictionary e, to represent a single lexical entry for some word of your choice. Define keys like headword, part-of-speech, sense, and example, and assign them suitable values.
In [1]:
e = {'word':'elephant','pos':'Noun','example':'Elephants never forget'}
for k in e:
print k,e[k]
Satisfy yourself that there are restrictions on the distribution of go and went, in the sense that they cannot be freely interchanged in the kinds of contexts illustrated in (3d) in 5.7.
Train a unigram tagger and run it on some new text. Observe that some words are not assigned a tag. Why not?
In [12]:
import nltk
from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents(categories='news')
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
sent = "IN the movie “The Big Short,” Steve Carell plays a slightly altered version of me. In real life, I am a portfolio manager and financial services analyst who over a 25-year career has, at times, been highly critical of bank behavior"
word_tags = unigram_tagger.tag(nltk.word_tokenize(sent))
[w for w,t in word_tags if t == None]
Out[12]:
Learn about the affix tagger (type help(nltk.AffixTagger)). Train an affix tagger and run it on some new text. Experiment with different settings for the affix length and the minimum word length. Discuss your findings.
In [15]:
affix_tagger = nltk.AffixTagger(train=brown_tagged_sents, affix_length=1, min_stem_length=3)
affix_tags = unigram_tagger.tag(nltk.word_tokenize(sent))
print affix_tags
Train a bigram tagger with no backoff tagger, and run it on some of the training data. Next, run it on some new data. What happens to the performance of the tagger? Why?
In [22]:
brown_news_tagged = brown.tagged_words(categories='news')
size = int(len(brown_tagged_sents) * 0.9)
train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]
unigram_tagger = nltk.BigramTagger(train_sents)
print '%f accuracy on train'%(unigram_tagger.evaluate(train_sents))
print '%f accuracy on test'%(unigram_tagger.evaluate(test_sents))
We can use a dictionary to specify the values to be substituted into a formatting string. Read Python's library documentation for formatting strings http://docs.python.org/lib/typesseq-strings.html and use this method to display today's date in two different formats.
In [27]:
print 'Day {0} Month {1} Year {2}'.format('2', '7', '2016')
print '{0}/{1}/{2}'.format('2', '7', '2016')
print '{1}/{0}/{2}'.format('2', '7', '2016')
Use sorted() and set() to get a sorted list of tags used in the Brown corpus, removing duplicates.
In [35]:
print sorted(set(brown.tagged_words()))[:20]