Exercise from http://www.nltk.org/book_1ed/ch07.html

Author : Nirmal kumar Ravi

The IOB format categorizes tagged tokens as I, O and B. Why are three tags necessary? What problem would be caused if we used I and O tags exclusively?

IOB stands for (I)inside (O)outside (B)egining tags
B is esential because It marks the begining of token
If we don't have a 'B' Then It would not be possible to identify tokens If tokens appear next to each other.
In other words B indicates the begining of a new token

Write a tag pattern to match noun phrases containing plural head nouns, e.g. "many/JJ researchers/NNS", "two/CD weeks/NNS", "both/DT new/JJ positions/NNS". Try to do this by generalizing the tag pattern that handled singular noun phrases.



In [14]:

    
import nltk
grammar = "PHN: {<[CDJT].*>+<NNS>}"
cp = nltk.RegexpParser(grammar)

Pick one of the three chunk types in the CoNLL corpus. Inspect the CoNLL corpus and try to observe any patterns in the POS tag sequences that make up this kind of chunk. Develop a simple chunker using the regular expression chunker nltk.RegexpParser. Discuss any tag sequences that are difficult to chunk reliably.



In [15]:

    
from nltk.corpus import conll2000
print conll2000.chunked_sents('train.txt', chunk_types=['NP'])[100]









    



(S (NP He/PRP) talked/VBD (NP about/IN 20/CD minutes/NNS) ./.)

We have filtered the text with noun phrase
The Sentence is "He talked about 20 minutes"
Here there are two chunks "HE" and "About 20 Minutes"
POS taging are PRP, VBD, IN, CD, NNS



In [16]:

    
test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])
grammar = r"NP: {<[CDJNP].*>+}"
cp = nltk.RegexpParser(grammar)
print cp.evaluate(test_sents)









    



ChunkParse score:
    IOB Accuracy:  87.7%
    Precision:     70.6%
    Recall:        67.8%
    F-Measure:     69.2%

An early definition of chunk was the material that occurs between chinks. Develop a chunker that starts by putting the whole sentence in a single chunk, and then does the rest of its work solely by chinking. Determine which tags (or tag sequences) are most likely to make up chinks with the help of your own utility program. Compare the performance and simplicity of this approach relative to a chunker based entirely on chunk rules.

In this we are asked to develop chinking instead of chunking.
For eg the sentence "we saw the yellow dog". Has two NP chunks "we", "the yellow dog"
So in chinking we will leave the rest and try to chink "Yellow"



In [21]:

    
sentence = [("We","PRP"),("saw","VBD"),("the","DT"),("yellow","JJ"),("dog","NN")]
grammar = "CHINK: {<VBD>}"
cp = nltk.RegexpParser(grammar)
result = cp.parse(sentence)
print result
result.draw()









    



(S We/PRP (CHINK saw/VBD) the/DT yellow/JJ dog/NN)

Write a tag pattern to cover noun phrases that contain gerunds, e.g. "the/DT receiving/VBG end/NN", "assistant/NN managing/VBG editor/NN". Add these patterns to the grammar, one per line. Test your work using some tagged sentences of your own devising.



In [24]:

    
grammar = "GRD: {<DT>|<NN><VBJ><NN>}"
cp = nltk.RegexpParser(grammar)
brown = nltk.corpus.brown
for sent in brown.tagged_sents():
    result = cp.parse(sent)
    print result
    break #I dont want the entire brown corpous printed









    



(S
  The/AT
  Fulton/NP-TL
  County/NN-TL
  Grand/JJ-TL
  Jury/NN-TL
  said/VBD
  Friday/NR
  an/AT
  investigation/NN
  of/IN
  Atlanta's/NP$
  recent/JJ
  primary/NN
  election/NN
  produced/VBD
  ``/``
  no/AT
  evidence/NN
  ''/''
  that/CS
  any/DTI
  irregularities/NNS
  took/VBD
  place/NN
  ./.)

Write one or more tag patterns to handle coordinated noun phrases, e.g. "July/NNP and/CC August/NNP", "all/DT your/PRP$ managers/NNS and/CC supervisors/NNS", "company/NN courts/NNS and/CC adjudicators/NNS".

It follows the pattern NN? followed by a conjuction "and" then NN?



In [29]:

    
sentence = [("July","NNP"),("and","CC"),("August","NNP")]
grammar = "CNP: {<NN[P|S]><CC><NN[P|S]>}"
cp = nltk.RegexpParser(grammar)
brown = nltk.corpus.brown
result = cp.parse(sentence)
print result









    



(S (CNP July/NNP and/CC August/NNP))

Carry out the following evaluation tasks for any of the chunkers you have developed earlier. (Note that most chunking corpora contain some internal inconsistencies, such that any reasonable rule-based approach will produce errors.) Evaluate your chunker on 100 sentences from a chunked corpus, and report the precision, recall and F-measure. Use the chunkscore.missed() and chunkscore.incorrect() methods to identify the errors made by your chunker. Discuss. Compare the performance of your chunker to the baseline chunker discussed in the evaluation section of this chapter.



In [36]:

    
from nltk.corpus import conll2000
from nltk.chunk import *
from nltk.chunk.util import *
grammar = r"NP: {<[CDJNP].*>+}"
cp = nltk.RegexpParser(grammar)
test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])
print cp.evaluate(test_sents)









    



ChunkParse score:
    IOB Accuracy:  87.7%
    Precision:     70.6%
    Recall:        67.8%
    F-Measure:     69.2%



In [31]:

    
#base line chunker
from nltk.corpus import conll2000
grammar = ""
cp = nltk.RegexpParser(grammar)
test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])
print cp.evaluate(test_sents)









    



ChunkParse score:
    IOB Accuracy:  43.4%
    Precision:      0.0%
    Recall:         0.0%
    F-Measure:      0.0%

comparing to baseline chunker . Our chunker performs extremly well