The IOB format categorizes tagged tokens as I, O and B. Why are three tags necessary? What problem would be caused if we used I and O tags exclusively?
Write a tag pattern to match noun phrases containing plural head nouns, e.g. "many/JJ researchers/NNS", "two/CD weeks/NNS", "both/DT new/JJ positions/NNS". Try to do this by generalizing the tag pattern that handled singular noun phrases.
In [14]:
import nltk
grammar = "PHN: {<[CDJT].*>+<NNS>}"
cp = nltk.RegexpParser(grammar)
Pick one of the three chunk types in the CoNLL corpus. Inspect the CoNLL corpus and try to observe any patterns in the POS tag sequences that make up this kind of chunk. Develop a simple chunker using the regular expression chunker nltk.RegexpParser. Discuss any tag sequences that are difficult to chunk reliably.
In [15]:
from nltk.corpus import conll2000
print conll2000.chunked_sents('train.txt', chunk_types=['NP'])[100]
In [16]:
test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])
grammar = r"NP: {<[CDJNP].*>+}"
cp = nltk.RegexpParser(grammar)
print cp.evaluate(test_sents)
An early definition of chunk was the material that occurs between chinks. Develop a chunker that starts by putting the whole sentence in a single chunk, and then does the rest of its work solely by chinking. Determine which tags (or tag sequences) are most likely to make up chinks with the help of your own utility program. Compare the performance and simplicity of this approach relative to a chunker based entirely on chunk rules.
In [21]:
sentence = [("We","PRP"),("saw","VBD"),("the","DT"),("yellow","JJ"),("dog","NN")]
grammar = "CHINK: {<VBD>}"
cp = nltk.RegexpParser(grammar)
result = cp.parse(sentence)
print result
result.draw()
Write a tag pattern to cover noun phrases that contain gerunds, e.g. "the/DT receiving/VBG end/NN", "assistant/NN managing/VBG editor/NN". Add these patterns to the grammar, one per line. Test your work using some tagged sentences of your own devising.
In [24]:
grammar = "GRD: {<DT>|<NN><VBJ><NN>}"
cp = nltk.RegexpParser(grammar)
brown = nltk.corpus.brown
for sent in brown.tagged_sents():
result = cp.parse(sent)
print result
break #I dont want the entire brown corpous printed
Write one or more tag patterns to handle coordinated noun phrases, e.g. "July/NNP and/CC August/NNP", "all/DT your/PRP$ managers/NNS and/CC supervisors/NNS", "company/NN courts/NNS and/CC adjudicators/NNS".
In [29]:
sentence = [("July","NNP"),("and","CC"),("August","NNP")]
grammar = "CNP: {<NN[P|S]><CC><NN[P|S]>}"
cp = nltk.RegexpParser(grammar)
brown = nltk.corpus.brown
result = cp.parse(sentence)
print result
Carry out the following evaluation tasks for any of the chunkers you have developed earlier. (Note that most chunking corpora contain some internal inconsistencies, such that any reasonable rule-based approach will produce errors.) Evaluate your chunker on 100 sentences from a chunked corpus, and report the precision, recall and F-measure. Use the chunkscore.missed() and chunkscore.incorrect() methods to identify the errors made by your chunker. Discuss. Compare the performance of your chunker to the baseline chunker discussed in the evaluation section of this chapter.
In [36]:
from nltk.corpus import conll2000
from nltk.chunk import *
from nltk.chunk.util import *
grammar = r"NP: {<[CDJNP].*>+}"
cp = nltk.RegexpParser(grammar)
test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])
print cp.evaluate(test_sents)
In [31]:
#base line chunker
from nltk.corpus import conll2000
grammar = ""
cp = nltk.RegexpParser(grammar)
test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])
print cp.evaluate(test_sents)