Read up on one of the language technologies mentioned in this section, such as word sense disambiguation, semantic role labeling, question answering, machine translation, named entity detection. Find out what type and quantity of annotated data is required for developing such systems. Why do you think a large amount of data is required?
Using any of the three classifiers described in this chapter, and any features you can think of, build the best name gender classifier you can. Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev-test set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set. How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?
In [6]:
#import data
from nltk.corpus import names
import random
names = ([(name, 'male') for name in names.words('male.txt')] +
[(name, 'female') for name in names.words('female.txt')])
random.shuffle(names)
In [41]:
#classifier 1
# classify with just last letter of name
def gender_features(word):
return {'last_letter': word[-1]}
In [42]:
#using classifier 1 to build model
import nltk
featuresets = [(gender_features(n), g) for (n,g) in names]
test_set , dev_test_set , train_set = featuresets[:500],featuresets[500:1000],featuresets[1000:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
In [43]:
print nltk.classify.accuracy(classifier, dev_test_set)
classifier.show_most_informative_features(5)
In [44]:
print nltk.classify.accuracy(classifier, test_set)
In [37]:
#classifier 2
# take bigrams instead
def gender_features(word):
lst = []
i = 0
for w in nltk.bigrams(word):
lst += [('b'+str(i),(w[0]+w[1]).lower())]
i = i+1
return dict(lst)
gender_features('Nirmal')
Out[37]:
In [38]:
#using classifier 2 to build model
import nltk
featuresets = [(gender_features(n), g) for (n,g) in names]
test_set , dev_test_set , train_set = featuresets[:500],featuresets[500:1000],featuresets[1000:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
In [39]:
print nltk.classify.accuracy(classifier, dev_test_set)
classifier.show_most_informative_features(5)
In [40]:
print nltk.classify.accuracy(classifier, test_set)
The Senseval 2 Corpus contains data intended to train word-sense disambiguation classifiers. It contains data for four words: hard, interest, line, and serve. Choose one of these four words, and load the corresponding data: Using this dataset, build a classifier that predicts the correct sense tag for a given instance. See the corpus HOWTO at http://www.nltk.org/howto for information on using the instance objects returned by the Senseval 2 Corpus.
In [45]:
from nltk.corpus import senseval
instances = senseval.instances('hard.pos')
size = int(len(instances) * 0.1)
train_set, test_set = instances[size:], instances[:size]
In [47]:
test_set[1]
Out[47]:
Using the movie review document classifier discussed in this chapter, generate a list of the 30 features that the classifier finds to be most informative. Can you explain why these particular features are informative? Do you find any of them surprising?
In [48]:
from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)
In [49]:
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = all_words.keys()[:2000]
def document_features(document):
document_words = set(document)
features = {}
for word in word_features:
features['contains(%s)' % word] = (word in document_words)
return features
In [51]:
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
In [52]:
print nltk.classify.accuracy(classifier, test_set)
In [53]:
classifier.show_most_informative_features(30)
Select one of the classification tasks described in this chapter, such as name gender detection, document classification, part-of-speech tagging, or dialog act classification. Using the same training and test data, and the same feature extractor, build three classifiers for the task: a decision tree, a naive Bayes classifier, and a Maximum Entropy classifier. Compare the performance of the three classifiers on your selected task. How do you think that your results might be different if you used a different feature extractor?
In [54]:
#using classifier 1 to build model
import nltk
def gender_features(word):
return {'last_letter': word[-1]}
In [56]:
featuresets = [(gender_features(n), g) for (n,g) in names]
test_set , dev_test_set , train_set = featuresets[:500],featuresets[500:1000],featuresets[1000:]
nbclassifier = nltk.NaiveBayesClassifier.train(train_set)
In [57]:
print nltk.classify.accuracy(nbclassifier, dev_test_set)
In [58]:
treeclassifier = nltk.DecisionTreeClassifier.train(train_set)
In [59]:
print nltk.classify.accuracy(treeclassifier, dev_test_set)
The synonyms strong and powerful pattern differently (try combining them with chip and sales). What features are relevant in this distinction? Build a classifier that predicts when each word should be used.
In [ ]: