Exercise from http://www.nltk.org/book_1ed/ch06.html

Author : Nirmal kumar Ravi

Read up on one of the language technologies mentioned in this section, such as word sense disambiguation, semantic role labeling, question answering, machine translation, named entity detection. Find out what type and quantity of annotated data is required for developing such systems. Why do you think a large amount of data is required?

Lets discuss on "question answering". For eg If you are running a customer service and you have similar set of problems which most of your customer face a question answer system can be built
To build this system we need to train our model on what category to classify the problem to.
After classification we may need further classification or It can directly point to the answer
For eg: :Lets assume that we are building a decision tree. Then we may ask our customer questions like "does the machine produce sound while running?". Based on the answer we take our nexr step
To train any model In our case a decision tree we need data in large quantity.
With increase in data our model learns all the possible problem that can occur and solution for it

Using any of the three classifiers described in this chapter, and any features you can think of, build the best name gender classifier you can. Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev-test set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set. How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?



In [6]:

    
#import data
from nltk.corpus import names
import random
names = ([(name, 'male') for name in names.words('male.txt')] +
         [(name, 'female') for name in names.words('female.txt')])
random.shuffle(names)



In [41]:

    
#classifier 1 
# classify with just last letter of name
def gender_features(word):
    return {'last_letter': word[-1]}



In [42]:

    
#using classifier 1 to build model
import nltk

featuresets = [(gender_features(n), g) for (n,g) in names]
test_set , dev_test_set , train_set = featuresets[:500],featuresets[500:1000],featuresets[1000:]
classifier = nltk.NaiveBayesClassifier.train(train_set)



In [43]:

    
print nltk.classify.accuracy(classifier, dev_test_set)
classifier.show_most_informative_features(5)









    



0.76
Most Informative Features
             last_letter = u'a'           female : male   =     35.0 : 1.0
             last_letter = u'k'             male : female =     30.2 : 1.0
             last_letter = u'p'             male : female =     19.4 : 1.0
             last_letter = u'f'             male : female =     13.7 : 1.0
             last_letter = u'v'             male : female =     10.3 : 1.0



In [44]:

    
print nltk.classify.accuracy(classifier, test_set)



In [37]:

    
#classifier 2 
# take bigrams instead
def gender_features(word):
    lst = []
    i = 0
    for w in nltk.bigrams(word):
        lst += [('b'+str(i),(w[0]+w[1]).lower())]
        i = i+1
    return dict(lst)

gender_features('Nirmal')









    Out[37]:





{'b0': 'ni', 'b1': 'ir', 'b2': 'rm', 'b3': 'ma', 'b4': 'al'}



In [38]:

    
#using classifier 2 to build model
import nltk

featuresets = [(gender_features(n), g) for (n,g) in names]
test_set , dev_test_set , train_set = featuresets[:500],featuresets[500:1000],featuresets[1000:]
classifier = nltk.NaiveBayesClassifier.train(train_set)



In [39]:

    
print nltk.classify.accuracy(classifier, dev_test_set)
classifier.show_most_informative_features(5)









    



0.788
Most Informative Features
                      b5 = u'ta'          female : male   =     19.7 : 1.0
                      b0 = u'hu'            male : female =     17.0 : 1.0
                      b2 = u'rk'            male : female =     16.8 : 1.0
                      b5 = u'rd'            male : female =     16.2 : 1.0
                      b3 = u'to'            male : female =     16.1 : 1.0



In [40]:

    
print nltk.classify.accuracy(classifier, test_set)

Our second classifier using bigrams proves to be more accurate
Last few letters are more significant for female names
First few letters are more significant for male names

The Senseval 2 Corpus contains data intended to train word-sense disambiguation classifiers. It contains data for four words: hard, interest, line, and serve. Choose one of these four words, and load the corresponding data: Using this dataset, build a classifier that predicts the correct sense tag for a given instance. See the corpus HOWTO at http://www.nltk.org/howto for information on using the instance objects returned by the Senseval 2 Corpus.



In [45]:

    
from nltk.corpus import senseval
instances = senseval.instances('hard.pos')
size = int(len(instances) * 0.1)
train_set, test_set = instances[size:], instances[:size]



In [47]:

    
test_set[1]









    Out[47]:





SensevalInstance(word=u'hard-a', position=10, context=[('clever', 'NNP'), ('white', 'NNP'), ('house', 'NNP'), ('``', '``'), ('spin', 'VB'), ('doctors', 'NNS'), ("''", "''"), ('are', 'VBP'), ('having', 'VBG'), ('a', 'DT'), ('hard', 'JJ'), ('time', 'NN'), ('helping', 'VBG'), ('president', 'NNP'), ('bush', 'NNP'), ('explain', 'VB'), ('away', 'RB'), ('the', 'DT'), ('economic', 'JJ'), ('bashing', 'NN'), ('that', 'IN'), ('low-and', 'JJ'), ('middle-income', 'JJ'), ('workers', 'NNS'), ('are', 'VBP'), ('taking', 'VBG'), ('these', 'DT'), ('days', 'NNS'), ('.', '.')], senses=('HARD1',))

Using the movie review document classifier discussed in this chapter, generate a list of the 30 features that the classifier finds to be most informative. Can you explain why these particular features are informative? Do you find any of them surprising?



In [48]:

    
from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)



In [49]:

    
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = all_words.keys()[:2000]
def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains(%s)' % word] = (word in document_words)
    return features



In [51]:

    
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)



In [52]:

    
print nltk.classify.accuracy(classifier, test_set)



In [53]:

    
classifier.show_most_informative_features(30)









    



Most Informative Features
          contains(sans) = True              neg : pos    =      9.0 : 1.0
     contains(dismissed) = True              pos : neg    =      7.0 : 1.0
    contains(mediocrity) = True              neg : pos    =      7.0 : 1.0
   contains(bruckheimer) = True              neg : pos    =      6.3 : 1.0
         contains(wires) = True              neg : pos    =      6.3 : 1.0
     contains(uplifting) = True              pos : neg    =      5.9 : 1.0
        contains(doubts) = True              pos : neg    =      5.8 : 1.0
           contains(ugh) = True              neg : pos    =      5.8 : 1.0
       contains(topping) = True              pos : neg    =      5.7 : 1.0
          contains(wits) = True              pos : neg    =      5.7 : 1.0
        contains(fabric) = True              pos : neg    =      5.7 : 1.0
          contains(lang) = True              pos : neg    =      5.7 : 1.0
           contains(hal) = True              neg : pos    =      5.6 : 1.0
          contains(hugo) = True              pos : neg    =      4.6 : 1.0
         contains(tripe) = True              neg : pos    =      4.6 : 1.0
  contains(effortlessly) = True              pos : neg    =      4.4 : 1.0
      contains(matheson) = True              pos : neg    =      4.4 : 1.0
         contains(spins) = True              pos : neg    =      4.4 : 1.0
          contains(wang) = True              pos : neg    =      4.4 : 1.0
       contains(maxwell) = True              neg : pos    =      4.3 : 1.0
         contains(locks) = True              neg : pos    =      4.3 : 1.0
    contains(cronenberg) = True              pos : neg    =      4.2 : 1.0
       contains(admired) = True              pos : neg    =      4.2 : 1.0
      contains(attorney) = True              pos : neg    =      3.9 : 1.0
     contains(testament) = True              pos : neg    =      3.9 : 1.0
          contains(sant) = True              pos : neg    =      3.8 : 1.0
         contains(gripe) = True              pos : neg    =      3.7 : 1.0
       contains(bandits) = True              pos : neg    =      3.7 : 1.0
     contains(patriarch) = True              pos : neg    =      3.7 : 1.0
   contains(voyeuristic) = True              pos : neg    =      3.7 : 1.0

Words contains more positive sentiment are good classifier

Select one of the classification tasks described in this chapter, such as name gender detection, document classification, part-of-speech tagging, or dialog act classification. Using the same training and test data, and the same feature extractor, build three classifiers for the task: a decision tree, a naive Bayes classifier, and a Maximum Entropy classifier. Compare the performance of the three classifiers on your selected task. How do you think that your results might be different if you used a different feature extractor?



In [54]:

    
#using classifier 1 to build model
import nltk
def gender_features(word):
    return {'last_letter': word[-1]}



In [56]:

    
featuresets = [(gender_features(n), g) for (n,g) in names]
test_set , dev_test_set , train_set = featuresets[:500],featuresets[500:1000],featuresets[1000:]
nbclassifier = nltk.NaiveBayesClassifier.train(train_set)



In [57]:

    
print nltk.classify.accuracy(nbclassifier, dev_test_set)



In [58]:

    
treeclassifier = nltk.DecisionTreeClassifier.train(train_set)



In [59]:

    
print nltk.classify.accuracy(treeclassifier, dev_test_set)

We have taken names corpus
Both classifier shows same accuracy

The synonyms strong and powerful pattern differently (try combining them with chip and sales). What features are relevant in this distinction? Build a classifier that predicts when each word should be used.



In [ ]: