DATA 620 - Project 4

Daina Bouquin

Your assignment in Project 4 is to answer either 6.10 exercise 3 or 6.10 exercise 4 from Natural Language Processing with Python.

4. Using the movie review document classifier discussed in this chapter, generate a list of the 30 features that the classifier finds to be most informative. Can you explain why these particular features are informative? Do you find any of them surprising?


In [21]:
import nltk
from nltk.corpus import movie_reviews
import pandas as pd
import sklearn as sk
import random

In [22]:
# Find the top 1000 words in all movie reviews

words = movie_reviews.words()
all_words = nltk.FreqDist(w.lower() for w in words) # sorted most popular {words: freq}
word_features = all_words.keys()[:1000] # more words slows down the training

# example of results
word_features[:15]


Out[22]:
[u'sucess',
 u'sonja',
 u'askew',
 u'woods',
 u'spiders',
 u'bazooms',
 u'hanging',
 u'francesca',
 u'comically',
 u'localized',
 u'disobeying',
 u'hennings',
 u'canet',
 u'scold',
 u'originality']

In [23]:
# build list of words and their positive/negative classification from the reviews

documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

In [24]:
# Create feature set/class for each review against list of top 1000 words
# Extract words from document

def doc_features(document): # [_document-classify-extractor]
    doc_words = set(document) # [_document-classify-set]
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in doc_words)
    return features

featuresets = [(doc_features(d), c) for (d,c) in documents]

In [25]:
# Split to create training and test data
train_set = featuresets[100:]
test_set = featuresets[:100]

In [26]:
# Train using Naive Bayes classifier
random.seed(4321)
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [27]:
# 30 most imporant features
classifier.show_most_informative_features(30)


Most Informative Features
    contains(mediocrity) = True              neg : pos    =      8.5 : 1.0
        contains(fabric) = True              pos : neg    =      5.7 : 1.0
     contains(uplifting) = True              pos : neg    =      5.5 : 1.0
        contains(doubts) = True              pos : neg    =      5.2 : 1.0
  contains(accomplishes) = True              pos : neg    =      5.1 : 1.0
       contains(topping) = True              pos : neg    =      5.1 : 1.0
  contains(effortlessly) = True              pos : neg    =      5.0 : 1.0
         contains(locks) = True              neg : pos    =      4.8 : 1.0
           contains(wcw) = True              neg : pos    =      4.1 : 1.0
       contains(maxwell) = True              neg : pos    =      4.1 : 1.0
        contains(minnie) = True              pos : neg    =      4.0 : 1.0
      contains(matheson) = True              pos : neg    =      3.9 : 1.0
          contains(wang) = True              pos : neg    =      3.9 : 1.0
     contains(sumptuous) = True              pos : neg    =      3.8 : 1.0
       contains(admired) = True              pos : neg    =      3.7 : 1.0
      contains(attorney) = True              pos : neg    =      3.5 : 1.0
      contains(troubles) = True              pos : neg    =      3.4 : 1.0
       contains(nebbish) = True              neg : pos    =      3.3 : 1.0
           contains(hal) = True              neg : pos    =      3.3 : 1.0
          contains(olds) = True              neg : pos    =      3.3 : 1.0
     contains(sickening) = True              neg : pos    =      3.3 : 1.0
   contains(unabashedly) = True              neg : pos    =      3.3 : 1.0
     contains(torpedoes) = True              neg : pos    =      3.3 : 1.0
       contains(bandits) = True              pos : neg    =      3.3 : 1.0
     contains(wednesday) = True              pos : neg    =      3.3 : 1.0
   contains(voyeuristic) = True              pos : neg    =      3.3 : 1.0
          contains(caan) = True              neg : pos    =      3.1 : 1.0
          contains(rico) = True              pos : neg    =      3.1 : 1.0
     contains(portrayed) = True              pos : neg    =      3.1 : 1.0
         contains(crowe) = True              pos : neg    =      3.0 : 1.0

Can you explain why these particular features are informative? Do you find any of them surprising

There were a number of unsurprising features that are relatively informative. A few examples are:

Mediocrity | negative: There would be not positive reviews
Uplifting | positive: "Uplifitng" I would think would mean the movie was well-reviewed.
Accomplishes | positive: Implies that the creator/director/actors accomplished what they were aiming for.
Effortlessly | positive: Overtly positive
Sickening | negative: Good movies aren't usually described as sickening
Topping | positive: As in "topping the charts"
Admired | positive: Also positive

There were though some surprising finds:

Maxwell | negative: I'm not sure who Maxwell is but he seems to be disliked.
Locks| negative: Why "locks" would be as negative and important is a bit opaque to me.
Fabric | positive: "Fabric of..." may be a phrase used often in reviews
Torpedoes | negative: Likely often used as a verb
Bandits | positive: People like bandits?
WCW | negative: Woman crush Wednesday?