Your assignment in Project 4 is to answer either 6.10 exercise 3 or 6.10 exercise 4 from Natural Language Processing with Python.

4. Using the movie review document classifier discussed in this chapter, generate a list of the 30 features that the classifier finds to be most informative. Can you explain why these particular features are informative? Do you find any of them surprising?

import nltk
from nltk.corpus import movie_reviews
import pandas as pd
import sklearn as sk
import random

# Find the top 1000 words in all movie reviews

words = movie_reviews.words()
all_words = nltk.FreqDist(w.lower() for w in words) # sorted most popular {words: freq}
word_features = all_words.keys()[:1000] # more words slows down the training

# example of results


# build list of words and their positive/negative classification from the reviews

documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

# Create feature set/class for each review against list of top 1000 words
# Extract words from document

def doc_features(document): # [_document-classify-extractor]
    doc_words = set(document) # [_document-classify-set]
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in doc_words)
    return features

featuresets = [(doc_features(d), c) for (d,c) in documents]

# Split to create training and test data
train_set = featuresets[100:]
test_set = featuresets[:100]

# Train using Naive Bayes classifier
classifier = nltk.NaiveBayesClassifier.train(train_set)

# 30 most imporant features

Most Informative Features
    contains(mediocrity) = True              neg : pos    =      8.5 : 1.0
        contains(fabric) = True              pos : neg    =      5.7 : 1.0
     contains(uplifting) = True              pos : neg    =      5.5 : 1.0
        contains(doubts) = True              pos : neg    =      5.2 : 1.0
  contains(accomplishes) = True              pos : neg    =      5.1 : 1.0
       contains(topping) = True              pos : neg    =      5.1 : 1.0
  contains(effortlessly) = True              pos : neg    =      5.0 : 1.0
         contains(locks) = True              neg : pos    =      4.8 : 1.0
           contains(wcw) = True              neg : pos    =      4.1 : 1.0
       contains(maxwell) = True              neg : pos    =      4.1 : 1.0
        contains(minnie) = True              pos : neg    =      4.0 : 1.0
      contains(matheson) = True              pos : neg    =      3.9 : 1.0
          contains(wang) = True              pos : neg    =      3.9 : 1.0
     contains(sumptuous) = True              pos : neg    =      3.8 : 1.0
       contains(admired) = True              pos : neg    =      3.7 : 1.0
      contains(attorney) = True              pos : neg    =      3.5 : 1.0
      contains(troubles) = True              pos : neg    =      3.4 : 1.0
       contains(nebbish) = True              neg : pos    =      3.3 : 1.0
           contains(hal) = True              neg : pos    =      3.3 : 1.0
          contains(olds) = True              neg : pos    =      3.3 : 1.0
     contains(sickening) = True              neg : pos    =      3.3 : 1.0
   contains(unabashedly) = True              neg : pos    =      3.3 : 1.0
     contains(torpedoes) = True              neg : pos    =      3.3 : 1.0
       contains(bandits) = True              pos : neg    =      3.3 : 1.0
     contains(wednesday) = True              pos : neg    =      3.3 : 1.0
   contains(voyeuristic) = True              pos : neg    =      3.3 : 1.0
          contains(caan) = True              neg : pos    =      3.1 : 1.0
          contains(rico) = True              pos : neg    =      3.1 : 1.0
     contains(portrayed) = True              pos : neg    =      3.1 : 1.0
         contains(crowe) = True              pos : neg    =      3.0 : 1.0

Can you explain why these particular features are informative? Do you find any of them surprising

There were a number of unsurprising features that are relatively informative. A few examples are:

Mediocrity | negative: There would be not positive reviews
Uplifting | positive: "Uplifitng" I would think would mean the movie was well-reviewed.
Accomplishes | positive: Implies that the creator/director/actors accomplished what they were aiming for.
Effortlessly | positive: Overtly positive
Sickening | negative: Good movies aren't usually described as sickening
Topping | positive: As in "topping the charts"
Admired | positive: Also positive

There were though some surprising finds:

Maxwell | negative: I'm not sure who Maxwell is but he seems to be disliked.
Locks| negative: Why "locks" would be as negative and important is a bit opaque to me.
Fabric | positive: "Fabric of..." may be a phrase used often in reviews
Torpedoes | negative: Likely often used as a verb
Bandits | positive: People like bandits?
WCW | negative: Woman crush Wednesday?