In [21]:
import nltk
from nltk.corpus import movie_reviews
import pandas as pd
import sklearn as sk
import random
In [22]:
# Find the top 1000 words in all movie reviews
words = movie_reviews.words()
all_words = nltk.FreqDist(w.lower() for w in words) # sorted most popular {words: freq}
word_features = all_words.keys()[:1000] # more words slows down the training
# example of results
word_features[:15]
Out[22]:
In [23]:
# build list of words and their positive/negative classification from the reviews
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
In [24]:
# Create feature set/class for each review against list of top 1000 words
# Extract words from document
def doc_features(document): # [_document-classify-extractor]
doc_words = set(document) # [_document-classify-set]
features = {}
for word in word_features:
features['contains({})'.format(word)] = (word in doc_words)
return features
featuresets = [(doc_features(d), c) for (d,c) in documents]
In [25]:
# Split to create training and test data
train_set = featuresets[100:]
test_set = featuresets[:100]
In [26]:
# Train using Naive Bayes classifier
random.seed(4321)
classifier = nltk.NaiveBayesClassifier.train(train_set)
In [27]:
# 30 most imporant features
classifier.show_most_informative_features(30)
There were a number of unsurprising features that are relatively informative. A few examples are:
Mediocrity | negative: There would be not positive reviews
Uplifting | positive: "Uplifitng" I would think would mean the movie was well-reviewed.
Accomplishes | positive: Implies that the creator/director/actors accomplished what they were aiming for.
Effortlessly | positive: Overtly positive
Sickening | negative: Good movies aren't usually described as sickening
Topping | positive: As in "topping the charts"
Admired | positive: Also positive
There were though some surprising finds:
Maxwell | negative: I'm not sure who Maxwell is but he seems to be disliked.
Locks| negative: Why "locks" would be as negative and important is a bit opaque to me.
Fabric | positive: "Fabric of..." may be a phrase used often in reviews
Torpedoes | negative: Likely often used as a verb
Bandits | positive: People like bandits?
WCW | negative: Woman crush Wednesday?