Sentiment Classification & How To "Frame Problems" for a Neural Network

by Andrew Trask

What You Should Already Know

  • neural networks, forward and back-propagation
  • stochastic gradient descent
  • mean squared error
  • and train/test splits

Where to Get Help if You Need it

  • Re-watch previous Udacity Lectures
  • Leverage the recommended Course Reading Material - Grokking Deep Learning (40% Off: traskud17)
  • Shoot me a tweet @iamtrask

Tutorial Outline:

  • Intro: The Importance of "Framing a Problem"
  • Curate a Dataset
  • Developing a "Predictive Theory"
  • PROJECT 1: Quick Theory Validation
  • Transforming Text to Numbers
  • PROJECT 2: Creating the Input/Output Data
  • Putting it all together in a Neural Network
  • PROJECT 3: Building our Neural Network
  • Understanding Neural Noise
  • PROJECT 4: Making Learning Faster by Reducing Noise
  • Analyzing Inefficiencies in our Network
  • PROJECT 5: Making our Network Train and Run Faster
  • Further Noise Reduction
  • PROJECT 6: Reducing Noise by Strategically Reducing the Vocabulary
  • Analysis: What's going on in the weights?

Lesson: Curate a Dataset


In [50]:
def pretty_print_review_and_label(i):
    print(labels[i] + "\t:\t" + reviews[i][:80] + "...")

g = open('reviews.txt','r') # What we know!
reviews = list(map(lambda x:x[:-1],g.readlines()))
g.close()

g = open('labels.txt','r') # What we WANT to know!
labels = list(map(lambda x:x[:-1].upper(),g.readlines()))
g.close()

In [51]:
len(reviews)


Out[51]:
25000

In [52]:
reviews[0]


Out[52]:
'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   '

In [53]:
labels[0]


Out[53]:
'POSITIVE'

Lesson: Develop a Predictive Theory


In [54]:
print("labels.txt \t : \t reviews.txt\n")
pretty_print_review_and_label(2137)
pretty_print_review_and_label(12816)
pretty_print_review_and_label(6267)
pretty_print_review_and_label(21934)
pretty_print_review_and_label(5297)
pretty_print_review_and_label(4998)


labels.txt 	 : 	 reviews.txt

NEGATIVE	:	this movie is terrible but it has some good effects .  ...
POSITIVE	:	adrian pasdar is excellent is this film . he makes a fascinating woman .  ...
NEGATIVE	:	comment this movie is impossible . is terrible  very improbable  bad interpretat...
POSITIVE	:	excellent episode movie ala pulp fiction .  days   suicides . it doesnt get more...
NEGATIVE	:	if you haven  t seen this  it  s terrible . it is pure trash . i saw this about ...
POSITIVE	:	this schiffer guy is a real genius  the movie is of excellent quality and both e...

Project 1: Quick Theory Validation


In [98]:
import numpy as np

In [114]:
bag_of_words = {}
pos_words = {}
neg_words = {}

for i in range(len(reviews)):
    words = reviews[i].split(' ')
    for word in words:
        if word in bag_of_words.keys():
            bag_of_words[word] += 1
        else:
            bag_of_words[word] = 1
            pos_words[word] = 0
            neg_words[word] = 0
            
        if labels[i] == 'POSITIVE':
            if word in pos_words.keys():
                pos_words[word] += 1
        elif labels[i] == 'NEGATIVE':
            if word in neg_words.keys():
                neg_words[word] += 1

words_pos_neg_ratio = []
for word in bag_of_words.keys():
    if bag_of_words[word] > 500:
        pos_neg_ratio = pos_words[word] / float(neg_words[word] + 1)
        words_pos_neg_ratio.append((word, np.log(pos_neg_ratio)))
            
words_pos_neg_ratio = sorted(words_pos_neg_ratio, key=lambda x: x[1], reverse=True)

In [115]:
print('\nTop positive words: \n')
for i in range(10):
    print(words_pos_neg_ratio[i][0],': ', round(words_pos_neg_ratio[i][1], 10), sep='')


Top positive words: 

superb: 1.7091514459
wonderful: 1.5645425925
fantastic: 1.5048433869
excellent: 1.4647538506
amazing: 1.3919815802
powerful: 1.2999662776
favorite: 1.2668956298
perfect: 1.2467424807
brilliant: 1.2287554138
perfectly: 1.1971931173

In [116]:
print('\nTop negative words: \n')
for i in range(-1, -11, -1):
    print(words_pos_neg_ratio[i][0],': ', round(words_pos_neg_ratio[i][1], 10), sep='')


Top negative words: 

waste: -2.619384564
pointless: -2.45530618
worst: -2.2869878962
awful: -2.227194247
poorly: -2.2207550747
lame: -1.9817674589
horrible: -1.910259094
wasted: -1.8382794849
crap: -1.8281271134
badly: -1.7536265995

In [ ]: