Sentiment Classification & How To "Frame Problems" for a Neural Network

by Andrew Trask

Twitter: @iamtrask
Blog: http://iamtrask.github.io

What You Should Already Know

neural networks, forward and back-propagation
stochastic gradient descent
mean squared error
and train/test splits

Where to Get Help if You Need it

Re-watch previous Udacity Lectures
Leverage the recommended Course Reading Material - Grokking Deep Learning (40% Off: traskud17)
Shoot me a tweet @iamtrask

Tutorial Outline:

Intro: The Importance of "Framing a Problem"

Curate a Dataset
Developing a "Predictive Theory"
PROJECT 1: Quick Theory Validation

Transforming Text to Numbers
PROJECT 2: Creating the Input/Output Data

Putting it all together in a Neural Network
PROJECT 3: Building our Neural Network

Understanding Neural Noise
PROJECT 4: Making Learning Faster by Reducing Noise

Analyzing Inefficiencies in our Network
PROJECT 5: Making our Network Train and Run Faster

Further Noise Reduction
PROJECT 6: Reducing Noise by Strategically Reducing the Vocabulary

Analysis: What's going on in the weights?

Lesson: Curate a Dataset



In [1]:

    
def pretty_print_review_and_label(i):
    print(labels[i] + "\t:\t" + reviews[i][:80] + "...")

g = open('reviews.txt','r') # What we know!
reviews = list(map(lambda x:x[:-1],g.readlines()))
g.close()

g = open('labels.txt','r') # What we WANT to know!
labels = list(map(lambda x:x[:-1].upper(),g.readlines()))
g.close()



In [2]:

    
len(reviews)









    Out[2]:





25000



In [3]:

    
reviews[0]









    Out[3]:





'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   '



In [4]:

    
labels[0]









    Out[4]:





'POSITIVE'

Lesson: Develop a Predictive Theory



In [5]:

    
print("labels.txt \t : \t reviews.txt\n")
pretty_print_review_and_label(2137)
pretty_print_review_and_label(12816)
pretty_print_review_and_label(6267)
pretty_print_review_and_label(21934)
pretty_print_review_and_label(5297)
pretty_print_review_and_label(4998)









    



labels.txt 	 : 	 reviews.txt

NEGATIVE	:	this movie is terrible but it has some good effects .  ...
POSITIVE	:	adrian pasdar is excellent is this film . he makes a fascinating woman .  ...
NEGATIVE	:	comment this movie is impossible . is terrible  very improbable  bad interpretat...
POSITIVE	:	excellent episode movie ala pulp fiction .  days   suicides . it doesnt get more...
NEGATIVE	:	if you haven  t seen this  it  s terrible . it is pure trash . i saw this about ...
POSITIVE	:	this schiffer guy is a real genius  the movie is of excellent quality and both e...



In [36]:

    
from toolz import concat, frequencies, groupby, reduceby, merge_with, valmap, first, last, mapcat, pluck, concatv, itemmap, get_in
import numpy as np
training_reviews = reviews[:15000]
training_labels = labels[:15000]
test_reviews = reviews[15000:]
test_labels = labels[15000:]
review_dict = [{'review': r, 'label': l} for r, l in zip(training_reviews, training_labels)]
grouped = groupby('label', review_dict)
review_pluck = valmap(lambda x: pluck('review', x), grouped)
review_comb = valmap(lambda x: frequencies(mapcat(lambda y: y.replace('.', '').split(), x)), review_pluck)
combined = merge_with(sum, [review_comb[x] for x in review_comb])

value_dict = {x: (get_in(['POSITIVE', x], review_comb, 0) - \
              get_in(['NEGATIVE', x], review_comb, 0)) / \
             combined[x]
          for x in combined}



In [37]:

    
def label_review(r, d):
    score = sum(map(lambda x: get_in([x], d, 0), r.replace('.', '').split()))
    if score >= 0:
        return 'POSITIVE'
    else:
        return 'NEGATIVE'



In [38]:

    
## training_score
training_score = np.mean([label_review(r, value_dict) == l for r, l in zip(training_reviews, training_labels)])
training_score









    Out[38]:





0.93153333333333332



In [ ]:

    
test_score = np.mean([label_review(r, value_dict) == l for r, l in zip(test_reviews, test_labels)])

test_score



In [ ]:



In [ ]: