Sentiment Classification & How To "Frame Problems" for a Neural Network

by Andrew Trask

Twitter: @iamtrask
Blog: http://iamtrask.github.io

What You Should Already Know

neural networks, forward and back-propagation
stochastic gradient descent
mean squared error
and train/test splits

Where to Get Help if You Need it

Re-watch previous Udacity Lectures
Leverage the recommended Course Reading Material - Grokking Deep Learning (40% Off: traskud17)
Shoot me a tweet @iamtrask

Tutorial Outline:

Intro: The Importance of "Framing a Problem"

Curate a Dataset
Developing a "Predictive Theory"
PROJECT 1: Quick Theory Validation

Transforming Text to Numbers
PROJECT 2: Creating the Input/Output Data

Putting it all together in a Neural Network
PROJECT 3: Building our Neural Network

Understanding Neural Noise
PROJECT 4: Making Learning Faster by Reducing Noise

Analyzing Inefficiencies in our Network
PROJECT 5: Making our Network Train and Run Faster

Further Noise Reduction
PROJECT 6: Reducing Noise by Strategically Reducing the Vocabulary

Analysis: What's going on in the weights?

Lesson: Curate a Dataset



In [1]:

    
def pretty_print_review_and_label(i):
    print(labels[i] + "\t:\t" + reviews[i][:80] + "...")

g = open('reviews.txt','r') # What we know!
reviews = list(map(lambda x:x[:-1],g.readlines()))
g.close()

g = open('labels.txt','r') # What we WANT to know!
labels = list(map(lambda x:x[:-1].upper(),g.readlines()))
g.close()



In [2]:

    
len(reviews)









    Out[2]:





25000



In [3]:

    
reviews[0]









    Out[3]:





'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   '



In [4]:

    
labels[0]









    Out[4]:





'POSITIVE'

Lesson: Develop a Predictive Theory



In [5]:

    
print("labels.txt \t : \t reviews.txt\n")
pretty_print_review_and_label(2137)
pretty_print_review_and_label(12816)
pretty_print_review_and_label(6267)
pretty_print_review_and_label(21934)
pretty_print_review_and_label(5297)
pretty_print_review_and_label(4998)









    



labels.txt 	 : 	 reviews.txt

NEGATIVE	:	this movie is terrible but it has some good effects .  ...
POSITIVE	:	adrian pasdar is excellent is this film . he makes a fascinating woman .  ...
NEGATIVE	:	comment this movie is impossible . is terrible  very improbable  bad interpretat...
POSITIVE	:	excellent episode movie ala pulp fiction .  days   suicides . it doesnt get more...
NEGATIVE	:	if you haven  t seen this  it  s terrible . it is pure trash . i saw this about ...
POSITIVE	:	this schiffer guy is a real genius  the movie is of excellent quality and both e...

Test Theory



In [6]:

    
from collections import Counter
import numpy as np



In [7]:

    
positive_count = Counter()
negative_count = Counter()
total_count = Counter()



In [8]:

    
for i in range(len(reviews)):
    if (labels[i] == 'POSITIVE'):
        for word in reviews[i].split(" "):
            positive_count[word] += 1
            total_count[word] += 1
    else:
        for word in reviews[i].split(" "):
            negative_count[word] +=1
            total_count[word] +=1



In [9]:

    
positive_count.most_common()[:15]









    Out[9]:





[('', 550468),
 ('the', 173324),
 ('.', 159654),
 ('and', 89722),
 ('a', 83688),
 ('of', 76855),
 ('to', 66746),
 ('is', 57245),
 ('in', 50215),
 ('br', 49235),
 ('it', 48025),
 ('i', 40743),
 ('that', 35630),
 ('this', 35080),
 ('s', 33815)]



In [10]:

    
pos_ratios = Counter()
neg_ratios = Counter()

for word, count in list(total_count.most_common()):
        if count > 100:
            pos_ratio = positive_count[word] / float(total_count[word] + 1)
            neg_ratio = negative_count[word] / float(total_count[word] + 1)
            pos_ratios[word] = pos_ratio
            neg_ratios[word] = neg_ratio



In [11]:

    
pos_ratios.most_common()[:15]









    Out[11]:





[('edie', 0.990909090909091),
 ('paulie', 0.9833333333333333),
 ('felix', 0.9590163934426229),
 ('polanski', 0.9439252336448598),
 ('matthau', 0.9430379746835443),
 ('victoria', 0.9358974358974359),
 ('mildred', 0.9310344827586207),
 ('gandhi', 0.926829268292683),
 ('flawless', 0.9206349206349206),
 ('superbly', 0.905511811023622),
 ('perfection', 0.896551724137931),
 ('astaire', 0.8947368421052632),
 ('captures', 0.8847926267281107),
 ('voight', 0.8839285714285714),
 ('wonderfully', 0.8830769230769231)]



In [12]:

    
neg_ratios.most_common()[:15]









    Out[12]:





[('boll', 0.9862068965517241),
 ('uwe', 0.9805825242718447),
 ('seagal', 0.9681528662420382),
 ('unwatchable', 0.9537037037037037),
 ('stinker', 0.9514563106796117),
 ('mst', 0.9447513812154696),
 ('incoherent', 0.9424460431654677),
 ('unfunny', 0.9328358208955224),
 ('waste', 0.9314128943758574),
 ('blah', 0.9238578680203046),
 ('pointless', 0.9189723320158103),
 ('horrid', 0.9145299145299145),
 ('atrocious', 0.9137055837563451),
 ('redeeming', 0.9113149847094801),
 ('worst', 0.907427735089645)]



In [ ]: