Sentiment Classification & How To "Frame Problems" for a Neural Network

by Andrew Trask

Twitter: @iamtrask
Blog: http://iamtrask.github.io

What You Should Already Know

neural networks, forward and back-propagation
stochastic gradient descent
mean squared error
and train/test splits

Where to Get Help if You Need it

Re-watch previous Udacity Lectures
Leverage the recommended Course Reading Material - Grokking Deep Learning (40% Off: traskud17)
Shoot me a tweet @iamtrask

Tutorial Outline:

Intro: The Importance of "Framing a Problem"

Curate a Dataset
Developing a "Predictive Theory"
PROJECT 1: Quick Theory Validation

Transforming Text to Numbers
PROJECT 2: Creating the Input/Output Data

Putting it all together in a Neural Network
PROJECT 3: Building our Neural Network

Understanding Neural Noise
PROJECT 4: Making Learning Faster by Reducing Noise

Analyzing Inefficiencies in our Network
PROJECT 5: Making our Network Train and Run Faster

Further Noise Reduction
PROJECT 6: Reducing Noise by Strategically Reducing the Vocabulary

Analysis: What's going on in the weights?

Lesson: Curate a Dataset



In [1]:

    
def pretty_print_review_and_label(i):
    print(labels[i] + "\t:\t" + reviews[i][:80] + "...")

g = open('reviews.txt','r') # What we know!
reviews = list(map(lambda x:x[:-1],g.readlines()))
g.close()

g = open('labels.txt','r') # What we WANT to know!
labels = list(map(lambda x:x[:-1].upper(),g.readlines()))
g.close()



In [2]:

    
len(reviews)









    Out[2]:





25000



In [3]:

    
reviews[0]









    Out[3]:





'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   '



In [4]:

    
labels[0]









    Out[4]:





'POSITIVE'

Lesson: Develop a Predictive Theory



In [5]:

    
print("labels.txt \t : \t reviews.txt\n")
pretty_print_review_and_label(2137)
pretty_print_review_and_label(12816)
pretty_print_review_and_label(6267)
pretty_print_review_and_label(21934)
pretty_print_review_and_label(5297)
pretty_print_review_and_label(4998)









    



labels.txt 	 : 	 reviews.txt

NEGATIVE	:	this movie is terrible but it has some good effects .  ...
POSITIVE	:	adrian pasdar is excellent is this film . he makes a fascinating woman .  ...
NEGATIVE	:	comment this movie is impossible . is terrible  very improbable  bad interpretat...
POSITIVE	:	excellent episode movie ala pulp fiction .  days   suicides . it doesnt get more...
NEGATIVE	:	if you haven  t seen this  it  s terrible . it is pure trash . i saw this about ...
POSITIVE	:	this schiffer guy is a real genius  the movie is of excellent quality and both e...

Project 1: Quick Theory Validation



In [6]:

    
from collections import Counter
import numpy as np



In [7]:

    
positive_counts = Counter()
negative_counts = Counter()
total_counts = Counter()



In [8]:

    
for i in range(len(reviews)):
    if(labels[i] == 'POSITIVE'):
        for word in reviews[i].split(" "):
            positive_counts[word] += 1
            total_counts[word] += 1
    else:
        for word in reviews[i].split(" "):
            negative_counts[word] += 1
            total_counts[word] += 1



In [9]:

    
positive_counts.most_common()[:15]









    Out[9]:





[('', 550468),
 ('the', 173324),
 ('.', 159654),
 ('and', 89722),
 ('a', 83688),
 ('of', 76855),
 ('to', 66746),
 ('is', 57245),
 ('in', 50215),
 ('br', 49235),
 ('it', 48025),
 ('i', 40743),
 ('that', 35630),
 ('this', 35080),
 ('s', 33815)]



In [10]:

    
pos_neg_ratios = Counter()

for term,cnt in list(total_counts.most_common()):
    if(cnt > 100):
        pos_neg_ratio = positive_counts[term] / float(negative_counts[term]+1)
        pos_neg_ratios[term] = pos_neg_ratio

for word,ratio in pos_neg_ratios.most_common():
    if(ratio > 1):
        pos_neg_ratios[word] = np.log(ratio)
    else:
        pos_neg_ratios[word] = -np.log((1 / (ratio+0.01)))



In [11]:

    
# words most frequently seen in a review with a "POSITIVE" label
pos_neg_ratios.most_common()[:15]









    Out[11]:





[('edie', 4.6913478822291435),
 ('paulie', 4.0775374439057197),
 ('felix', 3.1527360223636558),
 ('polanski', 2.8233610476132043),
 ('matthau', 2.8067217286092401),
 ('victoria', 2.6810215287142909),
 ('mildred', 2.6026896854443837),
 ('gandhi', 2.5389738710582761),
 ('flawless', 2.451005098112319),
 ('superbly', 2.2600254785752498),
 ('perfection', 2.1594842493533721),
 ('astaire', 2.1400661634962708),
 ('captures', 2.0386195471595809),
 ('voight', 2.0301704926730531),
 ('wonderfully', 2.0218960560332353)]



In [12]:

    
# words most frequently seen in a review with a "NEGATIVE" label
list(reversed(pos_neg_ratios.most_common()))[0:30]









    Out[12]:





[('boll', -4.0778152602708904),
 ('uwe', -3.9218753018711578),
 ('seagal', -3.3202501058581921),
 ('unwatchable', -3.0269848170580955),
 ('stinker', -2.9876839403711624),
 ('mst', -2.7753833211707968),
 ('incoherent', -2.7641396677532537),
 ('unfunny', -2.5545257844967644),
 ('waste', -2.4907515123361046),
 ('blah', -2.4475792789485005),
 ('horrid', -2.3715779644809971),
 ('pointless', -2.3451073877136341),
 ('atrocious', -2.3187369339642556),
 ('redeeming', -2.2667790015910296),
 ('prom', -2.2601040980178784),
 ('drivel', -2.2476029585766928),
 ('lousy', -2.2118080125207054),
 ('worst', -2.1930856334332267),
 ('laughable', -2.172468615469592),
 ('awful', -2.1385076866397488),
 ('poorly', -2.1326133844207011),
 ('wasting', -2.1178155545614512),
 ('remotely', -2.111046881095167),
 ('existent', -2.0024805005437076),
 ('boredom', -1.9241486572738005),
 ('miserably', -1.9216610938019989),
 ('sucks', -1.9166645809588516),
 ('uninspired', -1.9131499212248517),
 ('lame', -1.9117232884159072),
 ('insult', -1.9085323769376259)]

Transforming Text into Numbers



In [13]:

    
from IPython.display import Image

review = "This was a horrible, terrible movie."

Image(filename='sentiment_network.png')









    Out[13]:



In [14]:

    
review = "The movie was excellent"

Image(filename='sentiment_network_pos.png')









    Out[14]:

Mini-Project 2: Creating Input/Output Data



In [15]:

    
vocab = set(total_counts.keys())
vocab_size = len(vocab)
print(vocab_size)



In [16]:

    
list(vocab)[:15]









    Out[16]:





['',
 'guillermo',
 'scathingly',
 'mortem',
 'irreligious',
 'seduces',
 'kasnoff',
 'armpitted',
 'koboi',
 'mulit',
 'supernovas',
 'upbraids',
 'borkowski',
 'hombres',
 'pedophile']



In [17]:

    
import numpy as np

layer_0 = np.zeros((1, vocab_size))
layer_0









    Out[17]:





array([[ 0.,  0.,  0., ...,  0.,  0.,  0.]])



In [18]:

    
from IPython.display import Image
Image(filename='sentiment_network.png')









    Out[18]:



In [25]:

    
word2index = {}

for i, word in enumerate(vocab):
    word2index[word] = i

word2index_sample = {k: word2index[k] for k in list(word2index.keys())[:15]}

word2index_sample









    Out[25]:





{'': 0,
 'armpitted': 7,
 'borkowski': 12,
 'guillermo': 1,
 'hombres': 13,
 'irreligious': 4,
 'kasnoff': 6,
 'koboi': 8,
 'mortem': 3,
 'mulit': 9,
 'pedophile': 14,
 'scathingly': 2,
 'seduces': 5,
 'supernovas': 10,
 'upbraids': 11}



In [26]:

    
def update_input_layer(review):
    
    global layer_0
    
    layer_0 *= 0
    for word in review.split(" "):
        layer_0[0][word2index[word]] += 1

update_input_layer(reviews[0])



In [27]:

    
layer_0









    Out[27]:





array([[ 18.,   0.,   0., ...,   0.,   0.,   0.]])



In [28]:

    
def get_target_for_label(label):
    if (label == 'POSITIVE'):
        return 1
    else:
        return 0



In [29]:

    
labels[0]









    Out[29]:





'POSITIVE'



In [30]:

    
get_target_for_label(labels[0])









    Out[30]:





1



In [31]:

    
labels[1]









    Out[31]:





'NEGATIVE'



In [32]:

    
get_target_for_label(labels[1])









    Out[32]:





0



In [ ]: