# Sentiment Classification & How To "Frame Problems" for a Neural Network

### What You Should Already Know

• neural networks, forward and back-propagation
• mean squared error
• and train/test splits

### Where to Get Help if You Need it

• Re-watch previous Udacity Lectures
• Leverage the recommended Course Reading Material - Grokking Deep Learning (40% Off: traskud17)
• Shoot me a tweet @iamtrask

### Tutorial Outline:

• Intro: The Importance of "Framing a Problem"
• Curate a Dataset
• Developing a "Predictive Theory"
• PROJECT 1: Quick Theory Validation
• Transforming Text to Numbers
• PROJECT 2: Creating the Input/Output Data
• Putting it all together in a Neural Network
• PROJECT 3: Building our Neural Network
• Understanding Neural Noise
• PROJECT 4: Making Learning Faster by Reducing Noise
• Analyzing Inefficiencies in our Network
• PROJECT 5: Making our Network Train and Run Faster
• Further Noise Reduction
• PROJECT 6: Reducing Noise by Strategically Reducing the Vocabulary
• Analysis: What's going on in the weights?

# Lesson: Curate a Dataset

``````

In [1]:

def pretty_print_review_and_label(i):
print(labels[i] + "\t:\t" + reviews[i][:80] + "...")

g = open('reviews.txt','r') # What we know!
g.close()

g = open('labels.txt','r') # What we WANT to know!
g.close()

``````
``````

In [2]:

len(reviews)

``````
``````

Out[2]:

25000

``````
``````

In [3]:

reviews[0]

``````
``````

Out[3]:

'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   '

``````
``````

In [4]:

labels[0]

``````
``````

Out[4]:

'POSITIVE'

``````

# Lesson: Develop a Predictive Theory

``````

In [5]:

print("labels.txt \t : \t reviews.txt\n")
pretty_print_review_and_label(2137)
pretty_print_review_and_label(12816)
pretty_print_review_and_label(6267)
pretty_print_review_and_label(21934)
pretty_print_review_and_label(5297)
pretty_print_review_and_label(4998)

``````
``````

labels.txt 	 : 	 reviews.txt

NEGATIVE	:	this movie is terrible but it has some good effects .  ...
POSITIVE	:	adrian pasdar is excellent is this film . he makes a fascinating woman .  ...
NEGATIVE	:	comment this movie is impossible . is terrible  very improbable  bad interpretat...
POSITIVE	:	excellent episode movie ala pulp fiction .  days   suicides . it doesnt get more...
NEGATIVE	:	if you haven  t seen this  it  s terrible . it is pure trash . i saw this about ...
POSITIVE	:	this schiffer guy is a real genius  the movie is of excellent quality and both e...

``````

# Project 1: Quick Theory Validation

``````

In [6]:

from collections import Counter
import numpy as np

``````
``````

In [7]:

positive_counts = Counter()
negative_counts = Counter()
total_counts = Counter()

``````
``````

In [8]:

for i in range(len(reviews)):
if(labels[i] == 'POSITIVE'):
for word in reviews[i].split(" "):
positive_counts[word] += 1
total_counts[word] += 1
else:
for word in reviews[i].split(" "):
negative_counts[word] += 1
total_counts[word] += 1

``````
``````

In [9]:

positive_counts.most_common()[:15]

``````
``````

Out[9]:

[('', 550468),
('the', 173324),
('.', 159654),
('and', 89722),
('a', 83688),
('of', 76855),
('to', 66746),
('is', 57245),
('in', 50215),
('br', 49235),
('it', 48025),
('i', 40743),
('that', 35630),
('this', 35080),
('s', 33815)]

``````
``````

In [10]:

pos_neg_ratios = Counter()

for term,cnt in list(total_counts.most_common()):
if(cnt > 100):
pos_neg_ratio = positive_counts[term] / float(negative_counts[term]+1)
pos_neg_ratios[term] = pos_neg_ratio

for word,ratio in pos_neg_ratios.most_common():
if(ratio > 1):
pos_neg_ratios[word] = np.log(ratio)
else:
pos_neg_ratios[word] = -np.log((1 / (ratio+0.01)))

``````
``````

In [11]:

# words most frequently seen in a review with a "POSITIVE" label
pos_neg_ratios.most_common()[:15]

``````
``````

Out[11]:

[('edie', 4.6913478822291435),
('paulie', 4.0775374439057197),
('felix', 3.1527360223636558),
('polanski', 2.8233610476132043),
('matthau', 2.8067217286092401),
('victoria', 2.6810215287142909),
('mildred', 2.6026896854443837),
('gandhi', 2.5389738710582761),
('flawless', 2.451005098112319),
('superbly', 2.2600254785752498),
('perfection', 2.1594842493533721),
('astaire', 2.1400661634962708),
('captures', 2.0386195471595809),
('voight', 2.0301704926730531),
('wonderfully', 2.0218960560332353)]

``````
``````

In [12]:

# words most frequently seen in a review with a "NEGATIVE" label
list(reversed(pos_neg_ratios.most_common()))[0:30]

``````
``````

Out[12]:

[('boll', -4.0778152602708904),
('uwe', -3.9218753018711578),
('seagal', -3.3202501058581921),
('unwatchable', -3.0269848170580955),
('stinker', -2.9876839403711624),
('mst', -2.7753833211707968),
('incoherent', -2.7641396677532537),
('unfunny', -2.5545257844967644),
('waste', -2.4907515123361046),
('blah', -2.4475792789485005),
('horrid', -2.3715779644809971),
('pointless', -2.3451073877136341),
('atrocious', -2.3187369339642556),
('redeeming', -2.2667790015910296),
('prom', -2.2601040980178784),
('drivel', -2.2476029585766928),
('lousy', -2.2118080125207054),
('worst', -2.1930856334332267),
('laughable', -2.172468615469592),
('awful', -2.1385076866397488),
('poorly', -2.1326133844207011),
('wasting', -2.1178155545614512),
('remotely', -2.111046881095167),
('existent', -2.0024805005437076),
('boredom', -1.9241486572738005),
('miserably', -1.9216610938019989),
('sucks', -1.9166645809588516),
('uninspired', -1.9131499212248517),
('lame', -1.9117232884159072),
('insult', -1.9085323769376259)]

``````

# Transforming Text into Numbers

``````

In [13]:

from IPython.display import Image

review = "This was a horrible, terrible movie."

Image(filename='sentiment_network.png')

``````
``````

Out[13]:

``````
``````

In [14]:

review = "The movie was excellent"

Image(filename='sentiment_network_pos.png')

``````
``````

Out[14]:

``````

# Mini-Project 2: Creating Input/Output Data

``````

In [15]:

vocab = set(total_counts.keys())
vocab_size = len(vocab)
print(vocab_size)

``````
``````

74074

``````
``````

In [16]:

list(vocab)[:15]

``````
``````

Out[16]:

['',
'guillermo',
'scathingly',
'mortem',
'irreligious',
'seduces',
'kasnoff',
'armpitted',
'koboi',
'mulit',
'supernovas',
'upbraids',
'borkowski',
'hombres',
'pedophile']

``````
``````

In [17]:

import numpy as np

layer_0 = np.zeros((1, vocab_size))
layer_0

``````
``````

Out[17]:

array([[ 0.,  0.,  0., ...,  0.,  0.,  0.]])

``````
``````

In [18]:

from IPython.display import Image
Image(filename='sentiment_network.png')

``````
``````

Out[18]:

``````
``````

In [25]:

word2index = {}

for i, word in enumerate(vocab):
word2index[word] = i

word2index_sample = {k: word2index[k] for k in list(word2index.keys())[:15]}

word2index_sample

``````
``````

Out[25]:

{'': 0,
'armpitted': 7,
'borkowski': 12,
'guillermo': 1,
'hombres': 13,
'irreligious': 4,
'kasnoff': 6,
'koboi': 8,
'mortem': 3,
'mulit': 9,
'pedophile': 14,
'scathingly': 2,
'seduces': 5,
'supernovas': 10,
'upbraids': 11}

``````
``````

In [26]:

def update_input_layer(review):

global layer_0

layer_0 *= 0
for word in review.split(" "):
layer_0[0][word2index[word]] += 1

update_input_layer(reviews[0])

``````
``````

In [27]:

layer_0

``````
``````

Out[27]:

array([[ 18.,   0.,   0., ...,   0.,   0.,   0.]])

``````
``````

In [28]:

def get_target_for_label(label):
if (label == 'POSITIVE'):
return 1
else:
return 0

``````
``````

In [29]:

labels[0]

``````
``````

Out[29]:

'POSITIVE'

``````
``````

In [30]:

get_target_for_label(labels[0])

``````
``````

Out[30]:

1

``````
``````

In [31]:

labels[1]

``````
``````

Out[31]:

'NEGATIVE'

``````
``````

In [32]:

get_target_for_label(labels[1])

``````
``````

Out[32]:

0

``````
``````

In [ ]:

``````