# Sentiment Classification & How To "Frame Problems" for a Neural Network

### What You Should Already Know

• neural networks, forward and back-propagation
• mean squared error
• and train/test splits

### Where to Get Help if You Need it

• Re-watch previous Udacity Lectures
• Leverage the recommended Course Reading Material - Grokking Deep Learning (40% Off: traskud17)
• Shoot me a tweet @iamtrask

### Tutorial Outline:

• Intro: The Importance of "Framing a Problem"
• Curate a Dataset
• Developing a "Predictive Theory"
• PROJECT 1: Quick Theory Validation
• Transforming Text to Numbers
• PROJECT 2: Creating the Input/Output Data
• Putting it all together in a Neural Network
• PROJECT 3: Building our Neural Network
• Understanding Neural Noise
• PROJECT 4: Making Learning Faster by Reducing Noise
• Analyzing Inefficiencies in our Network
• PROJECT 5: Making our Network Train and Run Faster
• Further Noise Reduction
• PROJECT 6: Reducing Noise by Strategically Reducing the Vocabulary
• Analysis: What's going on in the weights?

# Lesson: Curate a Dataset

``````

In [14]:

def pretty_print_review_and_label(i):
print(labels[i] + "\t: " + reviews[i][:70] + "...")

g = open('reviews.txt','r') # What we know!
g.close()

g = open('labels.txt','r') # What we WANT to know!
g.close()

``````
``````

In [2]:

len(reviews)

``````
``````

Out[2]:

25000

``````
``````

In [5]:

reviews[1]

``````
``````

Out[5]:

'story of a man who has unnatural feelings for a pig . starts out with a opening scene that is a terrific example of absurd comedy . a formal orchestra audience is turned into an insane  violent mob by the crazy chantings of it  s singers . unfortunately it stays absurd the whole time with no general narrative eventually making it just too off putting . even those from the era should be turned off . the cryptic dialogue would make shakespeare seem easy to a third grader . on a technical level it  s better than you might think with some good cinematography by future great vilmos zsigmond . future stars sally kirkland and frederic forrest can be seen briefly .  '

``````
``````

In [6]:

labels[1]

``````
``````

Out[6]:

'NEGATIVE'

``````

# Lesson: Develop a Predictive Theory

``````

In [15]:

print("labels.txt \t : \t reviews.txt\n")
pretty_print_review_and_label(2137)
pretty_print_review_and_label(12816)
pretty_print_review_and_label(6267)
pretty_print_review_and_label(21934)
pretty_print_review_and_label(5297)
pretty_print_review_and_label(4998)

``````
``````

labels.txt 	 : 	 reviews.txt

NEGATIVE	: this movie is terrible but it has some good effects .  ...
POSITIVE	: adrian pasdar is excellent is this film . he makes a fascinating woman...
NEGATIVE	: comment this movie is impossible . is terrible  very improbable  bad i...
POSITIVE	: excellent episode movie ala pulp fiction .  days   suicides . it doesn...
NEGATIVE	: if you haven  t seen this  it  s terrible . it is pure trash . i saw t...
POSITIVE	: this schiffer guy is a real genius  the movie is of excellent quality ...

``````

Counting all the words:

``````

In [126]:

from collections import Counter
import numpy as np

c = Counter()

for review in reviews:
for word in str(review).split():
c[word] += 1

``````

the most common words have no predictive power:

``````

In [125]:

common_words = c.most_common(24)
common_words

``````
``````

Out[125]:

[('the', 336713),
('.', 327192),
('and', 164107),
('a', 163009),
('of', 145864),
('to', 135720),
('is', 107328),
('br', 101872),
('it', 96352),
('in', 93968),
('i', 87623),
('this', 76000),
('that', 73245),
('s', 65361),
('was', 48208),
('as', 46933),
('for', 44343),
('with', 44125),
('movie', 44039),
('but', 42603),
('film', 40155),
('you', 34230),
('on', 34200),
('t', 34081)]

``````
``````

In [88]:

for word in common_words:
del c[word[0]]
c.most_common(10)

``````
``````

Out[88]:

[('one', 26789),
('all', 23978),
('at', 23513),
('they', 22906),
('by', 22546),
('an', 21560),
('who', 21433),
('so', 20617),
('from', 20498),
('like', 20276)]

``````

Hmmm.. it would be more useful to have two counters, one for negative reviews and the other for positive ones.

``````

In [127]:

negative_words = Counter()
positive_words = Counter()
total_words = Counter()

for i, review in enumerate(reviews):
for word in str(review).split():
if labels[i] == "NEGATIVE":
negative_words[word] += 1
else:
positive_words[word] += 1
total_words[word] += 1

``````
``````

In [172]:

pos_neg_ratios = Counter()
neg_pos_ratios = Counter()
for word, cnt in total_words.most_common():
if cnt > 500:
pos_neg_ratios[word] += positive_words[word] / (negative_words[word] + 1.0)
neg_pos_ratios[word] += negative_words[word] / (positive_words[word] + 1.0)

pos_neg_ratios.most_common(20)

``````
``````

Out[172]:

[('superb', 5.524271844660194),
('wonderful', 4.780487804878049),
('fantastic', 4.503448275862069),
('excellent', 4.326478149100257),
('amazing', 4.022813688212928),
('powerful', 3.669172932330827),
('favorite', 3.5498154981549814),
('perfect', 3.4789915966386555),
('brilliant', 3.4169741697416973),
('perfectly', 3.310810810810811),
('loved', 3.1783625730994154),
('highly', 3.133093525179856),
('tony', 3.125984251968504),
('today', 3.0193548387096776),
('unique', 2.96875),
('beauty', 2.8588235294117648),
('greatest', 2.786802030456853),
('portrayal', 2.77037037037037),
('incredible', 2.7350993377483444),
('sweet', 2.6903225806451614)]

``````
``````

In [171]:

neg_pos_ratios.most_common(20)

``````
``````

Out[171]:

[('waste', 13.58),
('worst', 9.802371541501977),
('awful', 9.21301775147929),
('horrible', 6.705128205128205),
('crap', 6.172413793103448),
('worse', 5.65158371040724),
('terrible', 5.608870967741935),
('stupid', 5.211678832116788),
('boring', 4.425149700598802),
('supposed', 3.583081570996979),
('poor', 3.551558752997602),
('oh', 2.9617486338797816),
('save', 2.683453237410072),
('gore', 2.637630662020906),
('minutes', 2.5676328502415457),
('decent', 2.376093294460641),
('ok', 2.363036303630363),
('attempt', 2.3259493670886076),
('nothing', 2.298232129131437)]

``````
``````

In [ ]:

``````