Sentiment Classification & How To "Frame Problems" for a Neural Network

by Andrew Trask

What You Should Already Know

  • neural networks, forward and back-propagation
  • stochastic gradient descent
  • mean squared error
  • and train/test splits

Where to Get Help if You Need it

  • Re-watch previous Udacity Lectures
  • Leverage the recommended Course Reading Material - Grokking Deep Learning (40% Off: traskud17)
  • Shoot me a tweet @iamtrask

Tutorial Outline:

  • Intro: The Importance of "Framing a Problem"
  • Curate a Dataset
  • Developing a "Predictive Theory"
  • PROJECT 1: Quick Theory Validation
  • Transforming Text to Numbers
  • PROJECT 2: Creating the Input/Output Data
  • Putting it all together in a Neural Network
  • PROJECT 3: Building our Neural Network
  • Understanding Neural Noise
  • PROJECT 4: Making Learning Faster by Reducing Noise
  • Analyzing Inefficiencies in our Network
  • PROJECT 5: Making our Network Train and Run Faster
  • Further Noise Reduction
  • PROJECT 6: Reducing Noise by Strategically Reducing the Vocabulary
  • Analysis: What's going on in the weights?

Lesson: Curate a Dataset


In [14]:
def pretty_print_review_and_label(i):
    print(labels[i] + "\t: " + reviews[i][:70] + "...")

g = open('reviews.txt','r') # What we know!
reviews = list(map(lambda x:x[:-1],g.readlines()))
g.close()

g = open('labels.txt','r') # What we WANT to know!
labels = list(map(lambda x:x[:-1].upper(),g.readlines()))
g.close()

In [2]:
len(reviews)


Out[2]:
25000

In [5]:
reviews[1]


Out[5]:
'story of a man who has unnatural feelings for a pig . starts out with a opening scene that is a terrific example of absurd comedy . a formal orchestra audience is turned into an insane  violent mob by the crazy chantings of it  s singers . unfortunately it stays absurd the whole time with no general narrative eventually making it just too off putting . even those from the era should be turned off . the cryptic dialogue would make shakespeare seem easy to a third grader . on a technical level it  s better than you might think with some good cinematography by future great vilmos zsigmond . future stars sally kirkland and frederic forrest can be seen briefly .  '

In [6]:
labels[1]


Out[6]:
'NEGATIVE'

Lesson: Develop a Predictive Theory


In [15]:
print("labels.txt \t : \t reviews.txt\n")
pretty_print_review_and_label(2137)
pretty_print_review_and_label(12816)
pretty_print_review_and_label(6267)
pretty_print_review_and_label(21934)
pretty_print_review_and_label(5297)
pretty_print_review_and_label(4998)


labels.txt 	 : 	 reviews.txt

NEGATIVE	: this movie is terrible but it has some good effects .  ...
POSITIVE	: adrian pasdar is excellent is this film . he makes a fascinating woman...
NEGATIVE	: comment this movie is impossible . is terrible  very improbable  bad i...
POSITIVE	: excellent episode movie ala pulp fiction .  days   suicides . it doesn...
NEGATIVE	: if you haven  t seen this  it  s terrible . it is pure trash . i saw t...
POSITIVE	: this schiffer guy is a real genius  the movie is of excellent quality ...

Counting all the words:


In [126]:
from collections import Counter
import numpy as np

c = Counter()

for review in reviews:
    for word in str(review).split():
        c[word] += 1

the most common words have no predictive power:


In [125]:
common_words = c.most_common(24)
common_words


Out[125]:
[('the', 336713),
 ('.', 327192),
 ('and', 164107),
 ('a', 163009),
 ('of', 145864),
 ('to', 135720),
 ('is', 107328),
 ('br', 101872),
 ('it', 96352),
 ('in', 93968),
 ('i', 87623),
 ('this', 76000),
 ('that', 73245),
 ('s', 65361),
 ('was', 48208),
 ('as', 46933),
 ('for', 44343),
 ('with', 44125),
 ('movie', 44039),
 ('but', 42603),
 ('film', 40155),
 ('you', 34230),
 ('on', 34200),
 ('t', 34081)]

In [88]:
for word in common_words:
    del c[word[0]]
c.most_common(10)


Out[88]:
[('one', 26789),
 ('all', 23978),
 ('at', 23513),
 ('they', 22906),
 ('by', 22546),
 ('an', 21560),
 ('who', 21433),
 ('so', 20617),
 ('from', 20498),
 ('like', 20276)]

Hmmm.. it would be more useful to have two counters, one for negative reviews and the other for positive ones.


In [127]:
negative_words = Counter()
positive_words = Counter()
total_words = Counter()

for i, review in enumerate(reviews):
    for word in str(review).split():
        if labels[i] == "NEGATIVE":
            negative_words[word] += 1
        else:
            positive_words[word] += 1
        total_words[word] += 1

In [172]:
pos_neg_ratios = Counter()
neg_pos_ratios = Counter()
for word, cnt in total_words.most_common():
    if cnt > 500:
        pos_neg_ratios[word] += positive_words[word] / (negative_words[word] + 1.0)
        neg_pos_ratios[word] += negative_words[word] / (positive_words[word] + 1.0)
        
pos_neg_ratios.most_common(20)


Out[172]:
[('superb', 5.524271844660194),
 ('wonderful', 4.780487804878049),
 ('fantastic', 4.503448275862069),
 ('excellent', 4.326478149100257),
 ('amazing', 4.022813688212928),
 ('powerful', 3.669172932330827),
 ('favorite', 3.5498154981549814),
 ('perfect', 3.4789915966386555),
 ('brilliant', 3.4169741697416973),
 ('perfectly', 3.310810810810811),
 ('loved', 3.1783625730994154),
 ('highly', 3.133093525179856),
 ('tony', 3.125984251968504),
 ('today', 3.0193548387096776),
 ('unique', 2.96875),
 ('beauty', 2.8588235294117648),
 ('greatest', 2.786802030456853),
 ('portrayal', 2.77037037037037),
 ('incredible', 2.7350993377483444),
 ('sweet', 2.6903225806451614)]

In [171]:
neg_pos_ratios.most_common(20)


Out[171]:
[('waste', 13.58),
 ('worst', 9.802371541501977),
 ('awful', 9.21301775147929),
 ('horrible', 6.705128205128205),
 ('crap', 6.172413793103448),
 ('worse', 5.65158371040724),
 ('terrible', 5.608870967741935),
 ('stupid', 5.211678832116788),
 ('boring', 4.425149700598802),
 ('bad', 3.8789308176100628),
 ('supposed', 3.583081570996979),
 ('poor', 3.551558752997602),
 ('oh', 2.9617486338797816),
 ('save', 2.683453237410072),
 ('gore', 2.637630662020906),
 ('minutes', 2.5676328502415457),
 ('decent', 2.376093294460641),
 ('ok', 2.363036303630363),
 ('attempt', 2.3259493670886076),
 ('nothing', 2.298232129131437)]

In [ ]: