Hi, my name is Andrew Trask. I am currently a PhD student at the University of Oxford studying Deep Learning for Natural Language Processing. Natural Language Processing is the field that studies human language and today we’re going to be talking about Sentiment Classification, or the classification of whether or not a section of human-generated text is positive or negative (i.e., happy or sad). Deep Learning, as I’m sure you’re coming to understand, is a set of tools (neural networks) used to take what we “know”, and predict what we what we “want to know”. In this case, we “know” a paragraph of text generated from a human, and we “want to know” whether or not it has positive or negative sentiment. Our goal is to build a neural network that can make this prediction.
What this tutorial is really about is "framing a problem" so that a neural network can be successful in solving it. Sentiment is a great example because neural networks don't take raw text as input, they take numbers! We have to consider how to efficiently transform our text into numbers so that our network can learn a valuable underlying pattern. I can't stress enough how important this skillset will be to your career. Frameworks (like TensorFlow) will handle backpropagation, gradients, and error measures for you, but "framing the problem" is up to you, the scientist, and if it's not done correctly, your networks will spend forever searching for correlation between your two datasets (and they might never find it).
I am assuming you already know about neural networks, forward and back-propagation, stochastic gradient descent, mean squared error, and train/test splits from previous lessons.
Neural networks, by themselves, cannot do anything. All a neural network really does is search for direct or indirect correlation between two datasets. So, in order for a neural network to learn anything, we have to present it with two, meaningful datasets. The first dataset must represent what we “know” and the second dataset must represent what we “want to know”, or what we want the neural network to be able to tell us. As the network trains, it’s going to search for correlation between these two datasets, so that eventually it can take one and predict the other. Let me show you what I mean with our example sentiment dataset.
In [1]:
def pretty_print_review_and_label(i):
print(labels[i] + "\t:\t" + reviews[i][:80] + "...")
g = open('reviews.txt','r')
reviews = list(map(lambda x:x[:-1],g.readlines()))
g.close()
g = open('labels.txt','r')
labels = list(map(lambda x:x[:-1].upper(),g.readlines()))
g.close()
In the cell above, I have loaded two datasets. The first dataset "reviews" is a list of 25,000 movie reviews that people wrote about various movies. The second dataset is a list of whether or not each review is a “positive” review or “negative” review.
In [2]:
reviews[0]
Out[2]:
In [3]:
labels[0]
Out[3]:
I want you to pretend that you’re a neural network for a moment. Consider a few examples from the two datasets below. Do you see any correlation between these two datasets?
In [4]:
print("labels.txt \t : \t reviews.txt\n")
pretty_print_review_and_label(2137)
pretty_print_review_and_label(12816)
pretty_print_review_and_label(6267)
pretty_print_review_and_label(21934)
pretty_print_review_and_label(5297)
pretty_print_review_and_label(4998)
Well, let’s consider several different granularities. At the paragraph level, no two paragraphs are the same, so there can be no “correlation” per-say. You have to see two things occur at the same time more than once in order for there to be considered “correlation”. What about at the character level? I’m guessing the letter “b” is used just as much in positive reviews as it is in negative reviews. How about word level? Ah, I think there's some correlation between the words in these reviews and whether or not the review is positive or negative.
In [5]:
from collections import Counter
In [6]:
positive_counts = Counter()
negative_counts = Counter()
total_counts = Counter()
for i in range(len(reviews)):
if(labels[i] == 'POSITIVE'):
for word in reviews[i].split(" "):
positive_counts[word] += 1
total_counts[word] += 1
else:
for word in reviews[i].split(" "):
negative_counts[word] += 1
total_counts[word] += 1
pos_neg_ratios = Counter()
for term,cnt in list(total_counts.most_common()):
if(cnt > 10):
pos_neg_ratio = positive_counts[term] / float(negative_counts[term]+1)
pos_neg_ratios[term] = pos_neg_ratio
for word,ratio in pos_neg_ratios.most_common():
if(ratio > 1):
pos_neg_ratios[word] = np.log(ratio)
else:
pos_neg_ratios[word] = -np.log((1 / (ratio+0.01)))
In [169]:
# words most frequently seen in a review with a "POSITIVE" label
pos_neg_ratios.most_common()
Out[169]:
In [157]:
# words most frequently seen in a review with a "NEGATIVE" label
list(reversed(pos_neg_ratios.most_common()))[0:30]
Out[157]:
Wow, there’s really something to this theory! As we can see, there are clearly terms in movie reviews that have correlation with our output labels. So, if we think there might be strong correlation between the words present in a particular review and the sentiment of that review, what should our network take as input and then predict? Let me put it a different way: If we think that there is correlation between the “vocabulary” of a particular review and the sentiment of that review, what should be the input and output to our neural network? The input should be the “vocabulary of the review” and the output should be whether or not the review is positive or negative!
Now that we have some idea that this task is possible (and where we want the network to find correlation), let’s try to train a neural network to predict sentiment based on the vocabulary of a movie review.
The next challenge is to transform our datasets into something that the neural network can read.
As I’m sure you’ve learned, neural networks are made up of layers of interconnected “neurons”. The first layer is where our input data “goes in” to the network. Any particular “input neuron” can take exactly two kinds of inputs, binary inputs and “real valued” inputs. Previously, you’ve been training networks on raw, continuous data, real valued inputs. However, now we’re modeling whether different input terms “exist” or “do not exist” in a movie review. When we model something that either “exists” or “does not exiest” or when something is either “true” or “false”, we want to use “binary” inputs to our neural network. This use of binary values is called "one-hot encoding". Let me show you what I mean.
In [85]:
from IPython.display import Image
review = "This was a horrible, terrible movie."
Image(filename='sentiment_network.png')
Out[85]:
In [86]:
review = "The movie was excellent"
Image(filename='sentiment_network_pos.png')
Out[86]:
Let’s say our entire movie review corpus has 10,000 words. Given a single movie review ("This was a horrible, terrible movie"), we’re going to put a “1” in the input of our neural network for every word that exists in the review, and a 0 everywhere else. So, given our 10,000 words, a movie review with 6 words would have 6 neurons with a “1” and 9,994 neurons with a “0”. The picture above is a miniturized version of this, displaying how we input a "1" for the words "horrible" and "terrible" while inputting a "0" for the word "excellent" because it was not present in the review.
In the same way, we want our network to either predict that the input is “positive” or “negative”. Now, our networks can’t write “positive” or “negative”, so we’re going to instead have another single neuron that represents “positive” when it is a “1” and “negative” when it is a “0”. In this way, our network can give us a number that we will interpret as “positive” or “negative”.
What we’re actually doing here is creating a “derivative dataset” from our movie reviews. Neural networks, after all, can’t read text. So, what we’re doing is identifying the “source of correlation” in our two datasets and creating a derivative dataset made up of numbers that preserve the patterns that we care about. In our input dataset, that pattern is the existence or non-existence of a particular word. In our output dataset, that pattern is whether a statement is positive or negative. Now we’ve converted our patterns into something our network can understand! Our network is going to look for correlation between the 1s and 0s in our input and the 1s and 0s in our output, and if it can do so it has learned to predict the sentiment of movie reviews. Now that our data is ready for the network, let’s start building the network.
As we just learned above, in order for our neural network to predict on a movie review, we have to be able to create an input layer of 1s and 0s that correlates with the words present in a review. Let's start by creating a function that can take a review and generate this layer of 1s and 0s.
In order to create this function, we first must decide how many input neurons we need. The answer is quite simple. Since we want our network's input to be able to represent the presence or absence of any word in the vocabulary, we need one node per vocabulary term. So, our input layer size is the size of our vocabulary. Let's calculate that.
In [87]:
vocab = set(total_counts.keys())
vocab_size = len(vocab)
print(vocab_size)
And now we can initialize our (empty) input layer as vector of 0s. We'll modify it later by putting "1"s in various positions.
In [88]:
import numpy as np
layer_0 = np.zeros((1,vocab_size))
layer_0
Out[88]:
And now we want to create a function that will set our layer_0 list to the correct sequence of 1s and 0s based on a single review. Now if you remember our picture before, you might have noticed something. Each word had a specific place in the input of our network.
In [89]:
from IPython.display import Image
Image(filename='sentiment_network.png')
Out[89]:
In order to create a function that can update our layer_0 variable based on a review, we have to decide which spots in our layer_0 vector (list of numbers) correlate with each word. Truth be told, it doesn't matter which ones we choose, only that we pick spots for each word and stick with them. Let's decide those positions now and store them in a python dictionary called "word2index".
In [90]:
word2index = {}
for i,word in enumerate(vocab):
word2index[word] = i
word2index
Out[90]:
...and now we can use this new "word2index" dictionary to populate our input layer with the right 1s in the right places.
In [91]:
def update_input_layer(review):
global layer_0
# clear out previous state, reset the layer to be all 0s
layer_0 *= 0
for word in review.split(" "):
layer_0[0][word2index[word]] = 1
update_input_layer(reviews[0])
In [92]:
layer_0
Out[92]:
In [93]:
def get_target_for_label(label):
if(label == 'POSITIVE'):
return 1
else:
return 0
In [94]:
get_target_for_label(labels[0])
Out[94]:
In [95]:
get_target_for_label(labels[1])
Out[95]:
In [96]:
from IPython.display import Image
Image(filename='sentiment_network_2.png')
Out[96]:
In [97]:
import time
import sys
import numpy as np
# Let's tweak our network from before to model these phenomena
class SentimentNetwork:
def __init__(self, reviews,labels,hidden_nodes = 10, learning_rate = 0.1):
np.random.seed(1)
self.pre_process_data()
self.init_network(len(self.review_vocab),hidden_nodes, 1, learning_rate)
def pre_process_data(self):
review_vocab = set()
for review in reviews:
for word in review.split(" "):
review_vocab.add(word)
self.review_vocab = list(review_vocab)
label_vocab = set()
for label in labels:
label_vocab.add(label)
self.label_vocab = list(label_vocab)
self.review_vocab_size = len(self.review_vocab)
self.label_vocab_size = len(self.label_vocab)
self.word2index = {}
for i, word in enumerate(self.review_vocab):
self.word2index[word] = i
self.label2index = {}
for i, label in enumerate(self.label_vocab):
self.label2index[label] = i
def init_network(self, input_nodes, hidden_nodes, output_nodes, learning_rate):
# Set number of nodes in input, hidden and output layers.
self.input_nodes = input_nodes
self.hidden_nodes = hidden_nodes
self.output_nodes = output_nodes
# Initialize weights
self.weights_0_1 = np.zeros((self.input_nodes,self.hidden_nodes))
self.weights_1_2 = np.random.normal(0.0, self.output_nodes**-0.5,
(self.hidden_nodes, self.output_nodes))
self.learning_rate = learning_rate
self.layer_0 = np.zeros((1,input_nodes))
def update_input_layer(self,review):
# clear out previous state, reset the layer to be all 0s
self.layer_0 *= 0
for word in review.split(" "):
if(word in self.word2index.keys()):
self.layer_0[0][self.word2index[word]] = 1
def get_target_for_label(self,label):
if(label == 'POSITIVE'):
return 1
else:
return 0
def sigmoid(self,x):
return 1 / (1 + np.exp(-x))
def sigmoid_output_2_derivative(self,output):
return output * (1 - output)
def train(self, training_reviews, training_labels):
assert(len(training_reviews) == len(training_labels))
correct_so_far = 0
start = time.time()
for i in range(len(training_reviews)):
review = training_reviews[i]
label = training_labels[i]
#### Implement the forward pass here ####
### Forward pass ###
# Input Layer
self.update_input_layer(review)
# Hidden layer
layer_1 = self.layer_0.dot(self.weights_0_1)
# Output layer
layer_2 = self.sigmoid(layer_1.dot(self.weights_1_2))
#### Implement the backward pass here ####
### Backward pass ###
# TODO: Output error
layer_2_error = layer_2 - self.get_target_for_label(label) # Output layer error is the difference between desired target and actual output.
layer_2_delta = layer_2_error * self.sigmoid_output_2_derivative(layer_2)
# TODO: Backpropagated error
layer_1_error = layer_2_delta.dot(self.weights_1_2.T) # errors propagated to the hidden layer
layer_1_delta = layer_1_error # hidden layer gradients - no nonlinearity so it's the same as the error
# TODO: Update the weights
self.weights_1_2 -= layer_1.T.dot(layer_2_delta) * self.learning_rate # update hidden-to-output weights with gradient descent step
self.weights_0_1 -= self.layer_0.T.dot(layer_1_delta) * self.learning_rate # update input-to-hidden weights with gradient descent step
if(np.abs(layer_2_error) < 0.5):
correct_so_far += 1
reviews_per_second = i / float(time.time() - start)
sys.stdout.write("\rProgress:" + str(100 * i/float(len(training_reviews)))[:4] + "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] + " #Correct:" + str(correct_so_far) + " #Trained:" + str(i+1) + " Training Accuracy:" + str(correct_so_far * 100 / float(i+1))[:4] + "%")
def test(self, testing_reviews, testing_labels):
correct = 0
start = time.time()
for i in range(len(testing_reviews)):
pred = self.run(testing_reviews[i])
if(pred == testing_labels[i]):
correct += 1
reviews_per_second = i / float(time.time() - start)
sys.stdout.write("\rProgress:" + str(100 * i/float(len(testing_reviews)))[:4] \
+ "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] \
+ "% #Correct:" + str(correct) + " #Tested:" + str(i+1) + " Testing Accuracy:" + str(correct * 100 / float(i+1))[:4] + "%")
def run(self, review):
# Input Layer
self.update_input_layer(review.lower())
# Hidden layer
layer_1 = self.layer_0.dot(self.weights_0_1)
# Output layer
layer_2 = self.sigmoid(layer_1.dot(self.weights_1_2))
if(layer_2[0] > 0.5):
return "POSITIVE"
else:
return "NEGATIVE"
In [98]:
mlp = SentimentNetwork(reviews[:-1000],labels[:-1000])
In [99]:
# evaluate our model before training (just to show how horrible it is)
mlp.test(reviews[-1000:],labels[-1000:])
In [100]:
# train the network
mlp.train(reviews[:-1000],labels[:-1000])
In [101]:
# evaluate the model after training
mlp.test(reviews[-1000:],labels[-1000:])
In [102]:
mlp.run("That movie was great")
Out[102]:
Even though this network is very trainable on a laptop, we can really get a lot more performance out of it, and doing so is all about understanding how the neural network is interacting with our data (again, "modeling the problem"). Let's take a moment to consider how layer_1 is generated. First, we're going to create a smaller layer_0 so that we can easily picture all the values in our notebook.
In [103]:
layer_0 = np.zeros(10)
In [104]:
layer_0
Out[104]:
Now, let's set a few of the inputs to 1s, and create a sample weight matrix
In [105]:
layer_0[4] = 1
layer_0[9] = 1
layer_0
Out[105]:
In [106]:
weights_0_1 = np.random.randn(10,5)
So, given these pieces, layer_1 is created in the following way....
In [107]:
layer_1 = layer_0.dot(weights_0_1)
In [108]:
layer_1
Out[108]:
layer_1 is generated by performing vector->matrix multiplication, however, most of our input neurons are turned off! Thus, there's actually a lot of computation being wasted. Consider the network below.
In [109]:
Image(filename='sentiment_network_sparse.png')
Out[109]:
If you recall from previous lessons, each edge from one neuron to another represents a single value in our weights_0_1 matrix. When we forward propagate, we take our input neuron's value, multiply it by each weight attached to that neuron, and then sum all the resulting values in the next layer. So, in this case, if only "excellent" was turned on, then all of the multiplications comein gout of "horrible" and "terrible" are wasted computation! All of the weights coming out of "horrible" and "terrible" are being multiplied by 0, thus having no affect on our values in layer_1.
In [110]:
Image(filename='sentiment_network_sparse_2.png')
Out[110]:
When we're forward propagating, we multiply our input neuron's value by the weights attached to it. However, in this case, when the neuron is turned on, it's always turned on to exactly 1. So, there's no need for multiplication, what if we skipped this step?
Instead of generating a huge layer_0 vector and then performing a full vector->matrix multiplication across our huge weights_0_1 matrix, we can simply sum the rows of weights_0_1 that correspond to the words in our review. The resulting value of layer_1 will be exactly the same as if we had performed a full matrix multiplication at a fraction of the computational cost. This is called a "lookup table" or an "embedding layer".
In [111]:
#inefficient thing we did before
layer_1 = layer_0.dot(weights_0_1)
layer_1
Out[111]:
In [112]:
# new, less expensive lookup table version
layer_1 = weights_0_1[4] + weights_0_1[9]
layer_1
Out[112]:
See how they generate exactly the same value? Let's update our new neural network to do this.
In [397]:
import time
import sys
# Let's tweak our network from before to model these phenomena
class SentimentNetwork:
def __init__(self, reviews,labels,hidden_nodes = 10, learning_rate = 0.1):
np.random.seed(1)
self.pre_process_data(reviews)
self.init_network(len(self.review_vocab),hidden_nodes, 1, learning_rate)
def pre_process_data(self,reviews):
review_vocab = set()
for review in reviews:
for word in review.split(" "):
review_vocab.add(word)
self.review_vocab = list(review_vocab)
label_vocab = set()
for label in labels:
label_vocab.add(label)
self.label_vocab = list(label_vocab)
self.review_vocab_size = len(self.review_vocab)
self.label_vocab_size = len(self.label_vocab)
self.word2index = {}
for i, word in enumerate(self.review_vocab):
self.word2index[word] = i
self.label2index = {}
for i, label in enumerate(self.label_vocab):
self.label2index[label] = i
def init_network(self, input_nodes, hidden_nodes, output_nodes, learning_rate):
# Set number of nodes in input, hidden and output layers.
self.input_nodes = input_nodes
self.hidden_nodes = hidden_nodes
self.output_nodes = output_nodes
# Initialize weights
self.weights_0_1 = np.zeros((self.input_nodes,self.hidden_nodes))
self.weights_1_2 = np.random.normal(0.0, self.output_nodes**-0.5,
(self.hidden_nodes, self.output_nodes))
self.learning_rate = learning_rate
self.layer_0 = np.zeros((1,input_nodes))
self.layer_1 = np.zeros((1,hidden_nodes))
def sigmoid(self,x):
return 1 / (1 + np.exp(-x))
def sigmoid_output_2_derivative(self,output):
return output * (1 - output)
def update_input_layer(self,review):
# clear out previous state, reset the layer to be all 0s
self.layer_0 *= 0
for word in review.split(" "):
self.layer_0[0][self.word2index[word]] = 1
def get_target_for_label(self,label):
if(label == 'POSITIVE'):
return 1
else:
return 0
def train(self, training_reviews_raw, training_labels):
training_reviews = list()
for review in training_reviews_raw:
indices = set()
for word in review.split(" "):
if(word in self.word2index.keys()):
indices.add(self.word2index[word])
training_reviews.append(list(indices))
assert(len(training_reviews) == len(training_labels))
correct_so_far = 0
start = time.time()
for i in range(len(training_reviews)):
review = training_reviews[i]
label = training_labels[i]
#### Implement the forward pass here ####
### Forward pass ###
# Input Layer
# Hidden layer
# layer_1 = self.layer_0.dot(self.weights_0_1)
self.layer_1 *= 0
for index in review:
self.layer_1 += self.weights_0_1[index]
# Output layer
layer_2 = self.sigmoid(self.layer_1.dot(self.weights_1_2))
#### Implement the backward pass here ####
### Backward pass ###
# Output error
layer_2_error = layer_2 - self.get_target_for_label(label) # Output layer error is the difference between desired target and actual output.
layer_2_delta = layer_2_error * self.sigmoid_output_2_derivative(layer_2)
# Backpropagated error
layer_1_error = layer_2_delta.dot(self.weights_1_2.T) # errors propagated to the hidden layer
layer_1_delta = layer_1_error # hidden layer gradients - no nonlinearity so it's the same as the error
# Update the weights
self.weights_1_2 -= self.layer_1.T.dot(layer_2_delta) * self.learning_rate # update hidden-to-output weights with gradient descent step
for index in review:
self.weights_0_1[index] -= layer_1_delta[0] * self.learning_rate # update input-to-hidden weights with gradient descent step
if(np.abs(layer_2_error) < 0.5):
correct_so_far += 1
reviews_per_second = i / float(time.time() - start)
sys.stdout.write("\rProgress:" + str(100 * i/float(len(training_reviews)))[:4] + "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] + " #Correct:" + str(correct_so_far) + " #Trained:" + str(i+1) + " Training Accuracy:" + str(correct_so_far * 100 / float(i+1))[:4] + "%")
def test(self, testing_reviews, testing_labels):
correct = 0
start = time.time()
for i in range(len(testing_reviews)):
pred = self.run(testing_reviews[i])
if(pred == testing_labels[i]):
correct += 1
reviews_per_second = i / float(time.time() - start)
sys.stdout.write("\rProgress:" + str(100 * i/float(len(testing_reviews)))[:4] \
+ "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] \
+ "% #Correct:" + str(correct) + " #Tested:" + str(i+1) + " Testing Accuracy:" + str(correct * 100 / float(i+1))[:4] + "%")
def run(self, review):
# Input Layer
# Hidden layer
self.layer_1 *= 0
unique_indices = set()
for word in review.lower().split(" "):
if word in self.word2index.keys():
unique_indices.add(self.word2index[word])
for index in unique_indices:
self.layer_1 += self.weights_0_1[index]
# Output layer
layer_2 = self.sigmoid(self.layer_1.dot(self.weights_1_2))
if(layer_2[0] > 0.5):
return "POSITIVE"
else:
return "NEGATIVE"
In [398]:
mlp = SentimentNetwork(reviews[:-1000],labels[:-1000],learning_rate=0.01)
In [399]:
# train the network
mlp.train(reviews[:-1000],labels[:-1000])
And wallah! Our network learns 10x faster than before while making exactly the same predictions!
In [400]:
# evaluate our model before training (just to show how horrible it is)
mlp.test(reviews[-1000:],labels[-1000:])
Our network even tests over twice as fast as well!
In [ ]:
So at first this might seem like the same thing we did in the previous section. However, while the previous section was about looking for computational waste and triming it out, this section is about looking for noise in our data and trimming it out. When we reduce the "noise" in our data, the neural network can identify correlation must faster and with greater accuracy. Whereas our technique will be simple, many recently developed state-of-the-art techniques (most notably attention and batch normalization) are all about reducing the amount of noise that your network has to filter through. The more obvious you can make the correaltion to your neural network, the better.
Our network is looking for correlation between movie review vocabularies and output positive/negative labels. In order to do this, our network has to come to understand over 70,000 different words in our vocabulary! That's a ton of knowledge that the network has to learn!
This begs the questions, are all the words in the vocabulary actually relevant to sentiment? A few pages ago, we counted how often words occured in positive reviews relative to negative reviews and created a ratio. We could then sort words by this ratio and see the words with the most positive and negative affinity. If you remember, the output looked like this:
In [150]:
# words most frequently seen in a review with a "POSITIVE" label
pos_neg_ratios.most_common()
Out[150]:
In [151]:
# words most frequently seen in a review with a "NEGATIVE" label
list(reversed(pos_neg_ratios.most_common()))[0:30]
Out[151]:
In [7]:
from bokeh.models import ColumnDataSource, LabelSet
from bokeh.plotting import figure, show, output_file
from bokeh.io import output_notebook
output_notebook()
In [158]:
hist, edges = np.histogram(list(map(lambda x:x[1],pos_neg_ratios.most_common())), density=True, bins=100, normed=True)
p = figure(tools="pan,wheel_zoom,reset,save",
toolbar_location="above",
title="Word Positive/Negative Affinity Distribution")
p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:], line_color="#555555")
show(p)
In this graph "0" means that a word has no affinitity for either positive or negative. AS you can see, the vast majority of our words don't have that much direct affinity! So, our network is having to learn about lots of terms that are likely irrelevant to the final prediction. If we remove some of the most irrelevant words, our network will have fewer words that it has to learn about, allowing it to focus more on the words that matters.
Furthermore, check out this graph of simple word frequency
In [236]:
frequency_frequency = Counter()
for word, cnt in total_counts.most_common():
frequency_frequency[cnt] += 1
In [239]:
hist, edges = np.histogram(list(map(lambda x:x[1],frequency_frequency.most_common())), density=True, bins=100, normed=True)
p = figure(tools="pan,wheel_zoom,reset,save",
toolbar_location="above",
title="The frequency distribution of the words in our corpus")
p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:], line_color="#555555")
show(p)
As you can see, the vast majority of words in our corpus only happen once or twice. Unfortunately, this isn't enough for any of those words to be correlated with anything. Correlation requires seeing two things occur at the same time on multiple occasions so that you can identify a pattern. We should eliminate these very low frequency terms as well.
In the next network, we eliminate both low frequency words (via a min_count parameters) and words with low positive/negative affiliation
In [18]:
import time
import sys
import numpy as np
# Let's tweak our network from before to model these phenomena
class SentimentNetwork:
def __init__(self, reviews,labels,min_count = 10,polarity_cutoff = 0.1,hidden_nodes = 10, learning_rate = 0.1):
np.random.seed(1)
self.pre_process_data(reviews, polarity_cutoff, min_count)
self.init_network(len(self.review_vocab),hidden_nodes, 1, learning_rate)
def pre_process_data(self,reviews, polarity_cutoff,min_count):
positive_counts = Counter()
negative_counts = Counter()
total_counts = Counter()
for i in range(len(reviews)):
if(labels[i] == 'POSITIVE'):
for word in reviews[i].split(" "):
positive_counts[word] += 1
total_counts[word] += 1
else:
for word in reviews[i].split(" "):
negative_counts[word] += 1
total_counts[word] += 1
pos_neg_ratios = Counter()
for term,cnt in list(total_counts.most_common()):
if(cnt >= 50):
pos_neg_ratio = positive_counts[term] / float(negative_counts[term]+1)
pos_neg_ratios[term] = pos_neg_ratio
for word,ratio in pos_neg_ratios.most_common():
if(ratio > 1):
pos_neg_ratios[word] = np.log(ratio)
else:
pos_neg_ratios[word] = -np.log((1 / (ratio + 0.01)))
review_vocab = set()
for review in reviews:
for word in review.split(" "):
if(total_counts[word] > min_count):
if(word in pos_neg_ratios.keys()):
if((pos_neg_ratios[word] >= polarity_cutoff) or (pos_neg_ratios[word] <= -polarity_cutoff)):
review_vocab.add(word)
else:
review_vocab.add(word)
self.review_vocab = list(review_vocab)
label_vocab = set()
for label in labels:
label_vocab.add(label)
self.label_vocab = list(label_vocab)
self.review_vocab_size = len(self.review_vocab)
self.label_vocab_size = len(self.label_vocab)
self.word2index = {}
for i, word in enumerate(self.review_vocab):
self.word2index[word] = i
self.label2index = {}
for i, label in enumerate(self.label_vocab):
self.label2index[label] = i
def init_network(self, input_nodes, hidden_nodes, output_nodes, learning_rate):
# Set number of nodes in input, hidden and output layers.
self.input_nodes = input_nodes
self.hidden_nodes = hidden_nodes
self.output_nodes = output_nodes
# Initialize weights
self.weights_0_1 = np.zeros((self.input_nodes,self.hidden_nodes))
self.weights_1_2 = np.random.normal(0.0, self.output_nodes**-0.5,
(self.hidden_nodes, self.output_nodes))
self.learning_rate = learning_rate
self.layer_0 = np.zeros((1,input_nodes))
self.layer_1 = np.zeros((1,hidden_nodes))
def sigmoid(self,x):
return 1 / (1 + np.exp(-x))
def sigmoid_output_2_derivative(self,output):
return output * (1 - output)
def update_input_layer(self,review):
# clear out previous state, reset the layer to be all 0s
self.layer_0 *= 0
for word in review.split(" "):
self.layer_0[0][self.word2index[word]] = 1
def get_target_for_label(self,label):
if(label == 'POSITIVE'):
return 1
else:
return 0
def train(self, training_reviews_raw, training_labels):
training_reviews = list()
for review in training_reviews_raw:
indices = set()
for word in review.split(" "):
if(word in self.word2index.keys()):
indices.add(self.word2index[word])
training_reviews.append(list(indices))
assert(len(training_reviews) == len(training_labels))
correct_so_far = 0
start = time.time()
for i in range(len(training_reviews)):
review = training_reviews[i]
label = training_labels[i]
#### Implement the forward pass here ####
### Forward pass ###
# Input Layer
# Hidden layer
# layer_1 = self.layer_0.dot(self.weights_0_1)
self.layer_1 *= 0
for index in review:
self.layer_1 += self.weights_0_1[index]
# Output layer
layer_2 = self.sigmoid(self.layer_1.dot(self.weights_1_2))
#### Implement the backward pass here ####
### Backward pass ###
# Output error
layer_2_error = layer_2 - self.get_target_for_label(label) # Output layer error is the difference between desired target and actual output.
layer_2_delta = layer_2_error * self.sigmoid_output_2_derivative(layer_2)
# Backpropagated error
layer_1_error = layer_2_delta.dot(self.weights_1_2.T) # errors propagated to the hidden layer
layer_1_delta = layer_1_error # hidden layer gradients - no nonlinearity so it's the same as the error
# Update the weights
self.weights_1_2 -= self.layer_1.T.dot(layer_2_delta) * self.learning_rate # update hidden-to-output weights with gradient descent step
for index in review:
self.weights_0_1[index] -= layer_1_delta[0] * self.learning_rate # update input-to-hidden weights with gradient descent step
if(layer_2 >= 0.5 and label == 'POSITIVE'):
correct_so_far += 1
if(layer_2 < 0.5 and label == 'NEGATIVE'):
correct_so_far += 1
reviews_per_second = i / float(time.time() - start)
sys.stdout.write("\rProgress:" + str(100 * i/float(len(training_reviews)))[:4] + "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] + " #Correct:" + str(correct_so_far) + " #Trained:" + str(i+1) + " Training Accuracy:" + str(correct_so_far * 100 / float(i+1))[:4] + "%")
def test(self, testing_reviews, testing_labels):
correct = 0
start = time.time()
for i in range(len(testing_reviews)):
pred = self.run(testing_reviews[i])
if(pred == testing_labels[i]):
correct += 1
reviews_per_second = i / float(time.time() - start)
sys.stdout.write("\rProgress:" + str(100 * i/float(len(testing_reviews)))[:4] \
+ "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] \
+ "% #Correct:" + str(correct) + " #Tested:" + str(i+1) + " Testing Accuracy:" + str(correct * 100 / float(i+1))[:4] + "%")
def run(self, review):
# Input Layer
# Hidden layer
self.layer_1 *= 0
unique_indices = set()
for word in review.lower().split(" "):
if word in self.word2index.keys():
unique_indices.add(self.word2index[word])
for index in unique_indices:
self.layer_1 += self.weights_0_1[index]
# Output layer
layer_2 = self.sigmoid(self.layer_1.dot(self.weights_1_2))
if(layer_2[0] >= 0.5):
return "POSITIVE"
else:
return "NEGATIVE"
In [6]:
mlp = SentimentNetwork(reviews[:-1000],labels[:-1000],min_count=20,polarity_cutoff=0.05,learning_rate=0.01)
In [7]:
mlp.train(reviews[:-1000],labels[:-1000])
In [8]:
mlp.test(reviews[-1000:],labels[-1000:])
So, using these techniques, we are able to achieve a slightly higher testing score while training 2x faster than before. Furthermore, if we really crank up these metrics, we can get some pretty extreme speed with minimal loss in quality (if, for example, your business use case requires running very fast)
In [52]:
mlp = SentimentNetwork(reviews[:-1000],labels[:-1000],min_count=20,polarity_cutoff=0.8,learning_rate=0.01)
In [53]:
mlp.train(reviews[:-1000],labels[:-1000])
In [55]:
mlp.test(reviews[-1000:],labels[-1000:])
In [19]:
mlp = SentimentNetwork(reviews[:-1000],labels[:-1000],min_count=0,polarity_cutoff=0,learning_rate=0.01)
In [20]:
mlp.train(reviews[:-1000],labels[:-1000])
In [12]:
import matplotlib.colors as colors
In [23]:
words_to_visualize = list()
for word, ratio in pos_neg_ratios.most_common(500):
if(word in mlp.word2index.keys()):
words_to_visualize.append(word)
for word, ratio in list(reversed(pos_neg_ratios.most_common()))[0:500]:
if(word in mlp.word2index.keys()):
words_to_visualize.append(word)
In [42]:
colors_list = list()
vectors_list = list()
for word in words_to_visualize:
if word in pos_neg_ratios.keys():
vectors_list.append(mlp.weights_0_1[mlp.word2index[word]])
if(pos_neg_ratios[word] > 0):
colors_list.append("#"+colors.rgb2hex([0,min(255,pos_neg_ratios[word] * 1),0])[3:])
else:
colors_list.append("#000000")
# colors_list.append("#"+colors.rgb2hex([0,0,min(255,pos_neg_ratios[word] * 1)])[3:])
In [36]:
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state=0)
words_top_ted_tsne = tsne.fit_transform(vectors_list)
In [45]:
p = figure(tools="pan,wheel_zoom,reset,save",
toolbar_location="above",
title="vector T-SNE for most polarized words")
source = ColumnDataSource(data=dict(x1=words_top_ted_tsne[:,0],
x2=words_top_ted_tsne[:,1],
names=words_to_visualize))
p.scatter(x="x1", y="x2", size=8, source=source,color=colors_list)
word_labels = LabelSet(x="x1", y="x2", text="names", y_offset=6,
text_font_size="8pt", text_color="#555555",
source=source, text_align='center')
# p.add_layout(word_labels)
show(p)
# green indicates positive words, black indicates negative words
In [ ]: