Sentiment Analysis is understood as a classic natural language processing problem. In this example, a large moview review dataset was chosen from IMDB to do a sentiment classification task with some deep learning approaches. The labeled data set consists of 50,000 IMDB movie reviews (good or bad), in which 25000 highly polar movie reviews for training, and 25,000 for testing. The dataset is originally collected by Stanford researchers and was used in a 2011 paper, and the highest accuray of 88.33% was achieved without using the unbalanced data. This example illustrates some deep learning approaches to do the sentiment classification with BigDL python API.
The IMDB dataset need to be loaded into BigDL, note that the dataset has been pre-processed, and each review was encoded as a sequence of integers. Each integer represents the index of the overall frequency of dataset, for instance, '5' means the 5-th most frequent words occured in the data. It is very convinient to filter the words by some conditions, for example, to filter only the top 5,000 most common word and/or eliminate the top 30 most common words. Let's define functions to load the pre-processed data.
In [1]:
from bigdl.dataset import base
import numpy as np
def download_imdb(dest_dir):
"""Download pre-processed IMDB movie review data
dest_dir: destination directory to store the data
The absolute path of the stored data
file_name = "imdb.npz"
file_abs_path = base.maybe_download(file_name,
return file_abs_path
def load_imdb(dest_dir='/tmp/.bigdl/dataset'):
"""Load IMDB dataset.
dest_dir: where to cache the data (relative to `~/.bigdl/dataset`).
the train, test separated IMDB dataset.
path = download_imdb(dest_dir)
f = np.load(path, allow_pickle=True)
x_train = f['x_train']
y_train = f['y_train']
x_test = f['x_test']
y_test = f['y_test']
return (x_train, y_train), (x_test, y_test)
print('Processing text dataset')
(x_train, y_train), (x_test, y_test) = load_imdb()
print('finished processing text')
In order to set a proper max sequence length, we need to go througth the property of the data and see the length distribution of each sentence in the dataset. A box and whisker plot is shown below for reviewing the length distribution in words.
In [2]:
import matplotlib
%pylab inline
# Summarize review length
from matplotlib import pyplot
print("Review length: ")
X = np.concatenate((x_train, x_test), axis=0)
result = [len(x) for x in X]
print("Mean %.2f words (%f)" % (np.mean(result), np.std(result)))
# plot review length
# Create a figure instance
fig = pyplot.figure(1, figsize=(6, 6))
Looking the box and whisker plot, the max length of a sample in words is 500, and the mean and median are below 250. According to the plot, we can probably cover the mass of the distribution with a clipped length of 400 to 500. Here we set the max sequence length of each sample as 500.
The corresponding vocabulary sorted by frequency is also required, for further embedding the words with pre-trained vectors. The downloaded vocabulary is in {word: index}, where each word as a key and the index as a value. It needs to be transformed into {index: word} format.
Let's define a function to obtain the vocabulary.
In [3]:
import json
def get_word_index(dest_dir='/tmp/.bigdl/dataset', ):
"""Retrieves the dictionary mapping word indices back to words.
path: where to cache the data (relative to `~/.bigdl/dataset`).
The word index dictionary.
file_name = "imdb_word_index.json"
path = base.maybe_download(file_name,
f = open(path)
data = json.load(f)
return data
print('Processing vocabulary')
word_idx = get_word_index()
idx_word = {v:k for k,v in word_idx.items()}
print('finished processing vocabulary')
Before we train the network, some pre-processing steps need to be applied to the dataset.
Next let's go through the mechanisms that used to be applied to the data.
We insert a start_char
at the beginning of each sentence to mark the start point. We set it as 2
here, and each other word index will plus a constant index_from
to differentiate some 'helper index' (eg. start_char
, oov_char
, etc.).
A max_words
variable is defined as the maximum index number (the least frequent word) included in the sequence. If the word index number is larger than max_words
, it will be replaced by a out-of-vocabulary number oov_char
, which is 3
Each word index sequence is restricted to the same length. We used left-padding here, which means the right (end) of the sequence will be keep as many as possible and drop the left (head) of the sequence if its length is more than pre-defined sequence_len
, or padding the left (head) of the sequence with padding_value
In [4]:
def replace_oov(x, oov_char, max_words):
Replace the words out of vocabulary with `oov_char`
:param x: a sequence
:param max_words: the max number of words to include
:param oov_char: words out of vocabulary because of exceeding the `max_words`
limit will be replaced by this character
:return: The replaced sequence
return [oov_char if w >= max_words else w for w in x]
def pad_sequence(x, fill_value, length):
Pads each sequence to the same length
:param x: a sequence
:param fill_value: pad the sequence with this value
:param length: pad sequence to the length
:return: the padded sequence
if len(x) >= length:
return x[(len(x) - length):]
return [fill_value] * (length - len(x)) + x
def to_sample(features, label):
Wrap the `features` and `label` to a training sample object
:param features: features of a sample
:param label: label of a sample
:return: a sample object including features and label
return Sample.from_ndarray(np.array(features, dtype='float'), np.array(label))
padding_value = 1
start_char = 2
oov_char = 3
index_from = 3
max_words = 5000
sequence_len = 500
print('start transformation')
from zoo.common.nncontext import *
sc = init_nncontext("Sentiment Analysis Example")
train_rdd = sc.parallelize(zip(x_train, y_train), 2) \
.map(lambda record: ([start_char] + [w + index_from for w in record[0]], record[1])) \
.map(lambda record: (replace_oov(record[0], oov_char, max_words), record[1])) \
.map(lambda record: (pad_sequence(record[0], padding_value, sequence_len), record[1])) \
.map(lambda record: to_sample(record[0], record[1]))
test_rdd = sc.parallelize(zip(x_test, y_test), 2) \
.map(lambda record: ([start_char] + [w + index_from for w in record[0]], record[1])) \
.map(lambda record: (replace_oov(record[0], oov_char, max_words), record[1])) \
.map(lambda record: (pad_sequence(record[0], padding_value, sequence_len), record[1])) \
.map(lambda record: to_sample(record[0], record[1]))
print('finish transformation')
Word embedding is a recent breakthrough in natural language field. The key idea is to encode words and phrases into distributed representations in the format of word vectors, which means each word is represented as a vector. There are two widely used word vector training alogirhms, one is published by Google called word to vector, the other is published by Standford called Glove. In this example, pre-trained glove is loaded into a lookup table and will be fine-tuned during the training process. BigDL provides a method to download and load glove in news20
In [5]:
from bigdl.dataset import news20
import itertools
embedding_dim = 100
print('loading glove')
glove = news20.get_glove_w2v(source_dir='/tmp/.bigdl/dataset', dim=embedding_dim)
print('finish loading glove')
For each word whose index less than the max_word
should try to match its embedding and store in an array.
With regard to those words which can not be found in glove, we randomly sample it from a [-0.05, 0.05] uniform distribution.
BigDL usually use a LookupTable
layer to do word embedding, so the matrix will be loaded to the LookupTable by seting the weight.
In [6]:
print('processing glove')
w2v = [glove.get(idx_word.get(i - index_from), np.random.uniform(-0.05, 0.05, embedding_dim))
for i in range(1, max_words + 1)]
w2v = np.array(list(itertools.chain(*np.array(w2v, dtype='float'))), dtype='float') \
.reshape([max_words, embedding_dim])
print('finish processing glove')
Next, let's build some deep learning models for the sentiment classification.
As an example, several deep learning models are illustrated for tutorial, comparison and demonstration.
LSTM, GRU, Bi-LSTM, CNN and CNN + LSTM models are implemented as options. To decide which model to use, just assign model_type the corresponding string.
In [7]:
from bigdl.nn.layer import *
p = 0.2
def build_model(w2v):
model = Sequential()
embedding = LookupTable(max_words, embedding_dim)
if model_type.lower() == "gru":
.add(GRU(embedding_dim, 128, p))) \
.add(Select(2, -1))
elif model_type.lower() == "lstm":
.add(LSTM(embedding_dim, 128, p)))\
.add(Select(2, -1))
elif model_type.lower() == "bi_lstm":
.add(LSTM(embedding_dim, 128, p)))\
.add(Select(2, -1))
elif model_type.lower() == "cnn":
model.add(Transpose([(2, 3)]))\
.add(Reshape([embedding_dim, 1, sequence_len]))\
.add(SpatialConvolution(embedding_dim, 128, 5, 1))\
.add(SpatialMaxPooling(sequence_len - 5 + 1, 1, 1, 1))\
elif model_type.lower() == "cnn_lstm":
model.add(Transpose([(2, 3)]))\
.add(Reshape([embedding_dim, 1, sequence_len])) \
.add(SpatialConvolution(embedding_dim, 64, 5, 1)) \
.add(ReLU()) \
.add(SpatialMaxPooling(4, 1, 1, 1)) \
.add(Squeeze(3)) \
.add(Transpose([(2, 3)])) \
.add(LSTM(64, 128, p))) \
.add(Select(2, -1))
model.add(Linear(128, 100))\
.add(Linear(100, 1))\
return model
In [8]:
from bigdl.optim.optimizer import *
from bigdl.nn.criterion import *
# max_epoch = 4
max_epoch = 1
batch_size = 64
model_type = 'gru'
optimizer = Optimizer(
To make the training process be visualized by TensorBoard, training summaries should be saved as a format of logs.
In [9]:
import datetime as dt
logdir = '/tmp/.bigdl/'
app_name = 'adam-' +"%Y%m%d-%H%M%S")
train_summary = TrainSummary(log_dir=logdir, app_name=app_name)
train_summary.set_summary_trigger("Parameters", SeveralIteration(50))
val_summary = ValidationSummary(log_dir=logdir, app_name=app_name)
Now, let's start training!
In [10]:
train_model = optimizer.optimize()
print ("Optimization Done.")
In [13]:
predictions = train_model.predict(test_rdd)
def map_predict_label(l):
if l > 0.5:
return 1
return 0
def map_groundtruth_label(l):
return l.to_ndarray()[0]
y_pred = np.array([ map_predict_label(s) for s in predictions.collect()])
y_true = np.array([map_groundtruth_label(s.label) for s in test_rdd.collect()])
Then let's see the prediction accuracy on validation set.
In [12]:
correct = 0
for i in range(0, y_pred.size):
if (y_pred[i] == y_true[i]):
correct += 1
accuracy = float(correct) / y_pred.size
print ('Prediction accuracy on validation set is: ', accuracy)
Show the confusion matrix
In [14]:
%pylab inline
import matplotlib.pyplot as plt
import seaborn as sn
import pandas as pd
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_true, y_pred)
df_cm = pd.DataFrame(cm)
plt.figure(figsize = (5,4))
sn.heatmap(df_cm, annot=True,fmt='d')
Because of the limitation of ariticle length, not all the results of optional models can be shown respectively. Please try other provided optional models to see the results. If you are interested in optimizing the results, try different training parameters which may make inpacts on the result, such as the max sequence length, batch size, training epochs, preprocessing schemes, optimization methods and so on. Among the models, CNN training would be much quicker. Note that the LSTM and it variants (eg. GRU) are difficult to train, even a unsuitable batch size may cause the model not converge. In addition it is prone to overfitting, please try different dropout threshold and/or add regularizers.