
In [1]:
# Tensorflow
import tensorflow as tf
print('Tested with TensorFlow 1.2.0')
print('Your TensorFlow version:', tf.__version__) 

# Feeding function for enqueue data
from tensorflow.python.estimator.inputs.queues import feeding_functions as ff

# Rnn common functions
from tensorflow.contrib.learn.python.learn.estimators import rnn_common

# Model builder
from tensorflow.python.estimator import model_fn as model_fn_lib

# Run an experiment
from tensorflow.contrib.learn.python.learn import learn_runner

# Helpers for data processing
import pandas as pd
import numpy as np
import argparse
import random

Loading Data

First, we want to create our word vectors. For simplicity, we're going to be using a pretrained model.

As one of the biggest players in the ML game, Google was able to train a Word2Vec model on a massive Google News dataset that contained over 100 billion different words! From that model, Google was able to create 3 million word vectors, each with a dimensionality of 300.

In an ideal scenario, we'd use those vectors, but since the word vectors matrix is quite large (3.6 GB!), we'll be using a much more manageable matrix that is trained using GloVe, a similar word vector generation model. The matrix will contain 400,000 word vectors, each with a dimensionality of 50.

We're going to be importing two different data structures, one will be a Python list with the 400,000 words, and one will be a 400,000 x 50 dimensional embedding matrix that holds all of the word vector values.

In [2]:
# data from:
TRAIN_INPUT = 'data/train.csv'
TEST_INPUT = 'data/test.csv'

# data manually generated
MY_TEST_INPUT = 'data/mytest.csv'

# wordtovec
# the matrix will contain 400,000 word vectors, each with a dimensionality of 50.
word_list = np.load('word_list.npy')
word_list = word_list.tolist() # originally loaded as numpy array
word_list = [word.decode('UTF-8') for word in word_list] # encode words as UTF-8
print('Loaded the word list, length:', len(word_list))

word_vector = np.load('word_vector.npy')
print ('Loaded the word vector, shape:', word_vector.shape)

Loaded the word list, length: 400000
Loaded the word vector, shape: (400000, 50)

We can search our word list for a word like "baseball", and then access its corresponding vector through the embedding matrix.

In [3]:
baseball_index = word_list.index('baseball')
print('Example: baseball')

Example: baseball
[-1.93270004  1.04209995 -0.78514999  0.91033     0.22711    -0.62158
 -1.64929998  0.07686    -0.58679998  0.058831    0.35628     0.68915999
 -0.50598001  0.70472997  1.26639998 -0.40031001 -0.020687    0.80862999
 -0.90565997 -0.074054   -0.87674999 -0.62910002 -0.12684999  0.11524
 -0.55685002 -1.68260002 -0.26291001  0.22632     0.713      -1.08280003
  2.12310004  0.49869001  0.066711   -0.48225999 -0.17896999  0.47699001
  0.16384     0.16537    -0.11506    -0.15962    -0.94926    -0.42833
 -0.59456998  1.35660005 -0.27506     0.19918001 -0.36008     0.55667001
 -0.70314997  0.17157   ]

Now that we have our vectors, our first step is taking an input sentence and then constructing the its vector representation. Let's say that we have the input sentence "I thought the movie was incredible and inspiring". In order to get the word vectors, we can use Tensorflow's embedding lookup function. This function takes in two arguments, one for the embedding matrix (the wordVectors matrix in our case), and one for the ids of each of the words. The ids vector can be thought of as the integerized representation of the training set. This is basically just the row index of each of the words. Let's look at a quick example to make this concrete.

In [92]:
max_seq_length = 10 # maximum length of sentence
num_dims = 50 # dimensions for each word vector

first_sentence = np.zeros((max_seq_length), dtype='int32')
first_sentence[0] = word_list.index("i")
first_sentence[1] = word_list.index("thought")
first_sentence[2] = word_list.index("the")
first_sentence[3] = word_list.index("movie")
first_sentence[4] = word_list.index("was")
first_sentence[5] = word_list.index("incredible")
first_sentence[6] = word_list.index("and")
first_sentence[7] = word_list.index("inspiring")
# first_sentence[8] = 0
# first_sentence[9] = 0

print(first_sentence) # shows the row index for each word

[    41    804 201534   1005     15   7446      5  13767      0      0]

The 10 x 50 output should contain the 50 dimensional word vectors for each of the 10 words in the sequence.

In [5]:
with tf.Session() as sess:
    print(tf.nn.embedding_lookup(word_vector, first_sentence).eval().shape)

(10, 50)

Before creating the ids matrix for the whole training set, let’s first take some time to visualize the type of data that we have. This will help us determine the best value for setting our maximum sequence length. In the previous example, we used a max length of 10, but this value is largely dependent on the inputs you have.

The training set we're going to use is the Imdb movie review dataset. This set has 25,000 movie reviews, with 12,500 positive reviews and 12,500 negative reviews. Each of the reviews is stored in a txt file that we need to parse through. The positive reviews are stored in one directory and the negative reviews are stored in another. The following piece of code will determine total and average number of words in each review.

In [6]:
from os import listdir
from os.path import isfile, join
positiveFiles = ['positiveReviews/' + f for f in listdir('positiveReviews/') if isfile(join('positiveReviews/', f))]
negativeFiles = ['negativeReviews/' + f for f in listdir('negativeReviews/') if isfile(join('negativeReviews/', f))]
numWords = []
for pf in positiveFiles:
    with open(pf, "r", encoding='utf-8') as f:
        counter = len(line.split())
print('Positive files finished')

for nf in negativeFiles:
    with open(nf, "r", encoding='utf-8') as f:
        counter = len(line.split())
print('Negative files finished')

numFiles = len(numWords)
print('The total number of files is', numFiles)
print('The total number of words in the files is', sum(numWords))
print('The average number of words in the files is', sum(numWords)/len(numWords))

Positive files finished
Negative files finished
The total number of files is 25000
The total number of words in the files is 5844680
The average number of words in the files is 233.7872

We can also use the Matplot library to visualize this data in a histogram format.

In [7]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.hist(numWords, 50)
plt.xlabel('Sequence Length')
plt.axis([0, 1200, 0, 8000])

From the histogram as well as the average number of words per file, we can safely say that most reviews will fall under 250 words, which is the max sequence length value we will set.

In [8]:
max_seq_len = 250


In [73]:
ids_matrix = np.load('ids_matrix.npy').tolist()


In [80]:
# Parameters for training
STEPS = 15000

# Parameters for data processing
REVIEW_KEY = 'review'
SEQUENCE_LENGTH_KEY = 'sequence_length'

Separating train and test data

The training set we're going to use is the Imdb movie review dataset. This set has 25,000 movie reviews, with 12,500 positive reviews and 12,500 negative reviews.

Let's first give a positive label [1, 0] to the first 12500 reviews, and a negative label [0, 1] to the other reviews.

In [75]:

# copying sequences
data_sequences = [np.asarray(v, dtype=np.int32) for v in ids_matrix]
# generating labels
data_labels = [[1, 0] if i < POSITIVE_REVIEWS else [0, 1] for i in range(len(ids_matrix))]
# also creating a length column, this will be used by the Dynamic RNN
# see more about it here:
data_length = [max_seq_len for i in range(len(ids_matrix))]

Then, let's shuffle the data and use 90% of the reviews for training and the other 10% for testing.

In [76]:
data = list(zip(data_sequences, data_labels, data_length))
random.shuffle(data) # shuffle

data = np.asarray(data)
# separating train and test data
limit = int(len(data) * 0.9)

train_data = data[:limit]
test_data = data[limit:]

Verifying if the train and test data have enough positive and negative examples

In [77]:
def _number_of_pos_labels(df):
    pos_labels = 0
    for value in df:
        if value[LABEL_INDEX] == [1, 0]:
            pos_labels += 1
    return pos_labels

pos_labels_train = _number_of_pos_labels(train_data)
total_labels_train = len(train_data)

pos_labels_test = _number_of_pos_labels(test_data)
total_labels_test = len(test_data)

print('Total number of positive labels:', pos_labels_train + pos_labels_test)
print('Proportion of positive labels on the Train data:', pos_labels_train/total_labels_train)
print('Proportion of positive labels on the Test data:', pos_labels_test/total_labels_test)

Total number of positive labels: 12500
Proportion of positive labels on the Train data: 0.49933333333333335
Proportion of positive labels on the Test data: 0.506

Input functions

In [159]:
def get_input_fn(df, batch_size, num_epochs=1, shuffle=True):  
    def input_fn():
        sequences = np.asarray([v for v in df[:,0]], dtype=np.int32)
        labels = np.asarray([v for v in df[:,1]], dtype=np.int32)
        length = np.asarray(df[:,2], dtype=np.int32)

        dataset = (
  , labels, length)) # reading data from memory
            .repeat(num_epochs) # repeat dataset the number of epochs
        # for our "manual" test we don't want to shuffle the data
        if shuffle:
            dataset = dataset.shuffle(buffer_size=100000)

        # create iterator
        review, label, length = dataset.make_one_shot_iterator().get_next()

        features = {
            REVIEW_KEY: review,
            SEQUENCE_LENGTH_KEY: length,

        return features, label
    return input_fn

In [160]:
features, label = get_input_fn(test_data, 2, shuffle=False)()

In [161]:
with tf.Session() as sess:
    items =


[[    36     29   7503    978    465     10 201534    371     65     34
     102      7  12474      6 201534    621      4    567    264   2500
  201534   1607    153      3 201534    281     14     12     20   2444
       4    480   1003     19   2532   6769     65      5     14   1096
      34    301      5    266     12  21853     44    973      4 201534
    2500      3   1487   7763    439     73     37    102     14    182
     749    164      6      7 399999 106337   3349 399999   1301  99048
   11027     13      7   4706 399999 399999      7    333   1983    151
  201534   1570      3  59651     32  12734 201534    371     19   4424
     142   1670   1222    152    164 399999    992   9742    197    109
     246     86     39    234 201534 399999    635      3 201534  22866
    2913      6 201534 399999  33830  24445   2115 201534    215   8183
     295   2956 217684      4    359 399999    401   4537   2280     46
      36      5     76     34     36    338     65     56    941   1088
     615     73     81     94   6597      7   4403 399999 399999 399999
      41    303      4    253   6494    142    161      5 399999   6412
   12193     41     54   1716   8273  14789      0      0      0      0
       0      0      0      0      0      0      0      0      0      0
       0      0      0      0      0      0      0      0      0      0
       0      0      0      0      0      0      0      0      0      0
       0      0      0      0      0      0      0      0      0      0
       0      0      0      0      0      0      0      0      0      0
       0      0      0      0      0      0      0      0      0      0
       0      0      0      0      0      0      0      0      0      0
       0      0      0      0      0      0      0      0      0      0]
 [    37   1005     14     36 399999     58     48    591 399999     40
      12  32795   1945      4    159     20   1089 399999     48     31
  399999   4014 399999   4333     46 399999 399999    588      7   5988
  399999      3 201534   5059 399999    273 399999     84   2182      4
    2199 399999     94    591     30      7 399999     10      7   1594
       3  16580      5      7   1594      3 399999   2050     14     36
     191 399999 399999    169    285    551     13 201534 399999     10
  201534   2661 399999 399999 399999  12073     20     94     33     51
  399999     41 399999    253 399999    127 399999     14    440     73
    1645    599 399999     41   5020 399999      7    219 399999     20
    1349      7    530   1078   1896     37     14 201534    611   1062
       3 201534 214247    238     20     10      7    191   5115 399999
      20    149     36   1089 399999     81    303 201534  34357     56
    5320 399999     66 201534   7118   1255 399999   1062      3 204834
    1176 399999  24235   1666      7    365   3747  91549     25    285
  399999     81   1716     37 399999     43    965     30      7  15002
       5   7894 399999     81     32      7    567 399999    414    303
       4   2065     60   7118 399999 204834 399999   5300   9492     87
    5976   1720   3910 201534     58     87   2459  34357      5     17
     557   1410      0      0      0      0      0      0      0      0
       0      0      0      0      0      0      0      0      0      0
       0      0      0      0      0      0      0      0      0      0
       0      0      0      0      0      0      0      0      0      0
       0      0      0      0      0      0      0      0      0      0
       0      0      0      0      0      0      0      0      0      0]]
[[0 1]
 [0 1]]

In [83]:
train_input_fn = get_input_fn(train_data, BATCH_SIZE, None)
test_input_fn = get_input_fn(test_data, BATCH_SIZE)

Creating the Estimator model

In [166]:
def get_model_fn(rnn_cell_sizes,
    def model_fn(features, labels, mode):
        review = features[REVIEW_KEY]
        sequence_length = tf.cast(features[SEQUENCE_LENGTH_KEY], tf.int32)

        # Creating embedding
        data = tf.Variable(tf.zeros([BATCH_SIZE, max_seq_len, 50]),dtype=tf.float32)
        data = tf.nn.embedding_lookup(word_vector, review)
        # Each RNN layer will consist of a LSTM cell
        rnn_layers = [tf.nn.rnn_cell.LSTMCell(size) for size in rnn_cell_sizes]
        # Construct the layers
        multi_rnn_cell = tf.nn.rnn_cell.MultiRNNCell(rnn_layers)
        # Runs the RNN model dynamically
        # more about it at: 
        outputs, final_state = tf.nn.dynamic_rnn(cell=multi_rnn_cell,

        # Slice to keep only the last cell of the RNN
        last_activations = rnn_common.select_last_activations(outputs, sequence_length)

        # Construct dense layers on top of the last cell of the RNN
        for units in dnn_layer_sizes:
            last_activations = tf.layers.dense(
              last_activations, units, activation=tf.nn.relu)
        # Final dense layer for prediction
        predictions = tf.layers.dense(last_activations, label_dimension)
        predictions_softmax = tf.nn.softmax(predictions)
        loss = None
        train_op = None
        eval_op = None
        preds_op = {
            'prediction': predictions_softmax,
            'label': labels
        if mode == tf.estimator.ModeKeys.EVAL:
            eval_op = {
                "accuracy": tf.metrics.accuracy(
                         tf.argmax(input=predictions_softmax, axis=1),
                         tf.argmax(input=labels, axis=1))
        if mode != tf.estimator.ModeKeys.PREDICT:    
            loss = tf.losses.softmax_cross_entropy(labels, predictions)
        if mode == tf.estimator.ModeKeys.TRAIN:    
            train_op = tf.contrib.layers.optimize_loss(
        return model_fn_lib.EstimatorSpec(mode,
    return model_fn

In [167]:
model_fn = get_model_fn(rnn_cell_sizes=[64], # size of the hidden layers
                        label_dimension=2, # since are just 2 classes
                        dnn_layer_sizes=[128, 64], # size of units in the dense layers on top of the RNN

Create and Run Experiment

In [90]:
# create experiment
def generate_experiment_fn():
        Create an experiment function given hyperparameters.
        A function (output_dir) -> Experiment where output_dir is a string
        representing the location of summaries, checkpoints, and exports.
        this function is used by learn_runner to create an Experiment which
        executes model code provided in the form of an Estimator and
        input functions.
        All listed arguments in the outer function are used to create an
        Estimator, and input functions (training, evaluation, serving).
        Unlisted args are passed through to Experiment.

    def _experiment_fn(run_config, hparams):
        estimator = tf.estimator.Estimator(model_fn=model_fn, config=run_config)
        return tf.contrib.learn.Experiment(
    return _experiment_fn

In [91]:
# run experiment, run_config=tf.contrib.learn.RunConfig(model_dir='testing2'))

Making Predictions

First let's generate our own sentences to see how the model classifies them.

In [188]:
def string_to_array(s, separator=' '):
    return s.split(separator)

def generate_data_row(sentence, label, max_length):
    sequence = np.zeros((max_length), dtype='int32')
    for i, word in enumerate(string_to_array(sentence)):
        sequence[i] = word_list.index(word)
    return sequence, label, max_length
def generate_data(sentences, labels, max_length):
    data = []
    for s, l in zip(sentences, labels):
        data.append(generate_data_row(s, l, max_length))
    return np.asarray(data)

sentences = ['i thought the movie was incredible and inspiring', 
             'this is a great movie',
             'this is a good movie but isnt the best',
             'it was fine i guess',
             'it was definitely bad',
             'its not that bad',
             'its not that bad i think its a good movie',
             'its not bad i think its a good movie']

labels = [[1, 0],
          [1, 0],
          [1, 0],
          [0, 1],
          [0, 1],
          [1, 0],
          [1, 0],
          [1, 0]] # [1, 0]: positive, [0, 1]: negative

my_test_data = generate_data(sentences, labels, 10)

In [187]:
estimator = tf.estimator.Estimator(model_fn=model_fn,

preds = estimator.predict(input_fn=get_input_fn(my_test_data, 1, 1, shuffle=False))

for p, s in zip(preds, sentences):
    print('sentence:', s)
    print('good review:', p[0], 'bad review:', p[1])
    print('-' * 10)

INFO:tensorflow:Using config: {'_save_checkpoints_steps': None, '_session_config': None, '_keep_checkpoint_every_n_hours': 10000, '_save_summary_steps': 100, '_num_ps_replicas': 0, '_cluster_spec': < object at 0x7f399109fd68>, '_tf_random_seed': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
, '_model_dir': 'tensorboard/batch_32', '_task_id': 0, '_evaluation_master': '', '_environment': 'local', '_master': '', '_is_chief': True, '_save_checkpoints_secs': 600, '_task_type': None, '_keep_checkpoint_max': 5, '_num_worker_replicas': 0}

WARNING:tensorflow:Input graph does not contain a QueueRunner. That means predict yields forever. This is probably a mistake.
INFO:tensorflow:Restoring parameters from tensorboard/batch_32/model.ckpt-12547
sentence: i thought the movie was incredible and inspiring
good review: 0.932236 bad review: 0.0677641
sentence: this is a great movie
good review: 0.980506 bad review: 0.0194938
sentence: this is a good movie but isnt the best
good review: 0.926685 bad review: 0.0733154
sentence: it was fine i guess
good review: 0.504088 bad review: 0.495912
sentence: it was definitely bad
good review: 0.0672349 bad review: 0.932765
sentence: its not that bad
good review: 0.234982 bad review: 0.765018
sentence: its not that bad i think its a good movie
good review: 0.323138 bad review: 0.676862
sentence: its not bad i think its a good movie
good review: 0.461568 bad review: 0.538432