Text classification using Neural Networks

The goal of this notebook is to learn to use Neural Networks for text classification.

In this notebook, we will:

  • Train a shallow model with learning embeddings
  • Download pre-trained embeddings from Glove
  • Use these pre-trained embeddings

However keep in mind:

  • Deep Learning can be better on text classification that simpler ML techniques, but only on very large datasets and well designed/tuned models.
  • We won't be using the most efficient (in terms of computing) techniques, as Keras is good for prototyping but rather inefficient for training small embedding models on text.
  • The following projects can replicate similar word embedding models much more efficiently: word2vec and gensim's word2vec (self-supervised learning only), fastText (both supervised and self-supervised learning), Vowpal Wabbit (supervised learning).
  • Plain shallow sparse TF-IDF bigrams features without any embedding and Logistic Regression or Multinomial Naive Bayes is often competitive in small to medium datasets.

20 Newsgroups Dataset

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups http://qwone.com/~jason/20Newsgroups/


In [1]:
import numpy as np
from sklearn.datasets import fetch_20newsgroups

newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test = fetch_20newsgroups(subset='test')

In [2]:
sample_idx = 1000
print(newsgroups_train["data"][sample_idx])


From: dabl2@nlm.nih.gov (Don A.B. Lindbergh)
Subject: Diamond SS24X, Win 3.1, Mouse cursor
Organization: National Library of Medicine
Lines: 10


Anybody seen mouse cursor distortion running the Diamond 1024x768x256 driver?
Sorry, don't know the version of the driver (no indication in the menus) but it's a recently
delivered Gateway system.  Am going to try the latest drivers from Diamond BBS but wondered
if anyone else had seen this.

post or email

--Don Lindbergh
dabl2@lhc.nlm.nih.gov


In [3]:
target_names = newsgroups_train["target_names"]

target_id = newsgroups_train["target"][sample_idx]
print("Class of previous message:", target_names[target_id])


Class of previous message: comp.os.ms-windows.misc

Here are all the possible classes:


In [4]:
target_names


Out[4]:
['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

Preprocessing text for the (supervised) CBOW model

We will implement a simple classification model in Keras. Raw text requires (sometimes a lot of) preprocessing.

The following cells uses Keras to preprocess text:

  • using a tokenizer. You may use different tokenizers (from scikit-learn, NLTK, custom Python function etc.). This converts the texts into sequences of indices representing the 20000 most frequent words
  • sequences have different lengths, so we pad them (add 0s at the end until the sequence is of length 1000)
  • we convert the output classes as 1-hot encodings

In [5]:
from keras.preprocessing.text import Tokenizer

MAX_NB_WORDS = 20000

# get the raw text data
texts_train = newsgroups_train["data"]
texts_test = newsgroups_test["data"]

# finally, vectorize the text samples into a 2D integer tensor
tokenizer = Tokenizer(nb_words=MAX_NB_WORDS, char_level=False)
tokenizer.fit_on_texts(texts_train)
sequences = tokenizer.texts_to_sequences(texts_train)
sequences_test = tokenizer.texts_to_sequences(texts_test)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))


Using TensorFlow backend.
Found 134142 unique tokens.

Tokenized sequences are converted to list of token ids (with an integer code):


In [6]:
sequences[0]


Out[6]:
[14,
 4314,
 1358,
 15,
 11388,
 38,
 250,
 29,
 42,
 298,
 9,
 17,
 95,
 78,
 91,
 4314,
 1358,
 15,
 34,
 77,
 3,
 2966,
 610,
 1772,
 32,
 211,
 8,
 26,
 1312,
 27,
 171,
 66,
 47,
 123,
 9966,
 63,
 16,
 17,
 298,
 8,
 708,
 1,
 86,
 263,
 11,
 26,
 4,
 36,
 1502,
 2274,
 298,
 1163,
 2,
 18,
 14,
 1,
 1349,
 14159,
 845,
 15953,
 11,
 26,
 337,
 4,
 1,
 4056,
 80,
 182,
 484,
 7,
 1378,
 1,
 843,
 8318,
 26,
 1837,
 14,
 1,
 818,
 3,
 1,
 726,
 17,
 9,
 44,
 8,
 88,
 27,
 171,
 39,
 4,
 833,
 273,
 1080,
 2912,
 198,
 3,
 2820,
 153,
 17,
 298,
 9,
 239,
 628,
 25,
 809,
 357,
 13,
 21,
 16,
 17,
 384,
 298,
 181,
 112,
 188,
 206,
 1497,
 1342,
 2,
 13,
 35,
 58,
 7944]

The tokenizer object stores a mapping (vocabulary) from word strings to token ids that can be inverted to reconstruct the original message (without formatting):


In [7]:
type(tokenizer.word_index), len(tokenizer.word_index)


Out[7]:
(dict, 134142)

In [8]:
index_to_word = dict((i, w) for w, i in tokenizer.word_index.items())

In [9]:
" ".join([index_to_word[i] for i in sequences[0]])


Out[9]:
"from wam umd edu where's my thing subject what car is this nntp posting host wam umd edu organization university of maryland college park lines 15 i was wondering if anyone out there could enlighten me on this car i saw the other day it was a 2 door sports car looked to be from the late 60s early 70s it was called a the doors were really small in addition the front bumper was separate from the rest of the body this is all i know if anyone can a model name engine specs years of production where this car is made history or whatever info you have on this looking car please e mail thanks il brought to you by your neighborhood"

Let's have a closer look at the tokenized sequences:


In [10]:
seq_lens = [len(s) for s in sequences]
print("average length: %0.1f" % np.mean(seq_lens))
print("max length: %d" % max(seq_lens))


average length: 302.5
max length: 15367

In [11]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.hist(seq_lens, bins=50);


Let's zoom on the distribution of regular sized posts. The vast majority of the posts have less than 1000 symbols:


In [12]:
plt.hist([l for l in seq_lens if l < 3000], bins=50);


Let's truncate and pad all the sequences to 1000 symbols to build the training set:


In [13]:
from keras.preprocessing.sequence import pad_sequences


MAX_SEQUENCE_LENGTH = 1000

# pad sequences with 0s
x_train = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
x_test = pad_sequences(sequences_test, maxlen=MAX_SEQUENCE_LENGTH)
print('Shape of data tensor:', x_train.shape)
print('Shape of data test tensor:', x_test.shape)


Shape of data tensor: (11314, 1000)
Shape of data test tensor: (7532, 1000)

In [14]:
from keras.utils.np_utils import to_categorical
y_train = newsgroups_train["target"]
y_test = newsgroups_test["target"]

y_train = to_categorical(np.asarray(y_train))
print('Shape of label tensor:', y_train.shape)


Shape of label tensor: (11314, 20)

A simple supervised CBOW model in Keras

The following computes a very simple model, as described in fastText:

  • Build an embedding layer mapping each word to a vector representation
  • Compute the vector representation of all words in each sequence and average them
  • Add a dense layer to output 20 classes (+ softmax)

In [15]:
from keras.layers import Dense, Input, Flatten
from keras.layers import GlobalAveragePooling1D, Embedding
from keras.models import Model

EMBEDDING_DIM = 50
N_CLASSES = len(target_names)

# input: a sequence of MAX_SEQUENCE_LENGTH integers
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')

embedding_layer = Embedding(MAX_NB_WORDS, EMBEDDING_DIM,
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=True)
embedded_sequences = embedding_layer(sequence_input)

average = GlobalAveragePooling1D()(embedded_sequences)
predictions = Dense(N_CLASSES, activation='softmax')(average)

model = Model(sequence_input, predictions)
model.compile(loss='categorical_crossentropy',
              optimizer='adam', metrics=['acc'], verbose=2)

In [16]:
model.fit(x_train, y_train, validation_split=0.1,
          nb_epoch=10, batch_size=128, verbose=2)


/home/ogrisel/.virtualenvs/py35/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py:1917: UserWarning: Expected no kwargs, you passed 1
kwargs passed to function are ignored with Tensorflow backend
  warnings.warn('\n'.join(msg))
Train on 10182 samples, validate on 1132 samples
Epoch 1/10
1s - loss: 2.9871 - acc: 0.0768 - val_loss: 2.9786 - val_acc: 0.1042
Epoch 2/10
0s - loss: 2.9624 - acc: 0.1167 - val_loss: 2.9540 - val_acc: 0.1396
Epoch 3/10
0s - loss: 2.9288 - acc: 0.1362 - val_loss: 2.9176 - val_acc: 0.1440
Epoch 4/10
0s - loss: 2.8870 - acc: 0.1936 - val_loss: 2.8745 - val_acc: 0.2058
Epoch 5/10
0s - loss: 2.8427 - acc: 0.2136 - val_loss: 2.8308 - val_acc: 0.2332
Epoch 6/10
0s - loss: 2.7943 - acc: 0.2840 - val_loss: 2.7818 - val_acc: 0.2995
Epoch 7/10
0s - loss: 2.7402 - acc: 0.3372 - val_loss: 2.7267 - val_acc: 0.3498
Epoch 8/10
0s - loss: 2.6786 - acc: 0.3894 - val_loss: 2.6657 - val_acc: 0.3860
Epoch 9/10
0s - loss: 2.6098 - acc: 0.4411 - val_loss: 2.5982 - val_acc: 0.4435
Epoch 10/10
0s - loss: 2.5358 - acc: 0.4873 - val_loss: 2.5267 - val_acc: 0.4629
Out[16]:
<keras.callbacks.History at 0x7f10b58eccf8>

Exercice

  • compute model accuracy on test set

In [17]:
# %load solutions/accuracy.py
output_test = model.predict(x_test)
test_casses = np.argmax(output_test, axis=-1)
print("test accuracy:", np.mean(test_casses == y_test))


test accuracy: 0.412373871482
/home/ogrisel/.virtualenvs/py35/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py:1917: UserWarning: Expected no kwargs, you passed 1
kwargs passed to function are ignored with Tensorflow backend
  warnings.warn('\n'.join(msg))

Building more complex models

Exercise

  • From the previous template, build more complex models using:
    • 1d convolution and 1d maxpooling. Note that you will still need a GloabalAveragePooling or Flatten after the convolutions
    • Recurrent neural networks through LSTM (you will need to reduce sequence length before)

Bonus

  • You may try different architectures with:
    • more intermediate layers, combination of dense, conv, recurrent
    • different recurrent (GRU, RNN)
    • bidirectional LSTMs

Note: The goal is to build working models rather than getting better test accuracy. To achieve much better results, we'd need more computation time and data quantity. Build your model, and verify that they converge to OK results.


In [18]:
# %load solutions/lstm.py
from keras.layers import LSTM, Conv1D, MaxPooling1D

# input: a sequence of MAX_SEQUENCE_LENGTH integers
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)

# 1D convolution with 64 output channels
x = Conv1D(64, 5)(embedded_sequences)
# MaxPool divides the length of the sequence by 5
x = MaxPooling1D(5)(x)
x = Conv1D(64, 5)(x)
x = MaxPooling1D(5)(x)
# LSTM layer with a hidden size of 64
x = LSTM(64)(x)
predictions = Dense(20, activation='softmax')(x)

model = Model(sequence_input, predictions)
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['acc'])

# You will get large speedups with these models by using a GPU
# The model might take a lot of time to converge, and even more
# if you add dropout (needed to prevent overfitting)

In [19]:
# %load solutions/conv1d.py
from keras.layers import Conv1D, MaxPooling1D, Flatten

sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)

# A 1D convolution with 128 output channels
x = Conv1D(128, 5, activation='relu')(embedded_sequences)
# MaxPool divides the length of the sequence by 5
x = MaxPooling1D(5)(x)
# A 1D convolution with 64 output channels
x = Conv1D(64, 5, activation='relu')(x)
# MaxPool divides the length of the sequence by 5
x = MaxPooling1D(5)(x)
x = Flatten()(x)

predictions = Dense(20, activation='softmax')(x)

model = Model(sequence_input, predictions)
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['acc'])

In [20]:
model.fit(x_train, y_train, validation_split=0.1,
          nb_epoch=10, batch_size=128, verbose=2)


Train on 10182 samples, validate on 1132 samples
Epoch 1/10
3s - loss: 2.1595 - acc: 0.2782 - val_loss: 1.5395 - val_acc: 0.4647
Epoch 2/10
2s - loss: 1.1247 - acc: 0.6185 - val_loss: 1.0682 - val_acc: 0.6581
Epoch 3/10
2s - loss: 0.6778 - acc: 0.7800 - val_loss: 0.8538 - val_acc: 0.7429
Epoch 4/10
2s - loss: 0.4322 - acc: 0.8677 - val_loss: 0.7542 - val_acc: 0.7774
Epoch 5/10
2s - loss: 0.2852 - acc: 0.9159 - val_loss: 0.7029 - val_acc: 0.7986
Epoch 6/10
2s - loss: 0.1859 - acc: 0.9463 - val_loss: 0.6669 - val_acc: 0.8251
Epoch 7/10
2s - loss: 0.1231 - acc: 0.9672 - val_loss: 0.7186 - val_acc: 0.8207
Epoch 8/10
2s - loss: 0.0803 - acc: 0.9799 - val_loss: 0.6855 - val_acc: 0.8410
Epoch 9/10
2s - loss: 0.0518 - acc: 0.9882 - val_loss: 0.7078 - val_acc: 0.8472
Epoch 10/10
2s - loss: 0.0335 - acc: 0.9946 - val_loss: 0.7555 - val_acc: 0.8366
Out[20]:
<keras.callbacks.History at 0x7f10afaaec50>

Loading pre-trained embeddings

The file glove100K.100d.txt is an extract of Glove Vectors, that were trained on english Wikipedia 2014 + Gigaword 5 (6B tokens).

We extracted the 100 000 most frequent words. They have a dimension of 100


In [21]:
embeddings_index = {}
embeddings_vectors = []
f = open('glove100K.100d.txt', 'rb')

word_idx = 0
for line in f:
    values = line.decode('utf-8').split()
    word = values[0]
    vector = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = word_idx
    embeddings_vectors.append(vector)
    word_idx = word_idx + 1
f.close()

inv_index = {v: k for k, v in embeddings_index.items()}
print("found %d different words in the file" % word_idx)


found 100000 different words in the file

In [22]:
# Stack all embeddings in a large numpy array
glove_embeddings = np.vstack(embeddings_vectors)
glove_norms = np.linalg.norm(glove_embeddings, axis=-1, keepdims=True)
glove_embeddings_normed = glove_embeddings / glove_norms
print(glove_embeddings.shape)


(100000, 100)

In [23]:
def get_emb(word):
    idx = embeddings_index.get(word)
    if idx is None:
        return None
    else:
        return glove_embeddings[idx]

    
def get_normed_emb(word):
    idx = embeddings_index.get(word)
    if idx is None:
        return None
    else:
        return glove_embeddings_normed[idx]

In [24]:
get_emb("computer")


Out[24]:
array([ -1.62980005e-01,   3.01409990e-01,   5.79779983e-01,
         6.65479973e-02,   4.58350003e-01,  -1.53290004e-01,
         4.32579994e-01,  -8.92149985e-01,   5.77470005e-01,
         3.63750011e-01,   5.65240026e-01,  -5.62810004e-01,
         3.56590003e-01,  -3.60960007e-01,  -9.96619985e-02,
         5.27530015e-01,   3.88390005e-01,   9.61849988e-01,
         1.88409999e-01,   3.07410002e-01,  -8.78419995e-01,
        -3.24420005e-01,   1.12020004e+00,   7.51259997e-02,
         4.26609993e-01,  -6.06509984e-01,  -1.38929993e-01,
         4.78620008e-02,  -4.51579988e-01,   9.37229991e-02,
         1.74630001e-01,   1.09619999e+00,  -1.00440001e+00,
         6.38889968e-02,   3.80019993e-01,   2.11089998e-01,
        -6.62469983e-01,  -4.07359987e-01,   8.94420028e-01,
        -6.09740019e-01,  -1.85770005e-01,  -1.99129999e-01,
        -6.92260027e-01,  -3.18060011e-01,  -7.85650015e-01,
         2.38309994e-01,   1.29920006e-01,   8.77209976e-02,
         4.32049990e-01,  -2.26620004e-01,   3.15490007e-01,
        -3.17479998e-01,  -2.46319990e-03,   1.66150004e-01,
         4.23579991e-01,  -1.80869997e+00,  -3.66990000e-01,
         2.39490002e-01,   2.54579997e+00,   3.61110002e-01,
         3.94859985e-02,   4.86070007e-01,  -3.69740009e-01,
         5.72820008e-02,  -4.93169993e-01,   2.27650002e-01,
         7.99660027e-01,   2.14279994e-01,   6.98109984e-01,
         1.12619996e+00,  -1.35260001e-01,   7.19720006e-01,
        -9.96049959e-04,  -2.68420011e-01,  -8.30380023e-01,
         2.17800006e-01,   3.43549997e-01,   3.77310008e-01,
        -4.02509987e-01,   3.31239998e-01,   1.25759995e+00,
        -2.71959990e-01,  -8.60930026e-01,   9.00529996e-02,
        -2.48760009e+00,   4.51999992e-01,   6.69449985e-01,
        -5.46480000e-01,  -1.03239998e-01,  -1.69790000e-01,
         5.94370008e-01,   1.12800002e+00,   7.57550001e-01,
        -5.91600016e-02,   1.51519999e-01,  -2.83879995e-01,
         4.94520009e-01,  -9.17029977e-01,   9.12890017e-01,
        -3.09269994e-01], dtype=float32)

Finding most similar words

Exercice

Build a function to find most similar words, given a word as query:

  • lookup the vector for the query word in the Glove index;
  • compute the cosine similarity between a word embedding and all other words;
  • display the top 10 most similar words.

Bonus

Change your function so that it takes multiple words as input (by averaging them)


In [25]:
# %load solutions/most_similar.py
def most_similar(words, topn=10):
    query_emb = 0
    # If we have a list of words instead of one word
    # (bonus question)
    if type(words) == list:
        for word in words:
            query_emb += get_emb(word)       
    else:
        query_emb = get_emb(words)
        
    query_emb = query_emb / np.linalg.norm(query_emb)
    
    # Large numpy vector with all cosine similarities
    # between emb and all other words
    cosines = np.dot(glove_embeddings_normed, query_emb)
    
    # topn most similar indexes corresponding to cosines
    idxs = np.argsort(cosines)[::-1][:topn]
    
    # pretty return with word and similarity
    return [(inv_index[idx], cosines[idx]) for idx in idxs]

In [26]:
most_similar("cpu")


Out[26]:
[('cpu', 0.99999994),
 ('processor', 0.77934384),
 ('cpus', 0.7651844),
 ('microprocessor', 0.73606336),
 ('processors', 0.67348146),
 ('motherboard', 0.66757727),
 ('x86', 0.66559219),
 ('pentium', 0.64758503),
 ('gpu', 0.64488822),
 ('i/o', 0.63523525)]

In [27]:
most_similar("pitt")


Out[27]:
[('pitt', 0.99999988),
 ('angelina', 0.67506433),
 ('jolie', 0.65090513),
 ('parker', 0.60754234),
 ('clooney', 0.59994191),
 ('brad', 0.59897137),
 ('thornton', 0.59552884),
 ('aniston', 0.59510386),
 ("o'donnell", 0.56095952),
 ('willis', 0.55930793)]

In [28]:
most_similar("jolie")


Out[28]:
[('jolie', 0.99999994),
 ('angelina', 0.90007746),
 ('pitt', 0.65090501),
 ('aniston', 0.64121068),
 ('changeling', 0.61878836),
 ('neeson', 0.60497248),
 ('kidman', 0.58190113),
 ('blanchett', 0.58027673),
 ('fonda', 0.57476908),
 ('clooney', 0.56517184)]

Predict the future better than tarot:


In [29]:
np.dot(get_normed_emb('aniston'), get_normed_emb('pitt'))


Out[29]:
0.59510386

In [30]:
np.dot(get_normed_emb('jolie'), get_normed_emb('pitt'))


Out[30]:
0.65090507

In [31]:
most_similar("1")


Out[31]:
[('1', 1.0000001),
 ('2', 0.97136045),
 ('3', 0.95510161),
 ('4', 0.93552566),
 ('5', 0.91438848),
 ('6', 0.90160364),
 ('8', 0.88341451),
 ('7', 0.87454295),
 ('9', 0.84709477),
 ('10', 0.8159793)]

In [32]:
# bonus: yangtze is a chinese river
most_similar(["river", "chinese"])


Out[32]:
[('river', 0.819628),
 ('chinese', 0.78044909),
 ('china', 0.71793413),
 ('mainland', 0.68212992),
 ('along', 0.66073483),
 ('yangtze', 0.6465708),
 ('sea', 0.6431601),
 ('south', 0.64280045),
 ('korean', 0.64098263),
 ('southern', 0.63223374)]

Displaying vectors with t-SNE


In [33]:
from sklearn.manifold import TSNE

word_emb_tsne = TSNE(perplexity=30).fit_transform(glove_embeddings_normed[:1000])

In [34]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.figure(figsize=(40, 40))
axis = plt.gca()
np.set_printoptions(suppress=True)
plt.scatter(word_emb_tsne[:, 0], word_emb_tsne[:, 1], marker=".", s=1)

for idx in range(1000):
    plt.annotate(inv_index[idx],
                 xy=(word_emb_tsne[idx, 0], word_emb_tsne[idx, 1]),
                 xytext=(0, 0), textcoords='offset points')
plt.savefig("tsne.png")
plt.show()


Using pre-trained embeddings in our model

We want to use these pre-trained embeddings for transfer learning. This process is rather similar than transfer learning in image recognition: the features learnt on words might help us bootstrap the learning process, and increase performance if we don't have enough training data.

  • We initialize embedding matrix from the model with Glove embeddings:
    • take all words from our 20 Newgroup vocabulary (MAX_NB_WORDS = 20000), and look up their Glove embedding
    • place the Glove embedding at the corresponding index in the matrix
    • if the word is not in the Glove vocabulary, we only place zeros in the matrix
  • We may fix these embeddings or fine-tune them

In [35]:
EMBEDDING_DIM = 100

# prepare embedding matrix
nb_words_in_matrix = 0
nb_words = min(MAX_NB_WORDS, len(word_index))
embedding_matrix = np.zeros((nb_words, EMBEDDING_DIM))
for word, i in word_index.items():
    if i >= MAX_NB_WORDS:
        continue
    embedding_vector = get_emb(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector
        nb_words_in_matrix = nb_words_in_matrix + 1
        
print("added %d words in the embedding matrix" % nb_words_in_matrix)


added 15848 words in the embedding matrix

Build a layer with pre-trained embeddings:


In [36]:
pretrained_embedding_layer = Embedding(
    MAX_NB_WORDS, EMBEDDING_DIM,
    weights=[embedding_matrix],
    input_length=MAX_SEQUENCE_LENGTH,
)

A model with pre-trained Embeddings

Average word embeddings pre-trained with Glove / Word2Vec usually works suprisingly well. However, when averaging more than 10-15 words, the resulting vector becomes too noisy and classification performance is degraded.


In [37]:
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = pretrained_embedding_layer(sequence_input)
average = GlobalAveragePooling1D()(embedded_sequences)
predictions = Dense(N_CLASSES, activation='softmax')(average)

model = Model(sequence_input, predictions)

# We don't want to fine-tune embeddings
model.layers[1].trainable=False

model.compile(loss='categorical_crossentropy',
              optimizer='adam', metrics=['acc'])

In [38]:
model.fit(x_train, y_train, validation_split=0.1,
          nb_epoch=10, batch_size=128, verbose=2)

# Note, on this type of task, this technique will 
# degrade results as we train much less parameters
# and we average a large number pre-trained embeddings.
# You will notice much less overfitting then!
# Using convolutions / LSTM will help
# It is also advisable to treat seperately pre-trained
# embeddings and words out of vocabulary.


Train on 10182 samples, validate on 1132 samples
Epoch 1/10
0s - loss: 2.9760 - acc: 0.0552 - val_loss: 2.9526 - val_acc: 0.0936
Epoch 2/10
0s - loss: 2.9393 - acc: 0.1145 - val_loss: 2.9232 - val_acc: 0.1484
Epoch 3/10
0s - loss: 2.9125 - acc: 0.1503 - val_loss: 2.8986 - val_acc: 0.1935
Epoch 4/10
0s - loss: 2.8884 - acc: 0.1925 - val_loss: 2.8759 - val_acc: 0.2067
Epoch 5/10
0s - loss: 2.8660 - acc: 0.2310 - val_loss: 2.8547 - val_acc: 0.2412
Epoch 6/10
0s - loss: 2.8450 - acc: 0.2485 - val_loss: 2.8356 - val_acc: 0.2420
Epoch 7/10
0s - loss: 2.8252 - acc: 0.2588 - val_loss: 2.8165 - val_acc: 0.2641
Epoch 8/10
0s - loss: 2.8066 - acc: 0.2812 - val_loss: 2.7987 - val_acc: 0.2809
Epoch 9/10
0s - loss: 2.7890 - acc: 0.3005 - val_loss: 2.7819 - val_acc: 0.2906
Epoch 10/10
0s - loss: 2.7721 - acc: 0.3041 - val_loss: 2.7667 - val_acc: 0.3048
Out[38]:
<keras.callbacks.History at 0x7f1058447c88>

Reality check

On small/medium datasets, simpler classification methods usually perform better, and are much more efficient to compute. Here are two resources to go further:

However, when looking at features, one can see that classification using simple methods isn't very robust, and won't generalize well to slightly different domains (e.g. forum posts => emails)

Note: Implementation in Keras for text is very slow due to python overhead and lack of hashing techniques. The fastText implementation https://github.com/facebookresearch/fasttext is much, much faster.

Going further

  • Compare pre-trained embeddings vs specifically trained embeddings
  • Train your own wordvectors in any language using gensim's word2vec
  • Check Keras Examples on imdb sentiment analysis
  • Install fastText (Linux or macOS only, use the Linux VM if under Windows) and give it a try on the classification example in its repository.

In [ ]: