Using wrappers for Gensim models for working with Keras

This tutorial is about using gensim models as a part of your Keras models.

The wrappers available (as of now) are :

  • Word2Vec (uses the function get_keras_embedding defined in gensim.models.keyedvectors)

Word2Vec

Integration with Keras : 20NewsGroups Task

To see how Gensim's Word2Vec model could be integrated with Keras while dealing with a supervised (classification) task, we consider the 20NewsGroups task. Here, we take a smaller version of this data by taking a subset of the documents to be classified.

First, we import the necessary modules.


In [163]:
import os
import sys
import keras
import numpy as np

from gensim.models import word2vec

from keras.models import Model
from keras.preprocessing.text import Tokenizer, text_to_word_sequence
from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical
from keras.layers import Input, Dense, Flatten
from keras.layers import Conv1D, MaxPooling1D

from sklearn.datasets import fetch_20newsgroups

We first load the training data. Then, we format our text samples and labels into tensors that can be fed into a neural network. To do this, we rely on Keras utilities keras.preprocessing.text.Tokenizer, keras.preprocessing.sequence.pad_sequences and from keras.utils.np_utils import to_categorical.


In [164]:
dataset = fetch_20newsgroups(subset='train', categories=['alt.atheism', 'comp.graphics', 'sci.space'])

MAX_SEQUENCE_LENGTH = 1000

# Vectorize the text samples into a 2D integer tensor
tokenizer = Tokenizer()
tokenizer.fit_on_texts(dataset.data)
sequences = tokenizer.texts_to_sequences(dataset.data)

x_train = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
y_train = to_categorical(np.asarray(dataset.target))

Now we train a Word2Vec model from the documents we have. From the word2vec model we construct the embedding layer to be used in our actual Keras model.

The Keras tokenizer object maintains an internal vocabulary (a token to index mapping), which might be different from the vocabulary gensim builds when training the word2vec model. To align the vocabularies we pass the Keras tokenizer vocabulary to the get_keras_embedding function


In [165]:
keras_w2v = word2vec.Word2Vec([text_to_word_sequence(doc) for doc in dataset.data],min_count=0)
embedding_layer = keras_w2v.wv.get_keras_embedding(word_index = tokenizer.word_index,train_embeddings=True)

Finally, we create a small 1D convnet to solve our classification problem.


In [166]:
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
x = Conv1D(128, 5, activation='relu')(embedded_sequences)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(35)(x)  # global max pooling
x = Flatten()(x)
x = Dense(128, activation='relu')(x)
preds = Dense(y_train.shape[1], activation='softmax')(x)

model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])

model.fit(x_train, y_train, epochs=3, validation_split= 0.1)


Train on 1491 samples, validate on 166 samples
Epoch 1/3
1491/1491 [==============================] - 16s 11ms/step - loss: 1.0239 - acc: 0.5017 - val_loss: 0.9306 - val_acc: 0.5663
Epoch 2/3
1491/1491 [==============================] - 15s 10ms/step - loss: 0.6941 - acc: 0.7015 - val_loss: 0.6612 - val_acc: 0.7048
Epoch 3/3
1491/1491 [==============================] - 15s 10ms/step - loss: 0.4270 - acc: 0.8404 - val_loss: 0.5119 - val_acc: 0.7892
Out[166]:
<keras.callbacks.History at 0x1373acda0>

We see that the model learns to reaches a reasonable accuracy, considering the small dataset.

Alternatively, we can use embeddings pretrained on a different larger corpus (Glove), to see if performance impoves


In [167]:
import gensim.downloader as api

glove_embeddings = api.load("glove-wiki-gigaword-100")

In [168]:
glove_embedding_layer = glove_embeddings.get_keras_embedding(word_index = tokenizer.word_index,train_embeddings=True)

embedded_sequences = glove_embedding_layer(sequence_input)
x = Conv1D(128, 5, activation='relu')(embedded_sequences)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(35)(x)  # global max pooling
x = Flatten()(x)
x = Dense(128, activation='relu')(x)
preds = Dense(y_train.shape[1], activation='softmax')(x)

model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])

model.fit(x_train, y_train, epochs=3, validation_split= 0.1)


Train on 1491 samples, validate on 166 samples
Epoch 1/3
1491/1491 [==============================] - 17s 11ms/step - loss: 1.0564 - acc: 0.4514 - val_loss: 0.9083 - val_acc: 0.4578
Epoch 2/3
1491/1491 [==============================] - 16s 11ms/step - loss: 0.5122 - acc: 0.7901 - val_loss: 0.3278 - val_acc: 0.8855
Epoch 3/3
1491/1491 [==============================] - 16s 10ms/step - loss: 0.0902 - acc: 0.9718 - val_loss: 0.2187 - val_acc: 0.9398
Out[168]:
<keras.callbacks.History at 0x11ea8ae48>

We see that pretrained embeddings result in a faster convergence