Text classification using Neural Networks

The goal of this notebook is to learn to use Neural Networks for text classification.

In this notebook, we will:

  • Train a shallow model with learning embeddings
  • Download pre-trained embeddings from Glove
  • Use these pre-trained embeddings

However keep in mind:

  • Deep Learning can be better on text classification that simpler ML techniques, but only on very large datasets and well designed/tuned models.
  • We won't be using the most efficient (in terms of computing) techniques, as Keras is good for prototyping but rather inefficient for training small embedding models on text.
  • The following projects can replicate similar word embedding models much more efficiently: word2vec and gensim's word2vec (self-supervised learning only), fastText (both supervised and self-supervised learning), Vowpal Wabbit (supervised learning).
  • Plain shallow sparse TF-IDF bigrams features without any embedding and Logistic Regression or Multinomial Naive Bayes is often competitive in small to medium datasets.

The BBC topic classification dataset

The BBC provides some benchmark topic classification datasets in English at: http://mlg.ucd.ie/datasets/bbc.html.

The raw text (encoded with the latin-1 character encoding) of the news can be downloaded as a ZIP archive:


In [ ]:
import os
import os.path as op
import zipfile
try:
    from urllib.request import urlretrieve
except ImportError:
    from urllib import urlretrieve


BBC_DATASET_URL = "http://mlg.ucd.ie/files/datasets/bbc-fulltext.zip"
zip_filename = BBC_DATASET_URL.rsplit('/', 1)[1]
BBC_DATASET_FOLDER = 'bbc'
if not op.exists(zip_filename):
    print("Downloading %s to %s..." % (BBC_DATASET_URL, zip_filename))
    urlretrieve(BBC_DATASET_URL, zip_filename)

if not op.exists(BBC_DATASET_FOLDER):
    with zipfile.ZipFile(zip_filename, 'r') as f:
        print("Extracting contents of %s..." % zip_filename)
        f.extractall('.')

Each of the five folders contains text files from one of the five topics:


In [ ]:
target_names = sorted(folder for folder in os.listdir(BBC_DATASET_FOLDER)
                      if op.isdir(op.join(BBC_DATASET_FOLDER, folder)))
target_names

Let's randomly partition the text files in a training and test set while recording the target category of each file as an integer:


In [ ]:
import numpy as np
from sklearn.model_selection import train_test_split

target = []
filenames = []
for target_id, target_name in enumerate(target_names):
    class_path = op.join(BBC_DATASET_FOLDER, target_name)
    for filename in sorted(os.listdir(class_path)):
        filenames.append(op.join(class_path, filename))
        target.append(target_id)

target = np.asarray(target, dtype=np.int32)
target_train, target_test, filenames_train, filenames_test = train_test_split(
    target, filenames, test_size=200, random_state=0)

In [ ]:
len(target_train), len(filenames_train)

In [ ]:
len(target_test), len(filenames_test)

Let's check that text of some document have been loaded correctly:


In [ ]:
idx = 0

with open(filenames_train[idx], 'rb') as f:
    print("class:", target_names[target_train[idx]])
    print()
    print(f.read().decode('latin-1')[:500] + '...')

In [ ]:
size_in_bytes = sum([len(open(fn, 'rb').read()) for fn in filenames_train])
print("Training set size: %0.3f MB" % (size_in_bytes / 1e6))

This dataset is small so we can preload it all in memory once and for all to simplify the notebook.


In [ ]:
texts_train = [open(fn, 'rb').read().decode('latin-1') for fn in filenames_train]
texts_test = [open(fn, 'rb').read().decode('latin-1') for fn in filenames_test]

A first baseline model

For simple topic classification problems, one should always try a simple method first. In this case a good baseline is extracting TF-IDF normalized bag of bi-grams features and then use a simple linear classifier such as logistic regression.

It's a very efficient method and should give us a strong baseline to compare our deep learning method against.


In [ ]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline


text_classifier = make_pipeline(
    TfidfVectorizer(min_df=3, max_df=0.8, ngram_range=(1, 2)),
    LogisticRegression(multi_class="multinomial", solver="lbfgs"),
)

In [ ]:
%time _ = text_classifier.fit(texts_train, target_train)

In [ ]:
text_classifier.score(texts_test, target_test)

6 classification errors on 200 test documents for a model fit in less than 10s. It's quite unlikely that we can significantly beat that baseline with a more complex deep learning based model. However let's try to reach a comparable level of accuracy with Embeddings-based models just for teaching purpose.

Preprocessing text for the (supervised) CBOW model

We will implement a simple classification model in Keras. Raw text requires (sometimes a lot of) preprocessing.

The following cells uses Keras to preprocess text:

  • using a tokenizer. You may use different tokenizers (from scikit-learn, NLTK, custom Python function etc.). This converts the texts into sequences of indices representing the 20000 most frequent words
  • sequences have different lengths, so we pad them (add 0s at the end until the sequence is of length 1000)
  • we convert the output classes as 1-hot encodings

In [ ]:
from tensorflow.keras.preprocessing.text import Tokenizer

MAX_NB_WORDS = 20000

# vectorize the text samples into a 2D integer tensor
tokenizer = Tokenizer(num_words=MAX_NB_WORDS, char_level=False)
tokenizer.fit_on_texts(texts_train)
sequences = tokenizer.texts_to_sequences(texts_train)
sequences_test = tokenizer.texts_to_sequences(texts_test)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Tokenized sequences are converted to list of token ids (with an integer code):


In [ ]:
sequences[0][:10]

The tokenizer object stores a mapping (vocabulary) from word strings to token ids that can be inverted to reconstruct the original message (without formatting):


In [ ]:
type(tokenizer.word_index), len(tokenizer.word_index)

In [ ]:
index_to_word = dict((i, w) for w, i in tokenizer.word_index.items())

In [ ]:
" ".join([index_to_word[i] for i in sequences[0]])

Let's have a closer look at the tokenized sequences:


In [ ]:
seq_lens = [len(s) for s in sequences]
print("average length: %0.1f" % np.mean(seq_lens))
print("max length: %d" % max(seq_lens))

In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.hist(seq_lens, bins=50);

Let's zoom on the distribution of regular sized posts. The vast majority of the posts have less than 1000 symbols:


In [ ]:
plt.hist([l for l in seq_lens if l < 3000], bins=50);

Let's truncate and pad all the sequences to 1000 symbols to build the training set:


In [ ]:
from tensorflow.keras.preprocessing.sequence import pad_sequences


MAX_SEQUENCE_LENGTH = 1000

# pad sequences with 0s
x_train = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
x_test = pad_sequences(sequences_test, maxlen=MAX_SEQUENCE_LENGTH)
print('Shape of data tensor:', x_train.shape)
print('Shape of data test tensor:', x_test.shape)

In [ ]:
from tensorflow.keras.utils import to_categorical

y_train = to_categorical(target_train)
print('Shape of label tensor:', y_train.shape)

A simple supervised CBOW model in Keras

The following computes a very simple model, as described in fastText:

  • Build an embedding layer mapping each word to a vector representation
  • Compute the vector representation of all words in each sequence and average them
  • Add a dense layer to output 20 classes (+ softmax)

In [ ]:
from tensorflow.keras.layers import Dense, Input, Flatten
from tensorflow.keras.layers import GlobalAveragePooling1D, Embedding
from tensorflow.keras.models import Model
from tensorflow.keras import optimizers

EMBEDDING_DIM = 50
N_CLASSES = len(target_names)

# input: a sequence of MAX_SEQUENCE_LENGTH integers
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')

embedding_layer = Embedding(MAX_NB_WORDS, EMBEDDING_DIM,
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=True)
embedded_sequences = embedding_layer(sequence_input)

average = GlobalAveragePooling1D()(embedded_sequences)
predictions = Dense(N_CLASSES, activation='softmax')(average)

model = Model(sequence_input, predictions)
model.compile(loss='categorical_crossentropy',
              optimizer=optimizers.Adam(lr=0.01), metrics=['acc'])

In [ ]:
model.fit(x_train, y_train, validation_split=0.1,
          epochs=10, batch_size=32)

Exercices

  • Compute model accuracy on test set

In [ ]:
# %load solutions/accuracy.py

Building more complex models

Exercise

  • From the previous template, build more complex models using:
    • 1d convolution and 1d maxpooling. Note that you will still need a GloabalAveragePooling or Flatten after the convolutions as the final Dense layer expects a fixed size input;
    • Recurrent neural networks through LSTM (you will need to reduce sequence length before using the LSTM layer).

Bonus

  • You may try different architectures with:
    • more intermediate layers, combination of dense, conv, recurrent
    • different recurrent (GRU, RNN)
    • bidirectional LSTMs

Note: The goal is to build working models rather than getting better test accuracy as this task is already very well solved by the simple model. Build your model, and verify that they converge to OK results.


In [ ]:
from tensorflow.keras.layers import Embedding, Dense, Input, Flatten
from tensorflow.keras.layers import Conv1D, LSTM, GRU
from tensorflow.keras.layers import MaxPooling1D, GlobalAveragePooling1D 
from tensorflow.keras.models import Model

EMBEDDING_DIM = 50
N_CLASSES = len(target_names)

# input: a sequence of MAX_SEQUENCE_LENGTH integers
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')

embedding_layer = Embedding(MAX_NB_WORDS, EMBEDDING_DIM,
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=True)
embedded_sequences = embedding_layer(sequence_input)

# TODO

model = Model(sequence_input, predictions)
model.compile(loss='categorical_crossentropy',
              optimizer='adam', metrics=['acc'])

In [ ]:
# %load solutions/conv1d.py

In [ ]:
# %load solutions/lstm.py

In [ ]:
model.fit(x_train, y_train, validation_split=0.1,
          epochs=5, batch_size=32)

output_test = model(x_test)
test_casses = np.argmax(output_test, axis=-1)
print("Test accuracy:", np.mean(test_casses == target_test))

Loading pre-trained embeddings

The file glove100K.100d.txt is an extract of Glove Vectors, that were trained on english Wikipedia 2014 + Gigaword 5 (6B tokens).

We extracted the 100 000 most frequent words. They have a dimension of 100


In [ ]:
embeddings_index = {}
embeddings_vectors = []
with open('glove100K.100d.txt', 'rb') as f:
    word_idx = 0
    for line in f:
        values = line.decode('utf-8').split()
        word = values[0]
        vector = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = word_idx
        embeddings_vectors.append(vector)
        word_idx = word_idx + 1

inv_index = {v: k for k, v in embeddings_index.items()}
print("found %d different words in the file" % word_idx)

In [ ]:
# Stack all embeddings in a large numpy array
glove_embeddings = np.vstack(embeddings_vectors)
glove_norms = np.linalg.norm(glove_embeddings, axis=-1, keepdims=True)
glove_embeddings_normed = glove_embeddings / glove_norms
print(glove_embeddings.shape)

In [ ]:
def get_emb(word):
    idx = embeddings_index.get(word)
    if idx is None:
        return None
    else:
        return glove_embeddings[idx]

    
def get_normed_emb(word):
    idx = embeddings_index.get(word)
    if idx is None:
        return None
    else:
        return glove_embeddings_normed[idx]

In [ ]:
get_emb("computer")

Finding most similar words

Exercice

Build a function to find most similar words, given a word as query:

  • lookup the vector for the query word in the Glove index;
  • compute the cosine similarity between a word embedding and all other words;
  • display the top 10 most similar words.

Bonus

Change your function so that it takes multiple words as input (by averaging them)


In [ ]:
# %load solutions/most_similar.py

In [ ]:
most_similar("cpu")

In [ ]:
most_similar("pitt")

In [ ]:
most_similar("jolie")

Predict the future better than tarot:


In [ ]:
np.dot(get_normed_emb('aniston'), get_normed_emb('pitt'))

In [ ]:
np.dot(get_normed_emb('jolie'), get_normed_emb('pitt'))

In [ ]:
most_similar("1")

In [ ]:
# bonus: yangtze is a chinese river
most_similar(["river", "chinese"])

Displaying vectors with t-SNE


In [ ]:
from sklearn.manifold import TSNE

word_emb_tsne = TSNE(perplexity=30).fit_transform(glove_embeddings_normed[:1000])

In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.figure(figsize=(40, 40))
axis = plt.gca()
np.set_printoptions(suppress=True)
plt.scatter(word_emb_tsne[:, 0], word_emb_tsne[:, 1], marker=".", s=1)

for idx in range(1000):
    plt.annotate(inv_index[idx],
                 xy=(word_emb_tsne[idx, 0], word_emb_tsne[idx, 1]),
                 xytext=(0, 0), textcoords='offset points')
plt.savefig("tsne.png")
plt.show()

Using pre-trained embeddings in our model

We want to use these pre-trained embeddings for transfer learning. This process is rather similar than transfer learning in image recognition: the features learnt on words might help us bootstrap the learning process, and increase performance if we don't have enough training data.

  • We initialize embedding matrix from the model with Glove embeddings:
    • take all unique words from our BBC news dataset to build a vocabulary (MAX_NB_WORDS = 20000), and look up their Glove embedding
    • place the Glove embedding at the corresponding index in the matrix
    • if the word is not in the Glove vocabulary, we only place zeros in the matrix
  • We may fix these embeddings or fine-tune them

In [ ]:
EMBEDDING_DIM = 100

# prepare embedding matrix
nb_words_in_matrix = 0
nb_words = min(MAX_NB_WORDS, len(word_index))
embedding_matrix = np.zeros((nb_words, EMBEDDING_DIM))
for word, i in word_index.items():
    if i >= MAX_NB_WORDS:
        continue
    embedding_vector = get_emb(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector
        nb_words_in_matrix = nb_words_in_matrix + 1
        
print("added %d words in the embedding matrix" % nb_words_in_matrix)

Build a layer with pre-trained embeddings:


In [ ]:
pretrained_embedding_layer = Embedding(
    MAX_NB_WORDS, EMBEDDING_DIM,
    weights=[embedding_matrix],
    input_length=MAX_SEQUENCE_LENGTH,
)

A model with pre-trained Embeddings

Average word embeddings pre-trained with Glove / Word2Vec usually works surprisingly well. However, when averaging more than 10-15 words, the resulting vector becomes too noisy and classification performance is degraded.


In [ ]:
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = pretrained_embedding_layer(sequence_input)
average = GlobalAveragePooling1D()(embedded_sequences)
predictions = Dense(N_CLASSES, activation='softmax')(average)

model = Model(sequence_input, predictions)

# We don't want to fine-tune embeddings
model.layers[1].trainable = False

model.compile(loss='categorical_crossentropy',
              optimizer=optimizers.Adam(lr=0.01), metrics=['acc'])

In [ ]:
model.fit(x_train, y_train, validation_split=0.1,
          epochs=15, batch_size=32)

Remarks:

  • On this type of task, using pre-trained embeddings can degrade results as we train much less parameters and we average a large number pre-trained embeddings.

  • Pre-trained embeddings followed by global averaging prevents overfitting but can also cause some underfitting.

  • Using convolutions / LSTM should help counter the underfitting effect.

  • It is also advisable to treat separately pre-trained embeddings and words out of vocabulary.

Pre-trained embeddings can be very useful when the training set is small and the individual text documents to classify are short: in this case there might be a single very important word in a test document that drives the label. If that word has never been seen in the training set but some synonyms were seen, the semantic similarity captured by the embedding will allow the model to generalized out of the restricted training set vocabulary.

We did not observe this effect here because the document are long enough so that guessing the topic can be done redundantly. Shortening the documents to make the task more difficult could possibly highlight this benefit.

Reality check

On small/medium datasets, simpler classification methods usually perform better, and are much more efficient to compute. Here are two resources to go further:

However, when looking at features, one can see that classification using simple methods isn't very robust, and won't generalize well to slightly different domains (e.g. forum posts => emails)

Note: Implementation in Keras for text is very slow due to python overhead and lack of hashing techniques. The fastText implementation https://github.com/facebookresearch/fasttext is much, much faster.

Going further

  • Compare pre-trained embeddings vs specifically trained embeddings
  • Train your own wordvectors in any language using gensim's word2vec
  • Check Keras Examples on imdb sentiment analysis
  • Install fastText (Linux or macOS only, use the Linux VM if under Windows) and give it a try on the classification example in its repository.
  • Today, the state-of-the-art text classification can be achieved by transfer learning from a language model instead of using traditional word embeddings. See for instance: ULMFit, Fine-tuned Language Models for Text Classification, ELMO, GPT, BERT, GPT-2. The second notebook introduces how to train such a language model from unlabeled data.

In [ ]: