The goal of this notebook is to learn to use Neural Networks for text classification.
In this notebook, we will:
However keep in mind:
The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups http://qwone.com/~jason/20Newsgroups/
In [1]:
import numpy as np
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test = fetch_20newsgroups(subset='test')
In [2]:
sample_idx = 1000
print(newsgroups_train["data"][sample_idx])
In [3]:
target_names = newsgroups_train["target_names"]
target_id = newsgroups_train["target"][sample_idx]
print("Class of previous message:", target_names[target_id])
Here are all the possible classes:
In [4]:
target_names
Out[4]:
We will implement a simple classification model in Keras. Raw text requires (sometimes a lot of) preprocessing.
The following cells uses Keras to preprocess text:
20000
most frequent words1000
)
In [5]:
from keras.preprocessing.text import Tokenizer
MAX_NB_WORDS = 20000
# get the raw text data
texts_train = newsgroups_train["data"]
texts_test = newsgroups_test["data"]
# finally, vectorize the text samples into a 2D integer tensor
tokenizer = Tokenizer(nb_words=MAX_NB_WORDS, char_level=False)
tokenizer.fit_on_texts(texts_train)
sequences = tokenizer.texts_to_sequences(texts_train)
sequences_test = tokenizer.texts_to_sequences(texts_test)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
Tokenized sequences are converted to list of token ids (with an integer code):
In [6]:
sequences[0]
Out[6]:
The tokenizer object stores a mapping (vocabulary) from word strings to token ids that can be inverted to reconstruct the original message (without formatting):
In [7]:
type(tokenizer.word_index), len(tokenizer.word_index)
Out[7]:
In [8]:
index_to_word = dict((i, w) for w, i in tokenizer.word_index.items())
In [9]:
" ".join([index_to_word[i] for i in sequences[0]])
Out[9]:
Let's have a closer look at the tokenized sequences:
In [10]:
seq_lens = [len(s) for s in sequences]
print("average length: %0.1f" % np.mean(seq_lens))
print("max length: %d" % max(seq_lens))
In [11]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.hist(seq_lens, bins=50);
Let's zoom on the distribution of regular sized posts. The vast majority of the posts have less than 1000 symbols:
In [12]:
plt.hist([l for l in seq_lens if l < 3000], bins=50);
Let's truncate and pad all the sequences to 1000 symbols to build the training set:
In [13]:
from keras.preprocessing.sequence import pad_sequences
MAX_SEQUENCE_LENGTH = 1000
# pad sequences with 0s
x_train = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
x_test = pad_sequences(sequences_test, maxlen=MAX_SEQUENCE_LENGTH)
print('Shape of data tensor:', x_train.shape)
print('Shape of data test tensor:', x_test.shape)
In [14]:
from keras.utils.np_utils import to_categorical
y_train = newsgroups_train["target"]
y_test = newsgroups_test["target"]
y_train = to_categorical(np.asarray(y_train))
print('Shape of label tensor:', y_train.shape)
The following computes a very simple model, as described in fastText:
In [15]:
from keras.layers import Dense, Input, Flatten
from keras.layers import GlobalAveragePooling1D, Embedding
from keras.models import Model
EMBEDDING_DIM = 50
N_CLASSES = len(target_names)
# input: a sequence of MAX_SEQUENCE_LENGTH integers
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedding_layer = Embedding(MAX_NB_WORDS, EMBEDDING_DIM,
input_length=MAX_SEQUENCE_LENGTH,
trainable=True)
embedded_sequences = embedding_layer(sequence_input)
average = GlobalAveragePooling1D()(embedded_sequences)
predictions = Dense(N_CLASSES, activation='softmax')(average)
model = Model(sequence_input, predictions)
model.compile(loss='categorical_crossentropy',
optimizer='adam', metrics=['acc'], verbose=2)
In [16]:
model.fit(x_train, y_train, validation_split=0.1,
nb_epoch=10, batch_size=128, verbose=2)
Out[16]:
Exercice
In [17]:
# %load solutions/accuracy.py
output_test = model.predict(x_test)
test_casses = np.argmax(output_test, axis=-1)
print("test accuracy:", np.mean(test_casses == y_test))
Exercise
Bonus
Note: The goal is to build working models rather than getting better test accuracy. To achieve much better results, we'd need more computation time and data quantity. Build your model, and verify that they converge to OK results.
In [18]:
# %load solutions/lstm.py
from keras.layers import LSTM, Conv1D, MaxPooling1D
# input: a sequence of MAX_SEQUENCE_LENGTH integers
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
# 1D convolution with 64 output channels
x = Conv1D(64, 5)(embedded_sequences)
# MaxPool divides the length of the sequence by 5
x = MaxPooling1D(5)(x)
x = Conv1D(64, 5)(x)
x = MaxPooling1D(5)(x)
# LSTM layer with a hidden size of 64
x = LSTM(64)(x)
predictions = Dense(20, activation='softmax')(x)
model = Model(sequence_input, predictions)
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['acc'])
# You will get large speedups with these models by using a GPU
# The model might take a lot of time to converge, and even more
# if you add dropout (needed to prevent overfitting)
In [19]:
# %load solutions/conv1d.py
from keras.layers import Conv1D, MaxPooling1D, Flatten
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
# A 1D convolution with 128 output channels
x = Conv1D(128, 5, activation='relu')(embedded_sequences)
# MaxPool divides the length of the sequence by 5
x = MaxPooling1D(5)(x)
# A 1D convolution with 64 output channels
x = Conv1D(64, 5, activation='relu')(x)
# MaxPool divides the length of the sequence by 5
x = MaxPooling1D(5)(x)
x = Flatten()(x)
predictions = Dense(20, activation='softmax')(x)
model = Model(sequence_input, predictions)
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['acc'])
In [20]:
model.fit(x_train, y_train, validation_split=0.1,
nb_epoch=10, batch_size=128, verbose=2)
Out[20]:
The file glove100K.100d.txt
is an extract of Glove Vectors, that were trained on english Wikipedia 2014 + Gigaword 5 (6B tokens).
We extracted the 100 000
most frequent words. They have a dimension of 100
In [21]:
embeddings_index = {}
embeddings_vectors = []
f = open('glove100K.100d.txt', 'rb')
word_idx = 0
for line in f:
values = line.decode('utf-8').split()
word = values[0]
vector = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = word_idx
embeddings_vectors.append(vector)
word_idx = word_idx + 1
f.close()
inv_index = {v: k for k, v in embeddings_index.items()}
print("found %d different words in the file" % word_idx)
In [22]:
# Stack all embeddings in a large numpy array
glove_embeddings = np.vstack(embeddings_vectors)
glove_norms = np.linalg.norm(glove_embeddings, axis=-1, keepdims=True)
glove_embeddings_normed = glove_embeddings / glove_norms
print(glove_embeddings.shape)
In [23]:
def get_emb(word):
idx = embeddings_index.get(word)
if idx is None:
return None
else:
return glove_embeddings[idx]
def get_normed_emb(word):
idx = embeddings_index.get(word)
if idx is None:
return None
else:
return glove_embeddings_normed[idx]
In [24]:
get_emb("computer")
Out[24]:
Exercice
Build a function to find most similar words, given a word as query:
Bonus
Change your function so that it takes multiple words as input (by averaging them)
In [25]:
# %load solutions/most_similar.py
def most_similar(words, topn=10):
query_emb = 0
# If we have a list of words instead of one word
# (bonus question)
if type(words) == list:
for word in words:
query_emb += get_emb(word)
else:
query_emb = get_emb(words)
query_emb = query_emb / np.linalg.norm(query_emb)
# Large numpy vector with all cosine similarities
# between emb and all other words
cosines = np.dot(glove_embeddings_normed, query_emb)
# topn most similar indexes corresponding to cosines
idxs = np.argsort(cosines)[::-1][:topn]
# pretty return with word and similarity
return [(inv_index[idx], cosines[idx]) for idx in idxs]
In [26]:
most_similar("cpu")
Out[26]:
In [27]:
most_similar("pitt")
Out[27]:
In [28]:
most_similar("jolie")
Out[28]:
Predict the future better than tarot:
In [29]:
np.dot(get_normed_emb('aniston'), get_normed_emb('pitt'))
Out[29]:
In [30]:
np.dot(get_normed_emb('jolie'), get_normed_emb('pitt'))
Out[30]:
In [31]:
most_similar("1")
Out[31]:
In [32]:
# bonus: yangtze is a chinese river
most_similar(["river", "chinese"])
Out[32]:
In [33]:
from sklearn.manifold import TSNE
word_emb_tsne = TSNE(perplexity=30).fit_transform(glove_embeddings_normed[:1000])
In [34]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.figure(figsize=(40, 40))
axis = plt.gca()
np.set_printoptions(suppress=True)
plt.scatter(word_emb_tsne[:, 0], word_emb_tsne[:, 1], marker=".", s=1)
for idx in range(1000):
plt.annotate(inv_index[idx],
xy=(word_emb_tsne[idx, 0], word_emb_tsne[idx, 1]),
xytext=(0, 0), textcoords='offset points')
plt.savefig("tsne.png")
plt.show()
We want to use these pre-trained embeddings for transfer learning. This process is rather similar than transfer learning in image recognition: the features learnt on words might help us bootstrap the learning process, and increase performance if we don't have enough training data.
MAX_NB_WORDS = 20000
), and look up their Glove embedding
In [35]:
EMBEDDING_DIM = 100
# prepare embedding matrix
nb_words_in_matrix = 0
nb_words = min(MAX_NB_WORDS, len(word_index))
embedding_matrix = np.zeros((nb_words, EMBEDDING_DIM))
for word, i in word_index.items():
if i >= MAX_NB_WORDS:
continue
embedding_vector = get_emb(word)
if embedding_vector is not None:
# words not found in embedding index will be all-zeros.
embedding_matrix[i] = embedding_vector
nb_words_in_matrix = nb_words_in_matrix + 1
print("added %d words in the embedding matrix" % nb_words_in_matrix)
Build a layer with pre-trained embeddings:
In [36]:
pretrained_embedding_layer = Embedding(
MAX_NB_WORDS, EMBEDDING_DIM,
weights=[embedding_matrix],
input_length=MAX_SEQUENCE_LENGTH,
)
In [37]:
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = pretrained_embedding_layer(sequence_input)
average = GlobalAveragePooling1D()(embedded_sequences)
predictions = Dense(N_CLASSES, activation='softmax')(average)
model = Model(sequence_input, predictions)
# We don't want to fine-tune embeddings
model.layers[1].trainable=False
model.compile(loss='categorical_crossentropy',
optimizer='adam', metrics=['acc'])
In [38]:
model.fit(x_train, y_train, validation_split=0.1,
nb_epoch=10, batch_size=128, verbose=2)
# Note, on this type of task, this technique will
# degrade results as we train much less parameters
# and we average a large number pre-trained embeddings.
# You will notice much less overfitting then!
# Using convolutions / LSTM will help
# It is also advisable to treat seperately pre-trained
# embeddings and words out of vocabulary.
Out[38]:
On small/medium datasets, simpler classification methods usually perform better, and are much more efficient to compute. Here are two resources to go further:
However, when looking at features, one can see that classification using simple methods isn't very robust, and won't generalize well to slightly different domains (e.g. forum posts => emails)
Note: Implementation in Keras for text is very slow due to python overhead and lack of hashing techniques. The fastText implementation https://github.com/facebookresearch/fasttext is much, much faster.
imdb
sentiment analysis
In [ ]: