Preprocessing: FastText Sequences & Embeddings

Based on the tokenized questions and a pre-built word embedding database, build fixed-length (padded) sequences of word indices for each question, as well as a lookup matrix that maps word indices to word vectors.

Imports

This utility package imports numpy, pandas, matplotlib and a helper kg module into the root namespace.


In [1]:
from pygoose import *

In [2]:
from gensim.models.wrappers.fasttext import FastText

Hide all GPUs from TensorFlow to not automatically occupy any GPU RAM.


In [3]:
kg.gpu.cuda_disable_gpus()

In [4]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences


Using TensorFlow backend.

Config

Automatically discover the paths to various data folders and compose the project structure.


In [5]:
project = kg.Project.discover()

The maximum allowed size of the embedding matrix and the maximum length our sequences will be padded/trimmed to.


In [6]:
MAX_VOCAB_SIZE = 125000
MAX_SEQUENCE_LENGTH = 30

Load data

Preprocessed and tokenized questions. Stopwords should be kept for neural models.


In [7]:
tokens_train = kg.io.load(project.preprocessed_data_dir + 'tokens_lowercase_spellcheck_train.pickle')
tokens_test = kg.io.load(project.preprocessed_data_dir + 'tokens_lowercase_spellcheck_test.pickle')

Word embedding database queried from the trained FastText model.


In [8]:
embedding_model = FastText.load_word2vec_format(project.aux_dir + 'fasttext_vocab.vec')

In [9]:
EMBEDDING_DIM = len(embedding_model['apple'])

Build features

Collect all texts


In [10]:
texts_q1_train = [' '.join(pair[0]) for pair in tokens_train]
texts_q2_train = [' '.join(pair[1]) for pair in tokens_train]

In [11]:
texts_q1_test = [' '.join(pair[0]) for pair in tokens_test]
texts_q2_test = [' '.join(pair[1]) for pair in tokens_test]

In [12]:
unique_question_texts = list(set(texts_q1_train + texts_q2_train + texts_q1_test + texts_q2_test))

Create question sequences


In [13]:
tokenizer = Tokenizer(
    num_words=MAX_VOCAB_SIZE,
    split=' ',
    lower=True,
    char_level=False,
)

In [14]:
tokenizer.fit_on_texts(unique_question_texts)

In [15]:
sequences_q1_train = tokenizer.texts_to_sequences(texts_q1_train)
sequences_q2_train = tokenizer.texts_to_sequences(texts_q2_train)

In [16]:
sequences_q1_test = tokenizer.texts_to_sequences(texts_q1_test)
sequences_q2_test = tokenizer.texts_to_sequences(texts_q2_test)

Create embedding lookup matrix


In [17]:
num_words = min(MAX_VOCAB_SIZE, len(tokenizer.word_index))

Allocate an embedding matrix. Include the NULL word.


In [18]:
embedding_matrix = np.zeros((num_words + 1, EMBEDDING_DIM))

Fill the matrix using the vectors for individual words.


In [19]:
for word, index in progressbar(tokenizer.word_index.items()):
    if word in embedding_model.vocab:
        embedding_matrix[index] = embedding_model[word]


100%|██████████| 101563/101563 [00:00<00:00, 153373.20it/s]

Save features

Word embedding lookup matrix.


In [20]:
kg.io.save(embedding_matrix, project.aux_dir + 'fasttext_vocab_embedding_matrix.pickle')

Padded word index sequences.


In [21]:
sequences_q1_padded_train = pad_sequences(sequences_q1_train, maxlen=MAX_SEQUENCE_LENGTH)
sequences_q2_padded_train = pad_sequences(sequences_q2_train, maxlen=MAX_SEQUENCE_LENGTH)

In [22]:
sequences_q1_padded_test = pad_sequences(sequences_q1_test, maxlen=MAX_SEQUENCE_LENGTH)
sequences_q2_padded_test = pad_sequences(sequences_q2_test, maxlen=MAX_SEQUENCE_LENGTH)

In [23]:
kg.io.save(sequences_q1_padded_train, project.preprocessed_data_dir + 'sequences_q1_fasttext_train.pickle')
kg.io.save(sequences_q2_padded_train, project.preprocessed_data_dir + 'sequences_q2_fasttext_train.pickle')

In [24]:
kg.io.save(sequences_q1_padded_test, project.preprocessed_data_dir + 'sequences_q1_fasttext_test.pickle')
kg.io.save(sequences_q2_padded_test, project.preprocessed_data_dir + 'sequences_q2_fasttext_test.pickle')