Preprocessing: FastText Sequences & Embeddings

Based on the tokenized questions and a pre-built word embedding database, build fixed-length (padded) sequences of word indices for each question, as well as a lookup matrix that maps word indices to word vectors.

Imports

This utility package imports numpy, pandas, matplotlib and a helper kg module into the root namespace.



In [1]:

    
from pygoose import *



In [2]:

    
from gensim.models.wrappers.fasttext import FastText

Hide all GPUs from TensorFlow to not automatically occupy any GPU RAM.



In [3]:

    
kg.gpu.cuda_disable_gpus()



In [4]:

    
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences









    



Using TensorFlow backend.

Config

Automatically discover the paths to various data folders and compose the project structure.



In [5]:

    
project = kg.Project.discover()

The maximum allowed size of the embedding matrix and the maximum length our sequences will be padded/trimmed to.



In [6]:

    
MAX_VOCAB_SIZE = 125000
MAX_SEQUENCE_LENGTH = 30

Load data

Preprocessed and tokenized questions. Stopwords should be kept for neural models.



In [7]:

    
tokens_train = kg.io.load(project.preprocessed_data_dir + 'tokens_lowercase_spellcheck_train.pickle')
tokens_test = kg.io.load(project.preprocessed_data_dir + 'tokens_lowercase_spellcheck_test.pickle')

Word embedding database queried from the trained FastText model.



In [8]:

    
embedding_model = FastText.load_word2vec_format(project.aux_dir + 'fasttext_vocab.vec')



In [9]:

    
EMBEDDING_DIM = len(embedding_model['apple'])

Build features

Collect all texts



In [10]:

    
texts_q1_train = [' '.join(pair[0]) for pair in tokens_train]
texts_q2_train = [' '.join(pair[1]) for pair in tokens_train]



In [11]:

    
texts_q1_test = [' '.join(pair[0]) for pair in tokens_test]
texts_q2_test = [' '.join(pair[1]) for pair in tokens_test]



In [12]:

    
unique_question_texts = list(set(texts_q1_train + texts_q2_train + texts_q1_test + texts_q2_test))

Create question sequences



In [13]:

    
tokenizer = Tokenizer(
    num_words=MAX_VOCAB_SIZE,
    split=' ',
    lower=True,
    char_level=False,
)



In [14]:

    
tokenizer.fit_on_texts(unique_question_texts)



In [15]:

    
sequences_q1_train = tokenizer.texts_to_sequences(texts_q1_train)
sequences_q2_train = tokenizer.texts_to_sequences(texts_q2_train)



In [16]:

    
sequences_q1_test = tokenizer.texts_to_sequences(texts_q1_test)
sequences_q2_test = tokenizer.texts_to_sequences(texts_q2_test)

Create embedding lookup matrix



In [17]:

    
num_words = min(MAX_VOCAB_SIZE, len(tokenizer.word_index))

Allocate an embedding matrix. Include the NULL word.



In [18]:

    
embedding_matrix = np.zeros((num_words + 1, EMBEDDING_DIM))

Fill the matrix using the vectors for individual words.



In [19]:

    
for word, index in progressbar(tokenizer.word_index.items()):
    if word in embedding_model.vocab:
        embedding_matrix[index] = embedding_model[word]









    



100%|██████████| 101563/101563 [00:00<00:00, 153373.20it/s]

Save features

Word embedding lookup matrix.



In [20]:

    
kg.io.save(embedding_matrix, project.aux_dir + 'fasttext_vocab_embedding_matrix.pickle')

Padded word index sequences.



In [21]:

    
sequences_q1_padded_train = pad_sequences(sequences_q1_train, maxlen=MAX_SEQUENCE_LENGTH)
sequences_q2_padded_train = pad_sequences(sequences_q2_train, maxlen=MAX_SEQUENCE_LENGTH)



In [22]:

    
sequences_q1_padded_test = pad_sequences(sequences_q1_test, maxlen=MAX_SEQUENCE_LENGTH)
sequences_q2_padded_test = pad_sequences(sequences_q2_test, maxlen=MAX_SEQUENCE_LENGTH)



In [23]:

    
kg.io.save(sequences_q1_padded_train, project.preprocessed_data_dir + 'sequences_q1_fasttext_train.pickle')
kg.io.save(sequences_q2_padded_train, project.preprocessed_data_dir + 'sequences_q2_fasttext_train.pickle')



In [24]:

    
kg.io.save(sequences_q1_padded_test, project.preprocessed_data_dir + 'sequences_q1_fasttext_test.pickle')
kg.io.save(sequences_q2_padded_test, project.preprocessed_data_dir + 'sequences_q2_fasttext_test.pickle')