Based on the tokenized questions and a pre-built word embedding database, build fixed-length (padded) sequences of word indices for each question, as well as a lookup matrix that maps word indices to word vectors.
This utility package imports numpy, pandas, matplotlib and a helper kg module into the root namespace.
In [1]:
from pygoose import *
In [2]:
from gensim.models.wrappers.fasttext import FastText
Hide all GPUs from TensorFlow to not automatically occupy any GPU RAM.
In [3]:
kg.gpu.cuda_disable_gpus()
In [4]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
Automatically discover the paths to various data folders and compose the project structure.
In [5]:
project = kg.Project.discover()
The maximum allowed size of the embedding matrix and the maximum length our sequences will be padded/trimmed to.
In [6]:
MAX_VOCAB_SIZE = 125000
MAX_SEQUENCE_LENGTH = 30
Preprocessed and tokenized questions. Stopwords should be kept for neural models.
In [7]:
tokens_train = kg.io.load(project.preprocessed_data_dir + 'tokens_lowercase_spellcheck_train.pickle')
tokens_test = kg.io.load(project.preprocessed_data_dir + 'tokens_lowercase_spellcheck_test.pickle')
Word embedding database queried from the trained FastText model.
In [8]:
embedding_model = FastText.load_word2vec_format(project.aux_dir + 'fasttext_vocab.vec')
In [9]:
EMBEDDING_DIM = len(embedding_model['apple'])
In [10]:
texts_q1_train = [' '.join(pair[0]) for pair in tokens_train]
texts_q2_train = [' '.join(pair[1]) for pair in tokens_train]
In [11]:
texts_q1_test = [' '.join(pair[0]) for pair in tokens_test]
texts_q2_test = [' '.join(pair[1]) for pair in tokens_test]
In [12]:
unique_question_texts = list(set(texts_q1_train + texts_q2_train + texts_q1_test + texts_q2_test))
In [13]:
tokenizer = Tokenizer(
num_words=MAX_VOCAB_SIZE,
split=' ',
lower=True,
char_level=False,
)
In [14]:
tokenizer.fit_on_texts(unique_question_texts)
In [15]:
sequences_q1_train = tokenizer.texts_to_sequences(texts_q1_train)
sequences_q2_train = tokenizer.texts_to_sequences(texts_q2_train)
In [16]:
sequences_q1_test = tokenizer.texts_to_sequences(texts_q1_test)
sequences_q2_test = tokenizer.texts_to_sequences(texts_q2_test)
In [17]:
num_words = min(MAX_VOCAB_SIZE, len(tokenizer.word_index))
Allocate an embedding matrix. Include the NULL word.
In [18]:
embedding_matrix = np.zeros((num_words + 1, EMBEDDING_DIM))
Fill the matrix using the vectors for individual words.
In [19]:
for word, index in progressbar(tokenizer.word_index.items()):
if word in embedding_model.vocab:
embedding_matrix[index] = embedding_model[word]
Word embedding lookup matrix.
In [20]:
kg.io.save(embedding_matrix, project.aux_dir + 'fasttext_vocab_embedding_matrix.pickle')
Padded word index sequences.
In [21]:
sequences_q1_padded_train = pad_sequences(sequences_q1_train, maxlen=MAX_SEQUENCE_LENGTH)
sequences_q2_padded_train = pad_sequences(sequences_q2_train, maxlen=MAX_SEQUENCE_LENGTH)
In [22]:
sequences_q1_padded_test = pad_sequences(sequences_q1_test, maxlen=MAX_SEQUENCE_LENGTH)
sequences_q2_padded_test = pad_sequences(sequences_q2_test, maxlen=MAX_SEQUENCE_LENGTH)
In [23]:
kg.io.save(sequences_q1_padded_train, project.preprocessed_data_dir + 'sequences_q1_fasttext_train.pickle')
kg.io.save(sequences_q2_padded_train, project.preprocessed_data_dir + 'sequences_q2_fasttext_train.pickle')
In [24]:
kg.io.save(sequences_q1_padded_test, project.preprocessed_data_dir + 'sequences_q1_fasttext_test.pickle')
kg.io.save(sequences_q2_padded_test, project.preprocessed_data_dir + 'sequences_q2_fasttext_test.pickle')