Preprocessing: Create a FastText Vector Database

Based on the vocabulary extracted from question texts, use a pretrained FastText model to query and save word vectors.

Imports

This utility package imports numpy, pandas, matplotlib and a helper kg module into the root namespace.


In [1]:
from pygoose import *

In [2]:
import os
import subprocess

Config

Automatically discover the paths to various data folders and compose the project structure.


In [3]:
project = kg.Project.discover()

Number of word embedding dimensions.


In [4]:
EMBEDDING_DIM = 300

Path to FastText executable.


In [5]:
FASTTEXT_EXECUTABLE = 'fasttext'

Path to the FastText binary model pre-trained on Wikipedia.


In [6]:
PRETRAINED_MODEL_FILE = os.path.join(project.aux_dir, 'fasttext', 'wiki.en.bin')

Input vocab file (one word per line).


In [7]:
VOCAB_FILE = project.preprocessed_data_dir + 'tokens_lowercase_spellcheck.vocab'

Vector output file (one vector per line).


In [8]:
OUTPUT_FILE = project.aux_dir + 'fasttext_vocab.vec'

Save FastText metadata

Add a header containing the number of words and embedding size to be readable by gensim.


In [9]:
vocab = kg.io.load_lines(VOCAB_FILE)

In [10]:
with open(OUTPUT_FILE, 'w') as f:
    print(f'{len(vocab)} {EMBEDDING_DIM}', file=f)

Query and save FastText vectors

Replicate the command fasttext print-vectors model.bin < words.txt >> vectors.vec.


In [11]:
with open(VOCAB_FILE) as f_vocab:
    with open(OUTPUT_FILE, 'a') as f_output:
        subprocess.run(
            [FASTTEXT_EXECUTABLE, 'print-word-vectors', PRETRAINED_MODEL_FILE],
            stdin=f_vocab,
            stdout=f_output,
        )