Preprocessing: Create a FastText Vector Database

Based on the vocabulary extracted from question texts, use a pretrained FastText model to query and save word vectors.

Imports

This utility package imports numpy, pandas, matplotlib and a helper kg module into the root namespace.



In [1]:

    
from pygoose import *



In [2]:

    
import os
import subprocess

Config

Automatically discover the paths to various data folders and compose the project structure.



In [3]:

    
project = kg.Project.discover()

Number of word embedding dimensions.



In [4]:

    
EMBEDDING_DIM = 300

Path to FastText executable.



In [5]:

    
FASTTEXT_EXECUTABLE = 'fasttext'

Path to the FastText binary model pre-trained on Wikipedia.



In [6]:

    
PRETRAINED_MODEL_FILE = os.path.join(project.aux_dir, 'fasttext', 'wiki.en.bin')

Input vocab file (one word per line).



In [7]:

    
VOCAB_FILE = project.preprocessed_data_dir + 'tokens_lowercase_spellcheck.vocab'

Vector output file (one vector per line).



In [8]:

    
OUTPUT_FILE = project.aux_dir + 'fasttext_vocab.vec'

Save FastText metadata

Add a header containing the number of words and embedding size to be readable by gensim.



In [9]:

    
vocab = kg.io.load_lines(VOCAB_FILE)



In [10]:

    
with open(OUTPUT_FILE, 'w') as f:
    print(f'{len(vocab)} {EMBEDDING_DIM}', file=f)

Query and save FastText vectors

Replicate the command fasttext print-vectors model.bin < words.txt >> vectors.vec.



In [11]:

    
with open(VOCAB_FILE) as f_vocab:
    with open(OUTPUT_FILE, 'a') as f_output:
        subprocess.run(
            [FASTTEXT_EXECUTABLE, 'print-word-vectors', PRETRAINED_MODEL_FILE],
            stdin=f_vocab,
            stdout=f_output,
        )