Based on the vocabulary extracted from question texts, use a pretrained FastText model to query and save word vectors.
This utility package imports numpy, pandas, matplotlib and a helper kg module into the root namespace.
In [1]:
from pygoose import *
In [2]:
import os
import subprocess
Automatically discover the paths to various data folders and compose the project structure.
In [3]:
project = kg.Project.discover()
Number of word embedding dimensions.
In [4]:
EMBEDDING_DIM = 300
Path to FastText executable.
In [5]:
FASTTEXT_EXECUTABLE = 'fasttext'
Path to the FastText binary model pre-trained on Wikipedia.
In [6]:
PRETRAINED_MODEL_FILE = os.path.join(project.aux_dir, 'fasttext', 'wiki.en.bin')
Input vocab file (one word per line).
In [7]:
VOCAB_FILE = project.preprocessed_data_dir + 'tokens_lowercase_spellcheck.vocab'
Vector output file (one vector per line).
In [8]:
OUTPUT_FILE = project.aux_dir + 'fasttext_vocab.vec'
Add a header containing the number of words and embedding size to be readable by gensim.
In [9]:
vocab = kg.io.load_lines(VOCAB_FILE)
In [10]:
with open(OUTPUT_FILE, 'w') as f:
print(f'{len(vocab)} {EMBEDDING_DIM}', file=f)
Replicate the command fasttext print-vectors model.bin < words.txt >> vectors.vec.
In [11]:
with open(VOCAB_FILE) as f_vocab:
with open(OUTPUT_FILE, 'a') as f_output:
subprocess.run(
[FASTTEXT_EXECUTABLE, 'print-word-vectors', PRETRAINED_MODEL_FILE],
stdin=f_vocab,
stdout=f_output,
)