Feature: Phrase Embedding Distances

Based on the pre-trained word embeddings, we'll calculate the mean embedding vector of each question (as well as the unit-length normalized sum of word embeddings), and compute vector distances between these aggregate vectors.

Imports

This utility package imports numpy, pandas, matplotlib and a helper kg module into the root namespace.


In [1]:
from pygoose import *

In [2]:
from gensim.models.wrappers.fasttext import FastText

In [3]:
from scipy.spatial.distance import cosine, euclidean, cityblock

Config

Automatically discover the paths to various data folders and compose the project structure.


In [4]:
project = kg.Project.discover()

Identifier for storing these features on disk and referring to them later.


In [5]:
feature_list_id = 'phrase_embedding'

Read Data

Preprocessed and tokenized questions.


In [6]:
tokens_train = kg.io.load(project.preprocessed_data_dir + 'tokens_lowercase_spellcheck_no_stopwords_train.pickle')
tokens_test = kg.io.load(project.preprocessed_data_dir + 'tokens_lowercase_spellcheck_no_stopwords_test.pickle')

In [7]:
tokens = tokens_train + tokens_test

Pretrained word vector database.


In [8]:
embedding_model = FastText.load_word2vec_format(project.aux_dir + 'fasttext_vocab.vec')

Build Features


In [9]:
def get_phrase_embedding_distances(pair):
    q1_vectors = [embedding_model[token] for token in pair[0] if token in embedding_model]
    q2_vectors = [embedding_model[token] for token in pair[1] if token in embedding_model]

    if len(q1_vectors) == 0:
        q1_vectors.append(np.zeros(word_vector_dim))
    if len(q2_vectors) == 0:
        q2_vectors.append(np.zeros(word_vector_dim))
        
    q1_mean = np.mean(q1_vectors, axis=0)
    q2_mean = np.mean(q2_vectors, axis=0)
    
    q1_sum = np.sum(q1_vectors, axis=0)
    q2_sum = np.sum(q2_vectors, axis=0)

    q1_norm = q1_sum / np.sqrt((q1_sum ** 2).sum())
    q2_norm = q2_sum / np.sqrt((q2_sum ** 2).sum())
    
    return [
        cosine(q1_mean, q2_mean),
        np.log(cityblock(q1_mean, q2_mean) + 1),
        euclidean(q1_mean, q2_mean),
        
        cosine(q1_norm, q2_norm),
        np.log(cityblock(q1_norm, q2_norm) + 1),
        euclidean(q1_norm, q2_norm),        
    ]

In [10]:
distances = kg.jobs.map_batch_parallel(
    tokens,
    item_mapper=get_phrase_embedding_distances,
    batch_size=1000,
)


Batches: 100%|██████████| 2751/2751 [02:53<00:00, 16.49it/s]

In [11]:
distances = np.array(distances)

In [12]:
X_train = distances[:len(tokens_train)]
X_test = distances[len(tokens_train):]

In [13]:
print('X_train:', X_train.shape)
print('X_test: ', X_test.shape)


X_train: (404290, 6)
X_test:  (2345796, 6)

Save features


In [14]:
feature_names = [
    'phrase_emb_mean_cosine',
    'phrase_emb_mean_cityblock_log',
    'phrase_emb_mean_euclidean',
    
    'phrase_emb_normsum_cosine',
    'phrase_emb_normsum_cityblock_log',
    'phrase_emb_normsum_euclidean',
]

In [15]:
project.save_features(X_train, X_test, feature_names, feature_list_id)