Feature: TF-IDF Distances

Create TF-IDF vectors from question texts and compute vector distances between them.

Imports

This utility package imports numpy, pandas, matplotlib and a helper kg module into the root namespace.


In [1]:
from pygoose import *

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_distances, euclidean_distances

Config

Automatically discover the paths to various data folders and compose the project structure.


In [3]:
project = kg.Project.discover()

Identifier for storing these features on disk and referring to them later.


In [4]:
feature_list_id = 'tfidf'

Read Data

Preprocessed and tokenized questions.


In [5]:
tokens_train = kg.io.load(project.preprocessed_data_dir + 'tokens_lowercase_spellcheck_no_stopwords_train.pickle')
tokens_test = kg.io.load(project.preprocessed_data_dir + 'tokens_lowercase_spellcheck_no_stopwords_test.pickle')

In [6]:
tokens = tokens_train + tokens_test

Extract a set of unique question texts (document corpus).


In [7]:
all_questions_flat = np.array(tokens).ravel()

In [8]:
documents = list(set(' '.join(question) for question in all_questions_flat))

In [9]:
del all_questions_flat

Train TF-IDF vectorizer

Create a bag-of-token-unigrams vectorizer.


In [10]:
vectorizer = TfidfVectorizer(
    encoding='utf-8',
    analyzer='word',
    strip_accents='unicode',
    ngram_range=(1, 1),
    lowercase=True,
    norm='l2',
    use_idf=True,
    smooth_idf=True,
    sublinear_tf=True,
)

In [11]:
vectorizer.fit(documents)


Out[11]:
TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents='unicode', sublinear_tf=True,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [12]:
model_filename = 'tfidf_vectorizer_{}_ngrams_{}_{}_penalty_{}.pickle'.format(
    vectorizer.analyzer,
    vectorizer.ngram_range[0],
    vectorizer.ngram_range[1],
    vectorizer.norm,
)

In [13]:
kg.io.save(vectorizer, project.trained_model_dir + model_filename)

Vectorize train and test sets, compute distances


In [14]:
def compute_pair_distances(pair):
    q1_doc = ' '.join(pair[0])
    q2_doc = ' '.join(pair[1])
    
    pair_dtm = vectorizer.transform([q1_doc, q2_doc])
    q1_doc_vec = pair_dtm[0]
    q2_doc_vec = pair_dtm[1]
    
    return [
        cosine_distances(q1_doc_vec, q2_doc_vec)[0][0],
        euclidean_distances(q1_doc_vec, q2_doc_vec)[0][0],
    ]

In [15]:
features = kg.jobs.map_batch_parallel(
    tokens,
    item_mapper=compute_pair_distances,
    batch_size=1000,
)


Batches: 100%|██████████| 2751/2751 [16:11<00:00,  3.01it/s]

In [16]:
X_train = np.array(features[:len(tokens_train)], dtype='float64')
X_test = np.array(features[len(tokens_train):], dtype='float64')

In [17]:
print('X_train:', X_train.shape)
print('X_test: ', X_test.shape)


X_train: (404290, 2)
X_test:  (2345796, 2)

Save features


In [18]:
feature_names = [
    'tfidf_cosine',
    'tfidf_euclidean',
]

In [19]:
project.save_features(X_train, X_test, feature_names, feature_list_id)