Feature: LDA Topic Distances

Train a Latent Dirichlet Allocation model with 300 topics on the question corpus and compute topic distances between the question pairs.

Imports

This utility package imports numpy, pandas, matplotlib and a helper kg module into the root namespace.


In [ ]:
from pygoose import *

In [ ]:
from gensim.corpora import Dictionary
from gensim.models import LdaMulticore

In [ ]:
from nltk.stem import SnowballStemmer

In [ ]:
from sklearn.metrics.pairwise import cosine_distances, euclidean_distances

Config

Automatically discover the paths to various data folders and compose the project structure.


In [ ]:
project = kg.Project.discover()

Identifier for storing these features on disk and referring to them later.


In [ ]:
feature_list_id = 'lda'

Number of LDA topics to train.


In [ ]:
NUM_TOPICS = 300

Make subsequent runs reproducible.


In [ ]:
RANDOM_SEED = 42

Read Data

Preprocessed and tokenized questions.


In [ ]:
tokens_train = kg.io.load(project.preprocessed_data_dir + 'tokens_lowercase_spellcheck_no_stopwords_train.pickle')
tokens_test = kg.io.load(project.preprocessed_data_dir + 'tokens_lowercase_spellcheck_no_stopwords_test.pickle')

Train LDA

Build a corpus of stemmed documents.


In [ ]:
stemmer = SnowballStemmer('english')

In [ ]:
def stem_pair(pair):
    return [
        [stemmer.stem(token) for token in pair[0]],
        [stemmer.stem(token) for token in pair[1]],
    ]

In [ ]:
tokens = kg.jobs.map_batch_parallel(
    tokens_train + tokens_test,
    item_mapper=stem_pair,
    batch_size=1000,
)

In [ ]:
documents = list(np.array(tokens).ravel())

Based on the corpus, train the BoW vectorizer and the topic model.


In [ ]:
dictionary = Dictionary(documents)

In [ ]:
corpus = [dictionary.doc2bow(document) for document in documents]

In [ ]:
model = LdaMulticore(
    corpus,
    num_topics=NUM_TOPICS,
    id2word=dictionary,
    random_state=RANDOM_SEED,
)

In [ ]:
model.save(project.trained_model_dir + f'lda_{NUM_TOPICS}.pickle')

Build topic vectors, compute distances


In [ ]:
def compute_topic_distances(pair):
    q1_bow = dictionary.doc2bow(pair[0])
    q2_bow = dictionary.doc2bow(pair[1])
    
    q1_topic_vec = np.array(model.get_document_topics(q1_bow, minimum_probability=0))[:, 1].reshape(1, -1)
    q2_topic_vec = np.array(model.get_document_topics(q2_bow, minimum_probability=0))[:, 1].reshape(1, -1)
    
    return [
        cosine_distances(q1_topic_vec, q2_topic_vec)[0][0],
        euclidean_distances(q1_topic_vec, q2_topic_vec)[0][0],
    ]

In [ ]:
distances = kg.jobs.map_batch_parallel(
    tokens,
    item_mapper=compute_topic_distances,
    batch_size=1000,
)

In [ ]:
X_train = np.array(distances[:len(tokens_train)], dtype='float64')
X_test = np.array(distances[len(tokens_train):], dtype='float64')

In [ ]:
print('X_train:', X_train.shape)
print('X_test: ', X_test.shape)

Save features


In [ ]:
feature_names = [
    'lda_cosine',
    'lda_euclidean',
]

In [ ]:
project.save_features(X_train, X_test, feature_names, feature_list_id)

In [ ]:
pd.DataFrame(X_train).describe()

In [ ]:
pd.DataFrame(X_test).describe()

In [ ]:
pd.DataFrame(X_train).plot.hist()