Preprocessing: Unique Question Corpus

Based on the training and test sets, extract a list of unique documents.

Imports

This utility package imports numpy, pandas, matplotlib and a helper kg module into the root namespace.


In [1]:
from pygoose import *

In [2]:
import nltk

Config

Automatically discover the paths to various data folders and compose the project structure.


In [3]:
project = kg.Project.discover()

Read data

Original question datasets.


In [4]:
df_train = pd.read_csv(project.data_dir + 'train.csv').fillna('')
df_test = pd.read_csv(project.data_dir + 'test.csv').fillna('')

Load tools


In [5]:
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')

Remove duplicate questions


In [6]:
df = pd.concat([df_train, df_test])

In [7]:
unique_question_texts = [
    question.strip(' \'"')
    for question in np.unique(df[['question1', 'question2']].values.ravel())
]

Tokenize unique questions


In [8]:
def tokenize_question_text(q):
    return tokenizer.tokenize(q.lower())

In [9]:
unique_question_tokens = kg.jobs.map_batch_parallel(
    unique_question_texts,
    item_mapper=tokenize_question_text,
    batch_size=1000,
)


Batches: 100%|██████████| 4790/4790 [00:22<00:00, 212.41it/s]

Save preprocessed data


In [10]:
kg.io.save_lines(unique_question_texts, project.preprocessed_data_dir + 'unique_questions_raw.txt')

In [11]:
kg.io.save(unique_question_tokens, project.preprocessed_data_dir + 'unique_questions_tokenized.pickle')