Preprocessing: Clean Up & Tokenize Questions

Break question titles into tokens, and perform token-level normalization: expand shortened words, correct spelling, etc.

Imports

This utility package imports numpy, pandas, matplotlib and a helper kg module into the root namespace.



In [1]:

    
from pygoose import *



In [2]:

    
import nltk

Config

Automatically discover the paths to various data folders and compose the project structure.



In [3]:

    
project = kg.Project.discover()

Load Data

Original question datasets.



In [4]:

    
df_train = pd.read_csv(project.data_dir + 'train.csv').fillna('none')
df_test = pd.read_csv(project.data_dir + 'test.csv').fillna('none')



In [5]:

    
df_all = pd.concat([df_train, df_test])

Stopwords customized for Quora dataset.



In [6]:

    
stopwords = set(kg.io.load_lines(project.aux_dir + 'stopwords.vocab'))

Pre-composed spelling correction dictionary.



In [7]:

    
spelling_corrections = kg.io.load_json(project.aux_dir + 'spelling_corrections.json')

Load Tools



In [8]:

    
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')

Preprocess and tokenize questions



In [9]:

    
def translate(text, translation):
    for token, replacement in translation.items():
        text = text.replace(token, ' ' + replacement + ' ')
    text = text.replace('  ', ' ')
    return text



In [10]:

    
def spell_digits(text):
    translation = {
        '0': 'zero',
        '1': 'one',
        '2': 'two',
        '3': 'three',
        '4': 'four',
        '5': 'five',
        '6': 'six',
        '7': 'seven',
        '8': 'eight',
        '9': 'nine',
    }
    return translate(text, translation)



In [11]:

    
def expand_negations(text):
    translation = {
        "can't": 'can not',
        "won't": 'would not',
        "shan't": 'shall not',
    }
    text = translate(text, translation)
    return text.replace("n't", " not")



In [12]:

    
def correct_spelling(text):
    return ' '.join(
        spelling_corrections.get(token, token)
        for token in tokenizer.tokenize(text)
    )



In [13]:

    
def get_question_tokens(question, lowercase=True, spellcheck=True, remove_stopwords=True):
    if lowercase:
        question = question.lower()
    
    if spellcheck:
        question = correct_spelling(question)
    
    question = spell_digits(question)
    question = expand_negations(question)

    tokens = [token for token in tokenizer.tokenize(question.lower() if lowercase else question)]    
    if remove_stopwords:
        tokens = [token for token in tokens if token not in stopwords]
    
    tokens.append('.')
    return tokens



In [14]:

    
def get_question_pair_tokens_spellcheck(pair):
    return [
        get_question_tokens(pair[0], lowercase=False, spellcheck=True, remove_stopwords=False),
        get_question_tokens(pair[1], lowercase=False, spellcheck=True, remove_stopwords=False),
    ]



In [15]:

    
def get_question_pair_tokens_lowercase_spellcheck(pair):
    return [
        get_question_tokens(pair[0], lowercase=True, spellcheck=True, remove_stopwords=False),
        get_question_tokens(pair[1], lowercase=True, spellcheck=True, remove_stopwords=False),
    ]



In [16]:

    
def get_question_pair_tokens_lowercase_spellcheck_remove_stopwords(pair):
    return [
        get_question_tokens(pair[0], lowercase=True, spellcheck=True, remove_stopwords=True),
        get_question_tokens(pair[1], lowercase=True, spellcheck=True, remove_stopwords=True),
    ]

Tokenize the questions, correct spelling, but keep the upper/lower case.



In [17]:

    
tokens_spellcheck = kg.jobs.map_batch_parallel(
    df_all.as_matrix(columns=['question1', 'question2']),
    item_mapper=get_question_pair_tokens_spellcheck,
    batch_size=1000,
)









    



Batches: 100%|██████████| 2751/2751 [00:29<00:00, 92.31it/s]

Tokenize the questions, convert to lowercase and correct spelling, keep the stopwords (useful for neural models).



In [18]:

    
tokens_lowercase_spellcheck = kg.jobs.map_batch_parallel(
    df_all.as_matrix(columns=['question1', 'question2']),
    item_mapper=get_question_pair_tokens_lowercase_spellcheck,
    batch_size=1000,
)









    



Batches: 100%|██████████| 2751/2751 [00:32<00:00, 84.48it/s]

Just as before, but also with stopwords removed.



In [19]:

    
tokens_lowercase_spellcheck_no_stopwords = kg.jobs.map_batch_parallel(
    df_all.as_matrix(columns=['question1', 'question2']),
    item_mapper=get_question_pair_tokens_lowercase_spellcheck_remove_stopwords,
    batch_size=1000,
)









    



Batches: 100%|██████████| 2751/2751 [00:34<00:00, 80.72it/s]

Extract question vocabulary



In [20]:

    
vocab = set()
for question in progressbar(np.array(tokens_lowercase_spellcheck).ravel()):
    for token in question:
        vocab.add(token)









    



100%|██████████| 5500172/5500172 [00:11<00:00, 468180.97it/s]



In [21]:

    
vocab_no_stopwords = vocab - stopwords

Save preprocessed data

Tokenized questions.



In [22]:

    
kg.io.save(
    tokens_spellcheck[:len(df_train)],
    project.preprocessed_data_dir + 'tokens_spellcheck_train.pickle'
)
kg.io.save(
    tokens_spellcheck[len(df_train):],
    project.preprocessed_data_dir + 'tokens_spellcheck_test.pickle'
)



In [23]:

    
kg.io.save(
    tokens_lowercase_spellcheck[:len(df_train)],
    project.preprocessed_data_dir + 'tokens_lowercase_spellcheck_train.pickle'
)
kg.io.save(
    tokens_lowercase_spellcheck[len(df_train):],
    project.preprocessed_data_dir + 'tokens_lowercase_spellcheck_test.pickle'
)



In [24]:

    
kg.io.save(
    tokens_lowercase_spellcheck_no_stopwords[:len(df_train)],
    project.preprocessed_data_dir + 'tokens_lowercase_spellcheck_no_stopwords_train.pickle'
)
kg.io.save(
    tokens_lowercase_spellcheck_no_stopwords[len(df_train):],
    project.preprocessed_data_dir + 'tokens_lowercase_spellcheck_no_stopwords_test.pickle'
)

Question vocabulary.



In [25]:

    
kg.io.save_lines(
    sorted(list(vocab)),
    project.preprocessed_data_dir + 'tokens_lowercase_spellcheck.vocab'
)



In [26]:

    
kg.io.save_lines(
    sorted(list(vocab_no_stopwords)),
    project.preprocessed_data_dir + 'tokens_lowercase_spellcheck_no_stopwords.vocab'
)

Ground truth.



In [27]:

    
kg.io.save(df_train['is_duplicate'].values, project.features_dir + 'y_train.pickle')