Feature: Question Occurrence Frequencies

This is a "magic" (leaky) feature published by Jared Turkewitz that doesn't rely on the question text. Questions that occur more often in the training and test sets are more likely to be duplicates.

Imports

This utility package imports numpy, pandas, matplotlib and a helper kg module into the root namespace.


In [1]:
from pygoose import *

Config

Automatically discover the paths to various data folders and compose the project structure.


In [2]:
project = kg.Project.discover()

Identifier for storing these features on disk and referring to them later.


In [3]:
feature_list_id = 'magic_frequencies'

Read data

Preprocessed and tokenized questions.


In [4]:
tokens_train = kg.io.load(project.preprocessed_data_dir + 'tokens_lowercase_spellcheck_train.pickle')
tokens_test = kg.io.load(project.preprocessed_data_dir + 'tokens_lowercase_spellcheck_test.pickle')

Build features

Unique question texts.


In [5]:
df_all_pairs = pd.DataFrame(
    [
        [' '.join(pair[0]), ' '.join(pair[1])]
        for pair in tokens_train + tokens_test
    ],
    columns=['question1', 'question2'],
)

In [6]:
df_unique_texts = pd.DataFrame(np.unique(df_all_pairs.values.ravel()), columns=['question'])

In [7]:
question_ids = pd.Series(df_unique_texts.index.values, index=df_unique_texts['question'].values).to_dict()

Mark every question with its number according to the uniques table.


In [8]:
df_all_pairs['q1_id'] = df_all_pairs['question1'].map(question_ids)
df_all_pairs['q2_id'] = df_all_pairs['question2'].map(question_ids)

Map to frequency space.


In [9]:
q1_counts = df_all_pairs['q1_id'].value_counts().to_dict()
q2_counts = df_all_pairs['q2_id'].value_counts().to_dict()

In [10]:
df_all_pairs['q1_freq'] = df_all_pairs['q1_id'].map(lambda x: q1_counts.get(x, 0) + q2_counts.get(x, 0))
df_all_pairs['q2_freq'] = df_all_pairs['q2_id'].map(lambda x: q1_counts.get(x, 0) + q2_counts.get(x, 0))

Calculate ratios.


In [11]:
df_all_pairs['freq_ratio'] = df_all_pairs['q1_freq'] / df_all_pairs['q2_freq']
df_all_pairs['freq_ratio_inverse'] = df_all_pairs['q2_freq'] / df_all_pairs['q1_freq']

Build final features.


In [12]:
columns_to_keep = [
    'q1_freq',
    'q2_freq',
    'freq_ratio',
    'freq_ratio_inverse',
]

In [13]:
X_train = df_all_pairs[columns_to_keep].values[:len(tokens_train)]
X_test = df_all_pairs[columns_to_keep].values[len(tokens_train):]

In [14]:
print('X train:', X_train.shape)
print('X test :', X_test.shape)


X train: (404290, 4)
X test : (2345796, 4)

Save features


In [15]:
feature_names = [
    'magic_freq_q1',
    'magic_freq_q2',
    'magic_freq_q1_q2_ratio',
    'magic_freq_q2_q1_ratio',
]

In [16]:
project.save_features(X_train, X_test, feature_names, feature_list_id)