This is a "magic" (leaky) feature published by Jared Turkewitz that doesn't rely on the question text. Questions that occur more often in the training and test sets are more likely to be duplicates.
This utility package imports numpy
, pandas
, matplotlib
and a helper kg
module into the root namespace.
In [1]:
from pygoose import *
Automatically discover the paths to various data folders and compose the project structure.
In [2]:
project = kg.Project.discover()
Identifier for storing these features on disk and referring to them later.
In [3]:
feature_list_id = 'magic_frequencies'
Preprocessed and tokenized questions.
In [4]:
tokens_train = kg.io.load(project.preprocessed_data_dir + 'tokens_lowercase_spellcheck_train.pickle')
tokens_test = kg.io.load(project.preprocessed_data_dir + 'tokens_lowercase_spellcheck_test.pickle')
Unique question texts.
In [5]:
df_all_pairs = pd.DataFrame(
[
[' '.join(pair[0]), ' '.join(pair[1])]
for pair in tokens_train + tokens_test
],
columns=['question1', 'question2'],
)
In [6]:
df_unique_texts = pd.DataFrame(np.unique(df_all_pairs.values.ravel()), columns=['question'])
In [7]:
question_ids = pd.Series(df_unique_texts.index.values, index=df_unique_texts['question'].values).to_dict()
Mark every question with its number according to the uniques table.
In [8]:
df_all_pairs['q1_id'] = df_all_pairs['question1'].map(question_ids)
df_all_pairs['q2_id'] = df_all_pairs['question2'].map(question_ids)
Map to frequency space.
In [9]:
q1_counts = df_all_pairs['q1_id'].value_counts().to_dict()
q2_counts = df_all_pairs['q2_id'].value_counts().to_dict()
In [10]:
df_all_pairs['q1_freq'] = df_all_pairs['q1_id'].map(lambda x: q1_counts.get(x, 0) + q2_counts.get(x, 0))
df_all_pairs['q2_freq'] = df_all_pairs['q2_id'].map(lambda x: q1_counts.get(x, 0) + q2_counts.get(x, 0))
Calculate ratios.
In [11]:
df_all_pairs['freq_ratio'] = df_all_pairs['q1_freq'] / df_all_pairs['q2_freq']
df_all_pairs['freq_ratio_inverse'] = df_all_pairs['q2_freq'] / df_all_pairs['q1_freq']
Build final features.
In [12]:
columns_to_keep = [
'q1_freq',
'q2_freq',
'freq_ratio',
'freq_ratio_inverse',
]
In [13]:
X_train = df_all_pairs[columns_to_keep].values[:len(tokens_train)]
X_test = df_all_pairs[columns_to_keep].values[len(tokens_train):]
In [14]:
print('X train:', X_train.shape)
print('X test :', X_test.shape)
In [15]:
feature_names = [
'magic_freq_q1',
'magic_freq_q2',
'magic_freq_q1_q2_ratio',
'magic_freq_q2_q1_ratio',
]
In [16]:
project.save_features(X_train, X_test, feature_names, feature_list_id)