Feature Engineering / Analysis

We can approach this problem from several points of view. The more apparent approach is to try to find a relation between two questions so that they are considered duplicates. On the other hand, maybe we can also find a duplicate likelihood of a question itself. There might be some features which make a question more likely to be a duplicate than another (one idea coming to mind: short question titles, because short questions are more simple and more simple questions have already been asked before).

Therefore, this problem is more about crafting features from the relation of two texts (which are basically two data sets) and then deducing a boolean decision from these features than it is about crafting features from one data set (as most other competitions are).

Individual questions

As the latter one is more simple, let's start with those and see if some titles are more likely to be considered duplicates. For this, we first need to create a new data set which contains only the question title and a ratio factor, how often this question was rated as duplicate of another question. E.g., if a question is in the original data set five times and out of those five times it was rated duplicate three times, then it has a duplicate likelihood of 0.6.


In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline

In [43]:
df = pd.read_csv('../data/raw/train.csv')
df['question1'] = df['question1'].apply(str)
df['question2'] = df['question2'].apply(str)

df2 = df.copy(deep=True)

df = df.rename(columns={'question1': 'question', 'qid1': 'qid'})
df = df.drop(['id', 'qid2', 'question2'], axis=1)

df2 = df2.rename(columns={'question2': 'question', 'qid2': 'qid'})
df2 = df2.drop(['id', 'qid1', 'question1'], axis=1)

df_full = df.append(df2)

df_likelihood = df_full.groupby('qid').mean()
df_unique = df_full.drop_duplicates(subset='question')
df_unique = df_unique.set_index('qid')
df_unique = df_unique.drop('is_duplicate', axis=1)
df_likelihood = df_likelihood.join(df_unique)
df_likelihood = df_likelihood.rename(columns={'is_duplicate': 'dup_llh'})
df_likelihood['question'] = df_likelihood['question'].apply(str)

df_likelihood[(df_likelihood['dup_llh'] < 1) & (df_likelihood['dup_llh'] > 0)].head()


Out[43]:
dup_llh question
qid
39 0.500000 Which is the best digital marketing institutio...
42 0.333333 Why are rockets and boosters painted white?
45 0.250000 What are the questions should not ask on Quora?
46 0.428571 Which question should I ask on Quora?
50 0.333333 How many times a day do a clock’s hands overlap?

Now we can go and see if there is a relation between question length and duplicate likelihood.


In [44]:
df_likelihood['question_length'] = df_likelihood['question'].apply(len)

sns.jointplot(x='question_length', y='dup_llh', data=df_likelihood);


There is not much relation between question length and duplicate likelihood, but at least a very small one which we might take into consideration. Questions with a very long length above 500 characters are not considered duplicates.

What happens, if we ignore all question which have an absolute 0 or absolute 1 (most of those occur only one time in the data)?


In [45]:
df_relative = df_likelihood[(df_likelihood['dup_llh'] < 1) & (df_likelihood['dup_llh'] > 0)]
sns.jointplot(x='question_length', y='dup_llh', data=df_relative);


No correlation here.