Feature Engineering / Analysis

We can approach this problem from several points of view. The more apparent approach is to try to find a relation between two questions so that they are considered duplicates. On the other hand, maybe we can also find a duplicate likelihood of a question itself. There might be some features which make a question more likely to be a duplicate than another (one idea coming to mind: short question titles, because short questions are more simple and more simple questions have already been asked before).

Therefore, this problem is more about crafting features from the relation of two texts (which are basically two data sets) and then deducing a boolean decision from these features than it is about crafting features from one data set (as most other competitions are).

Individual questions

As the latter one is more simple, let's start with those and see if some titles are more likely to be considered duplicates. For this, we first need to create a new data set which contains only the question title and a ratio factor, how often this question was rated as duplicate of another question. E.g., if a question is in the original data set five times and out of those five times it was rated duplicate three times, then it has a duplicate likelihood of 0.6.



In [1]:

    
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline



In [43]:

    
df = pd.read_csv('../data/raw/train.csv')
df['question1'] = df['question1'].apply(str)
df['question2'] = df['question2'].apply(str)

df2 = df.copy(deep=True)

df = df.rename(columns={'question1': 'question', 'qid1': 'qid'})
df = df.drop(['id', 'qid2', 'question2'], axis=1)

df2 = df2.rename(columns={'question2': 'question', 'qid2': 'qid'})
df2 = df2.drop(['id', 'qid1', 'question1'], axis=1)

df_full = df.append(df2)

df_likelihood = df_full.groupby('qid').mean()
df_unique = df_full.drop_duplicates(subset='question')
df_unique = df_unique.set_index('qid')
df_unique = df_unique.drop('is_duplicate', axis=1)
df_likelihood = df_likelihood.join(df_unique)
df_likelihood = df_likelihood.rename(columns={'is_duplicate': 'dup_llh'})
df_likelihood['question'] = df_likelihood['question'].apply(str)

df_likelihood[(df_likelihood['dup_llh'] < 1) & (df_likelihood['dup_llh'] > 0)].head()









    Out[43]:






  
    
      
      dup_llh
      question
    
    
      qid
      
      
    
  
  
    
      39
      0.500000
      Which is the best digital marketing institutio...
    
    
      42
      0.333333
      Why are rockets and boosters painted white?
    
    
      45
      0.250000
      What are the questions should not ask on Quora?
    
    
      46
      0.428571
      Which question should I ask on Quora?
    
    
      50
      0.333333
      How many times a day do a clock’s hands overlap?

Now we can go and see if there is a relation between question length and duplicate likelihood.



In [44]:

    
df_likelihood['question_length'] = df_likelihood['question'].apply(len)

sns.jointplot(x='question_length', y='dup_llh', data=df_likelihood);

There is not much relation between question length and duplicate likelihood, but at least a very small one which we might take into consideration. Questions with a very long length above 500 characters are not considered duplicates.

What happens, if we ignore all question which have an absolute 0 or absolute 1 (most of those occur only one time in the data)?



In [45]:

    
df_relative = df_likelihood[(df_likelihood['dup_llh'] < 1) & (df_likelihood['dup_llh'] > 0)]
sns.jointplot(x='question_length', y='dup_llh', data=df_relative);

No correlation here.

	dup_llh	question
qid
39	0.500000	Which is the best digital marketing institutio...
42	0.333333	Why are rockets and boosters painted white?
45	0.250000	What are the questions should not ask on Quora?
46	0.428571	Which question should I ask on Quora?
50	0.333333	How many times a day do a clock’s hands overlap?