Final Prediction: LightGBM

Train a GBM using K-fold CV and use the mean test prediction across the folds for the final submission.

Imports

This utility package imports numpy, pandas, matplotlib and a helper kg module into the root namespace.



In [1]:

    
from pygoose import *



In [2]:

    
import datetime



In [3]:

    
import lightgbm as lgb



In [4]:

    
from sklearn.model_selection import StratifiedKFold

Config

Automatically discover the paths to various data folders and compose the project structure.



In [5]:

    
project = kg.Project.discover()

Number of CV folds.



In [6]:

    
NUM_FOLDS = 5

Make subsequent runs reproducible.



In [7]:

    
RANDOM_SEED = 2017



In [8]:

    
np.random.seed(RANDOM_SEED)

Read Data

Load all features we extracted earlier.



In [9]:

    
feature_lists = [
    'simple_summaries',
    'jaccard_ngrams',
    'fuzzy',
    'tfidf',
    'lda',
    'nlp_tags',
    'wordnet_similarity',
    'phrase_embedding',
    'wmd',
    'wm_intersect',
    
    '3rdparty_abhishek',
    '3rdparty_dasolmar_whq',
    '3rdparty_mephistopheies',
    '3rdparty_image_similarity',
    
    'magic_pagerank',
    'magic_frequencies',
    'magic_cooccurrence_matrix',
    
    'oofp_nn_mlp_with_magic',
    'oofp_nn_cnn_with_magic',
    'oofp_nn_bi_lstm_with_magic',
    'oofp_nn_siamese_lstm_attention',
]



In [10]:

    
df_train, df_test, feature_list_ix = project.load_feature_lists(feature_lists)



In [11]:

    
X_train = df_train.values
X_test = df_test.values



In [12]:

    
y_train = kg.io.load(project.features_dir + 'y_train.pickle')

View feature summary.



In [13]:

    
print('X train:', X_train.shape)
print('X test: ', X_test.shape)
print('y train:', y_train.shape)









    



X train: (404290, 195)
X test:  (2345796, 195)
y train: (404290,)



In [14]:

    
pd.DataFrame(feature_list_ix, columns=['feature_list', 'start_index', 'end_index'])









    Out[14]:







  
    
      
      feature_list
      start_index
      end_index
    
  
  
    
      0
      simple_summaries
      0
      8
    
    
      1
      jaccard_ngrams
      9
      23
    
    
      2
      fuzzy
      24
      30
    
    
      3
      tfidf
      31
      32
    
    
      4
      lda
      33
      34
    
    
      5
      nlp_tags
      35
      70
    
    
      6
      wordnet_similarity
      71
      72
    
    
      7
      phrase_embedding
      73
      78
    
    
      8
      wmd
      79
      79
    
    
      9
      wm_intersect
      80
      81
    
    
      10
      3rdparty_abhishek
      82
      97
    
    
      11
      3rdparty_dasolmar_whq
      98
      146
    
    
      12
      3rdparty_mephistopheies
      147
      178
    
    
      13
      3rdparty_image_similarity
      179
      179
    
    
      14
      magic_pagerank
      180
      181
    
    
      15
      magic_frequencies
      182
      185
    
    
      16
      magic_cooccurrence_matrix
      186
      190
    
    
      17
      oofp_nn_mlp_with_magic
      191
      191
    
    
      18
      oofp_nn_cnn_with_magic
      192
      192
    
    
      19
      oofp_nn_bi_lstm_with_magic
      193
      193
    
    
      20
      oofp_nn_siamese_lstm_attention
      194
      194

Train models & compute test predictions from each fold

Calculate partitions.



In [15]:

    
kfold = StratifiedKFold(
    n_splits=NUM_FOLDS,
    shuffle=True,
    random_state=RANDOM_SEED
)



In [16]:

    
y_test_pred = np.zeros((len(X_test), NUM_FOLDS))

Fit all folds.



In [17]:

    
cv_scores = []



In [18]:

    
%%time

for fold_num, (ix_train, ix_val) in enumerate(kfold.split(X_train, y_train)):
    print(f'Fitting fold {fold_num + 1} of {kfold.n_splits}')
    
    X_fold_train = X_train[ix_train]
    X_fold_val = X_train[ix_val]

    y_fold_train = y_train[ix_train]
    y_fold_val = y_train[ix_val]
    
    lgb_params = {
        'objective': 'binary',
        'metric': 'binary_logloss',
        'boosting': 'gbdt',
        'device': 'cpu',
        'feature_fraction': 0.486,
        'num_leaves': 158,
        'lambda_l2': 50,
        'learning_rate': 0.01,
        'num_boost_round': 5000,
        'early_stopping_rounds': 10,
        'verbose': 1,
        'bagging_fraction_seed': RANDOM_SEED,
        'feature_fraction_seed': RANDOM_SEED,
    }
    
    lgb_data_train = lgb.Dataset(X_fold_train, y_fold_train)
    lgb_data_val = lgb.Dataset(X_fold_val, y_fold_val)    
    evals_result = {}
    
    model = lgb.train(
        lgb_params,
        lgb_data_train,
        valid_sets=[lgb_data_train, lgb_data_val],
        evals_result=evals_result,
        num_boost_round=lgb_params['num_boost_round'],
        early_stopping_rounds=lgb_params['early_stopping_rounds'],
        verbose_eval=False,
    )
    
    fold_train_scores = evals_result['training'][lgb_params['metric']]
    fold_val_scores = evals_result['valid_1'][lgb_params['metric']]
    
    print('Fold {}: {} rounds, training loss {:.6f}, validation loss {:.6f}'.format(
        fold_num + 1,
        len(fold_train_scores),
        fold_train_scores[-1],
        fold_val_scores[-1],
    ))
    print()
    
    cv_scores.append(fold_val_scores[-1])
    y_test_pred[:, fold_num] = model.predict(X_test).reshape(-1)









    



Fitting fold 1 of 5
Fold 1: 3115 rounds, training loss 0.120046, validation loss 0.188116

Fitting fold 2 of 5
Fold 2: 2652 rounds, training loss 0.127358, validation loss 0.188434

Fitting fold 3 of 5
Fold 3: 2922 rounds, training loss 0.123138, validation loss 0.188292

Fitting fold 4 of 5
Fold 4: 2661 rounds, training loss 0.127212, validation loss 0.189076

Fitting fold 5 of 5
Fold 5: 3251 rounds, training loss 0.118947, validation loss 0.184049

CPU times: user 7h 14min 19s, sys: 58.2 s, total: 7h 15min 18s
Wall time: 59min 6s

Print CV score and feature importance.



In [19]:

    
pd.DataFrame({
    'column': list(df_train.columns),
    'importance': model.feature_importance(),
}).sort_values(by='importance')









    Out[19]:







  
    
      
      column
      importance
    
  
  
    
      48
      ner_q1_time
      0
    
    
      46
      ner_q1_product
      0
    
    
      49
      ner_q1_quantity
      7
    
    
      89
      abh_jaccard_distance
      14
    
    
      58
      ner_q2_loc
      20
    
    
      42
      ner_q1_loc
      30
    
    
      64
      ner_q2_time
      34
    
    
      137
      das_where_both
      35
    
    
      139
      das_q2_when
      48
    
    
      138
      das_q1_when
      48
    
    
      132
      das_q1_who
      54
    
    
      136
      das_q2_where
      54
    
    
      62
      ner_q2_product
      61
    
    
      133
      das_q2_who
      61
    
    
      135
      das_q1_where
      63
    
    
      65
      ner_q2_quantity
      86
    
    
      131
      das_which_both
      87
    
    
      129
      das_q1_which
      89
    
    
      140
      das_when_both
      122
    
    
      134
      das_who_both
      129
    
    
      47
      ner_q1_date
      129
    
    
      130
      das_q2_which
      144
    
    
      63
      ner_q2_date
      144
    
    
      126
      das_q1_what
      188
    
    
      60
      ner_q2_norp
      191
    
    
      123
      das_q1_how
      193
    
    
      128
      das_what_both
      204
    
    
      44
      ner_q1_norp
      221
    
    
      124
      das_q2_how
      225
    
    
      127
      das_q2_what
      225
    
    
      142
      das_q2_why
      235
    
    
      125
      das_how_both
      239
    
    
      141
      das_q1_why
      255
    
    
      61
      ner_q2_person
      335
    
    
      59
      ner_q2_org
      405
    
    
      45
      ner_q1_person
      413
    
    
      150
      meph_log_abs_diff_len1_len2
      429
    
    
      43
      ner_q1_org
      461
    
    
      143
      das_why_both
      473
    
    
      41
      ner_q1_gpe
      479
    
    
      66
      ner_q2_cardinal
      485
    
    
      57
      ner_q2_gpe
      512
    
    
      146
      whq_count_diff
      539
    
    
      50
      ner_q1_cardinal
      557
    
    
      149
      meph_abs_diff_len1_len2
      631
    
    
      70
      ner_tag_count_diff
      787
    
    
      36
      pos_q1_adv
      999
    
    
      147
      meph_len1
      1062
    
    
      145
      whq_count_q2
      1118
    
    
      52
      pos_q2_adv
      1128
    
    
      144
      whq_count_q1
      1163
    
    
      39
      pos_q1_num
      1179
    
    
      110
      das_diff_len
      1196
    
    
      55
      pos_q2_num
      1220
    
    
      112
      das_caps_count_q2
      1229
    
    
      148
      meph_len2
      1238
    
    
      161
      meph_trigram_all_jaccard_max
      1275
    
    
      190
      magic_comatrix_svd_manhattan
      1308
    
    
      6
      token_len_diff_log
      1373
    
    
      35
      pos_q1_adj
      1375
    
    
      158
      meph_bigram_all_jaccard_max
      1377
    
    
      51
      pos_q2_adj
      1399
    
    
      117
      das_len_word_q1
      1401
    
    
      189
      magic_comatrix_svd_euclidean
      1433
    
    
      111
      das_caps_count_q1
      1455
    
    
      101
      das_shared_count
      1467
    
    
      118
      das_len_word_q2
      1479
    
    
      69
      ner_tag_euclidean
      1479
    
    
      84
      abh_fuzz_WRatio
      1506
    
    
      187
      magic_comatrix_euclidean
      1527
    
    
      0
      shorter_char_len_log
      1533
    
    
      40
      pos_q1_verb
      1562
    
    
      183
      magic_freq_q2
      1625
    
    
      56
      pos_q2_verb
      1626
    
    
      38
      pos_q1_propn
      1631
    
    
      4
      shorter_token_len_log
      1663
    
    
      159
      meph_trigram_jaccard
      1719
    
    
      182
      magic_freq_q1
      1720
    
    
      15
      jaccard_ix_4gram
      1728
    
    
      73
      phrase_emb_mean_cosine
      1735
    
    
      12
      jaccard_ix_3gram
      1749
    
    
      185
      magic_freq_q2_q1_ratio
      1756
    
    
      80
      q1_q2_intersect
      1761
    
    
      54
      pos_q2_propn
      1769
    
    
      119
      das_diff_len_word
      1816
    
    
      17
      jaccard_ix_norm_q2_4gram
      1830
    
    
      82
      abh_common_words
      1837
    
    
      53
      pos_q2_noun
      1898
    
    
      37
      pos_q1_noun
      1918
    
    
      160
      meph_trigram_all_jaccard
      1922
    
    
      155
      meph_unigram_all_jaccard_max
      1973
    
    
      186
      magic_comatrix_cosine
      2043
    
    
      5
      longer_token_len_log
      2081
    
    
      114
      das_len_char_q1
      2083
    
    
      7
      token_len_ratio
      2148
    
    
      113
      das_diff_caps
      2150
    
    
      152
      meph_log_ratio_len1_len2
      2155
    
    
      115
      das_len_char_q2
      2169
    
    
      165
      meph_trigram_tf_l2_euclidean
      2177
    
    
      16
      jaccard_ix_norm_q1_4gram
      2185
    
    
      108
      das_len_q1
      2197
    
    
      18
      jaccard_ix_5gram
      2200
    
    
      157
      meph_bigram_all_jaccard
      2203
    
    
      14
      jaccard_ix_norm_q2_3gram
      2224
    
    
      184
      magic_freq_q1_q2_ratio
      2238
    
    
      83
      abh_fuzz_qratio
      2357
    
    
      13
      jaccard_ix_norm_q1_3gram
      2357
    
    
      27
      fuzz_token_set_ratio
      2374
    
    
      109
      das_len_q2
      2455
    
    
      9
      jaccard_ix_2gram
      2469
    
    
      24
      fuzz_ratio
      2530
    
    
      20
      jaccard_ix_norm_q2_5gram
      2542
    
    
      1
      longer_char_len_log
      2591
    
    
      19
      jaccard_ix_norm_q1_5gram
      2654
    
    
      193
      oofp_nn_bi_lstm_with_magic
      2662
    
    
      194
      oofp_nn_siamese_lstm_attention
      2758
    
    
      156
      meph_bigram_jaccard
      2772
    
    
      78
      phrase_emb_normsum_euclidean
      2809
    
    
      91
      abh_euclidean_distance
      2822
    
    
      93
      abh_braycurtis_distance
      2875
    
    
      76
      phrase_emb_normsum_cosine
      2882
    
    
      154
      meph_unigram_all_jaccard
      2920
    
    
      116
      das_diff_len_char
      2931
    
    
      3
      char_len_ratio
      2987
    
    
      28
      fuzz_partial_token_sort_ratio
      3084
    
    
      88
      abh_cityblock_distance
      3112
    
    
      87
      abh_cosine_distance
      3302
    
    
      192
      oofp_nn_cnn_with_magic
      3304
    
    
      68
      pos_tag_euclidean
      3384
    
    
      151
      meph_ratio_len1_len2
      3416
    
    
      77
      phrase_emb_normsum_cityblock_log
      3514
    
    
      191
      oofp_nn_mlp_with_magic
      3528
    
    
      10
      jaccard_ix_norm_q1_2gram
      3543
    
    
      168
      meph_m_q1_q2_tf_svd1
      3552
    
    
      25
      fuzz_partial_ratio
      3562
    
    
      29
      jaro
      3613
    
    
      11
      jaccard_ix_norm_q2_2gram
      3618
    
    
      92
      abh_minkowski_distance
      3717
    
    
      26
      fuzz_token_sort_ratio
      3779
    
    
      153
      meph_unigram_jaccard
      3831
    
    
      71
      wordnet_similarity_raw
      3896
    
    
      2
      char_len_diff_log
      3918
    
    
      30
      jaro_winkler
      3921
    
    
      90
      abh_canberra_distance
      3929
    
    
      162
      meph_trigram_tfidf_cosine
      3983
    
    
      102
      das_stops1_ratio
      4063
    
    
      99
      das_word_match_2root
      4218
    
    
      103
      das_stops2_ratio
      4358
    
    
      163
      meph_trigram_tfidf_l2_euclidean
      4478
    
    
      164
      meph_trigram_tfidf_l1_euclidean
      4546
    
    
      75
      phrase_emb_mean_euclidean
      4570
    
    
      81
      q1_q2_wm_ratio
      4601
    
    
      98
      das_word_match
      4604
    
    
      74
      phrase_emb_mean_cityblock_log
      4633
    
    
      172
      meph_m_vstack_svd_q1_q1_cosine
      4673
    
    
      8
      word_diff_ratio
      4738
    
    
      188
      magic_comatrix_svd_cosine
      4778
    
    
      32
      tfidf_euclidean
      4834
    
    
      31
      tfidf_cosine
      4901
    
    
      79
      wmd
      4966
    
    
      120
      das_avg_word_len1
      5008
    
    
      177
      meph_1wl_tf_l2_euclidean
      5013
    
    
      72
      wordnet_similarity_brown
      5015
    
    
      171
      meph_m_vstack_svd_q1_q1_euclidean
      5032
    
    
      174
      meph_m_vstack_svd_absdiff_q1_q2_oof
      5050
    
    
      21
      jaccard_ix_diff_2_3
      5162
    
    
      96
      abh_kur_q1vec
      5259
    
    
      97
      abh_kur_q2vec
      5385
    
    
      105
      das_cosine
      5428
    
    
      94
      abh_skew_q1vec
      5460
    
    
      167
      meph_m_q1_q2_tf_svd0
      5471
    
    
      100
      das_tfidf_word_match
      5524
    
    
      121
      das_avg_word_len2
      5536
    
    
      175
      meph_1wl_tfidf_cosine
      5551
    
    
      95
      abh_skew_q2vec
      5565
    
    
      176
      meph_1wl_tfidf_l2_euclidean
      5592
    
    
      104
      das_shared_2gram
      5652
    
    
      106
      das_words_hamming
      5705
    
    
      180
      pagerank_q1
      5713
    
    
      170
      meph_m_diff_q1_q2_tf_oof
      5747
    
    
      34
      lda_euclidean
      5835
    
    
      179
      image_similarity
      5836
    
    
      169
      meph_m_q1_q2_tf_svd100_oof
      5859
    
    
      173
      meph_m_vstack_svd_mult_q1_q2_oof
      5884
    
    
      33
      lda_cosine
      6241
    
    
      166
      meph_m_q1_q2_tf_oof
      6456
    
    
      22
      jaccard_ix_diff_3_4
      6460
    
    
      107
      das_diff_stops_r
      6465
    
    
      181
      pagerank_q2
      6501
    
    
      85
      abh_wmd
      6612
    
    
      86
      abh_norm_wmd
      6690
    
    
      23
      jaccard_ix_diff_4_5
      6728
    
    
      67
      pos_tag_cosine
      6892
    
    
      122
      das_diff_avg_word
      7017
    
    
      178
      meph_m_w1l_tfidf_oof
      7728



In [20]:

    
final_cv_score = np.mean(cv_scores)



In [21]:

    
print('Final CV score:', final_cv_score)









    



Final CV score: 0.187593465728

Generate submission



In [22]:

    
y_test = np.mean(y_test_pred, axis=1)



In [23]:

    
submission_id = datetime.datetime.now().strftime('%Y-%m-%d-%H%M')



In [24]:

    
df_submission = pd.DataFrame({
    'test_id': range(len(y_test)),
    'is_duplicate': y_test
})

Recalibrate predictions for a different target balance on test

Based on Mike Swarbrick Jones' blog.

$\alpha = \frac{p_{test}}{p_{train}}$

$\beta = \frac{1 - p_{test}}{1 - p_{train}}$

$\hat{y}_{test}^{\prime} = \frac{\alpha \hat{y}_{test}}{\alpha \hat{y}_{test} + \beta(1 - \hat{y}_{test})}$

Training set balance is 36.92%, test set balance is ~16.5%.



In [25]:

    
def recalibrate_prediction(pred, train_pos_ratio=0.3692, test_pos_ratio=0.165):
    a = test_pos_ratio / train_pos_ratio
    b = (1 - test_pos_ratio) / (1 - train_pos_ratio)
    return a * pred / (a * pred + b * (1 - pred))



In [26]:

    
df_submission['is_duplicate'] = df_submission['is_duplicate'].map(recalibrate_prediction)



In [27]:

    
df_submission = df_submission[['test_id', 'is_duplicate']]

Explore and save submission



In [28]:

    
pd.DataFrame(y_test).plot.hist()









    Out[28]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f562bc44fd0>



In [29]:

    
print('Test duplicates with >0.9 confidence:', len(df_submission[df_submission.is_duplicate > 0.9]))
print('Test mean prediction:', np.mean(y_test))
print('Calibrated mean prediction:', df_submission['is_duplicate'].mean())









    



Test duplicates with >0.9 confidence: 42993
Test mean prediction: 0.127606241908
Calibrated mean prediction: 0.07546706870442038



In [30]:

    
df_submission.to_csv(
    project.submissions_dir + f'{submission_id}-submission-draft-cv-{final_cv_score:.6f}.csv',
    header=True,
    float_format='%.8f',
    index=None,
)

	feature_list	start_index	end_index
0	simple_summaries	0	8
1	jaccard_ngrams	9	23
2	fuzzy	24	30
3	tfidf	31	32
4	lda	33	34
5	nlp_tags	35	70
6	wordnet_similarity	71	72
7	phrase_embedding	73	78
8	wmd	79	79
9	wm_intersect	80	81
10	3rdparty_abhishek	82	97
11	3rdparty_dasolmar_whq	98	146
12	3rdparty_mephistopheies	147	178
13	3rdparty_image_similarity	179	179
14	magic_pagerank	180	181
15	magic_frequencies	182	185
16	magic_cooccurrence_matrix	186	190
17	oofp_nn_mlp_with_magic	191	191
18	oofp_nn_cnn_with_magic	192	192
19	oofp_nn_bi_lstm_with_magic	193	193
20	oofp_nn_siamese_lstm_attention	194	194

	column	importance
48	ner_q1_time	0
46	ner_q1_product	0
49	ner_q1_quantity	7
89	abh_jaccard_distance	14
58	ner_q2_loc	20
42	ner_q1_loc	30
64	ner_q2_time	34
137	das_where_both	35
139	das_q2_when	48
138	das_q1_when	48
132	das_q1_who	54
136	das_q2_where	54
62	ner_q2_product	61
133	das_q2_who	61
135	das_q1_where	63
65	ner_q2_quantity	86
131	das_which_both	87
129	das_q1_which	89
140	das_when_both	122
134	das_who_both	129
47	ner_q1_date	129
130	das_q2_which	144
63	ner_q2_date	144
126	das_q1_what	188
60	ner_q2_norp	191
123	das_q1_how	193
128	das_what_both	204
44	ner_q1_norp	221
124	das_q2_how	225
127	das_q2_what	225
142	das_q2_why	235
125	das_how_both	239
141	das_q1_why	255
61	ner_q2_person	335
59	ner_q2_org	405
45	ner_q1_person	413
150	meph_log_abs_diff_len1_len2	429
43	ner_q1_org	461
143	das_why_both	473
41	ner_q1_gpe	479
66	ner_q2_cardinal	485
57	ner_q2_gpe	512
146	whq_count_diff	539
50	ner_q1_cardinal	557
149	meph_abs_diff_len1_len2	631
70	ner_tag_count_diff	787
36	pos_q1_adv	999
147	meph_len1	1062
145	whq_count_q2	1118
52	pos_q2_adv	1128
144	whq_count_q1	1163
39	pos_q1_num	1179
110	das_diff_len	1196
55	pos_q2_num	1220
112	das_caps_count_q2	1229
148	meph_len2	1238
161	meph_trigram_all_jaccard_max	1275
190	magic_comatrix_svd_manhattan	1308
6	token_len_diff_log	1373
35	pos_q1_adj	1375
158	meph_bigram_all_jaccard_max	1377
51	pos_q2_adj	1399
117	das_len_word_q1	1401
189	magic_comatrix_svd_euclidean	1433
111	das_caps_count_q1	1455
101	das_shared_count	1467
118	das_len_word_q2	1479
69	ner_tag_euclidean	1479
84	abh_fuzz_WRatio	1506
187	magic_comatrix_euclidean	1527
0	shorter_char_len_log	1533
40	pos_q1_verb	1562
183	magic_freq_q2	1625
56	pos_q2_verb	1626
38	pos_q1_propn	1631
4	shorter_token_len_log	1663
159	meph_trigram_jaccard	1719
182	magic_freq_q1	1720
15	jaccard_ix_4gram	1728
73	phrase_emb_mean_cosine	1735
12	jaccard_ix_3gram	1749
185	magic_freq_q2_q1_ratio	1756
80	q1_q2_intersect	1761
54	pos_q2_propn	1769
119	das_diff_len_word	1816
17	jaccard_ix_norm_q2_4gram	1830
82	abh_common_words	1837
53	pos_q2_noun	1898
37	pos_q1_noun	1918
160	meph_trigram_all_jaccard	1922
155	meph_unigram_all_jaccard_max	1973
186	magic_comatrix_cosine	2043
5	longer_token_len_log	2081
114	das_len_char_q1	2083
7	token_len_ratio	2148
113	das_diff_caps	2150
152	meph_log_ratio_len1_len2	2155
115	das_len_char_q2	2169
165	meph_trigram_tf_l2_euclidean	2177
16	jaccard_ix_norm_q1_4gram	2185
108	das_len_q1	2197
18	jaccard_ix_5gram	2200
157	meph_bigram_all_jaccard	2203
14	jaccard_ix_norm_q2_3gram	2224
184	magic_freq_q1_q2_ratio	2238
83	abh_fuzz_qratio	2357
13	jaccard_ix_norm_q1_3gram	2357
27	fuzz_token_set_ratio	2374
109	das_len_q2	2455
9	jaccard_ix_2gram	2469
24	fuzz_ratio	2530
20	jaccard_ix_norm_q2_5gram	2542
1	longer_char_len_log	2591
19	jaccard_ix_norm_q1_5gram	2654
193	oofp_nn_bi_lstm_with_magic	2662
194	oofp_nn_siamese_lstm_attention	2758
156	meph_bigram_jaccard	2772
78	phrase_emb_normsum_euclidean	2809
91	abh_euclidean_distance	2822
93	abh_braycurtis_distance	2875
76	phrase_emb_normsum_cosine	2882
154	meph_unigram_all_jaccard	2920
116	das_diff_len_char	2931
3	char_len_ratio	2987
28	fuzz_partial_token_sort_ratio	3084
88	abh_cityblock_distance	3112
87	abh_cosine_distance	3302
192	oofp_nn_cnn_with_magic	3304
68	pos_tag_euclidean	3384
151	meph_ratio_len1_len2	3416
77	phrase_emb_normsum_cityblock_log	3514
191	oofp_nn_mlp_with_magic	3528
10	jaccard_ix_norm_q1_2gram	3543
168	meph_m_q1_q2_tf_svd1	3552
25	fuzz_partial_ratio	3562
29	jaro	3613
11	jaccard_ix_norm_q2_2gram	3618
92	abh_minkowski_distance	3717
26	fuzz_token_sort_ratio	3779
153	meph_unigram_jaccard	3831
71	wordnet_similarity_raw	3896
2	char_len_diff_log	3918
30	jaro_winkler	3921
90	abh_canberra_distance	3929
162	meph_trigram_tfidf_cosine	3983
102	das_stops1_ratio	4063
99	das_word_match_2root	4218
103	das_stops2_ratio	4358
163	meph_trigram_tfidf_l2_euclidean	4478
164	meph_trigram_tfidf_l1_euclidean	4546
75	phrase_emb_mean_euclidean	4570
81	q1_q2_wm_ratio	4601
98	das_word_match	4604
74	phrase_emb_mean_cityblock_log	4633
172	meph_m_vstack_svd_q1_q1_cosine	4673
8	word_diff_ratio	4738
188	magic_comatrix_svd_cosine	4778
32	tfidf_euclidean	4834
31	tfidf_cosine	4901
79	wmd	4966
120	das_avg_word_len1	5008
177	meph_1wl_tf_l2_euclidean	5013
72	wordnet_similarity_brown	5015
171	meph_m_vstack_svd_q1_q1_euclidean	5032
174	meph_m_vstack_svd_absdiff_q1_q2_oof	5050
21	jaccard_ix_diff_2_3	5162
96	abh_kur_q1vec	5259
97	abh_kur_q2vec	5385
105	das_cosine	5428
94	abh_skew_q1vec	5460
167	meph_m_q1_q2_tf_svd0	5471
100	das_tfidf_word_match	5524
121	das_avg_word_len2	5536
175	meph_1wl_tfidf_cosine	5551
95	abh_skew_q2vec	5565
176	meph_1wl_tfidf_l2_euclidean	5592
104	das_shared_2gram	5652
106	das_words_hamming	5705
180	pagerank_q1	5713
170	meph_m_diff_q1_q2_tf_oof	5747
34	lda_euclidean	5835
179	image_similarity	5836
169	meph_m_q1_q2_tf_svd100_oof	5859
173	meph_m_vstack_svd_mult_q1_q2_oof	5884
33	lda_cosine	6241
166	meph_m_q1_q2_tf_oof	6456
22	jaccard_ix_diff_3_4	6460
107	das_diff_stops_r	6465
181	pagerank_q2	6501
85	abh_wmd	6612
86	abh_norm_wmd	6690
23	jaccard_ix_diff_4_5	6728
67	pos_tag_cosine	6892
122	das_diff_avg_word	7017
178	meph_m_w1l_tfidf_oof	7728