Final Prediction: LightGBM

Train a GBM using K-fold CV and use the mean test prediction across the folds for the final submission.

Imports

This utility package imports numpy, pandas, matplotlib and a helper kg module into the root namespace.


In [1]:
from pygoose import *

In [2]:
import datetime

In [3]:
import lightgbm as lgb

In [4]:
from sklearn.model_selection import StratifiedKFold

Config

Automatically discover the paths to various data folders and compose the project structure.


In [5]:
project = kg.Project.discover()

Number of CV folds.


In [6]:
NUM_FOLDS = 5

Make subsequent runs reproducible.


In [7]:
RANDOM_SEED = 2017

In [8]:
np.random.seed(RANDOM_SEED)

Read Data

Load all features we extracted earlier.


In [9]:
feature_lists = [
    'simple_summaries',
    'jaccard_ngrams',
    'fuzzy',
    'tfidf',
    'lda',
    'nlp_tags',
    'wordnet_similarity',
    'phrase_embedding',
    'wmd',
    'wm_intersect',
    
    '3rdparty_abhishek',
    '3rdparty_dasolmar_whq',
    '3rdparty_mephistopheies',
    '3rdparty_image_similarity',
    
    'magic_pagerank',
    'magic_frequencies',
    'magic_cooccurrence_matrix',
    
    'oofp_nn_mlp_with_magic',
    'oofp_nn_cnn_with_magic',
    'oofp_nn_bi_lstm_with_magic',
    'oofp_nn_siamese_lstm_attention',
]

In [10]:
df_train, df_test, feature_list_ix = project.load_feature_lists(feature_lists)

In [11]:
X_train = df_train.values
X_test = df_test.values

In [12]:
y_train = kg.io.load(project.features_dir + 'y_train.pickle')

View feature summary.


In [13]:
print('X train:', X_train.shape)
print('X test: ', X_test.shape)
print('y train:', y_train.shape)


X train: (404290, 195)
X test:  (2345796, 195)
y train: (404290,)

In [14]:
pd.DataFrame(feature_list_ix, columns=['feature_list', 'start_index', 'end_index'])


Out[14]:
feature_list start_index end_index
0 simple_summaries 0 8
1 jaccard_ngrams 9 23
2 fuzzy 24 30
3 tfidf 31 32
4 lda 33 34
5 nlp_tags 35 70
6 wordnet_similarity 71 72
7 phrase_embedding 73 78
8 wmd 79 79
9 wm_intersect 80 81
10 3rdparty_abhishek 82 97
11 3rdparty_dasolmar_whq 98 146
12 3rdparty_mephistopheies 147 178
13 3rdparty_image_similarity 179 179
14 magic_pagerank 180 181
15 magic_frequencies 182 185
16 magic_cooccurrence_matrix 186 190
17 oofp_nn_mlp_with_magic 191 191
18 oofp_nn_cnn_with_magic 192 192
19 oofp_nn_bi_lstm_with_magic 193 193
20 oofp_nn_siamese_lstm_attention 194 194

Train models & compute test predictions from each fold

Calculate partitions.


In [15]:
kfold = StratifiedKFold(
    n_splits=NUM_FOLDS,
    shuffle=True,
    random_state=RANDOM_SEED
)

In [16]:
y_test_pred = np.zeros((len(X_test), NUM_FOLDS))

Fit all folds.


In [17]:
cv_scores = []

In [18]:
%%time

for fold_num, (ix_train, ix_val) in enumerate(kfold.split(X_train, y_train)):
    print(f'Fitting fold {fold_num + 1} of {kfold.n_splits}')
    
    X_fold_train = X_train[ix_train]
    X_fold_val = X_train[ix_val]

    y_fold_train = y_train[ix_train]
    y_fold_val = y_train[ix_val]
    
    lgb_params = {
        'objective': 'binary',
        'metric': 'binary_logloss',
        'boosting': 'gbdt',
        'device': 'cpu',
        'feature_fraction': 0.486,
        'num_leaves': 158,
        'lambda_l2': 50,
        'learning_rate': 0.01,
        'num_boost_round': 5000,
        'early_stopping_rounds': 10,
        'verbose': 1,
        'bagging_fraction_seed': RANDOM_SEED,
        'feature_fraction_seed': RANDOM_SEED,
    }
    
    lgb_data_train = lgb.Dataset(X_fold_train, y_fold_train)
    lgb_data_val = lgb.Dataset(X_fold_val, y_fold_val)    
    evals_result = {}
    
    model = lgb.train(
        lgb_params,
        lgb_data_train,
        valid_sets=[lgb_data_train, lgb_data_val],
        evals_result=evals_result,
        num_boost_round=lgb_params['num_boost_round'],
        early_stopping_rounds=lgb_params['early_stopping_rounds'],
        verbose_eval=False,
    )
    
    fold_train_scores = evals_result['training'][lgb_params['metric']]
    fold_val_scores = evals_result['valid_1'][lgb_params['metric']]
    
    print('Fold {}: {} rounds, training loss {:.6f}, validation loss {:.6f}'.format(
        fold_num + 1,
        len(fold_train_scores),
        fold_train_scores[-1],
        fold_val_scores[-1],
    ))
    print()
    
    cv_scores.append(fold_val_scores[-1])
    y_test_pred[:, fold_num] = model.predict(X_test).reshape(-1)


Fitting fold 1 of 5
Fold 1: 3115 rounds, training loss 0.120046, validation loss 0.188116

Fitting fold 2 of 5
Fold 2: 2652 rounds, training loss 0.127358, validation loss 0.188434

Fitting fold 3 of 5
Fold 3: 2922 rounds, training loss 0.123138, validation loss 0.188292

Fitting fold 4 of 5
Fold 4: 2661 rounds, training loss 0.127212, validation loss 0.189076

Fitting fold 5 of 5
Fold 5: 3251 rounds, training loss 0.118947, validation loss 0.184049

CPU times: user 7h 14min 19s, sys: 58.2 s, total: 7h 15min 18s
Wall time: 59min 6s

Print CV score and feature importance.


In [19]:
pd.DataFrame({
    'column': list(df_train.columns),
    'importance': model.feature_importance(),
}).sort_values(by='importance')


Out[19]:
column importance
48 ner_q1_time 0
46 ner_q1_product 0
49 ner_q1_quantity 7
89 abh_jaccard_distance 14
58 ner_q2_loc 20
42 ner_q1_loc 30
64 ner_q2_time 34
137 das_where_both 35
139 das_q2_when 48
138 das_q1_when 48
132 das_q1_who 54
136 das_q2_where 54
62 ner_q2_product 61
133 das_q2_who 61
135 das_q1_where 63
65 ner_q2_quantity 86
131 das_which_both 87
129 das_q1_which 89
140 das_when_both 122
134 das_who_both 129
47 ner_q1_date 129
130 das_q2_which 144
63 ner_q2_date 144
126 das_q1_what 188
60 ner_q2_norp 191
123 das_q1_how 193
128 das_what_both 204
44 ner_q1_norp 221
124 das_q2_how 225
127 das_q2_what 225
142 das_q2_why 235
125 das_how_both 239
141 das_q1_why 255
61 ner_q2_person 335
59 ner_q2_org 405
45 ner_q1_person 413
150 meph_log_abs_diff_len1_len2 429
43 ner_q1_org 461
143 das_why_both 473
41 ner_q1_gpe 479
66 ner_q2_cardinal 485
57 ner_q2_gpe 512
146 whq_count_diff 539
50 ner_q1_cardinal 557
149 meph_abs_diff_len1_len2 631
70 ner_tag_count_diff 787
36 pos_q1_adv 999
147 meph_len1 1062
145 whq_count_q2 1118
52 pos_q2_adv 1128
144 whq_count_q1 1163
39 pos_q1_num 1179
110 das_diff_len 1196
55 pos_q2_num 1220
112 das_caps_count_q2 1229
148 meph_len2 1238
161 meph_trigram_all_jaccard_max 1275
190 magic_comatrix_svd_manhattan 1308
6 token_len_diff_log 1373
35 pos_q1_adj 1375
158 meph_bigram_all_jaccard_max 1377
51 pos_q2_adj 1399
117 das_len_word_q1 1401
189 magic_comatrix_svd_euclidean 1433
111 das_caps_count_q1 1455
101 das_shared_count 1467
118 das_len_word_q2 1479
69 ner_tag_euclidean 1479
84 abh_fuzz_WRatio 1506
187 magic_comatrix_euclidean 1527
0 shorter_char_len_log 1533
40 pos_q1_verb 1562
183 magic_freq_q2 1625
56 pos_q2_verb 1626
38 pos_q1_propn 1631
4 shorter_token_len_log 1663
159 meph_trigram_jaccard 1719
182 magic_freq_q1 1720
15 jaccard_ix_4gram 1728
73 phrase_emb_mean_cosine 1735
12 jaccard_ix_3gram 1749
185 magic_freq_q2_q1_ratio 1756
80 q1_q2_intersect 1761
54 pos_q2_propn 1769
119 das_diff_len_word 1816
17 jaccard_ix_norm_q2_4gram 1830
82 abh_common_words 1837
53 pos_q2_noun 1898
37 pos_q1_noun 1918
160 meph_trigram_all_jaccard 1922
155 meph_unigram_all_jaccard_max 1973
186 magic_comatrix_cosine 2043
5 longer_token_len_log 2081
114 das_len_char_q1 2083
7 token_len_ratio 2148
113 das_diff_caps 2150
152 meph_log_ratio_len1_len2 2155
115 das_len_char_q2 2169
165 meph_trigram_tf_l2_euclidean 2177
16 jaccard_ix_norm_q1_4gram 2185
108 das_len_q1 2197
18 jaccard_ix_5gram 2200
157 meph_bigram_all_jaccard 2203
14 jaccard_ix_norm_q2_3gram 2224
184 magic_freq_q1_q2_ratio 2238
83 abh_fuzz_qratio 2357
13 jaccard_ix_norm_q1_3gram 2357
27 fuzz_token_set_ratio 2374
109 das_len_q2 2455
9 jaccard_ix_2gram 2469
24 fuzz_ratio 2530
20 jaccard_ix_norm_q2_5gram 2542
1 longer_char_len_log 2591
19 jaccard_ix_norm_q1_5gram 2654
193 oofp_nn_bi_lstm_with_magic 2662
194 oofp_nn_siamese_lstm_attention 2758
156 meph_bigram_jaccard 2772
78 phrase_emb_normsum_euclidean 2809
91 abh_euclidean_distance 2822
93 abh_braycurtis_distance 2875
76 phrase_emb_normsum_cosine 2882
154 meph_unigram_all_jaccard 2920
116 das_diff_len_char 2931
3 char_len_ratio 2987
28 fuzz_partial_token_sort_ratio 3084
88 abh_cityblock_distance 3112
87 abh_cosine_distance 3302
192 oofp_nn_cnn_with_magic 3304
68 pos_tag_euclidean 3384
151 meph_ratio_len1_len2 3416
77 phrase_emb_normsum_cityblock_log 3514
191 oofp_nn_mlp_with_magic 3528
10 jaccard_ix_norm_q1_2gram 3543
168 meph_m_q1_q2_tf_svd1 3552
25 fuzz_partial_ratio 3562
29 jaro 3613
11 jaccard_ix_norm_q2_2gram 3618
92 abh_minkowski_distance 3717
26 fuzz_token_sort_ratio 3779
153 meph_unigram_jaccard 3831
71 wordnet_similarity_raw 3896
2 char_len_diff_log 3918
30 jaro_winkler 3921
90 abh_canberra_distance 3929
162 meph_trigram_tfidf_cosine 3983
102 das_stops1_ratio 4063
99 das_word_match_2root 4218
103 das_stops2_ratio 4358
163 meph_trigram_tfidf_l2_euclidean 4478
164 meph_trigram_tfidf_l1_euclidean 4546
75 phrase_emb_mean_euclidean 4570
81 q1_q2_wm_ratio 4601
98 das_word_match 4604
74 phrase_emb_mean_cityblock_log 4633
172 meph_m_vstack_svd_q1_q1_cosine 4673
8 word_diff_ratio 4738
188 magic_comatrix_svd_cosine 4778
32 tfidf_euclidean 4834
31 tfidf_cosine 4901
79 wmd 4966
120 das_avg_word_len1 5008
177 meph_1wl_tf_l2_euclidean 5013
72 wordnet_similarity_brown 5015
171 meph_m_vstack_svd_q1_q1_euclidean 5032
174 meph_m_vstack_svd_absdiff_q1_q2_oof 5050
21 jaccard_ix_diff_2_3 5162
96 abh_kur_q1vec 5259
97 abh_kur_q2vec 5385
105 das_cosine 5428
94 abh_skew_q1vec 5460
167 meph_m_q1_q2_tf_svd0 5471
100 das_tfidf_word_match 5524
121 das_avg_word_len2 5536
175 meph_1wl_tfidf_cosine 5551
95 abh_skew_q2vec 5565
176 meph_1wl_tfidf_l2_euclidean 5592
104 das_shared_2gram 5652
106 das_words_hamming 5705
180 pagerank_q1 5713
170 meph_m_diff_q1_q2_tf_oof 5747
34 lda_euclidean 5835
179 image_similarity 5836
169 meph_m_q1_q2_tf_svd100_oof 5859
173 meph_m_vstack_svd_mult_q1_q2_oof 5884
33 lda_cosine 6241
166 meph_m_q1_q2_tf_oof 6456
22 jaccard_ix_diff_3_4 6460
107 das_diff_stops_r 6465
181 pagerank_q2 6501
85 abh_wmd 6612
86 abh_norm_wmd 6690
23 jaccard_ix_diff_4_5 6728
67 pos_tag_cosine 6892
122 das_diff_avg_word 7017
178 meph_m_w1l_tfidf_oof 7728

In [20]:
final_cv_score = np.mean(cv_scores)

In [21]:
print('Final CV score:', final_cv_score)


Final CV score: 0.187593465728

Generate submission


In [22]:
y_test = np.mean(y_test_pred, axis=1)

In [23]:
submission_id = datetime.datetime.now().strftime('%Y-%m-%d-%H%M')

In [24]:
df_submission = pd.DataFrame({
    'test_id': range(len(y_test)),
    'is_duplicate': y_test
})

Recalibrate predictions for a different target balance on test

$\alpha = \frac{p_{test}}{p_{train}}$

$\beta = \frac{1 - p_{test}}{1 - p_{train}}$

$\hat{y}_{test}^{\prime} = \frac{\alpha \hat{y}_{test}}{\alpha \hat{y}_{test} + \beta(1 - \hat{y}_{test})}$

Training set balance is 36.92%, test set balance is ~16.5%.


In [25]:
def recalibrate_prediction(pred, train_pos_ratio=0.3692, test_pos_ratio=0.165):
    a = test_pos_ratio / train_pos_ratio
    b = (1 - test_pos_ratio) / (1 - train_pos_ratio)
    return a * pred / (a * pred + b * (1 - pred))

In [26]:
df_submission['is_duplicate'] = df_submission['is_duplicate'].map(recalibrate_prediction)

In [27]:
df_submission = df_submission[['test_id', 'is_duplicate']]

Explore and save submission


In [28]:
pd.DataFrame(y_test).plot.hist()


Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f562bc44fd0>

In [29]:
print('Test duplicates with >0.9 confidence:', len(df_submission[df_submission.is_duplicate > 0.9]))
print('Test mean prediction:', np.mean(y_test))
print('Calibrated mean prediction:', df_submission['is_duplicate'].mean())


Test duplicates with >0.9 confidence: 42993
Test mean prediction: 0.127606241908
Calibrated mean prediction: 0.07546706870442038

In [30]:
df_submission.to_csv(
    project.submissions_dir + f'{submission_id}-submission-draft-cv-{final_cv_score:.6f}.csv',
    header=True,
    float_format='%.8f',
    index=None,
)