Задача: дано порядка 70000 данных на train - время и различные срезы данных по продажам для каждого товара item_id. Нужно научиться предсказывать продажи на 3 недели вперед.

Подключим все нужное и считаем данные.


In [1]:
import pandas as pd
from sklearn import model_selection, metrics
import numpy as np
import matplotlib.pyplot as plt
import seaborn
import xgboost
import os
%pylab inline

train = pd.read_csv("train.tsv")
test = pd.read_csv("test.tsv")
sample_submission = pd.read_csv("sample_submission.tsv")
sample_submission_a = pd.read_csv("boost_submission_a.csv")
sample_submission_b = pd.read_csv("boost_submission_b.csv")


/home/boyalex/anaconda2/lib/python2.7/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
Populating the interactive namespace from numpy and matplotlib

1

Сначала напишем функцию ошибки SMAPE и создадим scorer, чтобы его можно было передавать в cross_val_score и GridSearch.


In [2]:
def score_func(y_true, y_pred):
    y_true = np.array(y_true)
    y_pred = np.array(y_pred)
    return (np.abs(y_true - y_pred) / (np.abs(y_true) + np.abs(y_pred))).mean() * 200.
    return s

scorer = metrics.make_scorer(score_func=score_func, greater_is_better=False)

Будем обучаться на всех данных.


In [3]:
frac = 1.

train = train.sample(frac=frac, random_state=42)

X = train.drop(['Num','y'], axis=1)
y = train['y']

pd.set_option('max_columns', 64)

2

Теперь посмотрим на данные. Особенные подозрения сразу вызывают фичи f1-f60, возможно некоторые из них зависимы и от них стоит избавиться.


In [4]:
X.head(20)


Out[4]:
year week shift item_id f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12 f13 f14 f15 f16 f17 f18 f19 f20 f21 f22 f23 f24 f25 f26 f27 f28 f29 f30 f31 f32 f33 f34 f35 f36 f37 f38 f39 f40 f41 f42 f43 f44 f45 f46 f47 f48 f49 f50 f51 f52 f53 f54 f55 f56 f57 f58 f59 f60
44283 2014 13 1 20452327 129441.0 104610.0 121114.0 133780.0 122580.0 126830.0 102220.0 117450.0 113460.0 150970.0 72710.0 110530.0 128170.0 139070.0 91350.0 116210.0 129225.0 162615.0 99390.0 55160.0 95880.0 88520.0 107270.0 79610.0 99210.0 114561.0 93790.0 98070.0 83980.0 105240.0 129441.0 104610.0 121114.0 133780.0 122580.0 126830.0 102220.0 117450.0 113460.0 150970.0 72710.0 110530.0 128170.0 139070.0 91350.0 116210.0 129225.0 162615.0 99390.0 55160.0 95880.0 88520.0 107270.0 79610.0 99210.0 114561.0 93790.0 98070.0 83980.0 105240.0
50871 2014 23 1 20441989 4162.0 6760.0 7210.0 11330.0 6950.0 6798.0 9470.0 13332.0 6860.0 3270.0 8200.0 7580.0 9066.0 6906.0 8184.0 9144.0 9404.0 7075.0 7319.0 8162.0 9932.0 8908.0 10464.0 7431.0 10334.0 11548.0 8698.0 8696.0 8160.0 7570.0 4162.0 6760.0 7210.0 11330.0 6950.0 6798.0 9470.0 13332.0 6860.0 3270.0 8200.0 7580.0 9066.0 6906.0 8184.0 9144.0 9404.0 7075.0 7319.0 8162.0 9932.0 8908.0 10464.0 7431.0 10334.0 11548.0 8698.0 8696.0 8160.0 7570.0
13810 2013 21 3 20438706 24931.0 30338.0 30690.0 37930.0 21420.0 28240.0 28685.0 39205.0 22670.0 29780.0 31855.0 54106.0 18690.0 16835.0 27255.0 25706.0 24705.0 24015.0 27735.0 26515.0 34475.0 22390.0 27124.0 29660.0 30105.0 28054.0 31545.0 28185.0 34890.0 28790.0 24931.0 30338.0 30690.0 37930.0 21420.0 28240.0 28685.0 39205.0 22670.0 29780.0 31855.0 54106.0 18690.0 16835.0 27255.0 25706.0 24705.0 24015.0 27735.0 26515.0 34475.0 22390.0 27124.0 29660.0 30105.0 28054.0 31545.0 28185.0 34890.0 28790.0
10062 2013 15 2 20438591 11505.0 13550.0 15360.0 14750.0 12961.0 9880.0 11950.0 11269.0 15840.0 7720.0 11150.0 11370.0 15980.0 7990.0 12120.0 12370.0 21010.0 5730.0 4880.0 13690.0 10920.0 14030.0 8060.0 8430.0 9980.0 13930.0 6340.0 7810.0 8960.0 10260.0 11505.0 13550.0 15360.0 14750.0 12961.0 9880.0 11950.0 11269.0 15840.0 7720.0 11150.0 11370.0 15980.0 7990.0 12120.0 12370.0 21010.0 5730.0 4880.0 13690.0 10920.0 14030.0 8060.0 8430.0 9980.0 13930.0 6340.0 7810.0 8960.0 10260.0
37186 2014 3 2 20449525 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 52.0 2090.0 2470.0 1145.0 2955.0 3915.0 740.0 2260.0 1403.0 1417.0 980.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 52.0 2090.0 2470.0 1145.0 2955.0 3915.0 740.0 2260.0 1403.0 1417.0 980.0
11779 2013 17 1 20442117 58566.0 30476.0 27444.0 35176.0 37784.0 61653.0 24411.0 33302.0 40347.0 66246.0 21820.0 32870.0 32826.0 76585.0 14414.0 17140.0 30783.0 25557.0 50060.0 18640.0 23909.0 22744.0 47030.0 17352.0 18392.0 20060.0 35480.0 18256.0 16277.0 14780.0 58566.0 30476.0 27444.0 35176.0 37784.0 61653.0 24411.0 33302.0 40347.0 66246.0 21820.0 32870.0 32826.0 76585.0 14414.0 17140.0 30783.0 25557.0 50060.0 18640.0 23909.0 22744.0 47030.0 17352.0 18392.0 20060.0 35480.0 18256.0 16277.0 14780.0
19487 2013 28 1 20427463 8280.0 8900.0 14780.0 5194.0 4070.0 8050.0 9540.0 10020.0 5520.0 7870.0 7710.0 12471.0 5340.0 7612.0 7716.0 8643.0 7860.0 7950.0 8896.0 13340.0 8190.0 7600.0 11130.0 11349.0 16290.0 7410.0 5360.0 10620.0 12701.0 8141.0 8280.0 8900.0 14780.0 5194.0 4070.0 8050.0 9540.0 10020.0 5520.0 7870.0 7710.0 12471.0 5340.0 7612.0 7716.0 8643.0 7860.0 7950.0 8896.0 13340.0 8190.0 7600.0 11130.0 11349.0 16290.0 7410.0 5360.0 10620.0 12701.0 8141.0
13814 2013 21 3 20438803 54836.0 73042.0 77779.0 134052.0 61791.0 56961.0 80549.0 96744.0 81685.0 81735.0 89350.0 214545.0 62586.0 44214.0 60180.0 61831.0 113651.0 57998.0 57975.0 65771.0 112410.0 60409.0 78766.0 81732.0 140618.0 142027.0 67727.0 74731.0 102966.0 117584.0 54836.0 73042.0 77779.0 134052.0 61791.0 56961.0 80549.0 96744.0 81685.0 81735.0 89350.0 214545.0 62586.0 44214.0 60180.0 61831.0 113651.0 57998.0 57975.0 65771.0 112410.0 60409.0 78766.0 81732.0 140618.0 142027.0 67727.0 74731.0 102966.0 117584.0
9022 2013 13 1 20442103 2051366.0 816429.0 825582.0 1022983.0 1584683.0 1117102.0 866330.0 927534.0 1321817.0 1745788.0 670413.0 983465.0 975872.0 1530131.0 781707.0 1019808.0 1460916.0 2289616.0 610850.0 383526.0 844620.0 949393.0 852544.0 730940.0 807951.0 894810.0 1263319.0 582537.0 742668.0 756275.0 2051366.0 816429.0 825582.0 1022983.0 1584683.0 1117102.0 866330.0 927534.0 1321817.0 1745788.0 670413.0 983465.0 975872.0 1530131.0 781707.0 1019808.0 1460916.0 2289616.0 610850.0 383526.0 844620.0 949393.0 852544.0 730940.0 807951.0 894810.0 1263319.0 582537.0 742668.0 756275.0
47174 2014 18 2 20448095 54018.0 53682.0 52926.0 55286.0 50816.0 64685.0 27994.0 62230.0 52830.0 68099.0 43832.0 54177.0 69691.0 72417.0 49369.0 20720.0 50071.0 53756.0 56800.0 49161.0 56305.0 63654.0 62497.0 57887.0 48204.0 72346.0 64235.0 69257.0 67125.0 68337.0 54018.0 53682.0 52926.0 55286.0 50816.0 64685.0 27994.0 62230.0 52830.0 68099.0 43832.0 54177.0 69691.0 72417.0 49369.0 20720.0 50071.0 53756.0 56800.0 49161.0 56305.0 63654.0 62497.0 57887.0 48204.0 72346.0 64235.0 69257.0 67125.0 68337.0
8188 2013 13 3 20443031 36158.0 26956.0 38741.0 29039.0 28409.0 28615.0 36120.0 21031.0 23760.0 30173.0 23829.0 28682.0 22575.0 27010.0 31840.0 24190.0 26020.0 24200.0 31885.0 48535.0 11580.0 11850.0 23675.0 22060.0 29410.0 22475.0 25055.0 24700.0 28050.0 25477.0 36158.0 26956.0 38741.0 29039.0 28409.0 28615.0 36120.0 21031.0 23760.0 30173.0 23829.0 28682.0 22575.0 27010.0 31840.0 24190.0 26020.0 24200.0 31885.0 48535.0 11580.0 11850.0 23675.0 22060.0 29410.0 22475.0 25055.0 24700.0 28050.0 25477.0
45417 2014 16 3 20448549 29688.0 35178.0 28580.0 35520.0 32411.0 30950.0 32930.0 36190.0 44840.0 15620.0 35150.0 31290.0 40790.0 24230.0 27750.0 48640.0 32870.0 38370.0 13630.0 21120.0 24360.0 41130.0 23560.0 25790.0 35460.0 23890.0 38050.0 25632.0 42280.0 38489.0 29688.0 35178.0 28580.0 35520.0 32411.0 30950.0 32930.0 36190.0 44840.0 15620.0 35150.0 31290.0 40790.0 24230.0 27750.0 48640.0 32870.0 38370.0 13630.0 21120.0 24360.0 41130.0 23560.0 25790.0 35460.0 23890.0 38050.0 25632.0 42280.0 38489.0
33970 2013 50 2 20440443 2420.0 4510.0 4530.0 5460.0 4542.0 2680.0 6355.0 5670.0 5390.0 4395.0 6620.0 5820.0 5780.0 3233.0 5040.0 8990.0 9871.0 5360.0 6920.0 5640.0 5710.0 5690.0 3850.0 5274.0 5749.0 5020.0 2525.0 3820.0 3990.0 5130.0 2420.0 4510.0 4530.0 5460.0 4542.0 2680.0 6355.0 5670.0 5390.0 4395.0 6620.0 5820.0 5780.0 3233.0 5040.0 8990.0 9871.0 5360.0 6920.0 5640.0 5710.0 5690.0 3850.0 5274.0 5749.0 5020.0 2525.0 3820.0 3990.0 5130.0
63326 2014 42 2 20440984 19550.0 26817.0 33955.0 28920.0 22477.0 27281.0 33313.0 39569.0 27078.0 27579.0 34919.0 41062.0 33118.0 25189.0 32026.0 32818.0 28168.0 27300.0 24838.0 27712.0 30372.0 22128.0 31824.0 29352.0 38576.0 30748.0 23808.0 28059.0 27949.0 40034.0 19550.0 26817.0 33955.0 28920.0 22477.0 27281.0 33313.0 39569.0 27078.0 27579.0 34919.0 41062.0 33118.0 25189.0 32026.0 32818.0 28168.0 27300.0 24838.0 27712.0 30372.0 22128.0 31824.0 29352.0 38576.0 30748.0 23808.0 28059.0 27949.0 40034.0
24847 2013 36 1 20443031 22475.0 25055.0 24700.0 28050.0 25477.0 28805.0 32266.0 34945.0 29521.0 29964.0 36580.0 38055.0 28120.0 28319.0 32640.0 31935.0 32050.0 29260.0 21740.0 35782.0 33567.0 21740.0 25022.0 28744.0 26780.0 28835.0 27460.0 33635.0 43720.0 28850.0 22475.0 25055.0 24700.0 28050.0 25477.0 28805.0 32266.0 34945.0 29521.0 29964.0 36580.0 38055.0 28120.0 28319.0 32640.0 31935.0 32050.0 29260.0 21740.0 35782.0 33567.0 21740.0 25022.0 28744.0 26780.0 28835.0 27460.0 33635.0 43720.0 28850.0
28049 2013 42 3 20449360 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5929 2013 9 2 20438604 6078.0 8120.0 6463.0 7387.0 7544.0 10212.0 4943.0 5085.0 5828.0 7433.0 3500.0 3635.0 4078.0 4731.0 6451.0 3793.0 5149.0 5062.0 5943.0 4410.0 6035.0 5373.0 8665.0 2945.0 2400.0 3285.0 3701.0 5950.0 3300.0 3735.0 6078.0 8120.0 6463.0 7387.0 7544.0 10212.0 4943.0 5085.0 5828.0 7433.0 3500.0 3635.0 4078.0 4731.0 6451.0 3793.0 5149.0 5062.0 5943.0 4410.0 6035.0 5373.0 8665.0 2945.0 2400.0 3285.0 3701.0 5950.0 3300.0 3735.0
41877 2014 10 2 20447919 3820.0 2190.0 2270.0 2520.0 3160.0 1730.0 1990.0 1812.0 2420.0 1850.0 1830.0 2060.0 2700.0 2995.0 820.0 2310.0 1460.0 3060.0 2572.0 1350.0 2426.0 3270.0 1920.0 1700.0 1360.0 1380.0 1870.0 1284.0 1500.0 1880.0 3820.0 2190.0 2270.0 2520.0 3160.0 1730.0 1990.0 1812.0 2420.0 1850.0 1830.0 2060.0 2700.0 2995.0 820.0 2310.0 1460.0 3060.0 2572.0 1350.0 2426.0 3270.0 1920.0 1700.0 1360.0 1380.0 1870.0 1284.0 1500.0 1880.0
28935 2013 42 1 20441989 15270.0 17366.0 18330.0 13730.0 15060.0 16900.0 12494.0 14882.0 15390.0 16476.0 23410.0 9350.0 9220.0 14270.0 15732.0 10512.0 11158.0 11170.0 13310.0 16140.0 9748.0 11854.0 13590.0 17520.0 7936.0 10100.0 9368.0 10830.0 13440.0 7010.0 15270.0 17366.0 18330.0 13730.0 15060.0 16900.0 12494.0 14882.0 15390.0 16476.0 23410.0 9350.0 9220.0 14270.0 15732.0 10512.0 11158.0 11170.0 13310.0 16140.0 9748.0 11854.0 13590.0 17520.0 7936.0 10100.0 9368.0 10830.0 13440.0 7010.0
49540 2014 21 1 20441790 11790.0 14383.0 7360.0 13571.0 11870.0 13400.0 9945.0 12460.0 15520.0 19350.0 9590.0 4760.0 10760.0 10260.0 11676.0 11130.0 10545.0 11500.0 11930.0 10530.0 9260.0 12220.0 12800.0 11030.0 10850.0 10690.0 11490.0 12580.0 10765.0 10195.0 11790.0 14383.0 7360.0 13571.0 11870.0 13400.0 9945.0 12460.0 15520.0 19350.0 9590.0 4760.0 10760.0 10260.0 11676.0 11130.0 10545.0 11500.0 11930.0 10530.0 9260.0 12220.0 12800.0 11030.0 10850.0 10690.0 11490.0 12580.0 10765.0 10195.0

Можно заметить, что фичи f31-f60 полная копия f1-f30. Проверим этом и избавимся от них. Сделаем 3 модели xgboost. Первая - со всеми данными на всех фичах. Вторая - без повторяющихся f31-f60. Последняя - еще меньше фичей.

Теперь будем последовательно подбирать параметры xgboost. Перебирать все параметры сразу - слишком долго считается. Поэтому пойдем от основных параметров, как число деревьев или глубина, к менее значимым (возможно). Сначала будем искать параметр с большим шагом, потом уменьшать шаг.

Так как данные зависят от времени, то будем использовать специальное разбиение учитывающее временную зависимость (нельзя брать куски данных из "будущего").

Считаются модели долго, поэтому перезапускать не стоит. Перед созданием модели подогнанные параметры сохранены. Как показала практика - первая самая простая модель в комбинации с работой над данными дала наилучшие результаты, поэтому код подбора ее параметров приведен, остальных - нет.

Первая модель - на всех данных на всех фичах.


In [5]:
params_height = {
    'learning_rate': [0.1],
    'n_estimators': [70, 100, 120, 150, 175, 200, 225]
}

zero = model_selection.GridSearchCV(xgboost.XGBRegressor(silent=False), params_height, 
                                 scoring=scorer, cv=model_selection.TimeSeriesSplit(), 
                                 fit_params={'eval_metric' : 'mae'})

In [6]:
%%time
zero.fit(X, y)


CPU times: user 12min 18s, sys: 4.6 s, total: 12min 23s
Wall time: 3min 16s
Out[6]:
GridSearchCV(cv=TimeSeriesSplit(n_splits=3), error_score='raise',
       estimator=XGBRegressor(base_score=0.5, colsample_bylevel=1, colsample_bytree=1, gamma=0,
       learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
       objective='reg:linear', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=False, subsample=1),
       fit_params={'eval_metric': 'mae'}, iid=True, n_jobs=1,
       param_grid={'n_estimators': [70, 100, 120, 150, 175, 200, 225], 'learning_rate': [0.1]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=make_scorer(score_func, greater_is_better=False), verbose=0)

In [7]:
# to produce 'beep' sound when finished
length = 0.4
frequency = 1000
os.system('play --no-show-progress --null --channels 1 synth %s sine %f' % (length, frequency))


Out[7]:
0

In [8]:
zero.best_params_


Out[8]:
{'learning_rate': 0.1, 'n_estimators': 120}

In [9]:
params_height = {
    'learning_rate': [0.1],
    'n_estimators': [115, 120, 125, 130, 135]
}

zero = model_selection.GridSearchCV(xgboost.XGBRegressor(silent=False), params_height, 
                                 scoring=scorer, cv=model_selection.TimeSeriesSplit(), 
                                 fit_params={'eval_metric' : 'mae'})

In [10]:
%%time
zero.fit(X, y)


CPU times: user 8min 12s, sys: 3.6 s, total: 8min 16s
Wall time: 2min 23s
Out[10]:
GridSearchCV(cv=TimeSeriesSplit(n_splits=3), error_score='raise',
       estimator=XGBRegressor(base_score=0.5, colsample_bylevel=1, colsample_bytree=1, gamma=0,
       learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
       objective='reg:linear', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=False, subsample=1),
       fit_params={'eval_metric': 'mae'}, iid=True, n_jobs=1,
       param_grid={'n_estimators': [115, 120, 125, 130, 135], 'learning_rate': [0.1]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=make_scorer(score_func, greater_is_better=False), verbose=0)

In [11]:
length = 0.4
frequency = 1000
os.system('play --no-show-progress --null --channels 1 synth %s sine %f' % (length, frequency))


Out[11]:
0

In [12]:
zero.best_params_


Out[12]:
{'learning_rate': 0.1, 'n_estimators': 115}

In [13]:
params2 = {
    'n_estimators' : [115],
    'learning_rate': [0.1],
    'max_depth': [2, 6, 10, 14, 18],
    'min_child_weight' : [1, 3, 5]
}

zero = model_selection.GridSearchCV(xgboost.XGBRegressor(silent=False), params2, 
                                 scoring=scorer, cv=model_selection.TimeSeriesSplit(), 
                                 fit_params={'eval_metric' : 'mae'})

In [14]:
%%time
zero.fit(X, y)


CPU times: user 1h 15min 5s, sys: 30.9 s, total: 1h 15min 36s
Wall time: 21min 31s
Out[14]:
GridSearchCV(cv=TimeSeriesSplit(n_splits=3), error_score='raise',
       estimator=XGBRegressor(base_score=0.5, colsample_bylevel=1, colsample_bytree=1, gamma=0,
       learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
       objective='reg:linear', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=False, subsample=1),
       fit_params={'eval_metric': 'mae'}, iid=True, n_jobs=1,
       param_grid={'n_estimators': [115], 'learning_rate': [0.1], 'max_depth': [2, 6, 10, 14, 18], 'min_child_weight': [1, 3, 5]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=make_scorer(score_func, greater_is_better=False), verbose=0)

In [15]:
length = 0.4
frequency = 1000
os.system('play --no-show-progress --null --channels 1 synth %s sine %f' % (length, frequency))


Out[15]:
0

In [16]:
zero.best_params_


Out[16]:
{'learning_rate': 0.1,
 'max_depth': 18,
 'min_child_weight': 1,
 'n_estimators': 115}

In [17]:
params3 = {
    'n_estimators' : [115],
    'learning_rate': [0.1],
    'max_depth': [18],
    'min_child_weight' : [1],
    'gamma': [i / 10.0 for i in range(0, 5)]
}

zero = model_selection.GridSearchCV(xgboost.XGBRegressor(silent=False), params3, 
                                 scoring=scorer, cv=model_selection.TimeSeriesSplit(), 
                                 fit_params={'eval_metric' : 'mae'})

In [18]:
%%time
zero.fit(X, y)


CPU times: user 54min 12s, sys: 38.3 s, total: 54min 51s
Wall time: 16min 6s
Out[18]:
GridSearchCV(cv=TimeSeriesSplit(n_splits=3), error_score='raise',
       estimator=XGBRegressor(base_score=0.5, colsample_bylevel=1, colsample_bytree=1, gamma=0,
       learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
       objective='reg:linear', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=False, subsample=1),
       fit_params={'eval_metric': 'mae'}, iid=True, n_jobs=1,
       param_grid={'n_estimators': [115], 'gamma': [0.0, 0.1, 0.2, 0.3, 0.4], 'learning_rate': [0.1], 'max_depth': [18], 'min_child_weight': [1]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=make_scorer(score_func, greater_is_better=False), verbose=0)

In [19]:
length = 0.4
frequency = 1000
os.system('play --no-show-progress --null --channels 1 synth %s sine %f' % (length, frequency))


Out[19]:
0

In [20]:
zero.best_params_


Out[20]:
{'gamma': 0.3,
 'learning_rate': 0.1,
 'max_depth': 18,
 'min_child_weight': 1,
 'n_estimators': 115}

In [21]:
params4 = {
    'gamma': [0.3],
    'n_estimators' : [115],
    'learning_rate': [0.1],
    'max_depth': [18],
    'subsample': [i / 10.0 for i in range(6, 10)],
    'colsample_bytree': [i / 10.0 for i in range(6, 10)]
}

zero = model_selection.GridSearchCV(xgboost.XGBRegressor(silent=False), params4, 
                                 scoring=scorer, cv=model_selection.TimeSeriesSplit(), 
                                 fit_params={'eval_metric' : 'mae'})

In [22]:
%%time
zero.fit(X, y)


CPU times: user 1h 37min 29s, sys: 18.1 s, total: 1h 37min 47s
Wall time: 25min 12s
Out[22]:
GridSearchCV(cv=TimeSeriesSplit(n_splits=3), error_score='raise',
       estimator=XGBRegressor(base_score=0.5, colsample_bylevel=1, colsample_bytree=1, gamma=0,
       learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
       objective='reg:linear', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=False, subsample=1),
       fit_params={'eval_metric': 'mae'}, iid=True, n_jobs=1,
       param_grid={'colsample_bytree': [0.6, 0.7, 0.8, 0.9], 'learning_rate': [0.1], 'n_estimators': [115], 'subsample': [0.6, 0.7, 0.8, 0.9], 'max_depth': [18], 'gamma': [0.3]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=make_scorer(score_func, greater_is_better=False), verbose=0)

In [23]:
length = 0.4
frequency = 1000
os.system('play --no-show-progress --null --channels 1 synth %s sine %f' % (length, frequency))


Out[23]:
0

In [24]:
zero.best_params_


Out[24]:
{'colsample_bytree': 0.9,
 'gamma': 0.3,
 'learning_rate': 0.1,
 'max_depth': 18,
 'n_estimators': 115,
 'subsample': 0.9}

Так как считается все долго, введем сохраним параметры выше.


In [12]:
zero_best_params = {
    'colsample_bytree': 0.9,
     'gamma': 0.3,
     'learning_rate': 0.1,
     'max_depth': 18,
     'n_estimators': 115,
     'subsample': 0.9
}

In [13]:
best_zero = xgboost.XGBRegressor(**zero_best_params)

In [14]:
%%time
model = best_zero
model.fit(X, y, eval_metric='mae')

test_drop = test.drop(['Num'], axis=1)
preds = model.predict(test_drop)


print len(preds)
print len(sample_submission)


2016
2016
CPU times: user 43min 40s, sys: 7.96 s, total: 43min 48s
Wall time: 11min 19s

In [15]:
sample_submission['y'] = preds

In [16]:
sample_submission['y'] = sample_submission['y'].map(lambda x: x if x > 0 else 0.0)

In [17]:
sample_submission.to_csv("boost_submission_zero.csv", sep=',', index=False)

In [18]:
model_selection.cross_val_score(model, X, y, scoring=scorer)


Out[18]:
array([-21.70117872, -21.35792838, -21.63656115])

Данное решение обеспечивает SMAPE на public leaderboard примерно 23, что похоже на полученную ранее оценку.

Вторая модель. Убираем f31-f60

Проверим, что f31-f60 полная копия f1-f30.


In [19]:
sum(sum(X[X.columns[i + 4]] != X[X.columns[i + 34]] for i in xrange(30)))


Out[19]:
0

Удалим эти копии и подгоним модель.


In [20]:
X = X.drop(X.columns[34:], axis=1)

Сохраним параметры


In [21]:
a_best_params = {
    'colsample_bytree': 0.9,
    'gamma': 0.3,
    'learning_rate': 0.01,
    'max_depth': 18,
    'n_estimators': 115*10,
    'subsample': 0.9
}

In [22]:
best = xgboost.XGBRegressor(**a_best_params)

In [23]:
%%time
model = best
model.fit(X, y, eval_metric='mae')

test_drop = test.drop(['Num'], axis=1)
test_drop = test_drop.drop(test_drop.columns[34:], axis=1)
preds = model.predict(test_drop)


print len(preds)
print len(sample_submission_a)


2016
2016
CPU times: user 32min 6s, sys: 30.2 s, total: 32min 36s
Wall time: 10min 36s

In [24]:
sample_submission_a['y'] = preds

In [25]:
sample_submission_a['y'] = sample_submission_a['y'].map(lambda x: x if x > 0 else 0.0)

In [26]:
sample_submission_a.to_csv("boost_submission_a.csv", sep=',', index=False)

In [27]:
model_selection.cross_val_score(model, X, y, scoring=scorer)


Out[27]:
array([-21.76252417, -21.36676325, -21.66864735])

Теперь, учитывая скоррелированность фичей f, попробуем убрать часть из них и посмотреть на результаты.

Третья модель. Оставляем мало фичей.

Посчитаем коэффициенты корреляции между оставшимися признаками. Те, которые наиболее коррелируют стоит убрать (один из двух).


In [28]:
X.corr('kendall')


Out[28]:
year week shift item_id f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12 f13 f14 f15 f16 f17 f18 f19 f20 f21 f22 f23 f24 f25 f26 f27 f28 f29 f30
year 1.000000 -0.040407 0.010925 0.043550 0.040941 0.032403 0.032694 0.032910 0.033017 0.034271 0.033766 0.033634 0.033337 0.032657 0.033653 0.032863 0.032093 0.030454 0.031617 0.030607 0.029523 0.027499 0.027007 0.026095 0.024730 0.023550 0.021096 0.022329 0.020997 0.021032 0.018882 0.020021 0.019753 0.017332
week -0.040407 1.000000 -0.000080 0.022726 -0.003695 -0.004569 -0.004469 -0.003750 -0.000716 0.001711 0.002639 0.004036 0.006365 0.009487 0.011438 0.013189 0.016123 0.020705 0.022904 0.023710 0.024737 0.028234 0.030329 0.029608 0.027980 0.027194 0.028244 0.025780 0.022690 0.020287 0.021413 0.019904 0.017088 0.017104
shift 0.010925 -0.000080 1.000000 0.000628 -0.000854 -0.001094 -0.001009 -0.000919 -0.000943 -0.000939 -0.001123 -0.000967 -0.000818 -0.001073 -0.001112 -0.000938 -0.000725 -0.000847 -0.000848 -0.000819 -0.000414 -0.000007 -0.000070 -0.000351 0.000097 0.000396 0.000023 0.000058 0.000080 0.000306 0.000387 -0.000072 -0.000378 0.001500
item_id 0.043550 0.022726 0.000628 1.000000 -0.084510 -0.082547 -0.079873 -0.077088 -0.074288 -0.071403 -0.068628 -0.065832 -0.062845 -0.059768 -0.056604 -0.053552 -0.050409 -0.047284 -0.044096 -0.041074 -0.038044 -0.035209 -0.032366 -0.029711 -0.027024 -0.024244 -0.021373 -0.018428 -0.015728 -0.013009 -0.010290 -0.007413 -0.004674 -0.001977
f1 0.040941 -0.003695 -0.000854 -0.084510 1.000000 0.895411 0.879341 0.876172 0.895312 0.884445 0.862932 0.856184 0.864702 0.870442 0.843907 0.835686 0.835290 0.848292 0.828759 0.815665 0.811954 0.821869 0.814197 0.797690 0.791761 0.795196 0.796363 0.779899 0.769958 0.770015 0.778354 0.763380 0.751649 0.747743
f2 0.032403 -0.004569 -0.001094 -0.082547 0.895411 1.000000 0.901690 0.885347 0.883280 0.903618 0.890108 0.868792 0.862117 0.871048 0.877589 0.849126 0.841354 0.840904 0.855644 0.833150 0.820146 0.816683 0.828159 0.818664 0.801271 0.795483 0.799285 0.801767 0.782781 0.773940 0.773998 0.783923 0.766782 0.754085
f3 0.032694 -0.004469 -0.001009 -0.079873 0.879341 0.901690 1.000000 0.901379 0.885228 0.883894 0.903272 0.889655 0.868130 0.861414 0.870840 0.876901 0.848354 0.840400 0.840692 0.854640 0.832041 0.818839 0.815780 0.827002 0.817260 0.799660 0.793723 0.797967 0.800540 0.781409 0.772496 0.772835 0.782533 0.763998
f4 0.032910 -0.003750 -0.000919 -0.077088 0.876172 0.885347 0.901379 1.000000 0.901266 0.885735 0.883494 0.902814 0.889102 0.867368 0.861209 0.870171 0.876268 0.847408 0.840146 0.839701 0.853487 0.830760 0.817871 0.814646 0.825588 0.815733 0.797833 0.792395 0.796755 0.799240 0.779955 0.771237 0.771395 0.779676
f5 0.033017 -0.000716 -0.000943 -0.074288 0.895312 0.883280 0.885228 0.901266 1.000000 0.901245 0.885885 0.883197 0.902363 0.888502 0.866791 0.860784 0.869480 0.875418 0.846624 0.839518 0.838842 0.852363 0.829503 0.817101 0.813733 0.824423 0.814244 0.796298 0.791789 0.795753 0.798113 0.778550 0.770303 0.768777
f6 0.034271 0.001711 -0.000939 -0.071403 0.884445 0.903618 0.883894 0.885735 0.901245 1.000000 0.902406 0.886317 0.883383 0.902521 0.888163 0.867244 0.860687 0.869241 0.874537 0.846904 0.839343 0.838259 0.851165 0.829450 0.816956 0.813203 0.823585 0.812898 0.796556 0.791403 0.795242 0.796900 0.778383 0.768294
f7 0.033766 0.002639 -0.001123 -0.068628 0.862932 0.890108 0.903272 0.883494 0.885885 0.902406 1.000000 0.901996 0.885986 0.883213 0.903033 0.887597 0.866666 0.860005 0.869507 0.873549 0.845586 0.837991 0.837575 0.850228 0.827809 0.815214 0.811353 0.822415 0.811443 0.795027 0.789821 0.794226 0.795416 0.775300
f8 0.033634 0.004036 -0.000967 -0.065832 0.856184 0.868792 0.889655 0.902814 0.883197 0.886317 0.901996 1.000000 0.901479 0.885464 0.883099 0.902682 0.887048 0.865759 0.859822 0.868652 0.872641 0.844261 0.836990 0.836495 0.849133 0.826242 0.813430 0.809928 0.821321 0.810143 0.793529 0.788408 0.792883 0.792708
f9 0.033337 0.006365 -0.000818 -0.062845 0.864702 0.862117 0.868130 0.889102 0.902363 0.883383 0.885986 0.901479 1.000000 0.900839 0.885161 0.882494 0.902067 0.886200 0.865179 0.859012 0.867559 0.871496 0.843013 0.836016 0.835172 0.847845 0.824412 0.811850 0.808920 0.820093 0.808794 0.791986 0.787023 0.789931
f10 0.032657 0.009487 -0.001073 -0.059768 0.870442 0.871048 0.861414 0.867368 0.888502 0.902521 0.883213 0.885464 0.900839 1.000000 0.900625 0.884854 0.881822 0.901316 0.885858 0.864617 0.858127 0.866332 0.870481 0.842242 0.835050 0.833766 0.846419 0.822926 0.811116 0.807810 0.818927 0.807502 0.791036 0.784227
f11 0.033653 0.011438 -0.001112 -0.056604 0.843907 0.877589 0.870840 0.861209 0.866791 0.888163 0.903033 0.883099 0.885161 0.900625 1.000000 0.901029 0.884601 0.881355 0.900517 0.886048 0.864191 0.857259 0.864890 0.870099 0.841733 0.834131 0.832571 0.844969 0.822786 0.810348 0.806961 0.817642 0.807110 0.788717
f12 0.032863 0.013189 -0.000938 -0.053552 0.835686 0.849126 0.876901 0.870171 0.860784 0.867244 0.887597 0.902682 0.882494 0.884854 0.901029 1.000000 0.900584 0.883934 0.881746 0.899764 0.885093 0.863036 0.856698 0.864026 0.868826 0.840099 0.832370 0.831400 0.843806 0.821426 0.808799 0.805859 0.816253 0.804264
f13 0.032093 0.016123 -0.000725 -0.050409 0.835290 0.841354 0.848354 0.876268 0.869480 0.860687 0.866666 0.887048 0.902067 0.881822 0.884601 0.900584 1.000000 0.900049 0.883834 0.881232 0.899212 0.884297 0.862220 0.856132 0.863153 0.867772 0.838654 0.831197 0.830782 0.842861 0.820351 0.807636 0.804846 0.813731
f14 0.030454 0.020705 -0.000847 -0.047284 0.848292 0.840904 0.840400 0.847408 0.875418 0.869241 0.860005 0.865759 0.886200 0.901316 0.881355 0.883934 0.900049 1.000000 0.900140 0.883450 0.880508 0.898422 0.883684 0.861915 0.855249 0.862101 0.866525 0.837589 0.830748 0.829900 0.841942 0.819450 0.806917 0.802288
f15 0.031617 0.022904 -0.000848 -0.044096 0.828759 0.855644 0.840692 0.840146 0.846624 0.874537 0.869507 0.859822 0.865179 0.885858 0.900517 0.881746 0.883834 0.900140 1.000000 0.901268 0.883983 0.880536 0.897751 0.884387 0.862561 0.855216 0.861912 0.865764 0.838593 0.830983 0.829944 0.841269 0.819980 0.805458
f16 0.030607 0.023710 -0.000819 -0.041074 0.815665 0.833150 0.854640 0.839701 0.839518 0.846904 0.873549 0.868652 0.859012 0.864617 0.886048 0.899764 0.881232 0.883450 0.901268 1.000000 0.900518 0.883311 0.880802 0.897371 0.883177 0.861361 0.853895 0.861491 0.864807 0.837627 0.829952 0.829635 0.840146 0.817437
f17 0.029523 0.024737 -0.000414 -0.038044 0.811954 0.820146 0.832041 0.853487 0.838842 0.839343 0.845586 0.872641 0.867559 0.858127 0.864191 0.885093 0.899212 0.880508 0.883983 0.900518 1.000000 0.900021 0.883528 0.880401 0.896802 0.882141 0.860232 0.853493 0.860983 0.864093 0.836753 0.829455 0.828753 0.838086
f18 0.027499 0.028234 -0.000007 -0.035209 0.821869 0.816683 0.818839 0.830760 0.852363 0.838259 0.837991 0.844261 0.871496 0.866332 0.857259 0.863036 0.884297 0.898422 0.880536 0.883311 0.900021 1.000000 0.900098 0.883258 0.879749 0.896219 0.881095 0.859654 0.853237 0.860460 0.863398 0.836052 0.828622 0.826649
f19 0.027007 0.030329 -0.000070 -0.032366 0.814197 0.828159 0.815780 0.817871 0.829503 0.851165 0.837575 0.836990 0.843013 0.870481 0.864890 0.856698 0.862220 0.883684 0.897751 0.880802 0.883528 0.900098 1.000000 0.900814 0.883875 0.879788 0.896341 0.880632 0.860742 0.853543 0.860639 0.862822 0.836388 0.827086
f20 0.026095 0.029608 -0.000351 -0.029711 0.797690 0.818664 0.827002 0.814646 0.817101 0.829450 0.850228 0.836495 0.836016 0.842242 0.870099 0.864026 0.856132 0.861915 0.884387 0.897371 0.880401 0.883258 0.900814 1.000000 0.900302 0.883284 0.879336 0.896480 0.880674 0.860523 0.853263 0.860690 0.862444 0.834243
f21 0.024730 0.027980 0.000097 -0.027024 0.791761 0.801271 0.817260 0.825588 0.813733 0.816956 0.827809 0.849133 0.835172 0.835050 0.841733 0.868826 0.863153 0.855249 0.862561 0.883177 0.896802 0.879749 0.883875 0.900302 1.000000 0.899208 0.882512 0.879505 0.896091 0.880293 0.859842 0.853105 0.859814 0.861080
f22 0.023550 0.027194 0.000396 -0.024244 0.795196 0.795483 0.799660 0.815733 0.824423 0.813203 0.815214 0.826242 0.847845 0.833766 0.834131 0.840099 0.867772 0.862101 0.855216 0.861361 0.882141 0.896219 0.879788 0.883284 0.899208 1.000000 0.898184 0.882330 0.879141 0.895601 0.879787 0.859472 0.852018 0.857615
f23 0.021096 0.028244 0.000023 -0.021373 0.796363 0.799285 0.793723 0.797833 0.814244 0.823585 0.811353 0.813430 0.824412 0.846419 0.832571 0.832370 0.838654 0.866525 0.861912 0.853895 0.860232 0.881095 0.896341 0.879336 0.882512 0.898184 1.000000 0.898183 0.882482 0.878812 0.895190 0.879762 0.858956 0.849798
f24 0.022329 0.025780 0.000058 -0.018428 0.779899 0.801767 0.797967 0.792395 0.796298 0.812898 0.822415 0.809928 0.811850 0.822926 0.844969 0.831400 0.831197 0.837589 0.865764 0.861491 0.853493 0.859654 0.880632 0.896480 0.879505 0.882330 0.898183 1.000000 0.899611 0.882982 0.879070 0.894834 0.880326 0.857545
f25 0.020997 0.022690 0.000080 -0.015728 0.769958 0.782781 0.800540 0.796755 0.791789 0.796556 0.811443 0.821321 0.808920 0.811116 0.822786 0.843806 0.830782 0.830748 0.838593 0.864807 0.860983 0.853237 0.860742 0.880674 0.896091 0.879141 0.882482 0.899611 1.000000 0.899834 0.883051 0.880199 0.894481 0.878938
f26 0.021032 0.020287 0.000306 -0.013009 0.770015 0.773940 0.781409 0.799240 0.795753 0.791403 0.795027 0.810143 0.820093 0.807810 0.810348 0.821426 0.842861 0.829900 0.830983 0.837627 0.864093 0.860460 0.853543 0.860523 0.880293 0.895601 0.878812 0.882982 0.899834 1.000000 0.899584 0.883056 0.879717 0.892815
f27 0.018882 0.021413 0.000387 -0.010290 0.778354 0.773998 0.772496 0.779955 0.798113 0.795242 0.789821 0.793529 0.808794 0.818927 0.806961 0.808799 0.820351 0.841942 0.829944 0.829952 0.836753 0.863398 0.860639 0.853263 0.859842 0.879787 0.895190 0.879070 0.883051 0.899584 1.000000 0.899785 0.882588 0.877680
f28 0.020021 0.019904 -0.000072 -0.007413 0.763380 0.783923 0.772835 0.771237 0.778550 0.796900 0.794226 0.788408 0.791986 0.807502 0.817642 0.805859 0.807636 0.819450 0.841269 0.829635 0.829455 0.836052 0.862822 0.860690 0.853105 0.859472 0.879762 0.894834 0.880199 0.883056 0.899785 1.000000 0.900697 0.880776
f29 0.019753 0.017088 -0.000378 -0.004674 0.751649 0.766782 0.782533 0.771395 0.770303 0.778383 0.795416 0.792883 0.787023 0.791036 0.807110 0.816253 0.804846 0.806917 0.819980 0.840146 0.828753 0.828622 0.836388 0.862444 0.859814 0.852018 0.858956 0.880326 0.894481 0.879717 0.882588 0.900697 1.000000 0.898693
f30 0.017332 0.017104 0.001500 -0.001977 0.747743 0.754085 0.763998 0.779676 0.768777 0.768294 0.775300 0.792708 0.789931 0.784227 0.788717 0.804264 0.813731 0.802288 0.805458 0.817437 0.838086 0.826649 0.827086 0.834243 0.861080 0.857615 0.849798 0.857545 0.878938 0.892815 0.877680 0.880776 0.898693 1.000000

Можно заметить, что фичи f коррелируют между собой очень сильно, причем чем дальше (номера) - тем меньше. Поэтому попробуем выбрать несколько из них.


In [29]:
columns = X.columns[[0, 1, 2, 3, 4, 14, 24, -1]]
columns


Out[29]:
Index([u'year', u'week', u'shift', u'item_id', u'f1', u'f11', u'f21', u'f30'], dtype='object')

In [30]:
X = X[columns]

Сохраним параметры


In [31]:
b_best_params = {
    'colsample_bytree': 0.9,
    'gamma': 0.2,
    'learning_rate': 0.01,
    'max_depth': 21,
    'n_estimators': 85*10,
    'subsample': 0.9
}

In [32]:
best_b = xgboost.XGBRegressor(**b_best_params)

In [33]:
%%time
model = best_b
model.fit(X, y, eval_metric='mae')

preds = model.predict(test_drop[test_drop.columns[[0, 1, 2, 3, 4, 14, 24, -1]]])

print len(preds)
print len(sample_submission_b)


2016
2016
CPU times: user 9min 32s, sys: 11.3 s, total: 9min 43s
Wall time: 3min

In [34]:
sample_submission_b['y'] = preds

In [35]:
sample_submission_b['y'] = sample_submission_b['y'].map(lambda x: x if x > 0 else 0.0)

In [36]:
sample_submission_b.to_csv("boost_submission_b.csv", sep=',', index=False)

In [37]:
model_selection.cross_val_score(model, X, y, scoring=scorer)


Out[37]:
array([-21.88711999, -21.57399576, -21.38635692])

Хоть уменьшение кол-ва скоррелированных признаков должно было положительно сказаться на результатах, заметного улучшения нет.

3

Теперь посмотрим внимательнее на данные.


In [38]:
train = pd.read_csv("train.tsv")
test = pd.read_csv("test.tsv")
train = train.drop(train.columns[36:], axis=1)

sorted_train = train.sort_values(['item_id', 'year', 'week', 'shift'])
check_id = sorted_train['item_id'][0]
sorted_train[sorted_train['item_id']==check_id].head(20)


Out[38]:
Num y year week shift item_id f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12 f13 f14 f15 f16 f17 f18 f19 f20 f21 f22 f23 f24 f25 f26 f27 f28 f29 f30
0 0 123438 2012 52 1 20442076 4915.0 38056.0 40185.0 45733.0 59710.0 39982.0 45846.0 43680.0 48325.0 42685.0 40605.0 44601.0 41965.0 56221.0 34260.0 39914.0 42322.0 48903.0 42090.0 36690.0 39423.0 41765.0 52590.0 31452.0 44420.0 41865.0 52705.0 36102.0 44163.0 45239.0
698 3470 49217 2013 1 1 20442076 38056.0 40185.0 45733.0 59710.0 39982.0 45846.0 43680.0 48325.0 42685.0 40605.0 44601.0 41965.0 56221.0 34260.0 39914.0 42322.0 48903.0 42090.0 36690.0 39423.0 41765.0 52590.0 31452.0 44420.0 41865.0 52705.0 36102.0 44163.0 45239.0 76670.0
237 237 49217 2013 1 2 20442076 4915.0 38056.0 40185.0 45733.0 59710.0 39982.0 45846.0 43680.0 48325.0 42685.0 40605.0 44601.0 41965.0 56221.0 34260.0 39914.0 42322.0 48903.0 42090.0 36690.0 39423.0 41765.0 52590.0 31452.0 44420.0 41865.0 52705.0 36102.0 44163.0 45239.0
1394 6937 34819 2013 2 1 20442076 40185.0 45733.0 59710.0 39982.0 45846.0 43680.0 48325.0 42685.0 40605.0 44601.0 41965.0 56221.0 34260.0 39914.0 42322.0 48903.0 42090.0 36690.0 39423.0 41765.0 52590.0 31452.0 44420.0 41865.0 52705.0 36102.0 44163.0 45239.0 76670.0 30570.0
930 3702 34819 2013 2 2 20442076 38056.0 40185.0 45733.0 59710.0 39982.0 45846.0 43680.0 48325.0 42685.0 40605.0 44601.0 41965.0 56221.0 34260.0 39914.0 42322.0 48903.0 42090.0 36690.0 39423.0 41765.0 52590.0 31452.0 44420.0 41865.0 52705.0 36102.0 44163.0 45239.0 76670.0
469 469 34819 2013 2 3 20442076 4915.0 38056.0 40185.0 45733.0 59710.0 39982.0 45846.0 43680.0 48325.0 42685.0 40605.0 44601.0 41965.0 56221.0 34260.0 39914.0 42322.0 48903.0 42090.0 36690.0 39423.0 41765.0 52590.0 31452.0 44420.0 41865.0 52705.0 36102.0 44163.0 45239.0
2083 10395 77143 2013 3 1 20442076 45733.0 59710.0 39982.0 45846.0 43680.0 48325.0 42685.0 40605.0 44601.0 41965.0 56221.0 34260.0 39914.0 42322.0 48903.0 42090.0 36690.0 39423.0 41765.0 52590.0 31452.0 44420.0 41865.0 52705.0 36102.0 44163.0 45239.0 76670.0 30570.0 21627.0
1623 7166 77143 2013 3 2 20442076 40185.0 45733.0 59710.0 39982.0 45846.0 43680.0 48325.0 42685.0 40605.0 44601.0 41965.0 56221.0 34260.0 39914.0 42322.0 48903.0 42090.0 36690.0 39423.0 41765.0 52590.0 31452.0 44420.0 41865.0 52705.0 36102.0 44163.0 45239.0 76670.0 30570.0
1159 3931 77143 2013 3 3 20442076 38056.0 40185.0 45733.0 59710.0 39982.0 45846.0 43680.0 48325.0 42685.0 40605.0 44601.0 41965.0 56221.0 34260.0 39914.0 42322.0 48903.0 42090.0 36690.0 39423.0 41765.0 52590.0 31452.0 44420.0 41865.0 52705.0 36102.0 44163.0 45239.0 76670.0
2787 13865 67781 2013 4 1 20442076 59710.0 39982.0 45846.0 43680.0 48325.0 42685.0 40605.0 44601.0 41965.0 56221.0 34260.0 39914.0 42322.0 48903.0 42090.0 36690.0 39423.0 41765.0 52590.0 31452.0 44420.0 41865.0 52705.0 36102.0 44163.0 45239.0 76670.0 30570.0 21627.0 47915.0
2326 10638 67781 2013 4 2 20442076 45733.0 59710.0 39982.0 45846.0 43680.0 48325.0 42685.0 40605.0 44601.0 41965.0 56221.0 34260.0 39914.0 42322.0 48903.0 42090.0 36690.0 39423.0 41765.0 52590.0 31452.0 44420.0 41865.0 52705.0 36102.0 44163.0 45239.0 76670.0 30570.0 21627.0
1866 7409 67781 2013 4 3 20442076 40185.0 45733.0 59710.0 39982.0 45846.0 43680.0 48325.0 42685.0 40605.0 44601.0 41965.0 56221.0 34260.0 39914.0 42322.0 48903.0 42090.0 36690.0 39423.0 41765.0 52590.0 31452.0 44420.0 41865.0 52705.0 36102.0 44163.0 45239.0 76670.0 30570.0
3470 17308 67306 2013 5 1 20442076 39982.0 45846.0 43680.0 48325.0 42685.0 40605.0 44601.0 41965.0 56221.0 34260.0 39914.0 42322.0 48903.0 42090.0 36690.0 39423.0 41765.0 52590.0 31452.0 44420.0 41865.0 52705.0 36102.0 44163.0 45239.0 76670.0 30570.0 21627.0 47915.0 42100.0
3003 14081 67306 2013 5 2 20442076 59710.0 39982.0 45846.0 43680.0 48325.0 42685.0 40605.0 44601.0 41965.0 56221.0 34260.0 39914.0 42322.0 48903.0 42090.0 36690.0 39423.0 41765.0 52590.0 31452.0 44420.0 41865.0 52705.0 36102.0 44163.0 45239.0 76670.0 30570.0 21627.0 47915.0
2542 10854 67306 2013 5 3 20442076 45733.0 59710.0 39982.0 45846.0 43680.0 48325.0 42685.0 40605.0 44601.0 41965.0 56221.0 34260.0 39914.0 42322.0 48903.0 42090.0 36690.0 39423.0 41765.0 52590.0 31452.0 44420.0 41865.0 52705.0 36102.0 44163.0 45239.0 76670.0 30570.0 21627.0
4170 20762 57592 2013 6 1 20442076 45846.0 43680.0 48325.0 42685.0 40605.0 44601.0 41965.0 56221.0 34260.0 39914.0 42322.0 48903.0 42090.0 36690.0 39423.0 41765.0 52590.0 31452.0 44420.0 41865.0 52705.0 36102.0 44163.0 45239.0 76670.0 30570.0 21627.0 47915.0 42100.0 41805.0
3703 17541 57592 2013 6 2 20442076 39982.0 45846.0 43680.0 48325.0 42685.0 40605.0 44601.0 41965.0 56221.0 34260.0 39914.0 42322.0 48903.0 42090.0 36690.0 39423.0 41765.0 52590.0 31452.0 44420.0 41865.0 52705.0 36102.0 44163.0 45239.0 76670.0 30570.0 21627.0 47915.0 42100.0
3236 14314 57592 2013 6 3 20442076 59710.0 39982.0 45846.0 43680.0 48325.0 42685.0 40605.0 44601.0 41965.0 56221.0 34260.0 39914.0 42322.0 48903.0 42090.0 36690.0 39423.0 41765.0 52590.0 31452.0 44420.0 41865.0 52705.0 36102.0 44163.0 45239.0 76670.0 30570.0 21627.0 47915.0
4870 24215 61601 2013 7 1 20442076 43680.0 48325.0 42685.0 40605.0 44601.0 41965.0 56221.0 34260.0 39914.0 42322.0 48903.0 42090.0 36690.0 39423.0 41765.0 52590.0 31452.0 44420.0 41865.0 52705.0 36102.0 44163.0 45239.0 76670.0 30570.0 21627.0 47915.0 42100.0 41805.0 35772.0
4406 20998 61601 2013 7 2 20442076 45846.0 43680.0 48325.0 42685.0 40605.0 44601.0 41965.0 56221.0 34260.0 39914.0 42322.0 48903.0 42090.0 36690.0 39423.0 41765.0 52590.0 31452.0 44420.0 41865.0 52705.0 36102.0 44163.0 45239.0 76670.0 30570.0 21627.0 47915.0 42100.0 41805.0

Можно заметить, что значения фичей f для одного item_id очень похожи. Сделаем срез по shift=1.


In [39]:
sorted_train[sorted_train['shift']==1][sorted_train['item_id']==check_id].head(20)


/home/boyalex/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:1: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  if __name__ == '__main__':
Out[39]:
Num y year week shift item_id f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12 f13 f14 f15 f16 f17 f18 f19 f20 f21 f22 f23 f24 f25 f26 f27 f28 f29 f30
0 0 123438 2012 52 1 20442076 4915.0 38056.0 40185.0 45733.0 59710.0 39982.0 45846.0 43680.0 48325.0 42685.0 40605.0 44601.0 41965.0 56221.0 34260.0 39914.0 42322.0 48903.0 42090.0 36690.0 39423.0 41765.0 52590.0 31452.0 44420.0 41865.0 52705.0 36102.0 44163.0 45239.0
698 3470 49217 2013 1 1 20442076 38056.0 40185.0 45733.0 59710.0 39982.0 45846.0 43680.0 48325.0 42685.0 40605.0 44601.0 41965.0 56221.0 34260.0 39914.0 42322.0 48903.0 42090.0 36690.0 39423.0 41765.0 52590.0 31452.0 44420.0 41865.0 52705.0 36102.0 44163.0 45239.0 76670.0
1394 6937 34819 2013 2 1 20442076 40185.0 45733.0 59710.0 39982.0 45846.0 43680.0 48325.0 42685.0 40605.0 44601.0 41965.0 56221.0 34260.0 39914.0 42322.0 48903.0 42090.0 36690.0 39423.0 41765.0 52590.0 31452.0 44420.0 41865.0 52705.0 36102.0 44163.0 45239.0 76670.0 30570.0
2083 10395 77143 2013 3 1 20442076 45733.0 59710.0 39982.0 45846.0 43680.0 48325.0 42685.0 40605.0 44601.0 41965.0 56221.0 34260.0 39914.0 42322.0 48903.0 42090.0 36690.0 39423.0 41765.0 52590.0 31452.0 44420.0 41865.0 52705.0 36102.0 44163.0 45239.0 76670.0 30570.0 21627.0
2787 13865 67781 2013 4 1 20442076 59710.0 39982.0 45846.0 43680.0 48325.0 42685.0 40605.0 44601.0 41965.0 56221.0 34260.0 39914.0 42322.0 48903.0 42090.0 36690.0 39423.0 41765.0 52590.0 31452.0 44420.0 41865.0 52705.0 36102.0 44163.0 45239.0 76670.0 30570.0 21627.0 47915.0
3470 17308 67306 2013 5 1 20442076 39982.0 45846.0 43680.0 48325.0 42685.0 40605.0 44601.0 41965.0 56221.0 34260.0 39914.0 42322.0 48903.0 42090.0 36690.0 39423.0 41765.0 52590.0 31452.0 44420.0 41865.0 52705.0 36102.0 44163.0 45239.0 76670.0 30570.0 21627.0 47915.0 42100.0
4170 20762 57592 2013 6 1 20442076 45846.0 43680.0 48325.0 42685.0 40605.0 44601.0 41965.0 56221.0 34260.0 39914.0 42322.0 48903.0 42090.0 36690.0 39423.0 41765.0 52590.0 31452.0 44420.0 41865.0 52705.0 36102.0 44163.0 45239.0 76670.0 30570.0 21627.0 47915.0 42100.0 41805.0
4870 24215 61601 2013 7 1 20442076 43680.0 48325.0 42685.0 40605.0 44601.0 41965.0 56221.0 34260.0 39914.0 42322.0 48903.0 42090.0 36690.0 39423.0 41765.0 52590.0 31452.0 44420.0 41865.0 52705.0 36102.0 44163.0 45239.0 76670.0 30570.0 21627.0 47915.0 42100.0 41805.0 35772.0
5564 27662 63194 2013 8 1 20442076 48325.0 42685.0 40605.0 44601.0 41965.0 56221.0 34260.0 39914.0 42322.0 48903.0 42090.0 36690.0 39423.0 41765.0 52590.0 31452.0 44420.0 41865.0 52705.0 36102.0 44163.0 45239.0 76670.0 30570.0 21627.0 47915.0 42100.0 41805.0 35772.0 38262.0
6258 31114 71711 2013 9 1 20442076 42685.0 40605.0 44601.0 41965.0 56221.0 34260.0 39914.0 42322.0 48903.0 42090.0 36690.0 39423.0 41765.0 52590.0 31452.0 44420.0 41865.0 52705.0 36102.0 44163.0 45239.0 76670.0 30570.0 21627.0 47915.0 42100.0 41805.0 35772.0 38262.0 39251.0
6954 34563 53761 2013 10 1 20442076 40605.0 44601.0 41965.0 56221.0 34260.0 39914.0 42322.0 48903.0 42090.0 36690.0 39423.0 41765.0 52590.0 31452.0 44420.0 41865.0 52705.0 36102.0 44163.0 45239.0 76670.0 30570.0 21627.0 47915.0 42100.0 41805.0 35772.0 38262.0 39251.0 44541.0
7639 38001 60075 2013 11 1 20442076 44601.0 41965.0 56221.0 34260.0 39914.0 42322.0 48903.0 42090.0 36690.0 39423.0 41765.0 52590.0 31452.0 44420.0 41865.0 52705.0 36102.0 44163.0 45239.0 76670.0 30570.0 21627.0 47915.0 42100.0 41805.0 35772.0 38262.0 39251.0 44541.0 33392.0
8337 41448 67394 2013 12 1 20442076 41965.0 56221.0 34260.0 39914.0 42322.0 48903.0 42090.0 36690.0 39423.0 41765.0 52590.0 31452.0 44420.0 41865.0 52705.0 36102.0 44163.0 45239.0 76670.0 30570.0 21627.0 47915.0 42100.0 41805.0 35772.0 38262.0 39251.0 44541.0 33392.0 37314.0
9026 44889 64515 2013 13 1 20442076 56221.0 34260.0 39914.0 42322.0 48903.0 42090.0 36690.0 39423.0 41765.0 52590.0 31452.0 44420.0 41865.0 52705.0 36102.0 44163.0 45239.0 76670.0 30570.0 21627.0 47915.0 42100.0 41805.0 35772.0 38262.0 39251.0 44541.0 33392.0 37314.0 41860.0
9716 48330 64697 2013 14 1 20442076 34260.0 39914.0 42322.0 48903.0 42090.0 36690.0 39423.0 41765.0 52590.0 31452.0 44420.0 41865.0 52705.0 36102.0 44163.0 45239.0 76670.0 30570.0 21627.0 47915.0 42100.0 41805.0 35772.0 38262.0 39251.0 44541.0 33392.0 37314.0 41860.0 40072.0
10397 51764 60650 2013 15 1 20442076 39914.0 42322.0 48903.0 42090.0 36690.0 39423.0 41765.0 52590.0 31452.0 44420.0 41865.0 52705.0 36102.0 44163.0 45239.0 76670.0 30570.0 21627.0 47915.0 42100.0 41805.0 35772.0 38262.0 39251.0 44541.0 33392.0 37314.0 41860.0 40072.0 40185.0
11088 55209 65919 2013 16 1 20442076 42322.0 48903.0 42090.0 36690.0 39423.0 41765.0 52590.0 31452.0 44420.0 41865.0 52705.0 36102.0 44163.0 45239.0 76670.0 30570.0 21627.0 47915.0 42100.0 41805.0 35772.0 38262.0 39251.0 44541.0 33392.0 37314.0 41860.0 40072.0 40185.0 37671.0
11770 58644 81232 2013 17 1 20442076 48903.0 42090.0 36690.0 39423.0 41765.0 52590.0 31452.0 44420.0 41865.0 52705.0 36102.0 44163.0 45239.0 76670.0 30570.0 21627.0 47915.0 42100.0 41805.0 35772.0 38262.0 39251.0 44541.0 33392.0 37314.0 41860.0 40072.0 40185.0 37671.0 40944.0
12454 62083 63063 2013 18 1 20442076 42090.0 36690.0 39423.0 41765.0 52590.0 31452.0 44420.0 41865.0 52705.0 36102.0 44163.0 45239.0 76670.0 30570.0 21627.0 47915.0 42100.0 41805.0 35772.0 38262.0 39251.0 44541.0 33392.0 37314.0 41860.0 40072.0 40185.0 37671.0 40944.0 50455.0
13141 65522 64683 2013 19 1 20442076 36690.0 39423.0 41765.0 52590.0 31452.0 44420.0 41865.0 52705.0 36102.0 44163.0 45239.0 76670.0 30570.0 21627.0 47915.0 42100.0 41805.0 35772.0 38262.0 39251.0 44541.0 33392.0 37314.0 41860.0 40072.0 40185.0 37671.0 40944.0 50455.0 39170.0

Видно, что для одного item_id при shift=1 значения фичей по побочным диагоналям совпадают. Запомним это.

Построим график зависимости target переменной от, например, последней f30, помня, что все остальные по диагонали совпдают с ней.


In [40]:
first_id_data = sorted_train[sorted_train['item_id']==check_id]
first_id_data = first_id_data[first_id_data['shift']==1]

time = np.arange(len(first_id_data))

In [41]:
plt.figure(figsize=(16, 10))
plt.plot(time, first_id_data['y'], label='y')
plt.plot(time, first_id_data['f30'], label='f30')
plt.xlabel('time')
plt.grid(True)
plt.legend(loc='best')
plt.show()


Видно, что значения совпадают с точностью до множителя и сдвига на 1 шаг по времени.

Поэтому ответом будут значения f30, умноженные на посчитанный коэффициент между y и f30. Но для последних недель для каждого item_id соответствующего значения f30 нет (из-за сдвига по времени). Поэтому для них будем использовать предсказания полученные ранее.


In [48]:
boost_sample = pd.read_csv('boost_submission_zero.csv')
boost_sample = dict(boost_sample.values)

In [49]:
uids = test['item_id'].unique()
train = train[['Num', 'item_id', 'shift', 'week', 'year', 'f30', 'y']]
test = test[['Num', 'item_id', 'shift', 'week', 'year', 'f30']]

def get_table(item_id):
    id_table_train = train[train['item_id'] == item_id]
    id_table_test = test[test['item_id'] == item_id]

    id_table_train = id_table_train.sort_values(['year', 'week', 'shift'])
    id_table_train = id_table_train[id_table_train['shift'] == 1]
    id_table_test_full = id_table_test.sort_values(['year', 'week', 'shift'])

    id_table_test = id_table_test_full[id_table_test_full['shift'] == 1]

    id_table_test_shift = id_table_test_full[id_table_test_full['shift'] == 2]
    id_table_test['Extra_Num_2'] = id_table_test_shift['Num'].values

    id_table_test_shift = id_table_test_full[id_table_test_full['shift'] == 3]
    id_table_test['Extra_Num_3'] = id_table_test_shift['Num'].values

    full_table = id_table_train.append(id_table_test)
    
    full_table = full_table.drop(['year', 'week', 'shift', 'item_id'], axis=1)
    full_table['f30'] = (list(full_table['f30']) + [None])[1:]
    
    return full_table

In [50]:
with open('submission.csv', 'w') as f: 
    f.write('Num,y\n')
    for item_id in uids:
        table = get_table(item_id=item_id)
        w = []
        for row in table[table['Extra_Num_2'].isnull()].values:
            try:
                w.append(row[-2] / row[-1])
            except TypeError:
                print table
        w0 = np.median(w)
        table = table.drop(['y'], axis=1)[table['Extra_Num_2'].notnull()]
        for row in table.values:
            if not np.isnan(row[-1]):
                y = row[-1] / w0
            else:
                y = np.median([boost_sample[row[0]], boost_sample[row[1]], boost_sample[row[2]]])
            f.write('{},{}\n'.format(int(row[0]), y))
            f.write('{},{}\n'.format(int(row[1]), y))
            f.write('{},{}\n'.format(int(row[2]), y))


/home/boyalex/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:16: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/home/boyalex/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:19: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Попробовав 3 варианта xgboost, оказывается, что первый работает получше. SMAPE на public liderboard дали 8.06, 13, 8.37. Таким образом лучшие результаты дали простая модель с хорошими параметрами + работа с данными.

Доп. вопрос


In [2]:
train['y'].mean()


Out[2]:
198575.91203058365

In [5]:
y = pd.read_csv('sample_submission.tsv')
y.head()


Out[5]:
Num y
0 348622 198575.912031
1 348623 198575.912031
2 348624 198575.912031
3 348625 198575.912031
4 348626 198575.912031

Таким образом sample_submission.tsv - среднее по y на train.


In [ ]: