1 Rossman GBT Modeling
- 1.1 Data Preparation
- 1.2 Model Training



In [1]:

    
from jupyterthemes import get_themes
from jupyterthemes.stylefx import set_nb_theme
themes = get_themes()
set_nb_theme(themes[3])









    Out[1]:



In [2]:

    
# 1. magic for inline plot
# 2. magic to print version
# 3. magic so that the notebook will reload external python modules
# 4. magic to enable retina (high resolution) plots
# https://gist.github.com/minrk/3301035
%matplotlib inline
%load_ext watermark
%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format='retina'

import os
import json
import time
import numpy as np
import pandas as pd

%watermark -a 'Ethen' -d -t -v -p numpy,pandas,pyarrow,sklearn









    



Ethen 2019-08-09 13:25:04 

CPython 3.6.4
IPython 7.7.0

numpy 1.17.0
pandas 0.25.0
pyarrow 0.14.1
sklearn 0.21.2

Rossman GBT Modeling

Data Preparation

We've done most of our data preparation and feature engineering in the previous notebook, we'll still perform some additional ones here, but this notebook focuses on getting the data ready for fitting a Gradient Boosted Tree model. For the model, we will be leveraging lightgbm.



In [3]:

    
data_dir = 'cleaned_data'
path_train = os.path.join(data_dir, 'train_clean.parquet')
path_test = os.path.join(data_dir, 'test_clean.parquet')
engine = 'pyarrow'

df_train = pd.read_parquet(path_train, engine)
df_test = pd.read_parquet(path_test, engine)
print('train dimension: ', df_train.shape)
print('test dimension: ', df_test.shape)
df_train.head()









    



train dimension:  (1017209, 71)
test dimension:  (41088, 70)






    Out[3]:







  
    
      
      Store
      DayOfWeek
      Date
      Sales
      Customers
      Open
      Promo
      StateHoliday
      SchoolHoliday
      Year
      ...
      CompetitionMonthsOpen
      Promo2Since
      Promo2Days
      Promo2Weeks
      AfterSchoolHoliday
      AfterStateHoliday
      AfterPromo
      BeforeSchoolHoliday
      BeforeStateHoliday
      BeforePromo
    
  
  
    
      0
      1
      5
      2015-07-31
      5263
      555
      1
      1
      False
      1
      2015
      ...
      24
      1900-01-01
      42214
      25
      0
      57
      0
      0
      -48
      0
    
    
      1
      2
      5
      2015-07-31
      6064
      625
      1
      1
      False
      1
      2015
      ...
      24
      2010-03-29
      1950
      25
      0
      67
      0
      0
      0
      0
    
    
      2
      3
      5
      2015-07-31
      8314
      821
      1
      1
      False
      1
      2015
      ...
      24
      2011-04-04
      1579
      25
      0
      57
      0
      0
      -48
      0
    
    
      3
      4
      5
      2015-07-31
      13995
      1498
      1
      1
      False
      1
      2015
      ...
      24
      1900-01-01
      42214
      25
      0
      67
      0
      0
      0
      0
    
    
      4
      5
      5
      2015-07-31
      4822
      559
      1
      1
      False
      1
      2015
      ...
      3
      1900-01-01
      42214
      25
      0
      57
      0
      0
      0
      0
    
  

5 rows × 71 columns

We've pulled most of our configurable parameters outside into a json configuration file. In the ideal scenario, we can move all of our code into a python script and only change the configuration file to experiment with different type of settings to see which one leads to the best overall performance.



In [4]:

    
config_path = os.path.join('config', 'gbt_training_template.json')
with open(config_path) as f:
    config_file = json.load(f)
    
config_file









    Out[4]:





{'columns': {'num_cols_pattern': ['CloudCover',
   'CompetitionDistance',
   'Max_Humidity',
   'Max_TemperatureC',
   'Max_Wind_SpeedKm_h',
   'Mean_Humidity',
   'Mean_TemperatureC',
   'Mean_Wind_SpeedKm_h',
   'Min_Humidity',
   'Min_TemperatureC',
   'Promo',
   'SchoolHoliday',
   'trend',
   'trend_DE',
   'AfterSchoolHoliday',
   'AfterStateHoliday',
   'AfterPromo',
   'BeforeSchoolHoliday',
   'BeforeStateHoliday',
   'BeforePromo'],
  'cat_cols_pattern': ['Assortment',
   'CompetitionMonthsOpen',
   'CompetitionOpenSinceYear',
   'Day',
   'DayOfWeek',
   'Events',
   'Month',
   'Promo2SinceYear',
   'Promo2Weeks',
   'PromoInterval',
   'State',
   'StateHoliday',
   'Store',
   'StoreType',
   'Week',
   'Year'],
  'id_cols': ['Id'],
  'label_col': 'Sales',
  'weights_col': None},
 'model_task': 'regression',
 'model_type': 'lgb',
 'model_parameters': {'lgb': {'n_jobs': -1,
   'learning_rate': 0.01,
   'n_estimators': 3000,
   'min_data_in_leaf': 100}},
 'model_hyper_parameters': {'lgb': {'max_depth': [3, 5, 8, 10, 12],
   'colsampl_bytree': [0.7, 0.8, 0.9],
   'subsample': [0.7, 0.8, 0.9]}},
 'model_fit_parameters': {'lgb': {'eval_metric': 'l2',
   'early_stopping_rounds': 5,
   'verbose': 100}},
 'search_parameters': {'n_iter': 3,
  'n_jobs': -1,
  'verbose': 1,
  'scoring': 'neg_mean_squared_error',
  'random_state': 1234,
  'return_train_score': True}}



In [5]:

    
# extract settings from the configuration file into local variables
columns = config_file['columns']
num_cols = columns['num_cols_pattern']
cat_cols = columns['cat_cols_pattern']
id_cols = columns['id_cols']
label_col = columns['label_col']
weights_col = columns['weights_col']

model_task = config_file['model_task']
model_type = config_file['model_type']
model_parameters = config_file['model_parameters'][model_type]
model_hyper_parameters = config_file['model_hyper_parameters'][model_type]
model_fit_parameters = config_file['model_fit_parameters'][model_type]
search_parameters = config_file['search_parameters']

Here, we will remove all records where the store had zero sale / was closed (feel free to experiment with not excluding the zero sales record and see if improves performance)

We also perform a train/validation split. The validation split will be used in our hyper-parameter tuning process and for early stopping. Notice that because this is a time series application, where we are trying to predict different stores' daily sales. It's important to not perform a random train/test split, but instead divide the training and validation set based on time/date.

Our training data is already sorted by date in decreasing order, hence we can create the validation set by checking how big is our test set and select the top-N observations to create a validation set that has similar size to our test set. Here we're saying similar size and not exact size, because we make sure that all the records from the same date falls under either training or validation set.



In [6]:

    
df_train = df_train[df_train[label_col] != 0].reset_index(drop=True)

mask = df_train['Date'] == df_train['Date'].iloc[len(df_test)]
val_index = df_train.loc[mask, 'Date'].index.max()
val_index









    Out[6]:





41395

The validation fold we're creating is used for sklearn's PredefinedSplit, where we set the index to 0 for all samples that are part of the validation set, and to -1 for all other samples.



In [7]:

    
val_fold = np.full(df_train.shape[0], fill_value=-1)
val_fold[:(val_index + 1)] = 0
val_fold









    Out[7]:





array([ 0,  0,  0, ..., -1, -1, -1])

Here, we assign the validation fold back to the original dataframe to illustrate the point, this is technically not required for the rest of the pipeline. Notice in the dataframe that we've printed out, the last record's date, 2015-06-18 is different from the rest, and the record's val_fold takes on a value of -1. This means that all records including/after the date 2015-06-19 will become our validation set.



In [8]:

    
df_train['val_fold'] = val_fold
df_train[(val_index - 2):(val_index + 2)]









    Out[8]:







  
    
      
      Store
      DayOfWeek
      Date
      Sales
      Customers
      Open
      Promo
      StateHoliday
      SchoolHoliday
      Year
      ...
      Promo2Since
      Promo2Days
      Promo2Weeks
      AfterSchoolHoliday
      AfterStateHoliday
      AfterPromo
      BeforeSchoolHoliday
      BeforeStateHoliday
      BeforePromo
      val_fold
    
  
  
    
      41393
      1113
      5
      2015-06-19
      7114
      700
      1
      1
      False
      0
      2015
      ...
      1900-01-01
      42172
      25
      35
      25
      0
      -31
      -90
      0
      0
    
    
      41394
      1114
      5
      2015-06-19
      21834
      3211
      1
      1
      False
      0
      2015
      ...
      1900-01-01
      42172
      25
      35
      25
      0
      -27
      -90
      0
      0
    
    
      41395
      1115
      5
      2015-06-19
      8291
      535
      1
      1
      False
      0
      2015
      ...
      2012-05-28
      1117
      25
      70
      15
      0
      -38
      -90
      0
      0
    
    
      41396
      1
      4
      2015-06-18
      4645
      498
      1
      1
      False
      0
      2015
      ...
      1900-01-01
      42171
      25
      69
      14
      0
      -39
      -91
      0
      -1
    
  

4 rows × 72 columns

We proceed to extracting the necessary columns both numerical and categorical that we'll use for modeling.



In [9]:

    
# the model id is used as the indicator when saving the model
model_id = 'gbt'
input_cols = num_cols + cat_cols

df_train = df_train[input_cols + [label_col]]

# we will perform the modeling at the log-scale
df_train[label_col] = np.log(df_train[label_col])
df_test = df_test[input_cols + id_cols]

print('train dimension: ', df_train.shape)
print('test dimension: ', df_test.shape)
df_train.head()









    



train dimension:  (844338, 37)
test dimension:  (41088, 37)






    Out[9]:







  
    
      
      CloudCover
      CompetitionDistance
      Max_Humidity
      Max_TemperatureC
      Max_Wind_SpeedKm_h
      Mean_Humidity
      Mean_TemperatureC
      Mean_Wind_SpeedKm_h
      Min_Humidity
      Min_TemperatureC
      ...
      Promo2SinceYear
      Promo2Weeks
      PromoInterval
      State
      StateHoliday
      Store
      StoreType
      Week
      Year
      Sales
    
  
  
    
      0
      1.0
      1270.0
      98
      23
      24
      54
      16
      11
      18
      8
      ...
      1900
      25
      None
      HE
      False
      1
      c
      31
      2015
      8.568456
    
    
      1
      4.0
      570.0
      100
      19
      14
      62
      13
      11
      25
      7
      ...
      2010
      25
      Jan,Apr,Jul,Oct
      TH
      False
      2
      a
      31
      2015
      8.710125
    
    
      2
      2.0
      14130.0
      100
      21
      14
      61
      13
      5
      24
      6
      ...
      2011
      25
      Jan,Apr,Jul,Oct
      NW
      False
      3
      a
      31
      2015
      9.025696
    
    
      3
      6.0
      620.0
      94
      19
      23
      61
      14
      16
      30
      9
      ...
      1900
      25
      None
      BE
      False
      4
      c
      31
      2015
      9.546455
    
    
      4
      4.0
      29910.0
      82
      20
      14
      55
      15
      11
      26
      10
      ...
      1900
      25
      None
      SN
      False
      5
      a
      31
      2015
      8.480944
    
  

5 rows × 37 columns



In [10]:

    
for cat_col in cat_cols:
    df_train[cat_col] = df_train[cat_col].astype('category')
    df_test[cat_col] = df_test[cat_col].astype('category')

df_train.head()









    Out[10]:







  
    
      
      CloudCover
      CompetitionDistance
      Max_Humidity
      Max_TemperatureC
      Max_Wind_SpeedKm_h
      Mean_Humidity
      Mean_TemperatureC
      Mean_Wind_SpeedKm_h
      Min_Humidity
      Min_TemperatureC
      ...
      Promo2SinceYear
      Promo2Weeks
      PromoInterval
      State
      StateHoliday
      Store
      StoreType
      Week
      Year
      Sales
    
  
  
    
      0
      1.0
      1270.0
      98
      23
      24
      54
      16
      11
      18
      8
      ...
      1900
      25
      NaN
      HE
      False
      1
      c
      31
      2015
      8.568456
    
    
      1
      4.0
      570.0
      100
      19
      14
      62
      13
      11
      25
      7
      ...
      2010
      25
      Jan,Apr,Jul,Oct
      TH
      False
      2
      a
      31
      2015
      8.710125
    
    
      2
      2.0
      14130.0
      100
      21
      14
      61
      13
      5
      24
      6
      ...
      2011
      25
      Jan,Apr,Jul,Oct
      NW
      False
      3
      a
      31
      2015
      9.025696
    
    
      3
      6.0
      620.0
      94
      19
      23
      61
      14
      16
      30
      9
      ...
      1900
      25
      NaN
      BE
      False
      4
      c
      31
      2015
      9.546455
    
    
      4
      4.0
      29910.0
      82
      20
      14
      55
      15
      11
      26
      10
      ...
      1900
      25
      NaN
      SN
      False
      5
      a
      31
      2015
      8.480944
    
  

5 rows × 37 columns

Model Training

We use a helper class to train a boosted tree model, generate the prediction on our test set, create the submission file, check the feature importance of the tree-based model and also make sure we can save and re-load the model.



In [11]:

    
from gbt_module.model import GBTPipeline

model = GBTPipeline(input_cols, cat_cols, label_col, weights_col,
                    model_task, model_id, model_type, model_parameters,
                    model_hyper_parameters, search_parameters)
model









    Out[11]:





GBTPipeline(cat_cols=['Assortment', 'CompetitionMonthsOpen',
                      'CompetitionOpenSinceYear', 'Day', 'DayOfWeek', 'Events',
                      'Month', 'Promo2SinceYear', 'Promo2Weeks',
                      'PromoInterval', 'State', 'StateHoliday', 'Store',
                      'StoreType', 'Week', 'Year'],
            input_cols=['CloudCover', 'CompetitionDistance', 'Max_Humidity',
                        'Max_TemperatureC', 'Max_Wind_SpeedKm_h',
                        'Mean_Humidity', 'Mean...
                                    'max_depth': [3, 5, 8, 10, 12],
                                    'subsample': [0.7, 0.8, 0.9]},
            model_id='gbt',
            model_parameters={'learning_rate': 0.01, 'min_data_in_leaf': 100,
                              'n_estimators': 3000, 'n_jobs': -1},
            model_task='regression', model_type='lgb',
            search_parameters={'n_iter': 3, 'n_jobs': -1, 'random_state': 1234,
                               'return_train_score': True,
                               'scoring': 'neg_mean_squared_error',
                               'verbose': 1},
            weights_col=None)



In [12]:

    
start = time.time()
model.fit(df_train, val_fold, model_fit_parameters)
elapsed = time.time() - start
print('elapsed minutes: ', elapsed / 60)









    



Fitting 1 folds for each of 3 candidates, totalling 3 fits






    



/Users/mingyuliu/anaconda3/lib/python3.6/site-packages/lightgbm/__init__.py:46: UserWarning: Starting from version 2.2.1, the library file in distribution wheels for macOS is built by the Apple Clang (Xcode_8.3.3) compiler.
This means that in case of installing LightGBM from PyPI via the ``pip install lightgbm`` command, you don't need to install the gcc compiler anymore.
Instead of that, you need to install the OpenMP library, which is required for running LightGBM on the system with the Apple Clang compiler.
You can install the OpenMP library by the following command: ``brew install libomp``.
  "You can install the OpenMP library by the following command: ``brew install libomp``.", UserWarning)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed: 20.0min finished
/Users/mingyuliu/anaconda3/lib/python3.6/site-packages/lightgbm/basic.py:1209: UserWarning: categorical_feature in Dataset is overridden.
New categorical_feature is ['Assortment', 'CompetitionMonthsOpen', 'CompetitionOpenSinceYear', 'Day', 'DayOfWeek', 'Events', 'Month', 'Promo2SinceYear', 'Promo2Weeks', 'PromoInterval', 'State', 'StateHoliday', 'Store', 'StoreType', 'Week', 'Year']
  'New categorical_feature is {}'.format(sorted(list(categorical_feature))))






    



Training until validation scores don't improve for 5 rounds.
[100]	valid_0's l2: 0.0718742	valid_0's l2: 0.0718742	valid_1's l2: 0.0654301	valid_1's l2: 0.0654301
[200]	valid_0's l2: 0.0450121	valid_0's l2: 0.0450121	valid_1's l2: 0.0404465	valid_1's l2: 0.0404465
[300]	valid_0's l2: 0.0336353	valid_0's l2: 0.0336353	valid_1's l2: 0.0314702	valid_1's l2: 0.0314702
[400]	valid_0's l2: 0.027701	valid_0's l2: 0.027701	valid_1's l2: 0.0269131	valid_1's l2: 0.0269131
[500]	valid_0's l2: 0.0240796	valid_0's l2: 0.0240796	valid_1's l2: 0.0240597	valid_1's l2: 0.0240597
[600]	valid_0's l2: 0.0218724	valid_0's l2: 0.0218724	valid_1's l2: 0.0218025	valid_1's l2: 0.0218025
[700]	valid_0's l2: 0.0201603	valid_0's l2: 0.0201603	valid_1's l2: 0.0201185	valid_1's l2: 0.0201185
[800]	valid_0's l2: 0.0184851	valid_0's l2: 0.0184851	valid_1's l2: 0.0184089	valid_1's l2: 0.0184089
[900]	valid_0's l2: 0.0168288	valid_0's l2: 0.0168288	valid_1's l2: 0.0166787	valid_1's l2: 0.0166787
[1000]	valid_0's l2: 0.015335	valid_0's l2: 0.015335	valid_1's l2: 0.0151688	valid_1's l2: 0.0151688
[1100]	valid_0's l2: 0.0142658	valid_0's l2: 0.0142658	valid_1's l2: 0.0141122	valid_1's l2: 0.0141122
[1200]	valid_0's l2: 0.0135508	valid_0's l2: 0.0135508	valid_1's l2: 0.013411	valid_1's l2: 0.013411
[1300]	valid_0's l2: 0.012954	valid_0's l2: 0.012954	valid_1's l2: 0.0128038	valid_1's l2: 0.0128038
[1400]	valid_0's l2: 0.0124204	valid_0's l2: 0.0124204	valid_1's l2: 0.0122977	valid_1's l2: 0.0122977
[1500]	valid_0's l2: 0.0119961	valid_0's l2: 0.0119961	valid_1's l2: 0.011941	valid_1's l2: 0.011941
[1600]	valid_0's l2: 0.01163	valid_0's l2: 0.01163	valid_1's l2: 0.0115995	valid_1's l2: 0.0115995
[1700]	valid_0's l2: 0.0113139	valid_0's l2: 0.0113139	valid_1's l2: 0.0112803	valid_1's l2: 0.0112803
[1800]	valid_0's l2: 0.0110184	valid_0's l2: 0.0110184	valid_1's l2: 0.0109884	valid_1's l2: 0.0109884
[1900]	valid_0's l2: 0.0107764	valid_0's l2: 0.0107764	valid_1's l2: 0.0107527	valid_1's l2: 0.0107527
[2000]	valid_0's l2: 0.0105697	valid_0's l2: 0.0105697	valid_1's l2: 0.0105339	valid_1's l2: 0.0105339
[2100]	valid_0's l2: 0.0103475	valid_0's l2: 0.0103475	valid_1's l2: 0.0103047	valid_1's l2: 0.0103047
[2200]	valid_0's l2: 0.0101752	valid_0's l2: 0.0101752	valid_1's l2: 0.0101483	valid_1's l2: 0.0101483
[2300]	valid_0's l2: 0.0100167	valid_0's l2: 0.0100167	valid_1's l2: 0.00999843	valid_1's l2: 0.00999843
[2400]	valid_0's l2: 0.00986898	valid_0's l2: 0.00986898	valid_1's l2: 0.00983577	valid_1's l2: 0.00983577
[2500]	valid_0's l2: 0.00972949	valid_0's l2: 0.00972949	valid_1's l2: 0.00969056	valid_1's l2: 0.00969056
[2600]	valid_0's l2: 0.00960752	valid_0's l2: 0.00960752	valid_1's l2: 0.00955977	valid_1's l2: 0.00955977
[2700]	valid_0's l2: 0.00948489	valid_0's l2: 0.00948489	valid_1's l2: 0.00944049	valid_1's l2: 0.00944049
[2800]	valid_0's l2: 0.00936427	valid_0's l2: 0.00936427	valid_1's l2: 0.00928775	valid_1's l2: 0.00928775
[2900]	valid_0's l2: 0.00924005	valid_0's l2: 0.00924005	valid_1's l2: 0.00915616	valid_1's l2: 0.00915616
[3000]	valid_0's l2: 0.00912308	valid_0's l2: 0.00912308	valid_1's l2: 0.00904488	valid_1's l2: 0.00904488
Did not meet early stopping. Best iteration is:
[3000]	valid_0's l2: 0.00912308	valid_0's l2: 0.00912308	valid_1's l2: 0.00904488	valid_1's l2: 0.00904488
elapsed minutes:  22.69019781748454



In [13]:

    
pd.DataFrame(model.model_tuned_.cv_results_)









    Out[13]:







  
    
      
      mean_fit_time
      std_fit_time
      mean_score_time
      std_score_time
      param_subsample
      param_max_depth
      param_colsampl_bytree
      params
      split0_test_score
      mean_test_score
      std_test_score
      rank_test_score
      split0_train_score
      mean_train_score
      std_train_score
    
  
  
    
      0
      338.971463
      0.0
      23.137303
      0.0
      0.7
      10
      0.9
      {'subsample': 0.7, 'max_depth': 10, 'colsampl_...
      -0.015877
      -0.015877
      0.0
      2
      -0.011191
      -0.011191
      0.0
    
    
      1
      452.571008
      0.0
      62.282989
      0.0
      0.8
      5
      0.9
      {'subsample': 0.8, 'max_depth': 5, 'colsampl_b...
      -0.016854
      -0.016854
      0.0
      3
      -0.012256
      -0.012256
      0.0
    
    
      2
      339.133200
      0.0
      24.371682
      0.0
      0.9
      12
      0.9
      {'subsample': 0.9, 'max_depth': 12, 'colsampl_...
      -0.015859
      -0.015859
      0.0
      1
      -0.011161
      -0.011161
      0.0



In [14]:

    
# we logged our label, remember to exponentiate it back to the original scale
prediction_test = model.predict(df_test[input_cols])
df_test[label_col] = np.exp(prediction_test)

submission_cols = id_cols + [label_col]
df_test[submission_cols] = df_test[submission_cols].astype('int')

submission_dir = 'submission'
if not os.path.isdir(submission_dir):
    os.makedirs(submission_dir, exist_ok=True)

submission_file = 'rossmann_submission_{}.csv'.format(model_id)
submission_path = os.path.join(submission_dir, submission_file)
df_test[submission_cols].to_csv(submission_path, index=False)

df_test[submission_cols].head()



In [15]:

    
model.get_feature_importance()









    Out[15]:





[('Store', 0.6707),
 ('Promo', 0.1081),
 ('BeforePromo', 0.0653),
 ('DayOfWeek', 0.0574),
 ('Week', 0.0463),
 ('Day', 0.019),
 ('AfterStateHoliday', 0.0049),
 ('BeforeStateHoliday', 0.0045),
 ('CompetitionDistance', 0.0041),
 ('StoreType', 0.0039),
 ('Month', 0.003),
 ('Year', 0.0022),
 ('State', 0.0016),
 ('CompetitionOpenSinceYear', 0.0015),
 ('AfterPromo', 0.001)]



In [16]:

    
model_checkpoint = os.path.join('models', model_id + '.pkl')
model.save(model_checkpoint)

loaded_model = GBTPipeline.load(model_checkpoint)

# print the cv_results_ again to ensure the checkpointing works
pd.DataFrame(loaded_model.model_tuned_.cv_results_)









    Out[16]:







  
    
      
      mean_fit_time
      std_fit_time
      mean_score_time
      std_score_time
      param_subsample
      param_max_depth
      param_colsampl_bytree
      params
      split0_test_score
      mean_test_score
      std_test_score
      rank_test_score
      split0_train_score
      mean_train_score
      std_train_score
    
  
  
    
      0
      338.971463
      0.0
      23.137303
      0.0
      0.7
      10
      0.9
      {'subsample': 0.7, 'max_depth': 10, 'colsampl_...
      -0.015877
      -0.015877
      0.0
      2
      -0.011191
      -0.011191
      0.0
    
    
      1
      452.571008
      0.0
      62.282989
      0.0
      0.8
      5
      0.9
      {'subsample': 0.8, 'max_depth': 5, 'colsampl_b...
      -0.016854
      -0.016854
      0.0
      3
      -0.012256
      -0.012256
      0.0
    
    
      2
      339.133200
      0.0
      24.371682
      0.0
      0.9
      12
      0.9
      {'subsample': 0.9, 'max_depth': 12, 'colsampl_...
      -0.015859
      -0.015859
      0.0
      1
      -0.011161
      -0.011161
      0.0

	Id	Sales
0	1	4031
1	2	7236
2	3	9290
3	4	7094
4	5	7371

	Store	DayOfWeek	Date	Sales	Customers	Open	Promo	StateHoliday	SchoolHoliday	Year	...	CompetitionMonthsOpen	Promo2Since	Promo2Days	Promo2Weeks	AfterStateHoliday	BeforeStateHoliday
0	1	5	2015-07-31	5263	555	1	1	False	1	2015	...	24	1900-01-01	42214	25	57	-48
1	2	5	2015-07-31	6064	625	1	1	False	1	2015	...	24	2010-03-29	1950	25	67	0
2	3	5	2015-07-31	8314	821	1	1	False	1	2015	...	24	2011-04-04	1579	25	57	-48
3	4	5	2015-07-31	13995	1498	1	1	False	1	2015	...	24	1900-01-01	42214	25	67	0
4	5	5	2015-07-31	4822	559	1	1	False	1	2015	...	3	1900-01-01	42214	25	57	0

	Store	DayOfWeek	Date	Sales	Customers	Open	Promo	StateHoliday	Year	...	Promo2Since	Promo2Days	Promo2Weeks	AfterSchoolHoliday	AfterStateHoliday	BeforeSchoolHoliday	BeforeStateHoliday	val_fold
41393	1113	5	2015-06-19	7114	700	1	1	False	2015	...	1900-01-01	42172	25	35	25	-31	-90	0
41394	1114	5	2015-06-19	21834	3211	1	1	False	2015	...	1900-01-01	42172	25	35	25	-27	-90	0
41395	1115	5	2015-06-19	8291	535	1	1	False	2015	...	2012-05-28	1117	25	70	15	-38	-90	0
41396	1	4	2015-06-18	4645	498	1	1	False	2015	...	1900-01-01	42171	25	69	14	-39	-91	-1

	CloudCover	CompetitionDistance	Max_Humidity	Max_TemperatureC	Max_Wind_SpeedKm_h	Mean_Humidity	Mean_TemperatureC	Mean_Wind_SpeedKm_h	Min_Humidity	Min_TemperatureC	...	Promo2SinceYear	Promo2Weeks	PromoInterval	State	StateHoliday	Store	StoreType	Week	Year	Sales
0	1.0	1270.0	98	23	24	54	16	11	18	8	...	1900	25	None	HE	False	1	c	31	2015	8.568456
1	4.0	570.0	100	19	14	62	13	11	25	7	...	2010	25	Jan,Apr,Jul,Oct	TH	False	2	a	31	2015	8.710125
2	2.0	14130.0	100	21	14	61	13	5	24	6	...	2011	25	Jan,Apr,Jul,Oct	NW	False	3	a	31	2015	9.025696
3	6.0	620.0	94	19	23	61	14	16	30	9	...	1900	25	None	BE	False	4	c	31	2015	9.546455
4	4.0	29910.0	82	20	14	55	15	11	26	10	...	1900	25	None	SN	False	5	a	31	2015	8.480944

	mean_fit_time	mean_score_time	param_subsample	param_max_depth	param_colsampl_bytree	params	split0_test_score	mean_test_score	rank_test_score	split0_train_score	mean_train_score
0	338.971463	23.137303	0.7	10	0.9	{'subsample': 0.7, 'max_depth': 10, 'colsampl_...	-0.015877	-0.015877	2	-0.011191	-0.011191
1	452.571008	62.282989	0.8	5	0.9	{'subsample': 0.8, 'max_depth': 5, 'colsampl_b...	-0.016854	-0.016854	3	-0.012256	-0.012256
2	339.133200	24.371682	0.9	12	0.9	{'subsample': 0.9, 'max_depth': 12, 'colsampl_...	-0.015859	-0.015859	1	-0.011161	-0.011161

Table of Contents

Rossman GBT Modeling

Data Preparation

Model Training