In [1]:
from jupyterthemes import get_themes
from jupyterthemes.stylefx import set_nb_theme
themes = get_themes()
set_nb_theme(themes[3])


Out[1]:

In [2]:
# 1. magic for inline plot
# 2. magic to print version
# 3. magic so that the notebook will reload external python modules
# 4. magic to enable retina (high resolution) plots
# https://gist.github.com/minrk/3301035
%matplotlib inline
%load_ext watermark
%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format='retina'

import os
import json
import time
import numpy as np
import pandas as pd

%watermark -a 'Ethen' -d -t -v -p numpy,pandas,pyarrow,sklearn


Ethen 2019-08-09 13:25:04 

CPython 3.6.4
IPython 7.7.0

numpy 1.17.0
pandas 0.25.0
pyarrow 0.14.1
sklearn 0.21.2

Rossman GBT Modeling

Data Preparation

We've done most of our data preparation and feature engineering in the previous notebook, we'll still perform some additional ones here, but this notebook focuses on getting the data ready for fitting a Gradient Boosted Tree model. For the model, we will be leveraging lightgbm.


In [3]:
data_dir = 'cleaned_data'
path_train = os.path.join(data_dir, 'train_clean.parquet')
path_test = os.path.join(data_dir, 'test_clean.parquet')
engine = 'pyarrow'

df_train = pd.read_parquet(path_train, engine)
df_test = pd.read_parquet(path_test, engine)
print('train dimension: ', df_train.shape)
print('test dimension: ', df_test.shape)
df_train.head()


train dimension:  (1017209, 71)
test dimension:  (41088, 70)
Out[3]:
Store DayOfWeek Date Sales Customers Open Promo StateHoliday SchoolHoliday Year ... CompetitionMonthsOpen Promo2Since Promo2Days Promo2Weeks AfterSchoolHoliday AfterStateHoliday AfterPromo BeforeSchoolHoliday BeforeStateHoliday BeforePromo
0 1 5 2015-07-31 5263 555 1 1 False 1 2015 ... 24 1900-01-01 42214 25 0 57 0 0 -48 0
1 2 5 2015-07-31 6064 625 1 1 False 1 2015 ... 24 2010-03-29 1950 25 0 67 0 0 0 0
2 3 5 2015-07-31 8314 821 1 1 False 1 2015 ... 24 2011-04-04 1579 25 0 57 0 0 -48 0
3 4 5 2015-07-31 13995 1498 1 1 False 1 2015 ... 24 1900-01-01 42214 25 0 67 0 0 0 0
4 5 5 2015-07-31 4822 559 1 1 False 1 2015 ... 3 1900-01-01 42214 25 0 57 0 0 0 0

5 rows × 71 columns

We've pulled most of our configurable parameters outside into a json configuration file. In the ideal scenario, we can move all of our code into a python script and only change the configuration file to experiment with different type of settings to see which one leads to the best overall performance.


In [4]:
config_path = os.path.join('config', 'gbt_training_template.json')
with open(config_path) as f:
    config_file = json.load(f)
    
config_file


Out[4]:
{'columns': {'num_cols_pattern': ['CloudCover',
   'CompetitionDistance',
   'Max_Humidity',
   'Max_TemperatureC',
   'Max_Wind_SpeedKm_h',
   'Mean_Humidity',
   'Mean_TemperatureC',
   'Mean_Wind_SpeedKm_h',
   'Min_Humidity',
   'Min_TemperatureC',
   'Promo',
   'SchoolHoliday',
   'trend',
   'trend_DE',
   'AfterSchoolHoliday',
   'AfterStateHoliday',
   'AfterPromo',
   'BeforeSchoolHoliday',
   'BeforeStateHoliday',
   'BeforePromo'],
  'cat_cols_pattern': ['Assortment',
   'CompetitionMonthsOpen',
   'CompetitionOpenSinceYear',
   'Day',
   'DayOfWeek',
   'Events',
   'Month',
   'Promo2SinceYear',
   'Promo2Weeks',
   'PromoInterval',
   'State',
   'StateHoliday',
   'Store',
   'StoreType',
   'Week',
   'Year'],
  'id_cols': ['Id'],
  'label_col': 'Sales',
  'weights_col': None},
 'model_task': 'regression',
 'model_type': 'lgb',
 'model_parameters': {'lgb': {'n_jobs': -1,
   'learning_rate': 0.01,
   'n_estimators': 3000,
   'min_data_in_leaf': 100}},
 'model_hyper_parameters': {'lgb': {'max_depth': [3, 5, 8, 10, 12],
   'colsampl_bytree': [0.7, 0.8, 0.9],
   'subsample': [0.7, 0.8, 0.9]}},
 'model_fit_parameters': {'lgb': {'eval_metric': 'l2',
   'early_stopping_rounds': 5,
   'verbose': 100}},
 'search_parameters': {'n_iter': 3,
  'n_jobs': -1,
  'verbose': 1,
  'scoring': 'neg_mean_squared_error',
  'random_state': 1234,
  'return_train_score': True}}

In [5]:
# extract settings from the configuration file into local variables
columns = config_file['columns']
num_cols = columns['num_cols_pattern']
cat_cols = columns['cat_cols_pattern']
id_cols = columns['id_cols']
label_col = columns['label_col']
weights_col = columns['weights_col']

model_task = config_file['model_task']
model_type = config_file['model_type']
model_parameters = config_file['model_parameters'][model_type]
model_hyper_parameters = config_file['model_hyper_parameters'][model_type]
model_fit_parameters = config_file['model_fit_parameters'][model_type]
search_parameters = config_file['search_parameters']

Here, we will remove all records where the store had zero sale / was closed (feel free to experiment with not excluding the zero sales record and see if improves performance)

We also perform a train/validation split. The validation split will be used in our hyper-parameter tuning process and for early stopping. Notice that because this is a time series application, where we are trying to predict different stores' daily sales. It's important to not perform a random train/test split, but instead divide the training and validation set based on time/date.

Our training data is already sorted by date in decreasing order, hence we can create the validation set by checking how big is our test set and select the top-N observations to create a validation set that has similar size to our test set. Here we're saying similar size and not exact size, because we make sure that all the records from the same date falls under either training or validation set.


In [6]:
df_train = df_train[df_train[label_col] != 0].reset_index(drop=True)

mask = df_train['Date'] == df_train['Date'].iloc[len(df_test)]
val_index = df_train.loc[mask, 'Date'].index.max()
val_index


Out[6]:
41395

The validation fold we're creating is used for sklearn's PredefinedSplit, where we set the index to 0 for all samples that are part of the validation set, and to -1 for all other samples.


In [7]:
val_fold = np.full(df_train.shape[0], fill_value=-1)
val_fold[:(val_index + 1)] = 0
val_fold


Out[7]:
array([ 0,  0,  0, ..., -1, -1, -1])

Here, we assign the validation fold back to the original dataframe to illustrate the point, this is technically not required for the rest of the pipeline. Notice in the dataframe that we've printed out, the last record's date, 2015-06-18 is different from the rest, and the record's val_fold takes on a value of -1. This means that all records including/after the date 2015-06-19 will become our validation set.


In [8]:
df_train['val_fold'] = val_fold
df_train[(val_index - 2):(val_index + 2)]


Out[8]:
Store DayOfWeek Date Sales Customers Open Promo StateHoliday SchoolHoliday Year ... Promo2Since Promo2Days Promo2Weeks AfterSchoolHoliday AfterStateHoliday AfterPromo BeforeSchoolHoliday BeforeStateHoliday BeforePromo val_fold
41393 1113 5 2015-06-19 7114 700 1 1 False 0 2015 ... 1900-01-01 42172 25 35 25 0 -31 -90 0 0
41394 1114 5 2015-06-19 21834 3211 1 1 False 0 2015 ... 1900-01-01 42172 25 35 25 0 -27 -90 0 0
41395 1115 5 2015-06-19 8291 535 1 1 False 0 2015 ... 2012-05-28 1117 25 70 15 0 -38 -90 0 0
41396 1 4 2015-06-18 4645 498 1 1 False 0 2015 ... 1900-01-01 42171 25 69 14 0 -39 -91 0 -1

4 rows × 72 columns

We proceed to extracting the necessary columns both numerical and categorical that we'll use for modeling.


In [9]:
# the model id is used as the indicator when saving the model
model_id = 'gbt'
input_cols = num_cols + cat_cols

df_train = df_train[input_cols + [label_col]]

# we will perform the modeling at the log-scale
df_train[label_col] = np.log(df_train[label_col])
df_test = df_test[input_cols + id_cols]

print('train dimension: ', df_train.shape)
print('test dimension: ', df_test.shape)
df_train.head()


train dimension:  (844338, 37)
test dimension:  (41088, 37)
Out[9]:
CloudCover CompetitionDistance Max_Humidity Max_TemperatureC Max_Wind_SpeedKm_h Mean_Humidity Mean_TemperatureC Mean_Wind_SpeedKm_h Min_Humidity Min_TemperatureC ... Promo2SinceYear Promo2Weeks PromoInterval State StateHoliday Store StoreType Week Year Sales
0 1.0 1270.0 98 23 24 54 16 11 18 8 ... 1900 25 None HE False 1 c 31 2015 8.568456
1 4.0 570.0 100 19 14 62 13 11 25 7 ... 2010 25 Jan,Apr,Jul,Oct TH False 2 a 31 2015 8.710125
2 2.0 14130.0 100 21 14 61 13 5 24 6 ... 2011 25 Jan,Apr,Jul,Oct NW False 3 a 31 2015 9.025696
3 6.0 620.0 94 19 23 61 14 16 30 9 ... 1900 25 None BE False 4 c 31 2015 9.546455
4 4.0 29910.0 82 20 14 55 15 11 26 10 ... 1900 25 None SN False 5 a 31 2015 8.480944

5 rows × 37 columns


In [10]:
for cat_col in cat_cols:
    df_train[cat_col] = df_train[cat_col].astype('category')
    df_test[cat_col] = df_test[cat_col].astype('category')

df_train.head()


Out[10]:
CloudCover CompetitionDistance Max_Humidity Max_TemperatureC Max_Wind_SpeedKm_h Mean_Humidity Mean_TemperatureC Mean_Wind_SpeedKm_h Min_Humidity Min_TemperatureC ... Promo2SinceYear Promo2Weeks PromoInterval State StateHoliday Store StoreType Week Year Sales
0 1.0 1270.0 98 23 24 54 16 11 18 8 ... 1900 25 NaN HE False 1 c 31 2015 8.568456
1 4.0 570.0 100 19 14 62 13 11 25 7 ... 2010 25 Jan,Apr,Jul,Oct TH False 2 a 31 2015 8.710125
2 2.0 14130.0 100 21 14 61 13 5 24 6 ... 2011 25 Jan,Apr,Jul,Oct NW False 3 a 31 2015 9.025696
3 6.0 620.0 94 19 23 61 14 16 30 9 ... 1900 25 NaN BE False 4 c 31 2015 9.546455
4 4.0 29910.0 82 20 14 55 15 11 26 10 ... 1900 25 NaN SN False 5 a 31 2015 8.480944

5 rows × 37 columns

Model Training

We use a helper class to train a boosted tree model, generate the prediction on our test set, create the submission file, check the feature importance of the tree-based model and also make sure we can save and re-load the model.


In [11]:
from gbt_module.model import GBTPipeline

model = GBTPipeline(input_cols, cat_cols, label_col, weights_col,
                    model_task, model_id, model_type, model_parameters,
                    model_hyper_parameters, search_parameters)
model


Out[11]:
GBTPipeline(cat_cols=['Assortment', 'CompetitionMonthsOpen',
                      'CompetitionOpenSinceYear', 'Day', 'DayOfWeek', 'Events',
                      'Month', 'Promo2SinceYear', 'Promo2Weeks',
                      'PromoInterval', 'State', 'StateHoliday', 'Store',
                      'StoreType', 'Week', 'Year'],
            input_cols=['CloudCover', 'CompetitionDistance', 'Max_Humidity',
                        'Max_TemperatureC', 'Max_Wind_SpeedKm_h',
                        'Mean_Humidity', 'Mean...
                                    'max_depth': [3, 5, 8, 10, 12],
                                    'subsample': [0.7, 0.8, 0.9]},
            model_id='gbt',
            model_parameters={'learning_rate': 0.01, 'min_data_in_leaf': 100,
                              'n_estimators': 3000, 'n_jobs': -1},
            model_task='regression', model_type='lgb',
            search_parameters={'n_iter': 3, 'n_jobs': -1, 'random_state': 1234,
                               'return_train_score': True,
                               'scoring': 'neg_mean_squared_error',
                               'verbose': 1},
            weights_col=None)

In [12]:
start = time.time()
model.fit(df_train, val_fold, model_fit_parameters)
elapsed = time.time() - start
print('elapsed minutes: ', elapsed / 60)


Fitting 1 folds for each of 3 candidates, totalling 3 fits
/Users/mingyuliu/anaconda3/lib/python3.6/site-packages/lightgbm/__init__.py:46: UserWarning: Starting from version 2.2.1, the library file in distribution wheels for macOS is built by the Apple Clang (Xcode_8.3.3) compiler.
This means that in case of installing LightGBM from PyPI via the ``pip install lightgbm`` command, you don't need to install the gcc compiler anymore.
Instead of that, you need to install the OpenMP library, which is required for running LightGBM on the system with the Apple Clang compiler.
You can install the OpenMP library by the following command: ``brew install libomp``.
  "You can install the OpenMP library by the following command: ``brew install libomp``.", UserWarning)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed: 20.0min finished
/Users/mingyuliu/anaconda3/lib/python3.6/site-packages/lightgbm/basic.py:1209: UserWarning: categorical_feature in Dataset is overridden.
New categorical_feature is ['Assortment', 'CompetitionMonthsOpen', 'CompetitionOpenSinceYear', 'Day', 'DayOfWeek', 'Events', 'Month', 'Promo2SinceYear', 'Promo2Weeks', 'PromoInterval', 'State', 'StateHoliday', 'Store', 'StoreType', 'Week', 'Year']
  'New categorical_feature is {}'.format(sorted(list(categorical_feature))))
Training until validation scores don't improve for 5 rounds.
[100]	valid_0's l2: 0.0718742	valid_0's l2: 0.0718742	valid_1's l2: 0.0654301	valid_1's l2: 0.0654301
[200]	valid_0's l2: 0.0450121	valid_0's l2: 0.0450121	valid_1's l2: 0.0404465	valid_1's l2: 0.0404465
[300]	valid_0's l2: 0.0336353	valid_0's l2: 0.0336353	valid_1's l2: 0.0314702	valid_1's l2: 0.0314702
[400]	valid_0's l2: 0.027701	valid_0's l2: 0.027701	valid_1's l2: 0.0269131	valid_1's l2: 0.0269131
[500]	valid_0's l2: 0.0240796	valid_0's l2: 0.0240796	valid_1's l2: 0.0240597	valid_1's l2: 0.0240597
[600]	valid_0's l2: 0.0218724	valid_0's l2: 0.0218724	valid_1's l2: 0.0218025	valid_1's l2: 0.0218025
[700]	valid_0's l2: 0.0201603	valid_0's l2: 0.0201603	valid_1's l2: 0.0201185	valid_1's l2: 0.0201185
[800]	valid_0's l2: 0.0184851	valid_0's l2: 0.0184851	valid_1's l2: 0.0184089	valid_1's l2: 0.0184089
[900]	valid_0's l2: 0.0168288	valid_0's l2: 0.0168288	valid_1's l2: 0.0166787	valid_1's l2: 0.0166787
[1000]	valid_0's l2: 0.015335	valid_0's l2: 0.015335	valid_1's l2: 0.0151688	valid_1's l2: 0.0151688
[1100]	valid_0's l2: 0.0142658	valid_0's l2: 0.0142658	valid_1's l2: 0.0141122	valid_1's l2: 0.0141122
[1200]	valid_0's l2: 0.0135508	valid_0's l2: 0.0135508	valid_1's l2: 0.013411	valid_1's l2: 0.013411
[1300]	valid_0's l2: 0.012954	valid_0's l2: 0.012954	valid_1's l2: 0.0128038	valid_1's l2: 0.0128038
[1400]	valid_0's l2: 0.0124204	valid_0's l2: 0.0124204	valid_1's l2: 0.0122977	valid_1's l2: 0.0122977
[1500]	valid_0's l2: 0.0119961	valid_0's l2: 0.0119961	valid_1's l2: 0.011941	valid_1's l2: 0.011941
[1600]	valid_0's l2: 0.01163	valid_0's l2: 0.01163	valid_1's l2: 0.0115995	valid_1's l2: 0.0115995
[1700]	valid_0's l2: 0.0113139	valid_0's l2: 0.0113139	valid_1's l2: 0.0112803	valid_1's l2: 0.0112803
[1800]	valid_0's l2: 0.0110184	valid_0's l2: 0.0110184	valid_1's l2: 0.0109884	valid_1's l2: 0.0109884
[1900]	valid_0's l2: 0.0107764	valid_0's l2: 0.0107764	valid_1's l2: 0.0107527	valid_1's l2: 0.0107527
[2000]	valid_0's l2: 0.0105697	valid_0's l2: 0.0105697	valid_1's l2: 0.0105339	valid_1's l2: 0.0105339
[2100]	valid_0's l2: 0.0103475	valid_0's l2: 0.0103475	valid_1's l2: 0.0103047	valid_1's l2: 0.0103047
[2200]	valid_0's l2: 0.0101752	valid_0's l2: 0.0101752	valid_1's l2: 0.0101483	valid_1's l2: 0.0101483
[2300]	valid_0's l2: 0.0100167	valid_0's l2: 0.0100167	valid_1's l2: 0.00999843	valid_1's l2: 0.00999843
[2400]	valid_0's l2: 0.00986898	valid_0's l2: 0.00986898	valid_1's l2: 0.00983577	valid_1's l2: 0.00983577
[2500]	valid_0's l2: 0.00972949	valid_0's l2: 0.00972949	valid_1's l2: 0.00969056	valid_1's l2: 0.00969056
[2600]	valid_0's l2: 0.00960752	valid_0's l2: 0.00960752	valid_1's l2: 0.00955977	valid_1's l2: 0.00955977
[2700]	valid_0's l2: 0.00948489	valid_0's l2: 0.00948489	valid_1's l2: 0.00944049	valid_1's l2: 0.00944049
[2800]	valid_0's l2: 0.00936427	valid_0's l2: 0.00936427	valid_1's l2: 0.00928775	valid_1's l2: 0.00928775
[2900]	valid_0's l2: 0.00924005	valid_0's l2: 0.00924005	valid_1's l2: 0.00915616	valid_1's l2: 0.00915616
[3000]	valid_0's l2: 0.00912308	valid_0's l2: 0.00912308	valid_1's l2: 0.00904488	valid_1's l2: 0.00904488
Did not meet early stopping. Best iteration is:
[3000]	valid_0's l2: 0.00912308	valid_0's l2: 0.00912308	valid_1's l2: 0.00904488	valid_1's l2: 0.00904488
elapsed minutes:  22.69019781748454

In [13]:
pd.DataFrame(model.model_tuned_.cv_results_)


Out[13]:
mean_fit_time std_fit_time mean_score_time std_score_time param_subsample param_max_depth param_colsampl_bytree params split0_test_score mean_test_score std_test_score rank_test_score split0_train_score mean_train_score std_train_score
0 338.971463 0.0 23.137303 0.0 0.7 10 0.9 {'subsample': 0.7, 'max_depth': 10, 'colsampl_... -0.015877 -0.015877 0.0 2 -0.011191 -0.011191 0.0
1 452.571008 0.0 62.282989 0.0 0.8 5 0.9 {'subsample': 0.8, 'max_depth': 5, 'colsampl_b... -0.016854 -0.016854 0.0 3 -0.012256 -0.012256 0.0
2 339.133200 0.0 24.371682 0.0 0.9 12 0.9 {'subsample': 0.9, 'max_depth': 12, 'colsampl_... -0.015859 -0.015859 0.0 1 -0.011161 -0.011161 0.0

In [14]:
# we logged our label, remember to exponentiate it back to the original scale
prediction_test = model.predict(df_test[input_cols])
df_test[label_col] = np.exp(prediction_test)

submission_cols = id_cols + [label_col]
df_test[submission_cols] = df_test[submission_cols].astype('int')

submission_dir = 'submission'
if not os.path.isdir(submission_dir):
    os.makedirs(submission_dir, exist_ok=True)

submission_file = 'rossmann_submission_{}.csv'.format(model_id)
submission_path = os.path.join(submission_dir, submission_file)
df_test[submission_cols].to_csv(submission_path, index=False)

df_test[submission_cols].head()


Out[14]:
Id Sales
0 1 4031
1 2 7236
2 3 9290
3 4 7094
4 5 7371

In [15]:
model.get_feature_importance()


Out[15]:
[('Store', 0.6707),
 ('Promo', 0.1081),
 ('BeforePromo', 0.0653),
 ('DayOfWeek', 0.0574),
 ('Week', 0.0463),
 ('Day', 0.019),
 ('AfterStateHoliday', 0.0049),
 ('BeforeStateHoliday', 0.0045),
 ('CompetitionDistance', 0.0041),
 ('StoreType', 0.0039),
 ('Month', 0.003),
 ('Year', 0.0022),
 ('State', 0.0016),
 ('CompetitionOpenSinceYear', 0.0015),
 ('AfterPromo', 0.001)]

In [16]:
model_checkpoint = os.path.join('models', model_id + '.pkl')
model.save(model_checkpoint)

loaded_model = GBTPipeline.load(model_checkpoint)

# print the cv_results_ again to ensure the checkpointing works
pd.DataFrame(loaded_model.model_tuned_.cv_results_)


Out[16]:
mean_fit_time std_fit_time mean_score_time std_score_time param_subsample param_max_depth param_colsampl_bytree params split0_test_score mean_test_score std_test_score rank_test_score split0_train_score mean_train_score std_train_score
0 338.971463 0.0 23.137303 0.0 0.7 10 0.9 {'subsample': 0.7, 'max_depth': 10, 'colsampl_... -0.015877 -0.015877 0.0 2 -0.011191 -0.011191 0.0
1 452.571008 0.0 62.282989 0.0 0.8 5 0.9 {'subsample': 0.8, 'max_depth': 5, 'colsampl_b... -0.016854 -0.016854 0.0 3 -0.012256 -0.012256 0.0
2 339.133200 0.0 24.371682 0.0 0.9 12 0.9 {'subsample': 0.9, 'max_depth': 12, 'colsampl_... -0.015859 -0.015859 0.0 1 -0.011161 -0.011161 0.0