In [1]:
from jupyterthemes import get_themes
from jupyterthemes.stylefx import set_nb_theme
themes = get_themes()
set_nb_theme(themes[3])
Out[1]:
In [2]:
# 1. magic for inline plot
# 2. magic to print version
# 3. magic so that the notebook will reload external python modules
# 4. magic to enable retina (high resolution) plots
# https://gist.github.com/minrk/3301035
%matplotlib inline
%load_ext watermark
%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format='retina'
import os
import json
import time
import numpy as np
import pandas as pd
%watermark -a 'Ethen' -d -t -v -p numpy,pandas,pyarrow,sklearn
We've done most of our data preparation and feature engineering in the previous notebook, we'll still perform some additional ones here, but this notebook focuses on getting the data ready for fitting a Gradient Boosted Tree model. For the model, we will be leveraging lightgbm.
In [3]:
data_dir = 'cleaned_data'
path_train = os.path.join(data_dir, 'train_clean.parquet')
path_test = os.path.join(data_dir, 'test_clean.parquet')
engine = 'pyarrow'
df_train = pd.read_parquet(path_train, engine)
df_test = pd.read_parquet(path_test, engine)
print('train dimension: ', df_train.shape)
print('test dimension: ', df_test.shape)
df_train.head()
Out[3]:
We've pulled most of our configurable parameters outside into a json configuration file. In the ideal scenario, we can move all of our code into a python script and only change the configuration file to experiment with different type of settings to see which one leads to the best overall performance.
In [4]:
config_path = os.path.join('config', 'gbt_training_template.json')
with open(config_path) as f:
config_file = json.load(f)
config_file
Out[4]:
In [5]:
# extract settings from the configuration file into local variables
columns = config_file['columns']
num_cols = columns['num_cols_pattern']
cat_cols = columns['cat_cols_pattern']
id_cols = columns['id_cols']
label_col = columns['label_col']
weights_col = columns['weights_col']
model_task = config_file['model_task']
model_type = config_file['model_type']
model_parameters = config_file['model_parameters'][model_type]
model_hyper_parameters = config_file['model_hyper_parameters'][model_type]
model_fit_parameters = config_file['model_fit_parameters'][model_type]
search_parameters = config_file['search_parameters']
Here, we will remove all records where the store had zero sale / was closed (feel free to experiment with not excluding the zero sales record and see if improves performance)
We also perform a train/validation split. The validation split will be used in our hyper-parameter tuning process and for early stopping. Notice that because this is a time series application, where we are trying to predict different stores' daily sales. It's important to not perform a random train/test split, but instead divide the training and validation set based on time/date.
Our training data is already sorted by date in decreasing order, hence we can create the validation set by checking how big is our test set and select the top-N observations to create a validation set that has similar size to our test set. Here we're saying similar size and not exact size, because we make sure that all the records from the same date falls under either training or validation set.
In [6]:
df_train = df_train[df_train[label_col] != 0].reset_index(drop=True)
mask = df_train['Date'] == df_train['Date'].iloc[len(df_test)]
val_index = df_train.loc[mask, 'Date'].index.max()
val_index
Out[6]:
The validation fold we're creating is used for sklearn's PredefinedSplit, where we set the index to 0 for all samples that are part of the validation set, and to -1 for all other samples.
In [7]:
val_fold = np.full(df_train.shape[0], fill_value=-1)
val_fold[:(val_index + 1)] = 0
val_fold
Out[7]:
Here, we assign the validation fold back to the original dataframe to illustrate the point, this is technically not required for the rest of the pipeline. Notice in the dataframe that we've printed out, the last record's date, 2015-06-18 is different from the rest, and the record's val_fold
takes on a value of -1. This means that all records including/after the date 2015-06-19 will become our validation set.
In [8]:
df_train['val_fold'] = val_fold
df_train[(val_index - 2):(val_index + 2)]
Out[8]:
We proceed to extracting the necessary columns both numerical and categorical that we'll use for modeling.
In [9]:
# the model id is used as the indicator when saving the model
model_id = 'gbt'
input_cols = num_cols + cat_cols
df_train = df_train[input_cols + [label_col]]
# we will perform the modeling at the log-scale
df_train[label_col] = np.log(df_train[label_col])
df_test = df_test[input_cols + id_cols]
print('train dimension: ', df_train.shape)
print('test dimension: ', df_test.shape)
df_train.head()
Out[9]:
In [10]:
for cat_col in cat_cols:
df_train[cat_col] = df_train[cat_col].astype('category')
df_test[cat_col] = df_test[cat_col].astype('category')
df_train.head()
Out[10]:
We use a helper class to train a boosted tree model, generate the prediction on our test set, create the submission file, check the feature importance of the tree-based model and also make sure we can save and re-load the model.
In [11]:
from gbt_module.model import GBTPipeline
model = GBTPipeline(input_cols, cat_cols, label_col, weights_col,
model_task, model_id, model_type, model_parameters,
model_hyper_parameters, search_parameters)
model
Out[11]:
In [12]:
start = time.time()
model.fit(df_train, val_fold, model_fit_parameters)
elapsed = time.time() - start
print('elapsed minutes: ', elapsed / 60)
In [13]:
pd.DataFrame(model.model_tuned_.cv_results_)
Out[13]:
In [14]:
# we logged our label, remember to exponentiate it back to the original scale
prediction_test = model.predict(df_test[input_cols])
df_test[label_col] = np.exp(prediction_test)
submission_cols = id_cols + [label_col]
df_test[submission_cols] = df_test[submission_cols].astype('int')
submission_dir = 'submission'
if not os.path.isdir(submission_dir):
os.makedirs(submission_dir, exist_ok=True)
submission_file = 'rossmann_submission_{}.csv'.format(model_id)
submission_path = os.path.join(submission_dir, submission_file)
df_test[submission_cols].to_csv(submission_path, index=False)
df_test[submission_cols].head()
Out[14]:
In [15]:
model.get_feature_importance()
Out[15]:
In [16]:
model_checkpoint = os.path.join('models', model_id + '.pkl')
model.save(model_checkpoint)
loaded_model = GBTPipeline.load(model_checkpoint)
# print the cv_results_ again to ensure the checkpointing works
pd.DataFrame(loaded_model.model_tuned_.cv_results_)
Out[16]: