In [1]:
from jupyterthemes import get_themes
from jupyterthemes.stylefx import set_nb_theme
themes = get_themes()
set_nb_theme(themes[3])


Out[1]:

In [2]:
# 1. magic for inline plot
# 2. magic to print version
# 3. magic so that the notebook will reload external python modules
# 4. magic to enable retina (high resolution) plots
# https://gist.github.com/minrk/3301035
%matplotlib inline
%load_ext watermark
%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format='retina'

import os
import torch
import numpy as np
import pandas as pd

%watermark -a 'Ethen' -d -t -v -p numpy,pandas,pyarrow,torch,fastai


Ethen 2019-08-09 13:50:25 

CPython 3.6.4
IPython 7.7.0

numpy 1.17.0
pandas 0.25.0
pyarrow 0.14.1
torch 1.1.0.post2
fastai 1.0.55

Rossman Deep Learning Modeling

The success of deep learning is often times mentioned in domains such as computer vision and natural language processing, another use-case that is also powerful but receives far less attention is to use deep learning on tabular data. By tabular data, we are referring to data that we usually put in a dataframe or a relational database, which is one of the most commonly encountered type of data in the industry.

Embeddings for Categorical Variables

One key technique to make the most out of deep learning for tabular data is to use embeddings for our categorical variables. This approach allows for relationship between categories to be captured, e.g. Given a categorical feature with high cardinality (number of distinct categories is large), it often works best to embed the categories into a lower dimensional numeric space, the embeddings might be able to capture zip codes that are geographically near each other without us needing to explicitly tell it so. Similarly for a feature such as week of day, the week day embedding might be able to capture that Saturday and Sunday have similar behavior and maybe Friday behaves like an average of a weekend and weekday. By converting our raw categories into embeddings, our goal/hope is that these embeddings can capture more rich/complex relationships that will ultimately improve the performance of our models.

For instance, a 4-dimensional version of an embedding for day of week could look like:

Sunday   [.8, .2, .1, .1]
Monday   [.1, .2, .9, .9]
Tuesday [.2, .1, .9, .8]

Here, Monday and Tuesday are fairly similar, yet they are both quite different from Sunday. In practice, our neural network would learn the best representations for each category while it is training, and we can experiment with the number of dimensions that are allowed to capture these rich relationships.

People have shared usecase/success stories of leveraging embeddings, e.g.

  • Instacart has embeddings for its stores, groceries, and customers.
  • Pinterest has embeddings for its pins.

Another interesting thing about embeddings is that once we train them, we can leverage them in other scenarios. e.g. use these learned embeddings as features for our tree-based models.

Data Preparation


In [3]:
data_dir = 'cleaned_data'
path_train = os.path.join(data_dir, 'train_clean.parquet')
path_test = os.path.join(data_dir, 'test_clean.parquet')
engine = 'pyarrow'

df_train = pd.read_parquet(path_train, engine)
df_test = pd.read_parquet(path_test, engine)
print('train dimension: ', df_train.shape)
print('test dimension: ', df_test.shape)
df_train.head()


train dimension:  (1017209, 71)
test dimension:  (41088, 70)
Out[3]:
Store DayOfWeek Date Sales Customers Open Promo StateHoliday SchoolHoliday Year ... CompetitionMonthsOpen Promo2Since Promo2Days Promo2Weeks AfterSchoolHoliday AfterStateHoliday AfterPromo BeforeSchoolHoliday BeforeStateHoliday BeforePromo
0 1 5 2015-07-31 5263 555 1 1 False 1 2015 ... 24 1900-01-01 42214 25 0 57 0 0 -48 0
1 2 5 2015-07-31 6064 625 1 1 False 1 2015 ... 24 2010-03-29 1950 25 0 67 0 0 0 0
2 3 5 2015-07-31 8314 821 1 1 False 1 2015 ... 24 2011-04-04 1579 25 0 57 0 0 -48 0
3 4 5 2015-07-31 13995 1498 1 1 False 1 2015 ... 24 1900-01-01 42214 25 0 67 0 0 0 0
4 5 5 2015-07-31 4822 559 1 1 False 1 2015 ... 3 1900-01-01 42214 25 0 57 0 0 0 0

5 rows × 71 columns


In [4]:
cat_names = ['Store', 'DayOfWeek', 'Year', 'Month', 'Day', 'StateHoliday',
    'CompetitionMonthsOpen', 'Promo2Weeks', 'StoreType', 'Assortment',
    'PromoInterval', 'CompetitionOpenSinceYear', 'Promo2SinceYear',
    'State', 'Week', 'Events']

cont_names = ['CompetitionDistance', 'Max_TemperatureC', 'Mean_TemperatureC',
    'Min_TemperatureC', 'Max_Humidity', 'Mean_Humidity', 'Min_Humidity',
    'Max_Wind_SpeedKm_h', 'Mean_Wind_SpeedKm_h', 'CloudCover', 'trend',
    'trend_DE', 'Promo', 'SchoolHoliday', 'AfterSchoolHoliday',
    'AfterStateHoliday', 'AfterPromo', 'BeforeSchoolHoliday',
    'BeforeStateHoliday', 'BeforePromo']

dep_var = 'Sales'

Here, we will remove all records where the store had zero sale / was closed (feel free to experiment with not excluding the zero sales record and see if improves performance)

We also perform a train/validation split. The validation split will be used in our hyper-parameter tuning process and for early stopping. Notice that because this is a time series application, where we are trying to predict different stores' daily sales. It's important to not perform a random train/test split, but instead divide the training and validation set based on time/date.


In [5]:
df_train = df_train[df_train[dep_var] != 0].reset_index(drop=True)

We print out the min/max time stamp of the training and test set to confirm that the two sets doesn't overlap.


In [6]:
df_test['Date'].min(), df_test['Date'].max()


Out[6]:
(Timestamp('2015-08-01 00:00:00'), Timestamp('2015-09-17 00:00:00'))

In [7]:
# the minimum date of the test set is larger than the maximum date of the
# training set
df_train['Date'].min(), df_train['Date'].max()


Out[7]:
(Timestamp('2013-01-01 00:00:00'), Timestamp('2015-07-31 00:00:00'))

Our training data is already sorted by date in decreasing order, hence we can create the validation set by checking how big is our test set and select the top-N observations to create a validation set that has similar size to our test set. Here we're saying similar size and not exact size, because we make sure that all the records from the same date falls under either training or validation set.


In [8]:
mask = df_train['Date'] == df_train['Date'].iloc[len(df_test)]
cut = df_train.loc[mask, 'Date'].index.max()

# fastai expects a collection of int for specifying which index belongs
# to the validation set
valid_idx = range(cut)
valid_idx


Out[8]:
range(0, 41395)

Here, we print out the dataframe where we'll be doing the train/validation cut to illustrate the point, this is technically not required for the rest of the pipeline. Notice in the dataframe that we've printed out, the last record's date, 2015-06-18 is different from the rest. This means that all records including/after the date 2015-06-19 will become our validation set.


In [9]:
df_train.loc[(cut - 2):(cut + 1)]


Out[9]:
Store DayOfWeek Date Sales Customers Open Promo StateHoliday SchoolHoliday Year ... CompetitionMonthsOpen Promo2Since Promo2Days Promo2Weeks AfterSchoolHoliday AfterStateHoliday AfterPromo BeforeSchoolHoliday BeforeStateHoliday BeforePromo
41393 1113 5 2015-06-19 7114 700 1 1 False 0 2015 ... 24 1900-01-01 42172 25 35 25 0 -31 -90 0
41394 1114 5 2015-06-19 21834 3211 1 1 False 0 2015 ... 24 1900-01-01 42172 25 35 25 0 -27 -90 0
41395 1115 5 2015-06-19 8291 535 1 1 False 0 2015 ... 24 2012-05-28 1117 25 70 15 0 -38 -90 0
41396 1 4 2015-06-18 4645 498 1 1 False 0 2015 ... 24 1900-01-01 42171 25 69 14 0 -39 -91 0

4 rows × 71 columns


In [10]:
df_train = df_train[cat_names + cont_names + [dep_var]]
df_test = df_test[cat_names + cont_names + ['Id']]

print('train dimension: ', df_train.shape)
print('test dimension: ', df_test.shape)
df_train.head()


train dimension:  (844338, 37)
test dimension:  (41088, 37)
Out[10]:
Store DayOfWeek Year Month Day StateHoliday CompetitionMonthsOpen Promo2Weeks StoreType Assortment ... trend_DE Promo SchoolHoliday AfterSchoolHoliday AfterStateHoliday AfterPromo BeforeSchoolHoliday BeforeStateHoliday BeforePromo Sales
0 1 5 2015 7 31 False 24 25 c a ... 83 1 1 0 57 0 0 -48 0 5263
1 2 5 2015 7 31 False 24 25 a a ... 83 1 1 0 67 0 0 0 0 6064
2 3 5 2015 7 31 False 24 25 a a ... 83 1 1 0 57 0 0 -48 0 8314
3 4 5 2015 7 31 False 24 25 c c ... 83 1 1 0 67 0 0 0 0 13995
4 5 5 2015 7 31 False 3 25 a a ... 83 1 1 0 57 0 0 0 0 4822

5 rows × 37 columns

Model Training


In [11]:
from fastai.tabular import DatasetType
from fastai.tabular import defaults, tabular_learner, exp_rmspe, TabularList
from fastai.tabular import Categorify, Normalize, FillMissing, FloatList

The fastai will automatically fit a regression model when the dependent variable is a float, but not when it's an int. So in order to apply regression we need to tell fastai it is a float type, hence the argument label_cls=FloatList when creating the DataBunch that is required for training the model.

The procs variable is fastai's procedure that contains transformation logic that will be applied to our variables. Here we transform all categorical variables into categories (an unique numeric id that represents the original category). We also replace missing values for continuous variables by the median column value (apart from imputing the missing values with the median, it will also create a new column that indicates whether the original value was missing) and normalize them (similar to sklearn's StandardScaler).


In [12]:
procs = [FillMissing, Categorify, Normalize]

# regression
data = (TabularList
        .from_df(df_train, path=data_dir, cat_names=cat_names,
                 cont_names=cont_names, procs=procs)
        .split_by_idx(valid_idx)
        .label_from_df(cols=dep_var, label_cls=FloatList, log=True)
        .add_test(TabularList.from_df(df_test, path=data_dir,
                                      cat_names=cat_names, cont_names=cont_names))
        .databunch())

We can specify the capping for our prediction, ensuring that it won't be a negative value and it won't go beyond 1.2 times the maximum sales value we see in the dataset.


In [13]:
max_log_y = np.log(np.max(df_train[dep_var]) * 1.2)
y_range = torch.tensor([0, max_log_y], device=defaults.device)

We'll now use all the information we have to create a fastai TabularModel. Here we've defined a fixed model with 2 hidden layers, we also try to avoid overfitting by applying regularization. This can be done by performing dropout, which we can specify the dropout probability at each layer with argument ps (more commonly seen) and the embedding (input) dropout with argument emb_drop.


In [14]:
learn = tabular_learner(data, layers=[1000, 500], ps=[0.001, 0.01], emb_drop=0.04, 
                        y_range=y_range, metrics=exp_rmspe)
learn.model


Out[14]:
TabularModel(
  (embeds): ModuleList(
    (0): Embedding(1116, 81)
    (1): Embedding(8, 5)
    (2): Embedding(4, 3)
    (3): Embedding(13, 7)
    (4): Embedding(32, 11)
    (5): Embedding(3, 3)
    (6): Embedding(50, 14)
    (7): Embedding(27, 10)
    (8): Embedding(5, 4)
    (9): Embedding(4, 3)
    (10): Embedding(4, 3)
    (11): Embedding(24, 9)
    (12): Embedding(9, 5)
    (13): Embedding(13, 7)
    (14): Embedding(53, 15)
    (15): Embedding(22, 9)
    (16): Embedding(3, 3)
    (17): Embedding(3, 3)
  )
  (emb_drop): Dropout(p=0.04)
  (bn_cont): BatchNorm1d(20, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (layers): Sequential(
    (0): Linear(in_features=215, out_features=1000, bias=True)
    (1): ReLU(inplace)
    (2): BatchNorm1d(1000, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (3): Dropout(p=0.001)
    (4): Linear(in_features=1000, out_features=500, bias=True)
    (5): ReLU(inplace)
    (6): BatchNorm1d(500, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (7): Dropout(p=0.01)
    (8): Linear(in_features=500, out_features=1, bias=True)
  )
)

Printing out the model architecture, we see it first consists of a list of Embedding layer, one for each categorical variable. Recall the shape of Embedding layer is (the number of distinct categories, the dimension of the embedding). When we specify our fastai learner, we didn't specify emb_szs argument, which lets us specify the embedding size for each of our categorical variable. Hence the embedding size will be determined algorithmically. e.g. Our first embedding is for our Store feature, it shows it has 1116 of them and 81 is corresponding embedding size that was chosen.

Then moving to the first Linear layer, we can see it accepts the size of the sum of our embedding layer and number of continuous variables, showing that they are concatenated together before moving to the next stage in the network.


In [15]:
learn.model.n_emb + learn.model.n_cont


Out[15]:
215

In [16]:
# training time shown here is for a 8 core cpu
learn.fit_one_cycle(6, 1e-3, wd=0.2)


epoch train_loss valid_loss exp_rmspe time
0 0.024063 0.019387 0.142367 05:41
1 0.020623 0.018925 0.130072 05:59
2 0.017074 0.021191 0.148419 07:25
3 0.015322 0.014938 0.121601 08:28
4 0.010684 0.012374 0.108619 09:41
5 0.010553 0.012929 0.108629 10:02

We can leverage the get_preds method to return the predictions and targets on the type of dataset. For the test set, we're only interested in the prediction.


In [17]:
test_preds = learn.get_preds(ds_type=DatasetType.Test)
test_preds[:5]


Out[17]:
[tensor([[ 8.3923],
         [ 8.8337],
         [ 9.0851],
         ...,
         [ 8.7029],
         [10.0019],
         [ 8.8206]]), tensor([0, 0, 0,  ..., 0, 0, 0])]

In [18]:
# we logged our label, remember to exponentiate it back to the original scale
df_test[dep_var] = np.exp(test_preds[0].numpy().ravel())
df_test[['Id', dep_var]] = df_test[['Id', dep_var]].astype('int')

submission_dir = 'submission'
if not os.path.isdir(submission_dir):
    os.makedirs(submission_dir, exist_ok=True)

submission_path = os.path.join(submission_dir, 'rossmann_submission_fastai.csv')
df_test[['Id', dep_var]].to_csv(submission_path, index=False)

df_test[['Id', dep_var]].head()


Out[18]:
Id Sales
0 1 4412
1 2 6861
2 3 8822
3 4 7279
4 5 7265

Reference