In [1]:

    
from jupyterthemes import get_themes
from jupyterthemes.stylefx import set_nb_theme
themes = get_themes()
set_nb_theme(themes[3])









    Out[1]:



In [2]:

    
# 1. magic for inline plot
# 2. magic to print version
# 3. magic so that the notebook will reload external python modules
# 4. magic to enable retina (high resolution) plots
# https://gist.github.com/minrk/3301035
%matplotlib inline
%load_ext watermark
%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format='retina'

import os
import torch
import numpy as np
import pandas as pd

%watermark -a 'Ethen' -d -t -v -p numpy,pandas,pyarrow,torch,fastai









    



Ethen 2019-08-09 13:50:25 

CPython 3.6.4
IPython 7.7.0

numpy 1.17.0
pandas 0.25.0
pyarrow 0.14.1
torch 1.1.0.post2
fastai 1.0.55

Rossman Deep Learning Modeling

The success of deep learning is often times mentioned in domains such as computer vision and natural language processing, another use-case that is also powerful but receives far less attention is to use deep learning on tabular data. By tabular data, we are referring to data that we usually put in a dataframe or a relational database, which is one of the most commonly encountered type of data in the industry.

Embeddings for Categorical Variables

One key technique to make the most out of deep learning for tabular data is to use embeddings for our categorical variables. This approach allows for relationship between categories to be captured, e.g. Given a categorical feature with high cardinality (number of distinct categories is large), it often works best to embed the categories into a lower dimensional numeric space, the embeddings might be able to capture zip codes that are geographically near each other without us needing to explicitly tell it so. Similarly for a feature such as week of day, the week day embedding might be able to capture that Saturday and Sunday have similar behavior and maybe Friday behaves like an average of a weekend and weekday. By converting our raw categories into embeddings, our goal/hope is that these embeddings can capture more rich/complex relationships that will ultimately improve the performance of our models.

For instance, a 4-dimensional version of an embedding for day of week could look like:

Sunday   [.8, .2, .1, .1]
Monday   [.1, .2, .9, .9]
Tuesday [.2, .1, .9, .8]

Here, Monday and Tuesday are fairly similar, yet they are both quite different from Sunday. In practice, our neural network would learn the best representations for each category while it is training, and we can experiment with the number of dimensions that are allowed to capture these rich relationships.

People have shared usecase/success stories of leveraging embeddings, e.g.

Instacart has embeddings for its stores, groceries, and customers.
Pinterest has embeddings for its pins.

Another interesting thing about embeddings is that once we train them, we can leverage them in other scenarios. e.g. use these learned embeddings as features for our tree-based models.

Data Preparation



In [3]:

    
data_dir = 'cleaned_data'
path_train = os.path.join(data_dir, 'train_clean.parquet')
path_test = os.path.join(data_dir, 'test_clean.parquet')
engine = 'pyarrow'

df_train = pd.read_parquet(path_train, engine)
df_test = pd.read_parquet(path_test, engine)
print('train dimension: ', df_train.shape)
print('test dimension: ', df_test.shape)
df_train.head()









    



train dimension:  (1017209, 71)
test dimension:  (41088, 70)






    Out[3]:







  
    
      
      Store
      DayOfWeek
      Date
      Sales
      Customers
      Open
      Promo
      StateHoliday
      SchoolHoliday
      Year
      ...
      CompetitionMonthsOpen
      Promo2Since
      Promo2Days
      Promo2Weeks
      AfterSchoolHoliday
      AfterStateHoliday
      AfterPromo
      BeforeSchoolHoliday
      BeforeStateHoliday
      BeforePromo
    
  
  
    
      0
      1
      5
      2015-07-31
      5263
      555
      1
      1
      False
      1
      2015
      ...
      24
      1900-01-01
      42214
      25
      0
      57
      0
      0
      -48
      0
    
    
      1
      2
      5
      2015-07-31
      6064
      625
      1
      1
      False
      1
      2015
      ...
      24
      2010-03-29
      1950
      25
      0
      67
      0
      0
      0
      0
    
    
      2
      3
      5
      2015-07-31
      8314
      821
      1
      1
      False
      1
      2015
      ...
      24
      2011-04-04
      1579
      25
      0
      57
      0
      0
      -48
      0
    
    
      3
      4
      5
      2015-07-31
      13995
      1498
      1
      1
      False
      1
      2015
      ...
      24
      1900-01-01
      42214
      25
      0
      67
      0
      0
      0
      0
    
    
      4
      5
      5
      2015-07-31
      4822
      559
      1
      1
      False
      1
      2015
      ...
      3
      1900-01-01
      42214
      25
      0
      57
      0
      0
      0
      0
    
  

5 rows × 71 columns



In [4]:

    
cat_names = ['Store', 'DayOfWeek', 'Year', 'Month', 'Day', 'StateHoliday',
    'CompetitionMonthsOpen', 'Promo2Weeks', 'StoreType', 'Assortment',
    'PromoInterval', 'CompetitionOpenSinceYear', 'Promo2SinceYear',
    'State', 'Week', 'Events']

cont_names = ['CompetitionDistance', 'Max_TemperatureC', 'Mean_TemperatureC',
    'Min_TemperatureC', 'Max_Humidity', 'Mean_Humidity', 'Min_Humidity',
    'Max_Wind_SpeedKm_h', 'Mean_Wind_SpeedKm_h', 'CloudCover', 'trend',
    'trend_DE', 'Promo', 'SchoolHoliday', 'AfterSchoolHoliday',
    'AfterStateHoliday', 'AfterPromo', 'BeforeSchoolHoliday',
    'BeforeStateHoliday', 'BeforePromo']

dep_var = 'Sales'

Here, we will remove all records where the store had zero sale / was closed (feel free to experiment with not excluding the zero sales record and see if improves performance)

We also perform a train/validation split. The validation split will be used in our hyper-parameter tuning process and for early stopping. Notice that because this is a time series application, where we are trying to predict different stores' daily sales. It's important to not perform a random train/test split, but instead divide the training and validation set based on time/date.



In [5]:

    
df_train = df_train[df_train[dep_var] != 0].reset_index(drop=True)

We print out the min/max time stamp of the training and test set to confirm that the two sets doesn't overlap.



In [6]:

    
df_test['Date'].min(), df_test['Date'].max()









    Out[6]:





(Timestamp('2015-08-01 00:00:00'), Timestamp('2015-09-17 00:00:00'))



In [7]:

    
# the minimum date of the test set is larger than the maximum date of the
# training set
df_train['Date'].min(), df_train['Date'].max()









    Out[7]:





(Timestamp('2013-01-01 00:00:00'), Timestamp('2015-07-31 00:00:00'))

Our training data is already sorted by date in decreasing order, hence we can create the validation set by checking how big is our test set and select the top-N observations to create a validation set that has similar size to our test set. Here we're saying similar size and not exact size, because we make sure that all the records from the same date falls under either training or validation set.



In [8]:

    
mask = df_train['Date'] == df_train['Date'].iloc[len(df_test)]
cut = df_train.loc[mask, 'Date'].index.max()

# fastai expects a collection of int for specifying which index belongs
# to the validation set
valid_idx = range(cut)
valid_idx









    Out[8]:





range(0, 41395)

Here, we print out the dataframe where we'll be doing the train/validation cut to illustrate the point, this is technically not required for the rest of the pipeline. Notice in the dataframe that we've printed out, the last record's date, 2015-06-18 is different from the rest. This means that all records including/after the date 2015-06-19 will become our validation set.



In [9]:

    
df_train.loc[(cut - 2):(cut + 1)]









    Out[9]:







  
    
      
      Store
      DayOfWeek
      Date
      Sales
      Customers
      Open
      Promo
      StateHoliday
      SchoolHoliday
      Year
      ...
      CompetitionMonthsOpen
      Promo2Since
      Promo2Days
      Promo2Weeks
      AfterSchoolHoliday
      AfterStateHoliday
      AfterPromo
      BeforeSchoolHoliday
      BeforeStateHoliday
      BeforePromo
    
  
  
    
      41393
      1113
      5
      2015-06-19
      7114
      700
      1
      1
      False
      0
      2015
      ...
      24
      1900-01-01
      42172
      25
      35
      25
      0
      -31
      -90
      0
    
    
      41394
      1114
      5
      2015-06-19
      21834
      3211
      1
      1
      False
      0
      2015
      ...
      24
      1900-01-01
      42172
      25
      35
      25
      0
      -27
      -90
      0
    
    
      41395
      1115
      5
      2015-06-19
      8291
      535
      1
      1
      False
      0
      2015
      ...
      24
      2012-05-28
      1117
      25
      70
      15
      0
      -38
      -90
      0
    
    
      41396
      1
      4
      2015-06-18
      4645
      498
      1
      1
      False
      0
      2015
      ...
      24
      1900-01-01
      42171
      25
      69
      14
      0
      -39
      -91
      0
    
  

4 rows × 71 columns



In [10]:

    
df_train = df_train[cat_names + cont_names + [dep_var]]
df_test = df_test[cat_names + cont_names + ['Id']]

print('train dimension: ', df_train.shape)
print('test dimension: ', df_test.shape)
df_train.head()









    



train dimension:  (844338, 37)
test dimension:  (41088, 37)






    Out[10]:







  
    
      
      Store
      DayOfWeek
      Year
      Month
      Day
      StateHoliday
      CompetitionMonthsOpen
      Promo2Weeks
      StoreType
      Assortment
      ...
      trend_DE
      Promo
      SchoolHoliday
      AfterSchoolHoliday
      AfterStateHoliday
      AfterPromo
      BeforeSchoolHoliday
      BeforeStateHoliday
      BeforePromo
      Sales
    
  
  
    
      0
      1
      5
      2015
      7
      31
      False
      24
      25
      c
      a
      ...
      83
      1
      1
      0
      57
      0
      0
      -48
      0
      5263
    
    
      1
      2
      5
      2015
      7
      31
      False
      24
      25
      a
      a
      ...
      83
      1
      1
      0
      67
      0
      0
      0
      0
      6064
    
    
      2
      3
      5
      2015
      7
      31
      False
      24
      25
      a
      a
      ...
      83
      1
      1
      0
      57
      0
      0
      -48
      0
      8314
    
    
      3
      4
      5
      2015
      7
      31
      False
      24
      25
      c
      c
      ...
      83
      1
      1
      0
      67
      0
      0
      0
      0
      13995
    
    
      4
      5
      5
      2015
      7
      31
      False
      3
      25
      a
      a
      ...
      83
      1
      1
      0
      57
      0
      0
      0
      0
      4822
    
  

5 rows × 37 columns

Model Training



In [11]:

    
from fastai.tabular import DatasetType
from fastai.tabular import defaults, tabular_learner, exp_rmspe, TabularList
from fastai.tabular import Categorify, Normalize, FillMissing, FloatList

The fastai will automatically fit a regression model when the dependent variable is a float, but not when it's an int. So in order to apply regression we need to tell fastai it is a float type, hence the argument label_cls=FloatList when creating the DataBunch that is required for training the model.

The procs variable is fastai's procedure that contains transformation logic that will be applied to our variables. Here we transform all categorical variables into categories (an unique numeric id that represents the original category). We also replace missing values for continuous variables by the median column value (apart from imputing the missing values with the median, it will also create a new column that indicates whether the original value was missing) and normalize them (similar to sklearn's StandardScaler).



In [12]:

    
procs = [FillMissing, Categorify, Normalize]

# regression
data = (TabularList
        .from_df(df_train, path=data_dir, cat_names=cat_names,
                 cont_names=cont_names, procs=procs)
        .split_by_idx(valid_idx)
        .label_from_df(cols=dep_var, label_cls=FloatList, log=True)
        .add_test(TabularList.from_df(df_test, path=data_dir,
                                      cat_names=cat_names, cont_names=cont_names))
        .databunch())

We can specify the capping for our prediction, ensuring that it won't be a negative value and it won't go beyond 1.2 times the maximum sales value we see in the dataset.



In [13]:

    
max_log_y = np.log(np.max(df_train[dep_var]) * 1.2)
y_range = torch.tensor([0, max_log_y], device=defaults.device)

We'll now use all the information we have to create a fastai TabularModel. Here we've defined a fixed model with 2 hidden layers, we also try to avoid overfitting by applying regularization. This can be done by performing dropout, which we can specify the dropout probability at each layer with argument ps (more commonly seen) and the embedding (input) dropout with argument emb_drop.



In [14]:

    
learn = tabular_learner(data, layers=[1000, 500], ps=[0.001, 0.01], emb_drop=0.04, 
                        y_range=y_range, metrics=exp_rmspe)
learn.model









    Out[14]:





TabularModel(
  (embeds): ModuleList(
    (0): Embedding(1116, 81)
    (1): Embedding(8, 5)
    (2): Embedding(4, 3)
    (3): Embedding(13, 7)
    (4): Embedding(32, 11)
    (5): Embedding(3, 3)
    (6): Embedding(50, 14)
    (7): Embedding(27, 10)
    (8): Embedding(5, 4)
    (9): Embedding(4, 3)
    (10): Embedding(4, 3)
    (11): Embedding(24, 9)
    (12): Embedding(9, 5)
    (13): Embedding(13, 7)
    (14): Embedding(53, 15)
    (15): Embedding(22, 9)
    (16): Embedding(3, 3)
    (17): Embedding(3, 3)
  )
  (emb_drop): Dropout(p=0.04)
  (bn_cont): BatchNorm1d(20, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (layers): Sequential(
    (0): Linear(in_features=215, out_features=1000, bias=True)
    (1): ReLU(inplace)
    (2): BatchNorm1d(1000, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (3): Dropout(p=0.001)
    (4): Linear(in_features=1000, out_features=500, bias=True)
    (5): ReLU(inplace)
    (6): BatchNorm1d(500, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (7): Dropout(p=0.01)
    (8): Linear(in_features=500, out_features=1, bias=True)
  )
)

Printing out the model architecture, we see it first consists of a list of Embedding layer, one for each categorical variable. Recall the shape of Embedding layer is (the number of distinct categories, the dimension of the embedding). When we specify our fastai learner, we didn't specify emb_szs argument, which lets us specify the embedding size for each of our categorical variable. Hence the embedding size will be determined algorithmically. e.g. Our first embedding is for our Store feature, it shows it has 1116 of them and 81 is corresponding embedding size that was chosen.

Then moving to the first Linear layer, we can see it accepts the size of the sum of our embedding layer and number of continuous variables, showing that they are concatenated together before moving to the next stage in the network.



In [15]:

    
learn.model.n_emb + learn.model.n_cont









    Out[15]:





215



In [16]:

    
# training time shown here is for a 8 core cpu
learn.fit_one_cycle(6, 1e-3, wd=0.2)

We can leverage the get_preds method to return the predictions and targets on the type of dataset. For the test set, we're only interested in the prediction.



In [17]:

    
test_preds = learn.get_preds(ds_type=DatasetType.Test)
test_preds[:5]









    Out[17]:





[tensor([[ 8.3923],
         [ 8.8337],
         [ 9.0851],
         ...,
         [ 8.7029],
         [10.0019],
         [ 8.8206]]), tensor([0, 0, 0,  ..., 0, 0, 0])]



In [18]:

    
# we logged our label, remember to exponentiate it back to the original scale
df_test[dep_var] = np.exp(test_preds[0].numpy().ravel())
df_test[['Id', dep_var]] = df_test[['Id', dep_var]].astype('int')

submission_dir = 'submission'
if not os.path.isdir(submission_dir):
    os.makedirs(submission_dir, exist_ok=True)

submission_path = os.path.join(submission_dir, 'rossmann_submission_fastai.csv')
df_test[['Id', dep_var]].to_csv(submission_path, index=False)

df_test[['Id', dep_var]].head()

epoch	train_loss	valid_loss	exp_rmspe	time
0	0.024063	0.019387	0.142367	05:41
1	0.020623	0.018925	0.130072	05:59
2	0.017074	0.021191	0.148419	07:25
3	0.015322	0.014938	0.121601	08:28
4	0.010684	0.012374	0.108619	09:41
5	0.010553	0.012929	0.108629	10:02

	Id	Sales
0	1	4412
1	2	6861
2	3	8822
3	4	7279
4	5	7265

Table of Contents

Rossman Deep Learning Modeling

Embeddings for Categorical Variables

Data Preparation

Model Training

Reference

	Store	DayOfWeek	Date	Sales	Customers	Open	Promo	StateHoliday	SchoolHoliday	Year	...	CompetitionMonthsOpen	Promo2Since	Promo2Days	Promo2Weeks	AfterStateHoliday	BeforeStateHoliday
0	1	5	2015-07-31	5263	555	1	1	False	1	2015	...	24	1900-01-01	42214	25	57	-48
1	2	5	2015-07-31	6064	625	1	1	False	1	2015	...	24	2010-03-29	1950	25	67	0
2	3	5	2015-07-31	8314	821	1	1	False	1	2015	...	24	2011-04-04	1579	25	57	-48
3	4	5	2015-07-31	13995	1498	1	1	False	1	2015	...	24	1900-01-01	42214	25	67	0
4	5	5	2015-07-31	4822	559	1	1	False	1	2015	...	3	1900-01-01	42214	25	57	0

	Store	DayOfWeek	Date	Sales	Customers	Open	Promo	StateHoliday	Year	...	CompetitionMonthsOpen	Promo2Since	Promo2Days	Promo2Weeks	AfterSchoolHoliday	AfterStateHoliday	BeforeSchoolHoliday	BeforeStateHoliday
41393	1113	5	2015-06-19	7114	700	1	1	False	2015	...	24	1900-01-01	42172	25	35	25	-31	-90
41394	1114	5	2015-06-19	21834	3211	1	1	False	2015	...	24	1900-01-01	42172	25	35	25	-27	-90
41395	1115	5	2015-06-19	8291	535	1	1	False	2015	...	24	2012-05-28	1117	25	70	15	-38	-90
41396	1	4	2015-06-18	4645	498	1	1	False	2015	...	24	1900-01-01	42171	25	69	14	-39	-91