This notebook contains an implementation of the third place result in the Rossman Kaggle competition as detailed in Guo/Berkhahn's Entity Embeddings of Categorical Variables.

The motivation behind exploring this architecture is it's relevance to real-world application. Much of our focus has been computer-vision and NLP tasks, which largely deals with unstructured data.

However, most of the data informing KPI's in industry are structured, time-series data. Here we explore the end-to-end process of using neural networks with practical structured data problems.



In [1]:

    
%matplotlib inline



In [2]:

    
import math, keras, datetime, pandas as pd, numpy as np, keras.backend as K
import matplotlib.pyplot as plt, xgboost, operator, random, pickle









    



Using TensorFlow backend.
/home/roebius/pj/p3/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)



In [3]:

    
from utils2 import *



In [4]:

    
np.set_printoptions(threshold=50, edgeitems=20)



In [5]:

    
limit_mem()



In [6]:

    
from isoweek import Week
from pandas_summary import DataFrameSummary



In [7]:

    
%cd data/rossman/









    



/home/roebius/pj/fastai/nbs2/data/rossman

Create datasets

In addition to the provided data, we will be using external datasets put together by participants in the Kaggle competition. You can download all of them here.

For completeness, the implementation used to put them together is included below.



In [8]:

    
def concat_csvs(dirname):
    os.chdir(dirname)
    filenames=glob.glob("*.csv")

    wrote_header = False
    with open("../"+dirname+".csv","w") as outputfile:
        for filename in filenames:
            name = filename.split(".")[0]
            with open(filename) as f:
                line = f.readline()
                if not wrote_header:
                    wrote_header = True
                    outputfile.write("file,"+line)
                for line in f:
                     outputfile.write(name + "," + line)
                outputfile.write("\n")

    os.chdir("..")



In [9]:

    
# concat_csvs('googletrend')
# concat_csvs('weather')

Feature Space:

train: Training set provided by competition
store: List of stores
store_states: mapping of store to the German state they are in
List of German state names
googletrend: trend of certain google keywords over time, found by users to correlate well w/ given data
weather: weather
test: testing set



In [10]:

    
table_names = ['train', 'store', 'store_states', 'state_names', 
               'googletrend', 'weather', 'test']

We'll be using the popular data manipulation framework pandas.

Among other things, pandas allows you to manipulate tables/data frames in python as one would in a database.

We're going to go ahead and load all of our csv's as dataframes into a list tables.



In [11]:

    
tables = [pd.read_csv(fname+'.csv', low_memory=False) for fname in table_names]



In [12]:

    
from IPython.display import HTML

We can use head() to get a quick look at the contents of each table:

train: Contains store information on a daily basis, tracks things like sales, customers, whether that day was a holdiay, etc.
store: general info about the store including competition, etc.
store_states: maps store to state it is in
state_names: Maps state abbreviations to names
googletrend: trend data for particular week/state
weather: weather conditions for each state
test: Same as training table, w/o sales and customers



In [13]:

    
for t in tables: display(t.head())









    






  
    
      
      Store
      DayOfWeek
      Date
      Sales
      Customers
      Open
      Promo
      StateHoliday
      SchoolHoliday
    
  
  
    
      0
      1
      5
      2015-07-31
      5263
      555
      1
      1
      0
      1
    
    
      1
      2
      5
      2015-07-31
      6064
      625
      1
      1
      0
      1
    
    
      2
      3
      5
      2015-07-31
      8314
      821
      1
      1
      0
      1
    
    
      3
      4
      5
      2015-07-31
      13995
      1498
      1
      1
      0
      1
    
    
      4
      5
      5
      2015-07-31
      4822
      559
      1
      1
      0
      1
    
  








    






  
    
      
      Store
      StoreType
      Assortment
      CompetitionDistance
      CompetitionOpenSinceMonth
      CompetitionOpenSinceYear
      Promo2
      Promo2SinceWeek
      Promo2SinceYear
      PromoInterval
    
  
  
    
      0
      1
      c
      a
      1270.0
      9.0
      2008.0
      0
      NaN
      NaN
      NaN
    
    
      1
      2
      a
      a
      570.0
      11.0
      2007.0
      1
      13.0
      2010.0
      Jan,Apr,Jul,Oct
    
    
      2
      3
      a
      a
      14130.0
      12.0
      2006.0
      1
      14.0
      2011.0
      Jan,Apr,Jul,Oct
    
    
      3
      4
      c
      c
      620.0
      9.0
      2009.0
      0
      NaN
      NaN
      NaN
    
    
      4
      5
      a
      a
      29910.0
      4.0
      2015.0
      0
      NaN
      NaN
      NaN
    
  








    






  
    
      
      Store
      State
    
  
  
    
      0
      1
      HE
    
    
      1
      2
      TH
    
    
      2
      3
      NW
    
    
      3
      4
      BE
    
    
      4
      5
      SN
    
  








    






  
    
      
      StateName
      State
    
  
  
    
      0
      BadenWuerttemberg
      BW
    
    
      1
      Bayern
      BY
    
    
      2
      Berlin
      BE
    
    
      3
      Brandenburg
      BB
    
    
      4
      Bremen
      HB
    
  








    






  
    
      
      file
      week
      trend
    
  
  
    
      0
      Rossmann_DE_SN
      2012-12-02 - 2012-12-08
      96
    
    
      1
      Rossmann_DE_SN
      2012-12-09 - 2012-12-15
      95
    
    
      2
      Rossmann_DE_SN
      2012-12-16 - 2012-12-22
      91
    
    
      3
      Rossmann_DE_SN
      2012-12-23 - 2012-12-29
      48
    
    
      4
      Rossmann_DE_SN
      2012-12-30 - 2013-01-05
      67
    
  








    






  
    
      
      file
      Date
      Max_TemperatureC
      Mean_TemperatureC
      Min_TemperatureC
      Dew_PointC
      MeanDew_PointC
      Min_DewpointC
      Max_Humidity
      Mean_Humidity
      ...
      Max_VisibilityKm
      Mean_VisibilityKm
      Min_VisibilitykM
      Max_Wind_SpeedKm_h
      Mean_Wind_SpeedKm_h
      Max_Gust_SpeedKm_h
      Precipitationmm
      CloudCover
      Events
      WindDirDegrees
    
  
  
    
      0
      NordrheinWestfalen
      2013-01-01
      8
      4
      2
      7
      5
      1
      94
      87
      ...
      31.0
      12.0
      4.0
      39
      26
      58.0
      5.08
      6.0
      Rain
      215
    
    
      1
      NordrheinWestfalen
      2013-01-02
      7
      4
      1
      5
      3
      2
      93
      85
      ...
      31.0
      14.0
      10.0
      24
      16
      NaN
      0.00
      6.0
      Rain
      225
    
    
      2
      NordrheinWestfalen
      2013-01-03
      11
      8
      6
      10
      8
      4
      100
      93
      ...
      31.0
      8.0
      2.0
      26
      21
      NaN
      1.02
      7.0
      Rain
      240
    
    
      3
      NordrheinWestfalen
      2013-01-04
      9
      9
      8
      9
      9
      8
      100
      94
      ...
      11.0
      5.0
      2.0
      23
      14
      NaN
      0.25
      7.0
      Rain
      263
    
    
      4
      NordrheinWestfalen
      2013-01-05
      8
      8
      7
      8
      7
      6
      100
      94
      ...
      10.0
      6.0
      3.0
      16
      10
      NaN
      0.00
      7.0
      Rain
      268
    
  

5 rows × 24 columns







    






  
    
      
      Id
      Store
      DayOfWeek
      Date
      Open
      Promo
      StateHoliday
      SchoolHoliday
    
  
  
    
      0
      1
      1
      4
      2015-09-17
      1.0
      1
      0
      0
    
    
      1
      2
      3
      4
      2015-09-17
      1.0
      1
      0
      0
    
    
      2
      3
      7
      4
      2015-09-17
      1.0
      1
      0
      0
    
    
      3
      4
      8
      4
      2015-09-17
      1.0
      1
      0
      0
    
    
      4
      5
      9
      4
      2015-09-17
      1.0
      1
      0
      0

This is very representative of a typical industry dataset.

The following returns summarized aggregate information to each table accross each field.



In [14]:

    
for t in tables: display(DataFrameSummary(t).summary())









    






  
    
      
      Store
      DayOfWeek
      Date
      Sales
      Customers
      Open
      Promo
      StateHoliday
      SchoolHoliday
    
  
  
    
      count
      1.01721e+06
      1.01721e+06
      NaN
      1.01721e+06
      1.01721e+06
      1.01721e+06
      1.01721e+06
      NaN
      1.01721e+06
    
    
      mean
      558.43
      3.99834
      NaN
      5773.82
      633.146
      0.830107
      0.381515
      NaN
      0.178647
    
    
      std
      321.909
      1.99739
      NaN
      3849.93
      464.412
      0.375539
      0.485759
      NaN
      0.383056
    
    
      min
      1
      1
      NaN
      0
      0
      0
      0
      NaN
      0
    
    
      25%
      280
      2
      NaN
      3727
      405
      1
      0
      NaN
      0
    
    
      50%
      558
      4
      NaN
      5744
      609
      1
      0
      NaN
      0
    
    
      75%
      838
      6
      NaN
      7856
      837
      1
      1
      NaN
      0
    
    
      max
      1115
      7
      NaN
      41551
      7388
      1
      1
      NaN
      1
    
    
      counts
      1017209
      1017209
      1017209
      1017209
      1017209
      1017209
      1017209
      1017209
      1017209
    
    
      uniques
      1115
      7
      942
      21734
      4086
      2
      2
      4
      2
    
    
      missing
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      missing_perc
      0%
      0%
      0%
      0%
      0%
      0%
      0%
      0%
      0%
    
    
      types
      numeric
      numeric
      categorical
      numeric
      numeric
      bool
      bool
      categorical
      bool
    
  








    






  
    
      
      Store
      StoreType
      Assortment
      CompetitionDistance
      CompetitionOpenSinceMonth
      CompetitionOpenSinceYear
      Promo2
      Promo2SinceWeek
      Promo2SinceYear
      PromoInterval
    
  
  
    
      count
      1115
      NaN
      NaN
      1112
      761
      761
      1115
      571
      571
      NaN
    
    
      mean
      558
      NaN
      NaN
      5404.9
      7.2247
      2008.67
      0.512108
      23.5954
      2011.76
      NaN
    
    
      std
      322.017
      NaN
      NaN
      7663.17
      3.21235
      6.19598
      0.500078
      14.142
      1.67494
      NaN
    
    
      min
      1
      NaN
      NaN
      20
      1
      1900
      0
      1
      2009
      NaN
    
    
      25%
      279.5
      NaN
      NaN
      717.5
      4
      2006
      0
      13
      2011
      NaN
    
    
      50%
      558
      NaN
      NaN
      2325
      8
      2010
      1
      22
      2012
      NaN
    
    
      75%
      836.5
      NaN
      NaN
      6882.5
      10
      2013
      1
      37
      2013
      NaN
    
    
      max
      1115
      NaN
      NaN
      75860
      12
      2015
      1
      50
      2015
      NaN
    
    
      counts
      1115
      1115
      1115
      1112
      761
      761
      1115
      571
      571
      571
    
    
      uniques
      1115
      4
      3
      654
      12
      23
      2
      24
      7
      3
    
    
      missing
      0
      0
      0
      3
      354
      354
      0
      544
      544
      544
    
    
      missing_perc
      0%
      0%
      0%
      0.27%
      31.75%
      31.75%
      0%
      48.79%
      48.79%
      48.79%
    
    
      types
      numeric
      categorical
      categorical
      numeric
      numeric
      numeric
      bool
      numeric
      numeric
      categorical
    
  








    






  
    
      
      Store
      State
    
  
  
    
      count
      1115
      NaN
    
    
      mean
      558
      NaN
    
    
      std
      322.017
      NaN
    
    
      min
      1
      NaN
    
    
      25%
      279.5
      NaN
    
    
      50%
      558
      NaN
    
    
      75%
      836.5
      NaN
    
    
      max
      1115
      NaN
    
    
      counts
      1115
      1115
    
    
      uniques
      1115
      12
    
    
      missing
      0
      0
    
    
      missing_perc
      0%
      0%
    
    
      types
      numeric
      categorical
    
  








    






  
    
      
      StateName
      State
    
  
  
    
      count
      16
      16
    
    
      unique
      16
      16
    
    
      top
      Bayern
      SH
    
    
      freq
      1
      1
    
    
      counts
      16
      16
    
    
      uniques
      16
      16
    
    
      missing
      0
      0
    
    
      missing_perc
      0%
      0%
    
    
      types
      unique
      unique
    
  








    






  
    
      
      file
      week
      trend
    
  
  
    
      count
      NaN
      NaN
      2072
    
    
      mean
      NaN
      NaN
      63.8142
    
    
      std
      NaN
      NaN
      12.6502
    
    
      min
      NaN
      NaN
      0
    
    
      25%
      NaN
      NaN
      55
    
    
      50%
      NaN
      NaN
      64
    
    
      75%
      NaN
      NaN
      72
    
    
      max
      NaN
      NaN
      100
    
    
      counts
      2072
      2072
      2072
    
    
      uniques
      14
      148
      68
    
    
      missing
      0
      0
      0
    
    
      missing_perc
      0%
      0%
      0%
    
    
      types
      categorical
      categorical
      numeric
    
  








    






  
    
      
      file
      Date
      Max_TemperatureC
      Mean_TemperatureC
      Min_TemperatureC
      Dew_PointC
      MeanDew_PointC
      Min_DewpointC
      Max_Humidity
      Mean_Humidity
      ...
      Max_VisibilityKm
      Mean_VisibilityKm
      Min_VisibilitykM
      Max_Wind_SpeedKm_h
      Mean_Wind_SpeedKm_h
      Max_Gust_SpeedKm_h
      Precipitationmm
      CloudCover
      Events
      WindDirDegrees
    
  
  
    
      count
      NaN
      NaN
      15840
      15840
      15840
      15840
      15840
      15840
      15840
      15840
      ...
      15459
      15459
      15459
      15840
      15840
      3604
      15840
      14667
      NaN
      15840
    
    
      mean
      NaN
      NaN
      14.6441
      10.389
      6.19899
      8.58782
      6.20581
      3.62614
      93.6596
      74.2829
      ...
      24.0576
      12.2398
      7.02516
      22.7666
      11.9722
      48.8643
      0.831718
      5.55131
      NaN
      175.897
    
    
      std
      NaN
      NaN
      8.64601
      7.37926
      6.52639
      6.24478
      6.08677
      6.12839
      7.67853
      13.4866
      ...
      8.9768
      5.06794
      4.9806
      8.98862
      5.87284
      13.027
      2.51351
      1.68771
      NaN
      101.589
    
    
      min
      NaN
      NaN
      -11
      -13
      -15
      -14
      -15
      -73
      44
      30
      ...
      0
      0
      0
      3
      2
      21
      0
      0
      NaN
      -1
    
    
      25%
      NaN
      NaN
      8
      4
      1
      4
      2
      -1
      90.75
      65
      ...
      14
      10
      3
      16
      8
      39
      0
      5
      NaN
      80
    
    
      50%
      NaN
      NaN
      15
      11
      7
      9
      7
      4
      94
      76
      ...
      31
      11
      7
      21
      11
      48
      0
      6
      NaN
      202
    
    
      75%
      NaN
      NaN
      21
      16
      11
      13
      11
      8
      100
      85
      ...
      31
      14
      10
      27
      14
      55
      0.25
      7
      NaN
      256
    
    
      max
      NaN
      NaN
      39
      31
      24
      25
      20
      19
      100
      100
      ...
      31
      31
      31
      101
      53
      111
      58.93
      8
      NaN
      360
    
    
      counts
      15840
      15840
      15840
      15840
      15840
      15840
      15840
      15840
      15840
      15840
      ...
      15459
      15459
      15459
      15840
      15840
      3604
      15840
      14667
      11889
      15840
    
    
      uniques
      16
      990
      51
      45
      40
      40
      36
      40
      53
      71
      ...
      24
      32
      24
      44
      29
      47
      41
      9
      21
      362
    
    
      missing
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      381
      381
      381
      0
      0
      12236
      0
      1173
      3951
      0
    
    
      missing_perc
      0%
      0%
      0%
      0%
      0%
      0%
      0%
      0%
      0%
      0%
      ...
      2.41%
      2.41%
      2.41%
      0%
      0%
      77.25%
      0%
      7.41%
      24.94%
      0%
    
    
      types
      categorical
      categorical
      numeric
      numeric
      numeric
      numeric
      numeric
      numeric
      numeric
      numeric
      ...
      numeric
      numeric
      numeric
      numeric
      numeric
      numeric
      numeric
      numeric
      categorical
      numeric
    
  

13 rows × 24 columns







    






  
    
      
      Id
      Store
      DayOfWeek
      Date
      Open
      Promo
      StateHoliday
      SchoolHoliday
    
  
  
    
      count
      41088
      41088
      41088
      NaN
      41077
      41088
      NaN
      41088
    
    
      mean
      20544.5
      555.9
      3.97917
      NaN
      0.854322
      0.395833
      NaN
      0.443487
    
    
      std
      11861.2
      320.274
      2.01548
      NaN
      0.352787
      0.489035
      NaN
      0.496802
    
    
      min
      1
      1
      1
      NaN
      0
      0
      NaN
      0
    
    
      25%
      10272.8
      279.75
      2
      NaN
      1
      0
      NaN
      0
    
    
      50%
      20544.5
      553.5
      4
      NaN
      1
      0
      NaN
      0
    
    
      75%
      30816.2
      832.25
      6
      NaN
      1
      1
      NaN
      1
    
    
      max
      41088
      1115
      7
      NaN
      1
      1
      NaN
      1
    
    
      counts
      41088
      41088
      41088
      41088
      41077
      41088
      41088
      41088
    
    
      uniques
      41088
      856
      7
      48
      2
      2
      2
      2
    
    
      missing
      0
      0
      0
      0
      11
      0
      0
      0
    
    
      missing_perc
      0%
      0%
      0%
      0%
      0.03%
      0%
      0%
      0%
    
    
      types
      numeric
      numeric
      numeric
      categorical
      bool
      bool
      bool
      bool

Data Cleaning / Feature Engineering

As a structured data problem, we necessarily have to go through all the cleaning and feature engineering, even though we're using a neural network.



In [15]:

    
train, store, store_states, state_names, googletrend, weather, test = tables



In [16]:

    
len(train),len(test)









    Out[16]:





(1017209, 41088)

Turn state Holidays to Bool



In [17]:

    
train.StateHoliday = train.StateHoliday!='0'
test.StateHoliday = test.StateHoliday!='0'

Define function for joining tables on specific fields.

By default, we'll be doing a left outer join of right on the left argument using the given fields for each table.

Pandas does joins using the merge method. The suffixes argument describes the naming convention for duplicate fields. We've elected to leave the duplicate field names on the left untouched, and append a "_y" to those on the right.



In [18]:

    
def join_df(left, right, left_on, right_on=None):
    if right_on is None: right_on = left_on
    return left.merge(right, how='left', left_on=left_on, right_on=right_on, 
                      suffixes=("", "_y"))

Join weather/state names.



In [19]:

    
weather = join_df(weather, state_names, "file", "StateName")

In pandas you can add new columns to a dataframe by simply defining it. We'll do this for googletrends by extracting dates and state names from the given data and adding those columns.

We're also going to replace all instances of state name 'NI' with the usage in the rest of the table, 'HB,NI'. This is a good opportunity to highlight pandas indexing. We can use .ix[rows, cols] to select a list of rows and a list of columns from the dataframe. In this case, we're selecting rows w/ statename 'NI' by using a boolean list googletrend.State=='NI' and selecting "State".



In [20]:

    
googletrend['Date'] = googletrend.week.str.split(' - ', expand=True)[0]
googletrend['State'] = googletrend.file.str.split('_', expand=True)[2]
googletrend.loc[googletrend.State=='NI', "State"] = 'HB,NI'

The following extracts particular date fields from a complete datetime for the purpose of constructing categoricals.

You should always consider this feature extraction step when working with date-time. Without expanding your date-time into these additional fields, you can't capture any trend/cyclical behavior as a function of time at any of these granularities.



In [21]:

    
def add_datepart(df):
    df.Date = pd.to_datetime(df.Date)
    df["Year"] = df.Date.dt.year
    df["Month"] = df.Date.dt.month
    df["Week"] = df.Date.dt.week
    df["Day"] = df.Date.dt.day

We'll add to every table w/ a date field.



In [22]:

    
add_datepart(weather)
add_datepart(googletrend)
add_datepart(train)
add_datepart(test)



In [23]:

    
trend_de = googletrend[googletrend.file == 'Rossmann_DE']

Now we can outer join all of our data into a single dataframe.

Recall that in outer joins everytime a value in the joining field on the left table does not have a corresponding value on the right table, the corresponding row in the new table has Null values for all right table fields.

One way to check that all records are consistent and complete is to check for Null values post-join, as we do here.

Aside: Why note just do an inner join? If you are assuming that all records are complete and match on the field you desire, an inner join will do the same thing as an outer join. However, in the event you are wrong or a mistake is made, an outer join followed by a null-check will catch it. (Comparing before/after # of rows for inner join is equivalent, but requires keeping track of before/after row #'s. Outer join is easier.)



In [24]:

    
store = join_df(store, store_states, "Store")
len(store[store.State.isnull()])









    Out[24]:





0



In [25]:

    
joined = join_df(train, store, "Store")
len(joined[joined.StoreType.isnull()])









    Out[25]:





0



In [26]:

    
joined = join_df(joined, googletrend, ["State","Year", "Week"])
len(joined[joined.trend.isnull()])









    Out[26]:





0



In [27]:

    
joined = joined.merge(trend_de, 'left', ["Year", "Week"], suffixes=('', '_DE'))
len(joined[joined.trend_DE.isnull()])









    Out[27]:





0



In [28]:

    
joined = join_df(joined, weather, ["State","Date"])
len(joined[joined.Mean_TemperatureC.isnull()])









    Out[28]:





0



In [29]:

    
joined_test = test.merge(store, how='left', left_on='Store', right_index=True)
len(joined_test[joined_test.StoreType.isnull()])









    Out[29]:





48

Next we'll fill in missing values to avoid complications w/ na's.



In [30]:

    
joined.CompetitionOpenSinceYear = joined.CompetitionOpenSinceYear.fillna(1900).astype(np.int32)
joined.CompetitionOpenSinceMonth = joined.CompetitionOpenSinceMonth.fillna(1).astype(np.int32)
joined.Promo2SinceYear = joined.Promo2SinceYear.fillna(1900).astype(np.int32)
joined.Promo2SinceWeek = joined.Promo2SinceWeek.fillna(1).astype(np.int32)

Next we'll extract features "CompetitionOpenSince" and "CompetitionDaysOpen". Note the use of apply() in mapping a function across dataframe values.



In [31]:

    
joined["CompetitionOpenSince"] = pd.to_datetime(joined.apply(lambda x: datetime.datetime(
    x.CompetitionOpenSinceYear, x.CompetitionOpenSinceMonth, 15), axis=1).astype(pd.datetime))
joined["CompetitionDaysOpen"] = joined.Date.subtract(joined["CompetitionOpenSince"]).dt.days

We'll replace some erroneous / outlying data.



In [32]:

    
joined.loc[joined.CompetitionDaysOpen<0, "CompetitionDaysOpen"] = 0
joined.loc[joined.CompetitionOpenSinceYear<1990, "CompetitionDaysOpen"] = 0

Added "CompetitionMonthsOpen" field, limit the maximum to 2 years to limit number of unique embeddings.



In [33]:

    
joined["CompetitionMonthsOpen"] = joined["CompetitionDaysOpen"]//30
joined.loc[joined.CompetitionMonthsOpen>24, "CompetitionMonthsOpen"] = 24
joined.CompetitionMonthsOpen.unique()









    Out[33]:





array([24,  3, 19,  9,  0, 16, 17,  7, 15, 22, 11, 13,  2, 23, 12,  4, 10,
        1, 14, 20,  8, 18,  6, 21,  5])

Same process for Promo dates.



In [34]:

    
joined["Promo2Since"] = pd.to_datetime(joined.apply(lambda x: Week(
    x.Promo2SinceYear, x.Promo2SinceWeek).monday(), axis=1).astype(pd.datetime))
joined["Promo2Days"] = joined.Date.subtract(joined["Promo2Since"]).dt.days



In [35]:

    
joined.loc[joined.Promo2Days<0, "Promo2Days"] = 0
joined.loc[joined.Promo2SinceYear<1990, "Promo2Days"] = 0



In [36]:

    
joined["Promo2Weeks"] = joined["Promo2Days"]//7
joined.loc[joined.Promo2Weeks<0, "Promo2Weeks"] = 0
joined.loc[joined.Promo2Weeks>25, "Promo2Weeks"] = 25
joined.Promo2Weeks.unique()









    Out[36]:





array([ 0, 25, 17,  8, 13, 24, 16,  7, 12, 23, 15,  6, 11, 22, 14,  5, 10,
       21,  4,  9, 20,  3, 19,  2, 18,  1])

Durations

It is common when working with time series data to extract data that explains relationships across rows as opposed to columns, e.g.:

Running averages
Time until next event
Time since last event

This is often difficult to do with most table manipulation frameworks, since they are designed to work with relationships across columns. As such, we've created a class to handle this type of data.



In [37]:

    
columns = ["Date", "Store", "Promo", "StateHoliday", "SchoolHoliday"]

We've defined a class elapsed for cumulative counting across a sorted dataframe.

Given a particular field fld to monitor, this object will start tracking time since the last occurrence of that field. When the field is seen again, the counter is set to zero.

Upon initialization, this will result in datetime na's until the field is encountered. This is reset every time a new store is seen.

We'll see how to use this shortly.



In [38]:

    
class elapsed(object):
    def __init__(self, fld):
        self.fld = fld
        self.last = pd.to_datetime(np.nan)
        self.last_store = 0
        
    def get(self, row):
        if row.Store != self.last_store:
            self.last = pd.to_datetime(np.nan)
            self.last_store = row.Store
        if (row[self.fld]): self.last = row.Date
        return row.Date-self.last



In [39]:

    
df = train[columns]

And a function for applying said class across dataframe rows and adding values to a new column.



In [40]:

    
def add_elapsed(fld, prefix):
    sh_el = elapsed(fld)
    df[prefix+fld] = df.apply(sh_el.get, axis=1)

Let's walk through an example.

Say we're looking at School Holiday. We'll first sort by Store, then Date, and then call add_elapsed('SchoolHoliday', 'After'): This will generate an instance of the elapsed class for School Holiday:

Instance applied to every row of the dataframe in order of store and date
Will add to the dataframe the days since seeing a School Holiday
If we sort in the other direction, this will count the days until another promotion.



In [41]:

    
fld = 'SchoolHoliday'
df = df.sort_values(['Store', 'Date'])
add_elapsed(fld, 'After')
df = df.sort_values(['Store', 'Date'], ascending=[True, False])
add_elapsed(fld, 'Before')

We'll do this for two more fields.



In [42]:

    
fld = 'StateHoliday'
df = df.sort_values(['Store', 'Date'])
add_elapsed(fld, 'After')
df = df.sort_values(['Store', 'Date'], ascending=[True, False])
add_elapsed(fld, 'Before')



In [43]:

    
fld = 'Promo'
df = df.sort_values(['Store', 'Date'])
add_elapsed(fld, 'After')
df = df.sort_values(['Store', 'Date'], ascending=[True, False])
add_elapsed(fld, 'Before')



In [44]:

    
display(df.head())









    






  
    
      
      Date
      Store
      Promo
      StateHoliday
      SchoolHoliday
      AfterSchoolHoliday
      BeforeSchoolHoliday
      AfterStateHoliday
      BeforeStateHoliday
      AfterPromo
      BeforePromo
    
  
  
    
      0
      2015-07-31
      1
      1
      False
      1
      0 days
      0 days
      57 days
      NaT
      0 days 00:00:00
      0 days
    
    
      1115
      2015-07-30
      1
      1
      False
      1
      0 days
      0 days
      56 days
      NaT
      0 days 00:00:00
      0 days
    
    
      2230
      2015-07-29
      1
      1
      False
      1
      0 days
      0 days
      55 days
      NaT
      0 days 00:00:00
      0 days
    
    
      3345
      2015-07-28
      1
      1
      False
      1
      0 days
      0 days
      54 days
      NaT
      0 days 00:00:00
      0 days
    
    
      4460
      2015-07-27
      1
      1
      False
      1
      0 days
      0 days
      53 days
      NaT
      0 days 00:00:00
      0 days

We're going to set the active index to Date.



In [45]:

    
df = df.set_index("Date")

Then set null values from elapsed field calculations to 0.



In [46]:

    
columns = ['SchoolHoliday', 'StateHoliday', 'Promo']



In [47]:

    
for o in ['Before', 'After']:
    for p in columns:
        a = o+p
        df[a] = df[a].fillna(pd.Timedelta(0)).dt.days

Next we'll demonstrate window functions in pandas to calculate rolling quantities.

Here we're sorting by date (sort_index()) and counting the number of events of interest (sum()) defined in columns in the following week (rolling()), grouped by Store (groupby()). We do the same in the opposite direction.



In [48]:

    
bwd = df[['Store']+columns].sort_index().groupby("Store").rolling(7, min_periods=1).sum()



In [49]:

    
fwd = df[['Store']+columns].sort_index(ascending=False
                                      ).groupby("Store").rolling(7, min_periods=1).sum()

Next we want to drop the Store indices grouped together in the window function.

Often in pandas, there is an option to do this in place. This is time and memory efficient when working with large datasets.



In [50]:

    
bwd.drop('Store',1,inplace=True)
bwd.reset_index(inplace=True)



In [51]:

    
fwd.drop('Store',1,inplace=True)
fwd.reset_index(inplace=True)



In [52]:

    
df.reset_index(inplace=True)

Now we'll merge these values onto the df.



In [53]:

    
df = df.merge(bwd, 'left', ['Date', 'Store'], suffixes=['', '_bw'])
df = df.merge(fwd, 'left', ['Date', 'Store'], suffixes=['', '_fw'])



In [54]:

    
df.drop(columns,1,inplace=True)



In [55]:

    
df.head()









    Out[55]:






  
    
      
      Date
      Store
      AfterSchoolHoliday
      BeforeSchoolHoliday
      AfterStateHoliday
      BeforeStateHoliday
      AfterPromo
      BeforePromo
      SchoolHoliday_bw
      StateHoliday_bw
      Promo_bw
      SchoolHoliday_fw
      StateHoliday_fw
      Promo_fw
    
  
  
    
      0
      2015-07-31
      1
      0
      0
      57
      0
      0
      0
      5.0
      0.0
      5.0
      1.0
      0.0
      1.0
    
    
      1
      2015-07-30
      1
      0
      0
      56
      0
      0
      0
      4.0
      0.0
      4.0
      2.0
      0.0
      2.0
    
    
      2
      2015-07-29
      1
      0
      0
      55
      0
      0
      0
      3.0
      0.0
      3.0
      3.0
      0.0
      3.0
    
    
      3
      2015-07-28
      1
      0
      0
      54
      0
      0
      0
      2.0
      0.0
      2.0
      4.0
      0.0
      4.0
    
    
      4
      2015-07-27
      1
      0
      0
      53
      0
      0
      0
      1.0
      0.0
      1.0
      5.0
      0.0
      5.0

It's usually a good idea to back up large tables of extracted / wrangled features before you join them onto another one, that way you can go back to it easily if you need to make changes to it.



In [56]:

    
df.to_csv('df.csv')



In [57]:

    
df = pd.read_csv('df.csv', index_col=0)



In [58]:

    
df["Date"] = pd.to_datetime(df.Date)



In [59]:

    
df.columns









    Out[59]:





Index(['Date', 'Store', 'AfterSchoolHoliday', 'BeforeSchoolHoliday',
       'AfterStateHoliday', 'BeforeStateHoliday', 'AfterPromo', 'BeforePromo',
       'SchoolHoliday_bw', 'StateHoliday_bw', 'Promo_bw', 'SchoolHoliday_fw',
       'StateHoliday_fw', 'Promo_fw'],
      dtype='object')



In [60]:

    
joined = join_df(joined, df, ['Store', 'Date'])

We'll back this up as well.



In [61]:

    
joined.to_csv('joined.csv')

We now have our final set of engineered features.



In [62]:

    
joined = pd.read_csv('joined.csv', index_col=0)
joined["Date"] = pd.to_datetime(joined.Date)
joined.columns









    Out[62]:





Index(['Store', 'DayOfWeek', 'Date', 'Sales', 'Customers', 'Open', 'Promo',
       'StateHoliday', 'SchoolHoliday', 'Year', 'Month', 'Week', 'Day',
       'StoreType', 'Assortment', 'CompetitionDistance',
       'CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear', 'Promo2',
       'Promo2SinceWeek', 'Promo2SinceYear', 'PromoInterval', 'State', 'file',
       'week', 'trend', 'Date_y', 'Month_y', 'Day_y', 'file_DE', 'week_DE',
       'trend_DE', 'Date_DE', 'State_DE', 'Month_DE', 'Day_DE', 'file_y',
       'Max_TemperatureC', 'Mean_TemperatureC', 'Min_TemperatureC',
       'Dew_PointC', 'MeanDew_PointC', 'Min_DewpointC', 'Max_Humidity',
       'Mean_Humidity', 'Min_Humidity', 'Max_Sea_Level_PressurehPa',
       'Mean_Sea_Level_PressurehPa', 'Min_Sea_Level_PressurehPa',
       'Max_VisibilityKm', 'Mean_VisibilityKm', 'Min_VisibilitykM',
       'Max_Wind_SpeedKm_h', 'Mean_Wind_SpeedKm_h', 'Max_Gust_SpeedKm_h',
       'Precipitationmm', 'CloudCover', 'Events', 'WindDirDegrees',
       'StateName', 'Year_y', 'Month_y.1', 'Week_y', 'Day_y.1',
       'CompetitionOpenSince', 'CompetitionDaysOpen', 'CompetitionMonthsOpen',
       'Promo2Since', 'Promo2Days', 'Promo2Weeks', 'AfterSchoolHoliday',
       'BeforeSchoolHoliday', 'AfterStateHoliday', 'BeforeStateHoliday',
       'AfterPromo', 'BeforePromo', 'SchoolHoliday_bw', 'StateHoliday_bw',
       'Promo_bw', 'SchoolHoliday_fw', 'StateHoliday_fw', 'Promo_fw'],
      dtype='object')

While these steps were explicitly outlined in the paper, these are all fairly typical feature engineering steps for dealing with time series data and are practical in any similar setting.

Create features

Now that we've engineered all our features, we need to convert to input compatible with a neural network.

This includes converting categorical variables into contiguous integers or one-hot encodings, normalizing continuous features to standard normal, etc...



In [63]:

    
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import LabelEncoder, Imputer, StandardScaler

This dictionary maps categories to embedding dimensionality. In generally, categories we might expect to be conceptually more complex have larger dimension.



In [64]:

    
cat_var_dict = {'Store': 50, 'DayOfWeek': 6, 'Year': 2, 'Month': 6,
'Day': 10, 'StateHoliday': 3, 'CompetitionMonthsOpen': 2,
'Promo2Weeks': 1, 'StoreType': 2, 'Assortment': 3, 'PromoInterval': 3,
'CompetitionOpenSinceYear': 4, 'Promo2SinceYear': 4, 'State': 6,
'Week': 2, 'Events': 4, 'Promo_fw': 1,
'Promo_bw': 1, 'StateHoliday_fw': 1,
'StateHoliday_bw': 1, 'SchoolHoliday_fw': 1,
'SchoolHoliday_bw': 1}

Name categorical variables



In [65]:

    
cat_vars = [o[0] for o in 
            sorted(cat_var_dict.items(), key=operator.itemgetter(1), reverse=True)]



In [66]:

    
"""cat_vars = ['Store', 'DayOfWeek', 'Year', 'Month', 'Day', 'StateHoliday',
           'StoreType', 'Assortment', 'Week', 'Events', 'Promo2SinceYear',
            'CompetitionOpenSinceYear', 'PromoInterval', 'Promo', 'SchoolHoliday', 'State']"""









    Out[66]:





"cat_vars = ['Store', 'DayOfWeek', 'Year', 'Month', 'Day', 'StateHoliday',\n           'StoreType', 'Assortment', 'Week', 'Events', 'Promo2SinceYear',\n            'CompetitionOpenSinceYear', 'PromoInterval', 'Promo', 'SchoolHoliday', 'State']"

Likewise for continuous



In [67]:

    
# mean/max wind; min temp; cloud; min/mean humid; 
contin_vars = ['CompetitionDistance', 
   'Max_TemperatureC', 'Mean_TemperatureC', 'Min_TemperatureC',
   'Max_Humidity', 'Mean_Humidity', 'Min_Humidity', 'Max_Wind_SpeedKm_h', 
   'Mean_Wind_SpeedKm_h', 'CloudCover', 'trend', 'trend_DE',
   'AfterStateHoliday', 'BeforeStateHoliday', 'Promo', 'SchoolHoliday']



In [68]:

    
"""contin_vars = ['CompetitionDistance', 'Max_TemperatureC', 'Mean_TemperatureC', 
   'Max_Humidity', 'trend', 'trend_DE', 'AfterStateHoliday', 'BeforeStateHoliday']"""









    Out[68]:





"contin_vars = ['CompetitionDistance', 'Max_TemperatureC', 'Mean_TemperatureC', \n   'Max_Humidity', 'trend', 'trend_DE', 'AfterStateHoliday', 'BeforeStateHoliday']"

Replace nulls w/ 0 for continuous, "" for categorical.



In [69]:

    
for v in contin_vars: joined.loc[joined[v].isnull(), v] = 0
for v in cat_vars: joined.loc[joined[v].isnull(), v] = ""

Here we create a list of tuples, each containing a variable and an instance of a transformer for that variable.

For categoricals, we use a label encoder that maps categories to continuous integers. For continuous variables, we standardize them.



In [70]:

    
cat_maps = [(o, LabelEncoder()) for o in cat_vars]
contin_maps = [([o], StandardScaler()) for o in contin_vars]

The same instances need to be used for the test set as well, so values are mapped/standardized appropriately.

DataFrame mapper will keep track of these variable-instance mappings.



In [71]:

    
cat_mapper = DataFrameMapper(cat_maps)
cat_map_fit = cat_mapper.fit(joined)
cat_cols = len(cat_map_fit.features)
cat_cols









    Out[71]:





22



In [72]:

    
contin_mapper = DataFrameMapper(contin_maps)
contin_map_fit = contin_mapper.fit(joined)
contin_cols = len(contin_map_fit.features)
contin_cols









    



/home/roebius/pj/p3/lib/python3.5/site-packages/sklearn/utils/validation.py:429: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.
  warnings.warn(msg, _DataConversionWarning)






    Out[72]:





16

Example of first five rows of zeroth column being transformed appropriately.



In [73]:

    
cat_map_fit.transform(joined)[0,:5], contin_map_fit.transform(joined)[0,:5]









    



/home/roebius/pj/p3/lib/python3.5/site-packages/sklearn/utils/validation.py:429: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.
  warnings.warn(msg, _DataConversionWarning)






    Out[73]:





(array([ 0, 30,  4,  6,  4]),
 array([-0.53772351,  1.02444552,  0.81863226,  0.32504749,  0.61112004]))

We can also pickle these mappings, which is great for portability!



In [74]:

    
pickle.dump(contin_map_fit, open('contin_maps.pickle', 'wb'))
pickle.dump(cat_map_fit, open('cat_maps.pickle', 'wb'))



In [75]:

    
[len(o[1].classes_) for o in cat_map_fit.features]









    Out[75]:





[1115, 31, 7, 12, 12, 22, 8, 23, 4, 3, 2, 25, 3, 4, 52, 6, 8, 3, 6, 8, 3, 26]

Sample data

Next, the authors removed all instances where the store had zero sale / was closed.



In [76]:

    
joined_sales = joined[joined.Sales!=0]
n = len(joined_sales)

We speculate that this may have cost them a higher standing in the competition. One reason this may be the case is that a little EDA reveals that there are often periods where stores are closed, typically for refurbishment. Before and after these periods, there are naturally spikes in sales that one might expect. Be ommitting this data from their training, the authors gave up the ability to leverage information about these periods to predict this otherwise volatile behavior.



In [77]:

    
n









    Out[77]:





844338

We're going to run on a sample.



In [78]:

    
samp_size = 100000
np.random.seed(42)
idxs = sorted(np.random.choice(n, samp_size, replace=False))



In [79]:

    
joined_samp = joined_sales.iloc[idxs].set_index("Date")



In [80]:

    
samp_size = n
joined_samp = joined_sales.set_index("Date")

In time series data, cross-validation is not random. Instead, our holdout data is always the most recent data, as it would be in real application.

We've taken the last 10% as our validation set.



In [81]:

    
train_ratio = 0.9
train_size = int(samp_size * train_ratio)



In [82]:

    
train_size









    Out[82]:





759904



In [83]:

    
joined_valid = joined_samp[train_size:]
joined_train = joined_samp[:train_size]
len(joined_valid), len(joined_train)









    Out[83]:





(84434, 759904)

Here's a preprocessor for our categoricals using our instance mapper.



In [84]:

    
def cat_preproc(dat):
    return cat_map_fit.transform(dat).astype(np.int64)



In [85]:

    
cat_map_train = cat_preproc(joined_train)
cat_map_valid = cat_preproc(joined_valid)

Same for continuous.



In [86]:

    
def contin_preproc(dat):
    return contin_map_fit.transform(dat).astype(np.float32)



In [87]:

    
contin_map_train = contin_preproc(joined_train)
contin_map_valid = contin_preproc(joined_valid)









    



/home/roebius/pj/p3/lib/python3.5/site-packages/sklearn/utils/validation.py:429: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.
  warnings.warn(msg, _DataConversionWarning)

Grab our targets.



In [88]:

    
y_train_orig = joined_train.Sales
y_valid_orig = joined_valid.Sales

Finally, the authors modified the target values by applying a logarithmic transformation and normalizing to unit scale by dividing by the maximum log value.

Log transformations are used on this type of data frequently to attain a nicer shape.

Further by scaling to the unit interval we can now use a sigmoid output in our neural network. Then we can multiply by the maximum log value to get the original log value and transform back.



In [89]:

    
max_log_y = np.max(np.log(joined_samp.Sales))
y_train = np.log(y_train_orig)/max_log_y
y_valid = np.log(y_valid_orig)/max_log_y

Note: Some testing shows this doesn't make a big difference.



In [90]:

    
"""#y_train = np.log(y_train)
ymean=y_train_orig.mean()
ystd=y_train_orig.std()
y_train = (y_train_orig-ymean)/ystd
#y_valid = np.log(y_valid)
y_valid = (y_valid_orig-ymean)/ystd"""









    Out[90]:





'#y_train = np.log(y_train)\nymean=y_train_orig.mean()\nystd=y_train_orig.std()\ny_train = (y_train_orig-ymean)/ystd\n#y_valid = np.log(y_valid)\ny_valid = (y_valid_orig-ymean)/ystd'

Root-mean-squared percent error is the metric Kaggle used for this competition.



In [91]:

    
def rmspe(y_pred, targ = y_valid_orig):
    pct_var = (targ - y_pred)/targ
    return math.sqrt(np.square(pct_var).mean())

These undo the target transformations.



In [92]:

    
def log_max_inv(preds, mx = max_log_y):
    return np.exp(preds * mx)



In [93]:

    
# - This can be used if ymean and ystd are calculated above (they are currently commented out)
def normalize_inv(preds):
    return preds * ystd + ymean

Create models

Now we're ready to put together our models.

Much of the following code has commented out portions / alternate implementations.



In [94]:

    
"""
1 97s - loss: 0.0104 - val_loss: 0.0083
2 93s - loss: 0.0076 - val_loss: 0.0076
3 90s - loss: 0.0071 - val_loss: 0.0076
4 90s - loss: 0.0068 - val_loss: 0.0075
5 93s - loss: 0.0066 - val_loss: 0.0075
6 95s - loss: 0.0064 - val_loss: 0.0076
7 98s - loss: 0.0063 - val_loss: 0.0077
8 97s - loss: 0.0062 - val_loss: 0.0075
9 95s - loss: 0.0061 - val_loss: 0.0073
0 101s - loss: 0.0061 - val_loss: 0.0074
"""









    Out[94]:





'\n1 97s - loss: 0.0104 - val_loss: 0.0083\n2 93s - loss: 0.0076 - val_loss: 0.0076\n3 90s - loss: 0.0071 - val_loss: 0.0076\n4 90s - loss: 0.0068 - val_loss: 0.0075\n5 93s - loss: 0.0066 - val_loss: 0.0075\n6 95s - loss: 0.0064 - val_loss: 0.0076\n7 98s - loss: 0.0063 - val_loss: 0.0077\n8 97s - loss: 0.0062 - val_loss: 0.0075\n9 95s - loss: 0.0061 - val_loss: 0.0073\n0 101s - loss: 0.0061 - val_loss: 0.0074\n'



In [95]:

    
def split_cols(arr):
    return np.hsplit(arr,arr.shape[1])



In [96]:

    
# - This gives the correct list length for the model
# - (list of 23 elements: 22 embeddings + 1 array of 16-dim elements)  
map_train = split_cols(cat_map_train) + [contin_map_train]
map_valid = split_cols(cat_map_valid) + [contin_map_valid]



In [97]:

    
len(map_train)









    Out[97]:





23



In [98]:

    
# map_train = split_cols(cat_map_train) + split_cols(contin_map_train)
# map_valid = split_cols(cat_map_valid) + split_cols(contin_map_valid)

Helper function for getting categorical name and dim.



In [99]:

    
def cat_map_info(feat): return feat[0], len(feat[1].classes_)



In [100]:

    
cat_map_info(cat_map_fit.features[1])









    Out[100]:





('Day', 31)



In [101]:

    
# - In Keras 2 the "initializations" module is not available.
# - To keep here the custom initializer the code from Keras 1 "uniform" initializer is exploited 
def my_init(scale):
#     return lambda shape, name=None: initializations.uniform(shape, scale=scale, name=name)
    return K.variable(np.random.uniform(low=-scale, high=scale, size=shape),
                      name=name)



In [102]:

    
# - In Keras 2 the "initializations" module is not available.
# - To keep here the custom initializer the code from Keras 1 "uniform" initializer is exploited 
def emb_init(shape, name=None): 
#     return initializations.uniform(shape, scale=2/(shape[1]+1), name=name)
    return K.variable(np.random.uniform(low=-2/(shape[1]+1), high=2/(shape[1]+1), size=shape),
                      name=name)

Helper function for constructing embeddings. Notice commented out codes, several different ways to compute embeddings at play.

Also, note we're flattening the embedding. Embeddings in Keras come out as an element of a sequence like we might use in a sequence of words; here we just want to concatenate them so we flatten the 1-vector sequence into a vector.



In [103]:

    
def get_emb(feat):
    name, c = cat_map_info(feat)
    #c2 = cat_var_dict[name]
    c2 = (c+1)//2
    if c2>50: c2=50
    inp = Input((1,), dtype='int64', name=name+'_in')
    # , kernel_regularizer=l2(1e-6)  # Keras 2
    u = Flatten(name=name+'_flt')(Embedding(c, c2, input_length=1, embeddings_initializer=emb_init)(inp))  # Keras 2
#     u = Flatten(name=name+'_flt')(Embedding(c, c2, input_length=1)(inp))
    return inp,u

Helper function for continuous inputs.



In [104]:

    
def get_contin(feat):
    name = feat[0][0]
    inp = Input((1,), name=name+'_in')
    return inp, Dense(1, name=name+'_d', kernel_initializer=my_init(1.))(inp)  # Keras 2

Let's build them.



In [105]:

    
contin_inp = Input((contin_cols,), name='contin')
contin_out = Dense(contin_cols*10, activation='relu', name='contin_d')(contin_inp)
#contin_out = BatchNormalization()(contin_out)

Now we can put them together. Given the inputs, continuous and categorical embeddings, we're going to concatenate all of them.

Next, we're going to pass through some dropout, then two dense layers w/ ReLU activations, then dropout again, then the sigmoid activation we mentioned earlier.



In [106]:

    
embs = [get_emb(feat) for feat in cat_map_fit.features]
#conts = [get_contin(feat) for feat in contin_map_fit.features]
#contin_d = [d for inp,d in conts]
x = concatenate([emb for inp,emb in embs] + [contin_out])  # Keras 2
#x = concatenate([emb for inp,emb in embs] + contin_d)  # Keras 2

x = Dropout(0.02)(x)
x = Dense(1000, activation='relu', kernel_initializer='uniform')(x)
x = Dense(500, activation='relu', kernel_initializer='uniform')(x)
x = Dropout(0.2)(x)
x = Dense(1, activation='sigmoid')(x)

model = Model([inp for inp,emb in embs] + [contin_inp], x)
#model = Model([inp for inp,emb in embs] + [inp for inp,d in conts], x)
model.compile('adam', 'mean_absolute_error')
#model.compile(Adam(), 'mse')

Start training



In [107]:

    
%%time 
hist = model.fit(map_train, y_train, batch_size=128, epochs=25,
                 verbose=1, validation_data=(map_valid, y_valid))









    



Train on 759904 samples, validate on 84434 samples
Epoch 1/25
759904/759904 [==============================] - 53s - loss: 0.0115 - val_loss: 0.0109
Epoch 2/25
759904/759904 [==============================] - 52s - loss: 0.0081 - val_loss: 0.0106
Epoch 3/25
759904/759904 [==============================] - 53s - loss: 0.0072 - val_loss: 0.0103
Epoch 4/25
759904/759904 [==============================] - 53s - loss: 0.0068 - val_loss: 0.0098
Epoch 5/25
759904/759904 [==============================] - 53s - loss: 0.0066 - val_loss: 0.0099
Epoch 6/25
759904/759904 [==============================] - 54s - loss: 0.0064 - val_loss: 0.0098
Epoch 7/25
759904/759904 [==============================] - 54s - loss: 0.0063 - val_loss: 0.0099
Epoch 8/25
759904/759904 [==============================] - 54s - loss: 0.0062 - val_loss: 0.0099
Epoch 9/25
759904/759904 [==============================] - 54s - loss: 0.0062 - val_loss: 0.0098
Epoch 10/25
759904/759904 [==============================] - 53s - loss: 0.0061 - val_loss: 0.0097
Epoch 11/25
759904/759904 [==============================] - 53s - loss: 0.0061 - val_loss: 0.0101
Epoch 12/25
759904/759904 [==============================] - 53s - loss: 0.0061 - val_loss: 0.0096
Epoch 13/25
759904/759904 [==============================] - 53s - loss: 0.0060 - val_loss: 0.0098
Epoch 14/25
759904/759904 [==============================] - 53s - loss: 0.0060 - val_loss: 0.0098
Epoch 15/25
759904/759904 [==============================] - 53s - loss: 0.0060 - val_loss: 0.0096
Epoch 16/25
759904/759904 [==============================] - 53s - loss: 0.0060 - val_loss: 0.0096
Epoch 17/25
759904/759904 [==============================] - 53s - loss: 0.0059 - val_loss: 0.0100
Epoch 18/25
759904/759904 [==============================] - 53s - loss: 0.0059 - val_loss: 0.0095
Epoch 19/25
759904/759904 [==============================] - 53s - loss: 0.0059 - val_loss: 0.0097
Epoch 20/25
759904/759904 [==============================] - 53s - loss: 0.0059 - val_loss: 0.0098
Epoch 21/25
759904/759904 [==============================] - 53s - loss: 0.0059 - val_loss: 0.0094
Epoch 22/25
759904/759904 [==============================] - 53s - loss: 0.0059 - val_loss: 0.0095
Epoch 23/25
759904/759904 [==============================] - 53s - loss: 0.0058 - val_loss: 0.0095
Epoch 24/25
759904/759904 [==============================] - 53s - loss: 0.0058 - val_loss: 0.0096
Epoch 25/25
759904/759904 [==============================] - 53s - loss: 0.0058 - val_loss: 0.0097
CPU times: user 48min 2s, sys: 5min 7s, total: 53min 10s
Wall time: 22min 22s



In [108]:

    
hist.history









    Out[108]:





{'loss': [0.011543708899975998,
  0.0081428832266230763,
  0.0071605561921078793,
  0.0067688872935217853,
  0.006562436936816074,
  0.0064265123611733288,
  0.0063263529084675255,
  0.0062496530353846023,
  0.0061991263113358392,
  0.0061371925805966789,
  0.0060975233105218709,
  0.0060683156664236917,
  0.0060377205630576351,
  0.0060002038249019713,
  0.0059846025952508429,
  0.0059559546078189734,
  0.0059432193379071567,
  0.005925937787276579,
  0.0058936794265827342,
  0.0058852603196882548,
  0.0058711308764912897,
  0.0058566379312929573,
  0.0058427490107691602,
  0.0058289598719467768,
  0.0058106531752574247],
 'val_loss': [0.010913814287089577,
  0.010564941083243145,
  0.01027382288614701,
  0.0098251879501716419,
  0.0098826151931447094,
  0.0098126728207583811,
  0.0098805974064248396,
  0.0099482648354793594,
  0.0097641247889236529,
  0.0097114029873534794,
  0.0100529663411497,
  0.0096321621593877232,
  0.0097647817569319374,
  0.0098252719552163852,
  0.0096129882340110218,
  0.0095683800178879407,
  0.010039474427044233,
  0.0095306029156341735,
  0.009686993617640945,
  0.0097661329346007009,
  0.0093741671631562863,
  0.0094836538342612225,
  0.0095238257312254323,
  0.0096323451664686414,
  0.0097270023006603736]}



In [109]:

    
plot_train(hist)



In [110]:

    
preds = np.squeeze(model.predict(map_valid, 1024))

Result on validation data: 0.1678 (samp 150k, 0.75 trn)



In [111]:

    
log_max_inv(preds)









    Out[111]:





array([  7165.78076172,   8693.43945312,   5352.79882812,   7027.24267578,
         5618.58349609,   5587.34033203,   6442.33349609,   8173.66503906,
         5827.05566406,   6670.58496094,   9821.00585938,   6997.49658203,
         6749.18359375,   5440.39208984,   7297.14794922,   6256.46826172,
         8004.54150391,   5506.05957031,   3748.50585938,   6218.23828125,
       ...,   5906.76220703,  19001.56835938,   4391.00683594,
         4296.71191406,   8192.12109375,  22114.27734375,   3086.32592773,
         5314.42626953,   3874.55859375,   8417.43261719,   4149.20214844,
         3612.04711914,   2941.67211914,  15090.69824219,   4947.65966797,
         4927.15380859,  13834.59179688,   6490.71289062,   4514.25390625,
         7183.3515625 ], dtype=float32)



In [114]:

    
# - This will work if ymean and ystd are calculated in the "Data" section above (in this case uncomment)
# normalize_inv(preds)

Using 3rd place data



In [ ]:

    
pkl_path = '/data/jhoward/github/entity-embedding-rossmann/'



In [ ]:

    
def load_pickle(fname): 
    return pickle.load(open(pkl_path+fname + '.pickle', 'rb'))



In [ ]:

    
[x_pkl_orig, y_pkl_orig] = load_pickle('feature_train_data')



In [ ]:

    
max_log_y_pkl = np.max(np.log(y_pkl_orig))
y_pkl = np.log(y_pkl_orig)/max_log_y_pkl



In [ ]:

    
pkl_vars = ['Open', 'Store', 'DayOfWeek', 'Promo', 'Year', 'Month', 'Day', 
     'StateHoliday', 'SchoolHoliday', 'CompetitionMonthsOpen', 'Promo2Weeks', 
    'Promo2Weeks_L', 'CompetitionDistance',
    'StoreType', 'Assortment', 'PromoInterval', 'CompetitionOpenSinceYear',
    'Promo2SinceYear', 'State', 'Week', 'Max_TemperatureC', 'Mean_TemperatureC', 
    'Min_TemperatureC', 'Max_Humidity', 'Mean_Humidity', 'Min_Humidity', 'Max_Wind_SpeedKm_h', 
    'Mean_Wind_SpeedKm_h', 'CloudCover','Events', 'Promo_fw', 'Promo_bw', 
    'StateHoliday_fw', 'StateHoliday_bw', 'AfterStateHoliday', 'BeforeStateHoliday', 
    'SchoolHoliday_fw', 'SchoolHoliday_bw', 'trend_DE', 'trend']



In [ ]:

    
x_pkl = np.array(x_pkl_orig)



In [ ]:

    
gt_enc = StandardScaler()
gt_enc.fit(x_pkl[:,-2:])



In [ ]:

    
x_pkl[:,-2:] = gt_enc.transform(x_pkl[:,-2:])



In [ ]:

    
x_pkl.shape



In [ ]:

    
x_pkl = x_pkl[idxs]
y_pkl = y_pkl[idxs]



In [ ]:

    
x_pkl_trn, x_pkl_val = x_pkl[:train_size], x_pkl[train_size:]
y_pkl_trn, y_pkl_val = y_pkl[:train_size], y_pkl[train_size:]



In [ ]:

    
x_pkl_trn.shape



In [ ]:

    
xgb_parms = {'learning_rate': 0.1, 'subsample': 0.6, 
             'colsample_bylevel': 0.6, 'silent': True, 'objective': 'reg:linear'}



In [ ]:

    
xdata_pkl = xgboost.DMatrix(x_pkl_trn, y_pkl_trn, feature_names=pkl_vars)



In [ ]:

    
xdata_val_pkl = xgboost.DMatrix(x_pkl_val, y_pkl_val, feature_names=pkl_vars)



In [ ]:

    
xgb_parms['seed'] = random.randint(0,1e9)
model_pkl = xgboost.train(xgb_parms, xdata_pkl)



In [ ]:

    
model_pkl.eval(xdata_val_pkl)



In [ ]:

    
#0.117473



In [ ]:

    
importance = model_pkl.get_fscore()
importance = sorted(importance.items(), key=operator.itemgetter(1))

df = pd.DataFrame(importance, columns=['feature', 'fscore'])
df['fscore'] = df['fscore'] / df['fscore'].sum()

df.plot(kind='barh', x='feature', y='fscore', legend=False, figsize=(6, 10))
plt.title('XGBoost Feature Importance')
plt.xlabel('relative importance');

Neural net



In [ ]:

    
#np.savez_compressed('vars.npz', pkl_cats, pkl_contins)
#np.savez_compressed('deps.npz', y_pkl)



In [ ]:

    
pkl_cats = np.stack([x_pkl[:,pkl_vars.index(f)] for f in cat_vars], 1)
pkl_contins = np.stack([x_pkl[:,pkl_vars.index(f)] for f in contin_vars], 1)



In [ ]:

    
co_enc = StandardScaler().fit(pkl_contins)
pkl_contins = co_enc.transform(pkl_contins)



In [ ]:

    
pkl_contins_trn, pkl_contins_val = pkl_contins[:train_size], pkl_contins[train_size:]
pkl_cats_trn, pkl_cats_val = pkl_cats[:train_size], pkl_cats[train_size:]
y_pkl_trn, y_pkl_val = y_pkl[:train_size], y_pkl[train_size:]



In [ ]:

    
def get_emb_pkl(feat):
    name, c = cat_map_info(feat)
    c2 = (c+2)//3
    if c2>50: c2=50
    inp = Input((1,), dtype='int64', name=name+'_in')
    u = Flatten(name=name+'_flt')(Embedding(c, c2, input_length=1, init=emb_init)(inp))
    return inp,u



In [ ]:

    
n_pkl_contin = pkl_contins_trn.shape[1]
contin_inp = Input((n_pkl_contin,), name='contin')
contin_out = BatchNormalization()(contin_inp)



In [ ]:

    
map_train_pkl = split_cols(pkl_cats_trn) + [pkl_contins_trn]
map_valid_pkl = split_cols(pkl_cats_val) + [pkl_contins_val]



In [ ]:

    
def train_pkl(bs=128, ne=10):
    return model_pkl.fit(map_train_pkl, y_pkl_trn, batch_size=bs, nb_epoch=ne,
                 verbose=0, validation_data=(map_valid_pkl, y_pkl_val))



In [ ]:

    
def get_model_pkl(): 
    conts = [get_contin_pkl(feat) for feat in contin_map_fit.features]
    embs = [get_emb_pkl(feat) for feat in cat_map_fit.features]
    x = merge([emb for inp,emb in embs] + [contin_out], mode='concat')

    x = Dropout(0.02)(x)
    x = Dense(1000, activation='relu', init='uniform')(x)
    x = Dense(500, activation='relu', init='uniform')(x)
    x = Dense(1, activation='sigmoid')(x)

    model_pkl = Model([inp for inp,emb in embs] + [contin_inp], x)
    model_pkl.compile('adam', 'mean_absolute_error')
    #model.compile(Adam(), 'mse')
    return model_pkl



In [ ]:

    
model_pkl = get_model_pkl()



In [ ]:

    
train_pkl(128, 10).history['val_loss']



In [ ]:

    
K.set_value(model_pkl.optimizer.lr, 1e-4)
train_pkl(128, 5).history['val_loss']



In [ ]:

    
"""
1 97s - loss: 0.0104 - val_loss: 0.0083
2 93s - loss: 0.0076 - val_loss: 0.0076
3 90s - loss: 0.0071 - val_loss: 0.0076
4 90s - loss: 0.0068 - val_loss: 0.0075
5 93s - loss: 0.0066 - val_loss: 0.0075
6 95s - loss: 0.0064 - val_loss: 0.0076
7 98s - loss: 0.0063 - val_loss: 0.0077
8 97s - loss: 0.0062 - val_loss: 0.0075
9 95s - loss: 0.0061 - val_loss: 0.0073
0 101s - loss: 0.0061 - val_loss: 0.0074
"""



In [ ]:

    
plot_train(hist)



In [ ]:

    
preds = np.squeeze(model_pkl.predict(map_valid_pkl, 1024))



In [ ]:

    
y_orig_pkl_val = log_max_inv(y_pkl_val, max_log_y_pkl)



In [ ]:

    
rmspe(log_max_inv(preds, max_log_y_pkl), y_orig_pkl_val)

XGBoost

Xgboost is extremely quick and easy to use. Aside from being a powerful predictive model, it gives us information about feature importance.



In [115]:

    
X_train = np.concatenate([cat_map_train, contin_map_train], axis=1)



In [116]:

    
X_valid = np.concatenate([cat_map_valid, contin_map_valid], axis=1)



In [117]:

    
all_vars = cat_vars + contin_vars



In [118]:

    
xgb_parms = {'learning_rate': 0.1, 'subsample': 0.6, 
             'colsample_bylevel': 0.6, 'silent': True, 'objective': 'reg:linear'}



In [119]:

    
xdata = xgboost.DMatrix(X_train, y_train, feature_names=all_vars)



In [120]:

    
xdata_val = xgboost.DMatrix(X_valid, y_valid, feature_names=all_vars)



In [121]:

    
xgb_parms['seed'] = random.randint(0,1e9)
model = xgboost.train(xgb_parms, xdata)



In [122]:

    
model.eval(xdata_val)









    Out[122]:





b'[0]\teval-rmse:0.113776'



In [123]:

    
model.eval(xdata_val)









    Out[123]:





b'[0]\teval-rmse:0.113776'

Easily, competition distance is the most important, while events are not important at all.

In real applications, putting together a feature importance plot is often a first step. Oftentimes, we can remove hundreds of thousands of features from consideration with importance plots.



In [124]:

    
importance = model.get_fscore()
importance = sorted(importance.items(), key=operator.itemgetter(1))

df = pd.DataFrame(importance, columns=['feature', 'fscore'])
df['fscore'] = df['fscore'] / df['fscore'].sum()

df.plot(kind='barh', x='feature', y='fscore', legend=False, figsize=(6, 10))
plt.title('XGBoost Feature Importance')
plt.xlabel('relative importance');

End



In [ ]:

	Store	DayOfWeek	Date	Sales	Customers	Open	Promo	SchoolHoliday
0	1	5	2015-07-31	5263	555	1	1	1
1	2	5	2015-07-31	6064	625	1	1	1
2	3	5	2015-07-31	8314	821	1	1	1
3	4	5	2015-07-31	13995	1498	1	1	1
4	5	5	2015-07-31	4822	559	1	1	1

	Store	StoreType	Assortment	CompetitionDistance	CompetitionOpenSinceMonth	CompetitionOpenSinceYear	Promo2	Promo2SinceWeek	Promo2SinceYear	PromoInterval
0	1	c	a	1270.0	9.0	2008.0	0	NaN	NaN	NaN
1	2	a	a	570.0	11.0	2007.0	1	13.0	2010.0	Jan,Apr,Jul,Oct
2	3	a	a	14130.0	12.0	2006.0	1	14.0	2011.0	Jan,Apr,Jul,Oct
3	4	c	c	620.0	9.0	2009.0	0	NaN	NaN	NaN
4	5	a	a	29910.0	4.0	2015.0	0	NaN	NaN	NaN

	StateName	State
0	BadenWuerttemberg	BW
1	Bayern	BY
2	Berlin	BE
3	Brandenburg	BB
4	Bremen	HB

	file	week	trend
0	Rossmann_DE_SN	2012-12-02 - 2012-12-08	96
1	Rossmann_DE_SN	2012-12-09 - 2012-12-15	95
2	Rossmann_DE_SN	2012-12-16 - 2012-12-22	91
3	Rossmann_DE_SN	2012-12-23 - 2012-12-29	48
4	Rossmann_DE_SN	2012-12-30 - 2013-01-05	67

	file	Date	Max_TemperatureC	Mean_TemperatureC	Min_TemperatureC	Dew_PointC	MeanDew_PointC	Min_DewpointC	Max_Humidity	Mean_Humidity	...	Max_VisibilityKm	Mean_VisibilityKm	Min_VisibilitykM	Max_Wind_SpeedKm_h	Mean_Wind_SpeedKm_h	Max_Gust_SpeedKm_h	Precipitationmm	CloudCover	Events	WindDirDegrees
0	NordrheinWestfalen	2013-01-01	8	4	2	7	5	1	94	87	...	31.0	12.0	4.0	39	26	58.0	5.08	6.0	Rain	215
1	NordrheinWestfalen	2013-01-02	7	4	1	5	3	2	93	85	...	31.0	14.0	10.0	24	16	NaN	0.00	6.0	Rain	225
2	NordrheinWestfalen	2013-01-03	11	8	6	10	8	4	100	93	...	31.0	8.0	2.0	26	21	NaN	1.02	7.0	Rain	240
3	NordrheinWestfalen	2013-01-04	9	9	8	9	9	8	100	94	...	11.0	5.0	2.0	23	14	NaN	0.25	7.0	Rain	263
4	NordrheinWestfalen	2013-01-05	8	8	7	8	7	6	100	94	...	10.0	6.0	3.0	16	10	NaN	0.00	7.0	Rain	268

	Id	Store	DayOfWeek	Date	Open	Promo
0	1	1	4	2015-09-17	1.0	1
1	2	3	4	2015-09-17	1.0	1
2	3	7	4	2015-09-17	1.0	1
3	4	8	4	2015-09-17	1.0	1
4	5	9	4	2015-09-17	1.0	1

	Store	DayOfWeek	Date	Sales	Customers	Open	Promo	StateHoliday	SchoolHoliday
count	1.01721e+06	1.01721e+06	NaN	1.01721e+06	1.01721e+06	1.01721e+06	1.01721e+06	NaN	1.01721e+06
mean	558.43	3.99834	NaN	5773.82	633.146	0.830107	0.381515	NaN	0.178647
std	321.909	1.99739	NaN	3849.93	464.412	0.375539	0.485759	NaN	0.383056
min	1	1	NaN	0	0	0	0	NaN	0
25%	280	2	NaN	3727	405	1	0	NaN	0
50%	558	4	NaN	5744	609	1	0	NaN	0
75%	838	6	NaN	7856	837	1	1	NaN	0
max	1115	7	NaN	41551	7388	1	1	NaN	1
counts	1017209	1017209	1017209	1017209	1017209	1017209	1017209	1017209	1017209
uniques	1115	7	942	21734	4086	2	2	4	2
missing	0	0	0	0	0	0	0	0	0
missing_perc	0%	0%	0%	0%	0%	0%	0%	0%	0%
types	numeric	numeric	categorical	numeric	numeric	bool	bool	categorical	bool

	Store	StoreType	Assortment	CompetitionDistance	CompetitionOpenSinceMonth	CompetitionOpenSinceYear	Promo2	Promo2SinceWeek	Promo2SinceYear	PromoInterval
count	1115	NaN	NaN	1112	761	761	1115	571	571	NaN
mean	558	NaN	NaN	5404.9	7.2247	2008.67	0.512108	23.5954	2011.76	NaN
std	322.017	NaN	NaN	7663.17	3.21235	6.19598	0.500078	14.142	1.67494	NaN
min	1	NaN	NaN	20	1	1900	0	1	2009	NaN
25%	279.5	NaN	NaN	717.5	4	2006	0	13	2011	NaN
50%	558	NaN	NaN	2325	8	2010	1	22	2012	NaN
75%	836.5	NaN	NaN	6882.5	10	2013	1	37	2013	NaN
max	1115	NaN	NaN	75860	12	2015	1	50	2015	NaN
counts	1115	1115	1115	1112	761	761	1115	571	571	571
uniques	1115	4	3	654	12	23	2	24	7	3
missing	0	0	0	3	354	354	0	544	544	544
missing_perc	0%	0%	0%	0.27%	31.75%	31.75%	0%	48.79%	48.79%	48.79%
types	numeric	categorical	categorical	numeric	numeric	numeric	bool	numeric	numeric	categorical

	StateName	State
count	16	16
unique	16	16
top	Bayern	SH
freq	1	1
counts	16	16
uniques	16	16
missing	0	0
missing_perc	0%	0%
types	unique	unique

	file	week	trend
count	NaN	NaN	2072
mean	NaN	NaN	63.8142
std	NaN	NaN	12.6502
min	NaN	NaN	0
25%	NaN	NaN	55
50%	NaN	NaN	64
75%	NaN	NaN	72
max	NaN	NaN	100
counts	2072	2072	2072
uniques	14	148	68
missing	0	0	0
missing_perc	0%	0%	0%
types	categorical	categorical	numeric

	file	Date	Max_TemperatureC	Mean_TemperatureC	Min_TemperatureC	Dew_PointC	MeanDew_PointC	Min_DewpointC	Max_Humidity	Mean_Humidity	...	Max_VisibilityKm	Mean_VisibilityKm	Min_VisibilitykM	Max_Wind_SpeedKm_h	Mean_Wind_SpeedKm_h	Max_Gust_SpeedKm_h	Precipitationmm	CloudCover	Events	WindDirDegrees
count	NaN	NaN	15840	15840	15840	15840	15840	15840	15840	15840	...	15459	15459	15459	15840	15840	3604	15840	14667	NaN	15840
mean	NaN	NaN	14.6441	10.389	6.19899	8.58782	6.20581	3.62614	93.6596	74.2829	...	24.0576	12.2398	7.02516	22.7666	11.9722	48.8643	0.831718	5.55131	NaN	175.897
std	NaN	NaN	8.64601	7.37926	6.52639	6.24478	6.08677	6.12839	7.67853	13.4866	...	8.9768	5.06794	4.9806	8.98862	5.87284	13.027	2.51351	1.68771	NaN	101.589
min	NaN	NaN	-11	-13	-15	-14	-15	-73	44	30	...	0	0	0	3	2	21	0	0	NaN	-1
25%	NaN	NaN	8	4	1	4	2	-1	90.75	65	...	14	10	3	16	8	39	0	5	NaN	80
50%	NaN	NaN	15	11	7	9	7	4	94	76	...	31	11	7	21	11	48	0	6	NaN	202
75%	NaN	NaN	21	16	11	13	11	8	100	85	...	31	14	10	27	14	55	0.25	7	NaN	256
max	NaN	NaN	39	31	24	25	20	19	100	100	...	31	31	31	101	53	111	58.93	8	NaN	360
counts	15840	15840	15840	15840	15840	15840	15840	15840	15840	15840	...	15459	15459	15459	15840	15840	3604	15840	14667	11889	15840
uniques	16	990	51	45	40	40	36	40	53	71	...	24	32	24	44	29	47	41	9	21	362
missing	0	0	0	0	0	0	0	0	0	0	...	381	381	381	0	0	12236	0	1173	3951	0
missing_perc	0%	0%	0%	0%	0%	0%	0%	0%	0%	0%	...	2.41%	2.41%	2.41%	0%	0%	77.25%	0%	7.41%	24.94%	0%
types	categorical	categorical	numeric	numeric	numeric	numeric	numeric	numeric	numeric	numeric	...	numeric	numeric	numeric	numeric	numeric	numeric	numeric	numeric	categorical	numeric

	Id	Store	DayOfWeek	Date	Open	Promo	StateHoliday	SchoolHoliday
count	41088	41088	41088	NaN	41077	41088	NaN	41088
mean	20544.5	555.9	3.97917	NaN	0.854322	0.395833	NaN	0.443487
std	11861.2	320.274	2.01548	NaN	0.352787	0.489035	NaN	0.496802
min	1	1	1	NaN	0	0	NaN	0
25%	10272.8	279.75	2	NaN	1	0	NaN	0
50%	20544.5	553.5	4	NaN	1	0	NaN	0
75%	30816.2	832.25	6	NaN	1	1	NaN	1
max	41088	1115	7	NaN	1	1	NaN	1
counts	41088	41088	41088	41088	41077	41088	41088	41088
uniques	41088	856	7	48	2	2	2	2
missing	0	0	0	0	11	0	0	0
missing_perc	0%	0%	0%	0%	0.03%	0%	0%	0%
types	numeric	numeric	numeric	categorical	bool	bool	bool	bool

	Date	Store	Promo	StateHoliday	SchoolHoliday	AfterStateHoliday	BeforeStateHoliday	AfterPromo
0	2015-07-31	1	1	False	1	57 days	NaT	0 days 00:00:00
1115	2015-07-30	1	1	False	1	56 days	NaT	0 days 00:00:00
2230	2015-07-29	1	1	False	1	55 days	NaT	0 days 00:00:00
3345	2015-07-28	1	1	False	1	54 days	NaT	0 days 00:00:00
4460	2015-07-27	1	1	False	1	53 days	NaT	0 days 00:00:00