FastAI DL1 Lesson 3 CodeAlong II:

Rossmann - Structured Data

1 Structured and Time Series Data

This notebook contains an implementation of the third place result in the Rossmann Kaggle competition as detailed in Gui/Berkhahn's Entity Embeddings of Categorical Variables.

The motivation behind exploring this architecture is it's relevance to real-world application. Most data used for decision making day-to-day in industry is structured and/or time-series data. Here we explore the end-to-end process of using neural networks with practical structured data problems.



In [1]:

    
%matplotlib inline
%reload_ext autoreload
%autoreload 2



In [2]:

    
# from fastai.imports import *
# from fastai.torch_imports import *
from fastai.structured import * # non-PyTorch specfc Machine-Learning tools; indep lib
# from fastai.dataset import * # lets us do fastai PyTorch stuff w/ structured columnar data
from fastai.column_data import *

np.set_printoptions(threshold=50, edgeitems=20)
# from sklearn_pandas import DataFrameMapper
# from sklearn.preprocessing import LabelEncoder, Imputer, StandardScaler
# import operator

PATH = 'data/rossmann/'

1.1 Create Datasets

In addition to the provided data, we'll be using external datasets put together by participants in the Kaggle competition. You can download all of them here.

For completeness, the implementation used to put them together is included below.



In [3]:

    
def concat_csvs(dirname):
    path = f'{PATH}{dirname}'
    filenames = glob.glob(f'{path}/*.csv')
    
    wrote_header = False
    with open(f'{path}.csv', 'w') as outputfle:
        for filename in filenames:
            name = filename.split('.')[0]
            with open(filename) as f:
                line = f.readline()
                if not wrote_header:
                    wrote_header = True
                    outputfile.write('file' + line)
                for line in f:
                    outputfile.write(name + ',' + line)
                outputfile.write('\n')



In [4]:

    
# concat_csvs('googletrend')
# concat_csvs('weather')

Feature Space:

train: Training set provided by competition
store: List of stores
store_states: Mapping of store to the German state they're in
List of German state names
googletrend: Trend of certain google keywords over time, found by users to correlate well w/ given daya
weather: Weather
test: Testing set



In [5]:

    
table_names = ['train', 'store', 'store_states', 'state_names', 
              'googletrend', 'weather', 'test']

We'll be using the popular data manipulation framework pandas. Among other things, Pandas allows you to manipulate tables/DataFrames in Python as one would in a database.

We're going to go ahead and load all our csv's as DataFrames into the list tables.



In [6]:

    
tables = [pd.read_csv(f'{PATH}{fname}.csv', low_memory=False) for fname in table_names]



In [7]:

    
from IPython.display import HTML

We can use head() to get a quick look at the contents of each table:

train: Contains store information on a daily basis, tracks things like sales, customers, whether that day was a holiday, etc.
store: General info about the store including competition, etc.
store_states: Maps store to state it's in
state_names: Maps state abbreviations to names
googletrend: Trend data for particular week/state
weather: Weather conditions for each state
test: Same as training table, w/o sales and customers



In [8]:

    
for t in tables: display(t.head())









    







  
    
      
      Store
      DayOfWeek
      Date
      Sales
      Customers
      Open
      Promo
      StateHoliday
      SchoolHoliday
    
  
  
    
      0
      1
      5
      2015-07-31
      5263
      555
      1
      1
      0
      1
    
    
      1
      2
      5
      2015-07-31
      6064
      625
      1
      1
      0
      1
    
    
      2
      3
      5
      2015-07-31
      8314
      821
      1
      1
      0
      1
    
    
      3
      4
      5
      2015-07-31
      13995
      1498
      1
      1
      0
      1
    
    
      4
      5
      5
      2015-07-31
      4822
      559
      1
      1
      0
      1
    
  








    







  
    
      
      Store
      StoreType
      Assortment
      CompetitionDistance
      CompetitionOpenSinceMonth
      CompetitionOpenSinceYear
      Promo2
      Promo2SinceWeek
      Promo2SinceYear
      PromoInterval
    
  
  
    
      0
      1
      c
      a
      1270.0
      9.0
      2008.0
      0
      NaN
      NaN
      NaN
    
    
      1
      2
      a
      a
      570.0
      11.0
      2007.0
      1
      13.0
      2010.0
      Jan,Apr,Jul,Oct
    
    
      2
      3
      a
      a
      14130.0
      12.0
      2006.0
      1
      14.0
      2011.0
      Jan,Apr,Jul,Oct
    
    
      3
      4
      c
      c
      620.0
      9.0
      2009.0
      0
      NaN
      NaN
      NaN
    
    
      4
      5
      a
      a
      29910.0
      4.0
      2015.0
      0
      NaN
      NaN
      NaN
    
  








    







  
    
      
      Store
      State
    
  
  
    
      0
      1
      HE
    
    
      1
      2
      TH
    
    
      2
      3
      NW
    
    
      3
      4
      BE
    
    
      4
      5
      SN
    
  








    







  
    
      
      StateName
      State
    
  
  
    
      0
      BadenWuerttemberg
      BW
    
    
      1
      Bayern
      BY
    
    
      2
      Berlin
      BE
    
    
      3
      Brandenburg
      BB
    
    
      4
      Bremen
      HB
    
  








    







  
    
      
      file
      week
      trend
    
  
  
    
      0
      Rossmann_DE_SN
      2012-12-02 - 2012-12-08
      96
    
    
      1
      Rossmann_DE_SN
      2012-12-09 - 2012-12-15
      95
    
    
      2
      Rossmann_DE_SN
      2012-12-16 - 2012-12-22
      91
    
    
      3
      Rossmann_DE_SN
      2012-12-23 - 2012-12-29
      48
    
    
      4
      Rossmann_DE_SN
      2012-12-30 - 2013-01-05
      67
    
  








    







  
    
      
      file
      Date
      Max_TemperatureC
      Mean_TemperatureC
      Min_TemperatureC
      Dew_PointC
      MeanDew_PointC
      Min_DewpointC
      Max_Humidity
      Mean_Humidity
      ...
      Max_VisibilityKm
      Mean_VisibilityKm
      Min_VisibilitykM
      Max_Wind_SpeedKm_h
      Mean_Wind_SpeedKm_h
      Max_Gust_SpeedKm_h
      Precipitationmm
      CloudCover
      Events
      WindDirDegrees
    
  
  
    
      0
      NordrheinWestfalen
      2013-01-01
      8
      4
      2
      7
      5
      1
      94
      87
      ...
      31.0
      12.0
      4.0
      39
      26
      58.0
      5.08
      6.0
      Rain
      215
    
    
      1
      NordrheinWestfalen
      2013-01-02
      7
      4
      1
      5
      3
      2
      93
      85
      ...
      31.0
      14.0
      10.0
      24
      16
      NaN
      0.00
      6.0
      Rain
      225
    
    
      2
      NordrheinWestfalen
      2013-01-03
      11
      8
      6
      10
      8
      4
      100
      93
      ...
      31.0
      8.0
      2.0
      26
      21
      NaN
      1.02
      7.0
      Rain
      240
    
    
      3
      NordrheinWestfalen
      2013-01-04
      9
      9
      8
      9
      9
      8
      100
      94
      ...
      11.0
      5.0
      2.0
      23
      14
      NaN
      0.25
      7.0
      Rain
      263
    
    
      4
      NordrheinWestfalen
      2013-01-05
      8
      8
      7
      8
      7
      6
      100
      94
      ...
      10.0
      6.0
      3.0
      16
      10
      NaN
      0.00
      7.0
      Rain
      268
    
  

5 rows × 24 columns







    







  
    
      
      Id
      Store
      DayOfWeek
      Date
      Open
      Promo
      StateHoliday
      SchoolHoliday
    
  
  
    
      0
      1
      1
      4
      2015-09-17
      1.0
      1
      0
      0
    
    
      1
      2
      3
      4
      2015-09-17
      1.0
      1
      0
      0
    
    
      2
      3
      7
      4
      2015-09-17
      1.0
      1
      0
      0
    
    
      3
      4
      8
      4
      2015-09-17
      1.0
      1
      0
      0
    
    
      4
      5
      9
      4
      2015-09-17
      1.0
      1
      0
      0

This is vert representative of a typical industry dataset.

The following returns summarized aggregate information to each table accross each field.



In [9]:

    
# NOTE .summary() isn't printing, it's returning. So display() is 
# needed to print the multiple calls.
for t in tables: display(DataFrameSummary(t).summary())









    







  
    
      
      Store
      DayOfWeek
      Date
      Sales
      Customers
      Open
      Promo
      StateHoliday
      SchoolHoliday
    
  
  
    
      count
      1.01721e+06
      1.01721e+06
      NaN
      1.01721e+06
      1.01721e+06
      1.01721e+06
      1.01721e+06
      NaN
      1.01721e+06
    
    
      mean
      558.43
      3.99834
      NaN
      5773.82
      633.146
      0.830107
      0.381515
      NaN
      0.178647
    
    
      std
      321.909
      1.99739
      NaN
      3849.93
      464.412
      0.375539
      0.485759
      NaN
      0.383056
    
    
      min
      1
      1
      NaN
      0
      0
      0
      0
      NaN
      0
    
    
      25%
      280
      2
      NaN
      3727
      405
      1
      0
      NaN
      0
    
    
      50%
      558
      4
      NaN
      5744
      609
      1
      0
      NaN
      0
    
    
      75%
      838
      6
      NaN
      7856
      837
      1
      1
      NaN
      0
    
    
      max
      1115
      7
      NaN
      41551
      7388
      1
      1
      NaN
      1
    
    
      counts
      1017209
      1017209
      1017209
      1017209
      1017209
      1017209
      1017209
      1017209
      1017209
    
    
      uniques
      1115
      7
      942
      21734
      4086
      2
      2
      4
      2
    
    
      missing
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      missing_perc
      0%
      0%
      0%
      0%
      0%
      0%
      0%
      0%
      0%
    
    
      types
      numeric
      numeric
      categorical
      numeric
      numeric
      bool
      bool
      categorical
      bool
    
  








    







  
    
      
      Store
      StoreType
      Assortment
      CompetitionDistance
      CompetitionOpenSinceMonth
      CompetitionOpenSinceYear
      Promo2
      Promo2SinceWeek
      Promo2SinceYear
      PromoInterval
    
  
  
    
      count
      1115
      NaN
      NaN
      1112
      761
      761
      1115
      571
      571
      NaN
    
    
      mean
      558
      NaN
      NaN
      5404.9
      7.2247
      2008.67
      0.512108
      23.5954
      2011.76
      NaN
    
    
      std
      322.017
      NaN
      NaN
      7663.17
      3.21235
      6.19598
      0.500078
      14.142
      1.67494
      NaN
    
    
      min
      1
      NaN
      NaN
      20
      1
      1900
      0
      1
      2009
      NaN
    
    
      25%
      279.5
      NaN
      NaN
      717.5
      4
      2006
      0
      13
      2011
      NaN
    
    
      50%
      558
      NaN
      NaN
      2325
      8
      2010
      1
      22
      2012
      NaN
    
    
      75%
      836.5
      NaN
      NaN
      6882.5
      10
      2013
      1
      37
      2013
      NaN
    
    
      max
      1115
      NaN
      NaN
      75860
      12
      2015
      1
      50
      2015
      NaN
    
    
      counts
      1115
      1115
      1115
      1112
      761
      761
      1115
      571
      571
      571
    
    
      uniques
      1115
      4
      3
      654
      12
      23
      2
      24
      7
      3
    
    
      missing
      0
      0
      0
      3
      354
      354
      0
      544
      544
      544
    
    
      missing_perc
      0%
      0%
      0%
      0.27%
      31.75%
      31.75%
      0%
      48.79%
      48.79%
      48.79%
    
    
      types
      numeric
      categorical
      categorical
      numeric
      numeric
      numeric
      bool
      numeric
      numeric
      categorical
    
  








    







  
    
      
      Store
      State
    
  
  
    
      count
      1115
      NaN
    
    
      mean
      558
      NaN
    
    
      std
      322.017
      NaN
    
    
      min
      1
      NaN
    
    
      25%
      279.5
      NaN
    
    
      50%
      558
      NaN
    
    
      75%
      836.5
      NaN
    
    
      max
      1115
      NaN
    
    
      counts
      1115
      1115
    
    
      uniques
      1115
      12
    
    
      missing
      0
      0
    
    
      missing_perc
      0%
      0%
    
    
      types
      numeric
      categorical
    
  








    







  
    
      
      StateName
      State
    
  
  
    
      count
      16
      16
    
    
      unique
      16
      16
    
    
      top
      Saarland
      SL
    
    
      freq
      1
      1
    
    
      counts
      16
      16
    
    
      uniques
      16
      16
    
    
      missing
      0
      0
    
    
      missing_perc
      0%
      0%
    
    
      types
      unique
      unique
    
  








    







  
    
      
      file
      week
      trend
    
  
  
    
      count
      NaN
      NaN
      2072
    
    
      mean
      NaN
      NaN
      63.8142
    
    
      std
      NaN
      NaN
      12.6502
    
    
      min
      NaN
      NaN
      0
    
    
      25%
      NaN
      NaN
      55
    
    
      50%
      NaN
      NaN
      64
    
    
      75%
      NaN
      NaN
      72
    
    
      max
      NaN
      NaN
      100
    
    
      counts
      2072
      2072
      2072
    
    
      uniques
      14
      148
      68
    
    
      missing
      0
      0
      0
    
    
      missing_perc
      0%
      0%
      0%
    
    
      types
      categorical
      categorical
      numeric
    
  








    







  
    
      
      file
      Date
      Max_TemperatureC
      Mean_TemperatureC
      Min_TemperatureC
      Dew_PointC
      MeanDew_PointC
      Min_DewpointC
      Max_Humidity
      Mean_Humidity
      ...
      Max_VisibilityKm
      Mean_VisibilityKm
      Min_VisibilitykM
      Max_Wind_SpeedKm_h
      Mean_Wind_SpeedKm_h
      Max_Gust_SpeedKm_h
      Precipitationmm
      CloudCover
      Events
      WindDirDegrees
    
  
  
    
      count
      NaN
      NaN
      15840
      15840
      15840
      15840
      15840
      15840
      15840
      15840
      ...
      15459
      15459
      15459
      15840
      15840
      3604
      15840
      14667
      NaN
      15840
    
    
      mean
      NaN
      NaN
      14.6441
      10.389
      6.19899
      8.58782
      6.20581
      3.62614
      93.6596
      74.2829
      ...
      24.0576
      12.2398
      7.02516
      22.7666
      11.9722
      48.8643
      0.831718
      5.55131
      NaN
      175.897
    
    
      std
      NaN
      NaN
      8.64601
      7.37926
      6.52639
      6.24478
      6.08677
      6.12839
      7.67853
      13.4866
      ...
      8.9768
      5.06794
      4.9806
      8.98862
      5.87284
      13.027
      2.51351
      1.68771
      NaN
      101.589
    
    
      min
      NaN
      NaN
      -11
      -13
      -15
      -14
      -15
      -73
      44
      30
      ...
      0
      0
      0
      3
      2
      21
      0
      0
      NaN
      -1
    
    
      25%
      NaN
      NaN
      8
      4
      1
      4
      2
      -1
      90.75
      65
      ...
      14
      10
      3
      16
      8
      39
      0
      5
      NaN
      80
    
    
      50%
      NaN
      NaN
      15
      11
      7
      9
      7
      4
      94
      76
      ...
      31
      11
      7
      21
      11
      48
      0
      6
      NaN
      202
    
    
      75%
      NaN
      NaN
      21
      16
      11
      13
      11
      8
      100
      85
      ...
      31
      14
      10
      27
      14
      55
      0.25
      7
      NaN
      256
    
    
      max
      NaN
      NaN
      39
      31
      24
      25
      20
      19
      100
      100
      ...
      31
      31
      31
      101
      53
      111
      58.93
      8
      NaN
      360
    
    
      counts
      15840
      15840
      15840
      15840
      15840
      15840
      15840
      15840
      15840
      15840
      ...
      15459
      15459
      15459
      15840
      15840
      3604
      15840
      14667
      11889
      15840
    
    
      uniques
      16
      990
      51
      45
      40
      40
      36
      40
      53
      71
      ...
      24
      32
      24
      44
      29
      47
      41
      9
      21
      362
    
    
      missing
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      381
      381
      381
      0
      0
      12236
      0
      1173
      3951
      0
    
    
      missing_perc
      0%
      0%
      0%
      0%
      0%
      0%
      0%
      0%
      0%
      0%
      ...
      2.41%
      2.41%
      2.41%
      0%
      0%
      77.25%
      0%
      7.41%
      24.94%
      0%
    
    
      types
      categorical
      categorical
      numeric
      numeric
      numeric
      numeric
      numeric
      numeric
      numeric
      numeric
      ...
      numeric
      numeric
      numeric
      numeric
      numeric
      numeric
      numeric
      numeric
      categorical
      numeric
    
  

13 rows × 24 columns







    







  
    
      
      Id
      Store
      DayOfWeek
      Date
      Open
      Promo
      StateHoliday
      SchoolHoliday
    
  
  
    
      count
      41088
      41088
      41088
      NaN
      41077
      41088
      NaN
      41088
    
    
      mean
      20544.5
      555.9
      3.97917
      NaN
      0.854322
      0.395833
      NaN
      0.443487
    
    
      std
      11861.2
      320.274
      2.01548
      NaN
      0.352787
      0.489035
      NaN
      0.496802
    
    
      min
      1
      1
      1
      NaN
      0
      0
      NaN
      0
    
    
      25%
      10272.8
      279.75
      2
      NaN
      1
      0
      NaN
      0
    
    
      50%
      20544.5
      553.5
      4
      NaN
      1
      0
      NaN
      0
    
    
      75%
      30816.2
      832.25
      6
      NaN
      1
      1
      NaN
      1
    
    
      max
      41088
      1115
      7
      NaN
      1
      1
      NaN
      1
    
    
      counts
      41088
      41088
      41088
      41088
      41077
      41088
      41088
      41088
    
    
      uniques
      41088
      856
      7
      48
      2
      2
      2
      2
    
    
      missing
      0
      0
      0
      0
      11
      0
      0
      0
    
    
      missing_perc
      0%
      0%
      0%
      0%
      0.03%
      0%
      0%
      0%
    
    
      types
      numeric
      numeric
      numeric
      categorical
      bool
      bool
      bool
      bool

1.2 Data Cleaning / Feature Engineering

As a structuered data problem, we necessarily have to go through all the cleaning and feature engineering, even though we're using a neural network.



In [10]:

    
train, store, store_states, state_names, googletrend, weather, test = tables



In [11]:

    
len(train), len(test)









    Out[11]:





(1017209, 41088)

We turn state Holidays to booleans, to make them more convenient for modeling. We can do calculations on pandas fields using notation very similar (often identical) to numpy.



In [12]:

    
train.StateHoliday = train.StateHoliday!='0'
test.StateHoliday  = test.StateHoliday !='0'

join_df is a function for joining tables on specific fields. By default, we'll be doing a left outer join of right on the left argument using the given fields for each table.

Pandas does joins using the merge method. The suffixes argument doescribes the naming convention for duplicate fields. We've elected to leave the duplicate field names on the left entouched, and append a "_y" to those on the right.



In [13]:

    
def join_df(left, right, left_on, right_on=None, suffix='_y'):
    if right_on is None: right_on = left_on
    return left.merge(right, how='left', left_on=left_on, right_on=right_on, 
                      suffixes=("", suffix))

Join weather/state names:



In [14]:

    
weather = join_df(weather, state_names, "file", "StateName")

In Pandas you can add new columns to a DataFrame by simply definint it. We'll do this for googletrends by extracting dates and state names from the given data an dadding those columns.

We're also going to replace all instances of state name 'NI' to match the usage in the rest of the data: 'HB,NI'. This is a good opportunity to highlight Pandas indexing. We can use .loc[rows, cols] to select a list of rows and a list of columns from the DataFrame. In this case, we're selecting rows w/ statename 'NI' by using a boolean list googletrend.State=='NI' and selecting "State".



In [15]:

    
googletrend['Date'] = googletrend.week.str.split(' - ', expand=True)[0]
googletrend['State']= googletrend.file.str.split('_', expand=True)[2]
googletrend.loc[googletrend.State=='NI', "State"] = 'HB,NI'

The following extracts particular date fields from a complete datetime for the purpose of constructing categoricals.

You should always consider this feature extraction step when working with date-time. Without expanding your date-time into these additional fields, you can't capture any trend/cyclical behavior as a function of time at any of these granularities. We'll add to every table with a date field.



In [16]:

    
add_datepart(weather, "Date", drop=False)
add_datepart(googletrend, "Date", drop=False)
add_datepart(train, "Date", drop=False)
add_datepart(test, "Date", drop=False)

The Google Trends data has a special category for the whole of the US - we'll pull that out so we can use it explicitly. -- DE?



In [17]:

    
trend_de = googletrend[googletrend.file == 'Rossmann_DE']

Now we can outer join all of our data into a single DataFrame. Recall that in outer joins everytime a value in the joining field on the left table doesn't have a corresponding value on the right table, the corresponding row in the new table has Null values for all right table fields.One way to check that all records are consistent and complete is to check for Null values post-join, as we do here.

Aside Why not just do an inner join? If you're assuming that all records are complete and match on the field you desire, an inner join will do the same thing as an outer join. However, in the event that you're wrong or a mistake is made, an outer join followed by a Null-Check will catch it (Comparing before/after # of rows for inner join is equivalent, but requires keeping track of before/after row #'s. Outer join is easier.).



In [18]:

    
store = join_df(store, store_states, "Store")
len(store[store.State.isnull()])
# NOTE: if you do this multiple times, as per join_df(..)'s code, 
# you'll add the last column onto store, with '_y' as its suffix.









    Out[18]:





0



In [19]:

    
joined = join_df(train, store, "Store")
joined_test = join_df(test, store, "Store")
len(joined[joined.StoreType.isnull()]), len(joined_test[joined_test.StoreType.isnull()])









    Out[19]:





(0, 0)



In [20]:

    
joined = join_df(joined, googletrend, ["State","Year","Week"])
joined_test = join_df(joined_test, googletrend, ["State","Year","Week"])
len(joined[joined.trend.isnull()]), len(joined_test[joined_test.trend.isnull()])









    Out[20]:





(0, 0)



In [21]:

    
joined = joined.merge(trend_de, 'left', ["Year", "Week"], suffixes=('', '_DE'))
joined_test = joined_test.merge(trend_de, 'left', ["Year", "Week"], suffixes=('', '_DE'))
len(joined[joined.trend_DE.isnull()]), len(joined_test[joined_test.trend_DE.isnull()])









    Out[21]:





(0, 0)



In [22]:

    
joined = join_df(joined, weather, ["State","Date"])
joined_test = join_df(joined_test, weather, ["State","Date"])
len(joined[joined.Mean_TemperatureC.isnull()]), len(joined_test[joined_test.Mean_TemperatureC.isnull()])









    Out[22]:





(0, 0)



In [23]:

    
# oh very cool -- I think this function removes extra/accidental column joins
for df in (joined, joined_test):
    for c in df.columns:
        if c.endswith('_y'):
            if c in df.columns: df.drop(c, inplace=True, axis=1)

Next we'll fill in missing values to avoid complications with NA's. NA (not available) is how Pandas indicates missing values; many models have problems when missing values are present, so it's always important to think about how to deal with them. In these cases, we're picking an aribtirary signal value that doesn't otherwise appear in the data.



In [24]:

    
for df in (joined, joined_test):
    df['CompetitionOpenSinceYear'] = df.CompetitionOpenSinceYear.fillna(1900).astype(np.int32)
    df['CompetitionOpenSinceMonth']= df.CompetitionOpenSinceMonth.fillna(1).astype(np.int32)
    df['Promo2SinceYear'] = df.Promo2SinceYear.fillna(1900).astype(np.int32)
    df['Promo2SinceWeek'] = df.Promo2SinceWeek.fillna(1).astype(np.int32)

Next we'll extract features "CompetitionOpenSince" and "CompetitionDaysOpen". Note the use of apply() in mapping a function across DataFrame values.



In [25]:

    
for df in (joined, joined_test):
    df["CompetitionOpenSince"] = pd.to_datetime(dict(year=df.CompetitionOpenSinceYear, 
                                                     month=df.CompetitionOpenSinceMonth, day=15))
    df["CompetitionDaysOpen"] = df.Date.subtract(df.CompetitionOpenSince).dt.days

We'll replace some erroneous / outlying data



In [26]:

    
for df in (joined, joined_test):
    df.loc[df.CompetitionDaysOpen<0, "CompetitionDaysOpen"] = 0
    df.loc[df.CompetitionOpenSinceYear<1990, "CompetitionDaysOpen"] = 0

We add the "CompetitionMonthsOpen" field, limiting the maximum to 2 years to limit number of unique categories.



In [27]:

    
for df in (joined, joined_test):
    df["CompetitionMonthsOpen"] = df["CompetitionDaysOpen"]//30
    df.loc[df.CompetitionMonthsOpen>24, "CompetitionMonthsOpen"] = 24

joined.CompetitionMonthsOpen.unique()









    Out[27]:





array([24,  3, 19,  9,  0, 16, 17,  7, 15, 22, 11, 13,  2, 23, 12,  4, 10,  1, 14, 20,  8, 18,  6, 21,  5])

Same process for Promo dates.



In [28]:

    
for df in (joined, joined_test):
    df["Promo2Since"] = pd.to_datetime(df.apply(lambda x: Week(
        x.Promo2SinceYear, x.Promo2SinceWeek).monday(), axis=1).astype(pd.datetime))
    df["Promo2Days"] = df.Date.subtract(df["Promo2Since"]).dt.days



In [29]:

    
for df in (joined, joined_test):
    df.loc[df.Promo2Days<0, "Promo2Days"] = 0
    df.loc[df.Promo2SinceYear<1990, "Promo2Days"] = 0
    df["Promo2Weeks"] = df["Promo2Days"]//7
    df.loc[df.Promo2Weeks<0, "Promo2Weeks"] = 0
    df.loc[df.Promo2Weeks>25, "Promo2Weeks"] = 25
    df.Promo2Weeks.unique()



In [30]:

    
joined.to_feather(f'{PATH}joined')
joined_test.to_feather(f'{PATH}joined_test')

NOTE: make sure joined or joined_test is loaded into memory (either reinitialized from above, or loaded from disk if saved as below) when switching between train and test datasets.

Actually you'll probably have to make sure you have them both read in anyway, if you're going to run some of the lines further below in the same cells together.

1.3 Durations

It's common when working with time series data to extract data that explains relationships accross rows as opposed to columns, eg:

Running averages
Time until next event
Time since last event

This is often difficult to do with most table manipulation frameworks, since they're designed to work with relationships across columns. As such, we've created a class to handle this type of data.

We'll define a function get_elapsed for cumulative counting across a sorted DataFrame. Given a particular field fld to monitor, this function will start tracking time since the last occurrence of that field. When the field is seen again, the counter is set to zero.

Upon initialization, this will result in datetime NA's until the field is encountered. This is reset every time a new store is seen. We'll see how to use this shortly.



In [ ]:

    
?? proc_df



In [31]:

    
def get_elapsed(fld, pre):
    day1 = np.timedelta64(1, 'D')
    last_date = np.datetime64()
    last_store = 0
    res = []
    
    for s,v,d in zip(df.Store.values, df[fld].values, df.Date.values):
        if s != last_store:
            last_date = np.datetime64()
            last_store = s
        if v: last_date = d
        res.append(((d-last_date).astype('timedelta64[D]') / day1).astype(int))
    df[pre+fld] = res

We'll be applying this to a subset of columns:



In [ ]:

    
# # reading back in joined & joined_test when switching df betwen train/test:
# joined = pd.read_feather(f'{PATH}joined')
# joined_test = pd.read_feather(f'{PATH}joined_test')

NOTE: Init columns and define df as train, then run all cells below until joined = join_df(.), then Re-init columns and define df as test, then run all cells below, this time with joined_test = join_df(.).

This took me a very long time to learn -- Hakan



In [52]:

    
columns = ["Date", "Store", "Promo", "StateHoliday", "SchoolHoliday"]



In [33]:

    
df = train[columns]



In [53]:

    
df = test[columns]

NOTE: You must rerun the cell below before running either df = train[columns] or df = test[columns], ie: before switching between train and test datasets -- since columns is redefined for the purpose of converting NaNs to zeros further on down below.

NOTE: when running on the train-set, do train[columns], when running on the test-set, do test[columns] --- idk yet if you have to run all the cells afterwards, but I think you do.

Let's walk through an exmaple:

Say we're looking at School Holiday. We'll first sort by Store, then Date, and then call add_elapsed('SchoolHoliday', 'After'): This'll apply to each row with School Holiday:

Applied to every row of the DataFrame in order of store and date
Will add to the DataFrame the days since seeing a School Holiday
If we sort in the other direction, this'll count the days until another holiday



In [54]:

    
fld = 'SchoolHoliday'
df = df.sort_values(['Store', 'Date'])
get_elapsed(fld, 'After')
df = df.sort_values(['Store', 'Date'], ascending=[True, False])
get_elapsed(fld, 'Before')

We'll do this for two more fields:



In [55]:

    
fld = 'StateHoliday'
df = df.sort_values(['Store', 'Date'])
get_elapsed(fld, 'After')
df = df.sort_values(['Store', 'Date'], ascending=[True, False])
get_elapsed(fld, 'Before')



In [56]:

    
fld = 'Promo'
df = df.sort_values(['Store', 'Date'])
get_elapsed(fld, 'After')
df = df.sort_values(['Store', 'Date'], ascending=[True, False])
get_elapsed(fld, 'Before')

We're going to set the active index to Date



In [57]:

    
df = df.set_index("Date")

Then set Null values from elapsed field calculations to 0.



In [58]:

    
columns = ['SchoolHoliday', 'StateHoliday', 'Promo']



In [59]:

    
for o in ['Before', 'After']:
    for p in columns:
        a = o+p
        df[a] = df[a].fillna(0)

Next we'll demonstrate window functions in Pandas to calculate rolling quantities.

Here we're sorting by date (sort_index()) and counting the number of events of interest (sum() defined in columns in the following week (rolling()), grouped by Store (groupby()). We do the same in the opposite direction.



In [60]:

    
bwd = df[['Store']+columns].sort_index().groupby("Store").rolling(7, min_periods=1).sum()



In [61]:

    
fwd = df[['Store']+columns].sort_index(ascending=False 
                                      ).groupby("Store").rolling(7, min_periods=1).sum()

Next we want to drop the Store indices grouped together in the window function.

Often in Pandas, there's an option to do this in place. This is time and memory efficient when working with large datasets.



In [62]:

    
bwd.drop('Store', 1, inplace=True)
bwd.reset_index(inplace=True)



In [63]:

    
fwd.drop('Store', 1, inplace=True)
fwd.reset_index(inplace=True)



In [64]:

    
df.reset_index(inplace=True)

Now we'll merge these values onto the df.



In [65]:

    
df = df.merge(bwd, 'left', ['Date', 'Store'], suffixes=['', '_bw'])
df = df.merge(fwd, 'left', ['Date', 'Store'], suffixes=['', '_fw'])



In [66]:

    
df.drop(columns, 1, inplace=True)



In [67]:

    
df.head()









    Out[67]:







  
    
      
      Date
      Store
      AfterSchoolHoliday
      BeforeSchoolHoliday
      AfterStateHoliday
      BeforeStateHoliday
      AfterPromo
      BeforePromo
      SchoolHoliday_bw
      StateHoliday_bw
      Promo_bw
      SchoolHoliday_fw
      StateHoliday_fw
      Promo_fw
    
  
  
    
      0
      2015-09-17
      1
      13
      -9223372036854775808
      -9223372036854775808
      -9223372036854775808
      0
      0
      0.0
      0.0
      4.0
      0.0
      0.0
      1.0
    
    
      1
      2015-09-16
      1
      12
      -9223372036854775808
      -9223372036854775808
      -9223372036854775808
      0
      0
      0.0
      0.0
      3.0
      0.0
      0.0
      2.0
    
    
      2
      2015-09-15
      1
      11
      -9223372036854775808
      -9223372036854775808
      -9223372036854775808
      0
      0
      0.0
      0.0
      2.0
      0.0
      0.0
      3.0
    
    
      3
      2015-09-14
      1
      10
      -9223372036854775808
      -9223372036854775808
      -9223372036854775808
      0
      0
      0.0
      0.0
      1.0
      0.0
      0.0
      4.0
    
    
      4
      2015-09-13
      1
      9
      -9223372036854775808
      -9223372036854775808
      -9223372036854775808
      9
      -1
      0.0
      0.0
      0.0
      0.0
      0.0
      4.0

It's usually a good idea to back up large tables of extracted / wrangled features before you join them onto another one, that way you can go back to it easily if you need to make changes to it.



In [68]:

    
df.to_feather(f'{PATH}df')



In [ ]:

    
# df = pd.read_feather(f'{PATH}df', index_col=0)



In [69]:

    
df["Date"] = pd.to_datetime(df.Date)



In [70]:

    
df.columns









    Out[70]:





Index(['Date', 'Store', 'AfterSchoolHoliday', 'BeforeSchoolHoliday',
       'AfterStateHoliday', 'BeforeStateHoliday', 'AfterPromo', 'BeforePromo',
       'SchoolHoliday_bw', 'StateHoliday_bw', 'Promo_bw', 'SchoolHoliday_fw',
       'StateHoliday_fw', 'Promo_fw'],
      dtype='object')



In [51]:

    
joined = join_df(joined, df, ['Store', 'Date'])



In [71]:

    
joined_test = join_df(joined_test, df, ['Store', 'Date'])

NOTE: *you'll get a Buffer dtype mismatch ValueError here unless you have joined or joined_test loaded in from before up above. -- Note for when rerunning this notebook: switching between test and train datasets.

The authors also removed all instances where the store had zero sale / was closed. We speculate that this may have cost them a higher standing in the competition. One reason this may be the case is that a little exploratory data analysis reveals that there are often periods where astores are closed, typically for refurbishment. Before and after these periods, there are naturally spikes in sales that one might expect. By ommitting this data from their training, the authors gave up the ability to leverage information about these periods to predict this otherwise volatile behavior.



In [72]:

    
joined = joined[joined.Sales != 0]

We'll back this up as well



In [73]:

    
joined.reset_index(inplace=True)
joined_test.reset_index(inplace=True)



In [74]:

    
joined.to_feather(f'{PATH}joined')
joined_test.to_feather(f'{PATH}joined_test')

We now have our final set of engineered features.

While these steps were explicitly outlined in the paper, these are all fairly typical feature engineering steps for dealing with time series data and are practical in any similar setting.

1.4 Create Features



In [172]:

    
joined = pd.read_feather(f'{PATH}joined')
joined_test = pd.read_feather(f'{PATH}joined_test')



In [75]:

    
joined.head().T.head(40)









    Out[75]:







  
    
      
      0
      1
      2
      3
      4
    
  
  
    
      index
      0
      1
      2
      3
      4
    
    
      Store
      1
      2
      3
      4
      5
    
    
      DayOfWeek
      5
      5
      5
      5
      5
    
    
      Date
      2015-07-31 00:00:00
      2015-07-31 00:00:00
      2015-07-31 00:00:00
      2015-07-31 00:00:00
      2015-07-31 00:00:00
    
    
      Sales
      5263
      6064
      8314
      13995
      4822
    
    
      Customers
      555
      625
      821
      1498
      559
    
    
      Open
      1
      1
      1
      1
      1
    
    
      Promo
      1
      1
      1
      1
      1
    
    
      StateHoliday
      False
      False
      False
      False
      False
    
    
      SchoolHoliday
      1
      1
      1
      1
      1
    
    
      Year
      2015
      2015
      2015
      2015
      2015
    
    
      Month
      7
      7
      7
      7
      7
    
    
      Week
      31
      31
      31
      31
      31
    
    
      Day
      31
      31
      31
      31
      31
    
    
      Dayofweek
      4
      4
      4
      4
      4
    
    
      Dayofyear
      212
      212
      212
      212
      212
    
    
      Is_month_end
      True
      True
      True
      True
      True
    
    
      Is_month_start
      False
      False
      False
      False
      False
    
    
      Is_quarter_end
      False
      False
      False
      False
      False
    
    
      Is_quarter_start
      False
      False
      False
      False
      False
    
    
      Is_year_end
      False
      False
      False
      False
      False
    
    
      Is_year_start
      False
      False
      False
      False
      False
    
    
      Elapsed
      1438300800
      1438300800
      1438300800
      1438300800
      1438300800
    
    
      StoreType
      c
      a
      a
      c
      a
    
    
      Assortment
      a
      a
      a
      c
      a
    
    
      CompetitionDistance
      1270
      570
      14130
      620
      29910
    
    
      CompetitionOpenSinceMonth
      9
      11
      12
      9
      4
    
    
      CompetitionOpenSinceYear
      2008
      2007
      2006
      2009
      2015
    
    
      Promo2
      0
      1
      1
      0
      0
    
    
      Promo2SinceWeek
      1
      13
      14
      1
      1
    
    
      Promo2SinceYear
      1900
      2010
      2011
      1900
      1900
    
    
      PromoInterval
      NaN
      Jan,Apr,Jul,Oct
      Jan,Apr,Jul,Oct
      NaN
      NaN
    
    
      State
      HE
      TH
      NW
      BE
      SN
    
    
      file
      Rossmann_DE_HE
      Rossmann_DE_TH
      Rossmann_DE_NW
      Rossmann_DE_BE
      Rossmann_DE_SN
    
    
      week
      2015-08-02 - 2015-08-08
      2015-08-02 - 2015-08-08
      2015-08-02 - 2015-08-08
      2015-08-02 - 2015-08-08
      2015-08-02 - 2015-08-08
    
    
      trend
      85
      80
      86
      74
      82
    
    
      file_DE
      Rossmann_DE
      Rossmann_DE
      Rossmann_DE
      Rossmann_DE
      Rossmann_DE
    
    
      week_DE
      2015-08-02 - 2015-08-08
      2015-08-02 - 2015-08-08
      2015-08-02 - 2015-08-08
      2015-08-02 - 2015-08-08
      2015-08-02 - 2015-08-08
    
    
      trend_DE
      83
      83
      83
      83
      83
    
    
      Date_DE
      2015-08-02 00:00:00
      2015-08-02 00:00:00
      2015-08-02 00:00:00
      2015-08-02 00:00:00
      2015-08-02 00:00:00

Now that we've engineered all our features, we need to convert to input compatible with a neural network.

This includes converting categorical variables into contiguous integers or one-hot encodings, normalizing continuous features to standard normal, etc..



In [173]:

    
cat_vars = ['Store', 'DayOfWeek', 'Year', 'Month', 'Day', 'StateHoliday', 'CompetitionMonthsOpen',
    'Promo2Weeks', 'StoreType', 'Assortment', 'PromoInterval', 'CompetitionOpenSinceYear', 'Promo2SinceYear',
    'State', 'Week', 'Events', 'Promo_fw', 'Promo_bw', 'StateHoliday_fw', 'StateHoliday_bw',
    'SchoolHoliday_fw', 'SchoolHoliday_bw']

contin_vars = ['CompetitionDistance', 'Max_TemperatureC', 'Mean_TemperatureC', 'Min_TemperatureC',
   'Max_Humidity', 'Mean_Humidity', 'Min_Humidity', 'Max_Wind_SpeedKm_h', 
   'Mean_Wind_SpeedKm_h', 'CloudCover', 'trend', 'trend_DE',
   'AfterStateHoliday', 'BeforeStateHoliday', 'Promo', 'SchoolHoliday']

n = len(joined); n









    Out[173]:





844338



In [174]:

    
dep = 'Sales'
joined = joined[cat_vars+contin_vars+[dep, 'Date']].copy()



In [175]:

    
joined_test[dep] = 0
joined_test = joined_test[cat_vars+contin_vars+[dep, 'Date', 'Id']].copy()



In [176]:

    
for v in cat_vars: joined[v] = joined[v].astype('category').cat.as_ordered()



In [177]:

    
apply_cats(joined_test, joined)



In [178]:

    
for v in contin_vars:
    joined[v] = joined[v].astype('float32')
    joined_test[v] = joined_test[v].astype('float32')

We're going to run on a sample:



In [179]:

    
idxs = get_cv_idxs(n, val_pct=150000/n)
joined_samp = joined.iloc[idxs].set_index("Date")
samp_size = len(joined_samp); samp_size









    Out[179]:





150000

To run on the full dataset, use this instead:



In [180]:

    
samp_size = n
joined_samp = joined.set_index("Date")

We can now process our data...



In [181]:

    
joined_samp.head(2)









    Out[181]:







  
    
      
      Store
      DayOfWeek
      Year
      Month
      Day
      StateHoliday
      CompetitionMonthsOpen
      Promo2Weeks
      StoreType
      Assortment
      ...
      Max_Wind_SpeedKm_h
      Mean_Wind_SpeedKm_h
      CloudCover
      trend
      trend_DE
      AfterStateHoliday
      BeforeStateHoliday
      Promo
      SchoolHoliday
      Sales
    
    
      Date
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2015-07-31
      1
      5
      2015
      7
      31
      False
      24
      0
      c
      a
      ...
      24.0
      11.0
      1.0
      85.0
      83.0
      57.0
      -9.223372e+18
      1.0
      1.0
      5263
    
    
      2015-07-31
      2
      5
      2015
      7
      31
      False
      24
      25
      a
      a
      ...
      14.0
      11.0
      4.0
      80.0
      83.0
      67.0
      -9.223372e+18
      1.0
      1.0
      6064
    
  

2 rows × 39 columns



In [182]:

    
df, y, nas, mapper = proc_df(joined_samp, 'Sales', do_scale=True)
yλ = np.log(y) # the amount of times I mix up `1` with `l` ... so I use `λ`



In [183]:

    
joined_test = joined_test.set_index("Date")



In [184]:

    
df_test, _, nas, mapper = proc_df(joined_test, 'Sales', do_scale=True, skip_flds=['Id'], 
                                  mapper=mapper, na_dict=nas)



In [185]:

    
df.head(2)









    Out[185]:







  
    
      
      Store
      DayOfWeek
      Year
      Month
      Day
      StateHoliday
      CompetitionMonthsOpen
      Promo2Weeks
      StoreType
      Assortment
      ...
      Mean_Wind_SpeedKm_h
      CloudCover
      trend
      trend_DE
      AfterStateHoliday
      BeforeStateHoliday
      Promo
      SchoolHoliday
      CompetitionDistance_na
      CloudCover_na
    
    
      Date
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2015-07-31
      1
      5
      3
      7
      31
      1
      25
      1
      3
      1
      ...
      -0.142774
      -2.805667
      1.732492
      1.724334
      0.0
      -0.0
      1.113717
      2.04105
      -0.050948
      -0.29609
    
    
      2015-07-31
      2
      5
      3
      7
      31
      1
      25
      26
      1
      1
      ...
      -0.142774
      -0.961766
      1.294578
      1.724334
      0.0
      -0.0
      1.113717
      2.04105
      -0.050948
      -0.29609
    
  

2 rows × 40 columns

In time series data, cross-validation is not random. Instead, our holdout data is generally the most recent data, as it would be a in a real application. This issue is discussed in detail in this post on our website.

One approach is to take the last 25% of rows (sorted by date) as our validation set.



In [186]:

    
train_ratio = 0.75
# train_ration = 0.9
train_size = int(samp_size + train_ratio); train_size
val_idx = list(range(train_size, len(df)))

An even better option for picking a validation set is using the exact same length of time period as the test set uses - this is implemented here:



In [187]:

    
val_idx = np.flatnonzero(
    (df.index<=datetime.datetime(2014,9,17)) & (df.index>=datetime.datetime(2014,8,1)))



In [188]:

    
val_idx=[0]

1.5 DL

We're ready to put together our models.

Root-Mean-Squared Percent Error is the metric Kaggle used for this competition.



In [189]:

    
def inv_y(a): return np.exp(a)

def exp_rmspe(y_pred, targ):
    targ = inv_y(targ)
    pct_var = (targ - inv_y(y_pred))/targ
    return math.sqrt((pct_var**2).mean())

max_log_y = np.max(yλ)
y_range = (0, max_log_y)

We can create a ModelData object directly from our DataFrame.



In [190]:

    
md = ColumnarModelData.from_data_frame(PATH, val_idx, df, yλ.astype(np.float32), cat_flds=cat_vars, bs=128, 
                                       test_df=df_test)

Some categorical variables have a lot more levels than others. Store, in particular, has over a thousand!



In [191]:

    
cat_sz = [(c, len(joined_samp[c].cat.categories)+1) for c in cat_vars]



In [192]:

    
cat_sz









    Out[192]:





[('Store', 1116),
 ('DayOfWeek', 8),
 ('Year', 4),
 ('Month', 13),
 ('Day', 32),
 ('StateHoliday', 3),
 ('CompetitionMonthsOpen', 26),
 ('Promo2Weeks', 27),
 ('StoreType', 5),
 ('Assortment', 4),
 ('PromoInterval', 4),
 ('CompetitionOpenSinceYear', 24),
 ('Promo2SinceYear', 9),
 ('State', 13),
 ('Week', 53),
 ('Events', 22),
 ('Promo_fw', 7),
 ('Promo_bw', 7),
 ('StateHoliday_fw', 4),
 ('StateHoliday_bw', 4),
 ('SchoolHoliday_fw', 9),
 ('SchoolHoliday_bw', 9)]

We use the cardinality of each variable (the number of unique values) to decide how large to make its embeddings. Each level will be associated with a vector with length defined as below.

Take the cardinality of the variable, divide it by 2, but don't make it bigger than 50:



In [193]:

    
emb_szs = [(c, min(50, (c+1)//2)) for _, c in cat_sz]



In [194]:

    
emb_szs









    Out[194]:





[(1116, 50),
 (8, 4),
 (4, 2),
 (13, 7),
 (32, 16),
 (3, 2),
 (26, 13),
 (27, 14),
 (5, 3),
 (4, 2),
 (4, 2),
 (24, 12),
 (9, 5),
 (13, 7),
 (53, 27),
 (22, 11),
 (7, 4),
 (7, 4),
 (4, 2),
 (4, 2),
 (9, 5),
 (9, 5)]

You need to tell the FastAI modeldata.get_learner() method how big to make the embedding matrices (emb_szs), and also how many of the columns in the dataframe are continuous variables.

So here: total num. cols minus num categorical vars.

The FastAI/PyTorch Neural Net will then know how to properly handle the data.

The parameters supplied to .get_learner():

number of continuous variables: n_cont = len(df.columns) - len(cat_vars)
embedding matrix's dropout (here, emb_drop = 0.04)
number of outputs of last Lin.Layer (out_sz = 1 for predicting a single number: sales)
output size: here 1000 activations in the 1st Linear layer, 500 in the 2nd, (szs = [1000,500])
dropout par for 1st & 2nd Lin.Layers (drops = [0.001, 0.01])

I think the y_range parameter is for ensuring proper Normalization, as DNNs like numbers clustered near zero, and above y_range is defined as a tuple of (0, max(log(y))) -- where max & log are NumPy functions.



In [195]:

    
m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars), 
                   0.04, 1, [1000,500], [0.001,0.01], y_range=y_range)
λr = 1e-3



In [118]:

    
m.lr_find()









    





 
 










    



 64%|██████▎   | 745/1172 [00:06<00:03, 109.17it/s, loss=0.3]



In [119]:

    
m.sched.plot(100)



In [151]:

    
# learning rate finder run on full dataset:
m.lr_find()
m.sched.plot(100)









    





 
 










    



 55%|█████▍    | 3606/6597 [00:31<00:25, 116.25it/s, loss=0.169] 





    












    



 55%|█████▍    | 3606/6597 [00:50<00:41, 72.06it/s, loss=0.169]

1.5.1 Sample



In [120]:

    
m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars), 
                   0.04, 1, [1000,500], [0.001,0.01], y_range=y_range)
λr = 1e-3

You can pass in custom metrics to learner.fit() that say: print out a number at the end of every epoch by calling this function (in this case: [exp_rmspe] I think it's in an array in case multiple metrics, or consistency.).

The metric used here is Root-Mean-Squared-Percentage



In [121]:

    
m.fit(λr, 3, metrics=[exp_rmspe])









    





 
 










    



[ 0.       0.02477  0.00066  0.026  ]                            
[ 1.       0.01824  0.02135  0.15733]                            
[ 2.       0.01758  0.01487  0.12971]



In [122]:

    
m.fit(λr, 5, metrics=[exp_rmspe], cycle_len=1)









    





 
 










    



[ 0.       0.01213  0.00032  0.01804]                            
[ 1.       0.01142  0.00601  0.08059]                            
[ 2.       0.0105   0.00199  0.0456 ]                            
[ 3.       0.01026  0.00011  0.01073]                             
[ 4.       0.01006  0.00236  0.04982]



In [123]:

    
m.fit(λr, 2, metrics=[exp_rmspe], cycle_len=4)









    





 
 










    



[ 0.       0.01196  0.00272  0.05358]                            
[ 1.       0.00966  0.00149  0.03937]                             
[ 2.       0.0077   0.0022   0.04806]                             
[ 3.       0.0073   0.00267  0.05308]                             
[ 4.       0.01085  0.00576  0.07885]                            
[ 5.       0.00878  0.00577  0.07891]                             
[ 6.       0.0071   0.00605  0.08091]                             
[ 7.       0.0063   0.00432  0.06794]

1.5.2 All



In [196]:

    
m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars), 
                   0.04, 1, [1000,500], [0.001,0.01], y_range=y_range)
λr = 1e-3



In [197]:

    
m.fit(λr, 1, metrics=[exp_rmspe])









    





 
 










    



[ 0.       0.01693  0.00059  0.02453]



In [198]:

    
m.fit(λr, 3, metrics=[exp_rmspe])









    





 
 










    



[ 0.       0.01237  0.0021   0.04693]                            
[ 1.       0.00991  0.0001   0.00976]                             
[ 2.       0.00915  0.00014  0.01162]



In [199]:

    
m.fit(λr, 3, metrics=[exp_rmspe], cycle_len=1)









    





 
 










    



[ 0.       0.00696  0.00033  0.01799]                             
[ 1.       0.00682  0.00005  0.00739]                             
[ 2.       0.00656  0.0001   0.01015]

1.5.3 Test



In [200]:

    
m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars), 
                   0.04, 1, [1000,500], [0.001,0.01], y_range=y_range)
λr = 1e-3



In [201]:

    
m.fit(λr, 3, metrics=[exp_rmspe])









    





 
 










    



[ 0.       0.01557  0.00003  0.00511]                            
[ 1.       0.01138  0.00316  0.05469]                            
[ 2.       0.01109  0.00079  0.02774]



In [202]:

    
m.fit(λr, 3, metrics=[exp_rmspe], cycle_len=1)









    





 
 










    



[ 0.       0.00808  0.00068  0.02634]                             
[ 1.       0.00705  0.00001  0.00299]                             
[ 2.       0.00672  0.       0.00161]



In [159]:

    
m.save('val0')



In [ ]:

    
# m.load('val0')



In [203]:

    
x,y = m.predict_with_targs()



In [204]:

    
exp_rmspe(x,y)









    Out[204]:





0.0016104093362837733



In [162]:

    
pred_test = m.predict(True)



In [163]:

    
pred_test = np.exp(pred_test)



In [164]:

    
joined_test['Sales'] = pred_test



In [165]:

    
csv_fn = f'{PATH}tmp/sub.csv'



In [166]:

    
joined_test[['Id','Sales']].to_csv(csv_fn, index=False)



In [167]:

    
FileLink(csv_fn)









    Out[167]:




data/rossmann/tmp/sub.csv

holy shit that go 30th of 3303 on the Kaggle Rossmann competition. That's top 0.9%. Score: 0.11076 (pvt)

1.6 RF

Random Forest



In [168]:

    
from sklearn.ensemble import RandomForestRegressor



In [169]:

    
((val, trn), (y_val, y_trn)) = split_by_idx(val_idx, df.values, yλ)



In [170]:

    
m = RandomForestRegressor(n_estimators=40, max_features=0.99, min_samples_leaf=2, 
                          n_jobs=-1, oob_score=True)
m.fit(trn, y_trn);



In [171]:

    
preds = m.predict(val)
m.score(trn, y_trn), m.score(val, y_val), m.oob_score_, exp_rmspe(preds, y_val)









    Out[171]:





(0.98114303593848995, 0.0, 0.92057288556583106, 0.07290035213948422)

X - Notes

If you want to use Embeddings on your own dataset:

- List the Categorical Variable Names
- List the Continuous Variable Names
- Put them in a Pandas DataFrame

Create a list of which row indices you want in your data set:

val_idx = np.flatnonzero(df.index<=datetime.datetime(2014,9,17)) &
                         (df.index>=datetime.datetime(2014,8,1)))

Call this line of code:

md = ColummnarModelData.from_data_frame(PATH, val_idx, df, yλ, 
                                         cat_flds=cat_vars, bs=128)

Create your list of how big you want each embedding matrix to be:

cat_sz = [(c, len(joined_samp[c].cat.categories)+1) for c in cat_vars]
emb_szs = [(c, min(50, (c+1)//2)) for _,c in cat_sz]

Call .get_learner():

m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars),
                    0.04, 1, [1000,500], [0.001,0.01], y_range=y_range)

Call .fit():
```
m.fit(λr, 3, metrics=[exp_rmspe])
```



In [ ]:

	Store	DayOfWeek	Date	Sales	Customers	Open	Promo	SchoolHoliday
0	1	5	2015-07-31	5263	555	1	1	1
1	2	5	2015-07-31	6064	625	1	1	1
2	3	5	2015-07-31	8314	821	1	1	1
3	4	5	2015-07-31	13995	1498	1	1	1
4	5	5	2015-07-31	4822	559	1	1	1

	Store	StoreType	Assortment	CompetitionDistance	CompetitionOpenSinceMonth	CompetitionOpenSinceYear	Promo2	Promo2SinceWeek	Promo2SinceYear	PromoInterval
0	1	c	a	1270.0	9.0	2008.0	0	NaN	NaN	NaN
1	2	a	a	570.0	11.0	2007.0	1	13.0	2010.0	Jan,Apr,Jul,Oct
2	3	a	a	14130.0	12.0	2006.0	1	14.0	2011.0	Jan,Apr,Jul,Oct
3	4	c	c	620.0	9.0	2009.0	0	NaN	NaN	NaN
4	5	a	a	29910.0	4.0	2015.0	0	NaN	NaN	NaN

	StateName	State
0	BadenWuerttemberg	BW
1	Bayern	BY
2	Berlin	BE
3	Brandenburg	BB
4	Bremen	HB

	file	week	trend
0	Rossmann_DE_SN	2012-12-02 - 2012-12-08	96
1	Rossmann_DE_SN	2012-12-09 - 2012-12-15	95
2	Rossmann_DE_SN	2012-12-16 - 2012-12-22	91
3	Rossmann_DE_SN	2012-12-23 - 2012-12-29	48
4	Rossmann_DE_SN	2012-12-30 - 2013-01-05	67

	file	Date	Max_TemperatureC	Mean_TemperatureC	Min_TemperatureC	Dew_PointC	MeanDew_PointC	Min_DewpointC	Max_Humidity	Mean_Humidity	...	Max_VisibilityKm	Mean_VisibilityKm	Min_VisibilitykM	Max_Wind_SpeedKm_h	Mean_Wind_SpeedKm_h	Max_Gust_SpeedKm_h	Precipitationmm	CloudCover	Events	WindDirDegrees
0	NordrheinWestfalen	2013-01-01	8	4	2	7	5	1	94	87	...	31.0	12.0	4.0	39	26	58.0	5.08	6.0	Rain	215
1	NordrheinWestfalen	2013-01-02	7	4	1	5	3	2	93	85	...	31.0	14.0	10.0	24	16	NaN	0.00	6.0	Rain	225
2	NordrheinWestfalen	2013-01-03	11	8	6	10	8	4	100	93	...	31.0	8.0	2.0	26	21	NaN	1.02	7.0	Rain	240
3	NordrheinWestfalen	2013-01-04	9	9	8	9	9	8	100	94	...	11.0	5.0	2.0	23	14	NaN	0.25	7.0	Rain	263
4	NordrheinWestfalen	2013-01-05	8	8	7	8	7	6	100	94	...	10.0	6.0	3.0	16	10	NaN	0.00	7.0	Rain	268

	Id	Store	DayOfWeek	Date	Open	Promo
0	1	1	4	2015-09-17	1.0	1
1	2	3	4	2015-09-17	1.0	1
2	3	7	4	2015-09-17	1.0	1
3	4	8	4	2015-09-17	1.0	1
4	5	9	4	2015-09-17	1.0	1

	Store	DayOfWeek	Date	Sales	Customers	Open	Promo	StateHoliday	SchoolHoliday
count	1.01721e+06	1.01721e+06	NaN	1.01721e+06	1.01721e+06	1.01721e+06	1.01721e+06	NaN	1.01721e+06
mean	558.43	3.99834	NaN	5773.82	633.146	0.830107	0.381515	NaN	0.178647
std	321.909	1.99739	NaN	3849.93	464.412	0.375539	0.485759	NaN	0.383056
min	1	1	NaN	0	0	0	0	NaN	0
25%	280	2	NaN	3727	405	1	0	NaN	0
50%	558	4	NaN	5744	609	1	0	NaN	0
75%	838	6	NaN	7856	837	1	1	NaN	0
max	1115	7	NaN	41551	7388	1	1	NaN	1
counts	1017209	1017209	1017209	1017209	1017209	1017209	1017209	1017209	1017209
uniques	1115	7	942	21734	4086	2	2	4	2
missing	0	0	0	0	0	0	0	0	0
missing_perc	0%	0%	0%	0%	0%	0%	0%	0%	0%
types	numeric	numeric	categorical	numeric	numeric	bool	bool	categorical	bool

	Store	StoreType	Assortment	CompetitionDistance	CompetitionOpenSinceMonth	CompetitionOpenSinceYear	Promo2	Promo2SinceWeek	Promo2SinceYear	PromoInterval
count	1115	NaN	NaN	1112	761	761	1115	571	571	NaN
mean	558	NaN	NaN	5404.9	7.2247	2008.67	0.512108	23.5954	2011.76	NaN
std	322.017	NaN	NaN	7663.17	3.21235	6.19598	0.500078	14.142	1.67494	NaN
min	1	NaN	NaN	20	1	1900	0	1	2009	NaN
25%	279.5	NaN	NaN	717.5	4	2006	0	13	2011	NaN
50%	558	NaN	NaN	2325	8	2010	1	22	2012	NaN
75%	836.5	NaN	NaN	6882.5	10	2013	1	37	2013	NaN
max	1115	NaN	NaN	75860	12	2015	1	50	2015	NaN
counts	1115	1115	1115	1112	761	761	1115	571	571	571
uniques	1115	4	3	654	12	23	2	24	7	3
missing	0	0	0	3	354	354	0	544	544	544
missing_perc	0%	0%	0%	0.27%	31.75%	31.75%	0%	48.79%	48.79%	48.79%
types	numeric	categorical	categorical	numeric	numeric	numeric	bool	numeric	numeric	categorical

	StateName	State
count	16	16
unique	16	16
top	Saarland	SL
freq	1	1
counts	16	16
uniques	16	16
missing	0	0
missing_perc	0%	0%
types	unique	unique

	file	week	trend
count	NaN	NaN	2072
mean	NaN	NaN	63.8142
std	NaN	NaN	12.6502
min	NaN	NaN	0
25%	NaN	NaN	55
50%	NaN	NaN	64
75%	NaN	NaN	72
max	NaN	NaN	100
counts	2072	2072	2072
uniques	14	148	68
missing	0	0	0
missing_perc	0%	0%	0%
types	categorical	categorical	numeric

	file	Date	Max_TemperatureC	Mean_TemperatureC	Min_TemperatureC	Dew_PointC	MeanDew_PointC	Min_DewpointC	Max_Humidity	Mean_Humidity	...	Max_VisibilityKm	Mean_VisibilityKm	Min_VisibilitykM	Max_Wind_SpeedKm_h	Mean_Wind_SpeedKm_h	Max_Gust_SpeedKm_h	Precipitationmm	CloudCover	Events	WindDirDegrees
count	NaN	NaN	15840	15840	15840	15840	15840	15840	15840	15840	...	15459	15459	15459	15840	15840	3604	15840	14667	NaN	15840
mean	NaN	NaN	14.6441	10.389	6.19899	8.58782	6.20581	3.62614	93.6596	74.2829	...	24.0576	12.2398	7.02516	22.7666	11.9722	48.8643	0.831718	5.55131	NaN	175.897
std	NaN	NaN	8.64601	7.37926	6.52639	6.24478	6.08677	6.12839	7.67853	13.4866	...	8.9768	5.06794	4.9806	8.98862	5.87284	13.027	2.51351	1.68771	NaN	101.589
min	NaN	NaN	-11	-13	-15	-14	-15	-73	44	30	...	0	0	0	3	2	21	0	0	NaN	-1
25%	NaN	NaN	8	4	1	4	2	-1	90.75	65	...	14	10	3	16	8	39	0	5	NaN	80
50%	NaN	NaN	15	11	7	9	7	4	94	76	...	31	11	7	21	11	48	0	6	NaN	202
75%	NaN	NaN	21	16	11	13	11	8	100	85	...	31	14	10	27	14	55	0.25	7	NaN	256
max	NaN	NaN	39	31	24	25	20	19	100	100	...	31	31	31	101	53	111	58.93	8	NaN	360
counts	15840	15840	15840	15840	15840	15840	15840	15840	15840	15840	...	15459	15459	15459	15840	15840	3604	15840	14667	11889	15840
uniques	16	990	51	45	40	40	36	40	53	71	...	24	32	24	44	29	47	41	9	21	362
missing	0	0	0	0	0	0	0	0	0	0	...	381	381	381	0	0	12236	0	1173	3951	0
missing_perc	0%	0%	0%	0%	0%	0%	0%	0%	0%	0%	...	2.41%	2.41%	2.41%	0%	0%	77.25%	0%	7.41%	24.94%	0%
types	categorical	categorical	numeric	numeric	numeric	numeric	numeric	numeric	numeric	numeric	...	numeric	numeric	numeric	numeric	numeric	numeric	numeric	numeric	categorical	numeric

	Id	Store	DayOfWeek	Date	Open	Promo	StateHoliday	SchoolHoliday
count	41088	41088	41088	NaN	41077	41088	NaN	41088
mean	20544.5	555.9	3.97917	NaN	0.854322	0.395833	NaN	0.443487
std	11861.2	320.274	2.01548	NaN	0.352787	0.489035	NaN	0.496802
min	1	1	1	NaN	0	0	NaN	0
25%	10272.8	279.75	2	NaN	1	0	NaN	0
50%	20544.5	553.5	4	NaN	1	0	NaN	0
75%	30816.2	832.25	6	NaN	1	1	NaN	1
max	41088	1115	7	NaN	1	1	NaN	1
counts	41088	41088	41088	41088	41077	41088	41088	41088
uniques	41088	856	7	48	2	2	2	2
missing	0	0	0	0	11	0	0	0
missing_perc	0%	0%	0%	0%	0.03%	0%	0%	0%
types	numeric	numeric	numeric	categorical	bool	bool	bool	bool

	Date	Store	AfterSchoolHoliday	BeforeSchoolHoliday	AfterStateHoliday	BeforeStateHoliday	AfterPromo	BeforePromo	Promo_bw	Promo_fw
0	2015-09-17	1	13	-9223372036854775808	-9223372036854775808	-9223372036854775808	0	0	4.0	1.0
1	2015-09-16	1	12	-9223372036854775808	-9223372036854775808	-9223372036854775808	0	0	3.0	2.0
2	2015-09-15	1	11	-9223372036854775808	-9223372036854775808	-9223372036854775808	0	0	2.0	3.0
3	2015-09-14	1	10	-9223372036854775808	-9223372036854775808	-9223372036854775808	0	0	1.0	4.0
4	2015-09-13	1	9	-9223372036854775808	-9223372036854775808	-9223372036854775808	9	-1	0.0	4.0

	0	1	2	3	4
index	0	1	2	3	4
Store	1	2	3	4	5
DayOfWeek	5	5	5	5	5
Date	2015-07-31 00:00:00	2015-07-31 00:00:00	2015-07-31 00:00:00	2015-07-31 00:00:00	2015-07-31 00:00:00
Sales	5263	6064	8314	13995	4822
Customers	555	625	821	1498	559
Open	1	1	1	1	1
Promo	1	1	1	1	1
StateHoliday	False	False	False	False	False
SchoolHoliday	1	1	1	1	1
Year	2015	2015	2015	2015	2015
Month	7	7	7	7	7
Week	31	31	31	31	31
Day	31	31	31	31	31
Dayofweek	4	4	4	4	4
Dayofyear	212	212	212	212	212
Is_month_end	True	True	True	True	True
Is_month_start	False	False	False	False	False
Is_quarter_end	False	False	False	False	False
Is_quarter_start	False	False	False	False	False
Is_year_end	False	False	False	False	False
Is_year_start	False	False	False	False	False
Elapsed	1438300800	1438300800	1438300800	1438300800	1438300800
StoreType	c	a	a	c	a
Assortment	a	a	a	c	a
CompetitionDistance	1270	570	14130	620	29910
CompetitionOpenSinceMonth	9	11	12	9	4
CompetitionOpenSinceYear	2008	2007	2006	2009	2015
Promo2	0	1	1	0	0
Promo2SinceWeek	1	13	14	1	1
Promo2SinceYear	1900	2010	2011	1900	1900
PromoInterval	NaN	Jan,Apr,Jul,Oct	Jan,Apr,Jul,Oct	NaN	NaN
State	HE	TH	NW	BE	SN
file	Rossmann_DE_HE	Rossmann_DE_TH	Rossmann_DE_NW	Rossmann_DE_BE	Rossmann_DE_SN
week	2015-08-02 - 2015-08-08	2015-08-02 - 2015-08-08	2015-08-02 - 2015-08-08	2015-08-02 - 2015-08-08	2015-08-02 - 2015-08-08
trend	85	80	86	74	82
file_DE	Rossmann_DE	Rossmann_DE	Rossmann_DE	Rossmann_DE	Rossmann_DE
week_DE	2015-08-02 - 2015-08-08	2015-08-02 - 2015-08-08	2015-08-02 - 2015-08-08	2015-08-02 - 2015-08-08	2015-08-02 - 2015-08-08
trend_DE	83	83	83	83	83
Date_DE	2015-08-02 00:00:00	2015-08-02 00:00:00	2015-08-02 00:00:00	2015-08-02 00:00:00	2015-08-02 00:00:00

	Store	DayOfWeek	Year	Month	Day	StateHoliday	CompetitionMonthsOpen	Promo2Weeks	StoreType	Assortment	...	Mean_Wind_SpeedKm_h	CloudCover	trend	trend_DE	AfterStateHoliday	BeforeStateHoliday	Promo	SchoolHoliday	CompetitionDistance_na	CloudCover_na
Date
2015-07-31	1	5	3	7	31	1	25	1	3	1	...	-0.142774	-2.805667	1.732492	1.724334	0.0	-0.0	1.113717	2.04105	-0.050948	-0.29609
2015-07-31	2	5	3	7	31	1	25	26	1	1	...	-0.142774	-0.961766	1.294578	1.724334	0.0	-0.0	1.113717	2.04105	-0.050948	-0.29609