This notebook contains an implementation of the third place result in the Rossman Kaggle competition as detailed in Guo/Berkhahn's Entity Embeddings of Categorical Variables.

The motivation behind exploring this architecture is it's relevance to real-world application. Much of our focus has been computer-vision and NLP tasks, which largely deals with unstructured data.

However, most of the data informing KPI's in industry are structured, time-series data. Here we explore the end-to-end process of using neural networks with practical structured data problems.


In [1]:
%matplotlib inline

In [2]:
import math, keras, datetime, pandas as pd, numpy as np, keras.backend as K
import matplotlib.pyplot as plt, xgboost, operator, random, pickle


Using TensorFlow backend.
/home/roebius/pj/p3/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

In [3]:
from utils2 import *

In [4]:
np.set_printoptions(threshold=50, edgeitems=20)

In [5]:
limit_mem()

In [6]:
from isoweek import Week
from pandas_summary import DataFrameSummary

In [7]:
%cd data/rossman/


/home/roebius/pj/fastai/nbs2/data/rossman

Create datasets

In addition to the provided data, we will be using external datasets put together by participants in the Kaggle competition. You can download all of them here.

For completeness, the implementation used to put them together is included below.


In [8]:
def concat_csvs(dirname):
    os.chdir(dirname)
    filenames=glob.glob("*.csv")

    wrote_header = False
    with open("../"+dirname+".csv","w") as outputfile:
        for filename in filenames:
            name = filename.split(".")[0]
            with open(filename) as f:
                line = f.readline()
                if not wrote_header:
                    wrote_header = True
                    outputfile.write("file,"+line)
                for line in f:
                     outputfile.write(name + "," + line)
                outputfile.write("\n")

    os.chdir("..")

In [9]:
# concat_csvs('googletrend')
# concat_csvs('weather')

Feature Space:

  • train: Training set provided by competition
  • store: List of stores
  • store_states: mapping of store to the German state they are in
  • List of German state names
  • googletrend: trend of certain google keywords over time, found by users to correlate well w/ given data
  • weather: weather
  • test: testing set

In [10]:
table_names = ['train', 'store', 'store_states', 'state_names', 
               'googletrend', 'weather', 'test']

We'll be using the popular data manipulation framework pandas.

Among other things, pandas allows you to manipulate tables/data frames in python as one would in a database.

We're going to go ahead and load all of our csv's as dataframes into a list tables.


In [11]:
tables = [pd.read_csv(fname+'.csv', low_memory=False) for fname in table_names]

In [12]:
from IPython.display import HTML

We can use head() to get a quick look at the contents of each table:

  • train: Contains store information on a daily basis, tracks things like sales, customers, whether that day was a holdiay, etc.
  • store: general info about the store including competition, etc.
  • store_states: maps store to state it is in
  • state_names: Maps state abbreviations to names
  • googletrend: trend data for particular week/state
  • weather: weather conditions for each state
  • test: Same as training table, w/o sales and customers

In [13]:
for t in tables: display(t.head())


Store DayOfWeek Date Sales Customers Open Promo StateHoliday SchoolHoliday
0 1 5 2015-07-31 5263 555 1 1 0 1
1 2 5 2015-07-31 6064 625 1 1 0 1
2 3 5 2015-07-31 8314 821 1 1 0 1
3 4 5 2015-07-31 13995 1498 1 1 0 1
4 5 5 2015-07-31 4822 559 1 1 0 1
Store StoreType Assortment CompetitionDistance CompetitionOpenSinceMonth CompetitionOpenSinceYear Promo2 Promo2SinceWeek Promo2SinceYear PromoInterval
0 1 c a 1270.0 9.0 2008.0 0 NaN NaN NaN
1 2 a a 570.0 11.0 2007.0 1 13.0 2010.0 Jan,Apr,Jul,Oct
2 3 a a 14130.0 12.0 2006.0 1 14.0 2011.0 Jan,Apr,Jul,Oct
3 4 c c 620.0 9.0 2009.0 0 NaN NaN NaN
4 5 a a 29910.0 4.0 2015.0 0 NaN NaN NaN
Store State
0 1 HE
1 2 TH
2 3 NW
3 4 BE
4 5 SN
StateName State
0 BadenWuerttemberg BW
1 Bayern BY
2 Berlin BE
3 Brandenburg BB
4 Bremen HB
file week trend
0 Rossmann_DE_SN 2012-12-02 - 2012-12-08 96
1 Rossmann_DE_SN 2012-12-09 - 2012-12-15 95
2 Rossmann_DE_SN 2012-12-16 - 2012-12-22 91
3 Rossmann_DE_SN 2012-12-23 - 2012-12-29 48
4 Rossmann_DE_SN 2012-12-30 - 2013-01-05 67
file Date Max_TemperatureC Mean_TemperatureC Min_TemperatureC Dew_PointC MeanDew_PointC Min_DewpointC Max_Humidity Mean_Humidity ... Max_VisibilityKm Mean_VisibilityKm Min_VisibilitykM Max_Wind_SpeedKm_h Mean_Wind_SpeedKm_h Max_Gust_SpeedKm_h Precipitationmm CloudCover Events WindDirDegrees
0 NordrheinWestfalen 2013-01-01 8 4 2 7 5 1 94 87 ... 31.0 12.0 4.0 39 26 58.0 5.08 6.0 Rain 215
1 NordrheinWestfalen 2013-01-02 7 4 1 5 3 2 93 85 ... 31.0 14.0 10.0 24 16 NaN 0.00 6.0 Rain 225
2 NordrheinWestfalen 2013-01-03 11 8 6 10 8 4 100 93 ... 31.0 8.0 2.0 26 21 NaN 1.02 7.0 Rain 240
3 NordrheinWestfalen 2013-01-04 9 9 8 9 9 8 100 94 ... 11.0 5.0 2.0 23 14 NaN 0.25 7.0 Rain 263
4 NordrheinWestfalen 2013-01-05 8 8 7 8 7 6 100 94 ... 10.0 6.0 3.0 16 10 NaN 0.00 7.0 Rain 268

5 rows × 24 columns

Id Store DayOfWeek Date Open Promo StateHoliday SchoolHoliday
0 1 1 4 2015-09-17 1.0 1 0 0
1 2 3 4 2015-09-17 1.0 1 0 0
2 3 7 4 2015-09-17 1.0 1 0 0
3 4 8 4 2015-09-17 1.0 1 0 0
4 5 9 4 2015-09-17 1.0 1 0 0

This is very representative of a typical industry dataset.

The following returns summarized aggregate information to each table accross each field.


In [14]:
for t in tables: display(DataFrameSummary(t).summary())


Store DayOfWeek Date Sales Customers Open Promo StateHoliday SchoolHoliday
count 1.01721e+06 1.01721e+06 NaN 1.01721e+06 1.01721e+06 1.01721e+06 1.01721e+06 NaN 1.01721e+06
mean 558.43 3.99834 NaN 5773.82 633.146 0.830107 0.381515 NaN 0.178647
std 321.909 1.99739 NaN 3849.93 464.412 0.375539 0.485759 NaN 0.383056
min 1 1 NaN 0 0 0 0 NaN 0
25% 280 2 NaN 3727 405 1 0 NaN 0
50% 558 4 NaN 5744 609 1 0 NaN 0
75% 838 6 NaN 7856 837 1 1 NaN 0
max 1115 7 NaN 41551 7388 1 1 NaN 1
counts 1017209 1017209 1017209 1017209 1017209 1017209 1017209 1017209 1017209
uniques 1115 7 942 21734 4086 2 2 4 2
missing 0 0 0 0 0 0 0 0 0
missing_perc 0% 0% 0% 0% 0% 0% 0% 0% 0%
types numeric numeric categorical numeric numeric bool bool categorical bool
Store StoreType Assortment CompetitionDistance CompetitionOpenSinceMonth CompetitionOpenSinceYear Promo2 Promo2SinceWeek Promo2SinceYear PromoInterval
count 1115 NaN NaN 1112 761 761 1115 571 571 NaN
mean 558 NaN NaN 5404.9 7.2247 2008.67 0.512108 23.5954 2011.76 NaN
std 322.017 NaN NaN 7663.17 3.21235 6.19598 0.500078 14.142 1.67494 NaN
min 1 NaN NaN 20 1 1900 0 1 2009 NaN
25% 279.5 NaN NaN 717.5 4 2006 0 13 2011 NaN
50% 558 NaN NaN 2325 8 2010 1 22 2012 NaN
75% 836.5 NaN NaN 6882.5 10 2013 1 37 2013 NaN
max 1115 NaN NaN 75860 12 2015 1 50 2015 NaN
counts 1115 1115 1115 1112 761 761 1115 571 571 571
uniques 1115 4 3 654 12 23 2 24 7 3
missing 0 0 0 3 354 354 0 544 544 544
missing_perc 0% 0% 0% 0.27% 31.75% 31.75% 0% 48.79% 48.79% 48.79%
types numeric categorical categorical numeric numeric numeric bool numeric numeric categorical
Store State
count 1115 NaN
mean 558 NaN
std 322.017 NaN
min 1 NaN
25% 279.5 NaN
50% 558 NaN
75% 836.5 NaN
max 1115 NaN
counts 1115 1115
uniques 1115 12
missing 0 0
missing_perc 0% 0%
types numeric categorical
StateName State
count 16 16
unique 16 16
top Bayern SH
freq 1 1
counts 16 16
uniques 16 16
missing 0 0
missing_perc 0% 0%
types unique unique
file week trend
count NaN NaN 2072
mean NaN NaN 63.8142
std NaN NaN 12.6502
min NaN NaN 0
25% NaN NaN 55
50% NaN NaN 64
75% NaN NaN 72
max NaN NaN 100
counts 2072 2072 2072
uniques 14 148 68
missing 0 0 0
missing_perc 0% 0% 0%
types categorical categorical numeric
file Date Max_TemperatureC Mean_TemperatureC Min_TemperatureC Dew_PointC MeanDew_PointC Min_DewpointC Max_Humidity Mean_Humidity ... Max_VisibilityKm Mean_VisibilityKm Min_VisibilitykM Max_Wind_SpeedKm_h Mean_Wind_SpeedKm_h Max_Gust_SpeedKm_h Precipitationmm CloudCover Events WindDirDegrees
count NaN NaN 15840 15840 15840 15840 15840 15840 15840 15840 ... 15459 15459 15459 15840 15840 3604 15840 14667 NaN 15840
mean NaN NaN 14.6441 10.389 6.19899 8.58782 6.20581 3.62614 93.6596 74.2829 ... 24.0576 12.2398 7.02516 22.7666 11.9722 48.8643 0.831718 5.55131 NaN 175.897
std NaN NaN 8.64601 7.37926 6.52639 6.24478 6.08677 6.12839 7.67853 13.4866 ... 8.9768 5.06794 4.9806 8.98862 5.87284 13.027 2.51351 1.68771 NaN 101.589
min NaN NaN -11 -13 -15 -14 -15 -73 44 30 ... 0 0 0 3 2 21 0 0 NaN -1
25% NaN NaN 8 4 1 4 2 -1 90.75 65 ... 14 10 3 16 8 39 0 5 NaN 80
50% NaN NaN 15 11 7 9 7 4 94 76 ... 31 11 7 21 11 48 0 6 NaN 202
75% NaN NaN 21 16 11 13 11 8 100 85 ... 31 14 10 27 14 55 0.25 7 NaN 256
max NaN NaN 39 31 24 25 20 19 100 100 ... 31 31 31 101 53 111 58.93 8 NaN 360
counts 15840 15840 15840 15840 15840 15840 15840 15840 15840 15840 ... 15459 15459 15459 15840 15840 3604 15840 14667 11889 15840
uniques 16 990 51 45 40 40 36 40 53 71 ... 24 32 24 44 29 47 41 9 21 362
missing 0 0 0 0 0 0 0 0 0 0 ... 381 381 381 0 0 12236 0 1173 3951 0
missing_perc 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% ... 2.41% 2.41% 2.41% 0% 0% 77.25% 0% 7.41% 24.94% 0%
types categorical categorical numeric numeric numeric numeric numeric numeric numeric numeric ... numeric numeric numeric numeric numeric numeric numeric numeric categorical numeric

13 rows × 24 columns

Id Store DayOfWeek Date Open Promo StateHoliday SchoolHoliday
count 41088 41088 41088 NaN 41077 41088 NaN 41088
mean 20544.5 555.9 3.97917 NaN 0.854322 0.395833 NaN 0.443487
std 11861.2 320.274 2.01548 NaN 0.352787 0.489035 NaN 0.496802
min 1 1 1 NaN 0 0 NaN 0
25% 10272.8 279.75 2 NaN 1 0 NaN 0
50% 20544.5 553.5 4 NaN 1 0 NaN 0
75% 30816.2 832.25 6 NaN 1 1 NaN 1
max 41088 1115 7 NaN 1 1 NaN 1
counts 41088 41088 41088 41088 41077 41088 41088 41088
uniques 41088 856 7 48 2 2 2 2
missing 0 0 0 0 11 0 0 0
missing_perc 0% 0% 0% 0% 0.03% 0% 0% 0%
types numeric numeric numeric categorical bool bool bool bool

Data Cleaning / Feature Engineering

As a structured data problem, we necessarily have to go through all the cleaning and feature engineering, even though we're using a neural network.


In [15]:
train, store, store_states, state_names, googletrend, weather, test = tables

In [16]:
len(train),len(test)


Out[16]:
(1017209, 41088)

Turn state Holidays to Bool


In [17]:
train.StateHoliday = train.StateHoliday!='0'
test.StateHoliday = test.StateHoliday!='0'

Define function for joining tables on specific fields.

By default, we'll be doing a left outer join of right on the left argument using the given fields for each table.

Pandas does joins using the merge method. The suffixes argument describes the naming convention for duplicate fields. We've elected to leave the duplicate field names on the left untouched, and append a "_y" to those on the right.


In [18]:
def join_df(left, right, left_on, right_on=None):
    if right_on is None: right_on = left_on
    return left.merge(right, how='left', left_on=left_on, right_on=right_on, 
                      suffixes=("", "_y"))

Join weather/state names.


In [19]:
weather = join_df(weather, state_names, "file", "StateName")

In pandas you can add new columns to a dataframe by simply defining it. We'll do this for googletrends by extracting dates and state names from the given data and adding those columns.

We're also going to replace all instances of state name 'NI' with the usage in the rest of the table, 'HB,NI'. This is a good opportunity to highlight pandas indexing. We can use .ix[rows, cols] to select a list of rows and a list of columns from the dataframe. In this case, we're selecting rows w/ statename 'NI' by using a boolean list googletrend.State=='NI' and selecting "State".


In [20]:
googletrend['Date'] = googletrend.week.str.split(' - ', expand=True)[0]
googletrend['State'] = googletrend.file.str.split('_', expand=True)[2]
googletrend.loc[googletrend.State=='NI', "State"] = 'HB,NI'

The following extracts particular date fields from a complete datetime for the purpose of constructing categoricals.

You should always consider this feature extraction step when working with date-time. Without expanding your date-time into these additional fields, you can't capture any trend/cyclical behavior as a function of time at any of these granularities.


In [21]:
def add_datepart(df):
    df.Date = pd.to_datetime(df.Date)
    df["Year"] = df.Date.dt.year
    df["Month"] = df.Date.dt.month
    df["Week"] = df.Date.dt.week
    df["Day"] = df.Date.dt.day

We'll add to every table w/ a date field.


In [22]:
add_datepart(weather)
add_datepart(googletrend)
add_datepart(train)
add_datepart(test)

In [23]:
trend_de = googletrend[googletrend.file == 'Rossmann_DE']

Now we can outer join all of our data into a single dataframe.

Recall that in outer joins everytime a value in the joining field on the left table does not have a corresponding value on the right table, the corresponding row in the new table has Null values for all right table fields.

One way to check that all records are consistent and complete is to check for Null values post-join, as we do here.

Aside: Why note just do an inner join? If you are assuming that all records are complete and match on the field you desire, an inner join will do the same thing as an outer join. However, in the event you are wrong or a mistake is made, an outer join followed by a null-check will catch it. (Comparing before/after # of rows for inner join is equivalent, but requires keeping track of before/after row #'s. Outer join is easier.)


In [24]:
store = join_df(store, store_states, "Store")
len(store[store.State.isnull()])


Out[24]:
0

In [25]:
joined = join_df(train, store, "Store")
len(joined[joined.StoreType.isnull()])


Out[25]:
0

In [26]:
joined = join_df(joined, googletrend, ["State","Year", "Week"])
len(joined[joined.trend.isnull()])


Out[26]:
0

In [27]:
joined = joined.merge(trend_de, 'left', ["Year", "Week"], suffixes=('', '_DE'))
len(joined[joined.trend_DE.isnull()])


Out[27]:
0

In [28]:
joined = join_df(joined, weather, ["State","Date"])
len(joined[joined.Mean_TemperatureC.isnull()])


Out[28]:
0

In [29]:
joined_test = test.merge(store, how='left', left_on='Store', right_index=True)
len(joined_test[joined_test.StoreType.isnull()])


Out[29]:
48

Next we'll fill in missing values to avoid complications w/ na's.


In [30]:
joined.CompetitionOpenSinceYear = joined.CompetitionOpenSinceYear.fillna(1900).astype(np.int32)
joined.CompetitionOpenSinceMonth = joined.CompetitionOpenSinceMonth.fillna(1).astype(np.int32)
joined.Promo2SinceYear = joined.Promo2SinceYear.fillna(1900).astype(np.int32)
joined.Promo2SinceWeek = joined.Promo2SinceWeek.fillna(1).astype(np.int32)

Next we'll extract features "CompetitionOpenSince" and "CompetitionDaysOpen". Note the use of apply() in mapping a function across dataframe values.


In [31]:
joined["CompetitionOpenSince"] = pd.to_datetime(joined.apply(lambda x: datetime.datetime(
    x.CompetitionOpenSinceYear, x.CompetitionOpenSinceMonth, 15), axis=1).astype(pd.datetime))
joined["CompetitionDaysOpen"] = joined.Date.subtract(joined["CompetitionOpenSince"]).dt.days

We'll replace some erroneous / outlying data.


In [32]:
joined.loc[joined.CompetitionDaysOpen<0, "CompetitionDaysOpen"] = 0
joined.loc[joined.CompetitionOpenSinceYear<1990, "CompetitionDaysOpen"] = 0

Added "CompetitionMonthsOpen" field, limit the maximum to 2 years to limit number of unique embeddings.


In [33]:
joined["CompetitionMonthsOpen"] = joined["CompetitionDaysOpen"]//30
joined.loc[joined.CompetitionMonthsOpen>24, "CompetitionMonthsOpen"] = 24
joined.CompetitionMonthsOpen.unique()


Out[33]:
array([24,  3, 19,  9,  0, 16, 17,  7, 15, 22, 11, 13,  2, 23, 12,  4, 10,
        1, 14, 20,  8, 18,  6, 21,  5])

Same process for Promo dates.


In [34]:
joined["Promo2Since"] = pd.to_datetime(joined.apply(lambda x: Week(
    x.Promo2SinceYear, x.Promo2SinceWeek).monday(), axis=1).astype(pd.datetime))
joined["Promo2Days"] = joined.Date.subtract(joined["Promo2Since"]).dt.days

In [35]:
joined.loc[joined.Promo2Days<0, "Promo2Days"] = 0
joined.loc[joined.Promo2SinceYear<1990, "Promo2Days"] = 0

In [36]:
joined["Promo2Weeks"] = joined["Promo2Days"]//7
joined.loc[joined.Promo2Weeks<0, "Promo2Weeks"] = 0
joined.loc[joined.Promo2Weeks>25, "Promo2Weeks"] = 25
joined.Promo2Weeks.unique()


Out[36]:
array([ 0, 25, 17,  8, 13, 24, 16,  7, 12, 23, 15,  6, 11, 22, 14,  5, 10,
       21,  4,  9, 20,  3, 19,  2, 18,  1])

Durations

It is common when working with time series data to extract data that explains relationships across rows as opposed to columns, e.g.:

  • Running averages
  • Time until next event
  • Time since last event

This is often difficult to do with most table manipulation frameworks, since they are designed to work with relationships across columns. As such, we've created a class to handle this type of data.


In [37]:
columns = ["Date", "Store", "Promo", "StateHoliday", "SchoolHoliday"]

We've defined a class elapsed for cumulative counting across a sorted dataframe.

Given a particular field fld to monitor, this object will start tracking time since the last occurrence of that field. When the field is seen again, the counter is set to zero.

Upon initialization, this will result in datetime na's until the field is encountered. This is reset every time a new store is seen.

We'll see how to use this shortly.


In [38]:
class elapsed(object):
    def __init__(self, fld):
        self.fld = fld
        self.last = pd.to_datetime(np.nan)
        self.last_store = 0
        
    def get(self, row):
        if row.Store != self.last_store:
            self.last = pd.to_datetime(np.nan)
            self.last_store = row.Store
        if (row[self.fld]): self.last = row.Date
        return row.Date-self.last

In [39]:
df = train[columns]

And a function for applying said class across dataframe rows and adding values to a new column.


In [40]:
def add_elapsed(fld, prefix):
    sh_el = elapsed(fld)
    df[prefix+fld] = df.apply(sh_el.get, axis=1)

Let's walk through an example.

Say we're looking at School Holiday. We'll first sort by Store, then Date, and then call add_elapsed('SchoolHoliday', 'After'): This will generate an instance of the elapsed class for School Holiday:

  • Instance applied to every row of the dataframe in order of store and date
  • Will add to the dataframe the days since seeing a School Holiday
  • If we sort in the other direction, this will count the days until another promotion.

In [41]:
fld = 'SchoolHoliday'
df = df.sort_values(['Store', 'Date'])
add_elapsed(fld, 'After')
df = df.sort_values(['Store', 'Date'], ascending=[True, False])
add_elapsed(fld, 'Before')

We'll do this for two more fields.


In [42]:
fld = 'StateHoliday'
df = df.sort_values(['Store', 'Date'])
add_elapsed(fld, 'After')
df = df.sort_values(['Store', 'Date'], ascending=[True, False])
add_elapsed(fld, 'Before')

In [43]:
fld = 'Promo'
df = df.sort_values(['Store', 'Date'])
add_elapsed(fld, 'After')
df = df.sort_values(['Store', 'Date'], ascending=[True, False])
add_elapsed(fld, 'Before')

In [44]:
display(df.head())


Date Store Promo StateHoliday SchoolHoliday AfterSchoolHoliday BeforeSchoolHoliday AfterStateHoliday BeforeStateHoliday AfterPromo BeforePromo
0 2015-07-31 1 1 False 1 0 days 0 days 57 days NaT 0 days 00:00:00 0 days
1115 2015-07-30 1 1 False 1 0 days 0 days 56 days NaT 0 days 00:00:00 0 days
2230 2015-07-29 1 1 False 1 0 days 0 days 55 days NaT 0 days 00:00:00 0 days
3345 2015-07-28 1 1 False 1 0 days 0 days 54 days NaT 0 days 00:00:00 0 days
4460 2015-07-27 1 1 False 1 0 days 0 days 53 days NaT 0 days 00:00:00 0 days

We're going to set the active index to Date.


In [45]:
df = df.set_index("Date")

Then set null values from elapsed field calculations to 0.


In [46]:
columns = ['SchoolHoliday', 'StateHoliday', 'Promo']

In [47]:
for o in ['Before', 'After']:
    for p in columns:
        a = o+p
        df[a] = df[a].fillna(pd.Timedelta(0)).dt.days

Next we'll demonstrate window functions in pandas to calculate rolling quantities.

Here we're sorting by date (sort_index()) and counting the number of events of interest (sum()) defined in columns in the following week (rolling()), grouped by Store (groupby()). We do the same in the opposite direction.


In [48]:
bwd = df[['Store']+columns].sort_index().groupby("Store").rolling(7, min_periods=1).sum()

In [49]:
fwd = df[['Store']+columns].sort_index(ascending=False
                                      ).groupby("Store").rolling(7, min_periods=1).sum()

Next we want to drop the Store indices grouped together in the window function.

Often in pandas, there is an option to do this in place. This is time and memory efficient when working with large datasets.


In [50]:
bwd.drop('Store',1,inplace=True)
bwd.reset_index(inplace=True)

In [51]:
fwd.drop('Store',1,inplace=True)
fwd.reset_index(inplace=True)

In [52]:
df.reset_index(inplace=True)

Now we'll merge these values onto the df.


In [53]:
df = df.merge(bwd, 'left', ['Date', 'Store'], suffixes=['', '_bw'])
df = df.merge(fwd, 'left', ['Date', 'Store'], suffixes=['', '_fw'])

In [54]:
df.drop(columns,1,inplace=True)

In [55]:
df.head()


Out[55]:
Date Store AfterSchoolHoliday BeforeSchoolHoliday AfterStateHoliday BeforeStateHoliday AfterPromo BeforePromo SchoolHoliday_bw StateHoliday_bw Promo_bw SchoolHoliday_fw StateHoliday_fw Promo_fw
0 2015-07-31 1 0 0 57 0 0 0 5.0 0.0 5.0 1.0 0.0 1.0
1 2015-07-30 1 0 0 56 0 0 0 4.0 0.0 4.0 2.0 0.0 2.0
2 2015-07-29 1 0 0 55 0 0 0 3.0 0.0 3.0 3.0 0.0 3.0
3 2015-07-28 1 0 0 54 0 0 0 2.0 0.0 2.0 4.0 0.0 4.0
4 2015-07-27 1 0 0 53 0 0 0 1.0 0.0 1.0 5.0 0.0 5.0

It's usually a good idea to back up large tables of extracted / wrangled features before you join them onto another one, that way you can go back to it easily if you need to make changes to it.


In [56]:
df.to_csv('df.csv')

In [57]:
df = pd.read_csv('df.csv', index_col=0)

In [58]:
df["Date"] = pd.to_datetime(df.Date)

In [59]:
df.columns


Out[59]:
Index(['Date', 'Store', 'AfterSchoolHoliday', 'BeforeSchoolHoliday',
       'AfterStateHoliday', 'BeforeStateHoliday', 'AfterPromo', 'BeforePromo',
       'SchoolHoliday_bw', 'StateHoliday_bw', 'Promo_bw', 'SchoolHoliday_fw',
       'StateHoliday_fw', 'Promo_fw'],
      dtype='object')

In [60]:
joined = join_df(joined, df, ['Store', 'Date'])

We'll back this up as well.


In [61]:
joined.to_csv('joined.csv')

We now have our final set of engineered features.


In [62]:
joined = pd.read_csv('joined.csv', index_col=0)
joined["Date"] = pd.to_datetime(joined.Date)
joined.columns


Out[62]:
Index(['Store', 'DayOfWeek', 'Date', 'Sales', 'Customers', 'Open', 'Promo',
       'StateHoliday', 'SchoolHoliday', 'Year', 'Month', 'Week', 'Day',
       'StoreType', 'Assortment', 'CompetitionDistance',
       'CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear', 'Promo2',
       'Promo2SinceWeek', 'Promo2SinceYear', 'PromoInterval', 'State', 'file',
       'week', 'trend', 'Date_y', 'Month_y', 'Day_y', 'file_DE', 'week_DE',
       'trend_DE', 'Date_DE', 'State_DE', 'Month_DE', 'Day_DE', 'file_y',
       'Max_TemperatureC', 'Mean_TemperatureC', 'Min_TemperatureC',
       'Dew_PointC', 'MeanDew_PointC', 'Min_DewpointC', 'Max_Humidity',
       'Mean_Humidity', 'Min_Humidity', 'Max_Sea_Level_PressurehPa',
       'Mean_Sea_Level_PressurehPa', 'Min_Sea_Level_PressurehPa',
       'Max_VisibilityKm', 'Mean_VisibilityKm', 'Min_VisibilitykM',
       'Max_Wind_SpeedKm_h', 'Mean_Wind_SpeedKm_h', 'Max_Gust_SpeedKm_h',
       'Precipitationmm', 'CloudCover', 'Events', 'WindDirDegrees',
       'StateName', 'Year_y', 'Month_y.1', 'Week_y', 'Day_y.1',
       'CompetitionOpenSince', 'CompetitionDaysOpen', 'CompetitionMonthsOpen',
       'Promo2Since', 'Promo2Days', 'Promo2Weeks', 'AfterSchoolHoliday',
       'BeforeSchoolHoliday', 'AfterStateHoliday', 'BeforeStateHoliday',
       'AfterPromo', 'BeforePromo', 'SchoolHoliday_bw', 'StateHoliday_bw',
       'Promo_bw', 'SchoolHoliday_fw', 'StateHoliday_fw', 'Promo_fw'],
      dtype='object')

While these steps were explicitly outlined in the paper, these are all fairly typical feature engineering steps for dealing with time series data and are practical in any similar setting.

Create features

Now that we've engineered all our features, we need to convert to input compatible with a neural network.

This includes converting categorical variables into contiguous integers or one-hot encodings, normalizing continuous features to standard normal, etc...


In [63]:
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import LabelEncoder, Imputer, StandardScaler

This dictionary maps categories to embedding dimensionality. In generally, categories we might expect to be conceptually more complex have larger dimension.


In [64]:
cat_var_dict = {'Store': 50, 'DayOfWeek': 6, 'Year': 2, 'Month': 6,
'Day': 10, 'StateHoliday': 3, 'CompetitionMonthsOpen': 2,
'Promo2Weeks': 1, 'StoreType': 2, 'Assortment': 3, 'PromoInterval': 3,
'CompetitionOpenSinceYear': 4, 'Promo2SinceYear': 4, 'State': 6,
'Week': 2, 'Events': 4, 'Promo_fw': 1,
'Promo_bw': 1, 'StateHoliday_fw': 1,
'StateHoliday_bw': 1, 'SchoolHoliday_fw': 1,
'SchoolHoliday_bw': 1}

Name categorical variables


In [65]:
cat_vars = [o[0] for o in 
            sorted(cat_var_dict.items(), key=operator.itemgetter(1), reverse=True)]

In [66]:
"""cat_vars = ['Store', 'DayOfWeek', 'Year', 'Month', 'Day', 'StateHoliday',
           'StoreType', 'Assortment', 'Week', 'Events', 'Promo2SinceYear',
            'CompetitionOpenSinceYear', 'PromoInterval', 'Promo', 'SchoolHoliday', 'State']"""


Out[66]:
"cat_vars = ['Store', 'DayOfWeek', 'Year', 'Month', 'Day', 'StateHoliday',\n           'StoreType', 'Assortment', 'Week', 'Events', 'Promo2SinceYear',\n            'CompetitionOpenSinceYear', 'PromoInterval', 'Promo', 'SchoolHoliday', 'State']"

Likewise for continuous


In [67]:
# mean/max wind; min temp; cloud; min/mean humid; 
contin_vars = ['CompetitionDistance', 
   'Max_TemperatureC', 'Mean_TemperatureC', 'Min_TemperatureC',
   'Max_Humidity', 'Mean_Humidity', 'Min_Humidity', 'Max_Wind_SpeedKm_h', 
   'Mean_Wind_SpeedKm_h', 'CloudCover', 'trend', 'trend_DE',
   'AfterStateHoliday', 'BeforeStateHoliday', 'Promo', 'SchoolHoliday']

In [68]:
"""contin_vars = ['CompetitionDistance', 'Max_TemperatureC', 'Mean_TemperatureC', 
   'Max_Humidity', 'trend', 'trend_DE', 'AfterStateHoliday', 'BeforeStateHoliday']"""


Out[68]:
"contin_vars = ['CompetitionDistance', 'Max_TemperatureC', 'Mean_TemperatureC', \n   'Max_Humidity', 'trend', 'trend_DE', 'AfterStateHoliday', 'BeforeStateHoliday']"

Replace nulls w/ 0 for continuous, "" for categorical.


In [69]:
for v in contin_vars: joined.loc[joined[v].isnull(), v] = 0
for v in cat_vars: joined.loc[joined[v].isnull(), v] = ""

Here we create a list of tuples, each containing a variable and an instance of a transformer for that variable.

For categoricals, we use a label encoder that maps categories to continuous integers. For continuous variables, we standardize them.


In [70]:
cat_maps = [(o, LabelEncoder()) for o in cat_vars]
contin_maps = [([o], StandardScaler()) for o in contin_vars]

The same instances need to be used for the test set as well, so values are mapped/standardized appropriately.

DataFrame mapper will keep track of these variable-instance mappings.


In [71]:
cat_mapper = DataFrameMapper(cat_maps)
cat_map_fit = cat_mapper.fit(joined)
cat_cols = len(cat_map_fit.features)
cat_cols


Out[71]:
22

In [72]:
contin_mapper = DataFrameMapper(contin_maps)
contin_map_fit = contin_mapper.fit(joined)
contin_cols = len(contin_map_fit.features)
contin_cols


/home/roebius/pj/p3/lib/python3.5/site-packages/sklearn/utils/validation.py:429: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.
  warnings.warn(msg, _DataConversionWarning)
Out[72]:
16

Example of first five rows of zeroth column being transformed appropriately.


In [73]:
cat_map_fit.transform(joined)[0,:5], contin_map_fit.transform(joined)[0,:5]


/home/roebius/pj/p3/lib/python3.5/site-packages/sklearn/utils/validation.py:429: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.
  warnings.warn(msg, _DataConversionWarning)
Out[73]:
(array([ 0, 30,  4,  6,  4]),
 array([-0.53772351,  1.02444552,  0.81863226,  0.32504749,  0.61112004]))

We can also pickle these mappings, which is great for portability!


In [74]:
pickle.dump(contin_map_fit, open('contin_maps.pickle', 'wb'))
pickle.dump(cat_map_fit, open('cat_maps.pickle', 'wb'))

In [75]:
[len(o[1].classes_) for o in cat_map_fit.features]


Out[75]:
[1115, 31, 7, 12, 12, 22, 8, 23, 4, 3, 2, 25, 3, 4, 52, 6, 8, 3, 6, 8, 3, 26]

Sample data

Next, the authors removed all instances where the store had zero sale / was closed.


In [76]:
joined_sales = joined[joined.Sales!=0]
n = len(joined_sales)

We speculate that this may have cost them a higher standing in the competition. One reason this may be the case is that a little EDA reveals that there are often periods where stores are closed, typically for refurbishment. Before and after these periods, there are naturally spikes in sales that one might expect. Be ommitting this data from their training, the authors gave up the ability to leverage information about these periods to predict this otherwise volatile behavior.


In [77]:
n


Out[77]:
844338

We're going to run on a sample.


In [78]:
samp_size = 100000
np.random.seed(42)
idxs = sorted(np.random.choice(n, samp_size, replace=False))

In [79]:
joined_samp = joined_sales.iloc[idxs].set_index("Date")

In [80]:
samp_size = n
joined_samp = joined_sales.set_index("Date")

In time series data, cross-validation is not random. Instead, our holdout data is always the most recent data, as it would be in real application.

We've taken the last 10% as our validation set.


In [81]:
train_ratio = 0.9
train_size = int(samp_size * train_ratio)

In [82]:
train_size


Out[82]:
759904

In [83]:
joined_valid = joined_samp[train_size:]
joined_train = joined_samp[:train_size]
len(joined_valid), len(joined_train)


Out[83]:
(84434, 759904)

Here's a preprocessor for our categoricals using our instance mapper.


In [84]:
def cat_preproc(dat):
    return cat_map_fit.transform(dat).astype(np.int64)

In [85]:
cat_map_train = cat_preproc(joined_train)
cat_map_valid = cat_preproc(joined_valid)

Same for continuous.


In [86]:
def contin_preproc(dat):
    return contin_map_fit.transform(dat).astype(np.float32)

In [87]:
contin_map_train = contin_preproc(joined_train)
contin_map_valid = contin_preproc(joined_valid)


/home/roebius/pj/p3/lib/python3.5/site-packages/sklearn/utils/validation.py:429: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.
  warnings.warn(msg, _DataConversionWarning)

Grab our targets.


In [88]:
y_train_orig = joined_train.Sales
y_valid_orig = joined_valid.Sales

Finally, the authors modified the target values by applying a logarithmic transformation and normalizing to unit scale by dividing by the maximum log value.

Log transformations are used on this type of data frequently to attain a nicer shape.

Further by scaling to the unit interval we can now use a sigmoid output in our neural network. Then we can multiply by the maximum log value to get the original log value and transform back.


In [89]:
max_log_y = np.max(np.log(joined_samp.Sales))
y_train = np.log(y_train_orig)/max_log_y
y_valid = np.log(y_valid_orig)/max_log_y

Note: Some testing shows this doesn't make a big difference.


In [90]:
"""#y_train = np.log(y_train)
ymean=y_train_orig.mean()
ystd=y_train_orig.std()
y_train = (y_train_orig-ymean)/ystd
#y_valid = np.log(y_valid)
y_valid = (y_valid_orig-ymean)/ystd"""


Out[90]:
'#y_train = np.log(y_train)\nymean=y_train_orig.mean()\nystd=y_train_orig.std()\ny_train = (y_train_orig-ymean)/ystd\n#y_valid = np.log(y_valid)\ny_valid = (y_valid_orig-ymean)/ystd'

Root-mean-squared percent error is the metric Kaggle used for this competition.


In [91]:
def rmspe(y_pred, targ = y_valid_orig):
    pct_var = (targ - y_pred)/targ
    return math.sqrt(np.square(pct_var).mean())

These undo the target transformations.


In [92]:
def log_max_inv(preds, mx = max_log_y):
    return np.exp(preds * mx)

In [93]:
# - This can be used if ymean and ystd are calculated above (they are currently commented out)
def normalize_inv(preds):
    return preds * ystd + ymean

Create models

Now we're ready to put together our models.

Much of the following code has commented out portions / alternate implementations.


In [94]:
"""
1 97s - loss: 0.0104 - val_loss: 0.0083
2 93s - loss: 0.0076 - val_loss: 0.0076
3 90s - loss: 0.0071 - val_loss: 0.0076
4 90s - loss: 0.0068 - val_loss: 0.0075
5 93s - loss: 0.0066 - val_loss: 0.0075
6 95s - loss: 0.0064 - val_loss: 0.0076
7 98s - loss: 0.0063 - val_loss: 0.0077
8 97s - loss: 0.0062 - val_loss: 0.0075
9 95s - loss: 0.0061 - val_loss: 0.0073
0 101s - loss: 0.0061 - val_loss: 0.0074
"""


Out[94]:
'\n1 97s - loss: 0.0104 - val_loss: 0.0083\n2 93s - loss: 0.0076 - val_loss: 0.0076\n3 90s - loss: 0.0071 - val_loss: 0.0076\n4 90s - loss: 0.0068 - val_loss: 0.0075\n5 93s - loss: 0.0066 - val_loss: 0.0075\n6 95s - loss: 0.0064 - val_loss: 0.0076\n7 98s - loss: 0.0063 - val_loss: 0.0077\n8 97s - loss: 0.0062 - val_loss: 0.0075\n9 95s - loss: 0.0061 - val_loss: 0.0073\n0 101s - loss: 0.0061 - val_loss: 0.0074\n'

In [95]:
def split_cols(arr):
    return np.hsplit(arr,arr.shape[1])

In [96]:
# - This gives the correct list length for the model
# - (list of 23 elements: 22 embeddings + 1 array of 16-dim elements)  
map_train = split_cols(cat_map_train) + [contin_map_train]
map_valid = split_cols(cat_map_valid) + [contin_map_valid]

In [97]:
len(map_train)


Out[97]:
23

In [98]:
# map_train = split_cols(cat_map_train) + split_cols(contin_map_train)
# map_valid = split_cols(cat_map_valid) + split_cols(contin_map_valid)

Helper function for getting categorical name and dim.


In [99]:
def cat_map_info(feat): return feat[0], len(feat[1].classes_)

In [100]:
cat_map_info(cat_map_fit.features[1])


Out[100]:
('Day', 31)

In [101]:
# - In Keras 2 the "initializations" module is not available.
# - To keep here the custom initializer the code from Keras 1 "uniform" initializer is exploited 
def my_init(scale):
#     return lambda shape, name=None: initializations.uniform(shape, scale=scale, name=name)
    return K.variable(np.random.uniform(low=-scale, high=scale, size=shape),
                      name=name)

In [102]:
# - In Keras 2 the "initializations" module is not available.
# - To keep here the custom initializer the code from Keras 1 "uniform" initializer is exploited 
def emb_init(shape, name=None): 
#     return initializations.uniform(shape, scale=2/(shape[1]+1), name=name)
    return K.variable(np.random.uniform(low=-2/(shape[1]+1), high=2/(shape[1]+1), size=shape),
                      name=name)

Helper function for constructing embeddings. Notice commented out codes, several different ways to compute embeddings at play.

Also, note we're flattening the embedding. Embeddings in Keras come out as an element of a sequence like we might use in a sequence of words; here we just want to concatenate them so we flatten the 1-vector sequence into a vector.


In [103]:
def get_emb(feat):
    name, c = cat_map_info(feat)
    #c2 = cat_var_dict[name]
    c2 = (c+1)//2
    if c2>50: c2=50
    inp = Input((1,), dtype='int64', name=name+'_in')
    # , kernel_regularizer=l2(1e-6)  # Keras 2
    u = Flatten(name=name+'_flt')(Embedding(c, c2, input_length=1, embeddings_initializer=emb_init)(inp))  # Keras 2
#     u = Flatten(name=name+'_flt')(Embedding(c, c2, input_length=1)(inp))
    return inp,u

Helper function for continuous inputs.


In [104]:
def get_contin(feat):
    name = feat[0][0]
    inp = Input((1,), name=name+'_in')
    return inp, Dense(1, name=name+'_d', kernel_initializer=my_init(1.))(inp)  # Keras 2

Let's build them.


In [105]:
contin_inp = Input((contin_cols,), name='contin')
contin_out = Dense(contin_cols*10, activation='relu', name='contin_d')(contin_inp)
#contin_out = BatchNormalization()(contin_out)

Now we can put them together. Given the inputs, continuous and categorical embeddings, we're going to concatenate all of them.

Next, we're going to pass through some dropout, then two dense layers w/ ReLU activations, then dropout again, then the sigmoid activation we mentioned earlier.


In [106]:
embs = [get_emb(feat) for feat in cat_map_fit.features]
#conts = [get_contin(feat) for feat in contin_map_fit.features]
#contin_d = [d for inp,d in conts]
x = concatenate([emb for inp,emb in embs] + [contin_out])  # Keras 2
#x = concatenate([emb for inp,emb in embs] + contin_d)  # Keras 2

x = Dropout(0.02)(x)
x = Dense(1000, activation='relu', kernel_initializer='uniform')(x)
x = Dense(500, activation='relu', kernel_initializer='uniform')(x)
x = Dropout(0.2)(x)
x = Dense(1, activation='sigmoid')(x)

model = Model([inp for inp,emb in embs] + [contin_inp], x)
#model = Model([inp for inp,emb in embs] + [inp for inp,d in conts], x)
model.compile('adam', 'mean_absolute_error')
#model.compile(Adam(), 'mse')

Start training


In [107]:
%%time 
hist = model.fit(map_train, y_train, batch_size=128, epochs=25,
                 verbose=1, validation_data=(map_valid, y_valid))


Train on 759904 samples, validate on 84434 samples
Epoch 1/25
759904/759904 [==============================] - 53s - loss: 0.0115 - val_loss: 0.0109
Epoch 2/25
759904/759904 [==============================] - 52s - loss: 0.0081 - val_loss: 0.0106
Epoch 3/25
759904/759904 [==============================] - 53s - loss: 0.0072 - val_loss: 0.0103
Epoch 4/25
759904/759904 [==============================] - 53s - loss: 0.0068 - val_loss: 0.0098
Epoch 5/25
759904/759904 [==============================] - 53s - loss: 0.0066 - val_loss: 0.0099
Epoch 6/25
759904/759904 [==============================] - 54s - loss: 0.0064 - val_loss: 0.0098
Epoch 7/25
759904/759904 [==============================] - 54s - loss: 0.0063 - val_loss: 0.0099
Epoch 8/25
759904/759904 [==============================] - 54s - loss: 0.0062 - val_loss: 0.0099
Epoch 9/25
759904/759904 [==============================] - 54s - loss: 0.0062 - val_loss: 0.0098
Epoch 10/25
759904/759904 [==============================] - 53s - loss: 0.0061 - val_loss: 0.0097
Epoch 11/25
759904/759904 [==============================] - 53s - loss: 0.0061 - val_loss: 0.0101
Epoch 12/25
759904/759904 [==============================] - 53s - loss: 0.0061 - val_loss: 0.0096
Epoch 13/25
759904/759904 [==============================] - 53s - loss: 0.0060 - val_loss: 0.0098
Epoch 14/25
759904/759904 [==============================] - 53s - loss: 0.0060 - val_loss: 0.0098
Epoch 15/25
759904/759904 [==============================] - 53s - loss: 0.0060 - val_loss: 0.0096
Epoch 16/25
759904/759904 [==============================] - 53s - loss: 0.0060 - val_loss: 0.0096
Epoch 17/25
759904/759904 [==============================] - 53s - loss: 0.0059 - val_loss: 0.0100
Epoch 18/25
759904/759904 [==============================] - 53s - loss: 0.0059 - val_loss: 0.0095
Epoch 19/25
759904/759904 [==============================] - 53s - loss: 0.0059 - val_loss: 0.0097
Epoch 20/25
759904/759904 [==============================] - 53s - loss: 0.0059 - val_loss: 0.0098
Epoch 21/25
759904/759904 [==============================] - 53s - loss: 0.0059 - val_loss: 0.0094
Epoch 22/25
759904/759904 [==============================] - 53s - loss: 0.0059 - val_loss: 0.0095
Epoch 23/25
759904/759904 [==============================] - 53s - loss: 0.0058 - val_loss: 0.0095
Epoch 24/25
759904/759904 [==============================] - 53s - loss: 0.0058 - val_loss: 0.0096
Epoch 25/25
759904/759904 [==============================] - 53s - loss: 0.0058 - val_loss: 0.0097
CPU times: user 48min 2s, sys: 5min 7s, total: 53min 10s
Wall time: 22min 22s

In [108]:
hist.history


Out[108]:
{'loss': [0.011543708899975998,
  0.0081428832266230763,
  0.0071605561921078793,
  0.0067688872935217853,
  0.006562436936816074,
  0.0064265123611733288,
  0.0063263529084675255,
  0.0062496530353846023,
  0.0061991263113358392,
  0.0061371925805966789,
  0.0060975233105218709,
  0.0060683156664236917,
  0.0060377205630576351,
  0.0060002038249019713,
  0.0059846025952508429,
  0.0059559546078189734,
  0.0059432193379071567,
  0.005925937787276579,
  0.0058936794265827342,
  0.0058852603196882548,
  0.0058711308764912897,
  0.0058566379312929573,
  0.0058427490107691602,
  0.0058289598719467768,
  0.0058106531752574247],
 'val_loss': [0.010913814287089577,
  0.010564941083243145,
  0.01027382288614701,
  0.0098251879501716419,
  0.0098826151931447094,
  0.0098126728207583811,
  0.0098805974064248396,
  0.0099482648354793594,
  0.0097641247889236529,
  0.0097114029873534794,
  0.0100529663411497,
  0.0096321621593877232,
  0.0097647817569319374,
  0.0098252719552163852,
  0.0096129882340110218,
  0.0095683800178879407,
  0.010039474427044233,
  0.0095306029156341735,
  0.009686993617640945,
  0.0097661329346007009,
  0.0093741671631562863,
  0.0094836538342612225,
  0.0095238257312254323,
  0.0096323451664686414,
  0.0097270023006603736]}

In [109]:
plot_train(hist)



In [110]:
preds = np.squeeze(model.predict(map_valid, 1024))

Result on validation data: 0.1678 (samp 150k, 0.75 trn)


In [111]:
log_max_inv(preds)


Out[111]:
array([  7165.78076172,   8693.43945312,   5352.79882812,   7027.24267578,
         5618.58349609,   5587.34033203,   6442.33349609,   8173.66503906,
         5827.05566406,   6670.58496094,   9821.00585938,   6997.49658203,
         6749.18359375,   5440.39208984,   7297.14794922,   6256.46826172,
         8004.54150391,   5506.05957031,   3748.50585938,   6218.23828125,
       ...,   5906.76220703,  19001.56835938,   4391.00683594,
         4296.71191406,   8192.12109375,  22114.27734375,   3086.32592773,
         5314.42626953,   3874.55859375,   8417.43261719,   4149.20214844,
         3612.04711914,   2941.67211914,  15090.69824219,   4947.65966797,
         4927.15380859,  13834.59179688,   6490.71289062,   4514.25390625,
         7183.3515625 ], dtype=float32)

In [114]:
# - This will work if ymean and ystd are calculated in the "Data" section above (in this case uncomment)
# normalize_inv(preds)

Using 3rd place data


In [ ]:
pkl_path = '/data/jhoward/github/entity-embedding-rossmann/'

In [ ]:
def load_pickle(fname): 
    return pickle.load(open(pkl_path+fname + '.pickle', 'rb'))

In [ ]:
[x_pkl_orig, y_pkl_orig] = load_pickle('feature_train_data')

In [ ]:
max_log_y_pkl = np.max(np.log(y_pkl_orig))
y_pkl = np.log(y_pkl_orig)/max_log_y_pkl

In [ ]:
pkl_vars = ['Open', 'Store', 'DayOfWeek', 'Promo', 'Year', 'Month', 'Day', 
     'StateHoliday', 'SchoolHoliday', 'CompetitionMonthsOpen', 'Promo2Weeks', 
    'Promo2Weeks_L', 'CompetitionDistance',
    'StoreType', 'Assortment', 'PromoInterval', 'CompetitionOpenSinceYear',
    'Promo2SinceYear', 'State', 'Week', 'Max_TemperatureC', 'Mean_TemperatureC', 
    'Min_TemperatureC', 'Max_Humidity', 'Mean_Humidity', 'Min_Humidity', 'Max_Wind_SpeedKm_h', 
    'Mean_Wind_SpeedKm_h', 'CloudCover','Events', 'Promo_fw', 'Promo_bw', 
    'StateHoliday_fw', 'StateHoliday_bw', 'AfterStateHoliday', 'BeforeStateHoliday', 
    'SchoolHoliday_fw', 'SchoolHoliday_bw', 'trend_DE', 'trend']

In [ ]:
x_pkl = np.array(x_pkl_orig)

In [ ]:
gt_enc = StandardScaler()
gt_enc.fit(x_pkl[:,-2:])

In [ ]:
x_pkl[:,-2:] = gt_enc.transform(x_pkl[:,-2:])

In [ ]:
x_pkl.shape

In [ ]:
x_pkl = x_pkl[idxs]
y_pkl = y_pkl[idxs]

In [ ]:
x_pkl_trn, x_pkl_val = x_pkl[:train_size], x_pkl[train_size:]
y_pkl_trn, y_pkl_val = y_pkl[:train_size], y_pkl[train_size:]

In [ ]:
x_pkl_trn.shape

In [ ]:
xgb_parms = {'learning_rate': 0.1, 'subsample': 0.6, 
             'colsample_bylevel': 0.6, 'silent': True, 'objective': 'reg:linear'}

In [ ]:
xdata_pkl = xgboost.DMatrix(x_pkl_trn, y_pkl_trn, feature_names=pkl_vars)

In [ ]:
xdata_val_pkl = xgboost.DMatrix(x_pkl_val, y_pkl_val, feature_names=pkl_vars)

In [ ]:
xgb_parms['seed'] = random.randint(0,1e9)
model_pkl = xgboost.train(xgb_parms, xdata_pkl)

In [ ]:
model_pkl.eval(xdata_val_pkl)

In [ ]:
#0.117473

In [ ]:
importance = model_pkl.get_fscore()
importance = sorted(importance.items(), key=operator.itemgetter(1))

df = pd.DataFrame(importance, columns=['feature', 'fscore'])
df['fscore'] = df['fscore'] / df['fscore'].sum()

df.plot(kind='barh', x='feature', y='fscore', legend=False, figsize=(6, 10))
plt.title('XGBoost Feature Importance')
plt.xlabel('relative importance');

Neural net


In [ ]:
#np.savez_compressed('vars.npz', pkl_cats, pkl_contins)
#np.savez_compressed('deps.npz', y_pkl)

In [ ]:
pkl_cats = np.stack([x_pkl[:,pkl_vars.index(f)] for f in cat_vars], 1)
pkl_contins = np.stack([x_pkl[:,pkl_vars.index(f)] for f in contin_vars], 1)

In [ ]:
co_enc = StandardScaler().fit(pkl_contins)
pkl_contins = co_enc.transform(pkl_contins)

In [ ]:
pkl_contins_trn, pkl_contins_val = pkl_contins[:train_size], pkl_contins[train_size:]
pkl_cats_trn, pkl_cats_val = pkl_cats[:train_size], pkl_cats[train_size:]
y_pkl_trn, y_pkl_val = y_pkl[:train_size], y_pkl[train_size:]

In [ ]:
def get_emb_pkl(feat):
    name, c = cat_map_info(feat)
    c2 = (c+2)//3
    if c2>50: c2=50
    inp = Input((1,), dtype='int64', name=name+'_in')
    u = Flatten(name=name+'_flt')(Embedding(c, c2, input_length=1, init=emb_init)(inp))
    return inp,u

In [ ]:
n_pkl_contin = pkl_contins_trn.shape[1]
contin_inp = Input((n_pkl_contin,), name='contin')
contin_out = BatchNormalization()(contin_inp)

In [ ]:
map_train_pkl = split_cols(pkl_cats_trn) + [pkl_contins_trn]
map_valid_pkl = split_cols(pkl_cats_val) + [pkl_contins_val]

In [ ]:
def train_pkl(bs=128, ne=10):
    return model_pkl.fit(map_train_pkl, y_pkl_trn, batch_size=bs, nb_epoch=ne,
                 verbose=0, validation_data=(map_valid_pkl, y_pkl_val))

In [ ]:
def get_model_pkl(): 
    conts = [get_contin_pkl(feat) for feat in contin_map_fit.features]
    embs = [get_emb_pkl(feat) for feat in cat_map_fit.features]
    x = merge([emb for inp,emb in embs] + [contin_out], mode='concat')

    x = Dropout(0.02)(x)
    x = Dense(1000, activation='relu', init='uniform')(x)
    x = Dense(500, activation='relu', init='uniform')(x)
    x = Dense(1, activation='sigmoid')(x)

    model_pkl = Model([inp for inp,emb in embs] + [contin_inp], x)
    model_pkl.compile('adam', 'mean_absolute_error')
    #model.compile(Adam(), 'mse')
    return model_pkl

In [ ]:
model_pkl = get_model_pkl()

In [ ]:
train_pkl(128, 10).history['val_loss']

In [ ]:
K.set_value(model_pkl.optimizer.lr, 1e-4)
train_pkl(128, 5).history['val_loss']

In [ ]:
"""
1 97s - loss: 0.0104 - val_loss: 0.0083
2 93s - loss: 0.0076 - val_loss: 0.0076
3 90s - loss: 0.0071 - val_loss: 0.0076
4 90s - loss: 0.0068 - val_loss: 0.0075
5 93s - loss: 0.0066 - val_loss: 0.0075
6 95s - loss: 0.0064 - val_loss: 0.0076
7 98s - loss: 0.0063 - val_loss: 0.0077
8 97s - loss: 0.0062 - val_loss: 0.0075
9 95s - loss: 0.0061 - val_loss: 0.0073
0 101s - loss: 0.0061 - val_loss: 0.0074
"""

In [ ]:
plot_train(hist)

In [ ]:
preds = np.squeeze(model_pkl.predict(map_valid_pkl, 1024))

In [ ]:
y_orig_pkl_val = log_max_inv(y_pkl_val, max_log_y_pkl)

In [ ]:
rmspe(log_max_inv(preds, max_log_y_pkl), y_orig_pkl_val)

XGBoost

Xgboost is extremely quick and easy to use. Aside from being a powerful predictive model, it gives us information about feature importance.


In [115]:
X_train = np.concatenate([cat_map_train, contin_map_train], axis=1)

In [116]:
X_valid = np.concatenate([cat_map_valid, contin_map_valid], axis=1)

In [117]:
all_vars = cat_vars + contin_vars

In [118]:
xgb_parms = {'learning_rate': 0.1, 'subsample': 0.6, 
             'colsample_bylevel': 0.6, 'silent': True, 'objective': 'reg:linear'}

In [119]:
xdata = xgboost.DMatrix(X_train, y_train, feature_names=all_vars)

In [120]:
xdata_val = xgboost.DMatrix(X_valid, y_valid, feature_names=all_vars)

In [121]:
xgb_parms['seed'] = random.randint(0,1e9)
model = xgboost.train(xgb_parms, xdata)

In [122]:
model.eval(xdata_val)


Out[122]:
b'[0]\teval-rmse:0.113776'

In [123]:
model.eval(xdata_val)


Out[123]:
b'[0]\teval-rmse:0.113776'

Easily, competition distance is the most important, while events are not important at all.

In real applications, putting together a feature importance plot is often a first step. Oftentimes, we can remove hundreds of thousands of features from consideration with importance plots.


In [124]:
importance = model.get_fscore()
importance = sorted(importance.items(), key=operator.itemgetter(1))

df = pd.DataFrame(importance, columns=['feature', 'fscore'])
df['fscore'] = df['fscore'] / df['fscore'].sum()

df.plot(kind='barh', x='feature', y='fscore', legend=False, figsize=(6, 10))
plt.title('XGBoost Feature Importance')
plt.xlabel('relative importance');


End


In [ ]: