The Kaggle Rossmann Challenge:

Gradient Boosting vs. Deep Learning




Ágoston Török, Krisztián Varga, Ádám Divák, Balázs Fehér PhD

Who we are

Ágoston

  • Data scientist @Synetiq
  • Junior researcher @MTA TTK Brain Imaging Centre
  • Lecturer & phd candidate @ELTE

Balázs

  • Data scientist @Deloitte | London from March 2016
  • Data scientist @Infomatix | Budapest Nov 2015 - Feb 2016
  • PhD in Cognitive Science | Saarbrücken 2011 - 2015

Krisztián

  • Research Engineer @Bell Labs, Nokia

Ádám

  • CTO @Synetiq

Why compete on Kaggle

  • Fame and Fortune
  • It is fun
  • It is learning

Let's get started


In [1]:
%matplotlib inline
%pylab inline

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings as w

import rossmann_helper_functions as rosshellfun #:)

sns.set_style('whitegrid')
pd.set_option('display.max_columns', None)
w.filterwarnings("ignore", category=DeprecationWarning) 
w.filterwarnings("ignore", category=FutureWarning)


Populating the interactive namespace from numpy and matplotlib
Rossmann Helper Functions imported

First look at the competition data


Training data set
train.csv


- 3.5 years of sales data
- number of customers
- promotion activities
- store opening hours
- state / school holiday details
Store details
stores.csv


- store type code
- assortment type code
- long term promotions
- nearest competition
- locality unknown
Test data set
test.csv


- same as train.csv
- no sales and customer data

Load the training, store and test data using pandas


In [2]:
types = {
    'CompetitionOpenSinceYear': np.dtype(int),
    'CompetitionOpenSinceMonth': np.dtype(int),
    'StateHoliday': np.dtype(str),
    'Promo2SinceWeek': np.dtype(int),
    'SchoolHoliday': np.dtype(int),
    'PromoInterval': np.dtype(str)
}
train = pd.read_csv("../train.csv", parse_dates=[2], dtype=types)
test  = pd.read_csv("../test.csv",  parse_dates=[3], dtype=types)
store = pd.read_csv("../store.csv")

train['Day']   = train.Date.dt.day
train['Month'] = train.Date.dt.month
train['Year']  = train.Date.dt.year

train.sort(['Store','Year','Month','Day'], inplace = True)

In [3]:
train.head(5)


Out[3]:
Store DayOfWeek Date Sales Customers Open Promo StateHoliday SchoolHoliday Day Month Year
1016095 1 2 2013-01-01 0 0 0 0 a 1 1 1 2013
1014980 1 3 2013-01-02 5530 668 1 0 0 1 2 1 2013
1013865 1 4 2013-01-03 4327 578 1 0 0 1 3 1 2013
1012750 1 5 2013-01-04 4486 619 1 0 0 1 4 1 2013
1011635 1 6 2013-01-05 4997 635 1 0 0 1 5 1 2013

In [4]:
test.head(5)


Out[4]:
Id Store DayOfWeek Date Open Promo StateHoliday SchoolHoliday
0 1 1 4 2015-09-17 1 1 0 0
1 2 3 4 2015-09-17 1 1 0 0
2 3 7 4 2015-09-17 1 1 0 0
3 4 8 4 2015-09-17 1 1 0 0
4 5 9 4 2015-09-17 1 1 0 0

Exploration and preparation

- Explore the data

- Drop bad values and impute missing ones

- Feature widening

Data exploration – days of week

Sundays (where open) and Mondays dominate


In [5]:
fig, (axis1) = plt.subplots(1, 1, figsize=(15,4))
train_plot = train[train["Open"] == 1]
sns.boxplot(x='DayOfWeek',y='Sales', data=train_plot.sort('DayOfWeek'), palette="husl", ax=axis1)
plt


Out[5]:
<module 'matplotlib.pyplot' from 'c:\\python34\\lib\\site-packages\\matplotlib\\pyplot.py'>

In [6]:
fig, (axis1) = plt.subplots(1, 1, figsize=(15,4))
train_plot = train[train["Open"] == 1]
sns.boxplot(x='DayOfWeek',y='Sales', data=train_plot.sort('DayOfWeek'), palette="husl", ax=axis1)
plt.ylim(ymax=12500, ymin=3000)


Out[6]:
(3000, 12500)

Data exploration – days of month

„Payday waves”, otherwise stable


In [7]:
fig, (axis1) = plt.subplots(1, 1, figsize=(15,7))
sns.boxplot(x='Day',y='Sales', data=train_plot.sort('Day'), palette="husl", ax=axis1)


Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x8ada588>

Data exploration – Closing periods

- Short closing periods (i.e. weekend) are visible on the day of week plot

- But what about longer periods? (renovation, etc.)


In [8]:
for fest in ['a','b','c']:
    print('Max. consecutive day(s) for holiday type "%s" is %i' 
          % (fest, rosshellfun.longest_streak(train['StateHoliday'].values, fest)))


Max. consecutive day(s) for holiday type "a" is 1
Max. consecutive day(s) for holiday type "b" is 1
Max. consecutive day(s) for holiday type "c" is 2

In [9]:
print('Max. consecutive day(s) where a store was marked as closed is %i!' 
      % rosshellfun.longest_streak(train['Open'].values, 0))
print('Max. consecutive day(s) where a store was marked with zero sales is %i!' 
      % rosshellfun.longest_streak(train['Sales'].values, 0))


Max. consecutive day(s) where a store was marked as closed is 185!
Max. consecutive day(s) where a store was marked with zero sales is 185!

Example store


In [10]:
plt.plot(train.loc[train['Store'] == 708, 'Sales'], 'b.')
plt.ylim(ymax=20000, ymin=200)


Out[10]:
(200, 20000)

Add 'DaysLeft' and 'DaysAfter' columns

Sales in no. of days after opening and before closing


In [11]:
rosshellfun.addTimeToClose(train)
rosshellfun.addTimeToClose(test)
plt.xlabel('Days after/before longer closed period')


Out[11]:
<matplotlib.text.Text at 0x14621b70>

Impute data & drop bad data

- Check for missing values

- Omit closed and zero-sales stores from the training (trivial prediction)

- Impute missing values


In [12]:
print("There are %i missing Open/Close values in the TRAINING set." 
      % sum(np.isnan(train.loc[:,'Open'])))
print("There are %i missing Open/Close values in the TEST set."
      % sum(np.isnan(test.loc[:,'Open'])))

train = train[train["Open"] != 0]  
train = train[train["Sales"] > 0]

test.fillna(1, inplace=True)


There are 0 missing Open/Close values in the TRAINING set.
There are 11 missing Open/Close values in the TEST set.

External data and feature widening

Trivial join with store information


In [13]:
train = pd.merge(train, store, on='Store')
test = pd.merge(test, store, on='Store')

External sources



Weather

Currency (EUR/USD)

Search popularity

Weather

  • WunderGround has an easy-to parse REST interface and data for many cities in Germany
  • Ratio threshold to determine overall likelihood of bad weather

EUR-USD currency

Google search for „Rossmann”

Other ideas

  • Population density (store location backtracking was not possible)
  • Child births == diaper++ (we found no sufficient data sources)

In [14]:
# Load external data
ext_curr = rosshellfun.ext_load_currency_data(
    "../ExternalData/ext-currfx-eur-usd.csv") # EurUsdRate
ext_search = rosshellfun.ext_load_search_data(
    "../ExternalData/ext-rossmann-google-search-intensities-2013-2015.csv") # SearchRossmanGermany, SearchRossmanWorld
ext_weather = rosshellfun.ext_load_weather_data(
    "../ExternalData/ext-weather-events-germany.csv") # WeatherBad

# Join the existing data sets with the external data
train2 = train \
    .merge(ext_weather, on="Date") \
    .merge(ext_search, on="Date") \
    .merge(ext_curr, on="Date")
test2 = test \
    .merge(ext_weather, on="Date") \
    .merge(ext_search, on="Date") \
    .merge(ext_curr, on="Date")

print("Shape of vectors before/after joined external data:\nTrain", 
      train.shape, "=", train2.shape, "\nTest", test.shape, "=", test2.shape)

train = train2
test = test2


Shape of vectors before/after joined external data:
Train (844338, 23) = (844338, 27) 
Test (41088, 19) = (41088, 23)

Feature widening - few more ideas

  • Map letters types to integer (a,b,c,... -> 1,2,3,...)
  • Add 'WeekOfYear'
  • Add 'CompetitionOpen' (nearest in months)
  • Add 'IsPromoMonth'

In [15]:
train = rosshellfun.build_features(train)
test = rosshellfun.build_features(test)

Where are we now


In [16]:
train.head(5)


Out[16]:
Store DayOfWeek Date Sales Customers Open Promo StateHoliday SchoolHoliday Day Month Year DaysLeft DaysAfter StoreType Assortment CompetitionDistance CompetitionOpenSinceMonth CompetitionOpenSinceYear Promo2 Promo2SinceWeek Promo2SinceYear PromoInterval WeatherBad SearchRossmanGermany SearchRossmanWorld EurUsdRate WeekOfYear CompetitionOpen PromoOpen IsPromoMonth
0 1 2 2013-01-02 5530 668 1 0 0 1 2 1 2013 0 0 3 1 1270 9 2008 0 0 0 0.746032 69 74 1.32214 1 52 0.00 0
1 2 2 2013-01-02 4422 650 1 0 0 1 2 1 2013 0 0 1 1 570 11 2007 1 13 2010 Jan,Apr,Jul,Oct 0.746032 69 74 1.32214 1 62 33.00 1
2 3 2 2013-01-02 6823 805 1 0 0 1 2 1 2013 0 0 1 1 14130 12 2006 1 14 2011 Jan,Apr,Jul,Oct 0.746032 69 74 1.32214 1 73 20.75 1
3 4 2 2013-01-02 9941 1429 1 0 0 1 2 1 2013 0 0 3 3 620 9 2009 0 0 0 0.746032 69 74 1.32214 1 40 0.00 0
4 5 2 2013-01-02 4253 577 1 0 0 1 2 1 2013 0 0 1 1 29910 4 2015 0 0 0 0.746032 69 74 1.32214 1 -27 0.00 0

In [17]:
test.head(5)


Out[17]:
Id Store DayOfWeek Date Open Promo StateHoliday SchoolHoliday DaysLeft DaysAfter StoreType Assortment CompetitionDistance CompetitionOpenSinceMonth CompetitionOpenSinceYear Promo2 Promo2SinceWeek Promo2SinceYear PromoInterval WeatherBad SearchRossmanGermany SearchRossmanWorld EurUsdRate Year Month Day WeekOfYear CompetitionOpen PromoOpen IsPromoMonth
0 1 1 3 2015-09-17 1 1 0 0 0 0 3 1 1270 9 2008 0 0 0 0.945763 75 66 1.13029 2015 9 17 38 84 0 0
1 2 3 3 2015-09-17 1 1 0 0 0 0 1 1 14130 12 2006 1 14 2011 Jan,Apr,Jul,Oct 0.945763 75 66 1.13029 2015 9 17 38 105 54 0
2 3 7 3 2015-09-17 1 1 0 0 0 0 1 3 24000 4 2013 0 0 0 0.945763 75 66 1.13029 2015 9 17 38 29 0 0
3 4 8 3 2015-09-17 1 1 0 0 0 0 1 1 7520 10 2014 0 0 0 0.945763 75 66 1.13029 2015 9 17 38 11 0 0
4 5 9 3 2015-09-17 1 1 0 0 0 0 1 3 2030 8 2000 0 0 0 0.945763 75 66 1.13029 2015 9 17 38 181 0 0

Creating store categories based on sales similarity

  • We have 1000+ stores
  • It is reasonable to assume that the covariation is large
  • It is reasonable to assume that there are similar stores (store type group)
  • This is a posteriori grouping (We had storetype already in the data)
  • We used hierarchical clustering

In [18]:
table = pd.pivot_table(train.loc[:,['Date','Sales','Store']],index = ['Date'], columns = ['Store'], values = ['Sales'] )

store_type_labels = rosshellfun.cluster_all_stores(table, cut_thresh = 0.05, method = 'ward')
# ward's method focuses on variance and not distance itself


Let's check if everything is added correctly

  • This is again a crucial step
  • 'Pandas eat only indexes'


In [19]:
# add store_type_labels to training and test
train = pd.merge(train, pd.DataFrame([store_type_labels,table.columns.get_level_values(1)],
                             columns = table.columns.get_level_values(1), 
                             index = ['StoreCluster','Store'] ).T, on='Store')
test = pd.merge(test, pd.DataFrame([store_type_labels,table.columns.get_level_values(1)],
                             columns = table.columns.get_level_values(1), 
                             index = ['StoreCluster','Store'] ).T, on='Store')

In [20]:
train.loc[:,['Store','StoreCluster']].drop_duplicates(inplace = False).sort('Store').head(10)


Out[20]:
Store StoreCluster
0 1 8
781 2 8
1565 3 22
2344 4 17
3128 5 8
3907 6 7
4687 7 21
5473 8 7
6257 9 22
7036 10 7

In [21]:
test.loc[:,['Store','StoreCluster']].drop_duplicates(inplace = False).sort('Store').head(10)


Out[21]:
Store StoreCluster
0 1 8
48 3 22
96 7 21
144 8 7
192 9 22
240 10 7
288 11 21
336 12 20
384 13 13
432 14 7

Create dummy columns

  • Categorical data is handled as dummy columns
  • At this point we create a validation set

In [22]:
#create dummy columns function creates dummy columns where values are value+_+columns name
train = rosshellfun.create_dummy_columns(train, ['StoreCluster','DayOfWeek','DaysLeft','DaysAfter'])
test = rosshellfun.create_dummy_columns(test, ['StoreCluster','DayOfWeek','DaysLeft','DaysAfter'])

In [23]:
train.head(5)


Out[23]:
Store DayOfWeek Date Sales Customers Open Promo StateHoliday SchoolHoliday Day Month Year DaysLeft DaysAfter StoreType Assortment CompetitionDistance CompetitionOpenSinceMonth CompetitionOpenSinceYear Promo2 Promo2SinceWeek Promo2SinceYear PromoInterval WeatherBad SearchRossmanGermany SearchRossmanWorld EurUsdRate WeekOfYear CompetitionOpen PromoOpen IsPromoMonth StoreCluster 10_StoreCluster 11_StoreCluster 12_StoreCluster 13_StoreCluster 14_StoreCluster 15_StoreCluster 16_StoreCluster 17_StoreCluster 18_StoreCluster 19_StoreCluster 1_StoreCluster 20_StoreCluster 21_StoreCluster 22_StoreCluster 23_StoreCluster 24_StoreCluster 2_StoreCluster 3_StoreCluster 4_StoreCluster 5_StoreCluster 6_StoreCluster 7_StoreCluster 8_StoreCluster 9_StoreCluster 0_DayOfWeek 1_DayOfWeek 2_DayOfWeek 3_DayOfWeek 4_DayOfWeek 5_DayOfWeek 6_DayOfWeek 0_DaysLeft 1_DaysLeft 2_DaysLeft 3_DaysLeft 4_DaysLeft 5_DaysLeft 6_DaysLeft 0_DaysAfter 1_DaysAfter 2_DaysAfter 3_DaysAfter 4_DaysAfter 5_DaysAfter 6_DaysAfter
0 1 2_DayOfWeek 2013-01-02 5530 668 1 0 0 1 2 1 2013 0_DaysLeft 0_DaysAfter 3 1 1270 9 2008 0 0 0 0.746032 69 74 1.32214 1 52 0 0 8_StoreCluster 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0
1 1 3_DayOfWeek 2013-01-03 4327 578 1 0 0 1 3 1 2013 0_DaysLeft 0_DaysAfter 3 1 1270 9 2008 0 0 0 0.980159 69 74 1.31698 1 52 0 0 8_StoreCluster 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0
2 1 4_DayOfWeek 2013-01-04 4486 619 1 0 0 1 4 1 2013 0_DaysLeft 0_DaysAfter 3 1 1270 9 2008 0 0 0 0.835294 69 74 1.30743 1 52 0 0 8_StoreCluster 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0
3 1 5_DayOfWeek 2013-01-05 4997 635 1 0 0 1 5 1 2013 0_DaysLeft 0_DaysAfter 3 1 1270 9 2008 0 0 0 0.658333 69 74 1.30743 1 52 0 0 8_StoreCluster 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0
4 1 0_DayOfWeek 2013-01-07 7176 785 1 1 0 1 7 1 2013 0_DaysLeft 0_DaysAfter 3 1 1270 9 2008 0 0 0 0.392996 63 59 1.30526 2 52 0 0 8_StoreCluster 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0

In [24]:
# note Id is not sorted now
test = test.sort('Id')
test.head(5)


Out[24]:
Id Store DayOfWeek Date Open Promo StateHoliday SchoolHoliday DaysLeft DaysAfter StoreType Assortment CompetitionDistance CompetitionOpenSinceMonth CompetitionOpenSinceYear Promo2 Promo2SinceWeek Promo2SinceYear PromoInterval WeatherBad SearchRossmanGermany SearchRossmanWorld EurUsdRate Year Month Day WeekOfYear CompetitionOpen PromoOpen IsPromoMonth StoreCluster 10_StoreCluster 11_StoreCluster 12_StoreCluster 13_StoreCluster 14_StoreCluster 15_StoreCluster 16_StoreCluster 17_StoreCluster 18_StoreCluster 19_StoreCluster 1_StoreCluster 20_StoreCluster 21_StoreCluster 22_StoreCluster 23_StoreCluster 24_StoreCluster 2_StoreCluster 3_StoreCluster 4_StoreCluster 5_StoreCluster 6_StoreCluster 7_StoreCluster 8_StoreCluster 9_StoreCluster 0_DayOfWeek 1_DayOfWeek 2_DayOfWeek 3_DayOfWeek 4_DayOfWeek 5_DayOfWeek 6_DayOfWeek 0_DaysLeft 1_DaysLeft 2_DaysLeft 3_DaysLeft 4_DaysLeft 5_DaysLeft 6_DaysLeft 0_DaysAfter 1_DaysAfter 2_DaysAfter 3_DaysAfter 4_DaysAfter 5_DaysAfter 6_DaysAfter
0 1 1 3_DayOfWeek 2015-09-17 1 1 0 0 0_DaysLeft 0_DaysAfter 3 1 1270 9 2008 0 0 0 0.945763 75 66 1.13029 2015 9 17 38 84 0 0 8_StoreCluster 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0
48 2 3 3_DayOfWeek 2015-09-17 1 1 0 0 0_DaysLeft 0_DaysAfter 1 1 14130 12 2006 1 14 2011 Jan,Apr,Jul,Oct 0.945763 75 66 1.13029 2015 9 17 38 105 54 0 22_StoreCluster 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0
96 3 7 3_DayOfWeek 2015-09-17 1 1 0 0 0_DaysLeft 0_DaysAfter 1 3 24000 4 2013 0 0 0 0.945763 75 66 1.13029 2015 9 17 38 29 0 0 21_StoreCluster 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0
144 4 8 3_DayOfWeek 2015-09-17 1 1 0 0 0_DaysLeft 0_DaysAfter 1 1 7520 10 2014 0 0 0 0.945763 75 66 1.13029 2015 9 17 38 11 0 0 7_StoreCluster 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0
192 5 9 3_DayOfWeek 2015-09-17 1 1 0 0 0_DaysLeft 0_DaysAfter 1 3 2030 8 2000 0 0 0 0.945763 75 66 1.13029 2015 9 17 38 181 0 0 22_StoreCluster 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0

In [25]:
# set up training features 

# from this we have to remove the original categorical and the string columns 
# AND the customers column that is not available in the test
# Gert predicted the customers also and used that to further help his prediction on the sales
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']

features = train.select_dtypes(include=numerics).columns.values.tolist()
features.remove('Customers')
features.remove('Store') # categorical but is numeric
features.remove('Sales') # Of course let's remove the outcome
print(features)


['Open', 'Promo', 'StateHoliday', 'SchoolHoliday', 'Day', 'Month', 'Year', 'StoreType', 'Assortment', 'CompetitionDistance', 'CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear', 'Promo2', 'Promo2SinceWeek', 'Promo2SinceYear', 'WeatherBad', 'SearchRossmanGermany', 'SearchRossmanWorld', 'EurUsdRate', 'WeekOfYear', 'CompetitionOpen', 'PromoOpen', 'IsPromoMonth', '10_StoreCluster', '11_StoreCluster', '12_StoreCluster', '13_StoreCluster', '14_StoreCluster', '15_StoreCluster', '16_StoreCluster', '17_StoreCluster', '18_StoreCluster', '19_StoreCluster', '1_StoreCluster', '20_StoreCluster', '21_StoreCluster', '22_StoreCluster', '23_StoreCluster', '24_StoreCluster', '2_StoreCluster', '3_StoreCluster', '4_StoreCluster', '5_StoreCluster', '6_StoreCluster', '7_StoreCluster', '8_StoreCluster', '9_StoreCluster', '0_DayOfWeek', '1_DayOfWeek', '2_DayOfWeek', '3_DayOfWeek', '4_DayOfWeek', '5_DayOfWeek', '6_DayOfWeek', '0_DaysLeft', '1_DaysLeft', '2_DaysLeft', '3_DaysLeft', '4_DaysLeft', '5_DaysLeft', '6_DaysLeft', '0_DaysAfter', '1_DaysAfter', '2_DaysAfter', '3_DaysAfter', '4_DaysAfter', '5_DaysAfter', '6_DaysAfter']

In [26]:
from sklearn.cross_validation import train_test_split

# create test set
X = train.loc[:,features]
y = train.loc[:,'Sales']
train_base_x, train_validation_x, train_base_y, train_validation_y = train_test_split( X, y, test_size = 0.2 )

# Note Gert here did not randomly chose the hold out set but 
# he chose consecutive days

Models

What models have we tried?

  • Linear regression, Ridge, Elasticnet, SVM etc. (we are not going to talk about these)
  • Trees, forests (we are not going to talk about these either)
  • Gradient Boosting (scikit)
  • Neural networks (keras + theano)

Gradient boosting

Gradient boosting is the Holy Grail of several machine learning problems today

  • Almost every winning solution on Kaggle used it

Gradient boosting is practically a huge ensemble of decision trees

  • The term gradient refers to gradient descent
  • Each tree is a step on the gradient
  • Adds weight to training examples
  • Learning rate ~ number of trees

Strong regularization:

  • Control model complexity:
    • tree size and structure
  • Add randomness to the training
    • subset of features, subset of samples etc.

The difference between scikit GBM and XGBoost is... well actually as far as I see XGBoost gives you a little more option in logistic regression problems (e.g. max delta step)

Why gradient boosting

  1. Handles well features of different scales
  2. Support different loss functions
  3. Can handle complex feature interactions and patterns

However:

  1. Slow to train
  2. Often comes as a black box

How it works

The objective function consists of the sum of the Loss and Regularization parameter ($\Omega$). So:

$$Objective(\Theta) = Loss(\Theta) + \Omega(\Theta)$$

Where $\Theta$ is the parameter set to be defined. The loss function that we used was

$$\begin{align} Loss(\Theta) = \left\{ \begin{array}{cl} \frac{1}{2} \left[y-\hat{y}\right]^2 & \text{for }|y-\hat{y}| \le \alpha, \\ \alpha \left(|y-\hat{y}|-\alpha/2\right) & \text{for }|y-\hat{y}| \gt \alpha \end{array}\right. \end{align} $$

Where $\alpha$ was equal to $Q(.56)$ (that is ~median). This loss function applies squared loss for small residuals, but absolute loss for larger. Could be better then squared loss, when outliers would pull the squared loss from optimum.

Huber, P. J. (1964). Robust estimation of a location parameter. The Annals of Mathematical Statistics, 35(1), 73-101.

The hyperparameters

There are several parameters to set (both for the loss and for the regularization)

  • One can use grid search or random search for the parameter tuning
  • Or the available parameters on the forum and only finetune them (i.e. Gert)

Bergstra, J., & Bengio, Y. (2012). Random search for hyper-parameter optimization. The Journal of Machine Learning Research, 13(1), 281-305. ISO 690


In [27]:
import random
from sklearn.grid_search import RandomizedSearchCV
from sklearn import ensemble
from scipy.stats import uniform, randint

# it is not necessary to search the parameters using the complete training set
training_subset = random.sample(range(len(train_base_x)),800) # this is intentionnaly so low it was 80000 in the competition
train_sample_x = train_base_x.iloc[training_subset,]
train_sample_y = train_base_y.iloc[training_subset,]

# note, this is not the whole parameter set of scikit (neither of XGBoost)
parameters = { 'learning_rate': uniform(), #0.426348401220647
               'max_depth' : [4, None], # None
               'min_samples_leaf' : randint(10, 100), #45
               'loss' : ['ls', 'lad', 'huber', 'quantile'], #'huber'
               'subsample' : uniform(), #0.7675104474139473
               'alpha': uniform()} #0.560116434370429

model = ensemble.GradientBoostingRegressor(n_estimators = 100, verbose = 0)

params_found = RandomizedSearchCV(model, param_distributions = parameters,
                               n_iter = 100, n_jobs = -1).fit(train_sample_x,train_sample_y)
params_found.best_params_


Out[27]:
{'alpha': 0.1488202120136839,
 'learning_rate': 0.3114262700288696,
 'loss': 'ls',
 'max_depth': 4,
 'min_samples_leaf': 11,
 'subsample': 0.9627646211439049}

Training and fitting of gradient boost model


In [28]:
model = ensemble.GradientBoostingRegressor(n_estimators = 100, verbose = 100, # in the challenge we used 3000 estimators
                                               learning_rate = params_found.best_params_['learning_rate'],
                                               loss = params_found.best_params_['loss'],
                                               max_depth = params_found.best_params_['max_depth'],
                                               min_samples_leaf = params_found.best_params_['min_samples_leaf'],
                                               subsample = params_found.best_params_['subsample'])

model = model.fit(train_base_x,train_base_y)


      Iter       Train Loss      OOB Improve   Remaining Time 
         1     7924886.3356     1694369.7223            9.25m
         2     6854009.4164     1132722.4515           10.39m
         3     6045046.3758      802708.0835            9.75m
         4     5306670.7300      737789.4309            9.05m
         5     4794341.1294      521248.8958            8.37m
         6     4349407.7204      424369.9625            8.30m
         7     3993613.6836      352966.2646            8.08m
         8     3732460.8015      262263.7742            7.71m
         9     3458674.5739      267828.3985            7.48m
        10     3214343.8622      257118.6627            7.31m
        11     3022715.9556      189443.5004            7.30m
        12     2844473.4652      170463.1538            7.16m
        13     2711873.5502      144630.4449            6.92m
        14     2599217.1565      117704.8078            6.74m
        15     2484636.6463      118964.8835            6.75m
        16     2398059.7977       90492.0325            6.61m
        17     2301836.2740       92128.7265            6.55m
        18     2231418.7853       69876.5811            6.42m
        19     2162917.7892       66652.5616            6.25m
        20     2107310.5006       56528.8757            6.13m
        21     2056627.5739       49859.0708            6.01m
        22     1986463.1787       67626.8977            5.98m
        23     1946469.5178       31450.2771            5.86m
        24     1904485.7656       42375.9036            5.78m
        25     1874292.5074       30544.3420            5.68m
        26     1836282.6458       36172.6973            5.58m
        27     1805944.4321       34269.8407            5.53m
        28     1775456.4518       31916.1725            5.45m
        29     1750780.2910       20522.9038            5.34m
        30     1728234.8670       22382.2920            5.27m
        31     1705849.7573       18336.8186            5.19m
        32     1696083.0218       11657.3725            5.09m
        33     1672407.8042       20678.3525            5.02m
        34     1655014.9339       14566.1379            4.93m
        35     1622166.4533       32774.0191            4.90m
        36     1610322.8518       14295.6877            4.80m
        37     1597551.6558       13097.2122            4.71m
        38     1577772.9512       17900.0960            4.63m
        39     1565017.9301       12888.5553            4.55m
        40     1544617.3809       20486.2002            4.50m
        41     1508433.4780       35404.9662            4.44m
        42     1497778.8369       10273.6939            4.35m
        43     1489812.7585        8489.3945            4.27m
        44     1481106.0725        5358.9700            4.18m
        45     1476006.1154        7931.1543            4.09m
        46     1466603.5175        6141.5811            4.02m
        47     1458927.5571        7780.1065            3.94m
        48     1445946.9609       14986.9274            3.88m
        49     1437324.2484       10808.7616            3.80m
        50     1428395.6313        9780.4278            3.73m
        51     1421732.0404        5238.9681            3.65m
        52     1411865.3564        8111.7093            3.58m
        53     1407186.2329        5438.7442            3.51m
        54     1402738.4485        5564.3776            3.43m
        55     1398041.6234        3967.6655            3.35m
        56     1388634.9897        8824.0922            3.28m
        57     1382463.7704        3721.0624            3.20m
        58     1375111.8005        7949.8628            3.12m
        59     1368169.2526        7329.5270            3.05m
        60     1355172.2165       10532.3874            2.99m
        61     1349368.0073        5813.2439            2.91m
        62     1345793.7896        4116.2429            2.82m
        63     1338888.3069        9507.6903            2.74m
        64     1334179.5252        5467.3345            2.68m
        65     1328172.1450        5081.1248            2.61m
        66     1325405.6624        3519.6267            2.53m
        67     1313722.2704       10667.1395            2.46m
        68     1306467.4245        9919.1880            2.38m
        69     1295690.5339        9729.6104            2.31m
        70     1291453.5662        1886.3609            2.23m
        71     1289069.8800        1874.3648            2.15m
        72     1284112.9839        4586.9766            2.07m
        73     1282241.3223        4668.6873            2.00m
        74     1276148.5099        3420.6221            1.92m
        75     1270807.8985        5120.0382            1.85m
        76     1265229.2299        6156.7764            1.78m
        77     1262642.0370        2885.4124            1.71m
        78     1254783.7294        5156.0675            1.64m
        79     1249981.8648        9361.1182            1.57m
        80     1244534.6621         199.6996            1.49m
        81     1242618.6210        3764.8827            1.41m
        82     1239972.8612        3880.7277            1.34m
        83     1236972.8838        2875.1280            1.27m
        84     1233547.5433        4485.0109            1.20m
        85     1226295.7607        5079.3669            1.12m
        86     1222079.8621        5684.7369            1.05m
        87     1218470.7264        3731.3162           58.38s
        88     1213958.9449        2492.2334           53.75s
        89     1211262.0233        3641.6902           49.33s
        90     1209055.4887        2174.6666           44.75s
        91     1205447.7677        5760.9629           40.34s
        92     1201554.4237        4011.1632           35.92s
        93     1197130.5160        3446.9511           31.51s
        94     1197259.9298        1611.3755           26.97s
        95     1193324.4202        -388.0543           22.42s
        96     1192187.5952        1838.7264           17.93s
        97     1188067.6579        2071.5737           13.44s
        98     1188631.4456        1824.7069            8.94s
        99     1181994.7580        6161.9572            4.48s
       100     1178359.2738        3670.0596            0.00s

In [29]:
# Let's see how the model performed
rosshellfun.plot_results(model, train_base_x, train_validation_x, 
                         train_base_y, train_validation_y, savemodel = True)


Saving model as 02_25_2016_18hrs05mins_0.226071770056.pkl

In [30]:
# let's see the relative importance of features
rosshellfun.plot_feature_importances(model, features)


Summary of the gradient boosting

  • It is a good, transparent (at least we think), realiable solution for regression
  • This model was not able to compete with the highest ranked models
    • Reason is that normally they used additional steps (such as Customers number prediction)
    • We made a mistake that took a long time to find: the dummy columns for the daysofweek were converted at some point to float, and so this feature was not included in the prediction

In [31]:
test['Sales'] = model.predict(test.loc[:,features])
test.loc[test['Open'] == 0,'Sales'] = 0
test.loc[:,['Id','Sales']].set_index('Id',drop = True).to_csv('submission_gbrt.csv')

Neural Networks and Deep Learning

Can we use neural networks/deep learning for time series prediciton?

“Yes, you can kill a mouse with a rocket propelled grenade. And yes, you can use neural networks on time series.”

(answer from Quora)

Building a Model is not necessarily easy

  • Weight initialization

  • Activation functions

  • Size of network

    • Number of layers
    • Number of neurons per layer
  • Normalization

    • Dropout
    • Batch Normalization
  • Optimizers

  • Batch Size

We used Keras

  • Keras is a deep learning framework that runs on top of Theano or TensorFlow
    • Theano is one of the most three most important deep learning frameworks
    • (others important frameworks being Torch and Caffee)
    • TensorFlow is Google’s deep learning framework, that was made publicly available last year

Weight initialization

Weight initialization defines the probability distribution used to set the initial random weights of layers.

Available methods in Keras:

  • uniform
  • lecun_uniform: Uniform initialization scaled by the square root of the number of inputs (LeCun 98)
  • normal
  • identity: Use with square 2D layers (shape[0] == shape[1])
  • orthogonal: Use with square 2D layers (shape[0] == shape[1])
  • zero
  • glorot_normal: Gaussian initialization scaled by fan_in + fan_out (Glorot 2010)
  • glorot_uniform
  • he_normal: Gaussian initialization scaled by fan_in (He et al., 2014)
  • he_uniform

Activation functions

Activations that are more complex than a simple Theano/TensorFlow function (eg. learnable activations, configurable activations, etc.) are also available in Keras:

  • Relu
  • LeakyReLU
  • PReLU
    • Parametric Rectified Linear Unit (PReLU). This activation function adaptively learns the parameters of the rectifiers, and improves accuracy at negligible extra computational cost
    • http://arxiv.org/pdf/1502.01852v1.pdf

Optimizers

  • SGD
    • Stochastic gradient descent, with support for momentum, decay, and Nesterov momentum.
  • RMSProp
  • Adagrad
  • Adadelta
  • Adam

In [ ]:
from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout
from keras.optimizers import Adam
from keras.layers.advanced_activations import PReLU
from keras.layers.normalization import BatchNormalization

model = Sequential()
model.add(Dense(50, input_dim=1242, init='he_normal'))
model.add(PReLU())
#model.add(BatchNormalization())
#model.add(Dropout(0.5))

model.add(Dense(50, init='he_normal'))
model.add(PReLU())
#model.add(BatchNormalization())
#model.add(Dropout(0.5))

model.add(Dense(1, init='he_normal'))
model.add(Activation('sigmoid'))

adam = Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-8)
model.compile(loss='mse', optimizer=adam)

Scaling relevant features for use in neural networks


In [ ]:
def featureScaling(arr):
    min_arr = min(arr); max_arr = max(arr); new_arr = []
    for i in arr:
        new_arr.append((i - min_arr) / (max_arr - min_arr))
    return new_arr

In [ ]:
class Vectorizer():
    def __init__(self):
        self.vectCache = dict()

    def vect(self, size, idx):
        """Retrieves a vector such as [0,0,0,...,idx=1,0,0,...,0] where the size of the vector is 'size'
        and all items are zero except the 'idx' numbered element, which is one. 
        Note: 'idx' numbering starts from 1..size, not from zero, it is 1-indexed.
        """
        if size not in self.vectCache:
            print("Vectorizer: Generating vectors for size", size)
            arr = []
            for i in range(size):
                e = [0] * size
                e[i] = 1.0
                arr.append(e)
            self.vectCache[size] = arr
        return self.vectCache[size][idx-1]

v = Vectorizer()

Scaling train file


In [ ]:
df = pd.read_csv('train_extended.csv')
df.Sales = df.Sales.apply(np.log1p)
df.Customers = df.Customers.apply(np.log1p)
df['CompetitionDistanceLog'] = df.CompetitionDistance.apply(np.log1p)
df.CompetitionOpen = df.CompetitionOpen + 31
df['CompetitionOpenLog'] = df.CompetitionOpen.apply(np.log1p)
df.EurUsdRate = df.EurUsdRate - 0.7
df.CompetitionOpenNull = df.CompetitionOpenNull + 31
df['CompetitionOpenNullLog'] = df.CompetitionOpenNull.apply(np.log1p)

new = df

a = new['Sales']
b = new['Customers']
c = new['CompetitionDistance']
d = new['CompetitionOpen']
e = new['PromoOpen']
f = new['CompetitionOpenNull']
g = new['CompetitionDistanceLog']
h = new['CompetitionOpenLog']
i = new['CompetitionOpenNullLog']

a_scaled = featureScaling(a)
b_scaled = featureScaling(b)
c_scaled = featureScaling(c)
d_scaled = featureScaling(d)
e_scaled = featureScaling(e)
f_scaled = featureScaling(f)
g_scaled = featureScaling(g)
h_scaled = featureScaling(h)
i_scaled = featureScaling(i)

new['Sales'] = a_scaled
new['Customers'] = b_scaled
new['CompetitionDistance'] = c_scaled
new['CompetitionOpen'] = d_scaled
new['PromoOpen'] = e_scaled
new['CompetitionOpenNull'] = f_scaled
new['CompetitionDistanceLog'] = g_scaled
new['CompetitionOpenLog'] = h_scaled
new['CompetitionOpenNullLog'] = i_scaled

new.to_csv('scaled_train.csv', na_rep='0', header=True, index = False)

Scaling test file


In [ ]:
df = pd.read_csv('test_extended.csv')
df['CompetitionDistanceLog'] = df.CompetitionDistance.apply(np.log1p)
df.CompetitionOpen = df.CompetitionOpen + 31
df['CompetitionOpenLog'] = df.CompetitionOpen.apply(np.log1p)
df.EurUsdRate = df.EurUsdRate - 0.7
df.DaysLeft = df.DaysLeft.astype(int)
df.DaysAfter = df.DaysAfter.astype(int)
df.CompetitionOpenNull = df.CompetitionOpenNull + 31
df['CompetitionOpenNullLog'] = df.CompetitionOpenNull.apply(np.log1p)


new = df

a = new['CompetitionDistance']
d = new['CompetitionOpen']
e = new['PromoOpen']
f = new['CompetitionOpenNull']
g = new['CompetitionDistanceLog']
h = new['CompetitionOpenLog']
i = new['CompetitionOpenNullLog']

a_scaled = featureScaling(a)
d_scaled = featureScaling(d)
e_scaled = featureScaling(e)
f_scaled = featureScaling(f)
g_scaled = featureScaling(g)
h_scaled = featureScaling(h)
i_scaled = featureScaling(i)

new['CompetitionDistance'] = a_scaled
new['CompetitionOpen'] = d_scaled
new['PromoOpen'] = e_scaled
new['CompetitionOpenNull'] = f_scaled
new['CompetitionDistanceLog'] = g_scaled
new['CompetitionOpenLog'] = h_scaled
new['CompetitionOpenNullLog'] = i_scaled

new.to_csv('scaled_test.csv', na_rep='0', header=True, index = False)

Train Test Split


In [ ]:
train = pd.read_csv('scaled_train.csv')
shape =  train.shape 

#shuffle data set, and split to train and test set. 
df = pd.DataFrame(train)
shuffled_train = df.reindex(np.random.permutation(df.index))

indice_5_percent = int((shape[0]/100.0)* 5)
indice_10_percent = int((shape[0]/100.0)* 10)
indice_15_percent = int((shape[0]/100.0)* 15)

#for experimenting
shuffled_train[:indice_5_percent].to_csv('test_5_split.csv', index=True, columns=train.columns)
shuffled_train[indice_5_percent:indice_10_percent].to_csv('train_5_split.csv', index=True, columns=train.columns)

#for full scale training
shuffled_train[:indice_15_percent].to_csv('test_15_split.csv', index=True, columns=train.columns)
shuffled_train[indice_15_percent:].to_csv('train_85_split.csv', index=True, columns=train.columns)

Feature selection and parameter setting


In [ ]:
train_file = 'train_5_split.csv'
test_file = 'test_5_split.csv'
num_epochs = 1
batch_size = 10
logging = True  #sets whether to write log file and create submission files

def feature_list():
    features = v.vect(31, (int(row['Day']))) + \
        v.vect(1115, (int(row['Store']))) +\
        v.vect(52, (int(row['WeekOfYear']))) +\
        v.vect(7, (int(row['DayOfWeek']))) +\
        [row['PromoOpen']] +\
        v.vect(12, (int(row['Month']))) +\
        v.vect(3, (int(row['Year'])-2013)) +\
        [row['Promo']] +\
        v.vect(4, (int(row['StoreType']))) +\
        [row['CompetitionDistance']] +\
        [row['CompetitionOpenNull']] +\
        [row['SchoolHoliday']] +\
        v.vect(3, (int(row['Assortment']))) +\
        [row['IsPromoMonth']] +\
        [row['Promo2']] +\
        v.vect(4, (int(row['StateHoliday']))) +\
        [row['WeatherBad']] +\
        [row['CompetitionDistanceLog']] + \
        [row['CompetitionOpenLog']] +\
        [row['CompetitionOpen']]  
              
    return features

Training the neural network


In [ ]:
import csv
import time

training_subset_size = 10000
loss = 1
losslog = []
Time = (time.strftime("Day:%d Time:%H:%M"))

df = pd.DataFrame.from_csv(train_file)
print ("Assembling training set and training neural network...")

for epoch in range(1, num_epochs + 1):
    n = 1
    i = 1
    list_a = []
    list_b = []
    
    df = df.reindex(np.random.permutation(df.index))
    for index, row in df.iterrows():
        if i <= training_subset_size:
            if row['Open'] == 1:
                features = feature_list()
                list_a.append(features)
                list_b.append(row['Sales'])
                i = i+1
        
        else:
            x = np.array(list_a, dtype=float)
            y = np.array(list_b, dtype=float)
            model.fit(x, y, batch_size=batch_size, nb_epoch=1, verbose=0)

            i = 1
            list_a = []
            list_b = []        
        
        if (n % 10000 == 0):
            print (n)
        n = n + 1
   
    '''
    Evaluating model
    '''
    n = 1
    list_a = []
    list_b = []
    with open(test_file) as csvfile:
        reader = csv.DictReader(csvfile)
        for row in reader:
            features = feature_list()
            list_a.append(features)
            list_b.append(row['Sales'])

            if (n % 10000 == 0):
                print (n)
            n = n + 1
            
        test_x = np.array(list_a, dtype=float)
        test_y = np.array(list_b, dtype=float)
        
        print ("Performance of the model over validation data (loss score): \
        " + str(model.evaluate(test_x, test_y, verbose=2)) + " at epoch:" + str(epoch))
        
        if logging == True:
            losslog.append(("Performance of the model over validation data (loss score): \
            " + str(model.evaluate(test_x, test_y, verbose=2)) + " at epoch:" + str(epoch) \
            + "_"+time.strftime("Day:%d Time:%H:%M")))
            losslog_df = pd.DataFrame(losslog)
            losslog_df.to_csv('losslog_'+Time+'_train:_'+str(train_file)+'_test:_'+str(test_file) \
                              +'_batch_'+str(batch_size)+'_epoch_'+str(num_epochs)+'.csv', index_label='Id')
    
    if model.evaluate(test_x, test_y, verbose=2) < loss and logging == True:
        
        '''
        Creating submission file on scaled_test.csv
        '''
        print ("Testing neural network on test file and writing results into submission file...")

        results = []
        with open('scaled_test.csv') as csvfile:
            reader = csv.DictReader(csvfile)
            for row in reader:
                list_a = []
                features = feature_list()

                if row['Open'] == '':
                    op = [1.0]
                else:
                    op = [row['Open']]

                list_a.append(features)
                x = np.array(list_a, dtype=float)

                if row['Open'] == '1.0':
                    result = model.predict(x, batch_size=1)
                else:
                    result = [np.array([0])]

                results.extend(result)

        '''
        Writing results into submission.csv
        '''

        results = pd.DataFrame(results)
        results.T
        results = np.exp(results * 10.634700933902529) - 1

        results.columns = ['Sales']
        results.index += 1
        results.to_csv('submission_train_'+str(train_file)+'_test_'+str(test_file) \
                       +'_batch_'+str(batch_size) +'_best_loss:_'+str(model.evaluate(test_x, test_y, verbose=2)) \
                       +'_epoch:_'+str(epoch)+"_"+time.strftime("Day:%d Time:%H:%M")+'.csv', index_label='Id')
        #
        print ("Finished!!!")
        loss = model.evaluate(test_x, test_y, verbose=2)

Bagging submissions

  1. NN 0.11884
  2. NN 0.12003
  3. NN 0.12017
  4. NN 0.12300
  5. NN 0.12488
  6. NN 0.12553
  7. NN 0.12628
  8. NN 0.12769
  9. NN 0.12792
  10. NN 0.13235
  11. NN 0.13462
  12. NN 0.13484
  13. NN 0.13491
  14. NN 0.13550
  15. NN 0.13580
  16. NN 0.13913
  17. NN 0.13945
  18. NN 0.13962

Bagged NN 0.11053


In [ ]:
df1 = pd.read_csv('tobag/11884.csv')
df2 = pd.read_csv('tobag/12003.csv')
df3 = pd.read_csv('tobag/12017.csv')
df4 = pd.read_csv('tobag/12300.csv')
df5 = pd.read_csv('tobag/12488.csv')
df6 = pd.read_csv('tobag/12553.csv')
df7 = pd.read_csv('tobag/12628.csv')
df8 = pd.read_csv('tobag/12769.csv')
df9 = pd.read_csv('tobag/12792.csv')
df10 = pd.read_csv('tobag/13235.csv')
df11 = pd.read_csv('tobag/13462.csv')
df12 = pd.read_csv('tobag/13484.csv')
df13 = pd.read_csv('tobag/13491.csv')
df14 = pd.read_csv('tobag/13550.csv')
df15 = pd.read_csv('tobag/13580.csv')
df16 = pd.read_csv('tobag/13913.csv')
df17 = pd.read_csv('tobag/13945.csv')
df18 = pd.read_csv('tobag/13962.csv')

results = (df1.Sales * 0.985 + df2.Sales * 0.985 + df3.Sales * 0.985 + \
            df4.Sales * 0.985 + df5.Sales * 0.985 + df6.Sales * 0.985 + \
            df7.Sales * 0.985 + df8.Sales * 0.985 + df9.Sales * 0.985 + \
            df10.Sales * 0.985 + df11.Sales * 0.985 + df12.Sales * 0.985 + df13.Sales * 0.985 + \
            df14.Sales * 0.985 + df15.Sales * 0.985 + df16.Sales * 0.985 + \
            df17.Sales * 0.985 + df18.Sales) / 18.0
            
results = pd.DataFrame(results)
results.T
results.columns = ['Sales']
results.index += 1
results.to_csv('bagged.csv', index_label='Id')

Summary

  • Competing on Kaggle is fun
  • Competing on Kaggle is hard
  • Nuances matter
  • Clever ideas matter

Notes for the next challenge:

  • Use VCS
  • Never ever ever ever stop using indexes in pandas :)
  • Read the forum more often

Thanks to Laci and Gabi for organizing, to Robi for the expert contribution, to Hiflylabs for the food and drinks

Thank you for your attention!

Questions?