This notebook contains comparison of Random Forest vs. MERF models on Rossman Kaggle competetion dataset. The feature engineering steps are borrowed from 3rd place holder in the competetion and codes are taken from fast.ai DL1 course notebook. We are not using entity embedddings steps are shown in the notebook as our goal is just to compare RF and MERF for now. The solution is using external data (weather and googletrend), we are not going to use that too for our purpose
Predict NEXT six weeks out. N+1 all the way to N+(7x6) days.
In their first Kaggle competition, Rossmann is challenging you to predict 6 weeks of daily sales for 1,115 stores located across Germany.
In [1]:
%matplotlib inline
%reload_ext autoreload
%autoreload 2
In [2]:
import os, sys
import re
sys.path.append('..')
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context("poster")
import numpy as np
from sklearn.ensemble import RandomForestRegressor
import pandas as pd
from IPython.display import HTML, display
import matplotlib.pyplot as plt
from merf.merf import MERF
from sklearn.model_selection import train_test_split
from merf.evaluator import plot_bhat, plot_training_stats
In [3]:
pd.options.display.float_format = '{:.2f}'.format
In [4]:
train = pd.read_csv('../data/kaggle_train.csv')
test = pd.read_csv('../data/kaggle_test.csv')
store = pd.read_csv('../data/kaggle_store.csv')
Feature Space:
In [5]:
train.head()
Out[5]:
In [6]:
test.head()
Out[6]:
In [7]:
store.head()
Out[7]:
In [8]:
# There are 1115 Stores
train['Store'].unique()
Out[8]:
In [9]:
train.shape, store.shape,test.shape
Out[9]:
Data is in descending order based on date
In [10]:
test[test['Store'] == 1].head()
Out[10]:
In [11]:
h = train.groupby('Store').count()['Sales'].hist()
So, we have either ~760 days or ~930 days of data for all the stores
In [12]:
tables = [train, store, test]
In [13]:
# data types
for t in tables: print(t.dtypes, '\n')
In [14]:
# summaries
for t in tables: display(t.describe())
codes borrowed from - https://github.com/fastai/fastai/blob/master/courses/dl1/lesson3-rossman.ipynb
We turn state Holidays to booleans, to make them more convenient for modeling. We can do calculations on pandas fields using notation very similar (often identical) to numpy.
In [15]:
train.StateHoliday = train.StateHoliday!='0'
test.StateHoliday = test.StateHoliday!='0'
In [16]:
def add_datepart(df, fldname, drop=True, time=False):
fld = df[fldname]
fld_dtype = fld.dtype
if isinstance(fld_dtype, pd.core.dtypes.dtypes.DatetimeTZDtype):
fld_dtype = np.datetime64
if not np.issubdtype(fld_dtype, np.datetime64):
df[fldname] = fld = pd.to_datetime(fld, infer_datetime_format=True)
targ_pre = re.sub('[Dd]ate$', '', fldname)
attr = ['Year', 'Month', 'Week', 'Day', 'Dayofweek', 'Dayofyear',
'Is_month_end', 'Is_month_start', 'Is_quarter_end', 'Is_quarter_start', 'Is_year_end', 'Is_year_start']
if time: attr = attr + ['Hour', 'Minute', 'Second']
for n in attr: df[targ_pre + n] = getattr(fld.dt, n.lower())
df[targ_pre + 'Elapsed'] = fld.astype(np.int64) // 10 ** 9
if drop: df.drop(fldname, axis=1, inplace=True)
In [17]:
add_datepart(train, "Date", drop=False)
add_datepart(test, "Date", drop=False)
Now we can outer join all of our data into a single dataframe. Recall that in outer joins everytime a value in the joining field on the left table does not have a corresponding value on the right table, the corresponding row in the new table has Null values for all right table fields. One way to check that all records are consistent and complete is to check for Null values post-join, as we do here.
Aside: Why note just do an inner join? If you are assuming that all records are complete and match on the field you desire, an inner join will do the same thing as an outer join. However, in the event you are wrong or a mistake is made, an outer join followed by a null-check will catch it. (Comparing before/after # of rows for inner join is equivalent, but requires keeping track of before/after row #'s. Outer join is easier.)
In [18]:
def join_df(left, right, left_on, right_on=None, suffix='_y'):
if right_on is None: right_on = left_on
return left.merge(right, how='left', left_on=left_on, right_on=right_on,
suffixes=("", suffix))
In [19]:
joined = join_df(train, store, "Store")
joined_test = join_df(test, store, "Store")
len(joined[joined.StoreType.isnull()]),len(joined_test[joined_test.StoreType.isnull()])
Out[19]:
In [20]:
joined_test.columns
Out[20]:
In [21]:
for df in (joined,joined_test):
df['CompetitionOpenSinceYear'] = df.CompetitionOpenSinceYear.fillna(1900).astype(np.int32)
df['CompetitionOpenSinceMonth'] = df.CompetitionOpenSinceMonth.fillna(1).astype(np.int32)
df['Promo2SinceYear'] = df.Promo2SinceYear.fillna(1900).astype(np.int32)
df['Promo2SinceWeek'] = df.Promo2SinceWeek.fillna(1).astype(np.int32)
Replacing null month and week with 1, not 0 because 0 is not recognizable as any date in pandas and we have used this data to make more features as shown in #4
Next we'll extract features "CompetitionOpenSince" and "CompetitionDaysOpen". Note the use of apply() in mapping a function across dataframe values.
In [22]:
joined.filter(like = 'Competition', axis=1)[:2]
Out[22]:
In [23]:
for df in (joined,joined_test):
df["CompetitionOpenSince"] = pd.to_datetime(dict(year=df.CompetitionOpenSinceYear,
month=df.CompetitionOpenSinceMonth, day=15))
df["CompetitionDaysOpen"] = df.Date.subtract(df.CompetitionOpenSince).dt.days
In [24]:
for df in (joined,joined_test):
df.loc[df.CompetitionDaysOpen<0, "CompetitionDaysOpen"] = 0
df.loc[df.CompetitionOpenSinceYear<1990, "CompetitionDaysOpen"] = 0
In [25]:
for df in (joined,joined_test):
df["CompetitionMonthsOpen"] = df["CompetitionDaysOpen"]//30
df.loc[df.CompetitionMonthsOpen>24, "CompetitionMonthsOpen"] = 24
joined.CompetitionMonthsOpen.unique()
Out[25]:
In [26]:
for df in (joined,joined_test):
df["Promo2Since"] = pd.to_datetime(df.Promo2SinceWeek.astype(str)+
df.Promo2SinceYear.astype(str).add('-0'),
format='%W%Y-%w')
df["Promo2Days"] = df.Date.subtract(df["Promo2Since"]).dt.days
In [27]:
for df in (joined,joined_test):
df.loc[df.Promo2Days<0, "Promo2Days"] = 0
df.loc[df.Promo2SinceYear<1990, "Promo2Days"] = 0
df["Promo2Weeks"] = df["Promo2Days"]//7
df.loc[df.Promo2Weeks<0, "Promo2Weeks"] = 0
df.loc[df.Promo2Weeks>25, "Promo2Weeks"] = 25
df.Promo2Weeks.unique()
In [28]:
joined.columns
Out[28]:
In [29]:
cat_vars = ['Store',
'DayOfWeek',
'Year',
'Month',
'Day',
'StateHoliday',
'CompetitionMonthsOpen',
'Promo2Weeks',
'StoreType',
'Assortment',
'PromoInterval',
'CompetitionOpenSinceYear',
'Promo2SinceYear',
'Week']
contin_vars = ['CompetitionDistance',
'Promo',
'SchoolHoliday']
n = len(joined); n
Out[29]:
In [30]:
dep = 'Sales'
joined = joined[cat_vars+contin_vars+[dep, 'Date']].copy()
In [31]:
joined_test[dep] = 0
joined_test = joined_test[cat_vars+contin_vars+[dep, 'Date', 'Id']].copy()
In [32]:
for v in cat_vars: joined[v] = joined[v].astype('category').cat.as_ordered()
In [33]:
def apply_cats(df, trn):
for n,c in df.items():
if (n in trn.columns) and (trn[n].dtype.name=='category'):
df[n] = pd.Categorical(c, categories=trn[n].cat.categories, ordered=True)
df[n] = df[n].cat.codes
In [34]:
apply_cats(joined_test, joined)
In [35]:
for col in joined.columns:
if joined[col].dtype.name=='category':
joined[col] = joined[col].cat.codes
In [36]:
for v in contin_vars:
joined[v] = joined[v].fillna(0).astype('float32')
joined_test[v] = joined_test[v].fillna(0).astype('float32')
In [37]:
n = 5000
joined_samp = joined.sample(n).set_index("Date")
samp_size = len(joined_samp); samp_size
Out[37]:
In [38]:
joined_samp = joined_samp[joined_samp.Sales!=0]
In [39]:
joined_samp.head(2)
Out[39]:
In [40]:
y = joined_samp['Sales']
In [41]:
df = joined_samp.drop(labels='Sales',axis=1)
In [42]:
df.head(2)
Out[42]:
In [43]:
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.20, random_state=42)
In [62]:
model = RandomForestRegressor(n_jobs=-1, oob_score=True, n_estimators=300)
In [63]:
model.fit(X_train, y_train)
Out[63]:
In [64]:
from sklearn.metrics import mean_squared_error
import math
In [65]:
def rmspe(targ, y_pred):
pct_var = (targ - y_pred)/targ
return math.sqrt((pct_var**2).mean())
In [66]:
preds = model.predict(X_test)
model.score(X_train, y_train), model.score(X_test, y_test), model.oob_score_, rmspe(y_test, preds)
Out[66]:
In [49]:
X_train.head(2)
Out[49]:
In [50]:
from merf import MERF
In [51]:
merf = MERF()
Covariates with possible random effect:
In [52]:
clusters_train = X_train['Store']
In [53]:
X_train_merf = X_train.drop(['Store'],axis=1)
In [54]:
Z_train = np.ones(shape=(X_train.shape[0],1))
In [55]:
merf.fit(X_train, Z_train, clusters_train, y_train)
Out[55]:
Predictions
In [59]:
clusters_test = X_test['Store']
X_test_merf = X_test.drop(['Store'],axis=1)
Z_test = np.ones(shape=(X_test.shape[0],1))
In [60]:
y_hat = merf.predict(X_test, Z_test, clusters_test)
In [61]:
rmspe(y_hat, preds)
Out[61]:
Keeping everything else same, MERF gave lower RMSPE than RF