In [1]:
from jupyterthemes import get_themes
from jupyterthemes.stylefx import set_nb_theme
themes = get_themes()
set_nb_theme(themes[3])
Out[1]:
In [2]:
# 1. magic for inline plot
# 2. magic to print version
# 3. magic so that the notebook will reload external python modules
# 4. magic to enable retina (high resolution) plots
# https://gist.github.com/minrk/3301035
%matplotlib inline
%load_ext watermark
%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format='retina'
import os
import time
import numba
import numpy as np
import pandas as pd
%watermark -a 'Ethen' -d -t -v -p numpy,pandas,numba
In addition to the data provided by the competition, we will be using external datasets put together by participants in the Kaggle competition. We can download all of them here. Then we should untar them in the directory to which data_dir
is pointing to.
In [3]:
data_dir = 'rossmann'
print('available files: ', os.listdir(data_dir))
file_names = ['train', 'store', 'store_states', 'state_names',
'googletrend', 'weather', 'test']
path_names = {file_name: os.path.join(data_dir, file_name + '.csv')
for file_name in file_names}
df_train = pd.read_csv(path_names['train'], low_memory=False)
df_test = pd.read_csv(path_names['test'], low_memory=False)
print('training data dimension: ', df_train.shape)
print('testing data dimension: ', df_test.shape)
df_train.head()
Out[3]:
We turn state Holidays to booleans, to make them more convenient for modeling.
In [4]:
df_train['StateHoliday'] = df_train['StateHoliday'] != '0'
df_test['StateHoliday'] = df_test['StateHoliday'] != '0'
For the weather and state names data, we perform a join on a state name field and create a single dataframe.
In [5]:
df_weather = pd.read_csv(path_names['weather'])
print('weather data dimension: ', df_weather.shape)
df_weather.head()
Out[5]:
In [6]:
df_state_names = pd.read_csv(path_names['state_names'])
print('state names data dimension: ', df_state_names.shape)
df_state_names.head()
Out[6]:
In [7]:
df_weather = df_weather.rename(columns={'file': 'StateName'})
df_weather = df_weather.merge(df_state_names, on="StateName", how='left')
df_weather.head()
Out[7]:
For the google trend data. We're going to extract the state and date information from the raw dataset, also replace all instances of state name 'NI' to match the usage in the rest of the data: 'HB,NI'.
In [8]:
df_googletrend = pd.read_csv(path_names['googletrend'])
print('google trend data dimension: ', df_googletrend.shape)
df_googletrend.head()
Out[8]:
In [9]:
df_googletrend['Date'] = df_googletrend['week'].str.split(' - ', expand=True)[0]
df_googletrend['State'] = df_googletrend['file'].str.split('_', expand=True)[2]
df_googletrend.loc[df_googletrend['State'] == 'NI', 'State'] = 'HB,NI'
df_googletrend.head()
Out[9]:
The following code chunks extracts particular date fields from a complete datetime for the purpose of constructing categoricals.
We should always consider this feature extraction step when working with date-time. Without expanding our date-time into these additional fields, we can't capture any trend/cyclical behavior as a function of time at any of these granularities. We'll add to every table with a date field.
In [10]:
DEFAULT_DT_ATTRIBUTES = [
'Year', 'Month', 'Week', 'Day', 'Dayofweek', 'Dayofyear',
'Is_month_end', 'Is_month_start', 'Is_quarter_end',
'Is_quarter_start', 'Is_year_end', 'Is_year_start'
]
def add_datepart(df, colname, drop_original_col=False,
dt_attributes=DEFAULT_DT_ATTRIBUTES,
add_elapse_col=True):
"""
Extract various date time components out of a date column, this modifies
the dataframe inplace.
References
----------
- https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#time-date-components
"""
df[colname] = pd.to_datetime(df[colname], infer_datetime_format=True)
if dt_attributes:
for attr in dt_attributes:
df[attr] = getattr(df[colname].dt, attr.lower())
# representing the number of seconds elapsed from 1970-01-01 00:00:00
# https://stackoverflow.com/questions/15203623/convert-pandas-datetimeindex-to-unix-time
if add_elapse_col:
df['Elapsed'] = df[colname].astype(np.int64) // 10 ** 9
if drop_original_col:
df = df.drop(colname, axis=1)
return df
In [11]:
df_weather.head()
Out[11]:
In [12]:
df_weather = add_datepart(
df_weather, 'Date',
dt_attributes=None, add_elapse_col=False)
df_googletrend = add_datepart(
df_googletrend, 'Date', drop_original_col=True,
dt_attributes=['Year', 'Week'], add_elapse_col=False)
df_train = add_datepart(df_train, 'Date')
df_test = add_datepart(df_test, 'Date')
print('training data dimension: ', df_train.shape)
df_train.head()
Out[12]:
The Google trends data has a special category for the whole of the Germany - we'll pull that out so we can use it explicitly.
In [13]:
df_trend_de = df_googletrend.loc[df_googletrend['file'] == 'Rossmann_DE',
['Year', 'Week', 'trend']]
df_trend_de.head()
Out[13]:
Now we can outer join all of our data into a single dataframe. Recall that in outer joins everytime a value in the joining field on the left table does not have a corresponding value on the right table, the corresponding row in the new table has Null values for all right table fields. One way to check that all records are consistent and complete is to check for Null values post-join, as we do here.
Aside: Why not just do an inner join? If we are assuming that all records are complete and match on the field we desire, an inner join will do the same thing as an outer join. However, in the event we are not sure, an outer join followed by a null-check will catch it. (Comparing before/after # of rows for inner join is an equivalent approach).
During the merging process, we'll print out the first few rows of the dataframe and the column names so we can keep track of how the dataframe evolves as we join with a new data source.
In [14]:
df_store = pd.read_csv(path_names['store'])
print('store data dimension: ', df_store.shape)
df_store.head()
Out[14]:
In [15]:
df_store_states = pd.read_csv(path_names['store_states'])
print('store states data dimension: ', df_store_states.shape)
df_store_states.head()
Out[15]:
In [16]:
df_store = df_store.merge(df_store_states, on='Store', how='left')
print('null count: ', len(df_store[df_store['State'].isnull()]))
df_store.head()
Out[16]:
In [17]:
df_joined_train = df_train.merge(df_store, on='Store', how='left')
df_joined_test = df_test.merge(df_store, on='Store', how='left')
null_count_train = len(df_joined_train[df_joined_train['StoreType'].isnull()])
null_count_test = len(df_joined_test[df_joined_test['StoreType'].isnull()])
print('null count: ', null_count_train, null_count_test)
print('dimension: ', df_joined_train.shape)
df_joined_train.head()
Out[17]:
In [18]:
df_joined_train.columns
Out[18]:
In [19]:
df_joined_train = df_joined_train.merge(df_weather, on=['State', 'Date'], how='left')
df_joined_test = df_joined_test.merge(df_weather, on=['State', 'Date'], how='left')
null_count_train = len(df_joined_train[df_joined_train['Mean_TemperatureC'].isnull()])
null_count_test = len(df_joined_test[df_joined_test['Mean_TemperatureC'].isnull()])
print('null count: ', null_count_train, null_count_test)
print('dimension: ', df_joined_train.shape)
df_joined_train.head()
Out[19]:
In [20]:
df_joined_train.columns
Out[20]:
In [21]:
df_joined_train = df_joined_train.merge(df_googletrend,
on=['State', 'Year', 'Week'],
how='left')
df_joined_test = df_joined_test.merge(df_googletrend,
on=['State', 'Year', 'Week'],
how='left')
null_count_train = len(df_joined_train[df_joined_train['trend'].isnull()])
null_count_test = len(df_joined_test[df_joined_test['trend'].isnull()])
print('null count: ', null_count_train, null_count_test)
print('dimension: ', df_joined_train.shape)
df_joined_train.head()
Out[21]:
In [22]:
df_joined_train.columns
Out[22]:
In [23]:
df_joined_train = df_joined_train.merge(df_trend_de,
on=['Year', 'Week'],
suffixes=('', '_DE'),
how='left')
df_joined_test = df_joined_test.merge(df_trend_de,
on=['Year', 'Week'],
suffixes=('', '_DE'),
how='left')
null_count_train = len(df_joined_train[df_joined_train['trend_DE'].isnull()])
null_count_test = len(df_joined_test[df_joined_test['trend_DE'].isnull()])
print('null count: ', null_count_train, null_count_test)
print('dimension: ', df_joined_train.shape)
df_joined_train.head()
Out[23]:
In [24]:
df_joined_train.columns
Out[24]:
After merging all the various data source to create our master dataframe, we'll still perform some additional feature engineering steps including:
In [25]:
for df in (df_joined_train, df_joined_test):
df['CompetitionOpenSinceYear'] = (df['CompetitionOpenSinceYear']
.fillna(1900)
.astype(np.int32))
df['CompetitionOpenSinceMonth'] = (df['CompetitionOpenSinceMonth']
.fillna(1)
.astype(np.int32))
df['Promo2SinceYear'] = df['Promo2SinceYear'].fillna(1900).astype(np.int32)
df['Promo2SinceWeek'] = df['Promo2SinceWeek'].fillna(1).astype(np.int32)
In [26]:
for df in (df_joined_train, df_joined_test):
df['CompetitionOpenSince'] = pd.to_datetime(dict(
year=df['CompetitionOpenSinceYear'],
month=df['CompetitionOpenSinceMonth'],
day=15
))
df['CompetitionDaysOpen'] = df['Date'].subtract(df['CompetitionOpenSince']).dt.days
For the CompetitionMonthsOpen
field, we limit the maximum to 2 years to limit the number of unique categories.
In [27]:
for df in (df_joined_train, df_joined_test):
df['CompetitionMonthsOpen'] = df['CompetitionDaysOpen'] // 30
df.loc[df['CompetitionMonthsOpen'] > 24, 'CompetitionMonthsOpen'] = 24
df.loc[df['CompetitionMonthsOpen'] < -24, 'CompetitionMonthsOpen'] = -24
df_joined_train['CompetitionMonthsOpen'].unique()
Out[27]:
Repeat the same process for Promo
In [28]:
from isoweek import Week
for df in (df_joined_train, df_joined_test):
df['Promo2Since'] = pd.to_datetime(df.apply(lambda x: Week(
x.Promo2SinceYear, x.Promo2SinceWeek).monday(), axis=1))
df['Promo2Days'] = df['Date'].subtract(df['Promo2Since']).dt.days
In [29]:
for df in (df_joined_train, df_joined_test):
df['Promo2Weeks'] = df['Promo2Days'] // 7
df.loc[df['Promo2Weeks'] < 0, 'Promo2Weeks'] = 0
df.loc[df['Promo2Weeks'] > 25, 'Promo2Weeks'] = 25
df_joined_train['Promo2Weeks'].unique()
Out[29]:
In [30]:
df_joined_train.columns
Out[30]:
It is common when working with time series data to extract features that captures relationships across rows instead of between columns. e.g. time until next event, time since last event.
Here, we would like to compute features such as days until next promotion or days before next promotion. And the same process can be repeated for state/school holiday.
In [31]:
columns = ['Date', 'Store', 'Promo', 'StateHoliday', 'SchoolHoliday']
df = df_joined_train[columns].append(df_joined_test[columns])
df['DateUnixSeconds'] = df['Date'].astype(np.int64) // 10 ** 9
df.head()
Out[31]:
In [32]:
@numba.njit
def compute_duration(store_arr, date_unix_seconds_arr, field_arr):
"""
For each store, track the day since/before the occurrence of a field.
The store and date are assumed to be already sorted.
Parameters
----------
store_arr : 1d ndarray[int]
date_unix_seconds_arr : 1d ndarray[int]
The date should be represented in unix timestamp (seconds).
field_arr : 1d ndarray[bool]/ndarray[int]
The field that we're interested in. If int, it should take value
of 1/0 indicating whether the field/event occurred or not.
Returns
-------
result : list[int]
Days since/before the occurrence of a field.
"""
result = []
last_store = 0
zipped = zip(store_arr, date_unix_seconds_arr, field_arr)
for store, date_unix_seconds, field in zipped:
if store != last_store:
last_store = store
last_date = date_unix_seconds
if field:
last_date = date_unix_seconds
diff_day = (date_unix_seconds - last_date) // 86400
result.append(diff_day)
return result
In [33]:
df = df.sort_values(['Store', 'Date'])
start = time.time()
for col in ('SchoolHoliday', 'StateHoliday', 'Promo'):
result = compute_duration(df['Store'].values,
df['DateUnixSeconds'].values,
df[col].values)
df['After' + col] = result
end = time.time()
print('elapsed: ', end - start)
df.head(10)
Out[33]:
If we look at the values in the AfterStateHoliday
column, we can see that the first row of the StateHoliday
column is True
, therefore, the corresponding AfterStateHoliday
is therefore 0 indicating it's a state holiday that day, after encountering a state holiday, the AfterStateHoliday
column will start incrementing until it sees the next StateHoliday
, which will then reset this counter.
Note that for Promo
, it starts out with a 0, but the AfterPromo
starts accumulating until it sees the next Promo
. Here, we're not exactly sure when was the last promo before 2013-01-01 since we don't have the data for it. Nonetheless we'll still start incrementing the counter. Another approach is to fill it all with 0.
In [34]:
df = df.sort_values(['Store', 'Date'], ascending=[True, False])
start = time.time()
for col in ('SchoolHoliday', 'StateHoliday', 'Promo'):
result = compute_duration(df['Store'].values,
df['DateUnixSeconds'].values,
df[col].values)
df['Before' + col] = result
end = time.time()
print('elapsed: ', end - start)
df.head(10)
Out[34]:
After creating these new features, we join it back to the original dataframe.
In [35]:
df = df.drop(['Promo', 'StateHoliday', 'SchoolHoliday', 'DateUnixSeconds'], axis=1)
df_joined_train = df_joined_train.merge(df, on=['Date', 'Store'], how='inner')
df_joined_test = df_joined_test.merge(df, on=['Date', 'Store'], how='inner')
print('dimension: ', df_joined_train.shape)
df_joined_train.head()
Out[35]:
In [36]:
df_joined_train.columns
Out[36]:
We save the cleaned data so we won't have to repeat this data preparation step again.
In [37]:
output_dir = 'cleaned_data'
if not os.path.isdir(output_dir):
os.makedirs(output_dir, exist_ok=True)
engine = 'pyarrow'
output_path_train = os.path.join(output_dir, 'train_clean.parquet')
output_path_test = os.path.join(output_dir, 'test_clean.parquet')
df_joined_train.to_parquet(output_path_train, engine=engine)
df_joined_test.to_parquet(output_path_test, engine=engine)