FastAI DL1 Lesson 3 CodeAlong II:

Rossmann - Structured Data

Code Along of lesson3-rossman.ipynb

1 Structured and Time Series Data

This notebook contains an implementation of the third place result in the Rossmann Kaggle competition as detailed in Gui/Berkhahn's Entity Embeddings of Categorical Variables.

The motivation behind exploring this architecture is it's relevance to real-world application. Most data used for decision making day-to-day in industry is structured and/or time-series data. Here we explore the end-to-end process of using neural networks with practical structured data problems.


In [221]:
%matplotlib inline
%reload_ext autoreload
%autoreload 2

In [222]:
# from fastai.imports import *
# from fastai.torch_imports import *
from fastai.structured import * # non-PyTorch specfc Machine-Learning tools; indep lib
# from fastai.dataset import * # lets us do fastai PyTorch stuff w/ structured columnar data
from fastai.column_data import *

# np.set_printoptions(threshold=50, edgeitems=20)
# from sklearn_pandas import DataFrameMapper
# from sklearn.preprocessing import LabelEncoder, Imputer, StandardScaler
# import operator

PATH = 'data/rossmann/'

1.1 Create Datasets


In [223]:
def concat_csvs(dirname):
    path = f'{PATH}{dirname}'
    filenames = glob.glob(f'{path}/*.csv')
    
    wrote_header = False
    with open(f'{path}.csv', 'w') as outputfile:
        for filename in filenames:
            name = filename.split('.')[0]
            with open(filename) as f:
                line = f.readline()
                if not wrote_header:
                    wrote_header = True
                    outputfile.write('file,' + line)
                for line in f:
                    outputfile.write(name + ',' + line)
                outputfile.write('\n')

In [4]:
# concat_csvs('googletrend')
# concat_csvs('weather')

Feature Space:

  • train: Training set provided by competition
  • store: List of stores
  • store_states: Mapping of store to the German state they're in
  • List of German state names
  • googletrend: Trend of certain google keywords over time, found by users to correlate well w/ given daya
  • weather: Weather
  • test: Testing set

In [224]:
table_names = ['train', 'store', 'store_states', 'state_names', 
               'googletrend', 'weather', 'test']

We'll be using the popular data manipulation framework pandas. Among other things, Pandas allows you to manipulate tables/DataFrames in Python as one would in a database.

We're going to go ahead and load all our csv's as DataFrames into the list tables.


In [225]:
tables = [pd.read_csv(f'{PATH}{fname}.csv', low_memory=False) for fname in table_names]

In [226]:
from IPython.display import HTML

We can use head() to get a quick look at the contents of each table:

  • train: Contains store information on a daily basis, tracks things like sales, customers, whether that day was a holiday, etc.
  • store: General info about the store including competition, etc.
  • store_states: Maps store to state it's in
  • state_names: Maps state abbreviations to names
  • googletrend: Trend data for particular week/state
  • weather: Weather conditions for each state
  • test: Same as training table, w/o sales and customers

In [227]:
for t in tables: display(t.head())


Store DayOfWeek Date Sales Customers Open Promo StateHoliday SchoolHoliday
0 1 5 2015-07-31 5263 555 1 1 0 1
1 2 5 2015-07-31 6064 625 1 1 0 1
2 3 5 2015-07-31 8314 821 1 1 0 1
3 4 5 2015-07-31 13995 1498 1 1 0 1
4 5 5 2015-07-31 4822 559 1 1 0 1
Store StoreType Assortment CompetitionDistance CompetitionOpenSinceMonth CompetitionOpenSinceYear Promo2 Promo2SinceWeek Promo2SinceYear PromoInterval
0 1 c a 1270.0 9.0 2008.0 0 NaN NaN NaN
1 2 a a 570.0 11.0 2007.0 1 13.0 2010.0 Jan,Apr,Jul,Oct
2 3 a a 14130.0 12.0 2006.0 1 14.0 2011.0 Jan,Apr,Jul,Oct
3 4 c c 620.0 9.0 2009.0 0 NaN NaN NaN
4 5 a a 29910.0 4.0 2015.0 0 NaN NaN NaN
Store State
0 1 HE
1 2 TH
2 3 NW
3 4 BE
4 5 SN
StateName State
0 BadenWuerttemberg BW
1 Bayern BY
2 Berlin BE
3 Brandenburg BB
4 Bremen HB
file week trend
0 Rossmann_DE_SN 2012-12-02 - 2012-12-08 96
1 Rossmann_DE_SN 2012-12-09 - 2012-12-15 95
2 Rossmann_DE_SN 2012-12-16 - 2012-12-22 91
3 Rossmann_DE_SN 2012-12-23 - 2012-12-29 48
4 Rossmann_DE_SN 2012-12-30 - 2013-01-05 67
file Date Max_TemperatureC Mean_TemperatureC Min_TemperatureC Dew_PointC MeanDew_PointC Min_DewpointC Max_Humidity Mean_Humidity ... Max_VisibilityKm Mean_VisibilityKm Min_VisibilitykM Max_Wind_SpeedKm_h Mean_Wind_SpeedKm_h Max_Gust_SpeedKm_h Precipitationmm CloudCover Events WindDirDegrees
0 NordrheinWestfalen 2013-01-01 8 4 2 7 5 1 94 87 ... 31.0 12.0 4.0 39 26 58.0 5.08 6.0 Rain 215
1 NordrheinWestfalen 2013-01-02 7 4 1 5 3 2 93 85 ... 31.0 14.0 10.0 24 16 NaN 0.00 6.0 Rain 225
2 NordrheinWestfalen 2013-01-03 11 8 6 10 8 4 100 93 ... 31.0 8.0 2.0 26 21 NaN 1.02 7.0 Rain 240
3 NordrheinWestfalen 2013-01-04 9 9 8 9 9 8 100 94 ... 11.0 5.0 2.0 23 14 NaN 0.25 7.0 Rain 263
4 NordrheinWestfalen 2013-01-05 8 8 7 8 7 6 100 94 ... 10.0 6.0 3.0 16 10 NaN 0.00 7.0 Rain 268

5 rows × 24 columns

Id Store DayOfWeek Date Open Promo StateHoliday SchoolHoliday
0 1 1 4 2015-09-17 1.0 1 0 0
1 2 3 4 2015-09-17 1.0 1 0 0
2 3 7 4 2015-09-17 1.0 1 0 0
3 4 8 4 2015-09-17 1.0 1 0 0
4 5 9 4 2015-09-17 1.0 1 0 0

This is vert representative of a typical industry dataset.

The following returns summarized aggregate information to each table accross each field.


In [228]:
for t in tables: display(DataFrameSummary(t).summary())


Store DayOfWeek Date Sales Customers Open Promo StateHoliday SchoolHoliday
count 1.01721e+06 1.01721e+06 NaN 1.01721e+06 1.01721e+06 1.01721e+06 1.01721e+06 NaN 1.01721e+06
mean 558.43 3.99834 NaN 5773.82 633.146 0.830107 0.381515 NaN 0.178647
std 321.909 1.99739 NaN 3849.93 464.412 0.375539 0.485759 NaN 0.383056
min 1 1 NaN 0 0 0 0 NaN 0
25% 280 2 NaN 3727 405 1 0 NaN 0
50% 558 4 NaN 5744 609 1 0 NaN 0
75% 838 6 NaN 7856 837 1 1 NaN 0
max 1115 7 NaN 41551 7388 1 1 NaN 1
counts 1017209 1017209 1017209 1017209 1017209 1017209 1017209 1017209 1017209
uniques 1115 7 942 21734 4086 2 2 4 2
missing 0 0 0 0 0 0 0 0 0
missing_perc 0% 0% 0% 0% 0% 0% 0% 0% 0%
types numeric numeric categorical numeric numeric bool bool categorical bool
Store StoreType Assortment CompetitionDistance CompetitionOpenSinceMonth CompetitionOpenSinceYear Promo2 Promo2SinceWeek Promo2SinceYear PromoInterval
count 1115 NaN NaN 1112 761 761 1115 571 571 NaN
mean 558 NaN NaN 5404.9 7.2247 2008.67 0.512108 23.5954 2011.76 NaN
std 322.017 NaN NaN 7663.17 3.21235 6.19598 0.500078 14.142 1.67494 NaN
min 1 NaN NaN 20 1 1900 0 1 2009 NaN
25% 279.5 NaN NaN 717.5 4 2006 0 13 2011 NaN
50% 558 NaN NaN 2325 8 2010 1 22 2012 NaN
75% 836.5 NaN NaN 6882.5 10 2013 1 37 2013 NaN
max 1115 NaN NaN 75860 12 2015 1 50 2015 NaN
counts 1115 1115 1115 1112 761 761 1115 571 571 571
uniques 1115 4 3 654 12 23 2 24 7 3
missing 0 0 0 3 354 354 0 544 544 544
missing_perc 0% 0% 0% 0.27% 31.75% 31.75% 0% 48.79% 48.79% 48.79%
types numeric categorical categorical numeric numeric numeric bool numeric numeric categorical
Store State
count 1115 NaN
mean 558 NaN
std 322.017 NaN
min 1 NaN
25% 279.5 NaN
50% 558 NaN
75% 836.5 NaN
max 1115 NaN
counts 1115 1115
uniques 1115 12
missing 0 0
missing_perc 0% 0%
types numeric categorical
StateName State
count 16 16
unique 16 16
top NordrheinWestfalen BY
freq 1 1
counts 16 16
uniques 16 16
missing 0 0
missing_perc 0% 0%
types unique unique
file week trend
count NaN NaN 2072
mean NaN NaN 63.8142
std NaN NaN 12.6502
min NaN NaN 0
25% NaN NaN 55
50% NaN NaN 64
75% NaN NaN 72
max NaN NaN 100
counts 2072 2072 2072
uniques 14 148 68
missing 0 0 0
missing_perc 0% 0% 0%
types categorical categorical numeric
file Date Max_TemperatureC Mean_TemperatureC Min_TemperatureC Dew_PointC MeanDew_PointC Min_DewpointC Max_Humidity Mean_Humidity ... Max_VisibilityKm Mean_VisibilityKm Min_VisibilitykM Max_Wind_SpeedKm_h Mean_Wind_SpeedKm_h Max_Gust_SpeedKm_h Precipitationmm CloudCover Events WindDirDegrees
count NaN NaN 15840 15840 15840 15840 15840 15840 15840 15840 ... 15459 15459 15459 15840 15840 3604 15840 14667 NaN 15840
mean NaN NaN 14.6441 10.389 6.19899 8.58782 6.20581 3.62614 93.6596 74.2829 ... 24.0576 12.2398 7.02516 22.7666 11.9722 48.8643 0.831718 5.55131 NaN 175.897
std NaN NaN 8.64601 7.37926 6.52639 6.24478 6.08677 6.12839 7.67853 13.4866 ... 8.9768 5.06794 4.9806 8.98862 5.87284 13.027 2.51351 1.68771 NaN 101.589
min NaN NaN -11 -13 -15 -14 -15 -73 44 30 ... 0 0 0 3 2 21 0 0 NaN -1
25% NaN NaN 8 4 1 4 2 -1 90.75 65 ... 14 10 3 16 8 39 0 5 NaN 80
50% NaN NaN 15 11 7 9 7 4 94 76 ... 31 11 7 21 11 48 0 6 NaN 202
75% NaN NaN 21 16 11 13 11 8 100 85 ... 31 14 10 27 14 55 0.25 7 NaN 256
max NaN NaN 39 31 24 25 20 19 100 100 ... 31 31 31 101 53 111 58.93 8 NaN 360
counts 15840 15840 15840 15840 15840 15840 15840 15840 15840 15840 ... 15459 15459 15459 15840 15840 3604 15840 14667 11889 15840
uniques 16 990 51 45 40 40 36 40 53 71 ... 24 32 24 44 29 47 41 9 21 362
missing 0 0 0 0 0 0 0 0 0 0 ... 381 381 381 0 0 12236 0 1173 3951 0
missing_perc 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% ... 2.41% 2.41% 2.41% 0% 0% 77.25% 0% 7.41% 24.94% 0%
types categorical categorical numeric numeric numeric numeric numeric numeric numeric numeric ... numeric numeric numeric numeric numeric numeric numeric numeric categorical numeric

13 rows × 24 columns

Id Store DayOfWeek Date Open Promo StateHoliday SchoolHoliday
count 41088 41088 41088 NaN 41077 41088 NaN 41088
mean 20544.5 555.9 3.97917 NaN 0.854322 0.395833 NaN 0.443487
std 11861.2 320.274 2.01548 NaN 0.352787 0.489035 NaN 0.496802
min 1 1 1 NaN 0 0 NaN 0
25% 10272.8 279.75 2 NaN 1 0 NaN 0
50% 20544.5 553.5 4 NaN 1 0 NaN 0
75% 30816.2 832.25 6 NaN 1 1 NaN 1
max 41088 1115 7 NaN 1 1 NaN 1
counts 41088 41088 41088 41088 41077 41088 41088 41088
uniques 41088 856 7 48 2 2 2 2
missing 0 0 0 0 11 0 0 0
missing_perc 0% 0% 0% 0% 0.03% 0% 0% 0%
types numeric numeric numeric categorical bool bool bool bool

1.2 Data Cleaning / Feature Engineering

As a structuered data problem, we necessarily have to go through all the cleaning and feature engineering, even though we're using a neural network.


In [229]:
train, store, store_states, state_names, googletrend, weather, test = tables

In [230]:
len(train), len(test)


Out[230]:
(1017209, 41088)

We turn state Holidays to booleans, to make them more convenient for modeling. We can do calculations on pandas fields using notation very similar (often identical) to numpy.


In [231]:
train.StateHoliday = train.StateHoliday != '0'
test.StateHoliday = test.StateHoliday != '0'

join_df is a function for joining tables on specific fields. By default, we'll be doing a left outer join of right on the left argument using the given fields for each table.

Pandas does joins using the merge method. The suffixes argument doescribes the naming convention for duplicate fields. We've elected to leave the duplicate field names on the left entouched, and append a "_y" to those on the right.


In [232]:
def join_df(left, right, left_on, right_on=None, suffix='_y'):
    if right_on is None: right_on = left_on
    return left.merge(right, how='left', left_on=left_on, right_on=right_on, 
                      suffixes=("", suffix))

Join weather/state names:


In [233]:
weather = join_df(weather, state_names, "file", "StateName")

In Pandas you can add new columns to a DataFrame by simply definint it. We'll do this for googletrends by extracting dates and state names from the given data an dadding those columns.

We're also going to replace all instances of state name 'NI' to match the usage in the rest of the data: 'HB,NI'. This is a good opportunity to highlight Pandas indexing. We can use .loc[rows, cols] to select a list of rows and a list of columns from the DataFrame. In this case, we're selecting rows w/ statename 'NI' by using a boolean list googletrend.State=='NI' and selecting "State".


In [234]:
googletrend['Date'] = googletrend.week.str.split(' - ', expand=True)[0]
googletrend['State'] = googletrend.file.str.split('_', expand=True)[2]
googletrend.loc[googletrend.State=='NI', 'State'] = 'HB,NI'

The following extracts particular date fields from a complete datetime for the purpose of constructing categoricals.

You should always consider this feature extraction step when working with date-time. Without expanding your date-time into these additional fields, you can't capture any trend/cyclical behavior as a function of time at any of these granularities. We'll add to every table with a date field.


In [235]:
add_datepart(weather, "Date", drop=False)
add_datepart(googletrend, "Date", drop=False)
add_datepart(train, "Date", drop=False)
add_datepart(test, "Date", drop=False)
add_datepart(googletrend, "Date", drop=False)

The Google Trends data has a special category for the whole of the US - we'll pull that out so we can use it explicitly. -- DE?


In [236]:
trend_de = googletrend[googletrend.file == 'Rossmann_DE']

Now we can outer join all of our data into a single DataFrame. Recall that in outer joins everytime a value in the joining field on the left table doesn't have a corresponding value on the right table, the corresponding row in the new table has Null values for all right table fields.One way to check that all records are consistent and complete is to check for Null values post-join, as we do here.

Aside Why not just do an inner join? If you're assuming that all records are complete and match on the field you desire, an inner join will do the same thing as an outer join. However, in the event that you're wrong or a mistake is made, an outer join followed by a Null-Check will catch it (Comparing before/after # of rows for inner join is equivalent, but requires keeping track of before/after row #'s. Outer join is easier.).


In [237]:
store = join_df(store, store_states, "Store")
len(store[store.State.isnull()])


Out[237]:
0

In [238]:
joined = join_df(train, store, "Store")
joined_test = join_df(test, store, "Store")
len(joined[joined.StoreType.isnull()]), len(joined_test[joined_test.StoreType.isnull()])


Out[238]:
(0, 0)

In [239]:
joined = join_df(joined, googletrend, ["State","Year","Week"])
joined_test = join_df(joined_test, googletrend, ["State","Year","Week"])
len(joined[joined.trend.isnull()]), len(joined_test[joined_test.trend.isnull()])


Out[239]:
(0, 0)

In [240]:
joined = joined.merge(trend_de, 'left', ["Year", "Week"], suffixes=('', '_DE'))
joined_test = joined_test.merge(trend_de, 'left', ["Year", "Week"], suffixes=('', '_DE'))
len(joined[joined.trend_DE.isnull()]), len(joined_test[joined_test.trend_DE.isnull()])


Out[240]:
(0, 0)

In [241]:
joined = join_df(joined, weather, ["State","Date"])
joined_test = join_df(joined_test, weather, ["State","Date"])
len(joined[joined.Mean_TemperatureC.isnull()]), len(joined_test[joined_test.Mean_TemperatureC.isnull()])


Out[241]:
(0, 0)

In [242]:
for df in (joined, joined_test):
    for c in df.columns:
        if c.endswith('_y'):
            if c in df.columns: df.drop(c, inplace=True, axis=1)

Next we'll fill in missing values to avoid complications with NA's. NA (not available) is how Pandas indicates missing values; many models have problems when missing values are present, so it's always important to think about how to deal with them. In these cases, we're picking an aribtirary signal value that doesn't otherwise appear in the data.


In [243]:
for df in (joined, joined_test):
    df['CompetitionOpenSinceYear'] = df.CompetitionOpenSinceYear.fillna(1900).astype(np.int32)
    df['CompetitionOpenSinceMonth']= df.CompetitionOpenSinceMonth.fillna(1).astype(np.int32)
    df['Promo2SinceYear'] = df.Promo2SinceYear.fillna(1900).astype(np.int32)
    df['Promo2SinceWeek'] = df.Promo2SinceWeek.fillna(1).astype(np.int32)

Next we'll extract features "CompetitionOpenSince" and "CompetitionDaysOpen". Note the use of apply() in mapping a function across DataFrame values.


In [244]:
for df in (joined, joined_test):
    df["CompetitionOpenSince"] = pd.to_datetime(dict(year=df.CompetitionOpenSinceYear, 
                                                     month=df.CompetitionOpenSinceMonth, day=15))
    df["CompetitionDaysOpen"] = df.Date.subtract(df.CompetitionOpenSince).dt.days

We'll replace some erroneous / outlying data


In [245]:
for df in (joined, joined_test):
    df.loc[df.CompetitionDaysOpen < 0, "CompetitionDaysOpen"] = 0
    df.loc[df.CompetitionOpenSinceYear < 1990, "CompetitionDaysOpen"] = 0

We add the "CompetitionMonthsOpen" field, limiting the maximum to 2 years to limit number of unique categories.


In [246]:
for df in (joined, joined_test):
    df["CompetitionMonthsOpen"] = df["CompetitionDaysOpen"] // 30
    df.loc[df.CompetitionMonthsOpen > 24, "CompetitionMonthsOpen"] = 24
joined.CompetitionMonthsOpen.unique()


Out[246]:
array([24,  3, 19,  9,  0, 16, 17,  7, 15, 22, 11, 13,  2, 23, 12,  4, 10,  1, 14, 20,  8, 18,  6, 21,  5])

Same process for Promo dates.


In [247]:
for df in (joined, joined_test):
    df["Promo2Since"] = pd.to_datetime(df.apply(lambda x: Week(
        x.Promo2SinceYear, x.Promo2SinceWeek).monday(), axis=1).astype(pd.datetime))
    df["Promo2Days"] = df.Date.subtract(df["Promo2Since"]).dt.days

In [248]:
for df in (joined, joined_test):
    df.loc[df.Promo2Days < 0, "Promo2Days"] = 0
    df.loc[df.Promo2SinceYear < 1990, "Promo2Days"] = 0
    df["Promo2Weeks"] = df["Promo2Days"] // 7
    df.loc[df.Promo2Weeks < 0, "Promo2Weeks"] = 0
    df.loc[df.Promo2Weeks > 25, "Promo2Weeks"] = 25
    df.Promo2Weeks.unique()

NOTE: make sure joined or joined_test is loaded into memory (either reinitialized from above, or loaded from disk if saved as below) when switching between train and test datasets.

Actually you'll probably have to make sure you have them both read in anyway, if you're going to run some of the lines further below in the same cells together.


In [249]:
joined.to_feather(f'{PATH}joined')
joined_test.to_feather(f'{PATH}joined_test')

In [171]:
joined = pd.read_feather(f'{PATH}joined')
joined_test = pd.read_feather(f'{PATH}joined_test')

1.3 Durations

It's common when working with time series data to extract data that explains relationships accross rows as opposed to columns, eg:

  • Running averages
  • Time until next event
  • Time since last event

This is often difficult to do with most table manipulation frameworks, since they're designed to work with relationships across columns. As such, we've created a class to handle this type of data.

We'll define a function get_elapsed for cumulative counting across a sorted DataFrame. Given a particular field fld to monitor, this function will start tracking time since the last occurrence of that field. When the field is seen again, the counter is set to zero.

Upon initialization, this will result in datetime NA's until the field is encountered. This is reset every time a new store is seen. We'll see how to use this shortly.


In [250]:
def get_elapsed(fld, pre):
    day1 = np.timedelta64(1, 'D')
    last_date = np.datetime64()
    last_store = 0
    res = []
    
    for s,v,d in zip(df.Store.values, df[fld].values, df.Date.values):
        if s != last_store:
            last_date = np.datetime64()
            last_store = s
        if v: last_date = d
        res.append(((d - last_date).astype('timedelta64[D]') / day1).astype(int))
    df[pre + fld] = res

We'll be applying this to a subset of columns:

NOTE: You must rerun the cell below before running either df = train[columns] or df = test[columns], ie: before switching between train and test datasets -- since columns is redefined for the purpose of converting NaNs to zeros further on down below.


In [253]:
columns = ["Date", "Store", "Promo", "StateHoliday", "SchoolHoliday"]

NOTE: when running on the train-set, do train[columns], when running on the test-set, do test[columns] --- idk yet if you have to run all the cells afterwards, but I think you do.


In [33]:
df = train[columns]

In [254]:
df = test[columns]

Let's walk through an exmaple:

Say we're looking at School Holiday. We'll first sort by Store, then Date, and then call add_elapsed('SchoolHoliday', 'After'): This'll apply to each row with School Holiday:

  • Applied to every row of the DataFrame in order of store and date
  • Will add to the DataFrame the days since seeing a School Holiday
  • If we sort in the other direction, this'll count the days until another holiday

In [255]:
fld = 'SchoolHoliday'
df = df.sort_values(['Store', 'Date'])
get_elapsed(fld, 'After')
df = df.sort_values(['Store', 'Date'], ascending=[True, False])
get_elapsed(fld, 'Before')

We'll do this for two more fields:


In [256]:
fld = 'StateHoliday'
df = df.sort_values(['Store', 'Date'])
get_elapsed(fld, 'After')
df = df.sort_values(['Store', 'Date'], ascending=[True, False])
get_elapsed(fld, 'Before')

In [257]:
fld = 'Promo'
df = df.sort_values(['Store', 'Date'])
get_elapsed(fld, 'After')
df = df.sort_values(['Store', 'Date'], ascending=[True, False])
get_elapsed(fld, 'Before')

We're going to set the active index to Date


In [258]:
df = df.set_index("Date")

Then set Null values from elapsed field calculations to 0.


In [259]:
columns = ['SchoolHoliday', 'StateHoliday', 'Promo']

In [260]:
for o in ['Before', 'After']:
    for p in columns:
        a = o + p
        df[a] = df[a].fillna(0)

Next we'll demonstrate window functions in Pandas to calculate rolling quantities.

Here we're sorting by date (sort_index()) and counting the number of events of interest (sum() defined in columns in the following week (rolling()), grouped by Store (groupby()). We do the same in the opposite direction.


In [261]:
bwd = df[['Store']+columns].sort_index().groupby("Store").rolling(7, min_periods=1).sum()

In [262]:
fwd = df[['Store']+columns].sort_index(ascending=False
                                      ).groupby("Store").rolling(7, min_periods=1).sum()

Next we want to drop the Store indices grouped together in the window function.

Often in Pandas, there's an option to do this in place. This is time an memory efficient when working with large datasets.


In [263]:
bwd.drop('Store',1,inplace=True)
bwd.reset_index(inplace=True)

In [264]:
fwd.drop('Store',1,inplace=True)
fwd.reset_index(inplace=True)

In [265]:
df.reset_index(inplace=True)

Now we'll merge these values onto the df.


In [266]:
df = df.merge(bwd, 'left', ['Date', 'Store'], suffixes=['', '_bw'])
df = df.merge(fwd, 'left', ['Date', 'Store'], suffixes=['', '_fw'])

In [267]:
df.drop(columns,1,inplace=True)

In [268]:
df.head()


Out[268]:
Date Store AfterSchoolHoliday BeforeSchoolHoliday AfterStateHoliday BeforeStateHoliday AfterPromo BeforePromo SchoolHoliday_bw StateHoliday_bw Promo_bw SchoolHoliday_fw StateHoliday_fw Promo_fw
0 2015-09-17 1 13 -9223372036854775808 -9223372036854775808 -9223372036854775808 0 0 0.0 0.0 4.0 0.0 0.0 1.0
1 2015-09-16 1 12 -9223372036854775808 -9223372036854775808 -9223372036854775808 0 0 0.0 0.0 3.0 0.0 0.0 2.0
2 2015-09-15 1 11 -9223372036854775808 -9223372036854775808 -9223372036854775808 0 0 0.0 0.0 2.0 0.0 0.0 3.0
3 2015-09-14 1 10 -9223372036854775808 -9223372036854775808 -9223372036854775808 0 0 0.0 0.0 1.0 0.0 0.0 4.0
4 2015-09-13 1 9 -9223372036854775808 -9223372036854775808 -9223372036854775808 9 -1 0.0 0.0 0.0 0.0 0.0 4.0

It's usually a good idea to back up large tables of extracted / wrangled features before you join them onto another one, that way you can go back to it easily if you need to make changes to it.


In [48]:
df.to_feather(f'{PATH}df')

In [272]:
df.to_feather(f'{PATH}df_test')

In [ ]:
# df = pd.read_feather(f'{PATH}df', index_col=0)

In [273]:
df["Date"] = pd.to_datetime(df.Date)

In [274]:
df.columns


Out[274]:
Index(['Date', 'Store', 'AfterSchoolHoliday', 'BeforeSchoolHoliday',
       'AfterStateHoliday', 'BeforeStateHoliday', 'AfterPromo', 'BeforePromo',
       'SchoolHoliday_bw', 'StateHoliday_bw', 'Promo_bw', 'SchoolHoliday_fw',
       'StateHoliday_fw', 'Promo_fw'],
      dtype='object')

NOTE: *you'll get a Buffer dtype mismatch ValueError here unless you have joined or joined_test loaded in from before up above. -- Note for when rerunning this notebook: switching between test and train datasets.


In [150]:
joined = join_df(joined, df, ['Store', 'Date'])

In [275]:
joined_test = join_df(joined_test, df, ['Store', 'Date'])

The authors also removed all instances where the store had zero sale / was closed. We speculate that this may have cost them a higher standing in the competition. One reason this may be the case is that a little exploratory data analysis reveals that there are often periods where astores are closed, typically for refurbishment. Before and after these periods, there are naturally spikes in sales that one might expect. By ommitting this data from their training, the authors gave up the ability to leverage information about these periods to predict this otherwise volatile behavior.


In [279]:
joined = joined[joined.Sales != 0]

We'll back this up as well


In [285]:
joined.reset_index(inplace=True)
joined_test.reset_index(inplace=True)
# so `` ValueError: cannot insert level_0, already exists `` just means Ive already done this once eh?


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-285-518289b81130> in <module>()
----> 1 joined.reset_index(inplace=True)
      2 joined_test.reset_index(inplace=True)
      3 # so `` ValueError: cannot insert level_0, already exists `` just means Ive already done this once eh?

~/miniconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/frame.py in reset_index(self, level, drop, inplace, col_level, col_fill)
   3377                 # to ndarray and maybe infer different dtype
   3378                 level_values = _maybe_casted_values(lev, lab)
-> 3379                 new_obj.insert(0, name, level_values)
   3380 
   3381         new_obj.index = new_index

~/miniconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/frame.py in insert(self, loc, column, value, allow_duplicates)
   2611         value = self._sanitize_column(column, value, broadcast=False)
   2612         self._data.insert(loc, column, value,
-> 2613                           allow_duplicates=allow_duplicates)
   2614 
   2615     def assign(self, **kwargs):

~/miniconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/internals.py in insert(self, loc, item, value, allow_duplicates)
   4056         if not allow_duplicates and item in self.items:
   4057             # Should this be a different kind of error??
-> 4058             raise ValueError('cannot insert {}, already exists'.format(item))
   4059 
   4060         if not isinstance(loc, int):

ValueError: cannot insert level_0, already exists

In [282]:
joined.to_feather(f'{PATH}joined')
joined_test.to_feather(f'{PATH}joined_test')

We now have our final set of engineered features.

While these steps were explicitly outlined in the paper, these are all fairly typical feature engineering steps for dealing with time series data and are practical in any similar setting.

1.4 Create Features


In [286]:
joined = pd.read_feather(f'{PATH}joined')
joined_test = pd.read_feather(f'{PATH}joined_test')

In [287]:
joined.head().T.head(40)


Out[287]:
0 1 2 3 4
level_0 0 1 2 3 4
index 0 1 2 3 4
Store 1 2 3 4 5
DayOfWeek 5 5 5 5 5
Date 2015-07-31 00:00:00 2015-07-31 00:00:00 2015-07-31 00:00:00 2015-07-31 00:00:00 2015-07-31 00:00:00
Sales 5263 6064 8314 13995 4822
Customers 555 625 821 1498 559
Open 1 1 1 1 1
Promo 1 1 1 1 1
StateHoliday False False False False False
SchoolHoliday 1 1 1 1 1
Year 2015 2015 2015 2015 2015
Month 7 7 7 7 7
Week 31 31 31 31 31
Day 31 31 31 31 31
Dayofweek 4 4 4 4 4
Dayofyear 212 212 212 212 212
Is_month_end True True True True True
Is_month_start False False False False False
Is_quarter_end False False False False False
Is_quarter_start False False False False False
Is_year_end False False False False False
Is_year_start False False False False False
Elapsed 1438300800 1438300800 1438300800 1438300800 1438300800
StoreType c a a c a
Assortment a a a c a
CompetitionDistance 1270 570 14130 620 29910
CompetitionOpenSinceMonth 9 11 12 9 4
CompetitionOpenSinceYear 2008 2007 2006 2009 2015
Promo2 0 1 1 0 0
Promo2SinceWeek 1 13 14 1 1
Promo2SinceYear 1900 2010 2011 1900 1900
PromoInterval None Jan,Apr,Jul,Oct Jan,Apr,Jul,Oct None None
State HE TH NW BE SN
file Rossmann_DE_HE Rossmann_DE_TH Rossmann_DE_NW Rossmann_DE_BE Rossmann_DE_SN
week 2015-08-02 - 2015-08-08 2015-08-02 - 2015-08-08 2015-08-02 - 2015-08-08 2015-08-02 - 2015-08-08 2015-08-02 - 2015-08-08
trend 85 80 86 74 82
file_DE Rossmann_DE Rossmann_DE Rossmann_DE Rossmann_DE Rossmann_DE
week_DE 2015-08-02 - 2015-08-08 2015-08-02 - 2015-08-08 2015-08-02 - 2015-08-08 2015-08-02 - 2015-08-08 2015-08-02 - 2015-08-08
trend_DE 83 83 83 83 83

In [288]:
joined_test.columns


Out[288]:
Index(['level_0', 'index', 'Id', 'Store', 'DayOfWeek', 'Date', 'Open', 'Promo',
       'StateHoliday', 'SchoolHoliday',
       ...
       'AfterStateHoliday_y', 'BeforeStateHoliday_y', 'AfterPromo_y',
       'BeforePromo_y', 'SchoolHoliday_bw_y', 'StateHoliday_bw_y',
       'Promo_bw_y', 'SchoolHoliday_fw_y', 'StateHoliday_fw_y', 'Promo_fw_y'],
      dtype='object', length=105)

In [289]:
joined.columns


Out[289]:
Index(['level_0', 'index', 'Store', 'DayOfWeek', 'Date', 'Sales', 'Customers',
       'Open', 'Promo', 'StateHoliday', 'SchoolHoliday', 'Year', 'Month',
       'Week', 'Day', 'Dayofweek', 'Dayofyear', 'Is_month_end',
       'Is_month_start', 'Is_quarter_end', 'Is_quarter_start', 'Is_year_end',
       'Is_year_start', 'Elapsed', 'StoreType', 'Assortment',
       'CompetitionDistance', 'CompetitionOpenSinceMonth',
       'CompetitionOpenSinceYear', 'Promo2', 'Promo2SinceWeek',
       'Promo2SinceYear', 'PromoInterval', 'State', 'file', 'week', 'trend',
       'file_DE', 'week_DE', 'trend_DE', 'Date_DE', 'State_DE', 'Month_DE',
       'Day_DE', 'Dayofweek_DE', 'Dayofyear_DE', 'Is_month_end_DE',
       'Is_month_start_DE', 'Is_quarter_end_DE', 'Is_quarter_start_DE',
       'Is_year_end_DE', 'Is_year_start_DE', 'Elapsed_DE', 'Max_TemperatureC',
       'Mean_TemperatureC', 'Min_TemperatureC', 'Dew_PointC', 'MeanDew_PointC',
       'Min_DewpointC', 'Max_Humidity', 'Mean_Humidity', 'Min_Humidity',
       'Max_Sea_Level_PressurehPa', 'Mean_Sea_Level_PressurehPa',
       'Min_Sea_Level_PressurehPa', 'Max_VisibilityKm', 'Mean_VisibilityKm',
       'Min_VisibilitykM', 'Max_Wind_SpeedKm_h', 'Mean_Wind_SpeedKm_h',
       'Max_Gust_SpeedKm_h', 'Precipitationmm', 'CloudCover', 'Events',
       'WindDirDegrees', 'StateName', 'CompetitionOpenSince',
       'CompetitionDaysOpen', 'CompetitionMonthsOpen', 'Promo2Since',
       'Promo2Days', 'Promo2Weeks'],
      dtype='object')

Now that we've engineered all our features, we need to convert to input compatible with a neural network.

This includes converting categorical variables into contiguous integers or one-hot encodings, normalizing continuous features to standard normal, etc..


In [290]:
cat_vars = ['Store', 'DayOfWeek', 'Year', 'Month', 'Day', 'StateHoliday', 'CompetitionMonthsOpen',
    'Promo2Weeks', 'StoreType', 'Assortment', 'PromoInterval', 'CompetitionOpenSinceYear', 'Promo2SinceYear',
    'State', 'Week', 'Events', 'Promo_fw', 'Promo_bw', 'StateHoliday_fw', 'StateHoliday_bw',
    'SchoolHoliday_fw', 'SchoolHoliday_bw']

contin_vars = ['CompetitionDistance', 'Max_TemperatureC', 'Mean_TemperatureC', 'Min_TemperatureC',
   'Max_Humidity', 'Mean_Humidity', 'Min_Humidity', 'Max_Wind_SpeedKm_h', 
   'Mean_Wind_SpeedKm_h', 'CloudCover', 'trend', 'trend_DE',
   'AfterStateHoliday', 'BeforeStateHoliday', 'Promo', 'SchoolHoliday']

n = len(joined); n


Out[290]:
844338

In [291]:
dep = 'Sales'
joined_test[dep] = 0
joined = joined[cat_vars + contin_vars + [dep, 'Date']].copy()
joined_test = joined_test[cat_vars + contin_vars + [dep, 'Date', 'Id']].copy()


---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-291-f329469a9f7e> in <module>()
      1 dep = 'Sales'
      2 joined_test[dep] = 0
----> 3 joined = joined[cat_vars + contin_vars + [dep, 'Date']].copy()
      4 joined_test = joined_test[cat_vars + contin_vars + [dep, 'Date', 'Id']].copy()

~/miniconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2131         if isinstance(key, (Series, np.ndarray, Index, list)):
   2132             # either boolean or fancy integer index
-> 2133             return self._getitem_array(key)
   2134         elif isinstance(key, DataFrame):
   2135             return self._getitem_frame(key)

~/miniconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/frame.py in _getitem_array(self, key)
   2175             return self._take(indexer, axis=0, convert=False)
   2176         else:
-> 2177             indexer = self.loc._convert_to_indexer(key, axis=1)
   2178             return self._take(indexer, axis=1, convert=True)
   2179 

~/miniconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/indexing.py in _convert_to_indexer(self, obj, axis, is_setter)
   1267                 if mask.any():
   1268                     raise KeyError('{mask} not in index'
-> 1269                                    .format(mask=objarr[mask]))
   1270 
   1271                 return _values_from_object(indexer)

KeyError: "['Promo_fw' 'Promo_bw' 'StateHoliday_fw' 'StateHoliday_bw' 'SchoolHoliday_fw' 'SchoolHoliday_bw'\n 'AfterStateHoliday' 'BeforeStateHoliday'] not in index"

In [197]:
for v in cat_vars: joined[v] = joined[v].astype('category').cat.as_ordered()

In [198]:
apply_cats(joined_test, joined)

In [199]:
for v in contin_vars:
    joined[v] = joined[v].astype('float32')
    joined_test[v] = joined_test[v].astype('float32')

We're going to run on a sample:


In [64]:
idxs = get_cv_idxs(n, val_pct = 150000 / n)
joined_samp = joined.iloc[idxs].set_index("Date")
samp_size = len(joined_samp); samp_size


Out[64]:
150000

To run on the full dataset, use this instead:


In [86]:
samp_size = n
joined_samp = joined.set_index("Date")

We can now process our data...


In [87]:
joined_samp.head(2)


Out[87]:
Store DayOfWeek Year Month Day StateHoliday CompetitionMonthsOpen Promo2Weeks StoreType Assortment ... Max_Wind_SpeedKm_h Mean_Wind_SpeedKm_h CloudCover trend trend_DE AfterStateHoliday BeforeStateHoliday Promo SchoolHoliday Sales
Date
2015-07-31 1 5 2015 7 31 False 24 0 c a ... 24.0 11.0 1.0 85.0 83.0 57.0 -9.223372e+18 1.0 1.0 5263
2015-07-31 2 5 2015 7 31 False 24 25 a a ... 14.0 11.0 4.0 80.0 83.0 67.0 -9.223372e+18 1.0 1.0 6064

2 rows × 39 columns


In [88]:
df, y, nas, mapper = proc_df(joined_samp, 'Sales', do_scale=True)
yl = np.log(y)

In [165]:
joined_test = joined_test.set_index("Date")

In [200]:
df_test, _, nas, mapper, = proc_df(joined_test, 'Sales', do_scale=True, skip_flds=['Id'], 
                                   mapper=mapper, na_dict=nas)


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-200-705c063ce1b2> in <module>()
      1 df_test, _, nas, mapper, = proc_df(joined_test, 'Sales', do_scale=True, skip_flds=['Id'], 
----> 2                                    mapper=mapper, na_dict=nas)

~/Kaukasos/FADL1/fastai/structured.py in proc_df(df, y_fld, skip_flds, do_scale, na_dict, preproc_fn, max_n_cat, subset, mapper)
    425     if na_dict is None: na_dict = {}
    426     for n,c in df.items(): na_dict = fix_missing(df, c, n, na_dict)
--> 427     if do_scale: mapper = scale_vars(df, mapper)
    428     for n,c in df.items(): numericalize(df, c, n, max_n_cat)
    429     res = [pd.get_dummies(df, dummy_na=True), y, na_dict]

~/Kaukasos/FADL1/fastai/structured.py in scale_vars(df, mapper)
    322         map_f = [([n],StandardScaler()) for n in df.columns if is_numeric_dtype(df[n])]
    323         mapper = DataFrameMapper(map_f).fit(df)
--> 324     df[mapper.transformed_names_] = mapper.transform(df)
    325     return mapper
    326 

~/miniconda3/envs/fastai/lib/python3.6/site-packages/sklearn_pandas/dataframe_mapper.py in transform(self, X)
    277             if transformers is not None:
    278                 with add_column_names_to_exception(columns):
--> 279                     Xt = transformers.transform(Xt)
    280             extracted.append(_handle_feature(Xt))
    281 

~/miniconda3/envs/fastai/lib/python3.6/site-packages/sklearn/preprocessing/data.py in transform(self, X, y, copy)
    679         copy = copy if copy is not None else self.copy
    680         X = check_array(X, accept_sparse='csr', copy=copy, warn_on_dtype=True,
--> 681                         estimator=self, dtype=FLOAT_DTYPES)
    682 
    683         if sparse.issparse(X):

~/miniconda3/envs/fastai/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    451                              % (array.ndim, estimator_name))
    452         if force_all_finite:
--> 453             _assert_all_finite(array)
    454 
    455     shape_repr = _shape_repr(array.shape)

~/miniconda3/envs/fastai/lib/python3.6/site-packages/sklearn/utils/validation.py in _assert_all_finite(X)
     42             and not np.isfinite(X).all()):
     43         raise ValueError("Input contains NaN, infinity"
---> 44                          " or a value too large for %r." % X.dtype)
     45 
     46 

ValueError: ['AfterStateHoliday']: Input contains NaN, infinity or a value too large for dtype('float32').

In [89]:
df.head(2)


Out[89]:
Store DayOfWeek Year Month Day StateHoliday CompetitionMonthsOpen Promo2Weeks StoreType Assortment ... Mean_Wind_SpeedKm_h CloudCover trend trend_DE AfterStateHoliday BeforeStateHoliday Promo SchoolHoliday CompetitionDistance_na CloudCover_na
Date
2015-07-31 1 5 3 7 31 1 25 1 3 1 ... -0.142774 -2.805667 1.732492 1.724334 0.0 -0.0 1.113717 2.04105 -0.050948 -0.29609
2015-07-31 2 5 3 7 31 1 25 26 1 1 ... -0.142774 -0.961766 1.294578 1.724334 0.0 -0.0 1.113717 2.04105 -0.050948 -0.29609

2 rows × 40 columns

In time series data, cross-validation is not random. Instead, our holdout data is generally the most recent data, as it would be a in a real application. This issue is discussed in detail in this post on our website.

One approach is to take the last 25% of rows (sorted by date) as our validation set.


In [90]:
train_ratio = 0.75
# train_ration = 0.90
train_size = int(samp_size * train_ratio); train_size
val_idx = list(range(train_size, len(df)))

An even better option for picking a validation set is using the exact same length of time period as the test set uses - this is implemented here:


In [91]:
val_idx = np.flatnonzero(
    (df.index<=datetime.datetime(2014,9,17)) & (df.index>=datetime.datetime(2014,8,1)))

In [ ]:
# val_idx=[0]

1.6 DL

We're ready to put together our models.

Root-Mean-Squared percent error is the metric Kaggle used for this competition.


In [92]:
def inv_y(a): return np.exp(a)

def exp_rmspe(y_pred, targ):
    targ = inv_y(targ)
    pct_var = (targ - inv_y(y_pred))/targ
    return math.sqrt((pct_var**2).mean())

max_log_y = np.max(yl)
y_range = (0, max_log_y*1.2)

We can create a ModelData object directly from our DataFrame.


In [93]:
md = ColumnarModelData.from_data_frame(PATH, val_idx, df, yl, cat_flds=cat_vars, bs=128)

# md = ColumnarModelData.from_data_frame(PATH, val_idx, df, yl, cat_flds=cat_vars, bs=128, 
#                                        test_df=df_test)

Some categorical variables have a lot more levels than others. Store, in particular, has over a thousand!


In [94]:
cat_sz = [(c, len(joined_samp[c].cat.categories)+1) for c in cat_vars]

In [95]:
cat_sz


Out[95]:
[('Store', 1116),
 ('DayOfWeek', 8),
 ('Year', 4),
 ('Month', 13),
 ('Day', 32),
 ('StateHoliday', 3),
 ('CompetitionMonthsOpen', 26),
 ('Promo2Weeks', 27),
 ('StoreType', 5),
 ('Assortment', 4),
 ('PromoInterval', 4),
 ('CompetitionOpenSinceYear', 24),
 ('Promo2SinceYear', 9),
 ('State', 13),
 ('Week', 53),
 ('Events', 22),
 ('Promo_fw', 7),
 ('Promo_bw', 7),
 ('StateHoliday_fw', 4),
 ('StateHoliday_bw', 4),
 ('SchoolHoliday_fw', 9),
 ('SchoolHoliday_bw', 9)]

We use the cardinality of each variable (the number of unique values) to decide how large to make its embeddings. Each level will be associated with a vector with length defined as below.


In [96]:
emb_szs = [(c, min(50, (c + 1) // 2)) for _, c in cat_sz]

In [97]:
emb_szs


Out[97]:
[(1116, 50),
 (8, 4),
 (4, 2),
 (13, 7),
 (32, 16),
 (3, 2),
 (26, 13),
 (27, 14),
 (5, 3),
 (4, 2),
 (4, 2),
 (24, 12),
 (9, 5),
 (13, 7),
 (53, 27),
 (22, 11),
 (7, 4),
 (7, 4),
 (4, 2),
 (4, 2),
 (9, 5),
 (9, 5)]

In [98]:
m = md.get_learner(emb_szs, len(df.columns) - len(cat_vars), 
                   0.04, 1, [1000, 500], [0.001, 0.01], y_range=y_range)
lr = 1e-3

In [99]:
m.lr_find()


 60%|██████    | 3802/6297 [00:38<00:25, 99.69it/s, loss=0.364] 

In [100]:
m.sched.plot(100)


 60%|██████    | 3802/6297 [00:50<00:32, 75.97it/s, loss=0.364]

1.6.1 Sample


In [80]:
m = md.get_learner(emb_szs, len(df.columns) - len(cat_vars), 
                   0.04, 1, [1000,500], [0.001, 0.01], y_range=y_range)
lr = 1e-3

In [81]:
m.fit(lr, 3, metrics=[exp_rmspe])


[ 0.       0.0307   0.02693  0.15822]                            
[ 1.       0.0205   0.02718  0.15008]                            
[ 2.       0.01654  0.01893  0.13223]                            


In [82]:
m.fit(lr, 5, metrics=[exp_rmspe], cycle_len=1)


[ 0.       0.01193  0.01488  0.1159 ]                            
[ 1.       0.01113  0.01431  0.11417]                            
[ 2.       0.01072  0.01398  0.11373]                            
[ 3.       0.00929  0.01401  0.11293]                             
[ 4.       0.00994  0.01385  0.11216]                             


In [83]:
m.fit(lr, 2, metrics=[exp_rmspe], cycle_len=4)


[ 0.       0.01181  0.01521  0.12101]                            
[ 1.       0.01026  0.01378  0.11835]                             
[ 2.       0.00838  0.01414  0.11314]                             
[ 3.       0.00702  0.01332  0.11125]                             
[ 4.       0.01005  0.01471  0.11704]                             
[ 5.       0.00934  0.01473  0.12419]                             
[ 6.       0.00648  0.01344  0.11299]                             
[ 7.       0.00587  0.01354  0.11398]                             

1.6.2 All


In [101]:
m = md.get_learner(emb_szs, len(df.columns) - len(cat_vars), 
                   0.04, 1, [1000,500], [0.001,0.01], y_range=y_range)
lr = 1e-3

In [102]:
m.fit(lr, 3, metrics=[exp_rmspe])


  0%|          | 0/6297 [00:00<?, ?it/s]
  0%|          | 0/6297 [00:00<?, ?it/s, loss=4.37]
  0%|          | 1/6297 [00:00<42:51,  2.45it/s, loss=4.37]
  0%|          | 1/6297 [00:00<45:32,  2.30it/s, loss=4.63]
  0%|          | 1/6297 [00:00<48:01,  2.18it/s, loss=5.46]
  0%|          | 1/6297 [00:00<50:26,  2.08it/s, loss=5.48]
  0%|          | 1/6297 [00:00<53:00,  1.98it/s, loss=4.91]
  0%|          | 1/6297 [00:00<55:34,  1.89it/s, loss=4.16]
  0%|          | 6/6297 [00:00<09:17, 11.29it/s, loss=4.16]

  0%|          | 10/6297 [00:00<07:40, 13.64it/s, loss=2.38]   
Exception in thread Thread-25:
Traceback (most recent call last):
  File "/home/wnixalo/miniconda3/envs/fastai/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/wnixalo/miniconda3/envs/fastai/lib/python3.6/site-packages/tqdm/_tqdm.py", line 144, in run
    for instance in self.tqdm_cls._instances:
  File "/home/wnixalo/miniconda3/envs/fastai/lib/python3.6/_weakrefset.py", line 60, in __iter__
    for itemref in self.data:
RuntimeError: Set changed size during iteration

Exception in thread Thread-26:
Traceback (most recent call last):
  File "/home/wnixalo/miniconda3/envs/fastai/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/wnixalo/miniconda3/envs/fastai/lib/python3.6/site-packages/tqdm/_tqdm.py", line 144, in run
    for instance in self.tqdm_cls._instances:
  File "/home/wnixalo/miniconda3/envs/fastai/lib/python3.6/_weakrefset.py", line 60, in __iter__
    for itemref in self.data:
RuntimeError: Set changed size during iteration

[ 0.       0.01323  0.01364  0.11332]                            
[ 1.       0.0115   0.01293  0.10698]                            
[ 2.       0.00938  0.01132  0.10187]                             


In [103]:
m.fit(lr, 3, metrics=[exp_rmspe], cycle_len=1)


[ 0.       0.00675  0.0105   0.09742]                             
[ 1.       0.00711  0.01019  0.09696]                             
[ 2.       0.00783  0.01034  0.09656]                             

Test


In [ ]:
m = md.get_learner(emb_szz, len(df.columns) - len(cat_vars), 
                   0.04, 1, [1000,500], [0.001,0.01], y_range=y_range)
lr = 1e-3

In [ ]:
m.fit(lr, 3, metrics=[exp_rmspe])

In [ ]:
m.fit(lr, 3, metrics=[exp_rmspe], cycle_len=1)

In [ ]:
m.save('val0')

In [ ]:
m.load('val0')

In [ ]:
x,y=m.predict_with_targs()

In [ ]:
exp_rmspe(x,y)

In [ ]:
pred_test=m.predict(True)

In [ ]:
pred_test = np.exp(pred_test)

In [ ]:
joined_test['Sales']=pred_test

In [ ]:
csv_fn=f'{PATH}tmp/sub.csv'

In [ ]:
FileLink(csv_fn)

1.5 RF


In [ ]:
from sklearn.ensemble import RandomForestRegressor

In [ ]:
((val,trn), (y_val,y_trn)) = split_by_idx(val_idx, df.values, yl)

In [ ]:
m = RandomForestRegressor(n_estimators=40, max_features=0.99, min_samples_leaf=2, 
                          n_jobs=-1, oob_score=True)
m.fit(trn, y_trn)

In [ ]:
preds = m.predict(val)
m.score(trn, y_trn), m.score(val, y_val), m.oob_score_, exp_rmspe(preds, y_val)

In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]: