Kaggle.com Sales in stormy weather.

(From http://www.kaggle.com/c/walmart-recruiting-sales-in-stormy-weather)

Walmart operates 11,450 stores in 27 countries, managing inventory across varying climates and cultures. Extreme weather events, like hurricanes, blizzards, and floods, can have a huge impact on sales at the store and product level.

In their second Kaggle recruiting competition, Walmart challenges participants to accurately predict the sales of 111 potentially weather-sensitive products (like umbrellas, bread, and milk) around the time of major weather events at 45 of their retail locations.

Intuitively, we may expect an uptick in the sales of umbrellas before a big thunderstorm, but it's difficult for replenishment managers to correctly predict the level of inventory needed to avoid being out-of-stock or overstock during and after that storm. Walmart relies on a variety of vendor tools to predict sales around extreme weather events, but it's an ad-hoc and time-consuming process that lacks a systematic measure of effectiveness.

Helping Walmart better predict sales of weather-sensitive products will keep valued customers out of the rain. It could also earn you a position at one of the most data-driven retailers in the world!

Data

(From http://www.kaggle.com/c/walmart-recruiting-sales-in-stormy-weather/data)

You have been provided with sales data for 111 products whose sales may be affected by the weather (such as milk, bread, umbrellas, etc.). These 111 products are sold in stores at 45 different Walmart locations. Some of the products may be a similar item (such as milk) but have a different id in different stores/regions/suppliers. The 45 locations are covered by 20 weather stations (i.e. some of the stores are nearby and share a weather station).

The competition task is to predict the amount of each product sold around the time of major weather events. For the purposes of this competition, we have defined a weather event as any day in which more than an inch of rain or two inches of snow was observed. You are asked to predict the units sold for a window of ±3 days surrounding each storm.

The following graphic shows the layout of the test windows. The blue dots are the training set days, the red dots are the test set days, and the event=True are the days with storms. Note that this plot is for the 20 weather stations. All days prior to 2013-04-01 are given out as training data.

You are provided with the full observed weather covering the entire data set. You do not need to forecast weather in addition to sales (it's as though you have a perfect weather forecast at your disposal).

You will not be provided with more information about the products, store locations, or other details.
Because the storms occur at variable times and in variable locations, use the test set file (or sample submission) as your guide to know which days and stores you must forecast.
The sales data does not capture the difference between the stock and the demand. In other words, sales number 0 doesn't necessarily mean there was no demand for this product; it may mean it was in stock but none were sold, or it could mean that the product was out of stock, or discontinued and not available.

Field descriptions

date - the day of sales or weather
store_nbr - an id representing one of the 45 stores
station_nbr - an id representing one of 20 weather stations
item_nbr - an id representing one of the 111 products
units - the quantity sold of an item on a given day
id - a triplet representing a store_nbr, item_nbr, and date. Form the id by concatenating these (in that order) with an underscore. E.g. "2_1_2013-04-01" represents store 2, item 1, sold on 2013-04-01.

File descriptions

key.csv - the relational mapping between stores and the weather stations that cover them
sampleSubmission.csv - file that gives the prediction format
train.csv - sales data for all stores & dates in the training set
test.csv - stores & dates for forecasting (missing 'units', which you must predict) NOTE: This file has been encrypted. To get the password, please fill out Walmart's Recruiting Survey
weather.csv - a file containing the NOAA weather information for each station and day
noaa_weather_qclcd_documentation.pdf - a guide to understand the data provided in the weather.csv file

Cleaning, merging.



In [21]:

    
#importing libraries
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import os

weather measurements With respect to the noaa_weather_qclcd_documentation.pdf:

M - Missing data, or data that is normally nor supported by the station
"*" - not supported by the station
"-" - values is not availible for sunset/sunrise



In [22]:

    
weather = pd.read_csv(os.path.join("data", "weather.csv"), na_values=["M", "-", "*"])



In [23]:

    
weather.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 20517 entries, 0 to 20516
Data columns (total 20 columns):
station_nbr    20517 non-null int64
date           20517 non-null object
tmax           19611 non-null float64
tmin           19609 non-null float64
tavg           19048 non-null float64
depart         9006 non-null float64
dewpoint       19851 non-null float64
wetbulb        19265 non-null float64
heat           19048 non-null float64
cool           19048 non-null float64
sunrise        10861 non-null float64
sunset         10861 non-null float64
codesum        20517 non-null object
snowfall       13293 non-null object
preciptotal    19657 non-null object
stnpressure    19588 non-null float64
sealevel       18793 non-null float64
resultspeed    19928 non-null float64
resultdir      19928 non-null float64
avgspeed       19642 non-null float64
dtypes: float64(15), int64(1), object(4)



In [24]:

    
weather.head(15)









    Out[24]:






  
    
      
      station_nbr
      date
      tmax
      tmin
      tavg
      depart
      dewpoint
      wetbulb
      heat
      cool
      sunrise
      sunset
      codesum
      snowfall
      preciptotal
      stnpressure
      sealevel
      resultspeed
      resultdir
      avgspeed
    
  
  
    
      0 
        1
       2012-01-01
       52
       31
       42
      NaN
       36
       40
       23
        0
       NaN
        NaN
           RA FZFG BR
       NaN
       0.05
       29.78
       29.92
        3.6
       20
        4.6
    
    
      1 
        2
       2012-01-01
       48
       33
       41
       16
       37
       39
       24
        0
       716
       1626
                   RA
       0.0
       0.07
       28.82
       29.91
        9.1
       23
       11.3
    
    
      2 
        3
       2012-01-01
       55
       34
       45
        9
       24
       36
       20
        0
       735
       1720
                     
       0.0
       0.00
       29.77
       30.47
        9.9
       31
       10.0
    
    
      3 
        4
       2012-01-01
       63
       47
       55
        4
       28
       43
       10
        0
       728
       1742
                     
       0.0
       0.00
       29.79
       30.48
        8.0
       35
        8.2
    
    
      4 
        6
       2012-01-01
       63
       34
       49
        0
       31
       43
       16
        0
       727
       1742
                     
       0.0
       0.00
       29.95
       30.47
       14.0
       36
       13.8
    
    
      5 
        7
       2012-01-01
       50
       33
       42
      NaN
       26
       35
       23
        0
       NaN
        NaN
                     
       0.0
       0.00
       29.15
       30.54
       10.3
       32
       10.2
    
    
      6 
        8
       2012-01-01
       66
       45
      NaN
      NaN
       34
       46
      NaN
      NaN
       NaN
        NaN
                RA BR
       NaN
       0.00
       30.05
         NaN
       11.0
       36
       10.9
    
    
      7 
        9
       2012-01-01
       34
       19
       27
      NaN
       17
       23
       38
        0
       NaN
        NaN
                   UP
       NaN
          T
       29.34
       30.09
       22.8
       30
       22.5
    
    
      8 
       10
       2012-01-01
       73
       53
       63
      NaN
       55
       58
        2
        0
       723
       1738
            FG+ FG BR
       NaN
       0.00
       30.16
       30.19
        5.1
       24
        5.5
    
    
      9 
       11
       2012-01-01
       72
       48
       60
        7
       54
       56
        5
        0
       724
       1737
            FG+ FG BR
       0.0
       0.00
       30.15
       30.18
        4.6
       23
        4.8
    
    
      10
       12
       2012-01-01
       72
       48
       60
      NaN
       54
       57
        5
        0
       NaN
        NaN
            FG+ FG BR
       0.0
       0.00
       30.15
       30.19
        4.4
       25
        4.8
    
    
      11
       13
       2012-01-01
       38
       18
       28
      NaN
       12
       21
       37
        0
       NaN
        NaN
                     
       NaN
       0.00
       24.60
       30.59
        2.5
       16
        6.1
    
    
      12
       14
       2012-01-01
       50
       34
       42
        5
       25
       35
       23
        0
       739
       1729
                     
       0.0
       0.00
       29.13
       30.52
       11.4
       32
       11.3
    
    
      13
       15
       2012-01-01
       48
       26
       37
       16
       35
       38
       28
        0
       718
       1621
       RA FG+ FZFG BR
       0.0
       0.09
       29.53
       29.89
        2.5
       17
        3.8
    
    
      14
       16
       2012-01-01
       50
       30
       40
      NaN
       37
       40
       25
        0
       NaN
        NaN
           RA MIFG BR
       NaN
          T
       29.80
       29.89
        4.2
       21
        6.8



In [25]:

    
#convert date column into the date format
#weather["date"] = weather["date"].map(pd.to_datetime)



In [26]:

    
weather.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 20517 entries, 0 to 20516
Data columns (total 20 columns):
station_nbr    20517 non-null int64
date           20517 non-null object
tmax           19611 non-null float64
tmin           19609 non-null float64
tavg           19048 non-null float64
depart         9006 non-null float64
dewpoint       19851 non-null float64
wetbulb        19265 non-null float64
heat           19048 non-null float64
cool           19048 non-null float64
sunrise        10861 non-null float64
sunset         10861 non-null float64
codesum        20517 non-null object
snowfall       13293 non-null object
preciptotal    19657 non-null object
stnpressure    19588 non-null float64
sealevel       18793 non-null float64
resultspeed    19928 non-null float64
resultdir      19928 non-null float64
avgspeed       19642 non-null float64
dtypes: float64(15), int64(1), object(4)



In [27]:

    
set(weather["snowfall"])









    Out[27]:





{nan,
 '  T',
 '0.0',
 '0.1',
 '0.2',
 '0.3',
 '0.4',
 '0.5',
 '0.6',
 '0.7',
 '0.8',
 '0.9',
 '1.0',
 '1.1',
 '1.2',
 '1.3',
 '1.4',
 '1.5',
 '1.6',
 '1.7',
 '1.8',
 '1.9',
 '10.5',
 '11.0',
 '12.2',
 '13.1',
 '13.5',
 '14.9',
 '15.2',
 '16.2',
 '2.0',
 '2.1',
 '2.2',
 '2.3',
 '2.4',
 '2.5',
 '2.6',
 '2.7',
 '2.8',
 '2.9',
 '3.0',
 '3.1',
 '3.2',
 '3.3',
 '3.4',
 '3.5',
 '3.6',
 '3.7',
 '3.8',
 '3.9',
 '4.0',
 '4.1',
 '4.2',
 '4.3',
 '4.4',
 '4.6',
 '4.7',
 '4.9',
 '5.0',
 '5.1',
 '5.2',
 '5.4',
 '5.5',
 '5.7',
 '5.8',
 '6.1',
 '6.2',
 '6.4',
 '6.6',
 '6.7',
 '7.0',
 '7.1',
 '7.2',
 '7.7',
 '7.8',
 '8.0',
 '9.3'}

T above will mean Trace that with respect to internet for snow will mean that is was some snow, but it was less than 0.1 inch. I will assume that T is equal to 0.01, just to distinguish it from 0.



In [28]:

    
def change_snowfall(x):
    if x == "  T":
        return 0.01    
    else:
        return float(x)
    
weather["snowfall"] = weather["snowfall"].map(change_snowfall)

Similar story with percipitation, although in this case T means less than 0.01 inch



In [29]:

    
def change_preciptotal(x):
    if x == "  T":
        return 0.001    
    else:
        return float(x)
    
weather["preciptotal"] = weather["preciptotal"].map(change_preciptotal)



In [30]:

    
weather.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 20517 entries, 0 to 20516
Data columns (total 20 columns):
station_nbr    20517 non-null int64
date           20517 non-null object
tmax           19611 non-null float64
tmin           19609 non-null float64
tavg           19048 non-null float64
depart         9006 non-null float64
dewpoint       19851 non-null float64
wetbulb        19265 non-null float64
heat           19048 non-null float64
cool           19048 non-null float64
sunrise        10861 non-null float64
sunset         10861 non-null float64
codesum        20517 non-null object
snowfall       13293 non-null float64
preciptotal    19657 non-null float64
stnpressure    19588 non-null float64
sealevel       18793 non-null float64
resultspeed    19928 non-null float64
resultdir      19928 non-null float64
avgspeed       19642 non-null float64
dtypes: float64(17), int64(1), object(2)

Now I need to work with "codesum" that describes summary of the weather. I would guess that it is most important feature.

Note: FG+ will give 1 to FG and also to the FG+ column



In [31]:

    
codesum_columns = set(' '.join(set(weather["codesum"])).strip().split())



In [32]:

    
codesum_columns









    Out[32]:





{'BCFG',
 'BLDU',
 'BLSN',
 'BR',
 'DU',
 'DZ',
 'FG',
 'FG+',
 'FU',
 'FZDZ',
 'FZFG',
 'FZRA',
 'GR',
 'GS',
 'HZ',
 'MIFG',
 'PL',
 'PRFG',
 'RA',
 'SG',
 'SN',
 'SQ',
 'TS',
 'TSRA',
 'TSSN',
 'UP',
 'VCFG',
 'VCTS'}



In [33]:

    
codesum = pd.DataFrame(index=weather.index, columns=codesum_columns)



In [34]:

    
for column in codesum.columns:
    for i in range(len(weather.index)):
        if column in weather["codesum"][i]:
            codesum[column][i] = 1



In [35]:

    
weather = weather.drop("codesum", 1)



In [36]:

    
weather = weather.join(codesum.fillna(0))



In [37]:

    
weather.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 20517 entries, 0 to 20516
Data columns (total 47 columns):
station_nbr    20517 non-null int64
date           20517 non-null object
tmax           19611 non-null float64
tmin           19609 non-null float64
tavg           19048 non-null float64
depart         9006 non-null float64
dewpoint       19851 non-null float64
wetbulb        19265 non-null float64
heat           19048 non-null float64
cool           19048 non-null float64
sunrise        10861 non-null float64
sunset         10861 non-null float64
snowfall       13293 non-null float64
preciptotal    19657 non-null float64
stnpressure    19588 non-null float64
sealevel       18793 non-null float64
resultspeed    19928 non-null float64
resultdir      19928 non-null float64
avgspeed       19642 non-null float64
HZ             20517 non-null int64
FU             20517 non-null int64
UP             20517 non-null int64
TSSN           20517 non-null int64
VCTS           20517 non-null int64
DZ             20517 non-null int64
BR             20517 non-null int64
FG             20517 non-null int64
BCFG           20517 non-null int64
DU             20517 non-null int64
FZRA           20517 non-null int64
TS             20517 non-null int64
RA             20517 non-null int64
PL             20517 non-null int64
GS             20517 non-null int64
GR             20517 non-null int64
FZDZ           20517 non-null int64
VCFG           20517 non-null int64
PRFG           20517 non-null int64
FG+            20517 non-null int64
TSRA           20517 non-null int64
FZFG           20517 non-null int64
BLDU           20517 non-null int64
MIFG           20517 non-null int64
SQ             20517 non-null int64
BLSN           20517 non-null int64
SN             20517 non-null int64
SG             20517 non-null int64
dtypes: float64(17), int64(29), object(1)



In [38]:

    
#Save modfied weather file
weather.to_csv(os.path.join("data", "weather_modified.csv"))



In [39]:

    
#read file that desribes correspondance between store_nbr and station_nbr
key = pd.read_csv(os.path.join("data", "key.csv"))



In [40]:

    
#read train set
training = pd.read_csv(os.path.join("data", "train.csv"))



In [41]:

    
training.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 4617600 entries, 0 to 4617599
Data columns (total 4 columns):
date         object
store_nbr    int64
item_nbr     int64
units        int64
dtypes: int64(3), object(1)



In [42]:

    
training = training.merge(key, on="store_nbr")



In [43]:

    
training.head()









    Out[43]:






  
    
      
      date
      store_nbr
      item_nbr
      units
      station_nbr
    
  
  
    
      0
       2012-01-01
       1
       1
       0
       1
    
    
      1
       2012-01-01
       1
       2
       0
       1
    
    
      2
       2012-01-01
       1
       3
       0
       1
    
    
      3
       2012-01-01
       1
       4
       0
       1
    
    
      4
       2012-01-01
       1
       5
       0
       1



In [44]:

    
weather.head()









    Out[44]:






  
    
      
      station_nbr
      date
      tmax
      tmin
      tavg
      depart
      dewpoint
      wetbulb
      heat
      cool
      ...
      PRFG
      FG+
      TSRA
      FZFG
      BLDU
      MIFG
      SQ
      BLSN
      SN
      SG
    
  
  
    
      0
       1
       2012-01-01
       52
       31
       42
      NaN
       36
       40
       23
       0
      ...
       0
       0
       0
       1
       0
       0
       0
       0
       0
       0
    
    
      1
       2
       2012-01-01
       48
       33
       41
       16
       37
       39
       24
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      2
       3
       2012-01-01
       55
       34
       45
        9
       24
       36
       20
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      3
       4
       2012-01-01
       63
       47
       55
        4
       28
       43
       10
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      4
       6
       2012-01-01
       63
       34
       49
        0
       31
       43
       16
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
  

5 rows × 47 columns



In [ ]:

    
training = pd.merge(training, weather)



In [ ]:

    
#save training set with added weather conditions to file
training.to_csv(os.path.join("data", "training_modified.csv"))



In [ ]:

    
training.head()



In [216]:

    
#read test set
testing = pd.read_csv(os.path.join("data", "test.csv"))



In [217]:

    
testing = testing.merge(key, on="store_nbr")



In [218]:

    
testing = pd.merge(testing, weather)



In [219]:

    
#save testing set with added weather conditions to file
testing.to_csv(os.path.join("data", "testing_modified.csv"))

Filling missing values

First naive approach that will just fill NaN with mean over the column



In [1]:

    
for column in training.columns:
    print column
    mean_column = training[column].mean()
    #training[column] = training[column].fillna(mean_columns)









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-abb1a876df1d> in <module>()
----> 1 for column in training.columns:
      2     print column
      3     mean_column = training[column].mean()
      4     #training[column] = training[column].fillna(mean_columns)

NameError: name 'training' is not defined



In [ ]:

	station_nbr	date	tmax	tmin	tavg	depart	dewpoint	wetbulb	heat	cool	sunrise	sunset	codesum	snowfall	preciptotal	stnpressure	sealevel	resultspeed	resultdir	avgspeed
0	1	2012-01-01	52	31	42	NaN	36	40	23	0	NaN	NaN	RA FZFG BR	NaN	0.05	29.78	29.92	3.6	20	4.6
1	2	2012-01-01	48	33	41	16	37	39	24	0	716	1626	RA	0.0	0.07	28.82	29.91	9.1	23	11.3
2	3	2012-01-01	55	34	45	9	24	36	20	0	735	1720		0.0	0.00	29.77	30.47	9.9	31	10.0
3	4	2012-01-01	63	47	55	4	28	43	10	0	728	1742		0.0	0.00	29.79	30.48	8.0	35	8.2
4	6	2012-01-01	63	34	49	0	31	43	16	0	727	1742		0.0	0.00	29.95	30.47	14.0	36	13.8
5	7	2012-01-01	50	33	42	NaN	26	35	23	0	NaN	NaN		0.0	0.00	29.15	30.54	10.3	32	10.2
6	8	2012-01-01	66	45	NaN	NaN	34	46	NaN	NaN	NaN	NaN	RA BR	NaN	0.00	30.05	NaN	11.0	36	10.9
7	9	2012-01-01	34	19	27	NaN	17	23	38	0	NaN	NaN	UP	NaN	T	29.34	30.09	22.8	30	22.5
8	10	2012-01-01	73	53	63	NaN	55	58	2	0	723	1738	FG+ FG BR	NaN	0.00	30.16	30.19	5.1	24	5.5
9	11	2012-01-01	72	48	60	7	54	56	5	0	724	1737	FG+ FG BR	0.0	0.00	30.15	30.18	4.6	23	4.8
10	12	2012-01-01	72	48	60	NaN	54	57	5	0	NaN	NaN	FG+ FG BR	0.0	0.00	30.15	30.19	4.4	25	4.8
11	13	2012-01-01	38	18	28	NaN	12	21	37	0	NaN	NaN		NaN	0.00	24.60	30.59	2.5	16	6.1
12	14	2012-01-01	50	34	42	5	25	35	23	0	739	1729		0.0	0.00	29.13	30.52	11.4	32	11.3
13	15	2012-01-01	48	26	37	16	35	38	28	0	718	1621	RA FG+ FZFG BR	0.0	0.09	29.53	29.89	2.5	17	3.8
14	16	2012-01-01	50	30	40	NaN	37	40	25	0	NaN	NaN	RA MIFG BR	NaN	T	29.80	29.89	4.2	21	6.8