Kaggle.com Sales in stormy weather.

(From http://www.kaggle.com/c/walmart-recruiting-sales-in-stormy-weather)

Walmart operates 11,450 stores in 27 countries, managing inventory across varying climates and cultures. Extreme weather events, like hurricanes, blizzards, and floods, can have a huge impact on sales at the store and product level.

In their second Kaggle recruiting competition, Walmart challenges participants to accurately predict the sales of 111 potentially weather-sensitive products (like umbrellas, bread, and milk) around the time of major weather events at 45 of their retail locations.

Intuitively, we may expect an uptick in the sales of umbrellas before a big thunderstorm, but it's difficult for replenishment managers to correctly predict the level of inventory needed to avoid being out-of-stock or overstock during and after that storm. Walmart relies on a variety of vendor tools to predict sales around extreme weather events, but it's an ad-hoc and time-consuming process that lacks a systematic measure of effectiveness.

Helping Walmart better predict sales of weather-sensitive products will keep valued customers out of the rain. It could also earn you a position at one of the most data-driven retailers in the world!

Data

(From http://www.kaggle.com/c/walmart-recruiting-sales-in-stormy-weather/data)

You have been provided with sales data for 111 products whose sales may be affected by the weather (such as milk, bread, umbrellas, etc.). These 111 products are sold in stores at 45 different Walmart locations. Some of the products may be a similar item (such as milk) but have a different id in different stores/regions/suppliers. The 45 locations are covered by 20 weather stations (i.e. some of the stores are nearby and share a weather station).

The competition task is to predict the amount of each product sold around the time of major weather events. For the purposes of this competition, we have defined a weather event as any day in which more than an inch of rain or two inches of snow was observed. You are asked to predict the units sold for a window of ±3 days surrounding each storm.

The following graphic shows the layout of the test windows. The blue dots are the training set days, the red dots are the test set days, and the event=True are the days with storms. Note that this plot is for the 20 weather stations. All days prior to 2013-04-01 are given out as training data.

You are provided with the full observed weather covering the entire data set. You do not need to forecast weather in addition to sales (it's as though you have a perfect weather forecast at your disposal).

  • You will not be provided with more information about the products, store locations, or other details.
  • Because the storms occur at variable times and in variable locations, use the test set file (or sample submission) as your guide to know which days and stores you must forecast.
  • The sales data does not capture the difference between the stock and the demand. In other words, sales number 0 doesn't necessarily mean there was no demand for this product; it may mean it was in stock but none were sold, or it could mean that the product was out of stock, or discontinued and not available.

Field descriptions

  • date - the day of sales or weather
  • store_nbr - an id representing one of the 45 stores
  • station_nbr - an id representing one of 20 weather stations
  • item_nbr - an id representing one of the 111 products
  • units - the quantity sold of an item on a given day
  • id - a triplet representing a store_nbr, item_nbr, and date. Form the id by concatenating these (in that order) with an underscore. E.g. "2_1_2013-04-01" represents store 2, item 1, sold on 2013-04-01.

File descriptions

  • key.csv - the relational mapping between stores and the weather stations that cover them
  • sampleSubmission.csv - file that gives the prediction format
  • train.csv - sales data for all stores & dates in the training set
  • test.csv - stores & dates for forecasting (missing 'units', which you must predict) NOTE: This file has been encrypted. To get the password, please fill out Walmart's Recruiting Survey
  • weather.csv - a file containing the NOAA weather information for each station and day
  • noaa_weather_qclcd_documentation.pdf - a guide to understand the data provided in the weather.csv file

Cleaning, merging.


In [21]:
#importing libraries
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import os

weather measurements With respect to the noaa_weather_qclcd_documentation.pdf:

  • M - Missing data, or data that is normally nor supported by the station
  • "*" - not supported by the station
  • "-" - values is not availible for sunset/sunrise

In [22]:
weather = pd.read_csv(os.path.join("data", "weather.csv"), na_values=["M", "-", "*"])

In [23]:
weather.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 20517 entries, 0 to 20516
Data columns (total 20 columns):
station_nbr    20517 non-null int64
date           20517 non-null object
tmax           19611 non-null float64
tmin           19609 non-null float64
tavg           19048 non-null float64
depart         9006 non-null float64
dewpoint       19851 non-null float64
wetbulb        19265 non-null float64
heat           19048 non-null float64
cool           19048 non-null float64
sunrise        10861 non-null float64
sunset         10861 non-null float64
codesum        20517 non-null object
snowfall       13293 non-null object
preciptotal    19657 non-null object
stnpressure    19588 non-null float64
sealevel       18793 non-null float64
resultspeed    19928 non-null float64
resultdir      19928 non-null float64
avgspeed       19642 non-null float64
dtypes: float64(15), int64(1), object(4)

In [24]:
weather.head(15)


Out[24]:
station_nbr date tmax tmin tavg depart dewpoint wetbulb heat cool sunrise sunset codesum snowfall preciptotal stnpressure sealevel resultspeed resultdir avgspeed
0 1 2012-01-01 52 31 42 NaN 36 40 23 0 NaN NaN RA FZFG BR NaN 0.05 29.78 29.92 3.6 20 4.6
1 2 2012-01-01 48 33 41 16 37 39 24 0 716 1626 RA 0.0 0.07 28.82 29.91 9.1 23 11.3
2 3 2012-01-01 55 34 45 9 24 36 20 0 735 1720 0.0 0.00 29.77 30.47 9.9 31 10.0
3 4 2012-01-01 63 47 55 4 28 43 10 0 728 1742 0.0 0.00 29.79 30.48 8.0 35 8.2
4 6 2012-01-01 63 34 49 0 31 43 16 0 727 1742 0.0 0.00 29.95 30.47 14.0 36 13.8
5 7 2012-01-01 50 33 42 NaN 26 35 23 0 NaN NaN 0.0 0.00 29.15 30.54 10.3 32 10.2
6 8 2012-01-01 66 45 NaN NaN 34 46 NaN NaN NaN NaN RA BR NaN 0.00 30.05 NaN 11.0 36 10.9
7 9 2012-01-01 34 19 27 NaN 17 23 38 0 NaN NaN UP NaN T 29.34 30.09 22.8 30 22.5
8 10 2012-01-01 73 53 63 NaN 55 58 2 0 723 1738 FG+ FG BR NaN 0.00 30.16 30.19 5.1 24 5.5
9 11 2012-01-01 72 48 60 7 54 56 5 0 724 1737 FG+ FG BR 0.0 0.00 30.15 30.18 4.6 23 4.8
10 12 2012-01-01 72 48 60 NaN 54 57 5 0 NaN NaN FG+ FG BR 0.0 0.00 30.15 30.19 4.4 25 4.8
11 13 2012-01-01 38 18 28 NaN 12 21 37 0 NaN NaN NaN 0.00 24.60 30.59 2.5 16 6.1
12 14 2012-01-01 50 34 42 5 25 35 23 0 739 1729 0.0 0.00 29.13 30.52 11.4 32 11.3
13 15 2012-01-01 48 26 37 16 35 38 28 0 718 1621 RA FG+ FZFG BR 0.0 0.09 29.53 29.89 2.5 17 3.8
14 16 2012-01-01 50 30 40 NaN 37 40 25 0 NaN NaN RA MIFG BR NaN T 29.80 29.89 4.2 21 6.8

In [25]:
#convert date column into the date format
#weather["date"] = weather["date"].map(pd.to_datetime)

In [26]:
weather.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 20517 entries, 0 to 20516
Data columns (total 20 columns):
station_nbr    20517 non-null int64
date           20517 non-null object
tmax           19611 non-null float64
tmin           19609 non-null float64
tavg           19048 non-null float64
depart         9006 non-null float64
dewpoint       19851 non-null float64
wetbulb        19265 non-null float64
heat           19048 non-null float64
cool           19048 non-null float64
sunrise        10861 non-null float64
sunset         10861 non-null float64
codesum        20517 non-null object
snowfall       13293 non-null object
preciptotal    19657 non-null object
stnpressure    19588 non-null float64
sealevel       18793 non-null float64
resultspeed    19928 non-null float64
resultdir      19928 non-null float64
avgspeed       19642 non-null float64
dtypes: float64(15), int64(1), object(4)

In [27]:
set(weather["snowfall"])


Out[27]:
{nan,
 '  T',
 '0.0',
 '0.1',
 '0.2',
 '0.3',
 '0.4',
 '0.5',
 '0.6',
 '0.7',
 '0.8',
 '0.9',
 '1.0',
 '1.1',
 '1.2',
 '1.3',
 '1.4',
 '1.5',
 '1.6',
 '1.7',
 '1.8',
 '1.9',
 '10.5',
 '11.0',
 '12.2',
 '13.1',
 '13.5',
 '14.9',
 '15.2',
 '16.2',
 '2.0',
 '2.1',
 '2.2',
 '2.3',
 '2.4',
 '2.5',
 '2.6',
 '2.7',
 '2.8',
 '2.9',
 '3.0',
 '3.1',
 '3.2',
 '3.3',
 '3.4',
 '3.5',
 '3.6',
 '3.7',
 '3.8',
 '3.9',
 '4.0',
 '4.1',
 '4.2',
 '4.3',
 '4.4',
 '4.6',
 '4.7',
 '4.9',
 '5.0',
 '5.1',
 '5.2',
 '5.4',
 '5.5',
 '5.7',
 '5.8',
 '6.1',
 '6.2',
 '6.4',
 '6.6',
 '6.7',
 '7.0',
 '7.1',
 '7.2',
 '7.7',
 '7.8',
 '8.0',
 '9.3'}

T above will mean Trace that with respect to internet for snow will mean that is was some snow, but it was less than 0.1 inch. I will assume that T is equal to 0.01, just to distinguish it from 0.


In [28]:
def change_snowfall(x):
    if x == "  T":
        return 0.01    
    else:
        return float(x)
    
weather["snowfall"] = weather["snowfall"].map(change_snowfall)

Similar story with percipitation, although in this case T means less than 0.01 inch


In [29]:
def change_preciptotal(x):
    if x == "  T":
        return 0.001    
    else:
        return float(x)
    
weather["preciptotal"] = weather["preciptotal"].map(change_preciptotal)

In [30]:
weather.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 20517 entries, 0 to 20516
Data columns (total 20 columns):
station_nbr    20517 non-null int64
date           20517 non-null object
tmax           19611 non-null float64
tmin           19609 non-null float64
tavg           19048 non-null float64
depart         9006 non-null float64
dewpoint       19851 non-null float64
wetbulb        19265 non-null float64
heat           19048 non-null float64
cool           19048 non-null float64
sunrise        10861 non-null float64
sunset         10861 non-null float64
codesum        20517 non-null object
snowfall       13293 non-null float64
preciptotal    19657 non-null float64
stnpressure    19588 non-null float64
sealevel       18793 non-null float64
resultspeed    19928 non-null float64
resultdir      19928 non-null float64
avgspeed       19642 non-null float64
dtypes: float64(17), int64(1), object(2)

Now I need to work with "codesum" that describes summary of the weather. I would guess that it is most important feature.

Note: FG+ will give 1 to FG and also to the FG+ column


In [31]:
codesum_columns = set(' '.join(set(weather["codesum"])).strip().split())

In [32]:
codesum_columns


Out[32]:
{'BCFG',
 'BLDU',
 'BLSN',
 'BR',
 'DU',
 'DZ',
 'FG',
 'FG+',
 'FU',
 'FZDZ',
 'FZFG',
 'FZRA',
 'GR',
 'GS',
 'HZ',
 'MIFG',
 'PL',
 'PRFG',
 'RA',
 'SG',
 'SN',
 'SQ',
 'TS',
 'TSRA',
 'TSSN',
 'UP',
 'VCFG',
 'VCTS'}

In [33]:
codesum = pd.DataFrame(index=weather.index, columns=codesum_columns)

In [34]:
for column in codesum.columns:
    for i in range(len(weather.index)):
        if column in weather["codesum"][i]:
            codesum[column][i] = 1

In [35]:
weather = weather.drop("codesum", 1)

In [36]:
weather = weather.join(codesum.fillna(0))

In [37]:
weather.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 20517 entries, 0 to 20516
Data columns (total 47 columns):
station_nbr    20517 non-null int64
date           20517 non-null object
tmax           19611 non-null float64
tmin           19609 non-null float64
tavg           19048 non-null float64
depart         9006 non-null float64
dewpoint       19851 non-null float64
wetbulb        19265 non-null float64
heat           19048 non-null float64
cool           19048 non-null float64
sunrise        10861 non-null float64
sunset         10861 non-null float64
snowfall       13293 non-null float64
preciptotal    19657 non-null float64
stnpressure    19588 non-null float64
sealevel       18793 non-null float64
resultspeed    19928 non-null float64
resultdir      19928 non-null float64
avgspeed       19642 non-null float64
HZ             20517 non-null int64
FU             20517 non-null int64
UP             20517 non-null int64
TSSN           20517 non-null int64
VCTS           20517 non-null int64
DZ             20517 non-null int64
BR             20517 non-null int64
FG             20517 non-null int64
BCFG           20517 non-null int64
DU             20517 non-null int64
FZRA           20517 non-null int64
TS             20517 non-null int64
RA             20517 non-null int64
PL             20517 non-null int64
GS             20517 non-null int64
GR             20517 non-null int64
FZDZ           20517 non-null int64
VCFG           20517 non-null int64
PRFG           20517 non-null int64
FG+            20517 non-null int64
TSRA           20517 non-null int64
FZFG           20517 non-null int64
BLDU           20517 non-null int64
MIFG           20517 non-null int64
SQ             20517 non-null int64
BLSN           20517 non-null int64
SN             20517 non-null int64
SG             20517 non-null int64
dtypes: float64(17), int64(29), object(1)

In [38]:
#Save modfied weather file
weather.to_csv(os.path.join("data", "weather_modified.csv"))

In [39]:
#read file that desribes correspondance between store_nbr and station_nbr
key = pd.read_csv(os.path.join("data", "key.csv"))

In [40]:
#read train set
training = pd.read_csv(os.path.join("data", "train.csv"))

In [41]:
training.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 4617600 entries, 0 to 4617599
Data columns (total 4 columns):
date         object
store_nbr    int64
item_nbr     int64
units        int64
dtypes: int64(3), object(1)

In [42]:
training = training.merge(key, on="store_nbr")

In [43]:
training.head()


Out[43]:
date store_nbr item_nbr units station_nbr
0 2012-01-01 1 1 0 1
1 2012-01-01 1 2 0 1
2 2012-01-01 1 3 0 1
3 2012-01-01 1 4 0 1
4 2012-01-01 1 5 0 1

In [44]:
weather.head()


Out[44]:
station_nbr date tmax tmin tavg depart dewpoint wetbulb heat cool ... PRFG FG+ TSRA FZFG BLDU MIFG SQ BLSN SN SG
0 1 2012-01-01 52 31 42 NaN 36 40 23 0 ... 0 0 0 1 0 0 0 0 0 0
1 2 2012-01-01 48 33 41 16 37 39 24 0 ... 0 0 0 0 0 0 0 0 0 0
2 3 2012-01-01 55 34 45 9 24 36 20 0 ... 0 0 0 0 0 0 0 0 0 0
3 4 2012-01-01 63 47 55 4 28 43 10 0 ... 0 0 0 0 0 0 0 0 0 0
4 6 2012-01-01 63 34 49 0 31 43 16 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 47 columns


In [ ]:
training = pd.merge(training, weather)

In [ ]:
#save training set with added weather conditions to file
training.to_csv(os.path.join("data", "training_modified.csv"))

In [ ]:
training.head()

In [216]:
#read test set
testing = pd.read_csv(os.path.join("data", "test.csv"))

In [217]:
testing = testing.merge(key, on="store_nbr")

In [218]:
testing = pd.merge(testing, weather)

In [219]:
#save testing set with added weather conditions to file
testing.to_csv(os.path.join("data", "testing_modified.csv"))

Filling missing values

First naive approach that will just fill NaN with mean over the column


In [1]:
for column in training.columns:
    print column
    mean_column = training[column].mean()
    #training[column] = training[column].fillna(mean_columns)


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-abb1a876df1d> in <module>()
----> 1 for column in training.columns:
      2     print column
      3     mean_column = training[column].mean()
      4     #training[column] = training[column].fillna(mean_columns)

NameError: name 'training' is not defined

In [ ]: