(From http://www.kaggle.com/c/walmart-recruiting-sales-in-stormy-weather)
Walmart operates 11,450 stores in 27 countries, managing inventory across varying climates and cultures. Extreme weather events, like hurricanes, blizzards, and floods, can have a huge impact on sales at the store and product level.
In their second Kaggle recruiting competition, Walmart challenges participants to accurately predict the sales of 111 potentially weather-sensitive products (like umbrellas, bread, and milk) around the time of major weather events at 45 of their retail locations.
Intuitively, we may expect an uptick in the sales of umbrellas before a big thunderstorm, but it's difficult for replenishment managers to correctly predict the level of inventory needed to avoid being out-of-stock or overstock during and after that storm. Walmart relies on a variety of vendor tools to predict sales around extreme weather events, but it's an ad-hoc and time-consuming process that lacks a systematic measure of effectiveness.
Helping Walmart better predict sales of weather-sensitive products will keep valued customers out of the rain. It could also earn you a position at one of the most data-driven retailers in the world!
(From http://www.kaggle.com/c/walmart-recruiting-sales-in-stormy-weather/data)
You have been provided with sales data for 111 products whose sales may be affected by the weather (such as milk, bread, umbrellas, etc.). These 111 products are sold in stores at 45 different Walmart locations. Some of the products may be a similar item (such as milk) but have a different id in different stores/regions/suppliers. The 45 locations are covered by 20 weather stations (i.e. some of the stores are nearby and share a weather station).
The competition task is to predict the amount of each product sold around the time of major weather events. For the purposes of this competition, we have defined a weather event as any day in which more than an inch of rain or two inches of snow was observed. You are asked to predict the units sold for a window of ±3 days surrounding each storm.
The following graphic shows the layout of the test windows. The blue dots are the training set days, the red dots are the test set days, and the event=True are the days with storms. Note that this plot is for the 20 weather stations. All days prior to 2013-04-01 are given out as training data.
You are provided with the full observed weather covering the entire data set. You do not need to forecast weather in addition to sales (it's as though you have a perfect weather forecast at your disposal).
In [21]:
#importing libraries
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import os
weather measurements With respect to the noaa_weather_qclcd_documentation.pdf:
In [22]:
weather = pd.read_csv(os.path.join("data", "weather.csv"), na_values=["M", "-", "*"])
In [23]:
weather.info()
In [24]:
weather.head(15)
Out[24]:
In [25]:
#convert date column into the date format
#weather["date"] = weather["date"].map(pd.to_datetime)
In [26]:
weather.info()
In [27]:
set(weather["snowfall"])
Out[27]:
T above will mean Trace that with respect to internet for snow will mean that is was some snow, but it was less than 0.1 inch. I will assume that T is equal to 0.01, just to distinguish it from 0.
In [28]:
def change_snowfall(x):
if x == " T":
return 0.01
else:
return float(x)
weather["snowfall"] = weather["snowfall"].map(change_snowfall)
Similar story with percipitation, although in this case T means less than 0.01 inch
In [29]:
def change_preciptotal(x):
if x == " T":
return 0.001
else:
return float(x)
weather["preciptotal"] = weather["preciptotal"].map(change_preciptotal)
In [30]:
weather.info()
Now I need to work with "codesum" that describes summary of the weather. I would guess that it is most important feature.
Note: FG+ will give 1 to FG and also to the FG+ column
In [31]:
codesum_columns = set(' '.join(set(weather["codesum"])).strip().split())
In [32]:
codesum_columns
Out[32]:
In [33]:
codesum = pd.DataFrame(index=weather.index, columns=codesum_columns)
In [34]:
for column in codesum.columns:
for i in range(len(weather.index)):
if column in weather["codesum"][i]:
codesum[column][i] = 1
In [35]:
weather = weather.drop("codesum", 1)
In [36]:
weather = weather.join(codesum.fillna(0))
In [37]:
weather.info()
In [38]:
#Save modfied weather file
weather.to_csv(os.path.join("data", "weather_modified.csv"))
In [39]:
#read file that desribes correspondance between store_nbr and station_nbr
key = pd.read_csv(os.path.join("data", "key.csv"))
In [40]:
#read train set
training = pd.read_csv(os.path.join("data", "train.csv"))
In [41]:
training.info()
In [42]:
training = training.merge(key, on="store_nbr")
In [43]:
training.head()
Out[43]:
In [44]:
weather.head()
Out[44]:
In [ ]:
training = pd.merge(training, weather)
In [ ]:
#save training set with added weather conditions to file
training.to_csv(os.path.join("data", "training_modified.csv"))
In [ ]:
training.head()
In [216]:
#read test set
testing = pd.read_csv(os.path.join("data", "test.csv"))
In [217]:
testing = testing.merge(key, on="store_nbr")
In [218]:
testing = pd.merge(testing, weather)
In [219]:
#save testing set with added weather conditions to file
testing.to_csv(os.path.join("data", "testing_modified.csv"))
First naive approach that will just fill NaN with mean over the column
In [1]:
for column in training.columns:
print column
mean_column = training[column].mean()
#training[column] = training[column].fillna(mean_columns)
In [ ]: