We are proposing to compete in the Kaggle competition setup by Rossman Stores to predict store sales for the next 6 weeks of business from previous business data.. All the information on the dataset is \textcolor{blue}{\href{https://www.kaggle.com/c/rossmann-store-sales/data}{here}}. There's information about competitors, daily sales grosses, etc.
Our exploratory code is found \textcolor{blue}{\href{https://github.com/meissnereric/rossman_predictor}{on Eric's github}.}
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
%pwd
Out[1]:
In [ ]:
#Partial code taken from (https://www.kaggle.com/mmourafiq/rossmann-store-sales/data-viz/notebook):
train = pd.read_csv('data/train.csv')
print(train[:5])
store = pd.read_csv('data/store.csv')
print(store[:5])
all_data = pd.merge(train, store, on='Store', how='left')
test = pd.read_csv('data/test.csv')
all_data['StateHoliday'][all_data['StateHoliday'] == 0 ] = '0'
Basic stats of the data. Notice particularly the standard deviation is high for both sales and customers, ipmlyign the data is highly variable and reassuring us that a predictor is necessary.
In [15]:
all_data.describe()
Out[15]:
In [16]:
# count open stores by week day
print(train.groupby(['DayOfWeek']).sum())
print(all_data[['DayOfWeek', 'Open', 'Sales', 'Customers']].groupby(['DayOfWeek', 'Open']).agg([np.sum, np.mean, np.std]))
Unsurprisingly, a state holiday strongly affects sales.
In [4]:
avg_stateholiday = all_data[['Sales', 'Customers', 'StateHoliday']].groupby('StateHoliday').mean()
avg_stateholiday.plot(kind='bar')
Out[4]:
Notably, having a promotion running for a particular day increases sales by quite a large amount, while not increasing customers by nearly the same rate. This implies people are spending more during those days, instead of having simply higher rates of customers.
In [5]:
avg_promotion = all_data[['Sales', 'Customers', 'Promo']].groupby('Promo').mean()
avg_promotion.plot(kind='bar')
Out[5]:
Competition distances are clustered around having close competitors, and in general it doesn't appear to affect sales strongly.
In [17]:
all_data[['CompetitionDistance', 'Sales']].plot(kind='scatter', x='CompetitionDistance', y='Sales')
all_data.hist('CompetitionDistance')
Out[17]:
In [8]:
# Bin the competition distance with 10 bins
bins = np.linspace(all_data['CompetitionDistance'].min(), all_data.CompetitionDistance.max(), 10)
competition_bins = all_data[['Sales', 'Customers']].groupby(pd.cut(all_data['CompetitionDistance'], bins))
competition_avg = competition_bins.mean()
competition_avg.plot(kind='bar')
Out[8]:
Sundays have almost no sales.
Sales peak in July and December, the peak of summer and Christmas season (The data is from Rossman stores, which are based in Germany so Christmas would be a major holiday.)
In [23]:
#Done By Neal
train['Date'] = pd.to_datetime(train['Date'])
train[:5]['Date'].dt.dayofweek
train['DayOfWeek'] = train['Date'].dt.dayofweek
train['Month'] = train['Date'].dt.month
train['Year'] = train['Date'].dt.year
avg_month = train[['Sales', 'Month']].groupby('Month').mean()
avg_month.plot(kind='bar')
avg_day = train[['Sales', 'DayOfWeek']].groupby('DayOfWeek').mean()
avg_day.plot(kind='bar')
#sales by day of week
sale_dayofweek = pd.pivot_table(train, values='Sales', index=['Year','Store'], columns=['DayOfWeek'])
#sales by month
sale_month = pd.pivot_table(train, values='Sales', index=['Year','Store'], columns=['Month'])
sale_month[:5]
#sale_dayofweek.plot(kind='box')
#sale_month.plot(kind='box')
Out[23]:
Our model will use a typical data science procedure for batch data analysis including a cleaning phase that feeds into a predictive regression model for sales prediction.
The next steps mainly include deciding on a particular predictive mode, through both empirical testing and looking at what popular methods for time series regression are used in the field.
Because our goal is to build a sales predictor, visualization of the results isn't a particular goal, though we plan to visualize and discuss major patterns in the data that our analysis finds, such as having a store with a promotion increasing sales by X or discussing the relation of stores to their competitors. We will also include a visualization of our predictor if relevant.
In [ ]: