Overview

We are proposing to compete in the Kaggle competition setup by Rossman Stores to predict store sales for the next 6 weeks of business from previous business data.. All the information on the dataset is \textcolor{blue}{\href{https://www.kaggle.com/c/rossmann-store-sales/data}{here}}. There's information about competitors, daily sales grosses, etc.

Code Location

Our exploratory code is found \textcolor{blue}{\href{https://github.com/meissnereric/rossman_predictor}{on Eric's github}.}

Descriptive Statistics and Discussion


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
%pwd


Out[1]:
u'/home/eric/dev/projects/data_science/rossman_predictor'
Our data came in three major files: train, test, and store. Training has the training data including daily sales and customers over a period of ~2 years (~1 million data points). , while test data has a period of 6 weeks for particular stores (not all of them, notably). Store has metadata about each store, which presumably affects the sales of a the stores. We first merge the store and training data together for ease of analysis, and display general information. We also clean the StateHoliday field for string/int happiness.

In [ ]:
#Partial code taken from (https://www.kaggle.com/mmourafiq/rossmann-store-sales/data-viz/notebook): 

train = pd.read_csv('data/train.csv')
print(train[:5])
store = pd.read_csv('data/store.csv')
print(store[:5])
all_data = pd.merge(train, store, on='Store', how='left')
test = pd.read_csv('data/test.csv')
all_data['StateHoliday'][all_data['StateHoliday'] == 0 ] = '0'

Basic stats of the data. Notice particularly the standard deviation is high for both sales and customers, ipmlyign the data is highly variable and reassuring us that a predictor is necessary.


In [15]:
all_data.describe()


Out[15]:
Store DayOfWeek Sales Customers Open Promo SchoolHoliday CompetitionDistance CompetitionOpenSinceMonth CompetitionOpenSinceYear Promo2 Promo2SinceWeek Promo2SinceYear
count 1017209.000000 1017209.000000 1017209.000000 1017209.000000 1017209.000000 1017209.000000 1017209.000000 1014567.000000 693861.000000 693861.000000 1017209.000000 509178.000000 509178.000000
mean 558.429727 3.998341 5773.818972 633.145946 0.830107 0.381515 0.178647 5430.085652 7.222866 2008.690228 0.500564 23.269093 2011.752774
std 321.908651 1.997391 3849.926175 464.411734 0.375539 0.485759 0.383056 7715.323700 3.211832 5.992644 0.500000 14.095973 1.662870
min 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 20.000000 1.000000 1900.000000 0.000000 1.000000 2009.000000
25% 280.000000 2.000000 3727.000000 405.000000 1.000000 0.000000 0.000000 710.000000 4.000000 2006.000000 0.000000 13.000000 2011.000000
50% 558.000000 4.000000 5744.000000 609.000000 1.000000 0.000000 0.000000 2330.000000 8.000000 2010.000000 1.000000 22.000000 2012.000000
75% 838.000000 6.000000 7856.000000 837.000000 1.000000 1.000000 0.000000 6890.000000 10.000000 2013.000000 1.000000 37.000000 2013.000000
max 1115.000000 7.000000 41551.000000 7388.000000 1.000000 1.000000 1.000000 75860.000000 12.000000 2015.000000 1.000000 50.000000 2015.000000

In [16]:
# count open stores by week day
print(train.groupby(['DayOfWeek']).sum())
print(all_data[['DayOfWeek', 'Open', 'Sales', 'Customers']].groupby(['DayOfWeek', 'Open']).agg([np.sum, np.mean, np.std]))


              Store       Sales  Customers    Open  Promo  SchoolHoliday
DayOfWeek                                                               
1          80821168  1130203012  117675012  137560  77760          34060
2          81344288  1020411930  110848063  143961  77580          36595
3          81345276   954962863  105117642  141936  77580          34636
4          81443338   911177709  101732938  134644  77580          34747
5          81443338   980555941  108384820  138640  77580          36235
6          80821168   846317735   95103854  144058      0           2724
7          80821168    29551433    5179426    3593      0           2724
                     Sales                            Customers               \
                       sum         mean          std        sum         mean   
DayOfWeek Open                                                                 
1         0              0     0.000000     0.000000          0     0.000000   
          1     1130203012  8216.073074  3691.768817  117675012   855.444984   
2         0              0     0.000000     0.000000          0     0.000000   
          1     1020411930  7088.113656  3066.237108  110848063   769.986753   
3         0              0     0.000000     0.000000          0     0.000000   
          1      954962863  6728.122978  2781.775153  105117642   740.598876   
4         0              0     0.000000     0.000000          0     0.000000   
          1      911177709  6767.310159  2764.548064  101732938   755.569784   
5         0              0     0.000000     0.000000          0     0.000000   
          1      980555941  7072.677012  2764.915731  108384820   781.771639   
6         0              0     0.000000     0.000000          0     0.000000   
          1      846317735  5874.840238  2852.730271   95103854   660.177526   
7         0              0     0.000000     0.000000          0     0.000000   
          1       29551433  8224.723908  6235.152925    5179426  1441.532424   

                             
                        std  
DayOfWeek Open               
1         0        0.000000  
          1      420.044045  
2         0        0.000000  
          1      389.826885  
3         0        0.000000  
          1      372.526509  
4         0        0.000000  
          1      379.039576  
5         0        0.000000  
          1      383.003916  
6         0        0.000000  
          1      386.119153  
7         0        0.000000  
          1     1112.072839  

Unsurprisingly, a state holiday strongly affects sales.


In [4]:
avg_stateholiday = all_data[['Sales', 'Customers', 'StateHoliday']].groupby('StateHoliday').mean()
avg_stateholiday.plot(kind='bar')


Out[4]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f950d056190>

Notably, having a promotion running for a particular day increases sales by quite a large amount, while not increasing customers by nearly the same rate. This implies people are spending more during those days, instead of having simply higher rates of customers.


In [5]:
avg_promotion = all_data[['Sales', 'Customers', 'Promo']].groupby('Promo').mean()
avg_promotion.plot(kind='bar')


Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f950cfec0d0>

Competition distances are clustered around having close competitors, and in general it doesn't appear to affect sales strongly.


In [17]:
all_data[['CompetitionDistance', 'Sales']].plot(kind='scatter', x='CompetitionDistance', y='Sales')
all_data.hist('CompetitionDistance')


Out[17]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f950c90da10>]], dtype=object)

In [8]:
# Bin the competition distance with 10 bins
bins = np.linspace(all_data['CompetitionDistance'].min(), all_data.CompetitionDistance.max(), 10)

competition_bins = all_data[['Sales', 'Customers']].groupby(pd.cut(all_data['CompetitionDistance'], bins))
competition_avg = competition_bins.mean()
competition_avg.plot(kind='bar')


Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9507ec8150>

Sundays have almost no sales.

Sales peak in July and December, the peak of summer and Christmas season (The data is from Rossman stores, which are based in Germany so Christmas would be a major holiday.)


In [23]:
#Done By Neal
train['Date'] = pd.to_datetime(train['Date'])
train[:5]['Date'].dt.dayofweek
train['DayOfWeek'] = train['Date'].dt.dayofweek
train['Month'] = train['Date'].dt.month
train['Year'] = train['Date'].dt.year
avg_month = train[['Sales', 'Month']].groupby('Month').mean()
avg_month.plot(kind='bar')
avg_day = train[['Sales', 'DayOfWeek']].groupby('DayOfWeek').mean()
avg_day.plot(kind='bar')
#sales by day of week
sale_dayofweek = pd.pivot_table(train, values='Sales', index=['Year','Store'], columns=['DayOfWeek'])

#sales by month
sale_month = pd.pivot_table(train, values='Sales', index=['Year','Store'], columns=['Month'])
sale_month[:5]

#sale_dayofweek.plot(kind='box')
#sale_month.plot(kind='box')


Out[23]:
Month 1 2 3 4 5 6 7 8 9 10 11 12
Year Store
2013 1 4142.935484 4473.964286 4682.870968 3933.633333 3813.709677 3780.366667 4130.064516 3816.774194 3628.900000 3698.096774 4253.200000 4682.838710
2 3715.193548 3968.357143 4210.483871 3895.933333 3754.161290 3862.366667 4476.709677 4037.516129 3969.366667 3882.645161 4287.600000 4845.612903
3 5343.645161 5720.250000 6102.548387 5739.600000 5495.935484 5682.066667 6658.903226 5751.677419 5443.000000 5428.290323 5652.666667 6915.064516
4 7571.387097 8231.178571 8242.387097 7566.033333 7589.806452 7840.900000 7357.935484 7685.903226 7318.333333 7254.935484 7964.600000 9184.677419
5 3530.387097 3611.642857 3769.677419 3538.033333 3557.806452 3509.333333 4899.741935 4063.516129 3677.933333 3735.645161 3985.566667 4758.580645

Place in Data Science Model

Our model will use a typical data science procedure for batch data analysis including a cleaning phase that feeds into a predictive regression model for sales prediction.

Next Steps

The next steps mainly include deciding on a particular predictive mode, through both empirical testing and looking at what popular methods for time series regression are used in the field.

Because our goal is to build a sales predictor, visualization of the results isn't a particular goal, though we plan to visualize and discuss major patterns in the data that our analysis finds, such as having a store with a promotion increasing sales by X or discussing the relation of stores to their competitors. We will also include a visualization of our predictor if relevant.


In [ ]: