Overview

We are proposing to compete in the Kaggle competition setup by Rossman Stores to predict store sales for the next 6 weeks of business from previous business data.. All the information on the dataset is \textcolor{blue}{\href{https://www.kaggle.com/c/rossmann-store-sales/data}{here}}. There's information about competitors, daily sales grosses, etc.

Code Location

Our exploratory code is found \textcolor{blue}{\href{https://github.com/meissnereric/rossman_predictor}{on Eric's github}.}

Descriptive Statistics and Discussion



In [1]:

    
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
%pwd









    Out[1]:





u'/home/eric/dev/projects/data_science/rossman_predictor'

Our data came in three major files: train, test, and store. Training has the training data including daily sales and customers over a period of ~2 years (~1 million data points). , while test data has a period of 6 weeks for particular stores (not all of them, notably). Store has metadata about each store, which presumably affects the sales of a the stores. We first merge the store and training data together for ease of analysis, and display general information. We also clean the StateHoliday field for string/int happiness.



In [ ]:

    
#Partial code taken from (https://www.kaggle.com/mmourafiq/rossmann-store-sales/data-viz/notebook): 

train = pd.read_csv('data/train.csv')
print(train[:5])
store = pd.read_csv('data/store.csv')
print(store[:5])
all_data = pd.merge(train, store, on='Store', how='left')
test = pd.read_csv('data/test.csv')
all_data['StateHoliday'][all_data['StateHoliday'] == 0 ] = '0'

Basic stats of the data. Notice particularly the standard deviation is high for both sales and customers, ipmlyign the data is highly variable and reassuring us that a predictor is necessary.



In [15]:

    
all_data.describe()









    Out[15]:






  
    
      
      Store
      DayOfWeek
      Sales
      Customers
      Open
      Promo
      SchoolHoliday
      CompetitionDistance
      CompetitionOpenSinceMonth
      CompetitionOpenSinceYear
      Promo2
      Promo2SinceWeek
      Promo2SinceYear
    
  
  
    
      count
      1017209.000000
      1017209.000000
      1017209.000000
      1017209.000000
      1017209.000000
      1017209.000000
      1017209.000000
      1014567.000000
      693861.000000
      693861.000000
      1017209.000000
      509178.000000
      509178.000000
    
    
      mean
      558.429727
      3.998341
      5773.818972
      633.145946
      0.830107
      0.381515
      0.178647
      5430.085652
      7.222866
      2008.690228
      0.500564
      23.269093
      2011.752774
    
    
      std
      321.908651
      1.997391
      3849.926175
      464.411734
      0.375539
      0.485759
      0.383056
      7715.323700
      3.211832
      5.992644
      0.500000
      14.095973
      1.662870
    
    
      min
      1.000000
      1.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      20.000000
      1.000000
      1900.000000
      0.000000
      1.000000
      2009.000000
    
    
      25%
      280.000000
      2.000000
      3727.000000
      405.000000
      1.000000
      0.000000
      0.000000
      710.000000
      4.000000
      2006.000000
      0.000000
      13.000000
      2011.000000
    
    
      50%
      558.000000
      4.000000
      5744.000000
      609.000000
      1.000000
      0.000000
      0.000000
      2330.000000
      8.000000
      2010.000000
      1.000000
      22.000000
      2012.000000
    
    
      75%
      838.000000
      6.000000
      7856.000000
      837.000000
      1.000000
      1.000000
      0.000000
      6890.000000
      10.000000
      2013.000000
      1.000000
      37.000000
      2013.000000
    
    
      max
      1115.000000
      7.000000
      41551.000000
      7388.000000
      1.000000
      1.000000
      1.000000
      75860.000000
      12.000000
      2015.000000
      1.000000
      50.000000
      2015.000000



In [16]:

    
# count open stores by week day
print(train.groupby(['DayOfWeek']).sum())
print(all_data[['DayOfWeek', 'Open', 'Sales', 'Customers']].groupby(['DayOfWeek', 'Open']).agg([np.sum, np.mean, np.std]))









    



              Store       Sales  Customers    Open  Promo  SchoolHoliday
DayOfWeek                                                               
1          80821168  1130203012  117675012  137560  77760          34060
2          81344288  1020411930  110848063  143961  77580          36595
3          81345276   954962863  105117642  141936  77580          34636
4          81443338   911177709  101732938  134644  77580          34747
5          81443338   980555941  108384820  138640  77580          36235
6          80821168   846317735   95103854  144058      0           2724
7          80821168    29551433    5179426    3593      0           2724
                     Sales                            Customers               \
                       sum         mean          std        sum         mean   
DayOfWeek Open                                                                 
1         0              0     0.000000     0.000000          0     0.000000   
          1     1130203012  8216.073074  3691.768817  117675012   855.444984   
2         0              0     0.000000     0.000000          0     0.000000   
          1     1020411930  7088.113656  3066.237108  110848063   769.986753   
3         0              0     0.000000     0.000000          0     0.000000   
          1      954962863  6728.122978  2781.775153  105117642   740.598876   
4         0              0     0.000000     0.000000          0     0.000000   
          1      911177709  6767.310159  2764.548064  101732938   755.569784   
5         0              0     0.000000     0.000000          0     0.000000   
          1      980555941  7072.677012  2764.915731  108384820   781.771639   
6         0              0     0.000000     0.000000          0     0.000000   
          1      846317735  5874.840238  2852.730271   95103854   660.177526   
7         0              0     0.000000     0.000000          0     0.000000   
          1       29551433  8224.723908  6235.152925    5179426  1441.532424   

                             
                        std  
DayOfWeek Open               
1         0        0.000000  
          1      420.044045  
2         0        0.000000  
          1      389.826885  
3         0        0.000000  
          1      372.526509  
4         0        0.000000  
          1      379.039576  
5         0        0.000000  
          1      383.003916  
6         0        0.000000  
          1      386.119153  
7         0        0.000000  
          1     1112.072839

Unsurprisingly, a state holiday strongly affects sales.



In [4]:

    
avg_stateholiday = all_data[['Sales', 'Customers', 'StateHoliday']].groupby('StateHoliday').mean()
avg_stateholiday.plot(kind='bar')









    Out[4]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f950d056190>

Notably, having a promotion running for a particular day increases sales by quite a large amount, while not increasing customers by nearly the same rate. This implies people are spending more during those days, instead of having simply higher rates of customers.



In [5]:

    
avg_promotion = all_data[['Sales', 'Customers', 'Promo']].groupby('Promo').mean()
avg_promotion.plot(kind='bar')









    Out[5]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f950cfec0d0>

Competition distances are clustered around having close competitors, and in general it doesn't appear to affect sales strongly.



In [17]:

    
all_data[['CompetitionDistance', 'Sales']].plot(kind='scatter', x='CompetitionDistance', y='Sales')
all_data.hist('CompetitionDistance')









    Out[17]:





array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f950c90da10>]], dtype=object)



In [8]:

    
# Bin the competition distance with 10 bins
bins = np.linspace(all_data['CompetitionDistance'].min(), all_data.CompetitionDistance.max(), 10)

competition_bins = all_data[['Sales', 'Customers']].groupby(pd.cut(all_data['CompetitionDistance'], bins))
competition_avg = competition_bins.mean()
competition_avg.plot(kind='bar')









    Out[8]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f9507ec8150>

Sundays have almost no sales.

Sales peak in July and December, the peak of summer and Christmas season (The data is from Rossman stores, which are based in Germany so Christmas would be a major holiday.)



In [23]:

    
#Done By Neal
train['Date'] = pd.to_datetime(train['Date'])
train[:5]['Date'].dt.dayofweek
train['DayOfWeek'] = train['Date'].dt.dayofweek
train['Month'] = train['Date'].dt.month
train['Year'] = train['Date'].dt.year
avg_month = train[['Sales', 'Month']].groupby('Month').mean()
avg_month.plot(kind='bar')
avg_day = train[['Sales', 'DayOfWeek']].groupby('DayOfWeek').mean()
avg_day.plot(kind='bar')
#sales by day of week
sale_dayofweek = pd.pivot_table(train, values='Sales', index=['Year','Store'], columns=['DayOfWeek'])

#sales by month
sale_month = pd.pivot_table(train, values='Sales', index=['Year','Store'], columns=['Month'])
sale_month[:5]

#sale_dayofweek.plot(kind='box')
#sale_month.plot(kind='box')









    Out[23]:






  
    
      
      Month
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
    
    
      Year
      Store
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2013
      1
      4142.935484
      4473.964286
      4682.870968
      3933.633333
      3813.709677
      3780.366667
      4130.064516
      3816.774194
      3628.900000
      3698.096774
      4253.200000
      4682.838710
    
    
      2
      3715.193548
      3968.357143
      4210.483871
      3895.933333
      3754.161290
      3862.366667
      4476.709677
      4037.516129
      3969.366667
      3882.645161
      4287.600000
      4845.612903
    
    
      3
      5343.645161
      5720.250000
      6102.548387
      5739.600000
      5495.935484
      5682.066667
      6658.903226
      5751.677419
      5443.000000
      5428.290323
      5652.666667
      6915.064516
    
    
      4
      7571.387097
      8231.178571
      8242.387097
      7566.033333
      7589.806452
      7840.900000
      7357.935484
      7685.903226
      7318.333333
      7254.935484
      7964.600000
      9184.677419
    
    
      5
      3530.387097
      3611.642857
      3769.677419
      3538.033333
      3557.806452
      3509.333333
      4899.741935
      4063.516129
      3677.933333
      3735.645161
      3985.566667
      4758.580645

Place in Data Science Model

Our model will use a typical data science procedure for batch data analysis including a cleaning phase that feeds into a predictive regression model for sales prediction.

Next Steps

The next steps mainly include deciding on a particular predictive mode, through both empirical testing and looking at what popular methods for time series regression are used in the field.

Because our goal is to build a sales predictor, visualization of the results isn't a particular goal, though we plan to visualize and discuss major patterns in the data that our analysis finds, such as having a store with a promotion increasing sales by X or discussing the relation of stores to their competitors. We will also include a visualization of our predictor if relevant.



In [ ]:

	Store	DayOfWeek	Sales	Customers	Open	Promo	SchoolHoliday	CompetitionDistance	CompetitionOpenSinceMonth	CompetitionOpenSinceYear	Promo2	Promo2SinceWeek	Promo2SinceYear
count	1017209.000000	1017209.000000	1017209.000000	1017209.000000	1017209.000000	1017209.000000	1017209.000000	1014567.000000	693861.000000	693861.000000	1017209.000000	509178.000000	509178.000000
mean	558.429727	3.998341	5773.818972	633.145946	0.830107	0.381515	0.178647	5430.085652	7.222866	2008.690228	0.500564	23.269093	2011.752774
std	321.908651	1.997391	3849.926175	464.411734	0.375539	0.485759	0.383056	7715.323700	3.211832	5.992644	0.500000	14.095973	1.662870
min	1.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	20.000000	1.000000	1900.000000	0.000000	1.000000	2009.000000
25%	280.000000	2.000000	3727.000000	405.000000	1.000000	0.000000	0.000000	710.000000	4.000000	2006.000000	0.000000	13.000000	2011.000000
50%	558.000000	4.000000	5744.000000	609.000000	1.000000	0.000000	0.000000	2330.000000	8.000000	2010.000000	1.000000	22.000000	2012.000000
75%	838.000000	6.000000	7856.000000	837.000000	1.000000	1.000000	0.000000	6890.000000	10.000000	2013.000000	1.000000	37.000000	2013.000000
max	1115.000000	7.000000	41551.000000	7388.000000	1.000000	1.000000	1.000000	75860.000000	12.000000	2015.000000	1.000000	50.000000	2015.000000

	Month	1	2	3	4	5	6	7	8	9	10	11	12
Year	Store
2013	1	4142.935484	4473.964286	4682.870968	3933.633333	3813.709677	3780.366667	4130.064516	3816.774194	3628.900000	3698.096774	4253.200000	4682.838710
	2	3715.193548	3968.357143	4210.483871	3895.933333	3754.161290	3862.366667	4476.709677	4037.516129	3969.366667	3882.645161	4287.600000	4845.612903
	3	5343.645161	5720.250000	6102.548387	5739.600000	5495.935484	5682.066667	6658.903226	5751.677419	5443.000000	5428.290323	5652.666667	6915.064516
	4	7571.387097	8231.178571	8242.387097	7566.033333	7589.806452	7840.900000	7357.935484	7685.903226	7318.333333	7254.935484	7964.600000	9184.677419
	5	3530.387097	3611.642857	3769.677419	3538.033333	3557.806452	3509.333333	4899.741935	4063.516129	3677.933333	3735.645161	3985.566667	4758.580645