``````

In [1]:

%matplotlib inline
import numpy as np
import pandas as pd
from scipy.stats import probplot
from matplotlib import pyplot as plt

``````

Most of the fields are self-explanatory. The following are descriptions for those that aren't.

• Id - an Id that represents a (Store, Date) duple within the test set
• Store - a unique Id for each store
• Sales - the turnover for any given day (this is what you are predicting)
• Customers - the number of customers on a given day
• Open - an indicator for whether the store was open: 0 = closed, 1 = open
• StateHoliday - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None
• SchoolHoliday - indicates if the (Store, Date) was affected by the closure of public schools
• StoreType - differentiates between 4 different store models: a, b, c, d
• Assortment - describes an assortment level: a = basic, b = extra, c = extended
• CompetitionDistance - distance in meters to the nearest competitor store
• CompetitionOpenSince[Month/Year] - gives the approximate year and month of the time the nearest competitor was opened
• Promo - indicates whether a store is running a promo on that day
• Promo2 - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating
• Promo2Since[Year/Week] - describes the year and calendar week when the store started participating in Promo2
• PromoInterval - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store
``````

In [2]:

!unzip -o '*.csv.zip'
!rm *.zip

``````
``````

unzip:  cannot find or open *.csv.zip, *.csv.zip.zip or *.csv.zip.ZIP.

No zipfiles found.
rm: *.zip: No such file or directory

``````
``````

In [3]:

ls

``````
``````

explore.ipynb          sample_submission.csv  test.csv
rf1.csv                store.csv              train.csv

``````
``````

In [4]:

``````
``````

"Store","DayOfWeek","Date","Sales","Customers","Open","Promo","StateHoliday","SchoolHoliday"
1,5,2015-07-31,5263,555,1,1,"0","1"
2,5,2015-07-31,6064,625,1,1,"0","1"
3,5,2015-07-31,8314,821,1,1,"0","1"
4,5,2015-07-31,13995,1498,1,1,"0","1"
5,5,2015-07-31,4822,559,1,1,"0","1"
6,5,2015-07-31,5651,589,1,1,"0","1"
7,5,2015-07-31,15344,1414,1,1,"0","1"
8,5,2015-07-31,8492,833,1,1,"0","1"
9,5,2015-07-31,8565,687,1,1,"0","1"

``````
``````

In [5]:

for c in df.columns:
v = ["2015-07-31", 'etc.'] if c == 'Date' else np.sort(df[c].unique())
hasna = df[c].isnull().any()
print "{}: {}\n  has NA values: {}\n".format(c, v, hasna)

``````
``````

Store: [   1    2    3 ..., 1113 1114 1115]
has NA values: False

DayOfWeek: [1 2 3 4 5 6 7]
has NA values: False

Date: ['2015-07-31', 'etc.']
has NA values: False

Sales: [    0    46   124 ..., 38484 38722 41551]
has NA values: False

Customers: [   0    3    5 ..., 5458 5494 7388]
has NA values: False

Open: [0 1]
has NA values: False

Promo: [0 1]
has NA values: False

StateHoliday: [0 '0' 'a' 'b' 'c']
has NA values: False

SchoolHoliday: [0 1]
has NA values: False

/Users/jimbijwaard/anaconda/lib/python2.7/site-packages/IPython/core/interactiveshell.py:2902: DtypeWarning: Columns (7) have mixed types. Specify dtype option on import or set low_memory=False.
interactivity=interactivity, compiler=compiler, result=result)

``````
``````

In [74]:

data_types = {
"Store": int,
"DayOfWeek": int,       # {1, 2, 3, 4, 5, 6, 7}
"Date": object,         # e.g. "2015-07-31"
"Sales": int,
"Customers": int,
"Open": bool,
"Promo": bool,
"StateHoliday": str,    # {'0', 'a', 'b', 'c'}
"SchoolHoliday": bool,
}

``````
``````

In [6]:

``````
``````

"Id","Store","DayOfWeek","Date","Open","Promo","StateHoliday","SchoolHoliday"
1,1,4,2015-09-17,1,1,"0","0"
2,3,4,2015-09-17,1,1,"0","0"
3,7,4,2015-09-17,1,1,"0","0"
4,8,4,2015-09-17,1,1,"0","0"
5,9,4,2015-09-17,1,1,"0","0"
6,10,4,2015-09-17,1,1,"0","0"
7,11,4,2015-09-17,1,1,"0","0"
8,12,4,2015-09-17,1,1,"0","0"
9,13,4,2015-09-17,1,1,"0","0"

``````
``````

In [93]:

for c in df.columns:
v =  np.sort(df[c].unique())
hasna = df[c].isnull().any()
print "{}: {}\n  dtype: {}\n  has NA values: {}\n".format(c, v, df[c].dtype, hasna)

``````
``````

Id: [    1     2     3 ..., 41086 41087 41088]
dtype: int64
has NA values: False

Store: [   1    3    7    8    9   10   11   12   13   14   15   16   19   20   21
22   23   24   25   27   29   30   31   32   33   35   36   38   39   40
41   42   43   45   46   47   48   49   50   51   52   53   56   58   61
62   63   64   66   67   68   69   70   71   72   73   74   75   76   77
79   80   81   82   83   84   86   89   90   91   92   93   94   98   99
100  101  102  105  107  108  109  110  111  112  113  115  117  118  119
120  122  124  126  127  128  129  130  131  132  135  136  137  139  140
141  142  143  144  145  146  147  149  150  152  153  154  155  157  158
159  160  161  162  164  165  166  168  169  170  171  172  174  175  176
179  180  181  182  183  184  186  187  188  189  190  191  192  193  194
195  197  199  200  201  202  204  206  207  209  210  212  213  214  215
216  217  218  219  220  221  224  226  227  228  229  230  231  233  234
235  238  239  241  242  243  244  245  246  247  248  249  250  251  252
253  254  255  256  258  259  262  263  265  267  268  269  272  273  274
275  277  278  279  280  281  283  284  285  287  288  289  290  294  295
296  297  298  299  300  301  302  303  304  305  306  308  309  310  311
312  314  315  316  317  319  320  323  325  326  328  329  330  331  333
334  335  336  337  338  339  340  341  342  343  344  345  346  347  348
350  351  352  353  354  355  356  358  359  362  364  365  367  368  369
370  371  372  373  377  378  379  380  383  385  386  387  388  389  391
392  393  394  395  397  398  399  403  404  405  406  407  408  409  410
411  412  413  414  415  416  417  418  420  421  422  424  425  426  427
428  429  430  431  432  433  434  435  440  441  442  445  446  447  448
449  450  451  452  453  455  456  457  458  459  461  463  465  466  467
468  470  471  472  473  475  477  481  484  485  486  487  488  490  491
492  493  495  497  498  499  500  501  502  504  505  506  507  508  509
510  511  512  514  515  516  517  518  519  520  521  522  524  527  528
529  530  531  532  533  534  535  536  537  538  539  540  541  542  543
545  547  548  549  550  551  552  553  554  555  557  558  561  562  563
564  565  566  567  568  570  571  572  573  574  575  577  578  579  580
581  582  584  585  586  587  588  589  590  591  592  593  597  598  600
601  602  603  604  605  610  611  612  615  616  618  619  620  621  622
623  624  625  627  628  629  631  632  633  636  637  638  639  640  641
642  643  644  645  646  647  650  651  653  655  656  657  658  659  660
661  662  663  665  666  667  669  670  671  673  674  675  676  677  678
680  681  684  685  687  689  690  691  692  693  694  695  696  697  699
700  701  702  703  705  706  707  710  711  712  713  714  716  717  718
719  720  721  722  723  724  725  727  728  729  731  732  733  734  736
737  738  739  740  741  742  744  746  748  749  750  751  752  753  756
757  758  759  762  763  764  765  766  767  768  769  770  771  772  773
774  775  776  777  778  782  784  785  789  790  791  792  793  795  796
797  799  800  801  802  803  804  805  806  807  809  810  811  813  815
816  818  819  820  822  823  824  825  826  829  831  832  833  835  837
840  842  844  845  846  847  848  849  850  851  852  853  855  856  857
858  859  860  861  862  863  864  865  866  867  868  871  872  874  875
877  879  880  881  882  883  884  885  886  887  888  890  891  893  894
895  896  897  900  901  902  903  904  905  906  907  908  909  911  912
913  914  915  916  917  919  920  922  924  925  926  927  928  929  930
931  932  934  935  936  937  938  939  941  942  943  944  945  946  947
948  950  951  952  954  955  956  960  961  962  964  965  966  967  969
970  973  974  975  976  977  980  983  984  985  986  988  989  991  992
994  997  998 1000 1003 1004 1005 1007 1008 1009 1010 1011 1012 1013 1014
1015 1016 1019 1020 1022 1024 1025 1026 1027 1028 1031 1036 1037 1038 1039
1040 1041 1042 1044 1045 1047 1048 1049 1050 1051 1052 1053 1054 1056 1057
1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1070 1071 1072 1073
1076 1077 1078 1079 1080 1083 1084 1086 1087 1088 1089 1091 1092 1094 1096
1097 1099 1100 1101 1102 1103 1104 1105 1106 1107 1109 1111 1112 1113 1114
1115]
dtype: int64
has NA values: False

DayOfWeek: [1 2 3 4 5 6 7]
dtype: int64
has NA values: False

Date: ['2015-08-01' '2015-08-02' '2015-08-03' '2015-08-04' '2015-08-05'
'2015-08-06' '2015-08-07' '2015-08-08' '2015-08-09' '2015-08-10'
'2015-08-11' '2015-08-12' '2015-08-13' '2015-08-14' '2015-08-15'
'2015-08-16' '2015-08-17' '2015-08-18' '2015-08-19' '2015-08-20'
'2015-08-21' '2015-08-22' '2015-08-23' '2015-08-24' '2015-08-25'
'2015-08-26' '2015-08-27' '2015-08-28' '2015-08-29' '2015-08-30'
'2015-08-31' '2015-09-01' '2015-09-02' '2015-09-03' '2015-09-04'
'2015-09-05' '2015-09-06' '2015-09-07' '2015-09-08' '2015-09-09'
'2015-09-10' '2015-09-11' '2015-09-12' '2015-09-13' '2015-09-14'
'2015-09-15' '2015-09-16' '2015-09-17']
dtype: object
has NA values: False

Open: [  0.   1.  nan]
dtype: float64
has NA values: True

Promo: [0 1]
dtype: int64
has NA values: False

StateHoliday: ['0' 'a']
dtype: object
has NA values: False

SchoolHoliday: [0 1]
dtype: int64
has NA values: False

``````
``````

In [61]:

``````
``````

"Store","StoreType","Assortment","CompetitionDistance","CompetitionOpenSinceMonth","CompetitionOpenSinceYear","Promo2","Promo2SinceWeek","Promo2SinceYear","PromoInterval"
1,"c","a",1270,9,2008,0,,,""
2,"a","a",570,11,2007,1,13,2010,"Jan,Apr,Jul,Oct"
3,"a","a",14130,12,2006,1,14,2011,"Jan,Apr,Jul,Oct"
4,"c","c",620,9,2009,0,,,""
5,"a","a",29910,4,2015,0,,,""
6,"a","a",310,12,2013,0,,,""
7,"a","c",24000,4,2013,0,,,""
8,"a","a",7520,10,2014,0,,,""
9,"a","c",2030,8,2000,0,,,""

``````
``````

In [98]:

for c in df.columns:
v = [20, 30, 'etc.', np.nan] if c == 'CompetitionDistance' else np.sort(df[c].unique())
print "{}: {}\n  dtype: {}\n  has NA values: {}\n".format(c, v, df[c].dtype, df[c].isnull().any())

``````
``````

Store: [   1    2    3 ..., 1113 1114 1115]
dtype: int64
has NA values: False

StoreType: ['a' 'b' 'c' 'd']
dtype: object
has NA values: False

Assortment: ['a' 'b' 'c']
dtype: object
has NA values: False

CompetitionDistance: [20, 30, 'etc.', nan]
dtype: float64
has NA values: True

CompetitionOpenSinceMonth: [  1.   2.   3.   4.   5.   6.   7.   8.   9.  10.  11.  12.  nan]
dtype: float64
has NA values: True

CompetitionOpenSinceYear: [ 1900.  1961.  1990.  1994.  1995.  1998.  1999.  2000.  2001.  2002.
2003.  2004.  2005.  2006.  2007.  2008.  2009.  2010.  2011.  2012.
2013.  2014.  2015.    nan]
dtype: float64
has NA values: True

Promo2: [0 1]
dtype: int64
has NA values: False

Promo2SinceWeek: [  1.   5.   6.   9.  10.  13.  14.  18.  22.  23.  26.  27.  28.  31.  35.
36.  37.  39.  40.  44.  45.  48.  49.  50.  nan]
dtype: float64
has NA values: True

Promo2SinceYear: [ 2009.  2010.  2011.  2012.  2013.  2014.  2015.    nan]
dtype: float64
has NA values: True

PromoInterval: [nan 'Feb,May,Aug,Nov' 'Jan,Apr,Jul,Oct' 'Mar,Jun,Sept,Dec']
dtype: object
has NA values: True

``````
``````

In [120]:

# check consistency of 'Promo2' flag
for c in ('PromoInterval', 'Promo2SinceWeek', 'Promo2SinceYear'):
print (df.Promo2[df[c].isnull()].unique(), df.Promo2[~df[c].isnull()].unique()) == ([0], [1])

``````
``````

True
True
True

``````
``````

In [121]:

print all(df['CompetitionDistance'].isnull() == df['CompetitionOpenSinceYear'].isnull())
print all(df['CompetitionDistance'].isnull() == df['CompetitionOpenSinceMonth'].isnull())
print all(df['CompetitionOpenSinceMonth'].isnull() == df['CompetitionOpenSinceYear'].isnull())

``````
``````

False
False
True

``````
``````

In [122]:

``````
``````

Out[122]:

Store
StoreType
Assortment
CompetitionDistance
CompetitionOpenSinceMonth
CompetitionOpenSinceYear
Promo2
Promo2SinceWeek
Promo2SinceYear
PromoInterval

11
12
a
c
1070
NaN
NaN
1
13
2010
Jan,Apr,Jul,Oct

12
13
d
a
310
NaN
NaN
1
45
2009
Feb,May,Aug,Nov

15
16
a
c
3270
NaN
NaN
0
NaN
NaN
NaN

18
19
a
c
3240
NaN
NaN
1
22
2011
Mar,Jun,Sept,Dec

21
22
a
a
1040
NaN
NaN
1
22
2012
Jan,Apr,Jul,Oct

25
26
d
a
2300
NaN
NaN
0
NaN
NaN
NaN

28
29
d
c
2170
NaN
NaN
0
NaN
NaN
NaN

31
32
a
a
2910
NaN
NaN
1
45
2009
Feb,May,Aug,Nov

39
40
a
a
180
NaN
NaN
1
45
2009
Feb,May,Aug,Nov

40
41
d
c
1180
NaN
NaN
1
31
2013
Jan,Apr,Jul,Oct

41
42
a
c
290
NaN
NaN
1
40
2011
Jan,Apr,Jul,Oct

42
43
d
a
4880
NaN
NaN
1
37
2009
Jan,Apr,Jul,Oct

61
62
a
a
2050
NaN
NaN
0
NaN
NaN
NaN

63
64
d
c
22560
NaN
NaN
1
14
2013
Jan,Apr,Jul,Oct

65
66
d
a
7660
NaN
NaN
1
37
2009
Jan,Apr,Jul,Oct

67
68
a
c
250
NaN
NaN
1
35
2012
Mar,Jun,Sept,Dec

68
69
c
c
1130
NaN
NaN
1
40
2011
Jan,Apr,Jul,Oct

69
70
c
c
4840
NaN
NaN
0
NaN
NaN
NaN

73
74
a
a
330
NaN
NaN
0
NaN
NaN
NaN

78
79
a
a
3320
NaN
NaN
0
NaN
NaN
NaN

79
80
d
a
7910
NaN
NaN
0
NaN
NaN
NaN

82
83
a
a
2710
NaN
NaN
0
NaN
NaN
NaN

90
91
c
a
2410
NaN
NaN
1
35
2011
Jan,Apr,Jul,Oct

91
92
c
a
240
NaN
NaN
1
45
2009
Feb,May,Aug,Nov

92
93
a
a
16690
NaN
NaN
1
14
2011
Jan,Apr,Jul,Oct

93
94
d
c
14620
NaN
NaN
0
NaN
NaN
NaN

96
97
d
c
8980
NaN
NaN
0
NaN
NaN
NaN

99
100
d
a
17930
NaN
NaN
0
NaN
NaN
NaN

100
101
d
c
2440
NaN
NaN
1
22
2012
Mar,Jun,Sept,Dec

104
105
a
c
6190
NaN
NaN
1
23
2013
Mar,Jun,Sept,Dec

...
...
...
...
...
...
...
...
...
...
...

1035
1036
d
c
9560
NaN
NaN
1
36
2013
Jan,Apr,Jul,Oct

1036
1037
a
c
150
NaN
NaN
0
NaN
NaN
NaN

1041
1042
a
a
3440
NaN
NaN
1
31
2013
Feb,May,Aug,Nov

1046
1047
a
a
3750
NaN
NaN
1
45
2009
Feb,May,Aug,Nov

1051
1052
a
c
5080
NaN
NaN
1
31
2013
Feb,May,Aug,Nov

1053
1054
a
c
13190
NaN
NaN
1
45
2013
Feb,May,Aug,Nov

1055
1056
d
c
5350
NaN
NaN
1
40
2012
Jan,Apr,Jul,Oct

1057
1058
a
c
180
NaN
NaN
1
35
2010
Mar,Jun,Sept,Dec

1059
1060
a
c
3430
NaN
NaN
1
31
2013
Feb,May,Aug,Nov

1060
1061
d
c
8110
NaN
NaN
0
NaN
NaN
NaN

1062
1063
a
c
6250
NaN
NaN
0
NaN
NaN
NaN

1063
1064
a
c
420
NaN
NaN
0
NaN
NaN
NaN

1064
1065
a
a
1290
NaN
NaN
1
35
2011
Mar,Jun,Sept,Dec

1065
1066
a
a
3350
NaN
NaN
0
NaN
NaN
NaN

1067
1068
d
c
5010
NaN
NaN
1
5
2013
Jan,Apr,Jul,Oct

1068
1069
a
c
18050
NaN
NaN
1
14
2011
Jan,Apr,Jul,Oct

1072
1073
a
c
1710
NaN
NaN
1
44
2012
Jan,Apr,Jul,Oct

1075
1076
a
c
90
NaN
NaN
1
1
2013
Jan,Apr,Jul,Oct

1077
1078
d
c
670
NaN
NaN
1
40
2011
Jan,Apr,Jul,Oct

1078
1079
a
a
16680
NaN
NaN
1
37
2009
Jan,Apr,Jul,Oct

1079
1080
a
a
2410
NaN
NaN
1
40
2014
Jan,Apr,Jul,Oct

1082
1083
d
c
11540
NaN
NaN
1
5
2013
Feb,May,Aug,Nov

1083
1084
a
a
190
NaN
NaN
1
13
2010
Jan,Apr,Jul,Oct

1089
1090
a
a
330
NaN
NaN
1
14
2011
Jan,Apr,Jul,Oct

1090
1091
a
c
9990
NaN
NaN
0
NaN
NaN
NaN

1095
1096
a
c
1130
NaN
NaN
1
10
2014
Mar,Jun,Sept,Dec

1099
1100
a
a
540
NaN
NaN
1
14
2011
Jan,Apr,Jul,Oct

1112
1113
a
c
9260
NaN
NaN
0
NaN
NaN
NaN

1113
1114
a
c
870
NaN
NaN
0
NaN
NaN
NaN

1114
1115
d
c
5350
NaN
NaN
1
22
2012
Mar,Jun,Sept,Dec

351 rows × 10 columns

``````
``````

In [77]:

data_types = {
"Store": int,
"StoreType": str,
"Assortment": str,
"CompetitionDistance": float,
"CompetitionOpenSinceMonth": float,
"CompetitionOpenSinceYear": float,
"Promo2": bool,
"Promo2SinceWeek": float,
"Promo2SinceYear": float,
"PromoInterval": object,
}

``````

Explore target

``````

In [45]:

``````
``````

Out[45]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f1720175690>

``````

This looks like a log-normal distribution...

``````

In [51]:

``````
``````

Out[51]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f170d683250>

``````

Check QQ plot to see how well this is modeled by a Gaussian distribution:

``````

In [60]:

``````
``````

Out[60]:

((array([-4.79314476, -4.61214391, -4.51425117, ...,  4.51425117,
4.61214391,  4.79314476]),
array([ 1.66275783,  2.09342169,  2.12385164, ...,  4.58528021,
4.58795778,  4.61858148])),
(0.18438248551453371, 3.8033618126742885, 0.99829684200204238))

``````

Explore features

``````

In [ ]:

``````