Arundo's Take home challenge

Given Data:

Arundo_take_home_challenge_training_set.csv
Arundo_take_home_challenge_test_set.csv

Task:

Arundo_take_home_challenge_training_set.csv will be used to train a model to predict request_count (target variable).
Predict request_count that is missing in Arundo_take_home_challenge_test_set.csv.

Note: Arundo_take_home_challenge_training_set.csv though will be loaded as a test_data (considering the name of the given file) should not be confused with the test data or validation data of the ML model. Validation/test data will actually come by splitting Arundo_take_home_challenge_training_set.csv.

The problem is solved in Jupyter's notebook with following main steps

Data Visualization
Features selection and preprocessing
Testing with different ML models
Output csv files with predicted request_values using the trained ML models



In [2]:

    
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn import svm
import matplotlib.pyplot as plt
import seaborn as sns
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import np_utils
import time









    



Using TensorFlow backend.

Read the given dataset



In [4]:

    
#ALso remember to parse the date column. This will be helpful in the next step
data=pd.read_csv('Arundo_take_home_challenge_training_set.csv',sep=',',parse_dates=['date'])
#Have a look at the data
data.head(15)









    Out[4]:







  
    
      
      date
      calendar_code
      request_count
      site_count
      max_temp
      min_temp
      precipitation
      events
    
  
  
    
      0
      2014-09-01
      0.0
      165
      6
      30.6
      22.8
      0.0
      Rain
    
    
      1
      2014-09-02
      1.0
      138
      7
      32.8
      22.8
      15.5
      Rain-Thunderstorm
    
    
      2
      2014-09-03
      1.0
      127
      7
      29.4
      18.3
      0.0
      None
    
    
      3
      2014-09-04
      1.0
      174
      7
      29.4
      17.2
      0.0
      None
    
    
      4
      2014-09-05
      1.0
      196
      7
      30.6
      21.7
      0.0
      Fog
    
    
      5
      2014-09-06
      1.0
      314
      7
      32.8
      22.2
      9.9
      Rain-Thunderstorm
    
    
      6
      2014-09-07
      1.0
      156
      7
      27.2
      17.8
      0.0
      None
    
    
      7
      2014-09-08
      1.0
      150
      7
      23.3
      17.8
      0.0
      None
    
    
      8
      2014-09-09
      1.0
      137
      7
      24.4
      20.6
      0.0
      None
    
    
      9
      2014-09-10
      1.0
      156
      7
      26.7
      17.8
      0.0
      None
    
    
      10
      2014-09-11
      1.0
      135
      7
      30.0
      20.6
      0.0
      None
    
    
      11
      2014-09-12
      1.0
      187
      7
      24.4
      15.6
      0.0
      None
    
    
      12
      2014-09-13
      1.0
      342
      7
      21.1
      12.8
      11.7
      Rain
    
    
      13
      2014-09-14
      1.0
      202
      7
      20.6
      9.4
      0.0
      None
    
    
      14
      2014-09-15
      1.0
      141
      7
      22.8
      8.3
      0.0
      None



In [5]:

    
data.tail(5)









    Out[5]:







  
    
      
      date
      calendar_code
      request_count
      site_count
      max_temp
      min_temp
      precipitation
      events
    
  
  
    
      147
      2015-02-26
      1.0
      378
      15
      0.6
      -3.9
      0.3
      Snow
    
    
      148
      2015-02-27
      1.0
      556
      16
      -1.7
      -6.7
      0.3
      Snow
    
    
      149
      2015-02-28
      1.0
      570
      14
      -0.6
      -10.0
      0.0
      None
    
    
      150
      2015-03-01
      1.0
      615
      16
      0.0
      -8.3
      13.7
      Rain-Snow
    
    
      151
      2015-03-02
      1.0
      369
      16
      3.3
      -2.8
      0.0
      None



In [6]:

    
print(data.isnull().any())









    



date             False
calendar_code    False
request_count    False
site_count       False
max_temp         False
min_temp         False
precipitation    False
events           False
dtype: bool

Quick observations

Daily data related to weather variations
Covers mainly the winter months
A better visualization of correlation of request_count needed with other variables
No nulls

Let us try to get a better insight into the data. First let us have a look at the dependence of request counts on the float variables.



In [4]:

    
data.hist('max_temp',weights=data['request_count'])
data.hist('min_temp',weights=data['request_count'])
data.hist('precipitation',weights=data['request_count'])
plt.show()

From the above histograms we see most of the requests comes when

maximum temperature is below 10C
min temperature is below 2C
When there is zero precipitation

All in all the distribtion of request count is strongly correlated to floating variables and hence all floating variables are to be considered as features. Let's now evaluate the correlation of two variables 'events' and 'calendar code' with request_count which clearly are categorical variables.



In [5]:

    
#Sort request_count with events  
data.groupby('events').request_count.agg(['mean','max','min']).plot(kind='bar')
plt.show()
data.groupby('events').request_count.agg(['count','mean','max','min'])









    












    Out[5]:







  
    
      
      count
      mean
      max
      min
    
    
      events
      
      
      
      
    
  
  
    
      Fog
      1
      196.000000
      196
      196
    
    
      Fog-Rain
      5
      206.400000
      383
      27
    
    
      Fog-Rain-Snow
      1
      210.000000
      210
      210
    
    
      Fog-Snow
      3
      355.000000
      531
      213
    
    
      None
      89
      246.662921
      570
      0
    
    
      Rain
      31
      256.483871
      473
      0
    
    
      Rain-Snow
      6
      334.833333
      615
      137
    
    
      Rain-Thunderstorm
      2
      226.000000
      314
      138
    
    
      Snow
      14
      365.642857
      593
      247

Clearly, support request comes in more when the weather condition is overcast which is understandable. But not many data instances are available in most of the weather events except when the events are 'None', 'Rain' and 'Snow' which will make it challenging to split the data honestly, train the ML model and test the accuracy.



In [6]:

    
# Now sort request_count with calendar code
data.groupby('calendar_code').request_count.agg(['mean','max','min']).plot(kind='bar')
plt.show()
data.groupby('calendar_code').request_count.agg(['count','mean','max','min'])









    












    Out[6]:







  
    
      
      count
      mean
      max
      min
    
    
      calendar_code
      
      
      
      
    
  
  
    
      0.0
      63
      228.555556
      410
      0
    
    
      1.0
      89
      287.505618
      615
      101

Calendar code probably refers to the intensity of weather variations in a single day. The distribution of calendar code behavior would be interesting to see.

We further use violin plot for dependence on the categorical variables (https://blog.modeanalytics.com/violin-plot-examples/)



In [7]:

    
var_name = "events"
col_order = np.sort(data[var_name].unique()).tolist()
plt.figure(figsize=(16,6))
sns.violinplot(x=var_name, y='request_count', data=data, order=col_order)
plt.xlabel(var_name, fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title("Distribution of request count with "+var_name, fontsize=15)
plt.show()

For events with less data points, it is difficult to to see the distribution which will eventually might reflect as a error in the train model or else would be difficult or not possible to split into taining, test and validation data sets. Certainly, more data points corresponding to these events would help in better trained model for these weather event.



In [9]:

    
var_name = "calendar_code"
col_order = np.sort(data[var_name].unique()).tolist()
plt.figure(figsize=(16,6))
sns.violinplot(x=var_name, y='request_count', data=data, order=col_order)
plt.xlabel(var_name, fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title("Distribution of request count with "+var_name, fontsize=15)
plt.show()

Significant distribution of request count over calendar code is visible and hence would be the part of feature metrix.

Next, lets analayze the impact of date. The dates covers mainly the winter months and may not be represented well if it is used as it is given. I realized may be the site maintenence is dependant on the working day or weekend instead. We start by adding an additional column with the week day (0: Monday, ... 6: Sunday)



In [10]:

    
data['day_of_week'] = data['date'].dt.dayofweek
data['week_day'] = data['date'].dt.weekday_name
data.head()









    Out[10]:







  
    
      
      date
      calendar_code
      request_count
      site_count
      max_temp
      min_temp
      precipitation
      events
      day_of_week
      week_day
    
  
  
    
      0
      2014-09-01
      0.0
      165
      6
      30.6
      22.8
      0.0
      Rain
      0
      Monday
    
    
      1
      2014-09-02
      1.0
      138
      7
      32.8
      22.8
      15.5
      Rain-Thunderstorm
      1
      Tuesday
    
    
      2
      2014-09-03
      1.0
      127
      7
      29.4
      18.3
      0.0
      None
      2
      Wednesday
    
    
      3
      2014-09-04
      1.0
      174
      7
      29.4
      17.2
      0.0
      None
      3
      Thursday
    
    
      4
      2014-09-05
      1.0
      196
      7
      30.6
      21.7
      0.0
      Fog
      4
      Friday



In [11]:

    
# We again choose the groupby and violin plots to see the underlying behaviour
data.groupby('week_day').request_count.agg(['mean','max','min']).plot(kind='bar')
plt.show()
data.groupby('week_day').request_count.agg(['count','mean','max','min'])

Clearly, Weekends (Friday-Sunday) are the most active period while Monday-Thursday show nearly uniform mean request_count. This indicates the best way to reflect the effect of date as numeric feature would be through the week days.

Next, we convert the events into some unique identifiers (integers). This will result in an additional column "events_code".



In [12]:

    
data['events_code'] = pd.Categorical(data["events"]).codes
data.head()









    Out[12]:







  
    
      
      date
      calendar_code
      request_count
      site_count
      max_temp
      min_temp
      precipitation
      events
      day_of_week
      week_day
      events_code
    
  
  
    
      0
      2014-09-01
      0.0
      165
      6
      30.6
      22.8
      0.0
      Rain
      0
      Monday
      5
    
    
      1
      2014-09-02
      1.0
      138
      7
      32.8
      22.8
      15.5
      Rain-Thunderstorm
      1
      Tuesday
      7
    
    
      2
      2014-09-03
      1.0
      127
      7
      29.4
      18.3
      0.0
      None
      2
      Wednesday
      4
    
    
      3
      2014-09-04
      1.0
      174
      7
      29.4
      17.2
      0.0
      None
      3
      Thursday
      4
    
    
      4
      2014-09-05
      1.0
      196
      7
      30.6
      21.7
      0.0
      Fog
      4
      Friday
      0

Since request_count is the target variable, we store it separately as "y" for ML model



In [13]:

    
y=data["request_count"]
print("Shape of y ",y.shape)









    



Shape of y  (152,)

Drop the redundant columns now "date","events","request_count" and week_day.



In [14]:

    
data_orig = data #Save data in data_orig before droping reduntant variables
data = data.drop(["date","events","request_count","week_day"],axis=1)
data.head()









    Out[14]:







  
    
      
      calendar_code
      site_count
      max_temp
      min_temp
      precipitation
      day_of_week
      events_code
    
  
  
    
      0
      0.0
      6
      30.6
      22.8
      0.0
      0
      5
    
    
      1
      1.0
      7
      32.8
      22.8
      15.5
      1
      7
    
    
      2
      1.0
      7
      29.4
      18.3
      0.0
      2
      4
    
    
      3
      1.0
      7
      29.4
      17.2
      0.0
      3
      4
    
    
      4
      1.0
      7
      30.6
      21.7
      0.0
      4
      0

The categorical values day_of_week, events_code and calender code need to be one-hot-encoded to be used as a feature input vector.



In [15]:

    
data= pd.get_dummies(data,columns=["calendar_code","events_code","day_of_week"],prefix=["calendar","event","week"])
data.head()









    Out[15]:







  
    
      
      site_count
      max_temp
      min_temp
      precipitation
      calendar_0.0
      calendar_1.0
      event_0
      event_1
      event_2
      event_3
      ...
      event_6
      event_7
      event_8
      week_0
      week_1
      week_2
      week_3
      week_4
      week_5
      week_6
    
  
  
    
      0
      6
      30.6
      22.8
      0.0
      1
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
    
    
      1
      7
      32.8
      22.8
      15.5
      0
      1
      0
      0
      0
      0
      ...
      0
      1
      0
      0
      1
      0
      0
      0
      0
      0
    
    
      2
      7
      29.4
      18.3
      0.0
      0
      1
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
    
    
      3
      7
      29.4
      17.2
      0.0
      0
      1
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
    
    
      4
      7
      30.6
      21.7
      0.0
      0
      1
      1
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
    
  

5 rows × 22 columns

DataFrame is now ready to be used as feature matrix. Lets assign data values to X.



In [17]:

    
X=data.values
X.shape









    Out[17]:





(152, 22)



In [20]:

    
plt.figure(1)
plt.plot(X[:,0],y[:],'r.')
plt.xlabel("No. of sites")
plt.ylabel("No. of requests")
plt.show()

plt.figure(1)
plt.plot((X[:,1]+X[:,2])**2/2.0,y[:],'r.')
plt.xlabel("Mean temperature")
plt.ylabel("No. of requests")
plt.show()

plt.figure(1)
plt.plot(X[:,3],y[:],'r.')
plt.xlabel("Precipitation")
plt.ylabel("No. of requests")
plt.show()

It appears that the no of requests has some kind of a quadratic dependence on the mean temperature so in addition to max and min temperature we should construct a new feature $((mintemp+maxtemp)/2)^2$



In [21]:

    
X=np.column_stack([X,(X[:,1]+X[:,2])**2.0])
X.shape









    Out[21]:





(152, 23)



In [22]:

    
#Split the data into training and validation test, shuffling of data won't be necessary since data seems to be already random
X_train, X_val, y_train, y_val =  train_test_split(X,y,test_size=0.2,random_state = 0)

Conduct a multivariate linear regression on the dataset.



In [23]:

    
#Multivariabte regression
regr = linear_model.LinearRegression()
start_time =time.time()
regr.fit(X_train, y_train)
print("--- %s seconds ---" % (time.time() - start_time))
y_train_pred=regr.predict(X_train)
print("Mean squared error: %.2f" % np.mean((regr.predict(X_train) - y_train) ** 2))
print("Mean squared error with validation set: %.2f" % np.mean((regr.predict(X_val) - y_val) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(X_train, y_train))
print('Variance score with validation set: %.2f' % regr.score(X_val, y_val))









    



--- 0.013047933578491211 seconds ---
Mean squared error: 2050.26
Mean squared error with validation set: 3354.57
Variance score: 0.82
Variance score with validation set: 0.77

I also checked the variance score decreases if the new column of $((mintemp+maxtemp)/2)^2$ is not considered. Here, Recursive Feature Elimination ( RFE ) technique is ignored since with analysis above I see each of the variable has its own impact. This is something I can do for larger set of features but not in this case.

Now apply regression on data after ignoring the least appeared events to try if it gives us any better result.



In [632]:

    
#Delete the least appeared events
plt.figure(1)
plt.plot((X[:,1]+X[:,2])**2/2.0,y[:],'r.')
plt.xlabel("Mean temperature")
plt.ylabel("No. of requests")
plt.show()
data_orig = data_orig[(data_orig['events'] != 'Fog') & (data_orig['events'] != 'Fog-Rain-Snow') & (data_orig['events'] != 'Rain-Thunderstorm')]
data_orig.shape









    Out[632]:





(148, 11)



In [633]:

    
#Preprocess data before sending to multivariant regression
data_orig['day_of_week'] = data_orig['date'].dt.dayofweek
data_orig['events_code'] = pd.Categorical(data_orig["events"]).codes
data_orig= pd.get_dummies(data_orig,columns=["calendar_code","events_code","day_of_week"],prefix=["calendar","event","week"])
y_red=data_orig["request_count"]
data_orig = data_orig.drop(["date","events","request_count","week_day"],axis=1)
data_orig.head()









    Out[633]:







  
    
      
      site_count
      max_temp
      min_temp
      precipitation
      calendar_0.0
      calendar_1.0
      event_0
      event_1
      event_2
      event_3
      event_4
      event_5
      week_0
      week_1
      week_2
      week_3
      week_4
      week_5
      week_6
    
  
  
    
      0
      6
      30.6
      22.8
      0.0
      1
      0
      0
      0
      0
      1
      0
      0
      1
      0
      0
      0
      0
      0
      0
    
    
      2
      7
      29.4
      18.3
      0.0
      0
      1
      0
      0
      1
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
    
    
      3
      7
      29.4
      17.2
      0.0
      0
      1
      0
      0
      1
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
    
    
      6
      7
      27.2
      17.8
      0.0
      0
      1
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
    
    
      7
      7
      23.3
      17.8
      0.0
      0
      1
      0
      0
      1
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0



In [634]:

    
#Split the data into training and validation test, shuffling of data won't be necessary since data seems to be already random
X_red = data_orig.values
X_red_train, X_red_val, y_red_train, y_red_val =  train_test_split(X_red,y_red,test_size=0.2,random_state = 0)



In [635]:

    
# Multivariant regression

regr = linear_model.LinearRegression()
regr.fit(X_red_train, y_red_train)
y_red_train_pred=regr.predict(X_red_train)
print("Mean squared error: %.2f" % np.mean((regr.predict(X_red_train) - y_red_train) ** 2))
print("Mean squared error with validation set: %.2f" % np.mean((regr.predict(X_red_val) - y_red_val) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(X_red_train, y_red_train))
print('Variance score with validation set: %.2f' % regr.score(X_red_val, y_red_val))









    



Mean squared error: 2303.49
Mean squared error with validation set: 3448.02
Variance score: 0.81
Variance score with validation set: 0.77

This actually has increased the error, so removing least appeared events in not helping at all in reducing the error. Lets go back to our original regression fit.



In [636]:

    
#Multivariabte regression
regr = linear_model.LinearRegression()
regr.fit(X_train, y_train)
y_train_pred=regr.predict(X_train)
print("Mean squared error: %.2f" % np.mean((regr.predict(X_train) - y_train) ** 2))
print("Mean squared error with validation set: %.2f" % np.mean((regr.predict(X_val) - y_val) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(X_train, y_train))
print('Variance score with validation set: %.2f' % regr.score(X_val, y_val))









    



Mean squared error: 2050.26
Mean squared error with validation set: 3354.57
Variance score: 0.82
Variance score with validation set: 0.77

Load CSV file with missing request_count and predict request_count. This is one of the tasks



In [637]:

    
test_data=pd.read_csv('Arundo_take_home_challenge_test_set.csv',sep=',',parse_dates=['date'])
# Sort by events to see if all the events in training data set
test_data.groupby('events').site_count.agg(['mean','max','min']).plot(kind='bar')
plt.show()
test_data.groupby('events').site_count.agg(['count','mean','max','min'])









    












    Out[637]:







  
    
      
      count
      mean
      max
      min
    
    
      events
      
      
      
      
    
  
  
    
      Fog
      1
      10.000000
      10
      10
    
    
      Fog-Rain
      1
      12.000000
      12
      12
    
    
      None
      17
      8.647059
      11
      7
    
    
      Rain
      11
      8.545455
      10
      7
    
    
      Rain-Thunderstorm
      1
      10.000000
      10
      10

Clearly 4 events Fog-Rain-Snow, Fog-Snow, Rain-Snow and Snow are not listed. This I just wanted to see if I can remove the least appeared events that won't be needed in predicting request_count in given test csv file. The only least appeared event not listed here is Fog-Rain-Snow which might not be very helpful in improving our fit. But now I already have checked and this won't give us any improved result.



In [638]:

    
# We must process the test csv data in similar way before predicting request_count 
# To ensure to have codes for catogorical variables, we will merge the given test data with training data csv (in csv files)
# and after inclusion of all events codes and week day codes we will remove the training data

data=pd.read_csv('Arundo_take_home_challenge_training_set.csv',sep=',',parse_dates=['date'])
data['key1'] = 1
data = data.drop(["request_count"],axis=1)
data.head()
test_data['key2'] = 0
frames = [test_data,data]
merg_frame = pd.concat(frames)
merg_frame['day_of_week'] = merg_frame['date'].dt.dayofweek
merg_frame['events_code'] = pd.Categorical(merg_frame["events"]).codes
merg_frame.head()









    Out[638]:







  
    
      
      calendar_code
      date
      events
      key1
      key2
      max_temp
      min_temp
      precipitation
      site_count
      day_of_week
      events_code
    
  
  
    
      0
      1.0
      2014-10-01
      None
      NaN
      0.0
      24.4
      15.0
      0.0
      8
      2
      4
    
    
      1
      1.0
      2014-10-02
      None
      NaN
      0.0
      21.1
      12.8
      0.0
      8
      3
      4
    
    
      2
      1.0
      2014-10-03
      Rain
      NaN
      0.0
      22.2
      12.2
      2.8
      8
      4
      5
    
    
      3
      1.0
      2014-10-04
      Rain
      NaN
      0.0
      21.7
      7.2
      10.7
      7
      5
      5
    
    
      4
      1.0
      2014-10-05
      None
      NaN
      0.0
      16.7
      3.9
      0.0
      8
      6
      4



In [639]:

    
# Drop data and Events columns
merg_frame = merg_frame.drop(["date","events"],axis=1)
merg_frame.head()









    Out[639]:







  
    
      
      calendar_code
      key1
      key2
      max_temp
      min_temp
      precipitation
      site_count
      day_of_week
      events_code
    
  
  
    
      0
      1.0
      NaN
      0.0
      24.4
      15.0
      0.0
      8
      2
      4
    
    
      1
      1.0
      NaN
      0.0
      21.1
      12.8
      0.0
      8
      3
      4
    
    
      2
      1.0
      NaN
      0.0
      22.2
      12.2
      2.8
      8
      4
      5
    
    
      3
      1.0
      NaN
      0.0
      21.7
      7.2
      10.7
      7
      5
      5
    
    
      4
      1.0
      NaN
      0.0
      16.7
      3.9
      0.0
      8
      6
      4



In [640]:

    
merg_frame= pd.get_dummies(merg_frame,columns=["calendar_code","events_code","day_of_week"],prefix=["calendar","event","week"])



In [641]:

    
test_data = merg_frame[merg_frame['key2'] == 0]
test_data = test_data.drop(["key1","key2"],axis=1)
test_data.head()









    Out[641]:







  
    
      
      max_temp
      min_temp
      precipitation
      site_count
      calendar_0.0
      calendar_1.0
      event_0
      event_1
      event_2
      event_3
      ...
      event_6
      event_7
      event_8
      week_0
      week_1
      week_2
      week_3
      week_4
      week_5
      week_6
    
  
  
    
      0
      24.4
      15.0
      0.0
      8
      0
      1
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
    
    
      1
      21.1
      12.8
      0.0
      8
      0
      1
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
    
    
      2
      22.2
      12.2
      2.8
      8
      0
      1
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
    
    
      3
      21.7
      7.2
      10.7
      7
      0
      1
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
    
    
      4
      16.7
      3.9
      0.0
      8
      0
      1
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
    
  

5 rows × 22 columns



In [651]:

    
# Assign test data to X_test and add ((minx+maxx)/2)^2 as an additional column
X_test=test_data.values
X_test=np.column_stack([X_test,(X_test[:,1]+X_test[:,2])**2.0])
X_test.shape









    Out[651]:





(31, 23)



In [672]:

    
y_test_pred=regr.predict(X_test)
np.shape(y_test_pred)









    Out[672]:





(31,)



In [667]:

    
pred = pd.DataFrame({'Predicted_request_counts':y_test_pred})
pred.head()









    Out[667]:







  
    
      
      Predicted_request_counts
    
  
  
    
      0
      460.767914
    
    
      1
      405.878501
    
    
      2
      535.667112
    
    
      3
      564.255783
    
    
      4
      437.806621



In [645]:

    
pred.to_csv('predicted_request_counts_regression.csv', index=False)

Now lets try a neural network, if it can decrease the mean square error



In [691]:

    
m,input_layer_size=X.shape
hidden_layer_size = input_layer_size   
ANN_classifier = Sequential()
ANN_classifier.add(Dense(units = 10, kernel_initializer = 'uniform', activation = 'relu', input_dim = input_layer_size))
ANN_classifier.add(Dense(units = 10, kernel_initializer = 'uniform', activation = 'relu'))
ANN_classifier.add(Dense(units = 1, kernel_initializer = 'normal'))
start_time = time.time()
ANN_classifier.compile(loss='mean_squared_error', optimizer='adam')
history=ANN_classifier.fit(X_train, y_train, batch_size = 15, epochs = 4000,verbose=0)
print("--- %s seconds ---" % (time.time() - start_time))









    



--- 47.88141584396362 seconds ---



In [658]:

    
pred_train = ANN_classifier.predict(X_train)
pred = ANN_classifier.predict(X_val)
print("Mean squared error: ", np.mean((pred_train - y_train.values.reshape(-1,1)) ** 2))
print("Mean squared error validation: ", np.mean((pred - y_val.values.reshape(-1,1)) ** 2))









    



Mean squared error:  1819.91797975
Mean squared error validation:  2880.4640133

Clearly Neural network has improved the fit but much more slower than regression

Now create Predicted_request_counts.csv with neural network fit



In [686]:

    
y_test_pred_ANN=ANN_classifier.predict(X_test)
pred_ANN = pd.DataFrame({'Predicted_request_counts':[y_test_pred_ANN]},index =[0])
pred_ANN.head()









    Out[686]:







  
    
      
      Predicted_request_counts
    
  
  
    
      0
      [[465.023], [407.046], [566.8], [636.315], [49...



In [687]:

    
pred_ANN.to_csv('predicted_request_counts_ANN.csv', index=False)

Conclusion

Though size of the given data is small but the catogorical data carrying valuable undelying structure helped in achieving a reasonable accuracy.
Conversion of date to week days revealed important informations.
- Weekend is the most active period to receive requests.
- Weekdays showed a more calm and uniform period.
One-hot-encoding turned out to be a smart way to expose the hidden structure in catogorical variables and enable us to use them as numerical feature.
Neural network produced better error compare to linear regression but at the cost more execution time. Execution time though is not very important to take into account for this specific problem.
Tried with different no of hidden layers but using more than one tends to overfit.
Expected the removing the least appearing events would improve the results slightly but that was not true.
Decided to not to use the Recursive Feature Elimination ( RFE ) since the data is small and execution time of models is not significant.
In further work, Various methods could have been used to create feature ranking matrix.
One easy fix is needed. I noticed the output predict is not integer which it should be. Don't have time for this now but will fix asap.

	date	calendar_code	request_count	site_count	max_temp	min_temp	precipitation	events
0	2014-09-01	0.0	165	6	30.6	22.8	0.0	Rain
1	2014-09-02	1.0	138	7	32.8	22.8	15.5	Rain-Thunderstorm
2	2014-09-03	1.0	127	7	29.4	18.3	0.0	None
3	2014-09-04	1.0	174	7	29.4	17.2	0.0	None
4	2014-09-05	1.0	196	7	30.6	21.7	0.0	Fog
5	2014-09-06	1.0	314	7	32.8	22.2	9.9	Rain-Thunderstorm
6	2014-09-07	1.0	156	7	27.2	17.8	0.0	None
7	2014-09-08	1.0	150	7	23.3	17.8	0.0	None
8	2014-09-09	1.0	137	7	24.4	20.6	0.0	None
9	2014-09-10	1.0	156	7	26.7	17.8	0.0	None
10	2014-09-11	1.0	135	7	30.0	20.6	0.0	None
11	2014-09-12	1.0	187	7	24.4	15.6	0.0	None
12	2014-09-13	1.0	342	7	21.1	12.8	11.7	Rain
13	2014-09-14	1.0	202	7	20.6	9.4	0.0	None
14	2014-09-15	1.0	141	7	22.8	8.3	0.0	None

	date	calendar_code	request_count	site_count	max_temp	min_temp	precipitation	events
147	2015-02-26	1.0	378	15	0.6	-3.9	0.3	Snow
148	2015-02-27	1.0	556	16	-1.7	-6.7	0.3	Snow
149	2015-02-28	1.0	570	14	-0.6	-10.0	0.0	None
150	2015-03-01	1.0	615	16	0.0	-8.3	13.7	Rain-Snow
151	2015-03-02	1.0	369	16	3.3	-2.8	0.0	None

	count	mean	max	min
events
Fog	1	196.000000	196	196
Fog-Rain	5	206.400000	383	27
Fog-Rain-Snow	1	210.000000	210	210
Fog-Snow	3	355.000000	531	213
None	89	246.662921	570	0
Rain	31	256.483871	473	0
Rain-Snow	6	334.833333	615	137
Rain-Thunderstorm	2	226.000000	314	138
Snow	14	365.642857	593	247

	count	mean	max	min
week_day
Friday	21	300.857143	556	82
Monday	23	234.434783	405	133
Saturday	22	313.090909	570	128
Sunday	22	295.818182	615	156
Thursday	21	219.571429	391	0
Tuesday	22	230.045455	423	101
Wednesday	21	248.047619	405	27

	count	mean	max	min
events
Fog	1	10.000000	10	10
Fog-Rain	1	12.000000	12	12
None	17	8.647059	11	7
Rain	11	8.545455	10	7
Rain-Thunderstorm	1	10.000000	10	10

	calendar_code	date	events	key1	max_temp	min_temp	precipitation	site_count	day_of_week	events_code
0	1.0	2014-10-01	None	NaN	24.4	15.0	0.0	8	2	4
1	1.0	2014-10-02	None	NaN	21.1	12.8	0.0	8	3	4
2	1.0	2014-10-03	Rain	NaN	22.2	12.2	2.8	8	4	5
3	1.0	2014-10-04	Rain	NaN	21.7	7.2	10.7	7	5	5
4	1.0	2014-10-05	None	NaN	16.7	3.9	0.0	8	6	4