Arundo's Take home challenge

Given Data:

  • Arundo_take_home_challenge_training_set.csv
  • Arundo_take_home_challenge_test_set.csv

Task:

  • Arundo_take_home_challenge_training_set.csv will be used to train a model to predict request_count (target variable).
  • Predict request_count that is missing in Arundo_take_home_challenge_test_set.csv.

Note: Arundo_take_home_challenge_training_set.csv though will be loaded as a test_data (considering the name of the given file) should not be confused with the test data or validation data of the ML model. Validation/test data will actually come by splitting Arundo_take_home_challenge_training_set.csv.

The problem is solved in Jupyter's notebook with following main steps

  • Data Visualization
  • Features selection and preprocessing
  • Testing with different ML models
  • Output csv files with predicted request_values using the trained ML models

In [2]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn import svm
import matplotlib.pyplot as plt
import seaborn as sns
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import np_utils
import time


Using TensorFlow backend.

Read the given dataset


In [4]:
#ALso remember to parse the date column. This will be helpful in the next step
data=pd.read_csv('Arundo_take_home_challenge_training_set.csv',sep=',',parse_dates=['date'])
#Have a look at the data
data.head(15)


Out[4]:
date calendar_code request_count site_count max_temp min_temp precipitation events
0 2014-09-01 0.0 165 6 30.6 22.8 0.0 Rain
1 2014-09-02 1.0 138 7 32.8 22.8 15.5 Rain-Thunderstorm
2 2014-09-03 1.0 127 7 29.4 18.3 0.0 None
3 2014-09-04 1.0 174 7 29.4 17.2 0.0 None
4 2014-09-05 1.0 196 7 30.6 21.7 0.0 Fog
5 2014-09-06 1.0 314 7 32.8 22.2 9.9 Rain-Thunderstorm
6 2014-09-07 1.0 156 7 27.2 17.8 0.0 None
7 2014-09-08 1.0 150 7 23.3 17.8 0.0 None
8 2014-09-09 1.0 137 7 24.4 20.6 0.0 None
9 2014-09-10 1.0 156 7 26.7 17.8 0.0 None
10 2014-09-11 1.0 135 7 30.0 20.6 0.0 None
11 2014-09-12 1.0 187 7 24.4 15.6 0.0 None
12 2014-09-13 1.0 342 7 21.1 12.8 11.7 Rain
13 2014-09-14 1.0 202 7 20.6 9.4 0.0 None
14 2014-09-15 1.0 141 7 22.8 8.3 0.0 None

In [5]:
data.tail(5)


Out[5]:
date calendar_code request_count site_count max_temp min_temp precipitation events
147 2015-02-26 1.0 378 15 0.6 -3.9 0.3 Snow
148 2015-02-27 1.0 556 16 -1.7 -6.7 0.3 Snow
149 2015-02-28 1.0 570 14 -0.6 -10.0 0.0 None
150 2015-03-01 1.0 615 16 0.0 -8.3 13.7 Rain-Snow
151 2015-03-02 1.0 369 16 3.3 -2.8 0.0 None

In [6]:
print(data.isnull().any())


date             False
calendar_code    False
request_count    False
site_count       False
max_temp         False
min_temp         False
precipitation    False
events           False
dtype: bool

Quick observations

  • Daily data related to weather variations
  • Covers mainly the winter months
  • A better visualization of correlation of request_count needed with other variables
  • No nulls

Let us try to get a better insight into the data. First let us have a look at the dependence of request counts on the float variables.


In [4]:
data.hist('max_temp',weights=data['request_count'])
data.hist('min_temp',weights=data['request_count'])
data.hist('precipitation',weights=data['request_count'])
plt.show()


From the above histograms we see most of the requests comes when

  • maximum temperature is below 10C
  • min temperature is below 2C
  • When there is zero precipitation

All in all the distribtion of request count is strongly correlated to floating variables and hence all floating variables are to be considered as features. Let's now evaluate the correlation of two variables 'events' and 'calendar code' with request_count which clearly are categorical variables.


In [5]:
#Sort request_count with events  
data.groupby('events').request_count.agg(['mean','max','min']).plot(kind='bar')
plt.show()
data.groupby('events').request_count.agg(['count','mean','max','min'])


Out[5]:
count mean max min
events
Fog 1 196.000000 196 196
Fog-Rain 5 206.400000 383 27
Fog-Rain-Snow 1 210.000000 210 210
Fog-Snow 3 355.000000 531 213
None 89 246.662921 570 0
Rain 31 256.483871 473 0
Rain-Snow 6 334.833333 615 137
Rain-Thunderstorm 2 226.000000 314 138
Snow 14 365.642857 593 247

Clearly, support request comes in more when the weather condition is overcast which is understandable. But not many data instances are available in most of the weather events except when the events are 'None', 'Rain' and 'Snow' which will make it challenging to split the data honestly, train the ML model and test the accuracy.


In [6]:
# Now sort request_count with calendar code
data.groupby('calendar_code').request_count.agg(['mean','max','min']).plot(kind='bar')
plt.show()
data.groupby('calendar_code').request_count.agg(['count','mean','max','min'])


Out[6]:
count mean max min
calendar_code
0.0 63 228.555556 410 0
1.0 89 287.505618 615 101

Calendar code probably refers to the intensity of weather variations in a single day. The distribution of calendar code behavior would be interesting to see.

We further use violin plot for dependence on the categorical variables (https://blog.modeanalytics.com/violin-plot-examples/)

In [7]:
var_name = "events"
col_order = np.sort(data[var_name].unique()).tolist()
plt.figure(figsize=(16,6))
sns.violinplot(x=var_name, y='request_count', data=data, order=col_order)
plt.xlabel(var_name, fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title("Distribution of request count with "+var_name, fontsize=15)
plt.show()


For events with less data points, it is difficult to to see the distribution which will eventually might reflect as a error in the train model or else would be difficult or not possible to split into taining, test and validation data sets. Certainly, more data points corresponding to these events would help in better trained model for these weather event.


In [9]:
var_name = "calendar_code"
col_order = np.sort(data[var_name].unique()).tolist()
plt.figure(figsize=(16,6))
sns.violinplot(x=var_name, y='request_count', data=data, order=col_order)
plt.xlabel(var_name, fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title("Distribution of request count with "+var_name, fontsize=15)
plt.show()


Significant distribution of request count over calendar code is visible and hence would be the part of feature metrix.

Next, lets analayze the impact of date. The dates covers mainly the winter months and may not be represented well if it is used as it is given. I realized may be the site maintenence is dependant on the working day or weekend instead. We start by adding an additional column with the week day (0: Monday, ... 6: Sunday)


In [10]:
data['day_of_week'] = data['date'].dt.dayofweek
data['week_day'] = data['date'].dt.weekday_name
data.head()


Out[10]:
date calendar_code request_count site_count max_temp min_temp precipitation events day_of_week week_day
0 2014-09-01 0.0 165 6 30.6 22.8 0.0 Rain 0 Monday
1 2014-09-02 1.0 138 7 32.8 22.8 15.5 Rain-Thunderstorm 1 Tuesday
2 2014-09-03 1.0 127 7 29.4 18.3 0.0 None 2 Wednesday
3 2014-09-04 1.0 174 7 29.4 17.2 0.0 None 3 Thursday
4 2014-09-05 1.0 196 7 30.6 21.7 0.0 Fog 4 Friday

In [11]:
# We again choose the groupby and violin plots to see the underlying behaviour
data.groupby('week_day').request_count.agg(['mean','max','min']).plot(kind='bar')
plt.show()
data.groupby('week_day').request_count.agg(['count','mean','max','min'])


Out[11]:
count mean max min
week_day
Friday 21 300.857143 556 82
Monday 23 234.434783 405 133
Saturday 22 313.090909 570 128
Sunday 22 295.818182 615 156
Thursday 21 219.571429 391 0
Tuesday 22 230.045455 423 101
Wednesday 21 248.047619 405 27

Clearly, Weekends (Friday-Sunday) are the most active period while Monday-Thursday show nearly uniform mean request_count. This indicates the best way to reflect the effect of date as numeric feature would be through the week days.

Next, we convert the events into some unique identifiers (integers). This will result in an additional column "events_code".


In [12]:
data['events_code'] = pd.Categorical(data["events"]).codes
data.head()


Out[12]:
date calendar_code request_count site_count max_temp min_temp precipitation events day_of_week week_day events_code
0 2014-09-01 0.0 165 6 30.6 22.8 0.0 Rain 0 Monday 5
1 2014-09-02 1.0 138 7 32.8 22.8 15.5 Rain-Thunderstorm 1 Tuesday 7
2 2014-09-03 1.0 127 7 29.4 18.3 0.0 None 2 Wednesday 4
3 2014-09-04 1.0 174 7 29.4 17.2 0.0 None 3 Thursday 4
4 2014-09-05 1.0 196 7 30.6 21.7 0.0 Fog 4 Friday 0

Since request_count is the target variable, we store it separately as "y" for ML model


In [13]:
y=data["request_count"]
print("Shape of y ",y.shape)


Shape of y  (152,)

Drop the redundant columns now "date","events","request_count" and week_day.


In [14]:
data_orig = data #Save data in data_orig before droping reduntant variables
data = data.drop(["date","events","request_count","week_day"],axis=1)
data.head()


Out[14]:
calendar_code site_count max_temp min_temp precipitation day_of_week events_code
0 0.0 6 30.6 22.8 0.0 0 5
1 1.0 7 32.8 22.8 15.5 1 7
2 1.0 7 29.4 18.3 0.0 2 4
3 1.0 7 29.4 17.2 0.0 3 4
4 1.0 7 30.6 21.7 0.0 4 0

The categorical values day_of_week, events_code and calender code need to be one-hot-encoded to be used as a feature input vector.


In [15]:
data= pd.get_dummies(data,columns=["calendar_code","events_code","day_of_week"],prefix=["calendar","event","week"])
data.head()


Out[15]:
site_count max_temp min_temp precipitation calendar_0.0 calendar_1.0 event_0 event_1 event_2 event_3 ... event_6 event_7 event_8 week_0 week_1 week_2 week_3 week_4 week_5 week_6
0 6 30.6 22.8 0.0 1 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 0
1 7 32.8 22.8 15.5 0 1 0 0 0 0 ... 0 1 0 0 1 0 0 0 0 0
2 7 29.4 18.3 0.0 0 1 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 0
3 7 29.4 17.2 0.0 0 1 0 0 0 0 ... 0 0 0 0 0 0 1 0 0 0
4 7 30.6 21.7 0.0 0 1 1 0 0 0 ... 0 0 0 0 0 0 0 1 0 0

5 rows × 22 columns

DataFrame is now ready to be used as feature matrix. Lets assign data values to X.


In [17]:
X=data.values
X.shape


Out[17]:
(152, 22)

In [20]:
plt.figure(1)
plt.plot(X[:,0],y[:],'r.')
plt.xlabel("No. of sites")
plt.ylabel("No. of requests")
plt.show()

plt.figure(1)
plt.plot((X[:,1]+X[:,2])**2/2.0,y[:],'r.')
plt.xlabel("Mean temperature")
plt.ylabel("No. of requests")
plt.show()

plt.figure(1)
plt.plot(X[:,3],y[:],'r.')
plt.xlabel("Precipitation")
plt.ylabel("No. of requests")
plt.show()


It appears that the no of requests has some kind of a quadratic dependence on the mean temperature so in addition to max and min temperature we should construct a new feature $((mintemp+maxtemp)/2)^2$


In [21]:
X=np.column_stack([X,(X[:,1]+X[:,2])**2.0])
X.shape


Out[21]:
(152, 23)

In [22]:
#Split the data into training and validation test, shuffling of data won't be necessary since data seems to be already random
X_train, X_val, y_train, y_val =  train_test_split(X,y,test_size=0.2,random_state = 0)

Conduct a multivariate linear regression on the dataset.


In [23]:
#Multivariabte regression
regr = linear_model.LinearRegression()
start_time =time.time()
regr.fit(X_train, y_train)
print("--- %s seconds ---" % (time.time() - start_time))
y_train_pred=regr.predict(X_train)
print("Mean squared error: %.2f" % np.mean((regr.predict(X_train) - y_train) ** 2))
print("Mean squared error with validation set: %.2f" % np.mean((regr.predict(X_val) - y_val) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(X_train, y_train))
print('Variance score with validation set: %.2f' % regr.score(X_val, y_val))


--- 0.013047933578491211 seconds ---
Mean squared error: 2050.26
Mean squared error with validation set: 3354.57
Variance score: 0.82
Variance score with validation set: 0.77

I also checked the variance score decreases if the new column of $((mintemp+maxtemp)/2)^2$ is not considered. Here, Recursive Feature Elimination ( RFE ) technique is ignored since with analysis above I see each of the variable has its own impact. This is something I can do for larger set of features but not in this case.

Now apply regression on data after ignoring the least appeared events to try if it gives us any better result.


In [632]:
#Delete the least appeared events
plt.figure(1)
plt.plot((X[:,1]+X[:,2])**2/2.0,y[:],'r.')
plt.xlabel("Mean temperature")
plt.ylabel("No. of requests")
plt.show()
data_orig = data_orig[(data_orig['events'] != 'Fog') & (data_orig['events'] != 'Fog-Rain-Snow') & (data_orig['events'] != 'Rain-Thunderstorm')]
data_orig.shape


Out[632]:
(148, 11)

In [633]:
#Preprocess data before sending to multivariant regression
data_orig['day_of_week'] = data_orig['date'].dt.dayofweek
data_orig['events_code'] = pd.Categorical(data_orig["events"]).codes
data_orig= pd.get_dummies(data_orig,columns=["calendar_code","events_code","day_of_week"],prefix=["calendar","event","week"])
y_red=data_orig["request_count"]
data_orig = data_orig.drop(["date","events","request_count","week_day"],axis=1)
data_orig.head()


Out[633]:
site_count max_temp min_temp precipitation calendar_0.0 calendar_1.0 event_0 event_1 event_2 event_3 event_4 event_5 week_0 week_1 week_2 week_3 week_4 week_5 week_6
0 6 30.6 22.8 0.0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0
2 7 29.4 18.3 0.0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0
3 7 29.4 17.2 0.0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0
6 7 27.2 17.8 0.0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1
7 7 23.3 17.8 0.0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0

In [634]:
#Split the data into training and validation test, shuffling of data won't be necessary since data seems to be already random
X_red = data_orig.values
X_red_train, X_red_val, y_red_train, y_red_val =  train_test_split(X_red,y_red,test_size=0.2,random_state = 0)

In [635]:
# Multivariant regression

regr = linear_model.LinearRegression()
regr.fit(X_red_train, y_red_train)
y_red_train_pred=regr.predict(X_red_train)
print("Mean squared error: %.2f" % np.mean((regr.predict(X_red_train) - y_red_train) ** 2))
print("Mean squared error with validation set: %.2f" % np.mean((regr.predict(X_red_val) - y_red_val) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(X_red_train, y_red_train))
print('Variance score with validation set: %.2f' % regr.score(X_red_val, y_red_val))


Mean squared error: 2303.49
Mean squared error with validation set: 3448.02
Variance score: 0.81
Variance score with validation set: 0.77

This actually has increased the error, so removing least appeared events in not helping at all in reducing the error. Lets go back to our original regression fit.


In [636]:
#Multivariabte regression
regr = linear_model.LinearRegression()
regr.fit(X_train, y_train)
y_train_pred=regr.predict(X_train)
print("Mean squared error: %.2f" % np.mean((regr.predict(X_train) - y_train) ** 2))
print("Mean squared error with validation set: %.2f" % np.mean((regr.predict(X_val) - y_val) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(X_train, y_train))
print('Variance score with validation set: %.2f' % regr.score(X_val, y_val))


Mean squared error: 2050.26
Mean squared error with validation set: 3354.57
Variance score: 0.82
Variance score with validation set: 0.77

Load CSV file with missing request_count and predict request_count. This is one of the tasks


In [637]:
test_data=pd.read_csv('Arundo_take_home_challenge_test_set.csv',sep=',',parse_dates=['date'])
# Sort by events to see if all the events in training data set
test_data.groupby('events').site_count.agg(['mean','max','min']).plot(kind='bar')
plt.show()
test_data.groupby('events').site_count.agg(['count','mean','max','min'])


Out[637]:
count mean max min
events
Fog 1 10.000000 10 10
Fog-Rain 1 12.000000 12 12
None 17 8.647059 11 7
Rain 11 8.545455 10 7
Rain-Thunderstorm 1 10.000000 10 10

Clearly 4 events Fog-Rain-Snow, Fog-Snow, Rain-Snow and Snow are not listed. This I just wanted to see if I can remove the least appeared events that won't be needed in predicting request_count in given test csv file. The only least appeared event not listed here is Fog-Rain-Snow which might not be very helpful in improving our fit. But now I already have checked and this won't give us any improved result.


In [638]:
# We must process the test csv data in similar way before predicting request_count 
# To ensure to have codes for catogorical variables, we will merge the given test data with training data csv (in csv files)
# and after inclusion of all events codes and week day codes we will remove the training data

data=pd.read_csv('Arundo_take_home_challenge_training_set.csv',sep=',',parse_dates=['date'])
data['key1'] = 1
data = data.drop(["request_count"],axis=1)
data.head()
test_data['key2'] = 0
frames = [test_data,data]
merg_frame = pd.concat(frames)
merg_frame['day_of_week'] = merg_frame['date'].dt.dayofweek
merg_frame['events_code'] = pd.Categorical(merg_frame["events"]).codes
merg_frame.head()


Out[638]:
calendar_code date events key1 key2 max_temp min_temp precipitation site_count day_of_week events_code
0 1.0 2014-10-01 None NaN 0.0 24.4 15.0 0.0 8 2 4
1 1.0 2014-10-02 None NaN 0.0 21.1 12.8 0.0 8 3 4
2 1.0 2014-10-03 Rain NaN 0.0 22.2 12.2 2.8 8 4 5
3 1.0 2014-10-04 Rain NaN 0.0 21.7 7.2 10.7 7 5 5
4 1.0 2014-10-05 None NaN 0.0 16.7 3.9 0.0 8 6 4

In [639]:
# Drop data and Events columns
merg_frame = merg_frame.drop(["date","events"],axis=1)
merg_frame.head()


Out[639]:
calendar_code key1 key2 max_temp min_temp precipitation site_count day_of_week events_code
0 1.0 NaN 0.0 24.4 15.0 0.0 8 2 4
1 1.0 NaN 0.0 21.1 12.8 0.0 8 3 4
2 1.0 NaN 0.0 22.2 12.2 2.8 8 4 5
3 1.0 NaN 0.0 21.7 7.2 10.7 7 5 5
4 1.0 NaN 0.0 16.7 3.9 0.0 8 6 4

In [640]:
merg_frame= pd.get_dummies(merg_frame,columns=["calendar_code","events_code","day_of_week"],prefix=["calendar","event","week"])

In [641]:
test_data = merg_frame[merg_frame['key2'] == 0]
test_data = test_data.drop(["key1","key2"],axis=1)
test_data.head()


Out[641]:
max_temp min_temp precipitation site_count calendar_0.0 calendar_1.0 event_0 event_1 event_2 event_3 ... event_6 event_7 event_8 week_0 week_1 week_2 week_3 week_4 week_5 week_6
0 24.4 15.0 0.0 8 0 1 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 0
1 21.1 12.8 0.0 8 0 1 0 0 0 0 ... 0 0 0 0 0 0 1 0 0 0
2 22.2 12.2 2.8 8 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
3 21.7 7.2 10.7 7 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
4 16.7 3.9 0.0 8 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1

5 rows × 22 columns


In [651]:
# Assign test data to X_test and add ((minx+maxx)/2)^2 as an additional column
X_test=test_data.values
X_test=np.column_stack([X_test,(X_test[:,1]+X_test[:,2])**2.0])
X_test.shape


Out[651]:
(31, 23)

In [672]:
y_test_pred=regr.predict(X_test)
np.shape(y_test_pred)


Out[672]:
(31,)

In [667]:
pred = pd.DataFrame({'Predicted_request_counts':y_test_pred})
pred.head()


Out[667]:
Predicted_request_counts
0 460.767914
1 405.878501
2 535.667112
3 564.255783
4 437.806621

In [645]:
pred.to_csv('predicted_request_counts_regression.csv', index=False)

Now lets try a neural network, if it can decrease the mean square error


In [691]:
m,input_layer_size=X.shape
hidden_layer_size = input_layer_size   
ANN_classifier = Sequential()
ANN_classifier.add(Dense(units = 10, kernel_initializer = 'uniform', activation = 'relu', input_dim = input_layer_size))
ANN_classifier.add(Dense(units = 10, kernel_initializer = 'uniform', activation = 'relu'))
ANN_classifier.add(Dense(units = 1, kernel_initializer = 'normal'))
start_time = time.time()
ANN_classifier.compile(loss='mean_squared_error', optimizer='adam')
history=ANN_classifier.fit(X_train, y_train, batch_size = 15, epochs = 4000,verbose=0)
print("--- %s seconds ---" % (time.time() - start_time))


--- 47.88141584396362 seconds ---

In [658]:
pred_train = ANN_classifier.predict(X_train)
pred = ANN_classifier.predict(X_val)
print("Mean squared error: ", np.mean((pred_train - y_train.values.reshape(-1,1)) ** 2))
print("Mean squared error validation: ", np.mean((pred - y_val.values.reshape(-1,1)) ** 2))


Mean squared error:  1819.91797975
Mean squared error validation:  2880.4640133

Clearly Neural network has improved the fit but much more slower than regression

Now create Predicted_request_counts.csv with neural network fit


In [686]:
y_test_pred_ANN=ANN_classifier.predict(X_test)
pred_ANN = pd.DataFrame({'Predicted_request_counts':[y_test_pred_ANN]},index =[0])
pred_ANN.head()


Out[686]:
Predicted_request_counts
0 [[465.023], [407.046], [566.8], [636.315], [49...

In [687]:
pred_ANN.to_csv('predicted_request_counts_ANN.csv', index=False)

Conclusion

  • Though size of the given data is small but the catogorical data carrying valuable undelying structure helped in achieving a reasonable accuracy.
  • Conversion of date to week days revealed important informations.
    • Weekend is the most active period to receive requests.
    • Weekdays showed a more calm and uniform period.
  • One-hot-encoding turned out to be a smart way to expose the hidden structure in catogorical variables and enable us to use them as numerical feature.
  • Neural network produced better error compare to linear regression but at the cost more execution time. Execution time though is not very important to take into account for this specific problem.
  • Tried with different no of hidden layers but using more than one tends to overfit.
  • Expected the removing the least appearing events would improve the results slightly but that was not true.
  • Decided to not to use the Recursive Feature Elimination ( RFE ) since the data is small and execution time of models is not significant.
  • In further work, Various methods could have been used to create feature ranking matrix.
  • One easy fix is needed. I noticed the output predict is not integer which it should be. Don't have time for this now but will fix asap.