Note: Arundo_take_home_challenge_training_set.csv though will be loaded as a test_data (considering the name of the given file) should not be confused with the test data or validation data of the ML model. Validation/test data will actually come by splitting Arundo_take_home_challenge_training_set.csv.
In [2]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn import svm
import matplotlib.pyplot as plt
import seaborn as sns
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import np_utils
import time
In [4]:
#ALso remember to parse the date column. This will be helpful in the next step
data=pd.read_csv('Arundo_take_home_challenge_training_set.csv',sep=',',parse_dates=['date'])
#Have a look at the data
data.head(15)
Out[4]:
In [5]:
data.tail(5)
Out[5]:
In [6]:
print(data.isnull().any())
In [4]:
data.hist('max_temp',weights=data['request_count'])
data.hist('min_temp',weights=data['request_count'])
data.hist('precipitation',weights=data['request_count'])
plt.show()
All in all the distribtion of request count is strongly correlated to floating variables and hence all floating variables are to be considered as features. Let's now evaluate the correlation of two variables 'events' and 'calendar code' with request_count which clearly are categorical variables.
In [5]:
#Sort request_count with events
data.groupby('events').request_count.agg(['mean','max','min']).plot(kind='bar')
plt.show()
data.groupby('events').request_count.agg(['count','mean','max','min'])
Out[5]:
Clearly, support request comes in more when the weather condition is overcast which is understandable. But not many data instances are available in most of the weather events except when the events are 'None', 'Rain' and 'Snow' which will make it challenging to split the data honestly, train the ML model and test the accuracy.
In [6]:
# Now sort request_count with calendar code
data.groupby('calendar_code').request_count.agg(['mean','max','min']).plot(kind='bar')
plt.show()
data.groupby('calendar_code').request_count.agg(['count','mean','max','min'])
Out[6]:
Calendar code probably refers to the intensity of weather variations in a single day. The distribution of calendar code behavior would be interesting to see.
In [7]:
var_name = "events"
col_order = np.sort(data[var_name].unique()).tolist()
plt.figure(figsize=(16,6))
sns.violinplot(x=var_name, y='request_count', data=data, order=col_order)
plt.xlabel(var_name, fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title("Distribution of request count with "+var_name, fontsize=15)
plt.show()
For events with less data points, it is difficult to to see the distribution which will eventually might reflect as a error in the train model or else would be difficult or not possible to split into taining, test and validation data sets. Certainly, more data points corresponding to these events would help in better trained model for these weather event.
In [9]:
var_name = "calendar_code"
col_order = np.sort(data[var_name].unique()).tolist()
plt.figure(figsize=(16,6))
sns.violinplot(x=var_name, y='request_count', data=data, order=col_order)
plt.xlabel(var_name, fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title("Distribution of request count with "+var_name, fontsize=15)
plt.show()
Significant distribution of request count over calendar code is visible and hence would be the part of feature metrix.
Next, lets analayze the impact of date. The dates covers mainly the winter months and may not be represented well if it is used as it is given. I realized may be the site maintenence is dependant on the working day or weekend instead. We start by adding an additional column with the week day (0: Monday, ... 6: Sunday)
In [10]:
data['day_of_week'] = data['date'].dt.dayofweek
data['week_day'] = data['date'].dt.weekday_name
data.head()
Out[10]:
In [11]:
# We again choose the groupby and violin plots to see the underlying behaviour
data.groupby('week_day').request_count.agg(['mean','max','min']).plot(kind='bar')
plt.show()
data.groupby('week_day').request_count.agg(['count','mean','max','min'])
Out[11]:
Clearly, Weekends (Friday-Sunday) are the most active period while Monday-Thursday show nearly uniform mean request_count. This indicates the best way to reflect the effect of date as numeric feature would be through the week days.
Next, we convert the events into some unique identifiers (integers). This will result in an additional column "events_code".
In [12]:
data['events_code'] = pd.Categorical(data["events"]).codes
data.head()
Out[12]:
Since request_count is the target variable, we store it separately as "y" for ML model
In [13]:
y=data["request_count"]
print("Shape of y ",y.shape)
Drop the redundant columns now "date","events","request_count" and week_day.
In [14]:
data_orig = data #Save data in data_orig before droping reduntant variables
data = data.drop(["date","events","request_count","week_day"],axis=1)
data.head()
Out[14]:
The categorical values day_of_week, events_code and calender code need to be one-hot-encoded to be used as a feature input vector.
In [15]:
data= pd.get_dummies(data,columns=["calendar_code","events_code","day_of_week"],prefix=["calendar","event","week"])
data.head()
Out[15]:
DataFrame is now ready to be used as feature matrix. Lets assign data values to X.
In [17]:
X=data.values
X.shape
Out[17]:
In [20]:
plt.figure(1)
plt.plot(X[:,0],y[:],'r.')
plt.xlabel("No. of sites")
plt.ylabel("No. of requests")
plt.show()
plt.figure(1)
plt.plot((X[:,1]+X[:,2])**2/2.0,y[:],'r.')
plt.xlabel("Mean temperature")
plt.ylabel("No. of requests")
plt.show()
plt.figure(1)
plt.plot(X[:,3],y[:],'r.')
plt.xlabel("Precipitation")
plt.ylabel("No. of requests")
plt.show()
It appears that the no of requests has some kind of a quadratic dependence on the mean temperature so in addition to max and min temperature we should construct a new feature $((mintemp+maxtemp)/2)^2$
In [21]:
X=np.column_stack([X,(X[:,1]+X[:,2])**2.0])
X.shape
Out[21]:
In [22]:
#Split the data into training and validation test, shuffling of data won't be necessary since data seems to be already random
X_train, X_val, y_train, y_val = train_test_split(X,y,test_size=0.2,random_state = 0)
In [23]:
#Multivariabte regression
regr = linear_model.LinearRegression()
start_time =time.time()
regr.fit(X_train, y_train)
print("--- %s seconds ---" % (time.time() - start_time))
y_train_pred=regr.predict(X_train)
print("Mean squared error: %.2f" % np.mean((regr.predict(X_train) - y_train) ** 2))
print("Mean squared error with validation set: %.2f" % np.mean((regr.predict(X_val) - y_val) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(X_train, y_train))
print('Variance score with validation set: %.2f' % regr.score(X_val, y_val))
I also checked the variance score decreases if the new column of $((mintemp+maxtemp)/2)^2$ is not considered. Here, Recursive Feature Elimination ( RFE ) technique is ignored since with analysis above I see each of the variable has its own impact. This is something I can do for larger set of features but not in this case.
In [632]:
#Delete the least appeared events
plt.figure(1)
plt.plot((X[:,1]+X[:,2])**2/2.0,y[:],'r.')
plt.xlabel("Mean temperature")
plt.ylabel("No. of requests")
plt.show()
data_orig = data_orig[(data_orig['events'] != 'Fog') & (data_orig['events'] != 'Fog-Rain-Snow') & (data_orig['events'] != 'Rain-Thunderstorm')]
data_orig.shape
Out[632]:
In [633]:
#Preprocess data before sending to multivariant regression
data_orig['day_of_week'] = data_orig['date'].dt.dayofweek
data_orig['events_code'] = pd.Categorical(data_orig["events"]).codes
data_orig= pd.get_dummies(data_orig,columns=["calendar_code","events_code","day_of_week"],prefix=["calendar","event","week"])
y_red=data_orig["request_count"]
data_orig = data_orig.drop(["date","events","request_count","week_day"],axis=1)
data_orig.head()
Out[633]:
In [634]:
#Split the data into training and validation test, shuffling of data won't be necessary since data seems to be already random
X_red = data_orig.values
X_red_train, X_red_val, y_red_train, y_red_val = train_test_split(X_red,y_red,test_size=0.2,random_state = 0)
In [635]:
# Multivariant regression
regr = linear_model.LinearRegression()
regr.fit(X_red_train, y_red_train)
y_red_train_pred=regr.predict(X_red_train)
print("Mean squared error: %.2f" % np.mean((regr.predict(X_red_train) - y_red_train) ** 2))
print("Mean squared error with validation set: %.2f" % np.mean((regr.predict(X_red_val) - y_red_val) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(X_red_train, y_red_train))
print('Variance score with validation set: %.2f' % regr.score(X_red_val, y_red_val))
This actually has increased the error, so removing least appeared events in not helping at all in reducing the error. Lets go back to our original regression fit.
In [636]:
#Multivariabte regression
regr = linear_model.LinearRegression()
regr.fit(X_train, y_train)
y_train_pred=regr.predict(X_train)
print("Mean squared error: %.2f" % np.mean((regr.predict(X_train) - y_train) ** 2))
print("Mean squared error with validation set: %.2f" % np.mean((regr.predict(X_val) - y_val) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(X_train, y_train))
print('Variance score with validation set: %.2f' % regr.score(X_val, y_val))
In [637]:
test_data=pd.read_csv('Arundo_take_home_challenge_test_set.csv',sep=',',parse_dates=['date'])
# Sort by events to see if all the events in training data set
test_data.groupby('events').site_count.agg(['mean','max','min']).plot(kind='bar')
plt.show()
test_data.groupby('events').site_count.agg(['count','mean','max','min'])
Out[637]:
Clearly 4 events Fog-Rain-Snow, Fog-Snow, Rain-Snow and Snow are not listed. This I just wanted to see if I can remove the least appeared events that won't be needed in predicting request_count in given test csv file. The only least appeared event not listed here is Fog-Rain-Snow which might not be very helpful in improving our fit. But now I already have checked and this won't give us any improved result.
In [638]:
# We must process the test csv data in similar way before predicting request_count
# To ensure to have codes for catogorical variables, we will merge the given test data with training data csv (in csv files)
# and after inclusion of all events codes and week day codes we will remove the training data
data=pd.read_csv('Arundo_take_home_challenge_training_set.csv',sep=',',parse_dates=['date'])
data['key1'] = 1
data = data.drop(["request_count"],axis=1)
data.head()
test_data['key2'] = 0
frames = [test_data,data]
merg_frame = pd.concat(frames)
merg_frame['day_of_week'] = merg_frame['date'].dt.dayofweek
merg_frame['events_code'] = pd.Categorical(merg_frame["events"]).codes
merg_frame.head()
Out[638]:
In [639]:
# Drop data and Events columns
merg_frame = merg_frame.drop(["date","events"],axis=1)
merg_frame.head()
Out[639]:
In [640]:
merg_frame= pd.get_dummies(merg_frame,columns=["calendar_code","events_code","day_of_week"],prefix=["calendar","event","week"])
In [641]:
test_data = merg_frame[merg_frame['key2'] == 0]
test_data = test_data.drop(["key1","key2"],axis=1)
test_data.head()
Out[641]:
In [651]:
# Assign test data to X_test and add ((minx+maxx)/2)^2 as an additional column
X_test=test_data.values
X_test=np.column_stack([X_test,(X_test[:,1]+X_test[:,2])**2.0])
X_test.shape
Out[651]:
In [672]:
y_test_pred=regr.predict(X_test)
np.shape(y_test_pred)
Out[672]:
In [667]:
pred = pd.DataFrame({'Predicted_request_counts':y_test_pred})
pred.head()
Out[667]:
In [645]:
pred.to_csv('predicted_request_counts_regression.csv', index=False)
In [691]:
m,input_layer_size=X.shape
hidden_layer_size = input_layer_size
ANN_classifier = Sequential()
ANN_classifier.add(Dense(units = 10, kernel_initializer = 'uniform', activation = 'relu', input_dim = input_layer_size))
ANN_classifier.add(Dense(units = 10, kernel_initializer = 'uniform', activation = 'relu'))
ANN_classifier.add(Dense(units = 1, kernel_initializer = 'normal'))
start_time = time.time()
ANN_classifier.compile(loss='mean_squared_error', optimizer='adam')
history=ANN_classifier.fit(X_train, y_train, batch_size = 15, epochs = 4000,verbose=0)
print("--- %s seconds ---" % (time.time() - start_time))
In [658]:
pred_train = ANN_classifier.predict(X_train)
pred = ANN_classifier.predict(X_val)
print("Mean squared error: ", np.mean((pred_train - y_train.values.reshape(-1,1)) ** 2))
print("Mean squared error validation: ", np.mean((pred - y_val.values.reshape(-1,1)) ** 2))
Now create Predicted_request_counts.csv with neural network fit
In [686]:
y_test_pred_ANN=ANN_classifier.predict(X_test)
pred_ANN = pd.DataFrame({'Predicted_request_counts':[y_test_pred_ANN]},index =[0])
pred_ANN.head()
Out[686]:
In [687]:
pred_ANN.to_csv('predicted_request_counts_ANN.csv', index=False)