Kaggle: Bike Sharing

In the bike sharing competition, we want to predict the bike demand given temporal data of bike checkouts and associated features such as season, temperature, humidity, weather and others. First we'll read in the data and set the datetime as the index. In addition, we are going to generate features from the timestamp such as the hour and day of the week.


In [5]:
import numpy as np
import pandas as pd
from datetime import datetime

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestRegressor

In [10]:
#load data
train_df = pd.read_csv('./data/train.csv', sep=',', header=0)
test_df = pd.read_csv('./data/test.csv', sep=',', header=0)
train_df = train_df.dropna()

train_df['datetime'] = pd.to_datetime(train_df['datetime'])
test_df['datetime'] = pd.to_datetime(test_df['datetime'])
    
train_ts = train_df.set_index('datetime')
test_ts = test_df.set_index('datetime')    
train_ts['hour'] = train_ts.index.hour
train_ts['day'] = train_ts.index.weekday_name
test_ts['hour'] = test_ts.index.hour
test_ts['day'] = test_ts.index.weekday_name

train_ts.head()


Out[10]:
season holiday workingday weather temp atemp humidity windspeed casual registered count hour day
datetime
2011-01-01 00:00:00 1 0 0 1 9.84 14.395 81 0.0 3 13 16 0 Saturday
2011-01-01 01:00:00 1 0 0 1 9.02 13.635 80 0.0 8 32 40 1 Saturday
2011-01-01 02:00:00 1 0 0 1 9.02 13.635 80 0.0 5 27 32 2 Saturday
2011-01-01 03:00:00 1 0 0 1 9.84 14.395 75 0.0 3 10 13 3 Saturday
2011-01-01 04:00:00 1 0 0 1 9.84 14.395 75 0.0 0 1 1 4 Saturday

Next, we are going to visualize bike checkouts by hour and day of the week. We can see that during the weekdays, the peak demand is at 8am and 5pm, while on the weekend the demand rises between 12 noon and 3pm.


In [11]:
#data exploration
plt.figure()
weekday_name = np.unique(train_ts['day'])
for day in weekday_name:
    counts_mean, counts_std = [], []        
    train_day = train_ts.loc[train_ts['day']==day]
    for i in range(24):
        train_day_hour = train_day.loc[train_day['hour']==i]                
        counts_mean.append(train_day_hour['count'].mean())
        counts_std.append(train_day_hour['count'].std())
    #plt.errorbar(range(24), counts_mean, yerr=counts_std, label=day)
    plt.plot(counts_mean, label=day)
plt.legend(loc=2, prop={'size':15})
plt.xlabel('hours', fontsize=20)
plt.ylabel('counts', fontsize=20)
plt.show()


We can use Random Forest Regressor to rank our feactures according to predictive ability. Notice that the most predictive feature is the hour which was feature engineered from the timestamp. This is followed by temperature, humidity, and day of the week.


In [12]:
#feature ranking                                                                                                                                                                                                                                                                                                                                                        
mapping = {'Monday':0,'Tuesday':1,'Wednesday':2,'Thursday':3,'Friday':4,'Saturday':5,'Sunday':6}
train_ts['day'] = train_ts['day'].map(mapping).astype(int)
test_ts['day'] = test_ts['day'].map(mapping).astype(int)    
train_cols = [col for col in train_ts.columns if col not in ['count','registered','casual']]
X_all = train_ts[train_cols].values
y_all = train_ts['count'].values
    
rf = RandomForestRegressor(n_estimators=100, max_depth=20, random_state=0)
rf.fit(X_all, y_all)
    
feature_ranks = rf.feature_importances_
num_features = len(feature_ranks)
        
plt.figure()
width = 0.35
plt.bar(range(num_features), feature_ranks, width)
plt.xticks(np.arange(num_features)+width/2.0,tuple(train_cols),rotation='vertical',fontsize=16)
plt.subplots_adjust(bottom=0.25)
plt.show()


Finally, we make our submission by iterating over each year and each month, and training on historical data that avoids look ahead bias, while predicting for the current month.


In [13]:
submission = pd.DataFrame(index=test_ts.index, columns=['count'])
submission = submission.fillna(0)
    
#use only past data for training and prediction
for year in np.unique(test_ts.index.year):
    for month in np.unique(test_ts.index.month):            
        print "Predicting Year: %d, Month: %d" %(year, month)
        test_locs = np.logical_and(test_ts.index.year == year, test_ts.index.month == month)
        test_subset = test_ts[test_locs]
        train_locs = train_ts.index <= min(test_subset.index)
        train_subset = train_ts[train_locs]

        X_train = train_subset[train_cols].values
        y_train = train_subset['count'].values
            
        rf = RandomForestRegressor(n_estimators=100, max_depth=20, random_state=0)
        rf.fit(X_train, y_train)
            
        X_test = test_subset[train_cols].values                  
        counts = rf.predict(X_test)
            
        submission[test_locs]=np.round(counts.reshape(-1,1)).astype(int)

submission.head()


Predicting Year: 2011, Month: 1
Predicting Year: 2011, Month: 2
Predicting Year: 2011, Month: 3
Predicting Year: 2011, Month: 4
Predicting Year: 2011, Month: 5
Predicting Year: 2011, Month: 6
Predicting Year: 2011, Month: 7
Predicting Year: 2011, Month: 8
Predicting Year: 2011, Month: 9
Predicting Year: 2011, Month: 10
Predicting Year: 2011, Month: 11
Predicting Year: 2011, Month: 12
Predicting Year: 2012, Month: 1
Predicting Year: 2012, Month: 2
Predicting Year: 2012, Month: 3
Predicting Year: 2012, Month: 4
Predicting Year: 2012, Month: 5
Predicting Year: 2012, Month: 6
Predicting Year: 2012, Month: 7
Predicting Year: 2012, Month: 8
Predicting Year: 2012, Month: 9
Predicting Year: 2012, Month: 10
Predicting Year: 2012, Month: 11
Predicting Year: 2012, Month: 12
Out[13]:
count
datetime
2011-01-20 00:00:00 8
2011-01-20 01:00:00 12
2011-01-20 02:00:00 11
2011-01-20 03:00:00 4
2011-01-20 04:00:00 3