In the bike sharing competition, we want to predict the bike demand given temporal data of bike checkouts and associated features such as season, temperature, humidity, weather and others. First we'll read in the data and set the datetime as the index. In addition, we are going to generate features from the timestamp such as the hour and day of the week.



In [5]:

    
import numpy as np
import pandas as pd
from datetime import datetime

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestRegressor



In [10]:

    
#load data
train_df = pd.read_csv('./data/train.csv', sep=',', header=0)
test_df = pd.read_csv('./data/test.csv', sep=',', header=0)
train_df = train_df.dropna()

train_df['datetime'] = pd.to_datetime(train_df['datetime'])
test_df['datetime'] = pd.to_datetime(test_df['datetime'])
    
train_ts = train_df.set_index('datetime')
test_ts = test_df.set_index('datetime')    
train_ts['hour'] = train_ts.index.hour
train_ts['day'] = train_ts.index.weekday_name
test_ts['hour'] = test_ts.index.hour
test_ts['day'] = test_ts.index.weekday_name

train_ts.head()









    Out[10]:






  
    
      
      season
      holiday
      workingday
      weather
      temp
      atemp
      humidity
      windspeed
      casual
      registered
      count
      hour
      day
    
    
      datetime
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2011-01-01 00:00:00
      1
      0
      0
      1
      9.84
      14.395
      81
      0.0
      3
      13
      16
      0
      Saturday
    
    
      2011-01-01 01:00:00
      1
      0
      0
      1
      9.02
      13.635
      80
      0.0
      8
      32
      40
      1
      Saturday
    
    
      2011-01-01 02:00:00
      1
      0
      0
      1
      9.02
      13.635
      80
      0.0
      5
      27
      32
      2
      Saturday
    
    
      2011-01-01 03:00:00
      1
      0
      0
      1
      9.84
      14.395
      75
      0.0
      3
      10
      13
      3
      Saturday
    
    
      2011-01-01 04:00:00
      1
      0
      0
      1
      9.84
      14.395
      75
      0.0
      0
      1
      1
      4
      Saturday

Next, we are going to visualize bike checkouts by hour and day of the week. We can see that during the weekdays, the peak demand is at 8am and 5pm, while on the weekend the demand rises between 12 noon and 3pm.



In [11]:

    
#data exploration
plt.figure()
weekday_name = np.unique(train_ts['day'])
for day in weekday_name:
    counts_mean, counts_std = [], []        
    train_day = train_ts.loc[train_ts['day']==day]
    for i in range(24):
        train_day_hour = train_day.loc[train_day['hour']==i]                
        counts_mean.append(train_day_hour['count'].mean())
        counts_std.append(train_day_hour['count'].std())
    #plt.errorbar(range(24), counts_mean, yerr=counts_std, label=day)
    plt.plot(counts_mean, label=day)
plt.legend(loc=2, prop={'size':15})
plt.xlabel('hours', fontsize=20)
plt.ylabel('counts', fontsize=20)
plt.show()

We can use Random Forest Regressor to rank our feactures according to predictive ability. Notice that the most predictive feature is the hour which was feature engineered from the timestamp. This is followed by temperature, humidity, and day of the week.



In [12]:

    
#feature ranking                                                                                                                                                                                                                                                                                                                                                        
mapping = {'Monday':0,'Tuesday':1,'Wednesday':2,'Thursday':3,'Friday':4,'Saturday':5,'Sunday':6}
train_ts['day'] = train_ts['day'].map(mapping).astype(int)
test_ts['day'] = test_ts['day'].map(mapping).astype(int)    
train_cols = [col for col in train_ts.columns if col not in ['count','registered','casual']]
X_all = train_ts[train_cols].values
y_all = train_ts['count'].values
    
rf = RandomForestRegressor(n_estimators=100, max_depth=20, random_state=0)
rf.fit(X_all, y_all)
    
feature_ranks = rf.feature_importances_
num_features = len(feature_ranks)
        
plt.figure()
width = 0.35
plt.bar(range(num_features), feature_ranks, width)
plt.xticks(np.arange(num_features)+width/2.0,tuple(train_cols),rotation='vertical',fontsize=16)
plt.subplots_adjust(bottom=0.25)
plt.show()

Finally, we make our submission by iterating over each year and each month, and training on historical data that avoids look ahead bias, while predicting for the current month.



In [13]:

    
submission = pd.DataFrame(index=test_ts.index, columns=['count'])
submission = submission.fillna(0)
    
#use only past data for training and prediction
for year in np.unique(test_ts.index.year):
    for month in np.unique(test_ts.index.month):            
        print "Predicting Year: %d, Month: %d" %(year, month)
        test_locs = np.logical_and(test_ts.index.year == year, test_ts.index.month == month)
        test_subset = test_ts[test_locs]
        train_locs = train_ts.index <= min(test_subset.index)
        train_subset = train_ts[train_locs]

        X_train = train_subset[train_cols].values
        y_train = train_subset['count'].values
            
        rf = RandomForestRegressor(n_estimators=100, max_depth=20, random_state=0)
        rf.fit(X_train, y_train)
            
        X_test = test_subset[train_cols].values                  
        counts = rf.predict(X_test)
            
        submission[test_locs]=np.round(counts.reshape(-1,1)).astype(int)

submission.head()









    



Predicting Year: 2011, Month: 1
Predicting Year: 2011, Month: 2
Predicting Year: 2011, Month: 3
Predicting Year: 2011, Month: 4
Predicting Year: 2011, Month: 5
Predicting Year: 2011, Month: 6
Predicting Year: 2011, Month: 7
Predicting Year: 2011, Month: 8
Predicting Year: 2011, Month: 9
Predicting Year: 2011, Month: 10
Predicting Year: 2011, Month: 11
Predicting Year: 2011, Month: 12
Predicting Year: 2012, Month: 1
Predicting Year: 2012, Month: 2
Predicting Year: 2012, Month: 3
Predicting Year: 2012, Month: 4
Predicting Year: 2012, Month: 5
Predicting Year: 2012, Month: 6
Predicting Year: 2012, Month: 7
Predicting Year: 2012, Month: 8
Predicting Year: 2012, Month: 9
Predicting Year: 2012, Month: 10
Predicting Year: 2012, Month: 11
Predicting Year: 2012, Month: 12






    Out[13]:






  
    
      
      count
    
    
      datetime
      
    
  
  
    
      2011-01-20 00:00:00
      8
    
    
      2011-01-20 01:00:00
      12
    
    
      2011-01-20 02:00:00
      11
    
    
      2011-01-20 03:00:00
      4
    
    
      2011-01-20 04:00:00
      3

	season	holiday	workingday	weather	temp	atemp	humidity	windspeed	casual	registered	count	hour	day
datetime
2011-01-01 00:00:00	1	0	0	1	9.84	14.395	81	0.0	3	13	16	0	Saturday
2011-01-01 01:00:00	1	0	0	1	9.02	13.635	80	0.0	8	32	40	1	Saturday
2011-01-01 02:00:00	1	0	0	1	9.02	13.635	80	0.0	5	27	32	2	Saturday
2011-01-01 03:00:00	1	0	0	1	9.84	14.395	75	0.0	3	10	13	3	Saturday
2011-01-01 04:00:00	1	0	0	1	9.84	14.395	75	0.0	0	1	1	4	Saturday

	count
datetime
2011-01-20 00:00:00	8
2011-01-20 01:00:00	12
2011-01-20 02:00:00	11
2011-01-20 03:00:00	4
2011-01-20 04:00:00	3

Kaggle: Bike Sharing