Look at the hourly data

Now on to the real challenge, the hourly data. So, as usual, make some plots, do a bit of anaalysis and try to get a feel for the data.

Remaining TO DO:

Try preprocessing - standardizing the features, which I really should do....
Try one hot encoder with days of the week (each day should be treated as a separate feature instead of a continuous one)
- When doing this, can get rid of 'working day' feature
Fit independently to the 'casual' and 'registered' values as targets, and then when we do predict, sum the two predictions to compare to the test values - this is probably only useful/effective for certain estimators
Evaluate the estimators
Clean up the plots and make them look nicer



In [13]:

    
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline



In [14]:

    
hourly = pd.read_csv('Bike-Sharing-Dataset/hour.csv',header = 0)



In [15]:

    
hourly.head(20)



In [16]:

    
type(hourly)
for row_index, row in hourly.iterrows():
    
    print row_index ,  row['registered']
    if (row_index > 5):
        break



In [17]:

    
weekly = np.zeros((24,7))
for row_index, row in hourly.iterrows():
    weekly[row['hr'], row['weekday']] += row['registered']



In [18]:

    
print np.max(weekly)



In [19]:

    
plt.imshow(np.sqrt(weekly/np.max(weekly)), interpolation='nearest', aspect=0.5)









    Out[19]:





<matplotlib.image.AxesImage at 0x114839a50>



In [20]:

    
weekly = np.zeros((24,7))
for row_index, row in hourly.iterrows():
    weekly[row['hr'], row['weekday']] += row['casual']
plt.imshow(np.sqrt(weekly/np.max(weekly)), interpolation='nearest', aspect=0.5)









    Out[20]:





<matplotlib.image.AxesImage at 0x1150a35d0>



In [21]:

    
weekly = np.zeros((24,7))
for row_index, row in hourly.iterrows():
    weekly[row['hr'], row['weekday']] += row['cnt']
plt.imshow(np.sqrt(weekly/np.max(weekly)), interpolation='nearest', aspect=0.5)









    Out[21]:





<matplotlib.image.AxesImage at 0x115190650>

It's quite clear that registered and casual users are very different

I already knew this from analyzing the daily data, but this shows that registered riders tend to commute, while casual riders joy ride on the weekends. And, if you look closely, you can see that registered users tend to head out of work a bit earlier on fridays. To get the best model, we'll need to include weekday (which we may want to put through one-hot encoder first).

Get ready for ML

I want to use the training/test cut that Kaggle does, so need to put the day of the month into the column, and then drop the unnecessary columns



In [22]:

    
hourly['day'] =  pd.DatetimeIndex(hourly.dteday).day
hourly = hourly.drop(['instant','dteday','casual','registered'], axis = 1)



In [83]:

    
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

dumb = LabelEncoder()
dumb1 = dumb.fit_transform(hourly['weekday'])

hourly[['weekday','day']]









    Out[83]:






  
    
      
      weekday
      day
    
  
  
    
      0
      6
      1
    
    
      1
      6
      1
    
    
      2
      6
      1
    
    
      3
      6
      1
    
    
      4
      6
      1
    
    
      5
      6
      1
    
    
      6
      6
      1
    
    
      7
      6
      1
    
    
      8
      6
      1
    
    
      9
      6
      1
    
    
      10
      6
      1
    
    
      11
      6
      1
    
    
      12
      6
      1
    
    
      13
      6
      1
    
    
      14
      6
      1
    
    
      15
      6
      1
    
    
      16
      6
      1
    
    
      17
      6
      1
    
    
      18
      6
      1
    
    
      19
      6
      1
    
    
      20
      6
      1
    
    
      21
      6
      1
    
    
      22
      6
      1
    
    
      23
      6
      1
    
    
      24
      0
      2
    
    
      25
      0
      2
    
    
      26
      0
      2
    
    
      27
      0
      2
    
    
      28
      0
      2
    
    
      29
      0
      2
    
    
      ...
      ...
      ...
    
    
      17349
      0
      30
    
    
      17350
      0
      30
    
    
      17351
      0
      30
    
    
      17352
      0
      30
    
    
      17353
      0
      30
    
    
      17354
      0
      30
    
    
      17355
      1
      31
    
    
      17356
      1
      31
    
    
      17357
      1
      31
    
    
      17358
      1
      31
    
    
      17359
      1
      31
    
    
      17360
      1
      31
    
    
      17361
      1
      31
    
    
      17362
      1
      31
    
    
      17363
      1
      31
    
    
      17364
      1
      31
    
    
      17365
      1
      31
    
    
      17366
      1
      31
    
    
      17367
      1
      31
    
    
      17368
      1
      31
    
    
      17369
      1
      31
    
    
      17370
      1
      31
    
    
      17371
      1
      31
    
    
      17372
      1
      31
    
    
      17373
      1
      31
    
    
      17374
      1
      31
    
    
      17375
      1
      31
    
    
      17376
      1
      31
    
    
      17377
      1
      31
    
    
      17378
      1
      31
    
  

17379 rows × 2 columns



In [92]:

    
#try one hot encoding
#>>> from sklearn.preprocessing import OneHotEncoder
#    >>> enc = OneHotEncoder()
#    >>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], \
#[1, 0, 2]])  # doctest: +ELLIPSIS
#    OneHotEncoder(categorical_features='all', dtype=<... 'float'>,
#           handle_unknown='error', n_values='auto', sparse=True)
#    >>> enc.n_values_
#    array([2, 3, 4])
#    >>> enc.feature_indices_
#    array([0, 2, 5, 9])
#    >>> enc.transform([[0, 1, 1]]).toarray()
#    array([[ 1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.]])
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

from sklearn import preprocessing

dumb = preprocessing.LabelEncoder()
dumb1 = dumb.fit_transform(hourly['weekday'])
#to convert back
#train.Sex = le_sex.inverse_transform(train.Sex)



enc = OneHotEncoder()

#enc.categorical_features = [['weekday','day']]
#enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1],[1, 0, 2]]) 
#print enc.transform([[0,1,2]]).toarray()
#print enc.transform([[0,1,3]]).toarray()
enc.fit(hourly[['weekday']])
dumb = enc.transform(hourly[['weekday']])
print dumb[0:100].toarray()









    



[[ 0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  1.]
 [ 1.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  0.  1.  0.  0.  0.]
 [ 0.  0.  0.  1.  0.  0.  0.]
 [ 0.  0.  0.  1.  0.  0.  0.]
 [ 0.  0.  0.  1.  0.  0.  0.]
 [ 0.  0.  0.  1.  0.  0.  0.]
 [ 0.  0.  0.  1.  0.  0.  0.]
 [ 0.  0.  0.  1.  0.  0.  0.]
 [ 0.  0.  0.  1.  0.  0.  0.]]



In [23]:

    
Xtrain = hourly[hourly.day < 19].drop('cnt',axis=1).values #the data for the training set
ytrain = (hourly[hourly.day < 19])['cnt'].values #the target of the training set

Xtest = hourly[hourly.day >= 19].drop('cnt',axis=1).values #the data for the test set
ytest = (hourly[hourly.day >= 19])['cnt'].values #the target of the test set



In [24]:

    
print ytrain.shape
print Xtrain.shape
print ytest.shape
print Xtest.shape
print Xtest[0]









    



(10312,)
(10312, 13)
(7067,)
(7067, 13)
[  1.       0.       1.       0.       0.       3.       1.       2.       0.22
   0.2727   0.93     0.      19.    ]



In [25]:

    
from sklearn import cross_validation
from sklearn.ensemble import RandomForestRegressor
from sklearn.grid_search import GridSearchCV



In [26]:

    
cv = cross_validation.ShuffleSplit(len(Xtrain), n_iter=3, test_size=0.2,
    random_state=0)
   
for train, test in cv:    
    reg = RandomForestRegressor(n_estimators = 500).fit(Xtrain[train], ytrain[train])
    print reg.score(Xtrain[train], ytrain[train]), reg.score(Xtrain[test], ytrain[test])
print reg.score(Xtest,ytest)









    



0.99328116142 0.953972144668
0.993364056771 0.947201659262
0.993132688776 0.949909553821
0.876479656556



In [28]:

    
estimators = [10,100,500]
grid = GridSearchCV(estimator=reg, param_grid=dict(n_estimators=estimators), n_jobs=-1)
grid.fit(Xtrain,ytrain)        

print grid.best_score_                                  

print grid.best_estimator_.n_estimators









    



0.698622292755
500



In [29]:

    
print grid.grid_scores_









    



[mean: 0.68889, std: 0.13296, params: {'n_estimators': 10}, mean: 0.69643, std: 0.12743, params: {'n_estimators': 100}, mean: 0.69862, std: 0.12848, params: {'n_estimators': 500}]

RandomForestRegressor does a much better job than the other estimators. Not seeing much difference when I change n_estimators, and the mean score is much lower than I'd expect (ie, it doesn't match the cv scores above). I wonder if I did this right ....



In [30]:

    
pred = reg.predict(Xtest) #put the predicted values into an array
hrInd=3 #the column number of the hr column
weekdayInd = 5 
weeklyPredict = np.zeros((24,7))
weeklyActual = np.zeros((24,7))

for i in range(0,len(ytest)):
    weeklyPredict[Xtest[i,hrInd], Xtest[i,weekdayInd]] += pred[i]
    weeklyActual[Xtest[i,hrInd], Xtest[i,weekdayInd]] += ytest[i]

def makeDifferencePlot(weeklyPredict, weeklyActual):
    plt.figure(1, figsize=(12,6))
    plt.subplot(141)
    plt.imshow(np.sqrt(weeklyPredict), interpolation='nearest', aspect=0.5)

    plt.subplot(142)
    plt.imshow(np.sqrt(weeklyActual), interpolation='nearest', aspect=0.5)

    plt.subplot(143)
    plt.imshow((weeklyPredict-weeklyActual), interpolation='nearest', aspect=0.5)

    plt.subplot(144)
    plt.imshow((weeklyPredict/weeklyActual), interpolation='nearest', aspect=0.5)

    plt.show()

makeDifferencePlot(weeklyPredict, weeklyActual)

This looks reasonable. Obviously not perfect, and the difference plot has some structure. The quotient is reasonably flat, though. This is also for the entire dataset - there may be more pronounced differences under certain conditions (ie, weather, temp, season, etc).

take a look at the same plots when the weather is good and the temp is high



In [31]:

    
weeklyPredict = np.zeros((24,7))
weeklyActual = np.zeros((24,7))
weatherInd=7
atempInd = 9
for i in range(0,len(ytest)):
    if (Xtest[i, weatherInd] < 3 and Xtest[i,atempInd] > .6):
        weeklyPredict[Xtest[i,hrInd], Xtest[i,weekdayInd]] += pred[i]
        weeklyActual[Xtest[i,hrInd], Xtest[i,weekdayInd]] += ytest[i]

makeDifferencePlot(weeklyPredict, weeklyActual)



In [32]:

    
weeklyPredict = np.zeros((24,7))
weeklyActual = np.zeros((24,7))
weatherInd=7
atempInd = 9
for i in range(0,len(ytest)):
    if (Xtest[i, weatherInd] > 2 and Xtest[i,atempInd] < .6):
        weeklyPredict[Xtest[i,hrInd], Xtest[i,weekdayInd]] += pred[i]
        weeklyActual[Xtest[i,hrInd], Xtest[i,weekdayInd]] += ytest[i]

makeDifferencePlot(weeklyPredict, weeklyActual)

way overpredicting saturday rides in bad weather.

Now plot the learning curve

this method is lifted from the scikit-learn docs



In [34]:

    
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
    """
    Generate a simple plot of the test and traning learning curve.

    Parameters
    ----------
    estimator : object type that implements the "fit" and "predict" methods
        An object of that type which is cloned for each validation.

    title : string
        Title for the chart.

    X : array-like, shape (n_samples, n_features)
        Training vector, where n_samples is the number of samples and
        n_features is the number of features.

    y : array-like, shape (n_samples) or (n_samples, n_features), optional
        Target relative to X for classification or regression;
        None for unsupervised learning.

    ylim : tuple, shape (ymin, ymax), optional
        Defines minimum and maximum yvalues plotted.

    cv : integer, cross-validation generator, optional
        If an integer is passed, it is the number of folds (defaults to 3).
        Specific cross-validation objects can be passed, see
        sklearn.cross_validation module for the list of possible objects

    n_jobs : integer, optional
        Number of jobs to run in parallel (default 1).
    """
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt



In [38]:

    
from sklearn.learning_curve import learning_curve
title = "Learning Curves (Random forest)"
# Cross validation with 100 iterations to get smoother mean test and train
# score curves, each time with 20% data randomly selected as a validation set.
cv = cross_validation.ShuffleSplit(len(Xtrain), n_iter=30,
                                   test_size=0.3, random_state=0)  
reg = RandomForestRegressor(n_estimators = 100)
plot_learning_curve(reg, title, Xtrain, ytrain, ylim=(0.6, 1.01), cv=cv, n_jobs=1)









    Out[38]:





<module 'matplotlib.pyplot' from '/usr/local/lib/python2.7/site-packages/matplotlib/pyplot.pyc'>



In [37]:

    
plt.show()

this plot can be interpreted as needing more training data - it can also be interpreted as needing fewer features. That is to say, it does very well on the training set, but less so on the test set. So it's overfitting/has high variance.



In [ ]:

	instant	dteday	season	mnth	hr	weekday	weathersit	temp	atemp	hum	windspeed	casual	registered	cnt
0	1	2011-01-01	1	1	0	6	1	0.24	0.2879	0.81	0.0000	3	13	16
1	2	2011-01-01	1	1	1	6	1	0.22	0.2727	0.80	0.0000	8	32	40
2	3	2011-01-01	1	1	2	6	1	0.22	0.2727	0.80	0.0000	5	27	32
3	4	2011-01-01	1	1	3	6	1	0.24	0.2879	0.75	0.0000	3	10	13
4	5	2011-01-01	1	1	4	6	1	0.24	0.2879	0.75	0.0000	0	1	1
5	6	2011-01-01	1	1	5	6	2	0.24	0.2576	0.75	0.0896	0	1	1
6	7	2011-01-01	1	1	6	6	1	0.22	0.2727	0.80	0.0000	2	0	2
7	8	2011-01-01	1	1	7	6	1	0.20	0.2576	0.86	0.0000	1	2	3
8	9	2011-01-01	1	1	8	6	1	0.24	0.2879	0.75	0.0000	1	7	8
9	10	2011-01-01	1	1	9	6	1	0.32	0.3485	0.76	0.0000	8	6	14
10	11	2011-01-01	1	1	10	6	1	0.38	0.3939	0.76	0.2537	12	24	36
11	12	2011-01-01	1	1	11	6	1	0.36	0.3333	0.81	0.2836	26	30	56
12	13	2011-01-01	1	1	12	6	1	0.42	0.4242	0.77	0.2836	29	55	84
13	14	2011-01-01	1	1	13	6	2	0.46	0.4545	0.72	0.2985	47	47	94
14	15	2011-01-01	1	1	14	6	2	0.46	0.4545	0.72	0.2836	35	71	106
15	16	2011-01-01	1	1	15	6	2	0.44	0.4394	0.77	0.2985	40	70	110
16	17	2011-01-01	1	1	16	6	2	0.42	0.4242	0.82	0.2985	41	52	93
17	18	2011-01-01	1	1	17	6	2	0.44	0.4394	0.82	0.2836	15	52	67
18	19	2011-01-01	1	1	18	6	3	0.42	0.4242	0.88	0.2537	9	26	35
19	20	2011-01-01	1	1	19	6	3	0.42	0.4242	0.88	0.2537	6	31	37

	weekday	day
0	6	1
1	6	1
2	6	1
3	6	1
4	6	1
5	6	1
6	6	1
7	6	1
8	6	1
9	6	1
10	6	1
11	6	1
12	6	1
13	6	1
14	6	1
15	6	1
16	6	1
17	6	1
18	6	1
19	6	1
20	6	1
21	6	1
22	6	1
23	6	1
24	0	2
25	0	2
26	0	2
27	0	2
28	0	2
29	0	2
...	...	...
17349	0	30
17350	0	30
17351	0	30
17352	0	30
17353	0	30
17354	0	30
17355	1	31
17356	1	31
17357	1	31
17358	1	31
17359	1	31
17360	1	31
17361	1	31
17362	1	31
17363	1	31
17364	1	31
17365	1	31
17366	1	31
17367	1	31
17368	1	31
17369	1	31
17370	1	31
17371	1	31
17372	1	31
17373	1	31
17374	1	31
17375	1	31
17376	1	31
17377	1	31
17378	1	31