Look at the hourly data

Now on to the real challenge, the hourly data. So, as usual, make some plots, do a bit of anaalysis and try to get a feel for the data.

Remaining TO DO:

  • Try preprocessing - standardizing the features, which I really should do....
  • Try one hot encoder with days of the week (each day should be treated as a separate feature instead of a continuous one)
    • When doing this, can get rid of 'working day' feature
  • Fit independently to the 'casual' and 'registered' values as targets, and then when we do predict, sum the two predictions to compare to the test values - this is probably only useful/effective for certain estimators
  • Evaluate the estimators
  • Clean up the plots and make them look nicer

In [13]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [14]:
hourly = pd.read_csv('Bike-Sharing-Dataset/hour.csv',header = 0)

In [15]:
hourly.head(20)


Out[15]:
instant dteday season yr mnth hr holiday weekday workingday weathersit temp atemp hum windspeed casual registered cnt
0 1 2011-01-01 1 0 1 0 0 6 0 1 0.24 0.2879 0.81 0.0000 3 13 16
1 2 2011-01-01 1 0 1 1 0 6 0 1 0.22 0.2727 0.80 0.0000 8 32 40
2 3 2011-01-01 1 0 1 2 0 6 0 1 0.22 0.2727 0.80 0.0000 5 27 32
3 4 2011-01-01 1 0 1 3 0 6 0 1 0.24 0.2879 0.75 0.0000 3 10 13
4 5 2011-01-01 1 0 1 4 0 6 0 1 0.24 0.2879 0.75 0.0000 0 1 1
5 6 2011-01-01 1 0 1 5 0 6 0 2 0.24 0.2576 0.75 0.0896 0 1 1
6 7 2011-01-01 1 0 1 6 0 6 0 1 0.22 0.2727 0.80 0.0000 2 0 2
7 8 2011-01-01 1 0 1 7 0 6 0 1 0.20 0.2576 0.86 0.0000 1 2 3
8 9 2011-01-01 1 0 1 8 0 6 0 1 0.24 0.2879 0.75 0.0000 1 7 8
9 10 2011-01-01 1 0 1 9 0 6 0 1 0.32 0.3485 0.76 0.0000 8 6 14
10 11 2011-01-01 1 0 1 10 0 6 0 1 0.38 0.3939 0.76 0.2537 12 24 36
11 12 2011-01-01 1 0 1 11 0 6 0 1 0.36 0.3333 0.81 0.2836 26 30 56
12 13 2011-01-01 1 0 1 12 0 6 0 1 0.42 0.4242 0.77 0.2836 29 55 84
13 14 2011-01-01 1 0 1 13 0 6 0 2 0.46 0.4545 0.72 0.2985 47 47 94
14 15 2011-01-01 1 0 1 14 0 6 0 2 0.46 0.4545 0.72 0.2836 35 71 106
15 16 2011-01-01 1 0 1 15 0 6 0 2 0.44 0.4394 0.77 0.2985 40 70 110
16 17 2011-01-01 1 0 1 16 0 6 0 2 0.42 0.4242 0.82 0.2985 41 52 93
17 18 2011-01-01 1 0 1 17 0 6 0 2 0.44 0.4394 0.82 0.2836 15 52 67
18 19 2011-01-01 1 0 1 18 0 6 0 3 0.42 0.4242 0.88 0.2537 9 26 35
19 20 2011-01-01 1 0 1 19 0 6 0 3 0.42 0.4242 0.88 0.2537 6 31 37

In [16]:
type(hourly)
for row_index, row in hourly.iterrows():
    
    print row_index ,  row['registered']
    if (row_index > 5):
        break


0 13
1 32
2 27
3 10
4 1
5 1
6 0

In [17]:
weekly = np.zeros((24,7))
for row_index, row in hourly.iterrows():
    weekly[row['hr'], row['weekday']] += row['registered']

In [18]:
print np.max(weekly)


50883.0

In [19]:
plt.imshow(np.sqrt(weekly/np.max(weekly)), interpolation='nearest', aspect=0.5)


Out[19]:
<matplotlib.image.AxesImage at 0x114839a50>

In [20]:
weekly = np.zeros((24,7))
for row_index, row in hourly.iterrows():
    weekly[row['hr'], row['weekday']] += row['casual']
plt.imshow(np.sqrt(weekly/np.max(weekly)), interpolation='nearest', aspect=0.5)


Out[20]:
<matplotlib.image.AxesImage at 0x1150a35d0>

In [21]:
weekly = np.zeros((24,7))
for row_index, row in hourly.iterrows():
    weekly[row['hr'], row['weekday']] += row['cnt']
plt.imshow(np.sqrt(weekly/np.max(weekly)), interpolation='nearest', aspect=0.5)


Out[21]:
<matplotlib.image.AxesImage at 0x115190650>

It's quite clear that registered and casual users are very different

I already knew this from analyzing the daily data, but this shows that registered riders tend to commute, while casual riders joy ride on the weekends. And, if you look closely, you can see that registered users tend to head out of work a bit earlier on fridays. To get the best model, we'll need to include weekday (which we may want to put through one-hot encoder first).

Get ready for ML

I want to use the training/test cut that Kaggle does, so need to put the day of the month into the column, and then drop the unnecessary columns


In [22]:
hourly['day'] =  pd.DatetimeIndex(hourly.dteday).day
hourly = hourly.drop(['instant','dteday','casual','registered'], axis = 1)

In [83]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

dumb = LabelEncoder()
dumb1 = dumb.fit_transform(hourly['weekday'])

hourly[['weekday','day']]


Out[83]:
weekday day
0 6 1
1 6 1
2 6 1
3 6 1
4 6 1
5 6 1
6 6 1
7 6 1
8 6 1
9 6 1
10 6 1
11 6 1
12 6 1
13 6 1
14 6 1
15 6 1
16 6 1
17 6 1
18 6 1
19 6 1
20 6 1
21 6 1
22 6 1
23 6 1
24 0 2
25 0 2
26 0 2
27 0 2
28 0 2
29 0 2
... ... ...
17349 0 30
17350 0 30
17351 0 30
17352 0 30
17353 0 30
17354 0 30
17355 1 31
17356 1 31
17357 1 31
17358 1 31
17359 1 31
17360 1 31
17361 1 31
17362 1 31
17363 1 31
17364 1 31
17365 1 31
17366 1 31
17367 1 31
17368 1 31
17369 1 31
17370 1 31
17371 1 31
17372 1 31
17373 1 31
17374 1 31
17375 1 31
17376 1 31
17377 1 31
17378 1 31

17379 rows × 2 columns


In [92]:
#try one hot encoding
#>>> from sklearn.preprocessing import OneHotEncoder
#    >>> enc = OneHotEncoder()
#    >>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], \
#[1, 0, 2]])  # doctest: +ELLIPSIS
#    OneHotEncoder(categorical_features='all', dtype=<... 'float'>,
#           handle_unknown='error', n_values='auto', sparse=True)
#    >>> enc.n_values_
#    array([2, 3, 4])
#    >>> enc.feature_indices_
#    array([0, 2, 5, 9])
#    >>> enc.transform([[0, 1, 1]]).toarray()
#    array([[ 1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.]])
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

from sklearn import preprocessing

dumb = preprocessing.LabelEncoder()
dumb1 = dumb.fit_transform(hourly['weekday'])
#to convert back
#train.Sex = le_sex.inverse_transform(train.Sex)



enc = OneHotEncoder()

#enc.categorical_features = [['weekday','day']]
#enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1],[1, 0, 2]]) 
#print enc.transform([[0,1,2]]).toarray()
#print enc.transform([[0,1,3]]).toarray()
enc.fit(hourly[['weekday']])
dumb = enc.transform(hourly[['weekday']])
print dumb[0:100].toarray()


[[ 0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  1.]
 [ 1.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  0.  1.  0.  0.  0.]
 [ 0.  0.  0.  1.  0.  0.  0.]
 [ 0.  0.  0.  1.  0.  0.  0.]
 [ 0.  0.  0.  1.  0.  0.  0.]
 [ 0.  0.  0.  1.  0.  0.  0.]
 [ 0.  0.  0.  1.  0.  0.  0.]
 [ 0.  0.  0.  1.  0.  0.  0.]
 [ 0.  0.  0.  1.  0.  0.  0.]]

In [23]:
Xtrain = hourly[hourly.day < 19].drop('cnt',axis=1).values #the data for the training set
ytrain = (hourly[hourly.day < 19])['cnt'].values #the target of the training set

Xtest = hourly[hourly.day >= 19].drop('cnt',axis=1).values #the data for the test set
ytest = (hourly[hourly.day >= 19])['cnt'].values #the target of the test set

In [24]:
print ytrain.shape
print Xtrain.shape
print ytest.shape
print Xtest.shape
print Xtest[0]


(10312,)
(10312, 13)
(7067,)
(7067, 13)
[  1.       0.       1.       0.       0.       3.       1.       2.       0.22
   0.2727   0.93     0.      19.    ]

In [25]:
from sklearn import cross_validation
from sklearn.ensemble import RandomForestRegressor
from sklearn.grid_search import GridSearchCV

In [26]:
cv = cross_validation.ShuffleSplit(len(Xtrain), n_iter=3, test_size=0.2,
    random_state=0)
   
for train, test in cv:    
    reg = RandomForestRegressor(n_estimators = 500).fit(Xtrain[train], ytrain[train])
    print reg.score(Xtrain[train], ytrain[train]), reg.score(Xtrain[test], ytrain[test])
print reg.score(Xtest,ytest)


0.99328116142 0.953972144668
0.993364056771 0.947201659262
0.993132688776 0.949909553821
0.876479656556

In [28]:
estimators = [10,100,500]
grid = GridSearchCV(estimator=reg, param_grid=dict(n_estimators=estimators), n_jobs=-1)
grid.fit(Xtrain,ytrain)        

print grid.best_score_                                  

print grid.best_estimator_.n_estimators


0.698622292755
500

In [29]:
print grid.grid_scores_


[mean: 0.68889, std: 0.13296, params: {'n_estimators': 10}, mean: 0.69643, std: 0.12743, params: {'n_estimators': 100}, mean: 0.69862, std: 0.12848, params: {'n_estimators': 500}]

RandomForestRegressor does a much better job than the other estimators. Not seeing much difference when I change n_estimators, and the mean score is much lower than I'd expect (ie, it doesn't match the cv scores above). I wonder if I did this right ....


In [30]:
pred = reg.predict(Xtest) #put the predicted values into an array
hrInd=3 #the column number of the hr column
weekdayInd = 5 
weeklyPredict = np.zeros((24,7))
weeklyActual = np.zeros((24,7))

for i in range(0,len(ytest)):
    weeklyPredict[Xtest[i,hrInd], Xtest[i,weekdayInd]] += pred[i]
    weeklyActual[Xtest[i,hrInd], Xtest[i,weekdayInd]] += ytest[i]

def makeDifferencePlot(weeklyPredict, weeklyActual):
    plt.figure(1, figsize=(12,6))
    plt.subplot(141)
    plt.imshow(np.sqrt(weeklyPredict), interpolation='nearest', aspect=0.5)

    plt.subplot(142)
    plt.imshow(np.sqrt(weeklyActual), interpolation='nearest', aspect=0.5)

    plt.subplot(143)
    plt.imshow((weeklyPredict-weeklyActual), interpolation='nearest', aspect=0.5)

    plt.subplot(144)
    plt.imshow((weeklyPredict/weeklyActual), interpolation='nearest', aspect=0.5)

    plt.show()

makeDifferencePlot(weeklyPredict, weeklyActual)


This looks reasonable. Obviously not perfect, and the difference plot has some structure. The quotient is reasonably flat, though. This is also for the entire dataset - there may be more pronounced differences under certain conditions (ie, weather, temp, season, etc).

take a look at the same plots when the weather is good and the temp is high


In [31]:
weeklyPredict = np.zeros((24,7))
weeklyActual = np.zeros((24,7))
weatherInd=7
atempInd = 9
for i in range(0,len(ytest)):
    if (Xtest[i, weatherInd] < 3 and Xtest[i,atempInd] > .6):
        weeklyPredict[Xtest[i,hrInd], Xtest[i,weekdayInd]] += pred[i]
        weeklyActual[Xtest[i,hrInd], Xtest[i,weekdayInd]] += ytest[i]

makeDifferencePlot(weeklyPredict, weeklyActual)



In [32]:
weeklyPredict = np.zeros((24,7))
weeklyActual = np.zeros((24,7))
weatherInd=7
atempInd = 9
for i in range(0,len(ytest)):
    if (Xtest[i, weatherInd] > 2 and Xtest[i,atempInd] < .6):
        weeklyPredict[Xtest[i,hrInd], Xtest[i,weekdayInd]] += pred[i]
        weeklyActual[Xtest[i,hrInd], Xtest[i,weekdayInd]] += ytest[i]

makeDifferencePlot(weeklyPredict, weeklyActual)


way overpredicting saturday rides in bad weather.

Now plot the learning curve

this method is lifted from the scikit-learn docs


In [34]:
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
    """
    Generate a simple plot of the test and traning learning curve.

    Parameters
    ----------
    estimator : object type that implements the "fit" and "predict" methods
        An object of that type which is cloned for each validation.

    title : string
        Title for the chart.

    X : array-like, shape (n_samples, n_features)
        Training vector, where n_samples is the number of samples and
        n_features is the number of features.

    y : array-like, shape (n_samples) or (n_samples, n_features), optional
        Target relative to X for classification or regression;
        None for unsupervised learning.

    ylim : tuple, shape (ymin, ymax), optional
        Defines minimum and maximum yvalues plotted.

    cv : integer, cross-validation generator, optional
        If an integer is passed, it is the number of folds (defaults to 3).
        Specific cross-validation objects can be passed, see
        sklearn.cross_validation module for the list of possible objects

    n_jobs : integer, optional
        Number of jobs to run in parallel (default 1).
    """
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt

In [38]:
from sklearn.learning_curve import learning_curve
title = "Learning Curves (Random forest)"
# Cross validation with 100 iterations to get smoother mean test and train
# score curves, each time with 20% data randomly selected as a validation set.
cv = cross_validation.ShuffleSplit(len(Xtrain), n_iter=30,
                                   test_size=0.3, random_state=0)  
reg = RandomForestRegressor(n_estimators = 100)
plot_learning_curve(reg, title, Xtrain, ytrain, ylim=(0.6, 1.01), cv=cv, n_jobs=1)


Out[38]:
<module 'matplotlib.pyplot' from '/usr/local/lib/python2.7/site-packages/matplotlib/pyplot.pyc'>

In [37]:
plt.show()

this plot can be interpreted as needing more training data - it can also be interpreted as needing fewer features. That is to say, it does very well on the training set, but less so on the test set. So it's overfitting/has high variance.


In [ ]: