In [13]:

import pandas as pd
import numpy as np
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.metrics import make_scorer





In [4]:




Out[4]:

datetime
season
holiday
workingday
weather
temp
atemp
humidity
windspeed
casual
registered
count

0
2011-01-01 00:00:00
1
0
0
1
9.84
14.395
81
0
3
13
16

1
2011-01-01 01:00:00
1
0
0
1
9.02
13.635
80
0
8
32
40

2
2011-01-01 02:00:00
1
0
0
1
9.02
13.635
80
0
5
27
32

3
2011-01-01 03:00:00
1
0
0
1
9.84
14.395
75
0
3
10
13

4
2011-01-01 04:00:00
1
0
0
1
9.84
14.395
75
0
0
1
1


1. datetime - hourly date + timestamp
2. season - 1 = spring, 2 = summer, 3 = fall, 4 = winter
3. holiday - whether the day is considered a holiday
4. workingday - whether the day is neither a weekend nor holiday
5. weather - 1: Clear, Few clouds, Partly cloudy, Partly cloudy 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
6. temp - temperature in Celsius
7. atemp - "feels like" temperature in Celsius
8. humidity - relative humidity
9. windspeed - wind speed
10. casual - number of non-registered user rentals initiated
11. registered - number of registered user rentals initiated
12. count - number of total rentals

# Build a model



In [7]:

model = ExtraTreesRegressor()



## train

Prepare two things: X - features. Matrix. y - target variable. Vector.



In [8]:

X = train[ ['season', 'holiday', 'workingday', 'weather', 'temp', 'atemp', 'humidity', 'windspeed'] ].values
y = train['count'].values

model.fit(X, y)




Out[8]:

ExtraTreesRegressor(bootstrap=False, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
verbose=0, warm_start=False)



# Evoluation a model

$$\sqrt{\frac{1}{n} \sum_{i=1}^n (\log(p_i + 1) - \log(a_i+1))^2 }$$

where
n is the number of hours in the test set
ai is the actual count
log(x) is the natural logarithm



In [10]:

def rmsle(y_true, y_pred):
diff = np.log(y_pred + 1) - np.log(y_true + 1)
mean_error = np.square(diff).mean()
return np.sqrt(mean_error)

scorer = make_scorer(rmsle, greater_is_better=False)




In [14]:

y_pred = model.predict(X)
rmsle(y, y_pred)




Out[14]:

0.56071612274478655



## That's all :)

To be honest this solution is very simple, but contains some big problems... do you see it?



In [ ]: