In [13]:
import pandas as pd
import numpy as np
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.metrics import make_scorer

Read data


In [4]:
train = pd.read_csv('train.csv')
train.head()


Out[4]:
datetime season holiday workingday weather temp atemp humidity windspeed casual registered count
0 2011-01-01 00:00:00 1 0 0 1 9.84 14.395 81 0 3 13 16
1 2011-01-01 01:00:00 1 0 0 1 9.02 13.635 80 0 8 32 40
2 2011-01-01 02:00:00 1 0 0 1 9.02 13.635 80 0 5 27 32
3 2011-01-01 03:00:00 1 0 0 1 9.84 14.395 75 0 3 10 13
4 2011-01-01 04:00:00 1 0 0 1 9.84 14.395 75 0 0 1 1
  1. datetime - hourly date + timestamp
  2. season - 1 = spring, 2 = summer, 3 = fall, 4 = winter
  3. holiday - whether the day is considered a holiday
  4. workingday - whether the day is neither a weekend nor holiday
  5. weather - 1: Clear, Few clouds, Partly cloudy, Partly cloudy 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
  6. temp - temperature in Celsius
  7. atemp - "feels like" temperature in Celsius
  8. humidity - relative humidity
  9. windspeed - wind speed
  10. casual - number of non-registered user rentals initiated
  11. registered - number of registered user rentals initiated
  12. count - number of total rentals

Build a model


In [7]:
model = ExtraTreesRegressor()

train

Prepare two things: X - features. Matrix. y - target variable. Vector.


In [8]:
X = train[ ['season', 'holiday', 'workingday', 'weather', 'temp', 'atemp', 'humidity', 'windspeed'] ].values
y = train['count'].values

model.fit(X, y)


Out[8]:
ExtraTreesRegressor(bootstrap=False, criterion='mse', max_depth=None,
          max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
          min_samples_split=2, min_weight_fraction_leaf=0.0,
          n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
          verbose=0, warm_start=False)

Evoluation a model

$$ \sqrt{\frac{1}{n} \sum_{i=1}^n (\log(p_i + 1) - \log(a_i+1))^2 }$$

where
n is the number of hours in the test set
pi is your predicted count
ai is the actual count
log(x) is the natural logarithm


In [10]:
def rmsle(y_true, y_pred):
    diff = np.log(y_pred + 1) - np.log(y_true + 1)
    mean_error = np.square(diff).mean()
    return np.sqrt(mean_error)

scorer = make_scorer(rmsle, greater_is_better=False)

In [14]:
y_pred = model.predict(X)
rmsle(y, y_pred)


Out[14]:
0.56071612274478655

That's all :)

To be honest this solution is very simple, but contains some big problems... do you see it?


In [ ]: