Bicycle-stations Availability Prediction

Use a supervised machine learning algorithm to predict the availability for each bicyle-sharing stations in Lyon (France) based on the history data.

I use the tree method XGBoost to predict a "probability" of bikes availability for each station. A number close to 1. means that you have several available bikes. A number close to 0. means you don't have many bikes.

Some Imports


In [1]:
%matplotlib inline

In [2]:
import numpy as np
import pandas as pd
from sklearn.externals import joblib
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import TimeSeriesSplit, train_test_split

In [3]:
import xgboost as xgb
from xgboost import plot_tree


/Users/babou/anaconda/envs/python3DS/lib/python3.5/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

In [4]:
import matplotlib as mpl
from matplotlib import pyplot as plt
import seaborn as sns

In [5]:
import folium

In [6]:
%load_ext watermark

In [7]:
%watermark -d -v -p numpy,pandas,xgboost,matplotlib,folium -g -m -w


2018-05-30 

CPython 3.5.1
IPython 6.1.0

numpy 1.13.1
pandas 0.20.3
xgboost 0.6
matplotlib 2.0.2
folium 0.5.0

compiler   : GCC 4.2.1 (Apple Inc. build 5577)
system     : Darwin
release    : 17.5.0
machine    : x86_64
processor  : i386
CPU cores  : 4
interpreter: 64bit
Git hash   : c480226b781fdeeb879124b13a3f6b68c9cd5642
watermark 1.5.0

The module prediction.py contains some functions dedicated to the bicyle-sharing stations predictions.


In [8]:
from prediction import (datareader, complete_data, cleanup, bikes_probability,
                        time_resampling, prepare_data_for_training, fit, prediction,
                       get_weather)
np.random.seed(42)

Read and prepare the data


In [9]:
DATAFILE = './data/lyon.csv'

In [11]:
raw = datareader(DATAFILE)


2018-05-30 11:20:25,678 [85498] INFO     prediction: read the file './data/lyon.csv'

In [12]:
raw[raw['number'] == 1005].tail()


Out[12]:
number last_update bike_stands available_bike_stands available_bikes availabilitycode availability bonus status
7522659 1005 2017-09-26 14:31:50 10 10 0 3 Orange Non OPEN
7523006 1005 2017-09-26 14:41:54 10 10 0 3 Orange Non OPEN
7523353 1005 2017-09-26 14:41:54 10 10 0 3 Orange Non OPEN
7523700 1005 2017-09-26 14:51:57 10 10 0 3 Orange Non OPEN
7524047 1005 2017-09-26 14:55:17 10 10 0 3 Orange Non OPEN

Min and max dates of the timeseries


In [13]:
print(raw.last_update.min())
print(raw.last_update.max())


2017-07-08 00:30:12
2017-09-26 14:58:46

In [14]:
raw[(raw.number == 1001) &(raw.last_update >= "2017-07-26 09:00:00")].head(10)


Out[14]:
number last_update bike_stands available_bike_stands available_bikes availabilitycode availability bonus status
1613382 1001 2017-07-26 09:04:23 16 13 3 1 Vert Non OPEN
1613730 1001 2017-07-26 09:05:33 16 14 2 1 Vert Non OPEN
1614078 1001 2017-07-26 09:05:33 16 14 2 1 Vert Non OPEN
1614426 1001 2017-07-26 09:18:02 16 16 0 3 Orange Non OPEN
1614774 1001 2017-07-26 09:18:02 16 16 0 3 Orange Non OPEN
1615122 1001 2017-07-26 09:28:07 16 16 0 3 Orange Non OPEN
1615470 1001 2017-07-26 09:29:43 16 15 1 1 Vert Non OPEN
1615818 1001 2017-07-26 09:36:35 16 14 2 1 Vert Non OPEN
1616166 1001 2017-07-26 09:39:16 16 14 2 1 Vert Non OPEN
1616514 1001 2017-07-26 09:49:21 16 15 1 1 Vert Non OPEN

In [15]:
df_clean = cleanup(raw)


2018-05-30 11:20:45,209 [85498] INFO     prediction: cleanup processing

Pipe some processing data functions :

  • time resampling to get data every 10 minutes (i.e. 10T)
  • add some columns : day of the week, hour of the day, etc.
  • compute the bikes (number) "probability". It's just a normalization for each station with num_avail_bikes / total

In [16]:
df_clean.head()


Out[16]:
station ts stands bikes bonus status
0 10063 2017-07-08 23:49:09 10 23 Non OPEN
2 8038 2017-07-08 23:49:26 6 14 Non OPEN
3 7045 2017-07-08 23:52:43 13 7 Non OPEN
4 6045 2017-07-08 23:52:14 7 8 Non OPEN
5 7062 2017-07-08 23:57:27 5 11 Non OPEN

In [17]:
df = (df_clean.pipe(time_resampling)
      .pipe(complete_data)
      .pipe(bikes_probability))


2018-05-30 11:20:51,046 [85498] INFO     prediction: Time resampling for each station by '10T'
2018-05-30 11:20:59,356 [85498] INFO     prediction: complete some data
2018-05-30 11:22:00,897 [85498] INFO     prediction: bikes probability

In [18]:
import gc
del df_clean
del raw
gc.collect()


Out[18]:
61

In [19]:
df.head()


Out[19]:
station ts bikes stands is_open day hour minute probability
0 1001 2017-07-09 00:00:00 15.0 1.0 1.0 6 0 0 0.93750
1 1001 2017-07-09 00:10:00 15.0 1.0 1.0 6 0 10 0.93750
2 1001 2017-07-09 00:20:00 14.5 1.5 1.0 6 0 20 0.90625
3 1001 2017-07-09 00:30:00 14.5 1.5 1.0 6 0 30 0.90625
4 1001 2017-07-09 00:40:00 11.5 4.5 1.0 6 0 40 0.71875

In [20]:
df.shape


Out[20]:
(3880901, 9)

In [21]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3880901 entries, 0 to 3880900
Data columns (total 9 columns):
station        int64
ts             datetime64[ns]
bikes          float64
stands         float64
is_open        float64
day            int64
hour           int64
minute         int64
probability    float64
dtypes: datetime64[ns](1), float64(4), int64(4)
memory usage: 266.5 MB

WIP


In [ ]:
from numpy import fft
def fourierExtrapolation(x, segments, harmo):
    n = x.size
    n_harm = round((n * harmo / 100) -0.5) # number of harmonics in model
    print(n_harm)
    t = np.arange(0, n)
    p = np.polyfit(t, x, 2)         # find linear trend in x
    x_notrend = x - p[0] * t        # detrended x
    x_freqdom = fft.fft(x_notrend)  # detrended x in frequency domain
    f = fft.fftfreq(n)              # frequencies
    indexes = list(range(n))
    # sort indexes by frequency, lower -> higher
    indexes.sort(key = lambda i: np.absolute(f[i]))
 
    t = np.arange(0, n + segments)
    restored_sig = np.zeros(t.size)
    for i in indexes[:1 + n_harm * 2]:
        ampli = np.absolute(x_freqdom[i]) / n   # amplitude
        phase = np.angle(x_freqdom[i])          # phase
        restored_sig += ampli * np.cos(2 * np.pi * f[i] * t + phase)
    return restored_sig + p[0] * t

In [ ]:
def plot_fourrier(data,station, segments=5, harmo=10, return_df=False):
    
    lags = 550
    my_wine = data.loc[(data.station == station)].copy()
    
    X1 = my_wine[:-segments].tail(lags).copy()
    X2 = my_wine[-segments:].tail(lags).copy()
    
    extrapolation = fourierExtrapolation(X1.probability, segments, harmo)
    
    X1['fourrier'] = extrapolation[:-segments]
    X2['fourrier'] = extrapolation[-segments:]
    
    #plot
    fig, (axis1) = plt.subplots(1 ,figsize=(15,5))
    # Axe 1
    axis1.plot(X1.ts, X1.probability, '-', label='Historical')
    axis1.plot(X2.ts, X2.probability, '-', label='Reality')
    axis1.plot(X1.ts, X1.fourrier, '-', label='fourrier Train')
    axis1.plot(X2.ts, X2.fourrier, '-', label='fourrier extrapolation')
    
    plt.legend(loc='best')
    #plt.title(title)
    
    if return_df is True:
        return my_wine

In [ ]:
lol = plot_fourrier(df, station=12001, segments=12, harmo=10, return_df=True)

In [ ]:
ok = fourierExtrapolation(lol.probability, segments=10, harmo=5)
len(ok)

In [195]:
lol['ff'] = ok[0:len(lol)]

In [199]:
lag = 1000
plt.plot(lol.ts.tail(lag), lol.probability.tail(lag))
plt.plot(lol.ts.tail(lag), lol.ff.tail(lag))


Out[199]:
[<matplotlib.lines.Line2D at 0x13162ecc0>]

In [179]:
ok


Out[179]:
array([ 0.32282496,  0.32820109,  0.33357722,  0.33895335,  0.34432948,
        0.34970561,  0.35508174,  0.36045787,  0.365834  ,  0.37121013,
        0.37658626,  0.38196239,  0.38733852,  0.39271465,  0.39809078,
        0.40346691,  0.40884304,  0.41421917,  0.4195953 ,  0.42497142])

In [176]:
ok


Out[176]:
array([ 0.77140794,  0.8524127 ,  0.86706836, ...,  0.26172551,
        0.16631123,  0.11497797])

In [171]:
lol['ff'] = ok


Out[171]:
11467

In [150]:
len(ok)


Out[150]:
11460

In [151]:
len(lol)


Out[151]:
11450

In [144]:
fft.fft(lol.probability)[0:2]


Out[144]:
array([ 5072.94970184 +0.j        ,  -331.07366223-82.17605142j])

In [137]:
lol.tail()


Out[137]:
station ts bikes stands is_open day hour minute probability date fft
22928 1002 2017-09-26 14:10:00 5.0 17.0 1.0 1 14 10 0.227273 2017-09-26 (-171.788087592+87.1589375286j)
22929 1002 2017-09-26 14:20:00 2.0 19.5 1.0 1 14 20 0.093023 2017-09-26 (-72.7384019554-121.274762818j)
22930 1002 2017-09-26 14:30:00 2.0 20.0 1.0 1 14 30 0.090909 2017-09-26 (-210.787479074+182.78960152j)
22931 1002 2017-09-26 14:40:00 3.0 19.0 1.0 1 14 40 0.136364 2017-09-26 (-5.05692277419+0.000912249385151j)
22932 1002 2017-09-26 14:50:00 6.0 16.0 1.0 1 14 50 0.272727 2017-09-26 (-331.073662233+82.1760514231j)

In [136]:
plt.plot(lol[120:].ts, lol[120:].probability)
plt.plot(lol[120:].ts, lol[120:].fft)


2017-11-10 14:37:20,462 [14571] WARNING  py.warnings: /Users/babou/anaconda/envs/python3DS/lib/python3.5/site-packages/numpy/core/numeric.py:531: ComplexWarning: Casting complex values to real discards the imaginary part
  return array(a, dtype, copy=False, order=order)

Out[136]:
[<matplotlib.lines.Line2D at 0x12fa9d898>]

In [79]:
data = df[df.station == 1001]

In [80]:
data.head()


Out[80]:
station ts bikes stands is_open day hour minute probability date
0 1001 2017-07-09 00:00:00 15.0 1.0 1.0 6 0 0 0.93750 2017-07-09
1 1001 2017-07-09 00:10:00 15.0 1.0 1.0 6 0 10 0.93750 2017-07-09
2 1001 2017-07-09 00:20:00 14.5 1.5 1.0 6 0 20 0.90625 2017-07-09
3 1001 2017-07-09 00:30:00 14.5 1.5 1.0 6 0 30 0.90625 2017-07-09
4 1001 2017-07-09 00:40:00 11.5 4.5 1.0 6 0 40 0.71875 2017-07-09

In [81]:
lol = fft.rfft(data.probability)

In [82]:
data.shape


Out[82]:
(11466, 10)

In [83]:
x = data.probability

In [86]:
x[0:5]


Out[86]:
0    0.93750
1    0.93750
2    0.90625
3    0.90625
4    0.71875
Name: probability, dtype: float64

In [96]:
n = x.size
n_harm = 3                     # number of harmonics in model
t = np.arange(0, n)
p = np.polyfit(t, x, 2)         # find linear trend in x
x_notrend = x - p[0] * t        # detrended x
x_freqdom = fft.fft(x_notrend)  # detrended x in frequency domain
f = fft.fftfreq(n)              # frequencies
indexes = list(range(n))

In [95]:
range(2005 - 3, 2005 + 4, 1)


Out[95]:
range(2002, 2009)

This is the final dataset. For further prediction, I could add some weather forecasts data to these features.

Prediction

Let's select a time window (start, stop) to a single prediction.

  • Two weeks of data for training
  • Predict the 'probability' for the next 30 minutes and 1 hour for every station.
  • Split the dataset into train and test dataset
  • Compute the RMSE (Root Mean Squared Error) between the observations (i.e. probability) and the predicted values

In [21]:
#start = pd.Timestamp("2017-08-01T00:10:00") # Tuesday OLD
start = pd.Timestamp("2017-08-01T02:00:00") # Tuesday
predict_date = pd.Timestamp("2017-09-22T09:00:00") # wednesday
# predict the next 30 minutes
freq = '1H'
# number of predictions at 'predict_date'.
# Here, the next 30 minutes and the next hour (30 minutes + 30 minutes).
# If you want to predict the next 3 hours, every 30 minutes, thus set periods=6
periods = 1

In [22]:
prepare_data_for_training??

In [22]:
#train_X, train_Y, test_X, test_Y  
#X, y 
#train_X, train_Y, test_X, test_Y  
train_X, train_Y, test_X, test_Y  = prepare_data_for_training(df,
                                 predict_date,
                                 freq=freq,
                                 start=start,
                                 periods=periods,
                                 observation='probability',
                                # how=None
                                )


2018-02-13 17:28:18,082 [31214] INFO     prediction: prepare data for training
2018-02-13 17:28:18,083 [31214] INFO     prediction: New version 9
2018-02-13 17:28:18,084 [31214] INFO     prediction: Get summer holiday features
2018-02-13 17:28:30,058 [31214] INFO     prediction: Get public holiday features
2018-02-13 17:28:40,472 [31214] INFO     prediction: compute delta with  public holiday on '5' days
2018-02-13 17:28:41,386 [31214] INFO     prediction: Get cluster station features
2018-02-13 17:28:41,694 [31214] INFO     prediction: Get Geo cluster station features
2018-02-13 17:28:41,978 [31214] INFO     prediction: Get ratio station open by time
2018-02-13 17:28:42,846 [31214] INFO     prediction: Get ratio station geo cluster open by time
2018-02-13 17:28:44,104 [31214] INFO     prediction: Create Bin hours of the day
2018-02-13 17:28:45,389 [31214] INFO     prediction: sort values (station, ts)
2018-02-13 17:28:46,417 [31214] INFO     prediction: compute the future availability at '1H'
2018-02-13 17:28:46,595 [31214] INFO     prediction: merge data with the future availability
2018-02-13 17:29:17,789 [31214] INFO     prediction: availability label as values
2018-02-13 17:29:18,084 [31214] INFO     prediction: Create shift features
2018-02-13 17:29:19,465 [31214] INFO     prediction: Create cumulative trending features
2018-02-13 17:29:22,307 [31214] INFO     prediction: Create recenlty open station indicator
2018-02-13 17:29:24,316 [31214] INFO     prediction: Create ratio bike filling on geo cluster station on time
2018-02-13 17:29:25,837 [31214] INFO     prediction: Create mean transformation
2018-02-13 17:29:27,994 [31214] INFO     prediction: Create std transformation
2018-02-13 17:29:30,209 [31214] INFO     prediction: Create median transformation
2018-02-13 17:29:33,106 [31214] INFO     prediction: Create interaction features with 'mean_6' and 'median_6' 
2018-02-13 17:29:33,156 [31214] INFO     prediction: create bool empty full station 
2018-02-13 17:29:33,440 [31214] INFO     prediction: Create sum transformation on empty full station
2018-02-13 17:29:35,820 [31214] INFO     prediction: cut date 2017-09-22 08:00:00
2018-02-13 17:29:35,822 [31214] INFO     prediction: stop date 2017-09-22 10:00:00
2018-02-13 17:29:36,087 [31214] INFO     prediction: split train and test according to a prediction date
2018-02-13 17:29:36,192 [31214] INFO     prediction: Create mean by station / day / hours_binned

Rolling Validation


In [ ]:
def fit_rolling(train_X, train_Y, test_X, test_Y, param, num_round=25):
    """Train the xgboost model

    Return the booster trained model
    """
    # param = {'objective': 'reg:linear'}

    #if param
    #param = {'objective': 'reg:logistic'}
    #param['eta'] = 0.2
    #param['max_depth'] = 4 # 6 original
    #param['silent'] = 1
    #param['nthread'] = 4
    # used num_class only for classification (e.g. a level of availability)
    # param = {'objective': 'multi:softmax'}
    # param['num_class'] = train_Y.nunique()
    xg_train = xgb.DMatrix(train_X, label=train_Y)
    xg_test = xgb.DMatrix(test_X, label=test_Y)
    watchlist = [(xg_train, 'train'), (xg_test, 'test')]
    evals_result = {}
    bst = xgb.train(param, xg_train, num_round, watchlist, evals_result=evals_result)
    return bst, evals_result

In [ ]:


In [3]:
result = [] # All macro result by fold
learning_progress = {} # micro result by fold
pred_serie = np.full(X.shape[0], np.nan) # Store pred during the rolling windows
fold_serie = np.full(X.shape[0], np.nan) # Store fold during the rolling windows

tscv = TimeSeriesSplit(n_splits=3)
for (train_index, test_index), k in zip (tscv.split(X), range(1, tscv.n_splits+1, 1)):
    #print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    print("Train : ")
    print(X_train.index.min())
    print(X_train.index.max())
    print(len(X_train))
    print("Test  : ")
    print(X_test.index.min())
    print(X_test.index.max())
    print(len(X_test))
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    # Split tranning for XGB
    X_tr, X_te, y_tr, y_te = train_test_split(X_train, y_train,
                                              test_size=0.33, random_state=42)
    
    # Adding Meteo features
    # Learning for Xgb
    X_tr = get_weather(X_tr, how='learning')
    X_te = get_weather(X_te, how='forecast', freq=freq)
    
    X_test = get_weather(X_test, how='forecast', freq=freq)
    
    # Learning XGB$
    # the 'booster'
    param ={"objective": "reg:logistic",
              "booster" : "gbtree",
              "eta": 0.2,
              "max_depth": 9,
              "silent": 1,
              "seed": 42}
    bst, evals_result = fit_rolling(X_tr, y_tr, X_te, y_te, param, num_round=60)
    
    pred = prediction(bst, X_test, y_test)
    rmse = np.sqrt(np.mean((pred - y_test)**2))
    
    pred_serie[test_index] = pred
    fold_serie[test_index] = k
    
    learning_progress[tscv.n_splits] = evals_result
    
    result.append({'fold' : k,
                  'train_min_date' : X_train.index.min(),
                  'train_max_date' : X_train.index.max(),
                   'len_train' : len(X_train),
                  'test_min_date' : X_test.index.min(),
                  'test_max_date' : X_test.index.max(),
                   'len_test' : len(X_test),
                  'rmse_train' : evals_result['train']['rmse'][-1],
                  'rmse_test' : evals_result['test']['rmse'][-1],
                  'rmse_validation' : rmse})
    print("---------------------------------------------")
df_result = pd.DataFrame(result)
X['pred'] = pred_serie
X['fold'] = fold_serie


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-3-b53478a470f4> in <module>()
      1 result = [] # All macro result by fold
      2 learning_progress = {} # micro result by fold
----> 3 pred_serie = np.full(X.shape[0], np.nan) # Store pred during the rolling windows
      4 fold_serie = np.full(X.shape[0], np.nan) # Store fold during the rolling windows
      5 

NameError: name 'np' is not defined

In [45]:
df_result.head()


Out[45]:
fold len_test len_train rmse_test rmse_train rmse_validation test_max_date test_min_date train_max_date train_min_date
0 1 632150 632151 0.180164 0.165386 0.244768 2017-08-26 23:30:00 2017-08-14 00:40:00 2017-08-14 00:40:00 2017-08-01 02:00:00
1 2 632150 1264301 0.193480 0.183436 0.245872 2017-09-08 22:20:00 2017-08-26 23:30:00 2017-08-26 23:30:00 2017-08-01 02:00:00
2 3 632150 1896451 0.201394 0.194458 0.246370 2017-09-21 21:00:00 2017-09-08 22:20:00 2017-09-08 22:20:00 2017-08-01 02:00:00

In [201]:
df_result.head()


Out[201]:
fold len_test len_train rmse_test rmse_train rmse_validation test_max_date test_min_date train_max_date train_min_date
0 3 637743 637746 0.092354 0.083158 0.096094 2017-08-27 05:00:00 2017-08-14 03:30:00 2017-08-14 03:30:00 2017-08-01 02:00:00
1 3 637743 1275489 0.092316 0.086886 0.105826 2017-09-09 06:30:00 2017-08-27 05:00:00 2017-08-27 05:00:00 2017-08-01 02:00:00
2 3 637743 1913232 0.095415 0.091773 0.109602 2017-09-22 08:00:00 2017-09-09 06:30:00 2017-09-09 06:30:00 2017-08-01 02:00:00

In [202]:
X['y'] = y

In [203]:
X.tail()


Out[203]:
station bikes stands is_open day hour minute is_holiday public_holiday public_holiday_count ... total_stand filling_station_by_geo_cluster paa sax paa_minus_sax paa_multi_sax paa_ratio_sax pred fold y
ts
2017-09-22 08:00:00 11001 1.0 17.0 1.0 4 8 0 0 0 0.0 ... 18.0 0.359069 0.088889 0.087707 0.001182 0.007796 1.013481 0.136377 3.0 0.222222
2017-09-22 08:00:00 11002 16.0 4.0 1.0 4 8 0 0 0 0.0 ... 20.0 0.386667 0.735000 0.736834 -0.001834 0.541573 0.997511 0.590673 3.0 0.600000
2017-09-22 08:00:00 11003 17.0 3.0 1.0 4 8 0 0 0 0.0 ... 20.0 0.359069 0.865000 0.868574 -0.003574 0.751316 0.995885 0.813607 3.0 0.800000
2017-09-22 08:00:00 12001 5.0 20.0 1.0 4 8 0 0 0 0.0 ... 25.0 0.340893 0.202000 0.200820 0.001180 0.040566 1.005878 0.363851 3.0 0.420000
2017-09-22 08:00:00 12002 2.0 18.0 1.0 4 8 0 0 0 0.0 ... 20.0 0.340893 0.055000 0.056097 -0.001097 0.003085 0.980447 0.112957 3.0 0.050000

5 rows × 28 columns


In [239]:
#import matplotlib.ticker as ticker
import matplotlib.dates as mdates

def plot_station_pred(result, station):
    """Plot available bikes and bike stands for a given station"""
    
    data = result[result.station == station].copy()
    
    nb_flod = data.fold.nunique()
    
    fig, axarr = plt.subplots(nb_flod ,figsize=(17,10))
    
    for i in range(0, nb_flod):
        fold_data = data[data.fold == i+1]
        rmse_fold = np.round(np.sqrt(np.mean((fold_data.pred - fold_data.y)**2)), 3)
        axarr[i].plot(fold_data[fold_data.is_open == 1].index, fold_data[fold_data.is_open == 1].y,
                      '-', label='Open', alpha=0.6)
        axarr[i].plot(fold_data[fold_data.is_open == 0].index, fold_data[fold_data.is_open == 0].y,
                      '-', label='Closed', alpha=0.6)
        axarr[i].plot(fold_data.index, fold_data.pred, 'r-', label='Prediction')
        axarr[i].set_title('Fold '+str(i+1)+' : RMSE '+str(rmse_fold), fontsize =25)
    
    ##ax = plt.gca()
    # set major ticks location every day
    ##ax.xaxis.set_major_locator(mdates.DayLocator())
    # set major ticks format
    ##ax.xaxis.set_major_formatter(mdates.DateFormatter('\n\n\n%a %d.%m.%Y'))
    # set minor ticks location every two hours
    ##ax.xaxis.set_minor_locator(mdates.HourLocator(interval=2))
    # set minor ticks format
    ##ax.xaxis.set_minor_formatter(mdates.DateFormatter('%H:%M'))
    ##plt.setp(ax.xaxis.get_minorticklabels(), rotation=45)

    plt.legend(loc='best')
    fig.tight_layout()
#    print("TRAIN")
#    print(train.tail(5))
#    print("DATA")
#    print(data)

In [243]:
plot_station_pred(X, station=2005)


Single Validation

The fit function create some data structure for the XGBoost from the train and test DataFrames (i.e. xgb.DMatrix), configure the model and launch it with the objective: 'reg:logistic'. It's a regression, not a classification.


In [23]:
# Spliting train in learning / validation
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(train_X, train_Y, 
                                                    test_size=0.33, random_state=42)

In [24]:
X_train.tail()


Out[24]:
station bikes stands is_open day hour minute probability is_holiday public_holiday ... filling_station_by_geo_cluster mean_6 std_9 median_6 mean_6_minus_median_6 mean_6_multi_median_6 mean_6_ratio_median_6 warning_empty_full rolling_sum_9_empty_full_station proba_mean_by_sta_day_binned_hour
ts
2017-08-03 08:10:00 3102 1.0 15.0 1.0 3 8 10 0.062500 1 0 ... 0.418919 0.135417 0.103645 0.125000 1.041667e-02 0.016927 1.083333 1 3.0 0.259223
2017-09-04 18:10:00 5002 11.0 7.0 1.0 0 18 10 0.611111 0 0 ... 0.332661 0.611111 0.027778 0.611111 1.765255e-14 0.373457 1.000000 0 0.0 0.435836
2017-09-18 08:20:00 10030 4.0 14.0 1.0 0 8 20 0.222222 0 0 ... 0.508080 0.226852 0.035258 0.222222 4.629630e-03 0.050412 1.020833 0 0.0 0.541058
2017-09-15 17:50:00 6036 2.5 19.5 1.0 4 17 50 0.113636 0 0 ... 0.454545 0.428823 0.301153 0.443182 -1.435870e-02 0.190047 0.967601 0 2.0 NaN
2017-09-15 13:00:00 1035 6.0 10.0 1.0 4 13 0 0.375000 0 0 ... 0.345997 0.359375 0.038273 0.375000 -1.562500e-02 0.134766 0.958333 0 0.0 0.171875

5 rows × 32 columns


In [25]:
# Adding Meteo features
# Learning for Xgb
X_train = get_weather(X_train, how='learning')
X_test = get_weather(X_test, how='forecast', freq=freq)

# Value in futur
test_X = get_weather(test_X, how='forecast', freq=freq)

In [26]:
# the 'booster'
param ={"objective": "reg:logistic",
          "booster" : "gbtree",
          "eta": 0.2,
          "max_depth": 9,
          "silent": 1,
          "seed": 42}
bst = fit(X_train, y_train, X_test, y_test, param, num_round=60)


2018-02-13 17:29:42,726 [31214] INFO     prediction: fit
[0]	train-rmse:0.272245	test-rmse:0.272204
[1]	train-rmse:0.232445	test-rmse:0.232539
[2]	train-rmse:0.202643	test-rmse:0.202854
[3]	train-rmse:0.180608	test-rmse:0.180925
[4]	train-rmse:0.164537	test-rmse:0.164953
[5]	train-rmse:0.152986	test-rmse:0.153485
[6]	train-rmse:0.144806	test-rmse:0.145376
[7]	train-rmse:0.139083	test-rmse:0.139721
[8]	train-rmse:0.135079	test-rmse:0.135787
[9]	train-rmse:0.132289	test-rmse:0.133059
[10]	train-rmse:0.13033	test-rmse:0.131156
[11]	train-rmse:0.128871	test-rmse:0.129752
[12]	train-rmse:0.127844	test-rmse:0.128771
[13]	train-rmse:0.127072	test-rmse:0.128054
[14]	train-rmse:0.126421	test-rmse:0.127455
[15]	train-rmse:0.125928	test-rmse:0.127013
[16]	train-rmse:0.125413	test-rmse:0.126544
[17]	train-rmse:0.125023	test-rmse:0.126207
[18]	train-rmse:0.124475	test-rmse:0.125737
[19]	train-rmse:0.124167	test-rmse:0.125485
[20]	train-rmse:0.123746	test-rmse:0.125119
[21]	train-rmse:0.123438	test-rmse:0.124873
[22]	train-rmse:0.123011	test-rmse:0.124507
[23]	train-rmse:0.122746	test-rmse:0.124301
[24]	train-rmse:0.122479	test-rmse:0.124084
[25]	train-rmse:0.122176	test-rmse:0.12385
[26]	train-rmse:0.121985	test-rmse:0.123711
[27]	train-rmse:0.121695	test-rmse:0.123498
[28]	train-rmse:0.121555	test-rmse:0.12341
[29]	train-rmse:0.121333	test-rmse:0.123241
[30]	train-rmse:0.121237	test-rmse:0.123176
[31]	train-rmse:0.120982	test-rmse:0.122973
[32]	train-rmse:0.120818	test-rmse:0.122865
[33]	train-rmse:0.120668	test-rmse:0.122774
[34]	train-rmse:0.120438	test-rmse:0.122603
[35]	train-rmse:0.120243	test-rmse:0.122488
[36]	train-rmse:0.120112	test-rmse:0.122419
[37]	train-rmse:0.119982	test-rmse:0.122364
[38]	train-rmse:0.119895	test-rmse:0.122324
[39]	train-rmse:0.119734	test-rmse:0.122219
[40]	train-rmse:0.119615	test-rmse:0.122132
[41]	train-rmse:0.119518	test-rmse:0.122077
[42]	train-rmse:0.11944	test-rmse:0.122029
[43]	train-rmse:0.119218	test-rmse:0.121898
[44]	train-rmse:0.119137	test-rmse:0.121878
[45]	train-rmse:0.119076	test-rmse:0.121839
[46]	train-rmse:0.119023	test-rmse:0.121818
[47]	train-rmse:0.11899	test-rmse:0.121802
[48]	train-rmse:0.118963	test-rmse:0.121798
[49]	train-rmse:0.118854	test-rmse:0.121717
[50]	train-rmse:0.118784	test-rmse:0.121708
[51]	train-rmse:0.118606	test-rmse:0.121598
[52]	train-rmse:0.118533	test-rmse:0.121576
[53]	train-rmse:0.118477	test-rmse:0.121571
[54]	train-rmse:0.118234	test-rmse:0.121405
[55]	train-rmse:0.118091	test-rmse:0.121325
[56]	train-rmse:0.118037	test-rmse:0.121292
[57]	train-rmse:0.117998	test-rmse:0.121282
[58]	train-rmse:0.117925	test-rmse:0.121259
[59]	train-rmse:0.117877	test-rmse:0.121241

In [27]:
# compute the prediction from test_* (Validation)
pred = prediction(bst, test_X, test_Y)

In [28]:
pred[:5]


Out[28]:
array([ 0.28615296,  0.30260265,  0.11917941,  0.24870376,  0.47009993], dtype=float32)

In [29]:
test_Y[:5].tolist()


Out[29]:
[0.3125, 0.2558139534883721, 0.13333333333333333, 0.5, 0.5227272727272727]

In [30]:
print("Number of predictions: {}".format(len(pred)))


Number of predictions: 2373

In [31]:
# Compute the RMSE
rmse = np.sqrt(np.mean((pred - test_Y)**2))
rmse


Out[31]:
0.13600357835540022

In [32]:
# New periode time

#start = pd.Timestamp("2017-08-01T00:10:00") # Tuesday
#predict_date = pd.Timestamp("2017-09-22T09:00:00") # wednesday
# predict the next 30 minutes
#freq = '1H'
# number of predictions at 'predict_date'.
# Here, the next 30 minutes and the next hour (30 minutes + 30 minutes).
# If you want to predict the next 3 hours, every 30 minutes, thus set periods=6
#periods = 1

# train-rmse:0.134421	test-rmse:0.134516 / 0.14425 : original : max_depth 6 / num_round = 25
# train-rmse:0.133858	test-rmse:0.133971 / 0.14436 : shift features
# train-rmse:0.132937	test-rmse:0.133034 / 0.14285 : Trending features
# train-rmse:0.132937	test-rmse:0.133034 / 0.14285 : Was recently open features (No improve at all, no use of features
# train-rmse:0.126300	test-rmse:0.127283 / 0.13869 : Idem max_dept 8 / num_round = 60. Stil no use of recently_open
# train-rmse:0.123439	test-rmse:0.124447 / 0.13477 : Cluster lyon station activite
# train-rmse:0.122884	test-rmse:0.123860 / 0.13571 : Cluster Lyon geo
# train-rmse:0.122716	test-rmse:0.123831 / 0.13432 : filling_station_by_geo_cluster & Total stand
# train-rmse:0.105716	test-rmse:0.107058 / 0.10607 : Add PAA tranformation (segments = 10 )
# train-rmse:0.104273	test-rmse:0.105701 / 0.10400 : Add Sax transformation (segments=10, symbols=8)
# train-rmse:0.104693	test-rmse:0.106081 / 0.10391 : Add public_holiday 
# train-rmse:0.104431	test-rmse:0.105907 / 0.10435 : add public_holiday_count (delta from x last holiday)
# train-rmse:0.104448	test-rmse:0.105840 / 0.10461 : Add summer holiday
# train-rmse:0.104113	test-rmse:0.105526 / 0.10457 : absolute public_holiday_count
# train-rmse:0.104328	test-rmse:0.106016 / 0.09623 : start time = "2017-08-01T02:00:00" (old "2017-08-01T00:10:00") for forecast weather matching
# train-rmse:0.104084	test-rmse:0.106086 / 0.09596 : Add weather ['temp', 'humidity', 'weather_desc']
# train-rmse:0.104084	test-rmse:0.106086 / 0.09596 : add ratio_station_open by time (not use)
# train-rmse:0.101491	test-rmse:0.104886 / 0.09667 : max_depth = 9 (ratio_station_open still not use)
# train-rmse:0.101491	test-rmse:0.104886 / 0.09667 : drop ratio_station_open (Features add nothing)
# train-rmse:0.101676	test-rmse:0.105003 / 0.09540 : add ratio_station_geo_cluster_open features
# train-rmse:0.101713	test-rmse:0.104963 / 0.09545 : Fix minutes
# train-rmse:0.101713	test-rmse:0.104963 / 0.09545 : [X] Binned hours of day (7 grp) (Didn't use it)
# train-rmse:0.101835	test-rmse:0.105203 / 0.09624 : [X]"subsample" : 0.85
# train-rmse:0.101666	test-rmse:0.105105 / 0.09575 : [X] Binned hours of day (5 grp) use it but not worth
# train-rmse:0.095662	test-rmse:0.099130 / 0.08941 : Interaction features with paa & sax
# Data Leak with paa & sax so delete it (with interaction)
# Back to Add public_holiday without paa & sax whit #start = pd.Timestamp("2017-08-01T00:10:00") # Tuesday
# -----------------------
# train-rmse:0.122716	test-rmse:0.123831 / 0.13432 : filling_station_by_geo_cluster & Total stand (Old one)
# train-rmse:0.119564	test-rmse:0.121620 / 0.13353 : Add absolute public_holiday_count 
# train-rmse:0.119425	test-rmse:0.121437 / 0.13446 : Add summer holiday
# train-rmse:0.119737	test-rmse:0.121806 / 0.13486 : start time = "2017-08-01T02:00:00" (old "2017-08-01T00:10:00") for forecast weather matching
# train-rmse:0.119416	test-rmse:0.122441 / 0.13383 : Add weather ['temp', 'humidity', 'weather_desc']
# train-rmse:0.119327	test-rmse:0.122504 / 0.13391 : Add weather cloudiness [+/-]
# train-rmse:0.119563	test-rmse:0.122719 / 0.13406 : [x] filling_bike_on_geo_cluster on bike (not bike shift feature)
# train-rmse:0.11972	test-rmse:0.122786 / 0.13378 : [X] Drop ratio_station_geo_cluster_open
# train-rmse:0.119536	test-rmse:0.122669 / 0.13558 : Add rolling mean (6)
# train-rmse:0.11953	test-rmse:0.122635 / 0.13620 : Add rolling std (9)
# train-rmse:0.119579	test-rmse:0.122756 / 0.13582 : Add median (6)
# train-rmse:0.119772	test-rmse:0.122874 / 0.13605 : add Interaction features with mean_6 & median_6
# train-rmse:0.119234	test-rmse:0.122487 / 0.13655 : [?1] add bin hours (5 groups, night / morning / midi / aprem / soire)
# train-rmse:0.119772	test-rmse:0.122874 / 0.13605 : [X] update bin hours (7 groups) [Algo don't use it as feature]
# train-rmse:0.119772	test-rmse:0.122874 / 0.13605 : [V?1] Fix bug on binned hour (aprem 17 -> 18). Correct this nug (create NaN value for 17hours give worst result) ???
# train-rmse:0.117679	test-rmse:0.121040 / 0.13530 : [/!\]add create_mean_by_sta_day_binned_hours (grp by : 'station', 'day', 'hours_binned'. Only open station)
# train-rmse:0.117875	test-rmse:0.121231 / 0.13542 : Add minus create_mean_by_sta_day_binned_hours by proba
# train-rmse:0.117794	test-rmse:0.121224 / 0.13598 : [/!\ v] Fix create_mean_by_sta_day_binned_hours (train test leak)
                                                # To create this features have to keep 'probabilty' in train / test
# train-rmse:0.117999	test-rmse:0.121282 / 0.13548 : [X] Add minus create_mean_by_sta_day_binned_hours by proba
# train-rmse:0.117996	test-rmse:0.121267 / 0.13576 : create_bool_empty_full_station
# train-rmse:0.117877	test-rmse:0.121241 / 0.13600 : create_rolling_median_features empty_full_station (nb_shift = 9)
# train-rmse:0.117877	test-rmse:0.121241 /  : create_rolling_median_features empty_full_station (nb_shift = 6)

In [37]:
#start = pd.Timestamp("2017-07-11T00:00:00") # Tuesday
#predict_date = pd.Timestamp("2017-07-26T10:00:00") # wednesday
# predict the next 30 minutes
#freq = '3OT'
# number of predictions at 'predict_date'.
# Here, the next 30 minutes and the next hour (30 minutes + 30 minutes).
# If you want to predict the next 3 hours, every 30 minutes, thus set periods=6
#periods = 2

#0.092800 original direct on test set

# Learning on : 501686 rows
# Testing on : 247100 rows
# Validation on : 2366 rows

## train / test / validation : comment
# train-rmse:0.087365	test-rmse:0.091362 / 0.09200 : 
# train-rmse:0.086774	test-rmse:0.090772 / 0.09222 :  shift 30 min
# train-rmse:0.086267	test-rmse:0.091045 / 0.092387 : trending KPI
# train-rmse:0.084771	test-rmse:0.089825 / 0.091426 : Cluster Lyon station activité
# train-rmse:0.084847	test-rmse:0.089818 / 0.091285 : was recently open
# train-rmse:0.083754	test-rmse:0.089106 / 0.091335 : Cluster Lyon geo
# train-rmse:0.08413	test-rmse:0.089307 / 0.0915887 : Cluster Lyon geo implement (Checking problem with -1)
# train-rmse:0.084369	test-rmse:0.089433 / 0.091028 : Fix nan was recently open with min_periods=1
# train-rmse:0.083312	test-rmse:0.089216 / 0.091014 : Adding total_stand; total_stand_geo_cluster; bikes_shift_30Tmin_geo_cluster; filling_station_by_geo_cluster
# train-rmse:0.083598	test-rmse:0.089433 / 0.091099 : DROP total_stand (features in filling_bike_on_geo_cluster)
# train-rmse:0.083386	test-rmse:0.089155 / 0.091078 : DROP total_stand_geo_cluster (...)
# train-rmse:0.083296	test-rmse:0.089036 / 0.090905 : DROP bikes_shift_30Tmin_geo_cluster (...)
# train-rmse:0.08354	test-rmse:0.089057 / 0.090567 : [V] Keep only filling_station_by_geo_cluster
# train-rmse:0.083809	test-rmse:0.089234 / 0.090531 : [X] Chage recently open (0->24) to boolean 24->1 else 0 (1 si la station est ouverte depuis 4H par exemple)
# train-rmse:0.083872	test-rmse:0.089464 / 0.090993 : [X] filling_station and filling_station_minus_filling_geo_station
# train-rmse:0.083918	test-rmse:0.089440 / 0.09133  : [X] DROP filling_station
# train-rmse:0.085091	test-rmse:0.090080 / 0.091526 : [X] DROP Station for more general learning
# train-rmse:0.063492	test-rmse:0.069996 / 0.059212 : Add PAA & SAX transformation
#

In [38]:
#start = pd.Timestamp("2017-07-01T00:00:00") # 
#predict_date = pd.Timestamp("2017-07-27T16:00:00") # Thursday
# predict the next 30 minutes
#freq = '1H'
# number of predictions at 'predict_date'.
# Here, the next 30 minutes and the next hour (30 minutes + 30 minutes).
# If you want to predict the next 3 hours, every 30 minutes, thus set periods=6
#periods = 1

# train-rmse:0.113644	test-rmse:0.117976 / 0.17470 : Same as previous best run
# train-rmse:0.114026	test-rmse:0.118588 / 0.1756795 : adding paa (segment=20)
# train-rmse:0.113203	test-rmse:0.117770 / 0.175806 : adding paa (segment=6 - 1 hours)
# train-rmse:0.098235	test-rmse:0.103058 / 0.133363 : Fix Paa function (bad value before)
# train-rmse:0.097559	test-rmse:0.102510 / 0.132045 : Sax function (drop paa)
# train-rmse:0.096851	test-rmse:0.102011 / 0.130106 : PAA and SAX

In [45]:
# must install graphviz
# plot_tree(bst)

In [33]:
def plot_features_importance(bst):
    """
    Plot features importance for a xgboost model
    """
    df_importance = pd.DataFrame([bst.get_score()]).T.reset_index()
    df_importance.columns = ['features', 'importance']
    df_importance.sort_values('importance', ascending=0, inplace=True)
    plt.figure(figsize=(10, 6))
    sns.barplot('importance', 'features', data=df_importance)
    plt.xlabel('Relative importance')
    plt.ylabel('Features importance')

In [34]:
plot_features_importance(bst)



In [34]:
plot_features_importance(bst)


Store the results


In [66]:
result = test_X.copy()
result['ts_future'] = test_Y.index.shift(1, freq=freq)

result['observation'] = test_Y.copy()
result['ts_future'] = test_Y.index.shift(1, freq=freq) ##??
result['prediction'] = pred
result['error'] = pred - test_Y
result['error_abs'] = np.abs(pred - test_Y)
result['relative_error'] = 100. * np.abs(pred - test_Y) / test_Y
result['quad_error'] = (pred - test_Y)**2
result.to_csv("prediction-freq-{}-{}.csv".format(freq, predict_date))

In [67]:
result.sort_values('error_abs', ascending=1).tail(10)


Out[67]:
station bikes stands is_open day hour minute probability is_holiday public_holiday ... humidity weather_desc cloudiness ts_future observation prediction error error_abs relative_error quad_error
ts
2017-09-22 09:30:00 1021 1.0 10.0 1.0 4 9 30 0.090909 0 0 ... 79 0 8 2017-09-22 10:30:00 0.727273 0.195490 -0.531782 0.531782 73.120075 0.282792
2017-09-22 09:10:00 9014 7.0 15.0 1.0 4 9 10 0.318182 0 0 ... 79 0 8 2017-09-22 10:10:00 0.863636 0.324194 -0.539442 0.539442 62.461718 0.290998
2017-09-22 09:50:00 1021 1.5 9.5 1.0 4 9 50 0.136364 0 0 ... 79 0 8 2017-09-22 10:50:00 0.818182 0.276557 -0.541625 0.541625 66.198607 0.293358
2017-09-22 09:10:00 7035 11.5 17.5 1.0 4 9 10 0.396552 0 0 ... 79 0 8 2017-09-22 10:10:00 0.896552 0.354651 -0.541900 0.541900 60.442736 0.293656
2017-09-22 09:00:00 7035 12.0 17.0 1.0 4 9 0 0.413793 0 0 ... 79 0 8 2017-09-22 10:00:00 0.965517 0.420354 -0.545164 0.545164 56.463375 0.297203
2017-09-22 10:00:00 8053 12.0 3.0 1.0 4 10 0 0.800000 0 0 ... 79 0 8 2017-09-22 11:00:00 0.200000 0.749965 0.549965 0.549965 274.982655 0.302462
2017-09-22 09:00:00 10101 8.5 25.5 1.0 4 9 0 0.250000 0 0 ... 79 0 8 2017-09-22 10:00:00 0.941176 0.367589 -0.573587 0.573587 60.943637 0.329002
2017-09-22 09:00:00 1032 4.0 18.0 1.0 4 9 0 0.181818 0 0 ... 79 0 8 2017-09-22 10:00:00 0.863636 0.252201 -0.611436 0.611436 70.797811 0.373854
2017-09-22 09:00:00 4014 1.0 12.0 1.0 4 9 0 0.076923 0 0 ... 79 0 8 2017-09-22 10:00:00 0.769231 0.145658 -0.623573 0.623573 81.064448 0.388843
2017-09-22 09:10:00 4014 1.0 12.0 1.0 4 9 10 0.076923 0 0 ... 79 0 8 2017-09-22 10:10:00 0.769231 0.139498 -0.629733 0.629733 81.865292 0.396564

10 rows × 42 columns


In [68]:
trainning = X_train.copy()
train_pred = prediction(bst, X_train, y_train)
trainning['train_pred'] = train_pred
trainning['y'] = y_train 

# Sorting on time
trainning = trainning.sort_index()

In [69]:
trainning.head()


Out[69]:
station bikes stands is_open day hour minute probability is_holiday public_holiday ... mean_6_multi_median_6 mean_6_ratio_median_6 proba_mean_by_sta_day_binned_hour minus_create_mean_by_sta_day_binned_hours_proba temp humidity weather_desc cloudiness train_pred y
ts
2017-08-01 02:00:00 3037 6.0 14.0 1.0 1 2 0 0.300000 1 0 ... 0.090000 1.0 0.298516 -0.001484 22.16 47.0 0.0 0.0 0.299718 0.250000
2017-08-01 02:00:00 9040 11.0 1.0 1.0 1 2 0 0.916667 1 0 ... 0.840278 1.0 0.610054 -0.306612 22.16 47.0 0.0 0.0 0.906871 0.916667
2017-08-01 02:00:00 6011 11.0 4.0 1.0 1 2 0 0.733333 1 0 ... 0.537778 1.0 0.794324 0.060990 22.16 47.0 0.0 0.0 0.728479 0.733333
2017-08-01 02:00:00 10114 11.0 9.0 1.0 1 2 0 0.550000 1 0 ... 0.302500 1.0 0.351540 -0.198460 22.16 47.0 0.0 0.0 0.566881 0.550000
2017-08-01 02:00:00 7004 29.0 1.0 1.0 1 2 0 0.966667 1 0 ... 0.934444 1.0 0.934624 -0.032042 22.16 47.0 0.0 0.0 0.962215 0.966667

5 rows × 37 columns


In [70]:
result['proba'] = result['bikes'] / (result['bikes'] + result['stands'])

In [120]:
from sklearn.metrics import mean_absolute_error

def plot_station(result, trainning, station, lag=750):
    """Plot available bikes and bike stands for a given station"""
    
    train = trainning[trainning.station == station].tail(lag)
    data = result[result.station == station].copy()
    
    fig, ax = plt.subplots(figsize=(18,5))
    plt.plot(train.index, train.y, '-', label='trainning target', alpha=0.6)
    plt.plot(train.index, train.train_pred, 'r--', label='trainning pred', alpha=0.4)
    #plt.plot(train.index, train.paa, '-', label='PAA', alpha=0.6)
    #plt.plot(train.index, train.sax, '-', label='SAX', alpha=0.6)
    plt.plot(data.index, data.observation, 'g-', label='Real')
    plt.plot(data.index, data.prediction, 'r-', label='Prediction')
    
    
    
    train['total_stand'] = train['bikes'] + train['stands']
    train['bikes_pred'] = np.round(train['total_stand'] * train['train_pred'])
    train['bikes_true'] = train['total_stand'] * train['y']
    
    data['total_stand'] = data['bikes'] + data['stands']
    data['bikes_pred'] = np.round(data['total_stand'] * data['prediction'])
    data['bikes_true'] = data['total_stand'] * data['observation']
    
    rmse_train = np.sqrt(np.mean((train.train_pred - train.y)**2))
    mae_train = mean_absolute_error(train.bikes_true, train.bikes_pred)
    rmse_test = np.sqrt(np.mean((data.prediction - data.observation)**2))
    mae_test = mean_absolute_error(data.bikes_true, data.bikes_pred)
    
    ax = plt.gca()
    # set major ticks location every day
    ax.xaxis.set_major_locator(mdates.DayLocator())
    # set major ticks format
    ax.xaxis.set_major_formatter(mdates.DateFormatter('\n\n\n%a %d.%m.%Y'))
    # set minor ticks location every two hours
    ax.xaxis.set_minor_locator(mdates.HourLocator(interval=2))
    # set minor ticks format
    ax.xaxis.set_minor_formatter(mdates.DateFormatter('%H:%M'))
    plt.setp(ax.xaxis.get_minorticklabels(), rotation=45)

    plt.legend(loc='best')
    print("RMSE train : " + str(np.round(rmse_train, 4)) + " - MAE : " + str(np.round(mae_train, 4)))
    print("RMSE test : " + str(np.round(rmse_test, 4)) + " - MAE : " + str(np.round(mae_test, 4)))

In [124]:
# Il semble que l'on est pas toute les données

plot_station(result, trainning, 10101, lag=350)


RMSE train : 0.1285 - MAE : 2.995
RMSE test : 0.2443 - MAE : 5.381

In [57]:
testing = result[result.station == 10113].copy()

In [58]:
testing = testing.reset_index()

In [59]:
testing.head()


Out[59]:
ts station bikes stands is_open day hour minute probability is_holiday ... ts_future observation prediction error error_abs relative_error quad_error proba bikes_pred bikes_true
0 2017-09-22 09:00:00 10113 3.5 18.5 1.0 4 9 0 0.159091 0 ... 2017-09-22 10:00:00 0.409091 0.156872 -0.252219 0.252219 61.653412 0.063614 0.159091 3.0 9.0
1 2017-09-22 09:10:00 10113 5.0 17.0 1.0 4 9 10 0.227273 0 ... 2017-09-22 10:10:00 0.318182 0.230708 -0.087474 0.087474 27.491752 0.007652 0.227273 5.0 7.0
2 2017-09-22 09:20:00 10113 6.0 16.0 1.0 4 9 20 0.272727 0 ... 2017-09-22 10:20:00 0.227273 0.238216 0.010943 0.010943 4.814914 0.000120 0.272727 5.0 5.0
3 2017-09-22 09:30:00 10113 7.0 15.0 1.0 4 9 30 0.318182 0 ... 2017-09-22 10:30:00 0.181818 0.296440 0.114622 0.114622 63.042249 0.013138 0.318182 7.0 4.0
4 2017-09-22 09:40:00 10113 7.5 14.5 1.0 4 9 40 0.340909 0 ... 2017-09-22 10:40:00 0.181818 0.339183 0.157365 0.157365 86.550783 0.024764 0.340909 7.0 4.0

5 rows × 46 columns


In [60]:
station_result = result.groupby('station')[['bikes_true','bikes_pred']].apply(lambda row: mean_absolute_error(row.bikes_true, row.bikes_pred))
station_result_df = pd.DataFrame(station_result).reset_index()
station_result_df.columns = ['station', 'mae']

In [61]:
station_result_df.sort_values('mae')


Out[61]:
station mae
116 5002 0.000000
263 9052 0.000000
261 9050 0.000000
258 9043 0.000000
216 8011 0.000000
262 9051 0.000000
133 5045 0.000000
114 4042 0.000000
129 5036 0.000000
322 10110 0.000000
128 5031 0.000000
95 4001 0.000000
132 5044 0.000000
235 8056 0.142857
248 9013 0.142857
127 5030 0.142857
17 1036 0.142857
338 12002 0.142857
313 10086 0.142857
211 8006 0.214286
260 9049 0.285714
276 10018 0.285714
293 10049 0.285714
255 9040 0.285714
200 7053 0.285714
138 5055 0.285714
137 5054 0.322751
125 5026 0.357143
229 8041 0.357143
226 8038 0.428571
... ... ...
156 6031 3.785714
79 3082 3.785714
247 9011 3.928571
59 3016 3.934524
190 7033 4.000000
321 10103 4.128079
35 2020 4.214286
143 6005 4.261905
81 3084 4.563776
123 5015 4.571429
1 1002 4.635760
249 9014 4.714286
297 10056 4.738776
188 7030 4.785714
43 2037 4.803665
140 6002 4.904762
267 10005 5.000000
78 3080 5.304147
319 10101 5.380952
14 1032 5.714286
19 2002 5.785714
36 2022 5.857143
268 10006 6.012673
51 3004 6.252747
183 7020 6.285714
10 1022 6.861751
192 7035 8.214286
26 2009 8.928571
203 7057 9.083851
22 2005 9.500000

339 rows × 2 columns


In [160]:
testing.tail()


Out[160]:
ts station bikes stands is_open day hour minute is_holiday public_holiday ... ts_future observation prediction error error_abs relative_error quad_error proba bikes_pred bikes_true
2 2017-09-22 09:20:00 10113 6.0 16.0 1.0 4 9 20 0 0 ... 2017-09-22 10:20:00 0.227273 0.308096 0.080823 0.080823 35.562210 0.006532 0.272727 7.0 5.0
3 2017-09-22 09:30:00 10113 7.0 15.0 1.0 4 9 30 0 0 ... 2017-09-22 10:30:00 0.181818 0.287022 0.105204 0.105204 57.862130 0.011068 0.318182 6.0 4.0
4 2017-09-22 09:40:00 10113 7.5 14.5 1.0 4 9 40 0 0 ... 2017-09-22 10:40:00 0.181818 0.287512 0.105694 0.105694 58.131635 0.011171 0.340909 6.0 4.0
5 2017-09-22 09:50:00 10113 8.5 13.5 1.0 4 9 50 0 0 ... 2017-09-22 10:50:00 0.136364 0.298288 0.161924 0.161924 118.744284 0.026219 0.386364 7.0 3.0
6 2017-09-22 10:00:00 10113 9.0 13.0 1.0 4 10 0 0 0 ... 2017-09-22 11:00:00 0.181818 0.278439 0.096620 0.096620 53.141180 0.009335 0.409091 6.0 4.0

5 rows × 39 columns

Signal transformation


In [50]:
data.head(10)


Out[50]:
ts station bikes stands is_open day hour minute is_holiday public_holiday ... paa sax paa_minus_sax paa_multi_sax paa_ratio_sax y probability prob_shit prob_diff prob_outlier
2402491 2017-09-19 07:00:00 12001 2.000000 23.000000 1.0 1 7 0 0 0 ... 0.092000 0.102565 -0.010565 0.009436 0.896992 0.18 0.080000 0.080000 0.000000 0
2402830 2017-09-19 07:10:00 12001 2.500000 22.500000 1.0 1 7 10 0 0 ... 0.092000 0.108671 -0.016671 0.009998 0.846590 0.16 0.100000 0.080000 0.020000 0
2403169 2017-09-19 07:20:00 12001 3.000000 22.000000 1.0 1 7 20 0 0 ... 0.092000 0.114778 -0.022778 0.010560 0.801550 0.16 0.120000 0.100000 0.020000 0
2403508 2017-09-19 07:30:00 12001 5.000000 20.000000 1.0 1 7 30 0 0 ... 0.092000 0.120884 -0.028884 0.011121 0.761061 0.20 0.200000 0.120000 0.080000 0
2403847 2017-09-19 07:40:00 12001 5.666667 19.333333 1.0 1 7 40 0 0 ... 0.630667 0.602541 0.028126 0.380002 1.046679 0.20 0.226667 0.200000 0.026667 0
2404186 2017-09-19 07:50:00 12001 6.000000 19.000000 1.0 1 7 50 0 0 ... 0.630667 0.608647 0.022020 0.383853 1.036178 0.20 0.240000 0.226667 0.013333 0
2404525 2017-09-19 08:00:00 12001 6.000000 19.000000 1.0 1 8 0 0 0 ... 0.630667 0.614753 0.015913 0.387704 1.025886 0.16 0.240000 0.240000 0.000000 0
2404864 2017-09-19 08:10:00 12001 20.000000 5.000000 1.0 1 8 10 0 0 ... 0.630667 0.620860 0.009807 0.391555 1.015796 0.16 0.800000 0.240000 0.560000 1
2405203 2017-09-19 08:20:00 12001 20.000000 5.000000 1.0 1 8 20 0 0 ... 0.630667 0.626966 0.003701 0.395406 1.005903 0.16 0.800000 0.800000 0.000000 0
2405542 2017-09-19 08:30:00 12001 20.000000 5.000000 1.0 1 8 30 0 0 ... 0.630667 0.633072 -0.002405 0.399257 0.996200 0.16 0.800000 0.800000 0.000000 0

10 rows × 31 columns


In [69]:
def plot_station_transfo(data, station):
    """Plot available bikes and bike stands for a given station"""
    
    nb_row = 750
    data = data[data.station == station].tail(nb_row).copy()
    
    fig, ax = plt.subplots(figsize=(18,5))
    #plt.plot(train.index, train.sax, '-', label='SAX', alpha=0.6)
    plt.plot(data.index, data.y, 'g-', label='Real')
    plt.plot(data.index, data.paa, '-', label='PAA', alpha=0.6)
    #plt.plot(data.index, data.prediction, 'r-', label='Prediction')
    
    
    ax = plt.gca()
    # set major ticks location every day
    ax.xaxis.set_major_locator(mdates.DayLocator())
    # set major ticks format
    ax.xaxis.set_major_formatter(mdates.DateFormatter('\n\n\n%a %d.%m.%Y'))
    # set minor ticks location every two hours
    ax.xaxis.set_minor_locator(mdates.HourLocator(interval=2))
    # set minor ticks format
    ax.xaxis.set_minor_formatter(mdates.DateFormatter('%H:%M'))
    plt.setp(ax.xaxis.get_minorticklabels(), rotation=45)
    
    plt.legend(loc='best')

In [70]:
plot_station_transfo(data, 12001)


---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-70-5a5d107ee213> in <module>()
----> 1 plot_station_transfo(data, 12001)

<ipython-input-69-5395562ebdf5> in plot_station_transfo(data, station)
     21     # set minor ticks format
     22     ax.xaxis.set_minor_formatter(mdates.DateFormatter('%H:%M'))
---> 23     plt.setp(ax.xaxis.get_minorticklabels(), rotation=45)
     24 
     25     plt.legend(loc='best')

~/anaconda/envs/python3DS/lib/python3.5/site-packages/matplotlib/axis.py in get_minorticklabels(self)
   1196     def get_minorticklabels(self):
   1197         'Return a list of Text instances for the minor ticklabels'
-> 1198         ticks = self.get_minor_ticks()
   1199         labels1 = [tick.label1 for tick in ticks if tick.label1On]
   1200         labels2 = [tick.label2 for tick in ticks if tick.label2On]

~/anaconda/envs/python3DS/lib/python3.5/site-packages/matplotlib/axis.py in get_minor_ticks(self, numticks)
   1341         'get the minor tick instances; grow as necessary'
   1342         if numticks is None:
-> 1343             numticks = len(self.get_minor_locator()())
   1344 
   1345         if len(self.minorTicks) < numticks:

~/anaconda/envs/python3DS/lib/python3.5/site-packages/matplotlib/dates.py in __call__(self)
    804             return []
    805 
--> 806         return self.tick_values(dmin, dmax)
    807 
    808     def tick_values(self, vmin, vmax):

~/anaconda/envs/python3DS/lib/python3.5/site-packages/matplotlib/dates.py in tick_values(self, vmin, vmax)
    826         if len(dates) == 0:
    827             return date2num([vmin, vmax])
--> 828         return self.raise_if_exceeds(date2num(dates))
    829 
    830     def _get_unit(self):

~/anaconda/envs/python3DS/lib/python3.5/site-packages/matplotlib/ticker.py in raise_if_exceeds(self, locs)
   1295             msg = ('Locator attempting to generate %d ticks from %s to %s: ' +
   1296                    'exceeds Locator.MAXTICKS') % (len(locs), locs[0], locs[-1])
-> 1297             raise RuntimeError(msg)
   1298 
   1299         return locs

RuntimeError: Locator attempting to generate 1664626 ticks from 2396185.625 to 2534904.375: exceeds Locator.MAXTICKS
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
~/anaconda/envs/python3DS/lib/python3.5/site-packages/IPython/core/formatters.py in __call__(self, obj)
    330                 pass
    331             else:
--> 332                 return printer(obj)
    333             # Finally look for special method names
    334             method = get_real_method(obj, self.print_method)

~/anaconda/envs/python3DS/lib/python3.5/site-packages/IPython/core/pylabtools.py in <lambda>(fig)
    235 
    236     if 'png' in formats:
--> 237         png_formatter.for_type(Figure, lambda fig: print_figure(fig, 'png', **kwargs))
    238     if 'retina' in formats or 'png2x' in formats:
    239         png_formatter.for_type(Figure, lambda fig: retina_figure(fig, **kwargs))

~/anaconda/envs/python3DS/lib/python3.5/site-packages/IPython/core/pylabtools.py in print_figure(fig, fmt, bbox_inches, **kwargs)
    119 
    120     bytes_io = BytesIO()
--> 121     fig.canvas.print_figure(bytes_io, **kw)
    122     data = bytes_io.getvalue()
    123     if fmt == 'svg':

~/anaconda/envs/python3DS/lib/python3.5/site-packages/matplotlib/backend_bases.py in print_figure(self, filename, dpi, facecolor, edgecolor, orientation, format, **kwargs)
   2198                     orientation=orientation,
   2199                     dryrun=True,
-> 2200                     **kwargs)
   2201                 renderer = self.figure._cachedRenderer
   2202                 bbox_inches = self.figure.get_tightbbox(renderer)

~/anaconda/envs/python3DS/lib/python3.5/site-packages/matplotlib/backends/backend_agg.py in print_png(self, filename_or_obj, *args, **kwargs)
    543 
    544     def print_png(self, filename_or_obj, *args, **kwargs):
--> 545         FigureCanvasAgg.draw(self)
    546         renderer = self.get_renderer()
    547         original_dpi = renderer.dpi

~/anaconda/envs/python3DS/lib/python3.5/site-packages/matplotlib/backends/backend_agg.py in draw(self)
    462 
    463         try:
--> 464             self.figure.draw(self.renderer)
    465         finally:
    466             RendererAgg.lock.release()

~/anaconda/envs/python3DS/lib/python3.5/site-packages/matplotlib/artist.py in draw_wrapper(artist, renderer, *args, **kwargs)
     61     def draw_wrapper(artist, renderer, *args, **kwargs):
     62         before(artist, renderer)
---> 63         draw(artist, renderer, *args, **kwargs)
     64         after(artist, renderer)
     65 

~/anaconda/envs/python3DS/lib/python3.5/site-packages/matplotlib/figure.py in draw(self, renderer)
   1142 
   1143             mimage._draw_list_compositing_images(
-> 1144                 renderer, self, dsu, self.suppressComposite)
   1145 
   1146             renderer.close_group('figure')

~/anaconda/envs/python3DS/lib/python3.5/site-packages/matplotlib/image.py in _draw_list_compositing_images(renderer, parent, dsu, suppress_composite)
    137     if not_composite or not has_images:
    138         for zorder, a in dsu:
--> 139             a.draw(renderer)
    140     else:
    141         # Composite any adjacent images together

~/anaconda/envs/python3DS/lib/python3.5/site-packages/matplotlib/artist.py in draw_wrapper(artist, renderer, *args, **kwargs)
     61     def draw_wrapper(artist, renderer, *args, **kwargs):
     62         before(artist, renderer)
---> 63         draw(artist, renderer, *args, **kwargs)
     64         after(artist, renderer)
     65 

~/anaconda/envs/python3DS/lib/python3.5/site-packages/matplotlib/axes/_base.py in draw(self, renderer, inframe)
   2424             renderer.stop_rasterizing()
   2425 
-> 2426         mimage._draw_list_compositing_images(renderer, self, dsu)
   2427 
   2428         renderer.close_group('axes')

~/anaconda/envs/python3DS/lib/python3.5/site-packages/matplotlib/image.py in _draw_list_compositing_images(renderer, parent, dsu, suppress_composite)
    137     if not_composite or not has_images:
    138         for zorder, a in dsu:
--> 139             a.draw(renderer)
    140     else:
    141         # Composite any adjacent images together

~/anaconda/envs/python3DS/lib/python3.5/site-packages/matplotlib/artist.py in draw_wrapper(artist, renderer, *args, **kwargs)
     61     def draw_wrapper(artist, renderer, *args, **kwargs):
     62         before(artist, renderer)
---> 63         draw(artist, renderer, *args, **kwargs)
     64         after(artist, renderer)
     65 

~/anaconda/envs/python3DS/lib/python3.5/site-packages/matplotlib/axis.py in draw(self, renderer, *args, **kwargs)
   1134         renderer.open_group(__name__)
   1135 
-> 1136         ticks_to_draw = self._update_ticks(renderer)
   1137         ticklabelBoxes, ticklabelBoxes2 = self._get_tick_bboxes(ticks_to_draw,
   1138                                                                 renderer)

~/anaconda/envs/python3DS/lib/python3.5/site-packages/matplotlib/axis.py in _update_ticks(self, renderer)
    967 
    968         interval = self.get_view_interval()
--> 969         tick_tups = [t for t in self.iter_ticks()]
    970         if self._smart_bounds:
    971             # handle inverted limits

~/anaconda/envs/python3DS/lib/python3.5/site-packages/matplotlib/axis.py in <listcomp>(.0)
    967 
    968         interval = self.get_view_interval()
--> 969         tick_tups = [t for t in self.iter_ticks()]
    970         if self._smart_bounds:
    971             # handle inverted limits

~/anaconda/envs/python3DS/lib/python3.5/site-packages/matplotlib/axis.py in iter_ticks(self)
    910         Iterate through all of the major and minor ticks.
    911         """
--> 912         majorLocs = self.major.locator()
    913         majorTicks = self.get_major_ticks(len(majorLocs))
    914         self.major.formatter.set_locs(majorLocs)

~/anaconda/envs/python3DS/lib/python3.5/site-packages/matplotlib/dates.py in __call__(self)
    804             return []
    805 
--> 806         return self.tick_values(dmin, dmax)
    807 
    808     def tick_values(self, vmin, vmax):

~/anaconda/envs/python3DS/lib/python3.5/site-packages/matplotlib/dates.py in tick_values(self, vmin, vmax)
    826         if len(dates) == 0:
    827             return date2num([vmin, vmax])
--> 828         return self.raise_if_exceeds(date2num(dates))
    829 
    830     def _get_unit(self):

~/anaconda/envs/python3DS/lib/python3.5/site-packages/matplotlib/ticker.py in raise_if_exceeds(self, locs)
   1295             msg = ('Locator attempting to generate %d ticks from %s to %s: ' +
   1296                    'exceeds Locator.MAXTICKS') % (len(locs), locs[0], locs[-1])
-> 1297             raise RuntimeError(msg)
   1298 
   1299         return locs

RuntimeError: Locator attempting to generate 138719 ticks from 2396186.0 to 2534904.0: exceeds Locator.MAXTICKS
<matplotlib.figure.Figure at 0x13ae36898>

In [68]:
data[data.station == 12001][['ts', 'bikes', 'stands', 'y', 'paa']].tail(22)


Out[68]:
ts bikes stands y paa
2521480 2017-09-21 17:30:00 2.0 23.0 0.04 0.148
2521819 2017-09-21 17:40:00 1.0 24.0 0.04 0.148
2522158 2017-09-21 17:50:00 1.0 24.0 0.04 0.148
2522497 2017-09-21 18:00:00 1.0 24.0 0.04 0.040
2522836 2017-09-21 18:10:00 1.0 24.0 0.04 0.040
2523175 2017-09-21 18:20:00 1.0 24.0 0.04 0.040
2523514 2017-09-21 18:30:00 1.0 24.0 0.08 0.040
2523853 2017-09-21 18:40:00 1.0 24.0 0.08 0.040
2524192 2017-09-21 18:50:00 1.0 24.0 0.08 0.040
2524531 2017-09-21 19:00:00 1.0 24.0 0.08 0.040
2524870 2017-09-21 19:10:00 1.0 24.0 0.12 0.040
2525209 2017-09-21 19:20:00 1.0 24.0 0.12 0.040
2525548 2017-09-21 19:30:00 1.0 24.0 0.16 0.040
2525887 2017-09-21 19:40:00 1.0 24.0 0.16 0.048
2526226 2017-09-21 19:50:00 1.0 24.0 0.16 0.048
2526565 2017-09-21 20:00:00 1.0 24.0 0.20 0.048
2526904 2017-09-21 20:10:00 1.0 24.0 0.24 0.048
2527243 2017-09-21 20:20:00 1.0 24.0 0.24 0.048
2527582 2017-09-21 20:30:00 1.0 24.0 0.20 0.048
2527921 2017-09-21 20:40:00 1.0 24.0 0.22 0.048
2528260 2017-09-21 20:50:00 1.0 24.0 0.32 0.048
2528599 2017-09-21 21:00:00 2.0 23.0 0.42 0.048

In [66]:
data.loc[2528599]


Out[66]:
ts                                2017-09-21 21:00:00
station                                         12001
bikes                                               2
stands                                             23
is_open                                             1
day                                                 3
hour                                               21
minute                                              0
is_holiday                                          0
public_holiday                                      0
public_holiday_count                                0
cluster                                             3
station_cluster_geo                                 3
ratio_station_geo_cluster_open                      1
bikes_shift_12Hmin                                  1
cumsum_trend_sup                                    1
cumsum_trend_inf                                    0
cumsum_trend_equal                                  0
was_recently_open                                  24
total_stand                                        25
filling_station_by_geo_cluster               0.288066
paa                                             0.048
sax                                         0.0601906
paa_minus_sax                              -0.0121906
paa_multi_sax                              0.00288915
paa_ratio_sax                                0.797466
y                                                0.42
probability                                      0.08
prob_shit                                        0.04
prob_diff                                        0.04
prob_outlier                                        0
Name: 2528599, dtype: object

In [ ]:


In [ ]:


In [ ]:


In [ ]:

météo


In [90]:
lyon_meteo = pd.read_csv('data/lyon_weather.csv', parse_dates=['date'])

In [91]:
lyon_meteo.rename(columns={'date':'ts'}, inplace=True)

In [92]:
lyon_meteo[55:62]


Out[92]:
ts weather_id weather_desc temp temp_min temp_max pressure humidity wind_speed cloudiness
55 2017-08-01 09:20:00 800 Clear 24.47 21.0 26.0 1013.0 54 1.0 0
56 2017-08-01 09:30:00 800 Clear 24.47 21.0 26.0 1013.0 54 1.0 0
57 2017-08-01 09:40:00 800 Clear 25.46 22.0 27.0 1013.0 51 1.5 0
58 2017-08-01 10:00:00 800 Clear 25.46 22.0 27.0 1013.0 51 1.5 0
59 2017-08-01 10:10:00 800 Clear 26.71 22.0 29.0 1013.0 48 1.5 0
60 2017-08-01 10:20:00 800 Clear 26.70 22.0 29.0 1013.0 48 1.5 0
61 2017-08-01 10:30:00 800 Clear 26.70 22.0 29.0 1013.0 48 1.5 0

In [67]:
lyon_meteo.pressure.describe()


Out[67]:
count    8714.000000
mean     1017.444658
std         4.307897
min       986.100000
25%      1015.000000
50%      1018.000000
75%      1020.000000
max      1025.000000
Name: pressure, dtype: float64

In [70]:
lyon_meteo[lyon_meteo.pressure >= 1020.000000].tail(10)


Out[70]:
ts weather_id weather_desc temp temp_min temp_max pressure humidity wind_speed cloudiness
8704 2017-09-30 22:20:00 800 Clear 12.49 12.0 13.0 1022.0 87 2.1 0
8705 2017-09-30 22:30:00 800 Clear 12.48 12.0 13.0 1022.0 87 2.1 0
8706 2017-09-30 22:40:00 800 Clear 12.23 11.0 13.0 1022.0 87 1.0 0
8707 2017-09-30 22:50:00 800 Clear 12.23 11.0 13.0 1022.0 87 1.0 0
8708 2017-09-30 23:00:00 800 Clear 12.23 11.0 13.0 1022.0 87 1.0 0
8709 2017-09-30 23:10:00 800 Clear 12.23 11.0 13.0 1022.0 87 1.0 0
8710 2017-09-30 23:20:00 800 Clear 11.99 11.0 13.0 1022.0 93 1.0 0
8711 2017-09-30 23:30:00 800 Clear 11.99 11.0 13.0 1022.0 93 1.0 0
8712 2017-09-30 23:40:00 800 Clear 11.99 11.0 13.0 1022.0 93 1.0 0
8713 2017-09-30 23:50:00 803 Clouds 11.99 11.0 13.0 1022.0 93 1.5 75

In [58]:
lyon_meteo.weather_desc.unique()


Out[58]:
array(['Clear', 'Thunderstorm', 'Clouds', 'Rain', 'Mist', 'Fog', 'Drizzle'], dtype=object)

In [59]:


In [60]:
LE = LabelEncoder()

In [61]:
lyon_meteo['weather_desc'] = LE.fit_transform(lyon_meteo['weather_desc'])

In [100]:
joblib.dump(LE, 'model/Label_Encoder_Weather.pkl')


Out[100]:
['model/Label_Encoder_Weather.pkl']

In [101]:
clf = joblib.load('model/Label_Encoder_Weather.pkl')

In [104]:
clf.transform(['Cleare'])


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-104-e545bfc7b670> in <module>()
----> 1 clf.transform(['Cleare'])

~/anaconda/envs/python3DS/lib/python3.5/site-packages/sklearn/preprocessing/label.py in transform(self, y)
    131         if len(np.intersect1d(classes, self.classes_)) < len(classes):
    132             diff = np.setdiff1d(classes, self.classes_)
--> 133             raise ValueError("y contains new labels: %s" % str(diff))
    134         return np.searchsorted(self.classes_, y)
    135 

ValueError: y contains new labels: ['Cleare']

In [62]:
lyon_meteo.tail()


Out[62]:
ts weather_id weather_desc temp temp_min temp_max pressure humidity wind_speed cloudiness
8709 2017-09-30 23:10:00 800 0 12.23 11.0 13.0 1022.0 87 1.0 0
8710 2017-09-30 23:20:00 800 0 11.99 11.0 13.0 1022.0 93 1.0 0
8711 2017-09-30 23:30:00 800 0 11.99 11.0 13.0 1022.0 93 1.0 0
8712 2017-09-30 23:40:00 800 0 11.99 11.0 13.0 1022.0 93 1.0 0
8713 2017-09-30 23:50:00 803 1 11.99 11.0 13.0 1022.0 93 1.5 75

resample meteo


In [75]:
clean_lyon_meteo = lyon_meteo.resample("10T", on="ts").mean().bfill().reset_index()

In [76]:
clean_lyon_meteo[55:62]


Out[76]:
ts weather_id weather_desc temp temp_min temp_max pressure humidity wind_speed cloudiness
55 2017-08-01 09:20:00 800.0 0.0 24.47 21.0 26.0 1013.0 54.0 1.0 0.0
56 2017-08-01 09:30:00 800.0 0.0 24.47 21.0 26.0 1013.0 54.0 1.0 0.0
57 2017-08-01 09:40:00 800.0 0.0 25.46 22.0 27.0 1013.0 51.0 1.5 0.0
58 2017-08-01 09:50:00 800.0 0.0 25.46 22.0 27.0 1013.0 51.0 1.5 0.0
59 2017-08-01 10:00:00 800.0 0.0 25.46 22.0 27.0 1013.0 51.0 1.5 0.0
60 2017-08-01 10:10:00 800.0 0.0 26.71 22.0 29.0 1013.0 48.0 1.5 0.0
61 2017-08-01 10:20:00 800.0 0.0 26.70 22.0 29.0 1013.0 48.0 1.5 0.0

Merging train_X


In [65]:
train_X.reset_index(inplace=True)

In [77]:
train_X.head()


Out[77]:
ts station bikes stands is_open day hour minute is_holiday public_holiday ... station_cluster_geo bikes_shift_1Hmin cumsum_trend_sup cumsum_trend_inf cumsum_trend_equal was_recently_open total_stand filling_station_by_geo_cluster paa sax
0 2017-08-01 00:10:00 1001 10.0 6.0 1.0 1 0 0 1 0 ... 5 10.0 0 0 0 1.0 16.0 0.456444 0.625000 0.623393
1 2017-08-01 00:10:00 1002 17.0 5.0 1.0 1 0 0 1 0 ... 5 17.0 0 0 1 24.0 22.0 0.456444 0.570455 0.595576
2 2017-08-01 00:10:00 1003 14.0 1.0 1.0 1 0 0 1 0 ... 5 14.0 0 0 1 24.0 15.0 0.456444 0.740000 0.767377
3 2017-08-01 00:10:00 1005 1.0 9.0 1.0 1 0 0 1 0 ... 5 1.0 0 0 1 24.0 10.0 0.456444 0.125000 0.150893
4 2017-08-01 00:10:00 1006 11.0 11.0 1.0 1 0 0 1 0 ... 5 11.0 0 0 3 24.0 22.0 0.456444 0.402273 0.430791

5 rows × 22 columns


In [78]:
train_X.shape


Out[78]:
(2554704, 22)

In [79]:
lol = train_X.merge(clean_lyon_meteo[['ts', 'temp', 'humidity', 'weather_desc']], on='ts', how='left')

In [80]:
pd.isnull(lol[['temp', 'humidity']]).sum()


Out[80]:
temp        0
humidity    0
dtype: int64

In [81]:
lol.shape


Out[81]:
(2554704, 25)

In [82]:
lol.head()


Out[82]:
ts station bikes stands is_open day hour minute is_holiday public_holiday ... cumsum_trend_inf cumsum_trend_equal was_recently_open total_stand filling_station_by_geo_cluster paa sax temp humidity weather_desc
0 2017-08-01 00:10:00 1001 10.0 6.0 1.0 1 0 0 1 0 ... 0 0 1.0 16.0 0.456444 0.625000 0.623393 26.7 39.0 0.0
1 2017-08-01 00:10:00 1002 17.0 5.0 1.0 1 0 0 1 0 ... 0 1 24.0 22.0 0.456444 0.570455 0.595576 26.7 39.0 0.0
2 2017-08-01 00:10:00 1003 14.0 1.0 1.0 1 0 0 1 0 ... 0 1 24.0 15.0 0.456444 0.740000 0.767377 26.7 39.0 0.0
3 2017-08-01 00:10:00 1005 1.0 9.0 1.0 1 0 0 1 0 ... 0 1 24.0 10.0 0.456444 0.125000 0.150893 26.7 39.0 0.0
4 2017-08-01 00:10:00 1006 11.0 11.0 1.0 1 0 0 1 0 ... 0 3 24.0 22.0 0.456444 0.402273 0.430791 26.7 39.0 0.0

5 rows × 25 columns


In [435]:
print(lyon_meteo.date.min())
print(lyon_meteo.date.max())


2017-08-01 00:10:00
2017-09-30 23:50:00

In [44]:
lyon_meteo = pd.read_csv('data/lyon_weather.csv', parse_dates=['date'])
lyon_meteo.rename(columns={'date':'ts'}, inplace=True)

# have to labelencode weather_desc
LE = LabelEncoder()
lyon_meteo['weather_desc'] = LE.fit_transform(lyon_meteo['weather_desc'])

# Dump LabelEncoder
joblib.dump(LE, 'model/Label_Encoder_Weather.pkl')

# Resemple data on 10
clean_lyon_meteo = lyon_meteo.resample("10T", on="ts").mean().bfill().reset_index()

In [45]:
clean_lyon_meteo.head()


Out[45]:
ts weather_id weather_desc temp temp_min temp_max pressure humidity wind_speed cloudiness
0 2017-08-01 00:10:00 800.0 0.0 26.70 24.0 28.0 1011.0 39.0 3.1 0.0
1 2017-08-01 00:20:00 800.0 0.0 26.22 24.0 27.0 1011.0 39.0 1.0 0.0
2 2017-08-01 00:30:00 800.0 0.0 26.22 24.0 27.0 1011.0 39.0 1.0 0.0
3 2017-08-01 00:40:00 800.0 0.0 26.22 24.0 27.0 1011.0 39.0 1.0 0.0
4 2017-08-01 00:50:00 800.0 0.0 24.72 21.0 27.0 1011.0 44.0 1.5 0.0

In [46]:
X_train.head()


Out[46]:
station bikes stands is_open day hour minute is_holiday public_holiday public_holiday_count ... station_cluster_geo bikes_shift_1Hmin cumsum_trend_sup cumsum_trend_inf cumsum_trend_equal was_recently_open total_stand filling_station_by_geo_cluster paa sax
ts
2017-09-13 21:20:00 3067 14.0 1.0 1.0 2 21 10 0 0 0.0 ... 2 13.5 2 0 0 24.0 15.0 0.422381 0.910000 0.884635
2017-08-29 16:40:00 7012 12.0 1.0 1.0 1 16 30 1 0 0.0 ... 4 12.0 0 0 19 24.0 13.0 0.657379 0.923077 0.925245
2017-09-16 04:10:00 7010 6.0 14.0 1.0 5 4 0 0 0 0.0 ... 0 5.0 1 0 0 24.0 20.0 0.660374 0.245000 0.266291
2017-08-28 15:30:00 8003 2.0 14.0 1.0 0 15 20 1 0 0.0 ... 11 2.0 0 0 13 24.0 16.0 0.498904 0.125000 0.123527
2017-08-25 23:30:00 4006 1.0 21.0 1.0 4 23 20 1 0 0.0 ... 8 1.0 0 0 21 24.0 22.0 0.349383 0.045455 0.046222

5 rows × 21 columns


In [214]:
def get_weather(df, how='learning', freq=None):
    """
    Match timeseries with weather data.
    If type == learning :
        Matching with historitical data weather
    if type == forcast :
        Matching with forcast data. Freq must be fill with this opton
    """

    df = df.reset_index()

    # Check params
    if how not in ['learning', 'forecast']:
        logger.error('Bad option for get_weather. You must choose between learning or forecast')
        return df

    if how == 'forecast' and freq is None:
        logger.error("For forecast option, we must specify freq. Ex freq='1H'")


    # Process for learning matching
    if how == 'learning':
        lyon_meteo = pd.read_csv('data/lyon_weather.csv', parse_dates=['date'])
        lyon_meteo.rename(columns={'date':'ts'}, inplace=True)

        # have to labelencode weather_desc
        LE = LabelEncoder()
        lyon_meteo['weather_desc'] = LE.fit_transform(lyon_meteo['weather_desc'])

        # Dump LabelEncoder
        joblib.dump(LE, 'model/Label_Encoder_Weather.pkl')

        # Resemple data on 10
        clean_lyon_meteo = lyon_meteo.resample("10T", on="ts").mean().bfill().reset_index()
        df = df.merge(clean_lyon_meteo[['ts', 'temp', 'humidity', 'weather_desc']], on='ts', how='left')
        
        df = df.set_index('ts')
        return df

    # Process for forecast matching
    if how == 'forecast':
        lyon_forecast = pd.read_csv('data/lyon_forecast.csv', parse_dates=['forecast_at', 'ts'])
        lyon_forecast['delta'] = lyon_forecast['ts'] - lyon_forecast['forecast_at']

        # Filter on delta with freq
        lyon_forecast = lyon_forecast[lyon_forecast['delta'] == freq]
        lyon_forecast.drop_duplicates(subset=['ts', 'delta'], keep='first', inplace=True)
        
        # Label encode weather_desc
        LE = joblib.load('model/Label_Encoder_Weather.pkl') 
        lyon_forecast['weather_desc'] = LE.transform(lyon_forecast['weather_desc'])

        #Merging
        # We talk the last forecast (on freq) using backward merging
        df = df.sort_values('ts')
        df = pd.merge_asof(left=df, right=lyon_forecast[['ts','temp', 'humidity', 'weather_desc']], on='ts', direction='backward')

        df = df.set_index('ts')
        return df

In [53]:
del lol

Forecast


In [88]:
lyon_forecast = pd.read_csv('data/lyon_forecast.csv', parse_dates=['forecast_at', 'ts'])

In [89]:
lyon_forecast.head()


Out[89]:
forecast_at ts weather_id weather_desc temp temp_min temp_max rain_3h snow_3h pressure humidity wind_speed cloudiness
0 2017-08-01 2017-08-01 02:00:00 800 Clear 26.06 26.06 26.16 NaN NaN 982.16 36 6.51 8
1 2017-08-01 2017-08-01 05:00:00 800 Clear 25.64 25.64 25.72 NaN NaN 981.74 36 7.09 8
2 2017-08-01 2017-08-01 08:00:00 500 Rain 25.83 25.83 25.88 0.025 NaN 982.54 44 6.83 20
3 2017-08-01 2017-08-01 11:00:00 500 Rain 29.75 29.75 29.77 0.160 NaN 983.02 44 6.05 0
4 2017-08-01 2017-08-01 14:00:00 803 Clouds 32.56 32.56 32.56 NaN NaN 982.83 35 7.16 76

In [90]:
lyon_forecast.weather_desc.unique()


Out[90]:
array(['Clear', 'Rain', 'Clouds'], dtype=object)

In [91]:
lyon_forecast.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 86102 entries, 0 to 86101
Data columns (total 13 columns):
forecast_at     86102 non-null datetime64[ns]
ts              86102 non-null datetime64[ns]
weather_id      86102 non-null int64
weather_desc    86102 non-null object
temp            86102 non-null float64
temp_min        86102 non-null float64
temp_max        86102 non-null float64
rain_3h         25824 non-null float64
snow_3h         0 non-null float64
pressure        86102 non-null float64
humidity        86102 non-null int64
wind_speed      86102 non-null float64
cloudiness      86102 non-null int64
dtypes: datetime64[ns](2), float64(7), int64(3), object(1)
memory usage: 8.5+ MB

In [92]:
lyon_forecast['delta'] = lyon_forecast['ts'] - lyon_forecast['forecast_at']

In [93]:
lyon_forecast = lyon_forecast[lyon_forecast['delta'] == "1H"]
lyon_forecast.drop_duplicates(subset=['ts', 'delta'], keep='first', inplace=True)

In [126]:
LE = joblib.load('model/Label_Encoder_Weather.pkl') 
lyon_forecast['weather_desc'] = LE.transform(lyon_forecast['weather_desc'])

In [229]:
lyon_forecast[(lyon_forecast.ts >= "2017-08-01 07:00:00") & (lyon_forecast.delta == "1H")].head()


Out[229]:
forecast_at ts weather_id weather_desc temp temp_min temp_max rain_3h snow_3h pressure humidity wind_speed cloudiness delta
273 2017-08-01 07:00:00 2017-08-01 08:00:00 500 5 20.82 20.82 25.88 0.025 NaN 982.54 44 6.83 20 01:00:00
385 2017-08-01 10:00:00 2017-08-01 11:00:00 500 5 29.44 29.44 29.77 0.160 NaN 983.02 44 6.05 0 01:00:00
494 2017-08-01 13:00:00 2017-08-01 14:00:00 803 1 31.95 31.95 32.56 NaN NaN 982.83 35 7.16 76 01:00:00
600 2017-08-01 16:00:00 2017-08-01 17:00:00 802 1 30.55 30.07 30.55 NaN NaN 983.80 48 5.96 48 01:00:00
720 2017-08-01 19:00:00 2017-08-01 20:00:00 804 1 27.31 27.31 27.99 NaN NaN 985.40 55 2.58 88 01:00:00

In [153]:
#lyon_forecast[(lyon_forecast.ts >= "2017-08-01 00:10:00") & (lyon_forecast.delta == "1H")]

In [215]:
X_test.head(2)


Out[215]:
station bikes stands is_open day hour minute is_holiday public_holiday public_holiday_count ... station_cluster_geo bikes_shift_1Hmin cumsum_trend_sup cumsum_trend_inf cumsum_trend_equal was_recently_open total_stand filling_station_by_geo_cluster paa sax
ts
2017-08-23 18:00:00 9010 10.0 9.0 1.0 2 18 0 1 0 0.0 ... 1 10.0 0 0 6 24.0 19.0 0.596376 0.552632 0.550343
2017-09-16 14:30:00 1022 30.0 1.0 1.0 5 14 20 0 0 0.0 ... 5 30.0 0 0 1 24.0 31.0 0.758573 0.686022 0.711584

2 rows × 21 columns


In [216]:
lol = get_weather(X_test, how='forecast', freq='1H')

In [235]:
lol[lol.index.isin(['2017-08-01 10:00:00','2017-08-01 11:00:00'])].tail()


Out[235]:
station bikes stands is_open day hour minute is_holiday public_holiday public_holiday_count ... cumsum_trend_inf cumsum_trend_equal was_recently_open total_stand filling_station_by_geo_cluster paa sax temp humidity weather_desc
ts
2017-08-01 11:00:00 3088 14.0 2.0 1.0 1 11 0 1 0 0.0 ... 0 1 24.0 16.0 0.664074 0.834375 0.836449 29.44 44 5
2017-08-01 11:00:00 6001 14.0 1.0 1.0 1 11 0 1 0 0.0 ... 0 2 24.0 15.0 0.664074 0.926667 0.922586 29.44 44 5
2017-08-01 11:00:00 3029 4.0 19.0 1.0 1 11 0 1 0 0.0 ... 1 0 24.0 23.0 0.421277 0.160870 0.163770 29.44 44 5
2017-08-01 11:00:00 4023 14.0 7.0 1.0 1 11 0 1 0 0.0 ... 0 3 24.0 21.0 0.640138 0.650000 0.652840 29.44 44 5
2017-08-01 11:00:00 3051 11.0 2.0 1.0 1 11 0 1 0 0.0 ... 0 6 24.0 13.0 0.537528 0.730769 0.731793 29.44 44 5

5 rows × 24 columns


In [232]:
lol.head()


Out[232]:
station bikes stands is_open day hour minute is_holiday public_holiday public_holiday_count ... cumsum_trend_inf cumsum_trend_equal was_recently_open total_stand filling_station_by_geo_cluster paa sax temp humidity weather_desc
ts
2017-08-01 02:00:00 10088 18.0 1.0 1.0 1 2 0 1 0 0.0 ... 0 2 24.0 19.0 0.460417 0.947368 0.948087 24.47 36 0
2017-08-01 02:00:00 10102 6.0 24.0 1.0 1 2 0 1 0 0.0 ... 0 1 24.0 30.0 0.375505 0.200000 0.199909 24.47 36 0
2017-08-01 02:00:00 2014 1.0 15.0 1.0 1 2 0 1 0 0.0 ... 0 2 24.0 16.0 0.363962 0.062500 0.063656 24.47 36 0
2017-08-01 02:00:00 4003 1.0 16.0 1.0 1 2 0 1 0 0.0 ... 0 92 24.0 17.0 0.336585 0.058824 0.059792 24.47 36 0
2017-08-01 02:00:00 8008 4.0 9.0 1.0 1 2 0 1 0 0.0 ... 0 1 24.0 13.0 0.853293 0.319231 0.292275 24.47 36 0

5 rows × 24 columns


In [ ]:


In [222]:
lyon_forecast[(lyon_forecast.ts >= "2017-08-01 07:00:00")].head(5)


Out[222]:
forecast_at ts weather_id weather_desc temp temp_min temp_max rain_3h snow_3h pressure humidity wind_speed cloudiness delta
273 2017-08-01 07:00:00 2017-08-01 08:00:00 500 5 20.82 20.82 25.88 0.025 NaN 982.54 44 6.83 20 01:00:00
385 2017-08-01 10:00:00 2017-08-01 11:00:00 500 5 29.44 29.44 29.77 0.160 NaN 983.02 44 6.05 0 01:00:00
494 2017-08-01 13:00:00 2017-08-01 14:00:00 803 1 31.95 31.95 32.56 NaN NaN 982.83 35 7.16 76 01:00:00
600 2017-08-01 16:00:00 2017-08-01 17:00:00 802 1 30.55 30.07 30.55 NaN NaN 983.80 48 5.96 48 01:00:00
720 2017-08-01 19:00:00 2017-08-01 20:00:00 804 1 27.31 27.31 27.99 NaN NaN 985.40 55 2.58 88 01:00:00

In [166]:
#X_test = X_test.reset_index()
#Merging
# We talk the last forecast (on freq) using backward merging
#X_test = X_test.sort_values('ts')
lol = pd.merge_asof(left=X_test, right=lyon_forecast[['ts','temp', 'humidity', 'weather_desc']], on='ts', direction='backward')

In [ ]:


In [176]:
print(lyon_forecast.ts.min())
print(lyon_forecast.ts.max())


2017-08-01 02:00:00
2017-09-30 23:00:00

In [ ]:

Some maps

CSV file with station coordinates


In [48]:
locations = pd.read_csv("./data/lyon-stations.csv")

In [49]:
locations.shape


Out[49]:
(348, 4)

Some stations were removed when the data were cleaned up. Remove them from the location data.


In [50]:
mask = locations['idstation'].isin(result.station.unique())

In [51]:
mask.sum()


Out[51]:
337

In [52]:
locations = locations[mask]

In [53]:
locations = locations.rename_axis({'idstation': 'station'}, axis=1)

In [54]:
locations.head()


Out[54]:
station nom lat lon
0 10027 Mairie de Villeurbanne 45.766831 4.879894
1 10030 Greuze 45.773844 4.893848
2 10034 MJC 45.761788 4.886157
3 10036 Chaplin / Dutriévoz 45.774357 4.859155
4 10038 Condorcet / 11 Nov. 1918 45.779046 4.866778

Some station names contains the ' character. Replace it by the HTML code for folium.


In [78]:
locations["nom"] = locations['nom'].str.replace("'", "&apos;")

Select the prediction data for a specific timestamp


In [79]:
data_to_plot = result.loc[predict_date]

In [80]:
data_to_plot.shape


Out[80]:
(337, 12)

In [81]:
data_to_plot.head()


Out[81]:
station bikes stands day hour minute ts_future observation prediction error relative_error quad_error
ts
2017-07-26 10:00:00 1001 2.0 14.0 2 10 0 2017-07-26 10:30:00 0.375000 0.140736 -0.234264 62.470392 0.054880
2017-07-26 10:00:00 1002 8.0 14.0 2 10 0 2017-07-26 10:30:00 0.255814 0.358517 0.102703 40.147679 0.010548
2017-07-26 10:00:00 1003 2.0 13.0 2 10 0 2017-07-26 10:30:00 0.133333 0.145994 0.012661 9.495718 0.000160
2017-07-26 10:00:00 1005 2.0 8.0 2 10 0 2017-07-26 10:30:00 0.300000 0.227014 -0.072986 24.328576 0.005327
2017-07-26 10:00:00 1006 2.0 19.0 2 10 0 2017-07-26 10:30:00 0.047619 0.109391 0.061772 129.721656 0.003816

In [83]:
yhat = data_to_plot[['station', 'prediction']].merge(locations, on='station')
yhat.head()


Out[83]:
station prediction nom lat lon
0 1001 0.140736 Terreaux / Terme 45.767735 4.832135
1 1002 0.358517 Opéra 45.767507 4.836282
2 1003 0.145994 Place Sathonay 45.769266 4.830080
3 1005 0.227014 Meissonier 45.766089 4.832890
4 1006 0.109391 Subsistances 45.767896 4.819817

In [104]:
y = data_to_plot[['station', 'observation']].merge(locations, on='station')

In [140]:
error = data_to_plot[['station', 'error']].merge(locations, on='station')

In [122]:
colormap = 'RdYlBu'
cmap = plt.get_cmap(colormap)

In [137]:
# show the colormap use to plot the stations, values [0, 1]
gradient = np.linspace(0, 1, 256)
gradient = np.vstack((gradient, gradient))
fig, ax = plt.subplots(1)
fig.subplots_adjust(top=0.95, bottom=0.80, left=0.2, right=0.99)
ax.set_xticks([0., 64, 128, 192, 256])
ax.set_xticklabels([0., 0.25, 0.5, 0.75, 1.])
ax.set_xlabel('Bikes Availability')
ax.imshow(gradient, aspect='auto', cmap=cmap, vmin=0, vmax=1)
plt.title('Colormap used to plot stations')


Out[137]:
<matplotlib.text.Text at 0x6e04ec46ec88>
  • Red stations if there are 0 or a few available bikes
  • Blue stations if there are several available bikes

In [95]:
color = lambda x: mpl.colors.to_hex(cmap(x))

In [89]:
# Lyon (France) Position
position = [45.750000, 4.850000]

In [106]:
mp_pred = folium.Map(location=position, zoom_start=13, tiles='cartodbpositron')

In [107]:
# Map of the predicted values
for _,row in yhat.iterrows():
    folium.CircleMarker(
        location=[row['lat'], row['lon']],
        radius=2,
        popup=row['nom'],
        color=color(row['prediction']),
        fill=True,
        fill_opacity=0.3,
        fill_color=color(row['prediction'])
    ).add_to(mp_pred)

In [102]:
mp_pred


Out[102]:

In [108]:
# Map for the observation
mp_obs = folium.Map(location=position, zoom_start=13, tiles='cartodbpositron')

In [109]:
# Map of the observations
for _,row in y.iterrows():
    folium.CircleMarker(
        location=[row['lat'], row['lon']],
        radius=2,
        popup=row['nom'],
        color=color(row['observation']),
        fill=True,
        fill_opacity=0.3,
        fill_color=color(row['observation'])
    ).add_to(mp_obs)

In [138]:
mp_obs


Out[138]:

In [153]:
# Colormap for error (by default, the color map fits for [0, 1] values)
norm = mpl.colors.Normalize(vmin=-1, vmax=1)
color_error = lambda x: mpl.colors.to_hex(cmap(norm(x)))

In [154]:
# Map for the errors
mp_error = folium.Map(location=position, zoom_start=13, tiles='cartodbpositron')

In [155]:
# Map of the errors
for _,row in error.iterrows():
    folium.CircleMarker(
        location=[row['lat'], row['lon']],
        radius=2,
        popup=row['nom'],
        color=color_error(row['error']),
        fill=True,
        fill_opacity=0.3,
        fill_color=color_error(row['error'])
    ).add_to(mp_error)

In [156]:
mp_error


Out[156]:

Stations lyon geo


In [23]:
station_lyon = pd.read_csv('data/lyon-stations.csv')

In [24]:
station_lyon.head()


Out[24]:
idstation nom lat lon
0 10027 Mairie de Villeurbanne 45.766831 4.879894
1 10030 Greuze 45.773844 4.893848
2 10034 MJC 45.761788 4.886157
3 10036 Chaplin / Dutriévoz 45.774357 4.859155
4 10038 Condorcet / 11 Nov. 1918 45.779046 4.866778

In [25]:
X = station_lyon[['lat', 'lon']].copy()

In [26]:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

In [27]:
scaler = StandardScaler()
X_scale = scaler.fit_transform(X)
k_means = KMeans(init='k-means++', n_clusters=12).fit(X_scale)

In [28]:
station_lyon['station_cluster_geo'] = k_means.labels_

In [29]:
sns.lmplot(x="lat", y="lon", data=station_lyon, hue='station_cluster_geo', scatter=True, fit_reg=False, aspect=2)


Out[29]:
<seaborn.axisgrid.FacetGrid at 0x12e8b6dd8>

In [30]:
# export 
station_lyon.head()
station_lyon.rename(columns={'idstation':'station'}, inplace=True)
station_lyon = station_lyon[['station', 'station_cluster_geo']]
station_lyon.to_csv('data/station_cluster_geo_armand.csv', index=False)

In [ ]: