Timeseries data about the number of bikes and bike stands for the Lyon city (France), betweek 2017-07-11 and 2017-09-26.

Some Imports



In [1]:

    
%matplotlib inline



In [2]:

    
import numpy as np
import pandas as pd



In [3]:

    
from sklearn.cluster import KMeans



In [4]:

    
from matplotlib import pyplot as plt



In [5]:

    
import seaborn as sns
sns.set_context('talk')

Which Dependencies versions



In [6]:

    
%load_ext watermark



In [7]:

    
%watermark -d -v -p numpy,pandas,sklearn,matplotlib,seaborn -g -m -w









    



2017-11-20 

CPython 3.5.2
IPython 6.2.1

numpy 1.13.3
pandas 0.20.3
sklearn 0.18.1
matplotlib 2.0.2
seaborn 0.8.1

compiler   : GCC 5.4.0 20160609
system     : Linux
release    : 4.4.0-98-generic
machine    : x86_64
processor  : x86_64
CPU cores  : 4
interpreter: 64bit
Git hash   : 213e0344fc447e7cc4c6119eec239ca29346e680
watermark 1.5.0

Load Lyon Data



In [9]:

    
DATA = "../data/lyon.csv"



In [10]:

    
raw = pd.read_csv(DATA, parse_dates=['last_update'])



In [11]:

    
raw.drop_duplicates(inplace=True)



In [12]:

    
raw = raw.sort_values(["number", "last_update"])



In [13]:

    
MIN_DATE = '2017-07-11'
MAX_DATE = '2017-09-26'

Some Cleaning



In [14]:

    
print(raw.shape)









    



(4569378, 9)



In [15]:

    
drop_columns = ["availability", "bonus", "status", "bike_stands", "available_bike_stands", "availabilitycode"]



In [16]:

    
data = (raw.copy()
        .query("last_update >= '{}' and last_update <= '{}'".format(MIN_DATE, MAX_DATE))
        .drop(drop_columns, axis=1)
        .rename_axis({"available_bikes": "bikes", "number": "station", "last_update": "ts"}, axis=1))



In [17]:

    
data.head()









    Out[17]:







  
    
      
      station
      ts
      bikes
    
  
  
    
      199806
      1001
      2017-07-11 00:05:17
      12
    
    
      200154
      1001
      2017-07-11 00:12:24
      13
    
    
      200502
      1001
      2017-07-11 00:17:19
      12
    
    
      201198
      1001
      2017-07-11 00:27:23
      12
    
    
      201546
      1001
      2017-07-11 00:31:09
      13

Some strange stations where the max number of bikes is 0



In [19]:

    
max_bikes = data.groupby("station")["bikes"].max()
max_bikes









    Out[19]:





station
1001     16
1002     23
1003     15
1005     10
1006     22
1012     20
1013     11
1016     17
1020     18
1021     11
1022     31
1023     20
1024     17
1031     20
1032     22
1034     16
1035     16
1036     17
2001     32
2002     20
2003     10
2004     18
2005     32
2006     14
2007     22
2008     18
2009     26
2010     22
2011     13
2012     16
         ..
10080    17
10083    15
10084    30
10086    10
10087    12
10088    19
10089    15
10091    16
10092    17
10101    34
10102    30
10103    30
10110    18
10111    16
10112    32
10113    22
10114    20
10115    22
10116     0
10117    22
10118    22
10119    25
10120    16
10121    16
10122    21
11001    18
11002    20
11003    20
12001    25
12002    18
Name: bikes, Length: 348, dtype: int64



In [20]:

    
wrong_stations = max_bikes[max_bikes == 0].index.tolist()
wrong_stations









    Out[20]:





[2026, 2035, 3010, 3032, 3090, 8020, 10021, 10025, 10116]



In [21]:

    
well_station_mask = np.logical_not(data['station'].isin(wrong_stations))



In [22]:

    
data = data[well_station_mask]
print(data.shape)









    



(4416849, 3)

Time Resampling

Get data every 5 minutes



In [23]:

    
df = (data.set_index("ts")
      .groupby("station")["bikes"]
      .resample("5T")
      .mean()
      .bfill())



In [24]:

    
df.head(10)









    Out[24]:





station  ts                 
1001     2017-07-11 00:05:00    12.0
         2017-07-11 00:10:00    13.0
         2017-07-11 00:15:00    12.0
         2017-07-11 00:20:00    12.0
         2017-07-11 00:25:00    12.0
         2017-07-11 00:30:00    13.0
         2017-07-11 00:35:00    13.0
         2017-07-11 00:40:00    13.0
         2017-07-11 00:45:00    13.0
         2017-07-11 00:50:00    13.0
Name: bikes, dtype: float64



In [25]:

    
df = df.unstack(0)



In [26]:

    
df.head()









    Out[26]:







  
    
      station
      1001
      1002
      1003
      1005
      1006
      1012
      1013
      1016
      1020
      1021
      ...
      10118
      10119
      10120
      10121
      10122
      11001
      11002
      11003
      12001
      12002
    
    
      ts
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2017-07-11 00:00:00
      NaN
      6.0
      15.0
      NaN
      4.0
      NaN
      NaN
      3.0
      NaN
      NaN
      ...
      21.0
      0.0
      NaN
      16.0
      NaN
      0.0
      10.0
      12.0
      NaN
      0.0
    
    
      2017-07-11 00:05:00
      12.0
      6.0
      15.0
      1.0
      5.0
      19.0
      10.5
      3.0
      1.0
      1.0
      ...
      21.0
      0.0
      15.0
      16.0
      5.0
      0.0
      10.0
      12.0
      0.0
      0.0
    
    
      2017-07-11 00:10:00
      13.0
      7.0
      15.0
      1.0
      5.0
      19.0
      10.0
      3.0
      1.0
      2.0
      ...
      21.0
      0.0
      15.0
      16.0
      5.0
      0.0
      10.0
      12.0
      0.0
      0.0
    
    
      2017-07-11 00:15:00
      12.0
      7.0
      15.0
      1.0
      5.0
      19.0
      10.0
      3.0
      1.0
      2.0
      ...
      21.0
      0.0
      15.0
      16.0
      5.0
      0.0
      10.0
      12.0
      0.0
      0.0
    
    
      2017-07-11 00:20:00
      12.0
      7.0
      15.0
      1.0
      5.0
      19.0
      10.0
      3.0
      1.0
      2.0
      ...
      21.0
      0.0
      15.0
      16.0
      5.0
      0.0
      10.0
      12.0
      0.0
      0.0
    
  

5 rows × 339 columns

Daily Profile

Get rid of saturday and sunday.



In [27]:

    
weekday = df.index.weekday
mask = weekday < 5



In [28]:

    
df = df[mask]



In [29]:

    
df['hour'] = df.index.hour



In [30]:

    
df = df.groupby("hour").mean()



In [31]:

    
df.head()









    Out[31]:







  
    
      station
      1001
      1002
      1003
      1005
      1006
      1012
      1013
      1016
      1020
      1021
      ...
      10118
      10119
      10120
      10121
      10122
      11001
      11002
      11003
      12001
      12002
    
    
      hour
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      0
      12.076631
      11.878788
      8.628788
      5.750379
      7.943939
      13.265554
      7.610774
      10.513636
      9.105463
      4.952959
      ...
      13.681818
      1.304545
      11.389226
      13.613636
      9.081942
      1.705303
      12.339394
      13.903788
      2.295903
      2.139394
    
    
      1
      10.571970
      8.608333
      8.380303
      5.054545
      8.107576
      10.893182
      7.221212
      9.557576
      6.308333
      2.918939
      ...
      13.907576
      1.387879
      12.133333
      13.934848
      9.230303
      2.003030
      12.887121
      13.878788
      1.803030
      1.696212
    
    
      2
      9.370455
      6.311364
      8.151515
      4.311364
      8.287121
      8.960606
      6.528030
      8.517424
      4.905303
      1.760606
      ...
      14.127273
      1.490909
      12.250000
      13.998485
      9.412121
      2.021212
      13.186364
      13.892424
      2.218182
      1.643939
    
    
      3
      8.350758
      4.744697
      7.722727
      4.005303
      8.345455
      8.031061
      6.053030
      7.806061
      3.887879
      1.096212
      ...
      14.233333
      1.489394
      12.481818
      13.936364
      9.510606
      2.080303
      13.431818
      13.898485
      2.557576
      1.581818
    
    
      4
      7.959848
      3.763636
      7.209848
      3.461364
      8.289394
      7.259091
      5.669697
      7.300000
      3.303030
      1.050000
      ...
      14.218939
      1.507576
      12.660606
      13.875000
      9.520455
      2.087121
      13.550000
      13.916667
      3.231818
      1.581818
    
  

5 rows × 339 columns

Clustering



In [32]:

    
n_clusters = 4



In [231]:

    
#df = df.fillna(method='bfill')



In [232]:

    
#df = df.fillna(method='ffill')

Normalization



In [33]:

    
df_norm = df / df.max()



In [34]:

    
kmeans = KMeans(n_clusters=n_clusters, random_state=0).fit(df_norm.T)



In [35]:

    
label = pd.Series(kmeans.labels_)

Number of stations for each label, i.e. usage pattern



In [36]:

    
label.groupby(label).count()









    Out[36]:





0     84
1     81
2    134
3     40
dtype: int64

Colors for each cluster



In [59]:

    
colors = sns.color_palette('Set1', n_clusters)



In [60]:

    
sns.palplot(colors)

Daily profile



In [75]:

    
pd.DataFrame(kmeans.cluster_centers_).to_csv("../data/lyon_clusters.csv", index=False)



In [69]:

    
with sns.axes_style("darkgrid", {'xtick.major.size': 8.0}):
    fig, ax = plt.subplots(figsize=(10,6))

for k, label, color in zip(kmeans.cluster_centers_, range(n_clusters), colors):
    plt.plot(100*k, color=color, label=label)
    
plt.legend()
plt.xlabel('Hour')
plt.xticks(np.linspace(0, 24, 13))
plt.yticks(np.linspace(0, 100, 11))
plt.ylabel("available bikes%")
sns.despine()
plt.savefig("../images/lyon-pattern.png")

Map

Station names and lat/lon coordinates.



In [41]:

    
locations = pd.read_csv("../data/lyon-stations.csv")



In [42]:

    
mask = np.logical_not(locations['idstation'].isin(wrong_stations))



In [43]:

    
locations = locations[mask]



In [44]:

    
locations.head()









    Out[44]:







  
    
      
      idstation
      nom
      lat
      lon
    
  
  
    
      0
      10027
      Mairie de Villeurbanne
      45.766831
      4.879894
    
    
      1
      10030
      Greuze
      45.773844
      4.893848
    
    
      2
      10034
      MJC
      45.761788
      4.886157
    
    
      3
      10036
      Chaplin / Dutriévoz
      45.774357
      4.859155
    
    
      4
      10038
      Condorcet / 11 Nov. 1918
      45.779046
      4.866778



In [45]:

    
dflabel = pd.DataFrame({"label": kmeans.labels_}, index=df.columns)



In [46]:

    
locations = locations.merge(dflabel, right_index=True, left_on='idstation')



In [47]:

    
locations.head()









    Out[47]:







  
    
      
      idstation
      nom
      lat
      lon
      label
    
  
  
    
      0
      10027
      Mairie de Villeurbanne
      45.766831
      4.879894
      2
    
    
      1
      10030
      Greuze
      45.773844
      4.893848
      2
    
    
      2
      10034
      MJC
      45.761788
      4.886157
      0
    
    
      3
      10036
      Chaplin / Dutriévoz
      45.774357
      4.859155
      1
    
    
      4
      10038
      Condorcet / 11 Nov. 1918
      45.779046
      4.866778
      1



In [48]:

    
locations["nom"] = locations['nom'].str.replace("'", "&apos;")



In [49]:

    
import folium



In [50]:

    
# Lyon (France) Position
position = [45.750000, 4.850000]



In [51]:

    
mp = folium.Map(location=position, zoom_start=13, tiles='cartodbpositron')
#mp = folium.Map(location=position, zoom_start=13)



In [52]:

    
hex_colors = colors.as_hex()



In [70]:

    
for _,row in locations.iterrows():
    folium.CircleMarker(
        location=[row['lat'], row['lon']],
        radius=5,
        popup=row['nom'],
        color=hex_colors[row['label']],
        fill=True,
        fill_opacity=0.5,
        fill_color=hex_colors[row['label']]
    ).add_to(mp)



In [71]:

    
mp.save("../images/lyon-map-n_clusters-{}.html".format(n_clusters))



In [72]:

    
mp









    Out[72]:

	station	ts	bikes
199806	1001	2017-07-11 00:05:17	12
200154	1001	2017-07-11 00:12:24	13
200502	1001	2017-07-11 00:17:19	12
201198	1001	2017-07-11 00:27:23	12
201546	1001	2017-07-11 00:31:09	13

station	1001	1002	1003	1005	1006	1012	1013	1016	1020	1021	...	10118	10119	10120	10121	10122	11001	11002	11003	12001	12002
ts
2017-07-11 00:00:00	NaN	6.0	15.0	NaN	4.0	NaN	NaN	3.0	NaN	NaN	...	21.0	0.0	NaN	16.0	NaN	0.0	10.0	12.0	NaN	0.0
2017-07-11 00:05:00	12.0	6.0	15.0	1.0	5.0	19.0	10.5	3.0	1.0	1.0	...	21.0	0.0	15.0	16.0	5.0	0.0	10.0	12.0	0.0	0.0
2017-07-11 00:10:00	13.0	7.0	15.0	1.0	5.0	19.0	10.0	3.0	1.0	2.0	...	21.0	0.0	15.0	16.0	5.0	0.0	10.0	12.0	0.0	0.0
2017-07-11 00:15:00	12.0	7.0	15.0	1.0	5.0	19.0	10.0	3.0	1.0	2.0	...	21.0	0.0	15.0	16.0	5.0	0.0	10.0	12.0	0.0	0.0
2017-07-11 00:20:00	12.0	7.0	15.0	1.0	5.0	19.0	10.0	3.0	1.0	2.0	...	21.0	0.0	15.0	16.0	5.0	0.0	10.0	12.0	0.0	0.0

station	1001	1002	1003	1005	1006	1012	1013	1016	1020	1021	...	10118	10119	10120	10121	10122	11001	11002	11003	12001	12002
hour
0	12.076631	11.878788	8.628788	5.750379	7.943939	13.265554	7.610774	10.513636	9.105463	4.952959	...	13.681818	1.304545	11.389226	13.613636	9.081942	1.705303	12.339394	13.903788	2.295903	2.139394
1	10.571970	8.608333	8.380303	5.054545	8.107576	10.893182	7.221212	9.557576	6.308333	2.918939	...	13.907576	1.387879	12.133333	13.934848	9.230303	2.003030	12.887121	13.878788	1.803030	1.696212
2	9.370455	6.311364	8.151515	4.311364	8.287121	8.960606	6.528030	8.517424	4.905303	1.760606	...	14.127273	1.490909	12.250000	13.998485	9.412121	2.021212	13.186364	13.892424	2.218182	1.643939
3	8.350758	4.744697	7.722727	4.005303	8.345455	8.031061	6.053030	7.806061	3.887879	1.096212	...	14.233333	1.489394	12.481818	13.936364	9.510606	2.080303	13.431818	13.898485	2.557576	1.581818
4	7.959848	3.763636	7.209848	3.461364	8.289394	7.259091	5.669697	7.300000	3.303030	1.050000	...	14.218939	1.507576	12.660606	13.875000	9.520455	2.087121	13.550000	13.916667	3.231818	1.581818

	idstation	nom	lat	lon
0	10027	Mairie de Villeurbanne	45.766831	4.879894
1	10030	Greuze	45.773844	4.893848
2	10034	MJC	45.761788	4.886157
3	10036	Chaplin / Dutriévoz	45.774357	4.859155
4	10038	Condorcet / 11 Nov. 1918	45.779046	4.866778

Bicycles Sharing Clustering Lyon