Timeseries data about the number of bikes and bike stands for the Lyon city (France), betweek 2017-07-11 and 2017-09-26.

Some Imports



In [1]:

    
%matplotlib inline



In [2]:

    
import numpy as np
import pandas as pd



In [3]:

    
from sklearn.cluster import KMeans



In [4]:

    
from matplotlib import pyplot as plt



In [5]:

    
import seaborn as sns
sns.set_context('talk')

Which Dependencies versions



In [6]:

    
%load_ext watermark



In [7]:

    
%watermark -d -v -p numpy,pandas,sklearn,matplotlib,seaborn -g -m -w









    



2018-02-05 

CPython 3.5.3
IPython 6.1.0

numpy 1.13.1
pandas 0.20.3
sklearn 0.19.0
matplotlib 2.0.2
seaborn 0.8.0

compiler   : GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)
system     : Darwin
release    : 17.4.0
machine    : x86_64
processor  : i386
CPU cores  : 4
interpreter: 64bit
Git hash   : febf6e8141decfff5a420116759bff7d67cb4ffe
watermark 1.5.0

Load Lyon Data



In [8]:

    
DATA = "./data/lyon.csv"



In [9]:

    
raw = pd.read_csv(DATA, parse_dates=['last_update'])



In [10]:

    
raw.drop_duplicates(inplace=True)



In [11]:

    
raw = raw.sort_values(["number", "last_update"])



In [12]:

    
MIN_DATE = '2017-07-11'
MAX_DATE = '2017-09-26'

Some Cleaning



In [13]:

    
print(raw.shape)









    



(4569378, 9)



In [14]:

    
drop_columns = ["availability", "bonus", "status", "bike_stands", "available_bike_stands", "availabilitycode"]



In [15]:

    
data = (raw.copy()
        .query("last_update >= '{}' and last_update <= '{}'".format(MIN_DATE, MAX_DATE))
        .drop(drop_columns, axis=1)
        .rename_axis({"available_bikes": "bikes", "number": "station", "last_update": "ts"}, axis=1))



In [16]:

    
data.head()









    Out[16]:







  
    
      
      station
      ts
      bikes
    
  
  
    
      199806
      1001
      2017-07-11 00:05:17
      12
    
    
      200154
      1001
      2017-07-11 00:12:24
      13
    
    
      200502
      1001
      2017-07-11 00:17:19
      12
    
    
      201198
      1001
      2017-07-11 00:27:23
      12
    
    
      201546
      1001
      2017-07-11 00:31:09
      13

Some strange stations where the max number of bikes is 0



In [17]:

    
max_bikes = data.groupby("station")["bikes"].max()
max_bikes









    Out[17]:





station
1001     16
1002     23
1003     15
1005     10
1006     22
1012     20
1013     11
1016     17
1020     18
1021     11
1022     31
1023     20
1024     17
1031     20
1032     22
1034     16
1035     16
1036     17
2001     32
2002     20
2003     10
2004     18
2005     32
2006     14
2007     22
2008     18
2009     26
2010     22
2011     13
2012     16
         ..
10080    17
10083    15
10084    30
10086    10
10087    12
10088    19
10089    15
10091    16
10092    17
10101    34
10102    30
10103    30
10110    18
10111    16
10112    32
10113    22
10114    20
10115    22
10116     0
10117    22
10118    22
10119    25
10120    16
10121    16
10122    21
11001    18
11002    20
11003    20
12001    25
12002    18
Name: bikes, Length: 348, dtype: int64



In [18]:

    
wrong_stations = max_bikes[max_bikes == 0].index.tolist()
wrong_stations









    Out[18]:





[2026, 2035, 3010, 3032, 3090, 8020, 10021, 10025, 10116]



In [19]:

    
well_station_mask = np.logical_not(data['station'].isin(wrong_stations))



In [20]:

    
data = data[well_station_mask]
print(data.shape)









    



(4416849, 3)

Time Resampling

Get data every 10 minutes



In [21]:

    
df = (data.set_index("ts")
      .groupby("station")["bikes"]
      .resample("10T")
      .mean()
      .bfill())



In [22]:

    
df.head(10)









    Out[22]:





station  ts                 
1001     2017-07-11 00:00:00    12.000000
         2017-07-11 00:10:00    12.500000
         2017-07-11 00:20:00    12.000000
         2017-07-11 00:30:00    13.000000
         2017-07-11 00:40:00    13.000000
         2017-07-11 00:50:00    13.000000
         2017-07-11 01:00:00    12.000000
         2017-07-11 01:10:00    11.000000
         2017-07-11 01:20:00    10.666667
         2017-07-11 01:30:00    10.000000
Name: bikes, dtype: float64



In [23]:

    
df = df.unstack(0)



In [24]:

    
df['hour'] = df.index.hour



In [25]:

    
df.head()









    Out[25]:







  
    
      station
      1001
      1002
      1003
      1005
      1006
      1012
      1013
      1016
      1020
      1021
      ...
      10119
      10120
      10121
      10122
      11001
      11002
      11003
      12001
      12002
      hour
    
    
      ts
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2017-07-11 00:00:00
      12.0
      6.0
      15.0
      1.0
      4.5
      19.0
      10.5
      3.0
      1.0
      1.0
      ...
      0.0
      15.0
      16.0
      5.0
      0.0
      10.0
      12.0
      0.0
      0.0
      0
    
    
      2017-07-11 00:10:00
      12.5
      7.0
      15.0
      1.0
      5.0
      19.0
      10.0
      3.0
      1.0
      2.0
      ...
      0.0
      15.0
      16.0
      5.0
      0.0
      10.0
      12.0
      0.0
      0.0
      0
    
    
      2017-07-11 00:20:00
      12.0
      5.5
      15.0
      1.0
      5.0
      19.0
      10.0
      3.0
      1.0
      2.0
      ...
      0.0
      15.0
      16.0
      5.0
      0.0
      10.0
      12.0
      0.0
      0.0
      0
    
    
      2017-07-11 00:30:00
      13.0
      6.5
      15.0
      1.0
      5.0
      18.0
      10.0
      3.0
      1.0
      2.0
      ...
      0.0
      15.0
      16.0
      5.0
      0.0
      10.0
      12.0
      0.0
      0.0
      0
    
    
      2017-07-11 00:40:00
      13.0
      8.5
      15.0
      1.0
      5.0
      18.0
      10.0
      3.0
      1.0
      3.0
      ...
      0.0
      15.0
      16.0
      5.0
      0.0
      10.0
      12.0
      0.0
      0.0
      0
    
  

5 rows × 340 columns



In [26]:

    
profile = df.groupby("hour").mean()
profile.head()









    Out[26]:







  
    
      station
      1001
      1002
      1003
      1005
      1006
      1012
      1013
      1016
      1020
      1021
      ...
      10118
      10119
      10120
      10121
      10122
      11001
      11002
      11003
      12001
      12002
    
    
      hour
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      0
      12.095238
      12.022727
      8.251804
      5.725469
      7.784632
      13.549062
      7.263709
      10.433622
      9.532107
      5.524170
      ...
      13.767316
      1.424242
      11.105700
      12.632395
      9.186147
      1.730519
      11.711039
      14.163420
      2.810967
      1.897186
    
    
      1
      9.803391
      8.627345
      7.364358
      4.807720
      8.012987
      10.051587
      6.724387
      9.081890
      6.038600
      3.054113
      ...
      13.981602
      1.542208
      12.025974
      13.230519
      9.337662
      1.967532
      12.348124
      14.208874
      2.048341
      1.431818
    
    
      2
      8.153319
      5.669553
      6.957071
      4.011183
      8.014069
      7.668831
      5.814214
      7.712843
      4.313492
      1.892857
      ...
      14.230159
      1.638528
      12.268398
      13.503247
      9.462121
      2.007576
      12.829004
      14.257937
      2.513709
      1.372294
    
    
      3
      6.759740
      3.895743
      6.477273
      3.530664
      7.915584
      6.183622
      5.227633
      6.645382
      3.146104
      1.026696
      ...
      14.338745
      1.685065
      12.549784
      13.645382
      9.587662
      2.019481
      13.115440
      14.316017
      2.933983
      1.304473
    
    
      4
      6.299062
      2.962482
      5.975108
      2.797980
      7.762987
      5.418831
      4.520202
      5.929654
      2.536436
      0.889250
      ...
      14.313131
      1.766234
      12.800866
      13.813853
      9.649351
      2.075758
      13.330087
      14.341991
      3.641053
      1.281385
    
  

5 rows × 339 columns

Clustering



In [36]:

    
n_clusters = 4



In [28]:

    
#df = df.fillna(method='bfill')



In [29]:

    
#df = df.fillna(method='ffill')

Normalization



In [30]:

    
df_norm = profile / profile.max()



In [37]:

    
kmeans = KMeans(n_clusters=n_clusters, random_state=0).fit(df_norm.T)



In [34]:

    
df_norm.T.head()









    Out[34]:







  
    
      hour
      0
      1
      2
      3
      4
      5
      6
      7
      8
      9
      ...
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
    
    
      station
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1001
      0.878248
      0.711835
      0.592021
      0.490832
      0.457382
      0.450623
      0.422464
      0.359545
      0.186557
      0.151745
      ...
      0.426498
      0.485200
      0.578452
      0.729306
      0.876703
      0.947873
      0.966314
      0.972312
      1.000000
      0.960525
    
    
      1002
      0.625143
      0.448594
      0.294798
      0.202566
      0.154040
      0.123502
      0.152464
      0.198102
      0.190130
      0.258765
      ...
      0.535255
      0.597419
      0.661646
      0.720302
      0.810077
      0.950179
      0.993847
      1.000000
      0.960909
      0.788805
    
    
      1003
      0.708546
      0.632345
      0.597373
      0.556175
      0.513056
      0.495214
      0.460769
      0.333240
      0.137193
      0.111886
      ...
      0.181644
      0.194808
      0.239321
      0.376421
      0.585076
      0.824118
      0.960351
      1.000000
      0.969767
      0.869746
    
    
      1005
      0.709160
      0.595487
      0.496828
      0.437310
      0.346559
      0.304468
      0.286372
      0.265371
      0.179580
      0.199330
      ...
      0.490483
      0.450089
      0.542627
      0.667560
      0.862422
      0.914388
      0.994861
      1.000000
      0.992002
      0.837265
    
    
      1006
      0.971371
      0.999865
      1.000000
      0.987711
      0.968670
      0.981499
      0.968175
      0.870358
      0.718524
      0.585955
      ...
      0.567634
      0.577133
      0.559532
      0.596219
      0.600945
      0.758857
      0.842539
      0.904434
      0.936709
      0.946973
    
  

5 rows × 24 columns



In [35]:

    
# On choisi 4 cluster
distortions = []
for i in range (1, 15):
    km = k_means = KMeans(random_state=0, n_clusters=i).fit(df_norm.T)
    distortions.append(km.inertia_)
plt.plot(range(1, 15), distortions, marker='o')
plt.xlabel('Number of cluster')
plt.ylabel('Distortion')
plt.show()



In [38]:

    
label = pd.Series(kmeans.labels_)

Number of stations for each label, i.e. usage pattern



In [39]:

    
label.groupby(label).count()









    Out[39]:





0     84
1    143
2     33
3     79
dtype: int64

Colors for each cluster



In [40]:

    
colors = sns.color_palette('Set1', n_clusters)



In [41]:

    
sns.palplot(colors)

Daily profile



In [42]:

    
with sns.axes_style("darkgrid", {'xtick.major.size': 5.0}):
    fig, ax = plt.subplots(figsize=(10,6))

for k, label, color in zip(kmeans.cluster_centers_, range(n_clusters), colors):
    plt.plot(100*k, color=color, label=label)
    
plt.legend()
plt.xlabel('Hour')
plt.ylabel("available bikes%")
sns.despine()



In [43]:

    
len(kmeans.labels_)









    Out[43]:





339



In [44]:

    
df_norm.T.shape









    Out[44]:





(339, 24)

Extract cluster



In [45]:

    
cluster_station = df_norm.T
cluster_station['cluster'] = kmeans.labels_
cluster_station.reset_index(inplace=True)
cluster_station = cluster_station[['station', 'cluster']]



In [52]:

    
cluster_station.head()



In [54]:

    
cluster_station.station.nunique()









    Out[54]:





339



In [146]:

    
cluster_station.to_csv('data/cluster_lyon_armand.csv', index=False)

Map

Station names and lat/lon coordinates.



In [64]:

    
locations = pd.read_csv("./data/lyon-stations.csv")



In [65]:

    
mask = np.logical_not(locations['idstation'].isin(wrong_stations))



In [66]:

    
locations = locations[mask]



In [67]:

    
locations.head()









    Out[67]:







  
    
      
      idstation
      nom
      lat
      lon
    
  
  
    
      0
      10027
      Mairie de Villeurbanne
      45.766831
      4.879894
    
    
      1
      10030
      Greuze
      45.773844
      4.893848
    
    
      2
      10034
      MJC
      45.761788
      4.886157
    
    
      3
      10036
      Chaplin / Dutriévoz
      45.774357
      4.859155
    
    
      4
      10038
      Condorcet / 11 Nov. 1918
      45.779046
      4.866778



In [68]:

    
dflabel = pd.DataFrame({"label": kmeans.labels_}, index=df.columns)



In [69]:

    
locations = locations.merge(dflabel, right_index=True, left_on='idstation')



In [70]:

    
locations.head()









    Out[70]:







  
    
      
      idstation
      nom
      lat
      lon
      label
    
  
  
    
      0
      10027
      Mairie de Villeurbanne
      45.766831
      4.879894
      2
    
    
      1
      10030
      Greuze
      45.773844
      4.893848
      2
    
    
      2
      10034
      MJC
      45.761788
      4.886157
      0
    
    
      3
      10036
      Chaplin / Dutriévoz
      45.774357
      4.859155
      1
    
    
      4
      10038
      Condorcet / 11 Nov. 1918
      45.779046
      4.866778
      1



In [71]:

    
locations["nom"] = locations['nom'].str.replace("'", "&apos;")



In [72]:

    
import folium



In [73]:

    
# Lyon (France) Position
position = [45.750000, 4.850000]



In [76]:

    
mp = folium.Map(location=position, zoom_start=13, tiles='cartodbpositron')
#mp = folium.Map(location=position, zoom_start=13)



In [74]:

    
hex_colors = colors.as_hex()



In [77]:

    
for _,row in locations.iterrows():
    folium.CircleMarker(
        location=[row['lat'], row['lon']],
        radius=5,
        popup=row['nom'],
        color=hex_colors[row['label']],
        fill=True,
        fill_opacity=0.7,
        fill_color=hex_colors[row['label']]
    ).add_to(mp)



In [78]:

    
mp.save("lyon-map-n_clusters-{}.html".format(n_clusters))



In [79]:

    
mp









    Out[79]:

	station	ts	bikes
199806	1001	2017-07-11 00:05:17	12
200154	1001	2017-07-11 00:12:24	13
200502	1001	2017-07-11 00:17:19	12
201198	1001	2017-07-11 00:27:23	12
201546	1001	2017-07-11 00:31:09	13

station	1001	1002	1003	1005	1006	1012	1013	1016	1020	1021	...	10119	10120	10121	10122	11001	11002	11003	12001	12002	hour
ts
2017-07-11 00:00:00	12.0	6.0	15.0	1.0	4.5	19.0	10.5	3.0	1.0	1.0	...	0.0	15.0	16.0	5.0	0.0	10.0	12.0	0.0	0.0	0
2017-07-11 00:10:00	12.5	7.0	15.0	1.0	5.0	19.0	10.0	3.0	1.0	2.0	...	0.0	15.0	16.0	5.0	0.0	10.0	12.0	0.0	0.0	0
2017-07-11 00:20:00	12.0	5.5	15.0	1.0	5.0	19.0	10.0	3.0	1.0	2.0	...	0.0	15.0	16.0	5.0	0.0	10.0	12.0	0.0	0.0	0
2017-07-11 00:30:00	13.0	6.5	15.0	1.0	5.0	18.0	10.0	3.0	1.0	2.0	...	0.0	15.0	16.0	5.0	0.0	10.0	12.0	0.0	0.0	0
2017-07-11 00:40:00	13.0	8.5	15.0	1.0	5.0	18.0	10.0	3.0	1.0	3.0	...	0.0	15.0	16.0	5.0	0.0	10.0	12.0	0.0	0.0	0

station	1001	1002	1003	1005	1006	1012	1013	1016	1020	1021	...	10118	10119	10120	10121	10122	11001	11002	11003	12001	12002
hour
0	12.095238	12.022727	8.251804	5.725469	7.784632	13.549062	7.263709	10.433622	9.532107	5.524170	...	13.767316	1.424242	11.105700	12.632395	9.186147	1.730519	11.711039	14.163420	2.810967	1.897186
1	9.803391	8.627345	7.364358	4.807720	8.012987	10.051587	6.724387	9.081890	6.038600	3.054113	...	13.981602	1.542208	12.025974	13.230519	9.337662	1.967532	12.348124	14.208874	2.048341	1.431818
2	8.153319	5.669553	6.957071	4.011183	8.014069	7.668831	5.814214	7.712843	4.313492	1.892857	...	14.230159	1.638528	12.268398	13.503247	9.462121	2.007576	12.829004	14.257937	2.513709	1.372294
3	6.759740	3.895743	6.477273	3.530664	7.915584	6.183622	5.227633	6.645382	3.146104	1.026696	...	14.338745	1.685065	12.549784	13.645382	9.587662	2.019481	13.115440	14.316017	2.933983	1.304473
4	6.299062	2.962482	5.975108	2.797980	7.762987	5.418831	4.520202	5.929654	2.536436	0.889250	...	14.313131	1.766234	12.800866	13.813853	9.649351	2.075758	13.330087	14.341991	3.641053	1.281385

hour	0	1	2	3	4	5	6	7	8	9	...	14	15	16	17	18	19	20	21	22	23
station
1001	0.878248	0.711835	0.592021	0.490832	0.457382	0.450623	0.422464	0.359545	0.186557	0.151745	...	0.426498	0.485200	0.578452	0.729306	0.876703	0.947873	0.966314	0.972312	1.000000	0.960525
1002	0.625143	0.448594	0.294798	0.202566	0.154040	0.123502	0.152464	0.198102	0.190130	0.258765	...	0.535255	0.597419	0.661646	0.720302	0.810077	0.950179	0.993847	1.000000	0.960909	0.788805
1003	0.708546	0.632345	0.597373	0.556175	0.513056	0.495214	0.460769	0.333240	0.137193	0.111886	...	0.181644	0.194808	0.239321	0.376421	0.585076	0.824118	0.960351	1.000000	0.969767	0.869746
1005	0.709160	0.595487	0.496828	0.437310	0.346559	0.304468	0.286372	0.265371	0.179580	0.199330	...	0.490483	0.450089	0.542627	0.667560	0.862422	0.914388	0.994861	1.000000	0.992002	0.837265
1006	0.971371	0.999865	1.000000	0.987711	0.968670	0.981499	0.968175	0.870358	0.718524	0.585955	...	0.567634	0.577133	0.559532	0.596219	0.600945	0.758857	0.842539	0.904434	0.936709	0.946973

	idstation	nom	lat	lon
0	10027	Mairie de Villeurbanne	45.766831	4.879894
1	10030	Greuze	45.773844	4.893848
2	10034	MJC	45.761788	4.886157
3	10036	Chaplin / Dutriévoz	45.774357	4.859155
4	10038	Condorcet / 11 Nov. 1918	45.779046	4.866778

Bicycles Sharing Clustering Lyon