Bicycles Sharing Clustering Lyon

Timeseries data about the number of bikes and bike stands for the Lyon city (France), betweek 2017-07-11 and 2017-09-26.

Some Imports


In [1]:
%matplotlib inline

In [2]:
import numpy as np
import pandas as pd

In [3]:
from sklearn.cluster import KMeans

In [4]:
from matplotlib import pyplot as plt

In [5]:
import seaborn as sns
sns.set_context('talk')

Which Dependencies versions


In [6]:
%load_ext watermark

In [7]:
%watermark -d -v -p numpy,pandas,sklearn,matplotlib,seaborn -g -m -w


2017-11-20 

CPython 3.5.2
IPython 6.2.1

numpy 1.13.3
pandas 0.20.3
sklearn 0.18.1
matplotlib 2.0.2
seaborn 0.8.1

compiler   : GCC 5.4.0 20160609
system     : Linux
release    : 4.4.0-98-generic
machine    : x86_64
processor  : x86_64
CPU cores  : 4
interpreter: 64bit
Git hash   : 213e0344fc447e7cc4c6119eec239ca29346e680
watermark 1.5.0

Load Lyon Data


In [9]:
DATA = "../data/lyon.csv"

In [10]:
raw = pd.read_csv(DATA, parse_dates=['last_update'])

In [11]:
raw.drop_duplicates(inplace=True)

In [12]:
raw = raw.sort_values(["number", "last_update"])

In [13]:
MIN_DATE = '2017-07-11'
MAX_DATE = '2017-09-26'

Some Cleaning


In [14]:
print(raw.shape)


(4569378, 9)

In [15]:
drop_columns = ["availability", "bonus", "status", "bike_stands", "available_bike_stands", "availabilitycode"]

In [16]:
data = (raw.copy()
        .query("last_update >= '{}' and last_update <= '{}'".format(MIN_DATE, MAX_DATE))
        .drop(drop_columns, axis=1)
        .rename_axis({"available_bikes": "bikes", "number": "station", "last_update": "ts"}, axis=1))

In [17]:
data.head()


Out[17]:
station ts bikes
199806 1001 2017-07-11 00:05:17 12
200154 1001 2017-07-11 00:12:24 13
200502 1001 2017-07-11 00:17:19 12
201198 1001 2017-07-11 00:27:23 12
201546 1001 2017-07-11 00:31:09 13

Some strange stations where the max number of bikes is 0


In [19]:
max_bikes = data.groupby("station")["bikes"].max()
max_bikes


Out[19]:
station
1001     16
1002     23
1003     15
1005     10
1006     22
1012     20
1013     11
1016     17
1020     18
1021     11
1022     31
1023     20
1024     17
1031     20
1032     22
1034     16
1035     16
1036     17
2001     32
2002     20
2003     10
2004     18
2005     32
2006     14
2007     22
2008     18
2009     26
2010     22
2011     13
2012     16
         ..
10080    17
10083    15
10084    30
10086    10
10087    12
10088    19
10089    15
10091    16
10092    17
10101    34
10102    30
10103    30
10110    18
10111    16
10112    32
10113    22
10114    20
10115    22
10116     0
10117    22
10118    22
10119    25
10120    16
10121    16
10122    21
11001    18
11002    20
11003    20
12001    25
12002    18
Name: bikes, Length: 348, dtype: int64

In [20]:
wrong_stations = max_bikes[max_bikes == 0].index.tolist()
wrong_stations


Out[20]:
[2026, 2035, 3010, 3032, 3090, 8020, 10021, 10025, 10116]

In [21]:
well_station_mask = np.logical_not(data['station'].isin(wrong_stations))

In [22]:
data = data[well_station_mask]
print(data.shape)


(4416849, 3)

Time Resampling

Get data every 5 minutes


In [23]:
df = (data.set_index("ts")
      .groupby("station")["bikes"]
      .resample("5T")
      .mean()
      .bfill())

In [24]:
df.head(10)


Out[24]:
station  ts                 
1001     2017-07-11 00:05:00    12.0
         2017-07-11 00:10:00    13.0
         2017-07-11 00:15:00    12.0
         2017-07-11 00:20:00    12.0
         2017-07-11 00:25:00    12.0
         2017-07-11 00:30:00    13.0
         2017-07-11 00:35:00    13.0
         2017-07-11 00:40:00    13.0
         2017-07-11 00:45:00    13.0
         2017-07-11 00:50:00    13.0
Name: bikes, dtype: float64

In [25]:
df = df.unstack(0)

In [26]:
df.head()


Out[26]:
station 1001 1002 1003 1005 1006 1012 1013 1016 1020 1021 ... 10118 10119 10120 10121 10122 11001 11002 11003 12001 12002
ts
2017-07-11 00:00:00 NaN 6.0 15.0 NaN 4.0 NaN NaN 3.0 NaN NaN ... 21.0 0.0 NaN 16.0 NaN 0.0 10.0 12.0 NaN 0.0
2017-07-11 00:05:00 12.0 6.0 15.0 1.0 5.0 19.0 10.5 3.0 1.0 1.0 ... 21.0 0.0 15.0 16.0 5.0 0.0 10.0 12.0 0.0 0.0
2017-07-11 00:10:00 13.0 7.0 15.0 1.0 5.0 19.0 10.0 3.0 1.0 2.0 ... 21.0 0.0 15.0 16.0 5.0 0.0 10.0 12.0 0.0 0.0
2017-07-11 00:15:00 12.0 7.0 15.0 1.0 5.0 19.0 10.0 3.0 1.0 2.0 ... 21.0 0.0 15.0 16.0 5.0 0.0 10.0 12.0 0.0 0.0
2017-07-11 00:20:00 12.0 7.0 15.0 1.0 5.0 19.0 10.0 3.0 1.0 2.0 ... 21.0 0.0 15.0 16.0 5.0 0.0 10.0 12.0 0.0 0.0

5 rows × 339 columns

Daily Profile

Get rid of saturday and sunday.


In [27]:
weekday = df.index.weekday
mask = weekday < 5

In [28]:
df = df[mask]

In [29]:
df['hour'] = df.index.hour

In [30]:
df = df.groupby("hour").mean()

In [31]:
df.head()


Out[31]:
station 1001 1002 1003 1005 1006 1012 1013 1016 1020 1021 ... 10118 10119 10120 10121 10122 11001 11002 11003 12001 12002
hour
0 12.076631 11.878788 8.628788 5.750379 7.943939 13.265554 7.610774 10.513636 9.105463 4.952959 ... 13.681818 1.304545 11.389226 13.613636 9.081942 1.705303 12.339394 13.903788 2.295903 2.139394
1 10.571970 8.608333 8.380303 5.054545 8.107576 10.893182 7.221212 9.557576 6.308333 2.918939 ... 13.907576 1.387879 12.133333 13.934848 9.230303 2.003030 12.887121 13.878788 1.803030 1.696212
2 9.370455 6.311364 8.151515 4.311364 8.287121 8.960606 6.528030 8.517424 4.905303 1.760606 ... 14.127273 1.490909 12.250000 13.998485 9.412121 2.021212 13.186364 13.892424 2.218182 1.643939
3 8.350758 4.744697 7.722727 4.005303 8.345455 8.031061 6.053030 7.806061 3.887879 1.096212 ... 14.233333 1.489394 12.481818 13.936364 9.510606 2.080303 13.431818 13.898485 2.557576 1.581818
4 7.959848 3.763636 7.209848 3.461364 8.289394 7.259091 5.669697 7.300000 3.303030 1.050000 ... 14.218939 1.507576 12.660606 13.875000 9.520455 2.087121 13.550000 13.916667 3.231818 1.581818

5 rows × 339 columns

Clustering


In [32]:
n_clusters = 4

In [231]:
#df = df.fillna(method='bfill')

In [232]:
#df = df.fillna(method='ffill')

Normalization


In [33]:
df_norm = df / df.max()

In [34]:
kmeans = KMeans(n_clusters=n_clusters, random_state=0).fit(df_norm.T)

In [35]:
label = pd.Series(kmeans.labels_)

Number of stations for each label, i.e. usage pattern


In [36]:
label.groupby(label).count()


Out[36]:
0     84
1     81
2    134
3     40
dtype: int64

Colors for each cluster


In [59]:
colors = sns.color_palette('Set1', n_clusters)

In [60]:
sns.palplot(colors)


Daily profile


In [75]:
pd.DataFrame(kmeans.cluster_centers_).to_csv("../data/lyon_clusters.csv", index=False)

In [69]:
with sns.axes_style("darkgrid", {'xtick.major.size': 8.0}):
    fig, ax = plt.subplots(figsize=(10,6))

for k, label, color in zip(kmeans.cluster_centers_, range(n_clusters), colors):
    plt.plot(100*k, color=color, label=label)
    
plt.legend()
plt.xlabel('Hour')
plt.xticks(np.linspace(0, 24, 13))
plt.yticks(np.linspace(0, 100, 11))
plt.ylabel("available bikes%")
sns.despine()
plt.savefig("../images/lyon-pattern.png")


Map

Station names and lat/lon coordinates.


In [41]:
locations = pd.read_csv("../data/lyon-stations.csv")

In [42]:
mask = np.logical_not(locations['idstation'].isin(wrong_stations))

In [43]:
locations = locations[mask]

In [44]:
locations.head()


Out[44]:
idstation nom lat lon
0 10027 Mairie de Villeurbanne 45.766831 4.879894
1 10030 Greuze 45.773844 4.893848
2 10034 MJC 45.761788 4.886157
3 10036 Chaplin / Dutriévoz 45.774357 4.859155
4 10038 Condorcet / 11 Nov. 1918 45.779046 4.866778

In [45]:
dflabel = pd.DataFrame({"label": kmeans.labels_}, index=df.columns)

In [46]:
locations = locations.merge(dflabel, right_index=True, left_on='idstation')

In [47]:
locations.head()


Out[47]:
idstation nom lat lon label
0 10027 Mairie de Villeurbanne 45.766831 4.879894 2
1 10030 Greuze 45.773844 4.893848 2
2 10034 MJC 45.761788 4.886157 0
3 10036 Chaplin / Dutriévoz 45.774357 4.859155 1
4 10038 Condorcet / 11 Nov. 1918 45.779046 4.866778 1

In [48]:
locations["nom"] = locations['nom'].str.replace("'", "&apos;")

In [49]:
import folium

In [50]:
# Lyon (France) Position
position = [45.750000, 4.850000]

In [51]:
mp = folium.Map(location=position, zoom_start=13, tiles='cartodbpositron')
#mp = folium.Map(location=position, zoom_start=13)

In [52]:
hex_colors = colors.as_hex()

In [70]:
for _,row in locations.iterrows():
    folium.CircleMarker(
        location=[row['lat'], row['lon']],
        radius=5,
        popup=row['nom'],
        color=hex_colors[row['label']],
        fill=True,
        fill_opacity=0.5,
        fill_color=hex_colors[row['label']]
    ).add_to(mp)

In [71]:
mp.save("../images/lyon-map-n_clusters-{}.html".format(n_clusters))

In [72]:
mp


Out[72]: