CASE - Bike count data
DS Data manipulation, analysis and visualisation in Python
December, 2019© 2016, Joris Van den Bossche and Stijn Van Hoey (mailto:jorisvandenbossche@gmail.com, mailto:stijnvanhoey@gmail.com). Licensed under CC BY 4.0 Creative Commons
In this case study, we will make use of the freely available bike count data of the city of Ghent. At the Coupure Links, next to the Faculty of Bioscience Engineering, a counter keeps track of the number of passing cyclists in both directions.
Those data are available on the open data portal of the city: https://data.stad.gent/data/236
In [2]:
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
%matplotlib notebook
The data are avaible in CSV, JSON and XML format. We will make use of the CSV data. The link to download the data can be found on the webpage. For the first dataset, this is:
link = "https://datatank.stad.gent/4/mobiliteit/fietstellingencoupure.csv"
A limit defines the size of the requested data set, by adding a limit parameter limit
to the URL :
link = "https://datatank.stad.gent/4/mobiliteit/fietstellingencoupure.csv?limit=100000"
Those datasets contain the historical data of the bike counters, and consist of the following columns:
datum
is the date, in dd/mm/yy
formattijd
is the time of the day, in hh:mm
formatri Centrum
and ri Mariakerke
are the counts at that point in time (counts between this timestamp and the previous)
In [3]:
limit = 200000
link = "https://datatank.stad.gent/4/mobiliteit/fietstellingencoupure.csv?limit={}".format(limit)
In [4]:
df = pd.read_csv(link, sep=';')
In [5]:
df.head()
Out[5]:
In [6]:
df.tail()
Out[6]:
In [7]:
len(df)
Out[7]:
In [8]:
df.dtypes
Out[8]:
As explained above, the first and second column (respectively datum
and tijd
) indicate the date and hour of the day. To obtain a time series, we have to combine those two columns into one series of actual datetime values.
drop
method)rename
function)
In [9]:
combined = df['datum'] + ' ' + df['tijd']
combined.head()
Out[9]:
In [10]:
df.index = pd.to_datetime(combined, format="%d/%m/%Y %H:%M")
In [11]:
df = df.drop(columns=['datum', 'tijd'])
In [12]:
df = df.rename(columns={'ri Centrum': 'direction_centre', 'ri Mariakerke':'direction_mariakerke'})
In [13]:
df.head()
Out[13]:
Having the data available with an interpreted datetime, provides us the possibility of having time aware plotting:
In [14]:
fig, ax = plt.subplots(figsize=(10, 6))
df.plot(colormap='coolwarm', ax=ax)
Out[14]:
When we just want to interpret the dates, without specifying how the dates are formatted, Pandas makes an attempt as good as possible:
In [14]:
%timeit -n 1 -r 1 pd.to_datetime(combined, dayfirst=True)
However, when we already know the format of the dates (and if this is consistent throughout the full dataset), we can use this information to interpret the dates:
In [15]:
%timeit pd.to_datetime(combined, format="%d/%m/%Y %H:%M")
process_bike_count_data(df)
that performs the processing steps as done above for an input DataFrame and return the updated DataFrame
In [16]:
def process_bike_count_data(df):
"""
Process the provided dataframe: parse datetimes and rename columns.
"""
df.index = pd.to_datetime(df['datum'] + ' ' + df['tijd'], format="%d/%m/%Y %H:%M")
df = df.drop(columns=['datum', 'tijd'])
df = df.rename(columns={'ri Centrum': 'direction_centre', 'ri Mariakerke':'direction_mariakerke'})
return df
In [17]:
df_raw = pd.read_csv(link, sep=';')
df_preprocessed = process_bike_count_data(df_raw)
As we finished our data-collection step, we want to save this result as a interim data output of our small investigation. As such, we do not have to re-download all the files each time something went wrong, but can restart from our interim step.
In [18]:
df_preprocessed.to_csv("bike_count_interim.csv")
We now have a cleaned-up dataset of the bike counts at Coupure Links. Next, we want to get an impression of the characteristics and properties of the data
Reading the file in from the interim file (when you want to rerun the whole analysis on the updated online data, you would comment out this cell...)
In [19]:
df = pd.read_csv("bike_count_interim.csv", index_col=0, parse_dates=True)
The number of bikers are counted for intervals of approximately 15 minutes. But let's check if this is indeed the case.
For this, we want to calculate the difference between each of the consecutive values of the index. We can use the Series.diff()
method:
In [20]:
pd.Series(df.index).diff()
Out[20]:
Again, the count of the possible intervals is of interest:
In [21]:
pd.Series(df.index).diff().value_counts()
Out[21]:
There are a few records that is not exactly 15min. But given it are only a few ones, we will ignore this for the current case study and just keep them as such for this explorative study.
Bonus question: do you know where the values of -1 days +23:15:01
and 01:15:00
are coming from?
In [22]:
df.describe()
Out[22]:
df_both
which contains the sum of the counts of both directions
In [23]:
df_both = df.sum(axis=1)
df_both
from the previous exercise, create a new Series df_quiet
which contains only those intervals for which less than 5 cyclists passed in both directions combined
In [24]:
df_quiet = df_both[df_both < 5]
In [25]:
df[(df['direction_centre'] < 3) | (df['direction_mariakerke'] < 3)]
Out[25]:
In [26]:
df.mean()
Out[26]:
In [27]:
df.resample('H').sum().mean()
Out[27]:
In [28]:
df['direction_centre'].nlargest(10)
# alternative:
# df['direction_centre'].sort_values(ascending=False).head(10)
Out[28]:
In [29]:
df_both = df.sum(axis=1)
In [30]:
df_daily = df_both.resample('D').sum()
In [31]:
df_daily.max()
Out[31]:
In [32]:
df_daily.nlargest(10)
Out[32]:
2013-06-05 was the first time more than 10,000 bikers passed on one day. Apparanlty, this was not just by coincidence... http://www.nieuwsblad.be/cnt/dmf20130605_022
In [33]:
df_monthly = df.resample('M').sum()
df_monthly.plot()
Out[33]:
In [34]:
df_hourly = df.resample('H').sum()
In [35]:
df_hourly.head()
Out[35]:
In [36]:
df_hourly['2014-01-01':'2014-01-20'].plot()
Out[36]:
New Year's Eve 2013-2014
newyear
and plot this subsetrolling
function (check documentation of the function!) to smooth the data of this period and make a plot of the smoothed version
In [37]:
newyear = df["2013-12-31 12:00:00": "2014-01-01 12:00:00"]
In [38]:
newyear.plot()
Out[38]:
In [39]:
newyear.rolling(10, center=True).mean().plot(linewidth=2)
Out[39]:
A more advanced usage of matplotlib to create a combined plot:
In [40]:
# A more in-detail plotting version of the graph.
fig, ax = plt.subplots()
newyear.plot(ax=ax, color=['LightGreen', 'LightBlue'], legend=False, rot=0)
newyear.rolling(10, center=True).mean().plot(linewidth=2, ax=ax, color=['DarkGreen', 'DarkBlue'], rot=0)
ax.set_xlabel('')
ax.set_ylabel('Cyclists count')
Out[40]:
groupby
...Looking at the data in the above exercises, there seems to be clearly a:
Such patterns can easily be calculated and visualized in pandas using the DatetimeIndex attributes weekday
combined with groupby
functionality. Below a taste of the possibilities, and we will learn about this in the proceeding notebooks:
Weekly pattern:
In [41]:
df_daily = df.resample('D').sum()
In [42]:
df_daily.groupby(df_daily.index.weekday).mean().plot(kind='bar')
Out[42]:
Daily pattern:
In [43]:
df_hourly.groupby(df_hourly.index.hour).mean().plot()
Out[43]:
So the daily pattern is clearly different for both directions. In the morning more people go north, in the evening more people go south. The morning peak is also more condensed.
Monthly pattern
In [44]:
df_monthly = df.resample('M').sum()
In [45]:
from calendar import month_abbr
In [46]:
ax = df_monthly.groupby(df_monthly.index.month).mean().plot()
ax.set_ylim(0)
xlabels = ax.set_xticklabels(list(month_abbr)[0::2]) #too lazy to write the month values yourself...
Thanks to the city of Ghent for opening their data