CASE - Bike count data

DS Data manipulation, analysis and visualisation in Python
December, 2019

© 2016, Joris Van den Bossche and Stijn Van Hoey (mailto:jorisvandenbossche@gmail.com, mailto:stijnvanhoey@gmail.com). Licensed under CC BY 4.0 Creative Commons


In this case study, we will make use of the freely available bike count data of the city of Ghent. At the Coupure Links, next to the Faculty of Bioscience Engineering, a counter keeps track of the number of passing cyclists in both directions.

Those data are available on the open data portal of the city: https://data.stad.gent/data/236


In [2]:
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')

%matplotlib notebook

Reading and processing the data

Read csv data from URL

The data are avaible in CSV, JSON and XML format. We will make use of the CSV data. The link to download the data can be found on the webpage. For the first dataset, this is:

link = "https://datatank.stad.gent/4/mobiliteit/fietstellingencoupure.csv"

A limit defines the size of the requested data set, by adding a limit parameter limit to the URL :

link = "https://datatank.stad.gent/4/mobiliteit/fietstellingencoupure.csv?limit=100000"

Those datasets contain the historical data of the bike counters, and consist of the following columns:

  • The first column datum is the date, in dd/mm/yy format
  • The second column tijd is the time of the day, in hh:mm format
  • The third and fourth column ri Centrum and ri Mariakerke are the counts at that point in time (counts between this timestamp and the previous)

In [3]:
limit = 200000
link = "https://datatank.stad.gent/4/mobiliteit/fietstellingencoupure.csv?limit={}".format(limit)
EXERCISE:
  • Read the csv file from the url into a DataFrame `df`, the delimiter of the data is `;`
  • Inspect the first and last 5 rows, and check the number of observations
  • Inspect the data types of the different columns

In [4]:
df = pd.read_csv(link, sep=';')

In [5]:
df.head()


Out[5]:
datum tijd ri Centrum ri Mariakerke
0 14/07/2011 17:44 21 46
1 14/07/2011 17:59 17 60
2 14/07/2011 18:14 13 48
3 14/07/2011 18:29 22 48
4 14/07/2011 18:44 15 51

In [6]:
df.tail()


Out[6]:
datum tijd ri Centrum ri Mariakerke
191442 18/01/2017 22:54 16 32
191443 18/01/2017 23:09 8 22
191444 18/01/2017 23:24 10 29
191445 18/01/2017 23:39 11 31
191446 18/01/2017 23:54 12 19

In [7]:
len(df)


Out[7]:
191447

In [8]:
df.dtypes


Out[8]:
datum            object
tijd             object
ri Centrum        int64
ri Mariakerke     int64
dtype: object
Remark: If the download is very slow, consider to reset the limit variable to a lower value as most execises will just work with the first 100000 records as well.

Data processing

As explained above, the first and second column (respectively datum and tijd) indicate the date and hour of the day. To obtain a time series, we have to combine those two columns into one series of actual datetime values.

EXERCISE: Preprocess the data
  • Combine the 'datum' and 'tijd' columns into one Series of string datetime values (Hint: concatenating strings can be done with the addition operation)
  • Parse the string datetime values (Hint: specifying the format will make this a lot faster)
  • Set the resulting dates as the index
  • Remove the original 'tijd' and 'tijd' columns (Hint: check the drop method)
  • Rename the 'ri Centrum', 'ri Mariakerke' to 'direction_centre', 'direction_mariakerke' (Hint: check the rename function)

In [9]:
combined = df['datum'] + ' ' + df['tijd']
combined.head()


Out[9]:
0    14/07/2011 17:44
1    14/07/2011 17:59
2    14/07/2011 18:14
3    14/07/2011 18:29
4    14/07/2011 18:44
dtype: object

In [10]:
df.index = pd.to_datetime(combined, format="%d/%m/%Y %H:%M")

In [11]:
df = df.drop(columns=['datum', 'tijd'])

In [12]:
df = df.rename(columns={'ri Centrum': 'direction_centre', 'ri Mariakerke':'direction_mariakerke'})

In [13]:
df.head()


Out[13]:
direction_centre direction_mariakerke
2011-07-14 17:44:00 21 46
2011-07-14 17:59:00 17 60
2011-07-14 18:14:00 13 48
2011-07-14 18:29:00 22 48
2011-07-14 18:44:00 15 51

Having the data available with an interpreted datetime, provides us the possibility of having time aware plotting:


In [14]:
fig, ax = plt.subplots(figsize=(10, 6))
df.plot(colormap='coolwarm', ax=ax)


Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fbeb516e0d0>
Remark: Interpretation of the dates with and without predefined date format.

When we just want to interpret the dates, without specifying how the dates are formatted, Pandas makes an attempt as good as possible:


In [14]:
%timeit -n 1 -r 1 pd.to_datetime(combined, dayfirst=True)


17.5 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

However, when we already know the format of the dates (and if this is consistent throughout the full dataset), we can use this information to interpret the dates:


In [15]:
%timeit pd.to_datetime(combined, format="%d/%m/%Y %H:%M")


500 ms ± 12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Remember: Whenever possible, specify the date format to interpret the dates to datetime values!

Write the data set cleaning as a function

In order to make it easier to reuse the code for the preprocessing we have now implemented, let's convert the code to a Python function

EXERCISE:
  • Write a function process_bike_count_data(df) that performs the processing steps as done above for an input DataFrame and return the updated DataFrame

In [16]:
def process_bike_count_data(df):
    """
    Process the provided dataframe: parse datetimes and rename columns.
    
    """
    df.index = pd.to_datetime(df['datum'] + ' ' + df['tijd'], format="%d/%m/%Y %H:%M")
    df = df.drop(columns=['datum', 'tijd'])
    df = df.rename(columns={'ri Centrum': 'direction_centre', 'ri Mariakerke':'direction_mariakerke'})
    return df

In [17]:
df_raw = pd.read_csv(link, sep=';')
df_preprocessed = process_bike_count_data(df_raw)

Store our collected dataset as an interim data product

As we finished our data-collection step, we want to save this result as a interim data output of our small investigation. As such, we do not have to re-download all the files each time something went wrong, but can restart from our interim step.


In [18]:
df_preprocessed.to_csv("bike_count_interim.csv")

Data exploration and analysis

We now have a cleaned-up dataset of the bike counts at Coupure Links. Next, we want to get an impression of the characteristics and properties of the data

Load the interim data

Reading the file in from the interim file (when you want to rerun the whole analysis on the updated online data, you would comment out this cell...)


In [19]:
df = pd.read_csv("bike_count_interim.csv", index_col=0, parse_dates=True)

Count interval verification

The number of bikers are counted for intervals of approximately 15 minutes. But let's check if this is indeed the case.
For this, we want to calculate the difference between each of the consecutive values of the index. We can use the Series.diff() method:


In [20]:
pd.Series(df.index).diff()


Out[20]:
0             NaT
1        00:15:00
2        00:15:00
3        00:15:00
4        00:15:00
           ...   
191442   00:15:00
191443   00:15:00
191444   00:15:00
191445   00:15:00
191446   00:15:00
Length: 191447, dtype: timedelta64[ns]

Again, the count of the possible intervals is of interest:


In [21]:
pd.Series(df.index).diff().value_counts()


Out[21]:
0 days 00:15:00      191356
0 days 00:16:00           9
0 days 00:09:00           6
-1 days +23:15:00         6
0 days 00:13:00           6
0 days 00:04:00           5
0 days 01:15:00           5
0 days 00:06:00           4
0 days 00:12:00           4
0 days 00:11:00           4
0 days 00:03:00           4
0 days 00:00:00           4
0 days 00:02:00           4
0 days 00:08:00           4
0 days 00:10:00           3
0 days 00:01:00           3
0 days 00:38:00           3
0 days 00:37:00           3
0 days 00:14:00           2
0 days 00:05:00           2
0 days 01:11:00           1
0 days 00:26:00           1
0 days 00:20:00           1
0 days 00:29:00           1
0 days 01:24:00           1
0 days 00:30:00           1
0 days 06:49:00           1
0 days 00:07:00           1
20 days 21:29:00          1
dtype: int64

There are a few records that is not exactly 15min. But given it are only a few ones, we will ignore this for the current case study and just keep them as such for this explorative study.

Bonus question: do you know where the values of -1 days +23:15:01 and 01:15:00 are coming from?


In [22]:
df.describe()


Out[22]:
direction_centre direction_mariakerke
count 191447.000000 191447.000000
mean 25.545237 25.338585
std 31.625619 26.903810
min 0.000000 0.000000
25% 5.000000 7.000000
50% 17.000000 17.000000
75% 34.000000 34.000000
max 361.000000 299.000000

Quiet periods

EXERCISE:
  • Create a new Series, df_both which contains the sum of the counts of both directions

_Tip:_ check the purpose of the `axis` argument of the `sum` function

In [23]:
df_both = df.sum(axis=1)
EXERCISE:
  • Using the df_both from the previous exercise, create a new Series df_quiet which contains only those intervals for which less than 5 cyclists passed in both directions combined

In [24]:
df_quiet = df_both[df_both < 5]
EXERCISE:
  • Using the original data, select only the intervals for which less than 3 cyclists passed in one or the other direction. Hence, less than 3 cyclists towards the centre or less than 3 cyclists towards Mariakerke.

In [25]:
df[(df['direction_centre'] < 3) | (df['direction_mariakerke'] < 3)]


Out[25]:
direction_centre direction_mariakerke
2011-07-14 23:59:00 1 13
2011-07-15 00:14:00 2 3
2011-07-15 00:44:00 2 6
2011-07-15 00:59:00 0 4
2011-07-15 01:29:00 1 10
... ... ...
2017-01-18 06:09:00 2 2
2017-01-18 06:24:00 7 1
2017-01-18 06:39:00 2 2
2017-01-18 06:54:00 8 2
2017-01-18 07:09:00 9 2

38007 rows × 2 columns

Count statistics

EXERCISE:
  • What is the average number of bikers passing each 15 min?

In [26]:
df.mean()


Out[26]:
direction_centre        25.545237
direction_mariakerke    25.338585
dtype: float64
EXERCISE:
  • What is the average number of bikers passing each hour?
_Tip:_ you can use `resample` to first calculate the number of bikers passing each hour.

In [27]:
df.resample('H').sum().mean()


Out[27]:
direction_centre        101.113548
direction_mariakerke    100.295573
dtype: float64
EXERCISE:
  • What are the 10 highest peak values observed during any of the intervals for the direction towards the centre of Ghent?

In [28]:
df['direction_centre'].nlargest(10)
# alternative:
# df['direction_centre'].sort_values(ascending=False).head(10)


Out[28]:
2013-10-16 21:38:00    361
2014-09-23 08:12:00    333
2016-10-06 09:16:00    325
2011-10-04 08:27:00    319
2013-10-03 08:49:00    318
2016-03-31 00:45:00    316
2016-10-13 09:16:00    316
2014-09-30 08:05:00    315
2013-09-23 08:34:00    314
2016-10-04 09:01:00    314
Name: direction_centre, dtype: int64
EXERCISE:
  • What is the maximum number of cyclist that passed on a single day calculated on both directions combined?

In [29]:
df_both = df.sum(axis=1)

In [30]:
df_daily = df_both.resample('D').sum()

In [31]:
df_daily.max()


Out[31]:
10770

In [32]:
df_daily.nlargest(10)


Out[32]:
2016-09-29    10770
2016-09-28    10640
2016-09-27    10394
2014-11-05    10321
2016-10-04    10295
2012-10-30    10213
2016-10-05    10179
2016-10-06    10157
2013-06-05    10041
2016-10-13     9996
dtype: int64

2013-06-05 was the first time more than 10,000 bikers passed on one day. Apparanlty, this was not just by coincidence... http://www.nieuwsblad.be/cnt/dmf20130605_022

EXERCISE:
  • How does the long-term trend look like? Calculate monthly sums and plot the result.

In [33]:
df_monthly = df.resample('M').sum()
df_monthly.plot()


Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f61ff179cd0>
EXERCISE:
  • Let's have a look at some short term patterns. For the data of the first 3 weeks of January 2014, calculate the hourly counts and visualize them.

In [34]:
df_hourly = df.resample('H').sum()

In [35]:
df_hourly.head()


Out[35]:
direction_centre direction_mariakerke
2011-07-14 17:00:00 38 106
2011-07-14 18:00:00 66 177
2011-07-14 19:00:00 62 92
2011-07-14 20:00:00 45 40
2011-07-14 21:00:00 28 30

In [36]:
df_hourly['2014-01-01':'2014-01-20'].plot()


Out[36]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f61fed60390>

New Year's Eve 2013-2014

EXERCISE:
  • Select a subset of the data set from 2013-12-31 12:00:00 untill 2014-01-01 12:00:00, store as variable newyear and plot this subset
  • Use a rolling function (check documentation of the function!) to smooth the data of this period and make a plot of the smoothed version

In [37]:
newyear = df["2013-12-31 12:00:00": "2014-01-01 12:00:00"]

In [38]:
newyear.plot()


Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f61ff0b87d0>

In [39]:
newyear.rolling(10, center=True).mean().plot(linewidth=2)


Out[39]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f61ff031750>

A more advanced usage of matplotlib to create a combined plot:


In [40]:
# A more in-detail plotting version of the graph.
fig, ax = plt.subplots()
newyear.plot(ax=ax, color=['LightGreen', 'LightBlue'], legend=False, rot=0)
newyear.rolling(10, center=True).mean().plot(linewidth=2, ax=ax, color=['DarkGreen', 'DarkBlue'], rot=0)

ax.set_xlabel('')
ax.set_ylabel('Cyclists count')


Out[40]:
Text(0, 0.5, 'Cyclists count')

The power of groupby...

Looking at the data in the above exercises, there seems to be clearly a:

  • daily pattern
  • weekly pattern
  • yearly pattern

Such patterns can easily be calculated and visualized in pandas using the DatetimeIndex attributes weekday combined with groupby functionality. Below a taste of the possibilities, and we will learn about this in the proceeding notebooks:

Weekly pattern:


In [41]:
df_daily = df.resample('D').sum()

In [42]:
df_daily.groupby(df_daily.index.weekday).mean().plot(kind='bar')


Out[42]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f61fee93410>

Daily pattern:


In [43]:
df_hourly.groupby(df_hourly.index.hour).mean().plot()


Out[43]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f61fee629d0>

So the daily pattern is clearly different for both directions. In the morning more people go north, in the evening more people go south. The morning peak is also more condensed.

Monthly pattern


In [44]:
df_monthly = df.resample('M').sum()

In [45]:
from calendar import month_abbr

In [46]:
ax = df_monthly.groupby(df_monthly.index.month).mean().plot()
ax.set_ylim(0)
xlabels = ax.set_xticklabels(list(month_abbr)[0::2]) #too lazy to write the month values yourself...


Acknowledgements

Thanks to the city of Ghent for opening their data