Hourly Distribution of Inbound and Outbound Trips

Some Divvy stations, particularly those heavily used by commuters, see significant imbalance in inbound and outbound trips throughout the day. Knowing the hourly usage pattern for a station allows for proactive rebalancing to prevent service disruption by not having available bikes when demand is high.

This notebook demonstrates the concat function of pandas which is useful for joining DataFrames that share the same index. It works similar to a join in SQL.

This data powers the time distribution area charts of the visualization.


In [1]:
from __future__ import print_function, division
import pandas as pd
import locale
import datetime

In [2]:
trips = pd.read_csv('../data/Divvy_Stations_Trips_2013/Divvy_Trips_2013.csv')
stations = pd.read_csv('../data/Divvy_Stations_Trips_2013/Divvy_Stations_2013.csv')
# Convert to numeric
trips.from_station_id = trips.from_station_id.convert_objects(convert_numeric=True)
trips.to_station_id = trips.to_station_id.convert_objects(convert_numeric=True)

# Convert trip duration to numeric
locale.setlocale(locale.LC_NUMERIC, '')
trips.tripduration = trips.tripduration.apply(locale.atof)

# Convert date columns to pandas datetime objects
trips.starttime = pd.to_datetime(trips.starttime)
trips.stoptime = pd.to_datetime(trips.stoptime)


/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py:1070: DtypeWarning: Columns (10) have mixed types. Specify dtype option on import or set low_memory=False.
  data = self._reader.read(nrows)

In [3]:
# Create an hour column by getting the hour attribute from starttime
trips['hour'] = trips.starttime.apply(lambda x: x.time().hour)

In [4]:
# Create two groups, one for outbounf and one for inbound
# Group both by hour
from_group = trips.groupby(by=['from_station_name','hour'])['trip_id'].count()
to_group = trips.groupby(by=['to_station_name','hour'])['trip_id'].count()

In [5]:
# Merge together, pandas automatically joins on the shared index
hourly_frame = pd.concat([from_group, to_group], axis=1)

In [6]:
# Rename the columns
hourly_frame.index.names = ['station', 'hour']
hourly_frame.rename(columns={0:'fromtrips', 1:'totrips'}, inplace=True)

In [7]:
# Fill zeros for instances where one type of trip had a value
# but the other did not
hourly_frame.fillna(value=0, inplace=True)

In [8]:
# Reset the index to get the type of trip as a single column
# This enables easier filtering with D3.js in the visualization
hrly_stack = pd.DataFrame(hourly_frame.stack())

# Rename columns
hrly_stack.index.names = ['station', 'hour','type']
hrly_stack.rename(columns={0:'trips'},inplace=True)

# Divide by number of days in dataset to get daily counts
hrly_stack['trips'] = np.round(hrly_stack.trips / 187, decimals=1)

# See below for results of this step
# Before each type was a column, now combined into index
hrly_stack.head()


Out[8]:
trips
station hour type
900 W Harrison 0 fromtrips 0.1
totrips 0.0
1 fromtrips 0.1
totrips 0.1
2 fromtrips 0.0

5 rows × 1 columns


In [9]:
hrly_stack.to_csv("../data/hourly.csv")