This notebook originally appeared as a post on the blog Pythonic Perambulations. The content is BSD licensed.
This week Pronto CycleShare, Seattle's Bicycle Share system, turned one year old. To celebrate this, Pronto made available a large cache of data from the first year of operation and announced the Pronto Cycle Share's Data Challenge, which offers prizes for different categories of analysis.
There are a lot of tools out there that you could use to analyze data like this, but my tool of choice is (obviously) Python. In this post, I want to show how you can get started analyzing this data and joining it with other available data sources using the PyData stack, namely NumPy, Pandas, Matplotlib, and Seaborn. Here I'll take a look at some of the basic questions you can answer with this data. Later I hope to find the time to dig deeper and ask some more interesting and creative questions – stay tuned!
For those who aren't familiar, this post is composed in the form of a Jupyter Notebook, which is an open document format that combines text, code, data, and graphics and is viewable through the web browser – if you have not used it before I encourage you to try it out! You can download the notebook containing this post here, open it with Jupyter, and start asking your own questions of the data.
We'll start by downloading the data (available on Pronto's Website) which you can do by uncommenting the following shell commands (the exclamation mark here is a special IPython syntax to run a shell command). The total download is about 70MB, and the unzipped files are around 900MB.
In [1]:
# !curl -O https://s3.amazonaws.com/pronto-data/open_data_year_one.zip
# !unzip open_data_year_one.zip
Next we need some standard Python package imports:
In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns; sns.set()
And now we load the trip data with Pandas:
In [3]:
trips = pd.read_csv('2015_trip_data.csv',
parse_dates=['starttime', 'stoptime'],
infer_datetime_format=True)
trips.head()
Out[3]:
Each row of this trip dataset is a single ride by a single person, and the data contains over 140,000 rows!
In [4]:
# Find the start date
ind = pd.DatetimeIndex(trips.starttime)
trips['date'] = ind.date.astype('datetime64')
trips['hour'] = ind.hour
In [5]:
# Count trips by date
by_date = trips.pivot_table('trip_id', aggfunc='count',
index='date',
columns='usertype', )
In [6]:
fig, ax = plt.subplots(2, figsize=(16, 8))
fig.subplots_adjust(hspace=0.4)
by_date.iloc[:, 0].plot(ax=ax[0], title='Annual Members');
by_date.iloc[:, 1].plot(ax=ax[1], title='Day-Pass Users');
This plot shows the daily trend, separated by Annual members (top) and Day-Pass users (bottom). A couple observations:
Let's zoom-in on this weekly trend, by averaging all rides by day of week. Becuase of the change in pattern around January 2015, we'll split the data by both year and by day of week:
In [7]:
by_weekday = by_date.groupby([by_date.index.year,
by_date.index.dayofweek]).mean()
by_weekday.columns.name = None # remove label for plot
fig, ax = plt.subplots(1, 2, figsize=(16, 6), sharey=True)
by_weekday.loc[2014].plot(title='Average Use by Day of Week (2014)', ax=ax[0]);
by_weekday.loc[2015].plot(title='Average Use by Day of Week (2015)', ax=ax[1]);
for axi in ax:
axi.set_xticklabels(['Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun'])
We see a complementary pattern overall: annual users tend to use their bikes during Monday to Friday (i.e. as part of a commute) while day pass users tend to use their bikes on the weekend. This pattern didn't fully develop until the start of 2015, however, especially for annual members: it seems that for the first couple months, users had not yet adapted their commute habits to make use of Pronto!
It's also quite interesting to view the average hourly trips by weekday and weekend. This takes a bit of manipulation:
In [8]:
# count trips by date and by hour
by_hour = trips.pivot_table('trip_id', aggfunc='count',
index=['date', 'hour'],
columns='usertype').fillna(0).reset_index('hour')
# average these counts by weekend
by_hour['weekend'] = (by_hour.index.dayofweek >= 5)
by_hour = by_hour.groupby(['weekend', 'hour']).mean()
by_hour.index.set_levels([['weekday', 'weekend'],
["{0}:00".format(i) for i in range(24)]],
inplace=True);
by_hour.columns.name = None
Now we can plot the results to see the hourly trends:
In [9]:
fig, ax = plt.subplots(1, 2, figsize=(16, 6), sharey=True)
by_hour.loc['weekday'].plot(title='Average Hourly Use (Mon-Fri)', ax=ax[0])
by_hour.loc['weekend'].plot(title='Average Hourly Use (Sat-Sun)', ax=ax[1])
ax[0].set_ylabel('Average Trips per Hour');
We see a clear difference between a "commute" pattern, which sharply peaks in the morning and evening (e.g. annual members during weekdays) and a "recreational" pattern, which has a broad peak in the early afternoon (e.g. annual members on weekends, and short-term users all the time). Interestingly, the average behavior of annual pass holders on weekends seems to be almost identical to the average behavior of day-pass users on weekdays!
For those who have read my previous posts, you might recognize this as very similar to the patterns I found in the Fremont Bridge bicycle counts.
Next let's take a look at the durations of trips. Pronto rides are designed to be up to 30 minutes; any single use that is longer than this incurs a usage fee of a couple dollars for the first half hour, and about ten dollars per hour thereafter.
Let's look at the distribution of trip durations for Annual members and short-term pass holders:
In [10]:
trips['minutes'] = trips.tripduration / 60
trips.groupby('usertype')['minutes'].hist(bins=np.arange(61), alpha=0.5, normed=True);
plt.xlabel('Duration (minutes)')
plt.ylabel('relative frequency')
plt.title('Trip Durations')
plt.text(34, 0.09, "Free Trips\n\nAdditional Fee", ha='right',
size=18, rotation=90, alpha=0.5, color='red')
plt.legend(['Annual Members', 'Short-term Pass'])
plt.axvline(30, linestyle='--', color='red', alpha=0.3);
Here I have added a red dashed line separating the free rides (left) from the rides which incur a usage fee (right). It seems that annual users are much more savvy to the system rules: only a small tail of the trip distribution lies beyond 30 minutes. Around one in four Day Pass Rides, on the other hand, are longer than the half hour limit and incur additional fees. My hunch is that these day pass users aren't fully aware of this pricing structure ("I paid for the day, right?") and likely walk away unhappy with the experience.
It's also interesting to look at the distance of the trips. Distances between stations are not included in Pronto's data release, so we need to find them via another source. Let's start by loading the station data – and because some of the trips start and end at Pronto's shop, we'll add this as another "station":
In [11]:
stations = pd.read_csv('2015_station_data.csv')
pronto_shop = dict(id=54, name="Pronto shop",
terminal="Pronto shop",
lat=47.6173156, long=-122.3414776,
dockcount=100, online='10/13/2014')
stations = stations.append(pronto_shop, ignore_index=True)
Now we need to find bicycling distances between pairs of lat/lon coordinates. Fortunately, Google Maps has a distances API that we can use for free.
Reading the fine print, free use is limited to 2500 distances per day, and 100 distances per 10 seconds. With 55 stations we have $55 \times 54 / 2 = 1485$ unique nonzero distances, so we can just query all of them within a few minutes on a single day for free (if we do it carefully).
To do this, we'll query one (partial) row at a time, waiting 10+ seconds between queries (Note: we might also use the googlemaps Python package instead, but it requires obtaining an API key).
In [12]:
from time import sleep
def query_distances(stations=stations):
"""Query the Google API for bicycling distances"""
latlon_list = ['{0},{1}'.format(lat, long)
for (lat, long) in zip(stations.lat, stations.long)]
def create_url(i):
URL = ('https://maps.googleapis.com/maps/api/distancematrix/json?'
'origins={origins}&destinations={destinations}&mode=bicycling')
return URL.format(origins=latlon_list[i],
destinations='|'.join(latlon_list[i + 1:]))
for i in range(len(latlon_list) - 1):
url = create_url(i)
filename = "distances_{0}.json".format(stations.terminal.iloc[i])
print(i, filename)
!curl "{url}" -o {filename}
sleep(11) # only one query per 10 seconds!
def build_distance_matrix(stations=stations):
"""Build a matrix from the Google API results"""
dist = np.zeros((len(stations), len(stations)), dtype=float)
for i, term in enumerate(stations.terminal[:-1]):
filename = 'queried_distances/distances_{0}.json'.format(term)
row = json.load(open(filename))
dist[i, i + 1:] = [el['distance']['value'] for el in row['rows'][0]['elements']]
dist += dist.T
distances = pd.DataFrame(dist, index=stations.terminal,
columns=stations.terminal)
distances.to_csv('station_distances.csv')
return distances
# only call this the first time
import os
if not os.path.exists('station_distances.csv'):
# Note: you can call this function at most ~twice per day!
query_distances()
# Move all the queried files into a directory
# so we don't accidentally overwrite them
if not os.path.exists('queried_distances'):
os.makedirs('queried_distances')
!mv distances_*.json queried_distances
# Build distance matrix and save to CSV
distances = build_distance_matrix()
Here's what the first 5x5 section of the distance matrix looks like:
In [13]:
distances = pd.read_csv('station_distances.csv', index_col='terminal')
distances.iloc[:5, :5]
Out[13]:
Let's convert these distances to miles and join them to our trips data:
In [14]:
stacked = distances.stack() / 1609.34 # convert meters to miles
stacked.name = 'distance'
trips = trips.join(stacked, on=['from_station_id', 'to_station_id'])
Now we can plot the distribution of trip distances:
In [15]:
fig, ax = plt.subplots(figsize=(12, 4))
trips.groupby('usertype')['distance'].hist(bins=np.linspace(0, 6.99, 50),
alpha=0.5, ax=ax);
plt.xlabel('Distance between start & end (miles)')
plt.ylabel('relative frequency')
plt.title('Minimum Distance of Trip')
plt.legend(['Annual Members', 'Short-term Pass']);
Keep in mind that this shows the shortest possible distance between stations, and thus is a lower bound on the actual distance ridden on each trip. Many trips (especially for day pass users) begin and end within a few blocks. Beyond this, trips peak at around 1 mile, though some extreme users are pushing their trips out to four or more miles.
In [16]:
trips['speed'] = trips.distance * 60 / trips.minutes
trips.groupby('usertype')['speed'].hist(bins=np.linspace(0, 15, 50), alpha=0.5, normed=True);
plt.xlabel('lower bound riding speed (MPH)')
plt.ylabel('relative frequency')
plt.title('Rider Speed Lower Bound (MPH)')
plt.legend(['Annual Members', 'Short-term Pass']);
Interestingly, the distributions are quite different, with annual riders showing on average a higher inferred speed. You might be tempted to conclude here that annual members ride faster than day-pass users, but the data alone aren't sufficient to support this conclusion. This data could also be explained if annual users tend to go from point A to point B by the most direct route, while day pass users tend to meander around and get to their destination indirectly. I suspect that the reality is some mix of these two effects.
It is also informative to take a look at the relationship between distance and speed:
In [17]:
g = sns.FacetGrid(trips, col="usertype", hue='usertype', size=6)
g.map(plt.scatter, "distance", "speed", s=4, alpha=0.2)
# Add lines and labels
x = np.linspace(0, 10)
g.axes[0, 0].set_ylabel('Lower Bound Speed')
for ax in g.axes.flat:
ax.set_xlabel('Lower Bound Distance')
ax.plot(x, 2 * x, '--r', alpha=0.3)
ax.text(9.8, 16.5, "Free Trips\n\nAdditional Fee", ha='right',
size=18, rotation=39, alpha=0.5, color='red')
ax.axis([0, 10, 0, 25])
Overall, we see that longer rides tend to be faster – though this is subject to the same lower-bound caveats as above. As above, for reference I have plotted the line separating free trips (above the red line) from trips requiring an additional fee (below the red line). Again we see that the annual members are much more savvy about not going over the half hour limit than are day pass users – the sharp cutoff in the distribution of points points to users keeping close track of their time to avoid an extra charge!
One oft-mentioned concern with the feasibility of bike share in Seattle is that it is a very hilly city – before the launch, armchair analysts predicted that there would be a steady flow of bikes from uphill to downhill, and that this would add up with other challenges to spell the demise of the system ("Sure, bikeshare works other places, but it can't work here: Seattle is special! We're just so special!")
Elevation data is not included in the data release, but again we can turn to the Google Maps API to get what we need; see this site for a description of the elevation API.
In this case the free-use limit is 2500 requests per day & 512 elevations per request. Since we need just 55 elevations, we can do it in a single query:
In [18]:
def get_station_elevations(stations):
"""Get station elevations via Google Maps API"""
URL = "https://maps.googleapis.com/maps/api/elevation/json?locations="
locs = '|'.join(['{0},{1}'.format(lat, long)
for (lat, long) in zip(stations.lat, stations.long)])
URL += locs
!curl "{URL}" -o elevations.json
def process_station_elevations():
"""Convert Elevations JSON output to CSV"""
import json
D = json.load(open('elevations.json'))
def unnest(D):
loc = D.pop('location')
loc.update(D)
return loc
elevs = pd.DataFrame([unnest(item) for item in D['results']])
elevs.to_csv('station_elevations.csv')
return elevs
# only run this the first time:
import os
if not os.path.exists('station_elevations.csv'):
get_station_elevations(stations)
process_station_elevations()
Now let's read-in the elevation data:
In [19]:
elevs = pd.read_csv('station_elevations.csv', index_col=0)
elevs.head()
Out[19]:
Just to make ourselves feel better, we'll double check that the latitudes and longitudes match:
In [20]:
# double check that locations match
print(np.allclose(stations.long, elevs.lng))
print(np.allclose(stations.lat, elevs.lat))
Now we can join the elevations to with the trip data by way of the station data:
In [21]:
stations['elevation'] = elevs['elevation']
elevs.index = stations['terminal']
trips['elevation_start'] = trips.join(elevs, on='from_station_id')['elevation']
trips['elevation_end'] = trips.join(elevs, on='to_station_id')['elevation']
trips['elevation_gain'] = trips['elevation_end'] - trips['elevation_start']
Let's take a look at the distribution of elevation gain by rider type:
In [22]:
g = sns.FacetGrid(trips, col="usertype", hue='usertype')
g.map(plt.hist, "elevation_gain", bins=np.arange(-145, 150, 10))
g.fig.set_figheight(6)
g.fig.set_figwidth(16);
# plot some lines to guide the eye
for lim in range(60, 150, 20):
x = np.linspace(-lim, lim, 3)
for ax in g.axes.flat:
ax.fill(x, 100 * (lim - abs(x)),
color='gray', alpha=0.1, zorder=0)
We have plotted some shading in the background to help guide the eye. Again, there is a big difference between Annual Members and Short-term users: annual users definitely show a preference for downhill trips (left of the distribution), while daily users dont show this as strongly, but rather show a preference for rides which start and end at the same elevation (i.e. the same station).
To make the effect of elevation change more quantitative, let's compute the numbers:
In [23]:
print("total downhill trips:", (trips.elevation_gain < 0).sum())
print("total uphill trips: ", (trips.elevation_gain > 0).sum())
We see that the first year had 30,000 more downhill trips than uphill trips – that's about 60% more. Given current usage levels, that means that Pronto staff must be shuttling an average of about 100 bikes per day from low-lying stations to higher-up stations.
In [24]:
weather = pd.read_csv('2015_weather_data.csv', index_col='Date', parse_dates=True)
weather.columns
Out[24]:
Let's join this weather data with the trip data:
In [25]:
by_date = trips.groupby(['date', 'usertype'])['trip_id'].count()
by_date.name = 'count'
by_date = by_date.reset_index('usertype').join(weather)
Now we can take a look at how the number of rides scales with both Temperature and Precipitation, splitting the data by weekday and weekend:
In [26]:
# add a flag indicating weekend
by_date['weekend'] = (by_date.index.dayofweek >= 5)
#----------------------------------------------------------------
# Plot Temperature Trend
g = sns.FacetGrid(by_date, col="weekend", hue='usertype', size=6)
g.map(sns.regplot, "Mean_Temperature_F", "count")
g.add_legend();
# do some formatting
g.axes[0, 0].set_title('')
g.axes[0, 1].set_title('')
g.axes[0, 0].text(0.05, 0.95, 'Monday - Friday', va='top', size=14,
transform=g.axes[0, 0].transAxes)
g.axes[0, 1].text(0.05, 0.95, 'Saturday - Sunday', va='top', size=14,
transform=g.axes[0, 1].transAxes)
g.fig.text(0.45, 1, "Trend With Temperature", ha='center', va='top', size=16);
#----------------------------------------------------------------
# Plot Precipitation
g = sns.FacetGrid(by_date, col="weekend", hue='usertype', size=6)
g.map(sns.regplot, "Precipitation_In ", "count")
g.add_legend();
# do some formatting
g.axes[0, 0].set_ylim(-50, 600);
g.axes[0, 0].set_title('')
g.axes[0, 1].set_title('')
g.axes[0, 0].text(0.95, 0.95, 'Monday - Friday', ha='right', va='top', size=14,
transform=g.axes[0, 0].transAxes)
g.axes[0, 1].text(0.95, 0.95, 'Saturday - Sunday', ha='right', va='top', size=14,
transform=g.axes[0, 1].transAxes)
g.fig.text(0.45, 1, "Trend With Precipitation", ha='center', va='top', size=16);
For both temperature and precipitiation, we see trends in the direction we might expect (people ride more on warm, sunny days). But there are some interesting details: during the work week, Annual and Short-term users are equally affected by the weather. On the weekend, however, annual members appear less influenced by weather, while short-term users appear more influenced. I don't have any good theories for why this would be the trend – please let me know if you have any ideas!
There is a lot to glean from the above discussion, but I think there are several main takeaway points that we can learn from this look at the data:
There is much more we could do with this data, and I'm hoping to dig more into it over the next couple weeks to seek some more detailed insights, but I think this is a good start! If you're interested in entering the Pronto Data Challenge, feel free to use these scripts as a jumping-off point to answer your own questions about the data.
Thanks for reading!
This post was written entirely in the IPython notebook. You can [download](http://jakevdp.github.io/downloads/notebooks/ProntoData.ipynb) this notebook, or see a static view [here](http://nbviewer.ipython.org/url/jakevdp.github.io/downloads/notebooks/ProntoData.ipynb).