Digging Into the Pronto Data Release

Pronto Data Challenge entry by Jake VanderPlas

In October Seattle's bike sharing service, Pronto, turned one year old and released a treasure-trove of data on the 140,000 individual trips during the first year. Here I want to dig into this data and answer a few questions:

  • many naysayers insist that Seattle is too cold, too wet, too hilly to for bicycling to take off. How do these elements actually affect users of the Pronto system?
  • what is the difference in Pronto usage by annual members and short-term users? How might Pronto evolve to be more useful to these groups?
  • how do Pronto trips compare to trips by other cyclists in the city? Can characteristics of Pronto use give us insight into deeper trends within the city?
  • Can we cleverly de-anonymize the data and learn about the usage patterns of individual members?

If you're interested in how the plots and figures below were created, I have made available all the Python code I used to run this analysis. For details, see the Github Repository or follow the links below each figure below.

Dataset Overview

The Pronto dataset catalogs 140,000 rides from October 2014 through October 2015, divided among Annual Members and Short-Term Pass Holders. As we will see, these two types of users have very different ride characteristics, and so we will start by splitting them and taking a look at the daily counts for each type of rider:

Particularly clear here is that there is a strong weekly cycle in both the Annual and Short-Term members. From the inset plots we see that Annual member ride numbers generally peak mid-week, while the short-term ride numbers generally peak on weekends.

Although this observation mostly holds, there are several weeks during the year that buck this trend. One notable instance of this is the large spike in short-term ridership in mid-April: this is likely due to the American Planning Association national conference which was held in Seattle that week.

In addition to the predictable weekly rise and fall, there is a predictable daily rise and fall which we can see by plotting hourly trip counts averaged across all days. Here we will split not only by the type of user, but by the weekday and weekend in order to illuminate the differences there:

This displays two unique patterns of use within the data: a double-peaked "commute pattern" of Annual riders from Monday to Friday, and a broad single-peaked "recreational pattern" for the remaining rides. We will explore these hourly trends in more detail further below.

Speed and Distance

The dataset contains the duration of each trip, a well as the start and end points. By querying the Google Maps API for bicycling directions between each pair of stations, we can get an estimate the distance ridden on each trip. We must keep in mind, however, that there's no guarantee that riders will go directly from point A to point B, so this distance estimate is effectively a lower bound on the true distance ridden, and from this distance estimate we can compute a lower-bound of the speed of each ride.

The distribution of these speeds and distances over the year yields interesting insights into the aggregate behavior of annual and short-term users:

The left panel shows the distribution of trip durations. For annual members, the most common ride length is around 5 minutes, while short term users' rides are two to three times longer. Annual members appear to be very savvy about the 30-minute free ride limit, with only a small number of their trips surpassing this and being subject to additional fees. Short-term users, on the other hand, either don't mind the extra cost of longer rides, or don't understand the intended use of the system, and frequently go longer than the 30 minute free limit. My hunch is that these short-term users aren't fully aware of this pricing structure ("I paid for the day, right?") and likely walk away unhappy with the experience. If I were advising Pronto, I'd recommend they do more to make sure day-pass users understand the pricing structure!

The right panel shows the distribution of lower-bound speed estimates. There is a spike at speed zero for both sets of users, indicating rides that start and stop at or near the same location. This is much more prevalent for short-term users, and probably indicates visitors using bikes to explore a neighborhood rather than to get from point A to point B. Beyond this, the distributions for annual and short-term users are quite different, with annual riders showing on average a higher estimated speed. You might be tempted to conclude here that annual members ride faster than day-pass users, but the data alone aren't sufficient to support this conclusion. This trend could also be explained if annual users tend to go from point A to point B by the most direct route, while day pass users tend to meander around and get to their destination indirectly. I suspect that the reality is some mix of these two effects.

We can see another interesting view of these data by plotting the speed and distance against each other:

Again, the red line shows the boundary between free trips and trips which incur an additional fee; we see the very sharp cutoff of annual users at this line. The sharpness of this cutoff suggests that many users plan their rides to not exceed that, and that they would likely do longer rides if the policy were changed. Short-term use is less affected by the 30-minute cutoff, but as I suggested above I believe this is more due to a misunderstanding of usage policy than users being willing to fork over extra cash.

Seattle's Challenges: Elevation and Weather

One oft-mentioned concern with the feasibility of bike share in Seattle is that it is a very hilly, cold, and rainy city – before Pronto's launch, armchair analysts predicted that nobody would ride when the weather is bad, and even in good weather all rides would just be downhill! This idea was usually brought up as an argument against the feasibility of the system within the city ("Sure, bikeshare works other places, but it can't work here: Seattle is special! We're just so special!") Let's take a look at ride trends with elevation and weather to see if this prediction was realized.

Elevation data is not included in the data release, but again we can turn to the Google Maps API to get what we need. The distribution of elevation changes over the year is shown below:

This shows that particularly with the annual members, downhill rides outnumber uphill rides by nearly a factor of 2! This is especially true for rides with an elevation change of greater than 50 meters or so (for reference, 50 meters is about the elevation difference between 11th & Pine in Capitol Hill and 2nd & Pine downtown). Of the 142,000 trips logged in Pronto's first year, there were about 80,000 total downhill trips and 50,000 total uphill trips, which means Pronto staff had to shuttle almost 100 bikes per day from low-lying stations to higher-elevation stations!

Next let's take a look at the trends with weather. We will look a the effect of temperature and precipitation, separating weekdays from weekends and annual users from short-term users:

The broad trends are exactly as one might expect: more people opt to ride their bicycles city-wide on warm, sunny days. One interesting feature here is seen in both the precipitation and temperature plots: on Mondays through Fridays, the slope of the trend line is about equal for Annual and Short-term users. But on weekends, the effect of weather on annual members gets weaker, while the effect of weather on short-term riders gets stronger. This suggests that the number of "opportunistic" riders — those who see a nice day and decide to go on a Pronto ride — is larger for Annual members during the week (perhaps when folks at work decide to grab a bike to head to lunch), and is larger for short-term members on the weekend (perhaps when tourists decide they'll explore by bike). Additionally, we see that on Monday to Friday, annual users essentially always outnumber short-term users, while on the weekends short-term users outnumber annual users as long as the weather is good.

How does all this bode for Seattle's cycle share? The trends are just what anyone would have expected (people like to bike downhill on warm sunny days), and I suspect that for most readers, the extent to which the realization of these predictions condemns Pronto's model is most closely correlated with how you felt before seeing the data.

Comparing with the Fremont Bridge

Another interesting question we can answer is how the number of Pronto rides relates to the number of total bicycle trips in Seattle. The latter numbers are very difficult to pin down, but we do have a nice source of ridership data in the Fremont Bridge Bike Counter, which has been logging bicycle trips across the Fremont bridge for the past three years. The figure below shows the ratio of daily Pronto trips to daily Fremont Bridge trips over the year:

We see that the ratio for annual members hovers at around 10% throughout the year: that is, each day, for every annual member Pronto trip, there are about ten bicycle trips across the Fremont bridge. This number is remarkably stable throughout the year, though the ratio dips slightly during the summer months. It would be interesting to track this ratio over more years of Pronto use: such data would be a good indication of trends in Pronto's share of total local bicycle traffic.

Data Summary

The above views of the data paint an interesting picture regarding the use of Pronto, and I see several main takeaway points:

  • Annual Members and Day Pass users show markedly different behavior in aggregate: annual members seem to use Pronto mostly for commuting from point A to point B on Monday-Friday, while short-term users use Pronto primarily on weekends to explore particular areas of town.
  • While annual members seem savvy to the pricing structure, one out of four short-term-pass rides exceeds the half hour limit and incurs an additional usage fee. For the sake of the customer, Pronto should probably make effort to better inform short-term users of this pricing structure.
  • Elevation and weather affect use just as you would expect: among annual members, there are nearly twice as many downhill trips as uphill trips, and cold & rain significantly decrease the number of rides on a given day. The effect of weather on Pronto ride numbers over the course of the year is comparable to that seen for riders crossing the Fremont Bridge.
  • Pronto's share of bicycle trips in the city has been relatively steady since day 1. This is interesting because we might expect the number of Pronto trips to have grown as the year went on, but it seems that adjusted for weather, the number of trips has been basically constant.

Diving Deeper with Machine Learning

With this basic understanding of what is in the data, we can now go on to ask some questions that require some slightly more sophisticated modeling of the data. Here I'll dig into the data a bit to learn about the behavior of Pronto users, both in aggregate and at the individual level.

What Days do Pronto Users Work?

We have found above that there are distinct differences in the hourly ride counts between annual and short-term users, and between weekdays and weekends. One way we can explore this deeper is to use Unsupervised Machine Learning approaches to try to discover structure in these hourly trends. What I'm going to do here is a bit abstract, but bear with me: each day has 24 hours, and we can count the number of rides over the course of a day to get a 24-component vector which "represents" the rides on that day. In this way, each of the days of the year can be viewed as a single point in a 24-dimensional space, and from there we can begin to ask questions about the relationships between days, using these 24-dimensional points as a proxy.

Now, humans are very good at visualizing two-dimensional or three-dimensional data: the plots above are mostly two-dimensional (plotting "x" values vs. "y" values), but as the dimension grows such visualization becomes increasingly difficult. To gain an intuition about high-dimensional data, scientists often make use of what are known as dimensionality reduction algorithms. In this case, we would like to reduce the dimensions of the data from 24 to 2, while maintaining some reflection of the data structure.

A very common method for such dimensionality reduction is is Principal Component Analysis, which is a way of automatically rotating and stretching high-dimensional data to create a suitable low-dimensional projection which preserves important relationships. Applying such an analysis to the Pronto hourly data over the course of the year yields the following representation of the data, where we color the points by total daily rides:

Again, these two dimensions are chosen automatically because they best preserve important relationships between the higher-dimensional points. What's notable here is that there are two distinct oblong clusters within the data, and that the more rides there are in a given day, the more the clusters diverge.

To see what these clusters actually represent, we might use another unsupervised machine learning method, a clustering algorithm (specifically a Gaussian Mixture Model) to identify groups within the 24-dimensional space and automatically assign cluster membership. After doing this, we can re-visualize the data with these cluster labels and show the hourly trends within each group:

We see that the pattern reflected in these two groups of points is exactly the commute/recreation split that we saw in the hourly data above. But we've gone a bit further than before: rather than deciding a priori to plot weekends vs. weekdays, we have created a model whereby we can automatically classify any day of the year as a "commute day" in red or a "recreation day" in purple.

It is interesting to look at how these commute/recreation days divide between annual and short-term users:

The results match our intuition from exploring the data above: the red "commute" cluster is made up entirely of annual riders, while the purple "recreational" cluster is a mix of annual and short-term riders (with one lone short-term day straying into the commute cluster – this is a day with a very low total ride count, where the random spikes in usage fool the clustering algorithm).

Our intuition is that "commute" patterns would happen from Monday to Friday, while "recreational" patterns would happen on the weekends, but the interesting thing is there are some days which buck this trend — these are weekdays which don't show a commute pattern. We show them as the red points on the following plot, and every single one of them corresponds to a holiday that lies on a week day.

This tells us that as a group, Pronto Users did not commute Thanksgiving week, Christmas week, New Years week, Memorial Day, Independence Day, Labor Day, or (surprisingly) Columbus Day.

Similarly, we can identify several federal holidays that lie in the "commute" cluster: these are the green points in the above plot. These are holidays on which Pronto users still as a whole went to work: Presidents Day, Veterans Day, and Martin Luther King Day.

On the other hand, when we look on the other side of the mismatch we find not a single weekend day with a commute pattern: apparently Pronto users really know how to take their days off. I find it immensely entertaining that from bicycling data alone, we can learn about the work habits of Seattle bicycle users!

Finding Pronto's Power-Users

The Pronto data release is almost entirely anonymized: that is, there is no "rider ID" that lets you track the behavior of an individual user over the course of the year. But as with most anonymized data, it is possible (if you are clever) to effectively de-anonymize some users using patterns within the data.

We have several pieces of information within the data that might help us identify individuals:

  • The birth year of the rider on each trip
  • The gender of the rider on each trip
  • The start and end points, which may be similar from day to day for any individual
  • The speed of riding, which probably is similar day to day for any individual
  • The start/end time of trips, which could show a pattern for any individual

Notice that the last three points are based on the idea that people tend to be creatures of habit: the question is, can we use these potential habits to identify individual frequent Pronto users? Let's start by taking a look at number of trips as a function of distance between stations:

On the outskirts of the distribution (marked by the red box) is a point representing a pair of stations nearly seven miles apart, which saw nearly 100 trips between them over the course of the year — five to ten times more trips than other station pairs at similar distances. Could this be the work of an individual Pronto power-user?

Looking at the trips in detail, we see that the answer is a definitive yes! All but a few of these trips through the year are from a single user: a male born in 1979, who is riding from downtown to University Village in the morning, and returning in the evening:

Now, it is certainly the case that there could be more than one 36-year-old male who rides 6.8 miles at the same time each day, but this strikes me as very unlikely. When we find such a distinct, consistent pattern of use, we can be fairly confident that we have identified the rides of an individual person.

With this in mind, how many other individuals might we find? Doing this via a manual search would be a long and painful process: with over 50 stations, there are over 1200 unique station pairs and any number of riders of various ages and genders. So instead of combing through the data by hand, we can use another unsupervised algorithm to do the search automatically. We are again looking for clusters, but want an algorithm that is sensitive to overdensities within a noisy background: an algorithm known as Mean-Shift Clustering is a good candidate for this.

After running the clustering model on the trips between each pair of stations, we will define a metric that measures how well we have localized an individual rider, rank the station pairs by this metric, and then only visually examine the dozen or so with the highest rankings. Once this ranking is computed, we find that the top match is the 36-year-old male we identified above!

To learn a bit more, let's look at when his rides occurred over the course of the year, and how his ride times changed as time went on:

We see that he rode from the beginning of July until mid-September, which aligns tightly with the summer quarter dates at UW. Perhaps this was a person who was working or studying at UW during the summer quarter, while living in Belltown, who decided that a Pronto membership was more economical than a bus pass. Over the course of the summer, his average ride time decreased by about a minute each direction, indicating that 13.5 daily miles on a Pronto bike helped him get in shape!

In the same way, the other top cluster results show similar patterns for other individual riders. We'll do the same plot for a few other top candidates:

This final one is the most impressive: this 25-year-old male used Pronto almost daily for the full year, and I'm willing to bet that elsewhere he rode the 18 further trips that would put him on the list of Top 20 Pronto Users.

Looking at the trends among all five of these identified individuals, we see a couple commonalities:

  • Morning commute rides tend to be a few minutes faster than evening commute rides. Perhaps people are tired at the end of the day, or perhaps they are more relaxed when they don't need to get to work at a certain time.

  • In all these cases, rider speed increased gradually over time. All of these riders show a trend of getting faster with time. For the more extreme riders, this improvement is as much as 10-15%, which indicates a fairly substantial improvement in health!

  • These are probably new(ish) riders. Riders who were already riding their own bike frequently before Pronto existed would not likely show such a significant improvment in fitness over time. So I think we can be fairly confident that these power-users are riding more frequently now with Pronto than they were before Pronto existed.

We could certainly go further and find less significant clusters, and probably identify dozens or hundreds more individuals based on their ridership patterns. We could also cross-match nearby stations to determine whether anyone is using, say, a different station on the way to work versus on the way home. We could compute aggregate statistics on how Pronto is improving health, based on this probabilistically de-anonymized data. The possibilities really are endless, but I've written enough here already!