Exploring the Divvy challenge dataset

Ideas

  • Customer vs subscriber behavior and trends
  • Weather impact
  • Time-of-day and weekday trends
  • Most popular routes
  • Most popular destinations from a given station
  • Demographics of subscribers
  • Time limit overage instances
  • Estimated distances traveled

In [1]:
import pandas as pd
import numpy as np
import locale
import datetime
import math
from pylab import *

In [2]:
# Load data
trips = pd.read_csv('Divvy_Stations_Trips_2013/Divvy_Trips_2013.csv')
stations = pd.read_csv('Divvy_Stations_Trips_2013/Divvy_Stations_2013.csv')

In [3]:
trips.tail()


Out[3]:
trip_id starttime stoptime bikeid tripduration from_station_id from_station_name to_station_id to_station_name usertype gender birthyear
759783 1109257 12/31/2013 21:12 12/31/2013 21:16 1917 284 154 Sheffield Ave & Kingsbury St 190 Sheffield Ave & Willow St Subscriber Female 1983
759784 1109276 12/31/2013 21:36 12/31/2013 22:01 2859 1,493 235 Southport Ave & Waveland Ave 143 Damen Ave & Melrose Ave Subscriber Female 1978
759785 1109283 12/31/2013 22:03 12/31/2013 22:13 198 650 340 Michigan Ave & Jackson Blvd 1 Michigan Ave & Washington St Subscriber Female 1976
759786 1109309 12/31/2013 22:10 12/31/2013 22:16 2048 346 136 Damen Ave & Division St 65 Damen Ave & Pierce Ave Subscriber Female 1976
759787 1109338 12/31/2013 22:35 12/31/2013 22:49 2239 868 256 California Ave & Division St 65 Damen Ave & Pierce Ave Subscriber Female 1978

In [4]:
# Convert date columns to pandas datetime objects
trips.starttime = pd.to_datetime(trips.starttime)
trips.stoptime = pd.to_datetime(trips.stoptime)

In [5]:
# Convert trip duration to numeric
locale.setlocale(locale.LC_NUMERIC, '')
trips.tripduration = trips.tripduration.apply(locale.atof)

In [6]:
# Create an age field
trips['age'] = 2013 - trips.birthyear

In [10]:
# Join the lat/lon coordinates to the trips dataframe
trips_lat_lon = pd.merge(trips,stations,left_on='from_station_name',right_on='name')
trips_geo = pd.merge(trips_lat_lon,stations,left_on='to_station_name',right_on='name')

In [25]:
# Haversine distance formula
def gc_dist(lat1, lat2, lon1, lon2):
    return 2*math.asin(sqrt((sin((radians(lat1)-radians(lat2))/2))^2 + 
                       cos(radians(lat1))*cos(radians(lat2))*(sin((radians(lon1)-radians(lon2))/2))^2))*6371/1.6

# Manhattan distance formula
def manhattan_dist(lat1, lat2, lon1, lon2):
    return (abs(lat2 - lat1) + abs(lon2 - lon1)) * 111 / 1.6

In [27]:
dist_func = lambda x: manhattan_dist(x['latitude_x'], x['latitude_y'],
                                     x['longitude_x'], x['longitude_y'])

gc_dist_func = lambda x: gc_dist(x['latitude_x'], x['latitude_y'],
                                     x['longitude_x'], x['longitude_y'])

trips_geo['dist'] = trips_geo.apply(dist_func, axis=1)
#trips_geo['gcdist'] = trips_geo.apply(gc_dist_func, axis=1)

In [29]:
# Summary statistics
trips_geo.sort(columns=['dist'],ascending=False)[['from_station_name','to_station_name','dist','tripduration']].head()


Out[29]:
from_station_name to_station_name dist tripduration
701822 Lincoln Ave & Eastwood Ave Shore Drive & 55th St 19.176152 2491
145657 Ellis Ave & 58th St Lincoln Ave & Eastwood Ave 18.194287 1776
367801 Lincoln Ave & Eastwood Ave Ellis Ave & 58th St 18.194287 1893
367802 Lincoln Ave & Eastwood Ave Ellis Ave & 58th St 18.194287 1572
702127 Ravenswood Ave & Irving Park Rd Shore Drive & 55th St 17.530577 3678

In [30]:
# Find date range
2491 / 60


Out[30]:
41

Quick first impressions of the data

  • 760k total trips
  • Average trip duration is 20.5 minutes, median is 13.5 minutes
  • 75th percentile duration is about 22.5 minutes, still well under the 30 minute time limit
  • Large outliers obvious in duration and age
  • Median member age is 33
  • Data available for 188 days beginning June 27 and ending Dec 31

Stations and Routes


In [31]:
# Most popular station by trip origin
trip_origins = trips[['from_station_name']].groupby(by=['from_station_name']).count()
trip_origins.sort(columns=['from_station_name'],ascending=False)[0:10]


Out[31]:
from_station_name
from_station_name
Millennium Park 17272
Streeter Dr & Illinois St 16710
Lake Shore Dr & Monroe St 15673
Clinton St & Washington Blvd 14967
Michigan Ave & Oak St 13274
Museum Campus 12676
McClurg Ct & Illinois St 10847
Michigan Ave & Lake St 10760
Canal St & Jackson Blvd 10141
Franklin St & Jackson Blvd 9784

The 10 most popular stations are centered around the lakefront and major rail stations. Canal St and Jackson is the large Divvy station in front of Union Station, and Clinton St and Washington Blvd is next to Ogilvie Transportation Center. Franklin St and Jackson is the Station around the corner from Willis Tower.


In [32]:
# Repeat with an ascending sort to find the least popular stations
trip_origins.sort(columns=['from_station_name'])[0:10]


Out[32]:
from_station_name
from_station_name
Calumet Ave & 35th St 33
Cottage Grove Ave & 47th St 36
Western Ave & 24th St 38
Cottage Grove Ave & 43rd St 62
Halsted St & 37th St 100
Damen Ave & Coulter St 100
Indiana Ave & 40th St 122
Wallace Ave & 35th St 133
Ashland Ave & 21st St 143
State St & 29th St 149

The 10 least popular stations are all located on the southside. A quick glance at the Divvy station map (as of February 2014) shows that many of these stations are located at the periphery of the Divvy service area. Overall station density on the southside is much lower as well which reduces the utility of bikeshare.


In [33]:
# Find most popular station pairs
group = trips[['from_station_name','to_station_name','trip_id']].groupby(by=['from_station_name','to_station_name']).count()

In [34]:
group.sort(columns=['trip_id'],ascending=False).head()


Out[34]:
from_station_name to_station_name trip_id
from_station_name to_station_name
Lake Shore Dr & Monroe St Streeter Dr & Illinois St 2113 2113 2113
Lake Shore Dr & Monroe St 2101 2101 2101
Millennium Park Millennium Park 2001 2001 2001
Michigan Ave & Oak St Michigan Ave & Oak St 1853 1853 1853
Streeter Dr & Illinois St Streeter Dr & Illinois St 1568 1568 1568

Members vs Non-Members


In [35]:
trip_cust = trips[['usertype']].groupby(by=['usertype']).count()
trip_cust


Out[35]:
usertype
usertype
Customer 356752
Subscriber 403036

Subscribers took about 50,000 more trips than those who purchased day passes.


In [36]:
# Create date and time columns for time series analysis
trips['startdate'] = trips.starttime.apply(lambda x: x.date())
trips['starttimetime'] = trips.starttime.apply(lambda x: x.time())

In [37]:
trips_ts = trips[['startdate']].groupby(by=['startdate']).count()
trips_ts.rename(columns={'startdate':'trips'},inplace=True)
trips_ma = pd.rolling_mean(trips_ts.trips,7)

In [38]:
# Plot rides time series

ticks = np.arange(0,len(trips_ma),14)

fig = plt.figure(figsize=(7,5), dpi=80)
ax = fig.add_subplot(111)

lns = ax.plot(trips_ma,linewidth=2.5)
ax.set_xticks(ticks)
ax.set_xticklabels(trips_ma.index[ticks])

for label in ax.xaxis.get_ticklabels():
    # label is a Text instance
    label.set_rotation(45)
    
ax.set_title('Total Daily Trips, 7D Moving Avg')
    
show()


Trips really ramped up toward the end of July after about a month since Divvy's introduction and continued at high rates until mid-October when they fell off sharply and have declined since then.


In [39]:
trips_user_ts = trips[['usertype','startdate']].groupby(by=['startdate','usertype']).count()
trips_user_ts.rename(columns={'startdate':'trips','usertype':'trips2'},inplace=True)
trips_user_ts.reset_index(inplace=True)
#trips_user_ts.drop('trips2')
#trips_ma = pd.rolling_mean(trips_ts.trips,7)

In [40]:
trips_user_pivot = trips_user_ts.pivot(index='startdate',columns='usertype',values='trips')

In [41]:
# Plot customers vs subscriber trips
ticks = np.arange(0,len(trips_ma),14)
fig = plt.figure(figsize=(7,5))
ax = fig.add_subplot(111)

custs = ax.plot(pd.rolling_mean(trips_user_pivot.Customer, 7),linewidth=2.5)
subs = ax.plot(pd.rolling_mean(trips_user_pivot.Subscriber, 7), linewidth=2.5)

ax.set_xticks(ticks)
ax.set_xticklabels(trips_user_pivot.index[ticks])

for label in ax.xaxis.get_ticklabels():
    # label is a Text instance
    label.set_rotation(45)
    
ax.set_title("Customer and Subscriber Daily Trips, 7D Moving Avg")

show()


Day pass holder trips fell off sharply after the end of tourist season while subscriber trips have declined at a much slower rate with the onset of cold weather. During the holiday period in December subscribers were averaging several hundred trips per day in adversse conditions.


In [22]: