Station Summary Statistics

Computing summary statistics such as means and medians is one of the first tasks typically performed when analyzing a dataset. Pandas offers a variety of summary functions that can be applied to DataFrames to quickly gain insight into the shape of the data. Ranks are also computed for each metric in descending order, the station with the most trips will be ranked first relative to other stations.

The merge function in pandas is used in this notebook to join the geo data from the stations file with trips file. Geo data is an input to the distance estimating function.

The data generated in this notebook powers the summary statistics and ranks in the visualization.


In [2]:
from __future__ import print_function, division
import pandas as pd
import locale
import datetime
import numpy as np

In [3]:
trips = pd.read_csv('../data/Divvy_Stations_Trips_2013/Divvy_Trips_2013.csv')
stations = pd.read_csv('../data/Divvy_Stations_Trips_2013/Divvy_Stations_2013.csv')
# Convert to numeric
trips.from_station_id = trips.from_station_id.convert_objects(convert_numeric=True)
trips.to_station_id = trips.to_station_id.convert_objects(convert_numeric=True)

# Convert trip duration to numeric
locale.setlocale(locale.LC_NUMERIC, '')
trips.tripduration = trips.tripduration.apply(locale.atof)

# Convert date columns to pandas datetime objects
trips.starttime = pd.to_datetime(trips.starttime)
trips.stoptime = pd.to_datetime(trips.stoptime)


/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py:1070: DtypeWarning: Columns (10) have mixed types. Specify dtype option on import or set low_memory=False.
  data = self._reader.read(nrows)

In [4]:
# Compute member age
trips['age'] = 2013 - trips.birthyear

Distance Estimation

To estimate distance I used the Manhattan Distance formula. This measures the distance between two points as Cartesian coordinates and does not take into account curvature of the earth such as great cicle distance, but it probably better reflects how people actually move through the city's square blocks.


In [8]:
# Manhattan distance formula
def manhattan_dist(lat1, lat2, lon1, lon2):
    return (abs(lat2 - lat1) + abs(lon2 - lon1)) * 111 / 1.6

# Add lat and lon to the trips table to calculate distances
trips_geo = pd.merge(trips,stations,left_on='from_station_name',right_on='name')
trips_geo = pd.merge(trips_geo,stations,left_on='to_station_name',right_on='name')

In [9]:
dist_func = lambda x: manhattan_dist(x['latitude_x'], x['latitude_y'],
                                     x['longitude_x'], x['longitude_y'])
trips_geo['dist'] = trips_geo.apply(dist_func, axis=1)

Days in Operation

The Divvy station rollout was in phases, some stations came online much later than the initial rollout in June and have lower absolute numbers of trips. Assuming someone used the station the first day it was installed, the number of days in operation can be calculated to correct the number of daily trips for stations open for varying lengths of time.


In [10]:
# Days in operation
first_tr_dt = trips_geo.groupby(by=['from_station_name'])['starttime'].min()
op_days = first_tr_dt.apply(lambda x: (datetime.datetime(2013,12,31,0,0,0)- x).days)

In [11]:
# Count the number of trips and take the median of the other metrics
daily = trips_geo.groupby(by=['from_station_name'])['trip_id'].count()
dist = trips_geo.groupby(by=['from_station_name'])['dist'].median()
duration = trips_geo.groupby(by=['from_station_name'])['tripduration'].median()
age = trips_geo.groupby(by=['from_station_name'])['age'].median()

In [12]:
# Create rank Series' for each
dly_rank = daily.rank(method='first', ascending=False)
dst_rank = dist.rank(method='first', ascending=False)
age_rank = age.rank(method='first', ascending=False)
dur_rank = duration.rank(method='first', ascending=False)

In [13]:
# Summaraize into one DataFrame
df = pd.DataFrame(np.round(daily / op_days, decimals=1))
df.rename(columns={0: 'dailytrips'}, inplace=True)
df['tripranks'] = dly_rank
df['dist'] = np.round(dist, decimals=1)
df['distranks'] = dst_rank
df['duration'] = np.round(duration / 60, decimals=1)
df['durranks'] = dur_rank
df['age'] = age
df['ageranks'] = age_rank
df.head()


Out[13]:
dailytrips tripranks dist distranks duration durranks age ageranks
from_station_name
900 W Harrison 11.1 142 0.9 289 7.8 294 34 71
Aberdeen St & Jackson Blvd 15.0 86 1.4 190 10.2 244 33 101
Aberdeen St & Madison St 22.7 57 1.3 213 9.2 270 34 72
Ada St & Washington Blvd 6.9 239 1.4 173 8.7 282 35 51
Adler Planetarium 6.4 168 2.0 65 23.3 6 31 181

5 rows × 8 columns


In [14]:
df.to_csv('../data/dailytripranks.csv')