Station Summary Statistics

Computing summary statistics such as means and medians is one of the first tasks typically performed when analyzing a dataset. Pandas offers a variety of summary functions that can be applied to DataFrames to quickly gain insight into the shape of the data. Ranks are also computed for each metric in descending order, the station with the most trips will be ranked first relative to other stations.

The merge function in pandas is used in this notebook to join the geo data from the stations file with trips file. Geo data is an input to the distance estimating function.

The data generated in this notebook powers the summary statistics and ranks in the visualization.



In [2]:

    
from __future__ import print_function, division
import pandas as pd
import locale
import datetime
import numpy as np



In [3]:

    
trips = pd.read_csv('../data/Divvy_Stations_Trips_2013/Divvy_Trips_2013.csv')
stations = pd.read_csv('../data/Divvy_Stations_Trips_2013/Divvy_Stations_2013.csv')
# Convert to numeric
trips.from_station_id = trips.from_station_id.convert_objects(convert_numeric=True)
trips.to_station_id = trips.to_station_id.convert_objects(convert_numeric=True)

# Convert trip duration to numeric
locale.setlocale(locale.LC_NUMERIC, '')
trips.tripduration = trips.tripduration.apply(locale.atof)

# Convert date columns to pandas datetime objects
trips.starttime = pd.to_datetime(trips.starttime)
trips.stoptime = pd.to_datetime(trips.stoptime)









    



/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py:1070: DtypeWarning: Columns (10) have mixed types. Specify dtype option on import or set low_memory=False.
  data = self._reader.read(nrows)



In [4]:

    
# Compute member age
trips['age'] = 2013 - trips.birthyear

Distance Estimation

To estimate distance I used the Manhattan Distance formula. This measures the distance between two points as Cartesian coordinates and does not take into account curvature of the earth such as great cicle distance, but it probably better reflects how people actually move through the city's square blocks.



In [8]:

    
# Manhattan distance formula
def manhattan_dist(lat1, lat2, lon1, lon2):
    return (abs(lat2 - lat1) + abs(lon2 - lon1)) * 111 / 1.6

# Add lat and lon to the trips table to calculate distances
trips_geo = pd.merge(trips,stations,left_on='from_station_name',right_on='name')
trips_geo = pd.merge(trips_geo,stations,left_on='to_station_name',right_on='name')



In [9]:

    
dist_func = lambda x: manhattan_dist(x['latitude_x'], x['latitude_y'],
                                     x['longitude_x'], x['longitude_y'])
trips_geo['dist'] = trips_geo.apply(dist_func, axis=1)

Days in Operation

The Divvy station rollout was in phases, some stations came online much later than the initial rollout in June and have lower absolute numbers of trips. Assuming someone used the station the first day it was installed, the number of days in operation can be calculated to correct the number of daily trips for stations open for varying lengths of time.



In [10]:

    
# Days in operation
first_tr_dt = trips_geo.groupby(by=['from_station_name'])['starttime'].min()
op_days = first_tr_dt.apply(lambda x: (datetime.datetime(2013,12,31,0,0,0)- x).days)



In [11]:

    
# Count the number of trips and take the median of the other metrics
daily = trips_geo.groupby(by=['from_station_name'])['trip_id'].count()
dist = trips_geo.groupby(by=['from_station_name'])['dist'].median()
duration = trips_geo.groupby(by=['from_station_name'])['tripduration'].median()
age = trips_geo.groupby(by=['from_station_name'])['age'].median()



In [12]:

    
# Create rank Series' for each
dly_rank = daily.rank(method='first', ascending=False)
dst_rank = dist.rank(method='first', ascending=False)
age_rank = age.rank(method='first', ascending=False)
dur_rank = duration.rank(method='first', ascending=False)



In [13]:

    
# Summaraize into one DataFrame
df = pd.DataFrame(np.round(daily / op_days, decimals=1))
df.rename(columns={0: 'dailytrips'}, inplace=True)
df['tripranks'] = dly_rank
df['dist'] = np.round(dist, decimals=1)
df['distranks'] = dst_rank
df['duration'] = np.round(duration / 60, decimals=1)
df['durranks'] = dur_rank
df['age'] = age
df['ageranks'] = age_rank
df.head()









    Out[13]:






  
    
      
      dailytrips
      tripranks
      dist
      distranks
      duration
      durranks
      age
      ageranks
    
    
      from_station_name
      
      
      
      
      
      
      
      
    
  
  
    
      900 W Harrison
       11.1
       142
       0.9
       289
        7.8
       294
       34
        71
    
    
      Aberdeen St & Jackson Blvd
       15.0
        86
       1.4
       190
       10.2
       244
       33
       101
    
    
      Aberdeen St & Madison St
       22.7
        57
       1.3
       213
        9.2
       270
       34
        72
    
    
      Ada St & Washington Blvd
        6.9
       239
       1.4
       173
        8.7
       282
       35
        51
    
    
      Adler Planetarium
        6.4
       168
       2.0
        65
       23.3
         6
       31
       181
    
  

5 rows × 8 columns



In [14]:

    
df.to_csv('../data/dailytripranks.csv')

	dailytrips	tripranks	dist	distranks	duration	durranks	age	ageranks
from_station_name
900 W Harrison	11.1	142	0.9	289	7.8	294	34	71
Aberdeen St & Jackson Blvd	15.0	86	1.4	190	10.2	244	33	101
Aberdeen St & Madison St	22.7	57	1.3	213	9.2	270	34	72
Ada St & Washington Blvd	6.9	239	1.4	173	8.7	282	35	51
Adler Planetarium	6.4	168	2.0	65	23.3	6	31	181