Visualization of Bus Bunching

When we are working on spatio-temporal data sets, it will be handy if we can visualize the spatial components of data while understanding their relations with time series. In this post, I present an example of how to visualize the bus bunching (buses of the same service number arriving at the same stop) of New York city.

The original data set are available at New York City Bus Data. In addition, we download the bus stop information of New York City

This jupyter notebook is available at my Github page: VisualizeBusBunching.ipynb, and it is part of the repository jqlearning.

Preprocessing Data

Before loading the city bus data of 2017-06, an additional preprocessing step is required. The file 'mta_1706.csv' has additional ',' that prevents us to load the correct number of columns with pd.read_csv(). For instance, if you refer to the line 53292, you will notice that information such as 'CPW/110st (non-public, for GEO)'. The comma that follows the word non-public causes the loading error. We can thus remove this additional comma by Regular Expressions.
The data set of 2017-06 has about 1 Gb size. For simplicity, we focus on the records of 2017-06-01, and filter any row record having missing values.
Some vehicle recods could have duplicate records due to different scheduled arrival time values. We should filter the duplicate records by choosing only one of the vehicle records.



In [1]:

    
import pandas as pd
import numpy as np



In [2]:

    
df = pd.read_csv("mta_1706.csv")



In [3]:

    
# Set to datetime object
df['RecordedAtTime'] = pd.to_datetime(df['RecordedAtTime'])



In [4]:

    
df = df[(df['RecordedAtTime'] < pd.Timestamp('2017-06-02')) & (df['RecordedAtTime'] > pd.Timestamp('2017-05-31'))]



In [5]:

    
# filter missing values
df = df.dropna(axis=0, how='any')



In [6]:

    
# BusCoord records both Longitude and Latidue info
df['BusCoord'] = list(zip(df["VehicleLocation.Longitude"], df["VehicleLocation.Latitude"]))



In [7]:

    
df.head(5)









    Out[7]:







  
    
      
      RecordedAtTime
      DirectionRef
      PublishedLineName
      OriginName
      OriginLat
      OriginLong
      DestinationName
      DestinationLat
      DestinationLong
      VehicleRef
      VehicleLocation.Latitude
      VehicleLocation.Longitude
      NextStopPointName
      ArrivalProximityText
      DistanceFromStop
      ExpectedArrivalTime
      ScheduledArrivalTime
      BusCoord
    
  
  
    
      0
      2017-06-01 00:03:34
      0
      B8
      4 AV/95 ST
      40.616104
      -74.031143
      BROWNSVILLE ROCKAWAY AV
      40.656048
      -73.907379
      NYCT_430
      40.635170
      -73.960803
      FOSTER AV/E 18 ST
      approaching
      76.0
      2017-06-01 00:03:59
      24:06:14
      (-73.960803, 40.63517)
    
    
      1
      2017-06-01 00:03:43
      1
      S61
      ST GEORGE FERRY/S61 & S91
      40.643169
      -74.073494
      S I MALL YUKON AV
      40.575935
      -74.167686
      NYCT_8263
      40.590802
      -74.158340
      MERRYMOUNT ST/TRAVIS AV
      approaching
      62.0
      2017-06-01 00:03:56
      23:58:02
      (-74.15834, 40.590802000000004)
    
    
      2
      2017-06-01 00:03:49
      0
      Bx10
      E 206 ST/BAINBRIDGE AV
      40.875008
      -73.880142
      RIVERDALE 263 ST
      40.912376
      -73.902534
      NYCT_4223
      40.886010
      -73.912647
      HENRY HUDSON PKY E/W 235 ST
      at stop
      5.0
      2017-06-01 00:03:56
      24:00:53
      (-73.91264699999999, 40.88601)
    
    
      3
      2017-06-01 00:03:31
      0
      Q5
      TEARDROP/LAYOVER
      40.701748
      -73.802399
      ROSEDALE LIRR STA via MERRICK
      40.666012
      -73.735939
      NYCT_8422
      40.668002
      -73.729348
      HOOK CREEK BL/SUNRISE HY
      < 1 stop away
      267.0
      2017-06-01 00:04:03
      24:03:00
      (-73.729348, 40.668002)
    
    
      4
      2017-06-01 00:03:22
      1
      Bx1
      RIVERDALE AV/W 231 ST
      40.881187
      -73.909340
      MOTT HAVEN 136 ST via CONCOURSE
      40.809654
      -73.928360
      NYCT_4710
      40.868134
      -73.893032
      GRAND CONCOURSE/E 196 ST
      at stop
      11.0
      2017-06-01 00:03:56
      23:59:38
      (-73.89303199999999, 40.868134000000005)



In [8]:

    
vehicle_gb = df.groupby(["RecordedAtTime", "PublishedLineName", "DirectionRef", "VehicleRef"])



In [9]:

    
vehicle_cnt_df = vehicle_gb.count()



In [10]:

    
# The following demonstrates an example of duplicate record
vehicle_gb.get_group(vehicle_cnt_df[vehicle_cnt_df["BusCoord"] > 1].index[0])









    Out[10]:







  
    
      
      RecordedAtTime
      DirectionRef
      PublishedLineName
      OriginName
      OriginLat
      OriginLong
      DestinationName
      DestinationLat
      DestinationLong
      VehicleRef
      VehicleLocation.Latitude
      VehicleLocation.Longitude
      NextStopPointName
      ArrivalProximityText
      DistanceFromStop
      ExpectedArrivalTime
      ScheduledArrivalTime
      BusCoord
    
  
  
    
      181
      2017-06-01 00:03:28
      1
      S46
      ST GEORGE FERRY/S46 & S96
      40.643429
      -74.073654
      W SHORE PLZ via CASTLETON
      40.601971
      -74.19133
      NYCT_8188
      40.637748
      -74.076454
      VICTORY BL/MONTGOMERY AV
      approaching
      112.0
      2017-06-01 00:04:03
      00:02:22
      (-74.076454, 40.637747999999995)
    
    
      182
      2017-06-01 00:03:28
      1
      S46
      ST GEORGE FERRY/S46 & S96
      40.643429
      -74.073654
      W SHORE PLZ via CASTLETON
      40.601971
      -74.19133
      NYCT_8188
      40.637748
      -74.076454
      VICTORY BL/MONTGOMERY AV
      approaching
      112.0
      2017-06-01 00:04:03
      00:02:26
      (-74.076454, 40.637747999999995)



In [11]:

    
bus_df = vehicle_gb.head(1).copy()

New York City Bus Stop Info



In [12]:

    
stop_bronx_df = pd.read_csv("stops_bronx.txt")
stop_brooklyn_df = pd.read_csv("stops_brooklyn.txt")
stop_manhattan_df = pd.read_csv("stops_manhattan.txt")
stop_queens_df = pd.read_csv("stops_queens.txt")
stop_staten_island_df = pd.read_csv("stops_staten_island.txt")



In [13]:

    
stop_new_york_df = pd.concat([stop_bronx_df, 
                              stop_brooklyn_df,
                             stop_manhattan_df,
                             stop_queens_df,
                             stop_staten_island_df], axis=0)



In [14]:

    
stop_new_york_df.drop_duplicates(inplace=True)

Determine the Bus Bunching Based on the NextStopPointName



In [15]:

    
bus_at_stop_df =  bus_df[bus_df["ArrivalProximityText"] == "at stop"]



In [16]:

    
# For bus bunching, buses of the same service number arriving at the same stop
bus_at_stop_gb = bus_at_stop_df.groupby(["RecordedAtTime", "PublishedLineName", "DirectionRef", "NextStopPointName"])
bus_at_stop_cnt = bus_at_stop_gb.count()



In [17]:

    
# Bunched buses have multiple locations
bunched_bus_df = bus_at_stop_cnt[bus_at_stop_cnt["BusCoord"] > 1]



In [18]:

    
# An example of a bus bunchin scenario
bus_at_stop_gb.get_group(bunched_bus_df.index[0])









    Out[18]:







  
    
      
      RecordedAtTime
      DirectionRef
      PublishedLineName
      OriginName
      OriginLat
      OriginLong
      DestinationName
      DestinationLat
      DestinationLong
      VehicleRef
      VehicleLocation.Latitude
      VehicleLocation.Longitude
      NextStopPointName
      ArrivalProximityText
      DistanceFromStop
      ExpectedArrivalTime
      ScheduledArrivalTime
      BusCoord
    
  
  
    
      16857
      2017-06-01 06:13:45
      0
      Q58
      PALMETTO ST/MYRTLE AV
      40.700176
      -73.910255
      LTD FLUSHING MAIN ST
      40.757343
      -73.829361
      NYCT_4520
      40.757344
      -73.829513
      41 RD/MAIN ST
      at stop
      10.0
      2017-06-01 06:13:50
      06:02:00
      (-73.829513, 40.757344)
    
    
      17113
      2017-06-01 06:13:45
      0
      Q58
      PALMETTO ST/MYRTLE AV
      40.700176
      -73.910255
      FLUSHING MAIN ST
      40.757343
      -73.829361
      NYCT_6563
      40.757285
      -73.829638
      41 RD/MAIN ST
      at stop
      23.0
      2017-06-01 06:13:50
      06:08:00
      (-73.829638, 40.757284999999996)

Prepare Data for Visualization

Assign bus bunching status for all the bus records.
Group the RecordedAtTime into 30 seconds interval.
Convert the latitude, longitude information since Bokeh employs Web Mercator projection for mapping. Credit goes to Charlie Harper's post on Visualizing Data with Bokeh and Pandas, where he detailed a helper function for the conversion using Pyproj.



In [19]:

    
# Initilialize as False
bus_df["BunchedStatus"] = False
bus_df["BunchedStatus"] = bus_df.apply(lambda x: True if (x["RecordedAtTime"], 
                                                      x["PublishedLineName"], 
                                                      x["DirectionRef"],
                                                      x["NextStopPointName"]) in bunched_bus_df.index else False, axis=1)



In [20]:

    
bus_df['TimeInterval'] = bus_df['RecordedAtTime'].map(lambda x: x.floor('30s'))



In [21]:

    
from pyproj import Proj, transform 

# helper function to convert lat/long to easting/northing for mapping
def LongLat_to_EN(long, lat):
    try:
        easting, northing = transform(Proj(init='epsg:4326'), Proj(init='epsg:3857'), long, lat)
        return easting, northing
    except:
        return None, None



In [22]:

    
bus_df['VehicleLocation.E'], bus_df['VehicleLocation.N'] = zip(*bus_df.apply(
    lambda x: LongLat_to_EN(x['VehicleLocation.Longitude'], x['VehicleLocation.Latitude']), axis=1))



In [23]:

    
stop_new_york_df["stop.E"], stop_new_york_df["stop.N"] = zip(*stop_new_york_df.apply(
    lambda x: LongLat_to_EN(x["stop_lon"], x["stop_lat"]), axis=1))

Visualization with Bokeh



In [24]:

    
from bokeh.plotting import figure, show, ColumnDataSource
from bokeh.tile_providers import CARTODBPOSITRON
from bokeh.io import output_notebook, push_notebook
from bokeh.models import HoverTool

output_notebook()









    





    
        
        Loading BokehJS ...



In [25]:

    
def busSources(datetime):
    # given a datetime, separate bunched bus and non-bunched bus sources.
    
    # Selected data sourceNonBunched for bunched bus
    sourceBunched = ColumnDataSource(data=dict(
        lon=bus_df[(bus_df['BunchedStatus'] == True) & (bus_df['TimeInterval'] == datetime)]["VehicleLocation.E"],
        lat=bus_df[(bus_df['BunchedStatus'] == True) & (bus_df['TimeInterval'] == datetime)]["VehicleLocation.N"],
        PublishedLineName=bus_df[(bus_df['BunchedStatus'] == True) & (bus_df['TimeInterval'] == datetime)]["PublishedLineName"],
        DirectionRef=bus_df[(bus_df['BunchedStatus'] == True) & (bus_df['TimeInterval'] == datetime)]["DirectionRef"],
        VehicleRef=bus_df[(bus_df['BunchedStatus'] == True) & (bus_df['TimeInterval'] == datetime)]["VehicleRef"],
        RecordedAtTime=bus_df[(bus_df['BunchedStatus'] == True) & (bus_df['TimeInterval'] == datetime)]["RecordedAtTime"],
        NextStopPoint=bus_df[(bus_df['BunchedStatus'] == True) & (bus_df['TimeInterval'] == datetime)]["NextStopPointName"]
    ))
    
    # Selected data sourceNonBunchedfor non-bunching bus
    sourceNonBunched = ColumnDataSource(data=dict(
        lon=bus_df[(bus_df['BunchedStatus'] == False) & (bus_df['TimeInterval'] == datetime)]["VehicleLocation.E"],
        lat=bus_df[(bus_df['BunchedStatus'] == False) & (bus_df['TimeInterval'] == datetime)]["VehicleLocation.N"],
        PublishedLineName=bus_df[(bus_df['BunchedStatus'] == False) & (bus_df['TimeInterval'] == datetime)]["PublishedLineName"],
        DirectionRef=bus_df[(bus_df['BunchedStatus'] == False) & (bus_df['TimeInterval'] == datetime)]["DirectionRef"],
        VehicleRef=bus_df[(bus_df['BunchedStatus'] == False) & (bus_df['TimeInterval'] == datetime)]["VehicleRef"],
        RecordedAtTime=bus_df[(bus_df['BunchedStatus'] == False) & (bus_df['TimeInterval'] == datetime)]["RecordedAtTime"],
        NextStopPoint=bus_df[(bus_df['BunchedStatus'] == False) & (bus_df['TimeInterval'] == datetime)]["NextStopPointName"]        
    ))
    return sourceBunched, sourceNonBunched



In [26]:

    
def visualize_selected_time(datetime):
    plot = figure(x_range=(bus_df["VehicleLocation.E"].min(), bus_df["VehicleLocation.E"].max()), 
               y_range=(bus_df["VehicleLocation.N"].min(), bus_df["VehicleLocation.N"].max()),
               x_axis_type="mercator", y_axis_type="mercator")
    plot.add_tile(CARTODBPOSITRON)
    sourceBunched, sourceNonBunched = busSources(datetime)
            
    # Bus Stop Info
    bus_stops = ColumnDataSource(data=dict(
        lon=stop_new_york_df["stop.E"],
        lat=stop_new_york_df["stop.N"],
        Name=stop_new_york_df["stop_name"]))

    circle1 = plot.circle('lon', 'lat', size=2.5, color="orange", alpha=0.3, source=bus_stops)
    # add a circle renderer with a size, color, and alpha
    circle2 = plot.circle('lon', 'lat', size=5, color="navy", alpha=0.5, source=sourceNonBunched)
    circle3 = plot.circle('lon', 'lat', size=8, color="red", alpha=0.5, source=sourceBunched)
    
    plot.add_tools(HoverTool(renderers=[circle2, circle3], tooltips=[("BusService", "@PublishedLineName"), 
                                    ("Direction", "@DirectionRef"),
                                    ("Vehicle", "@VehicleRef"),
                                    # The timestamp will be automatically converted to epoch time by default
                                    # https://bokeh.pydata.org/en/latest/docs/reference/models/formatters.html#bokeh.models.formatters.NumeralTickFormatter.format
                                    ("Time", "@RecordedAtTime{%F %T}"),
                                    ("NextStopPoint", "@NextStopPoint")],
                                    formatters={"RecordedAtTime": "datetime"}))
    
    plot.add_tools(HoverTool(renderers=[circle1], tooltips=[
                                    ("Stop Name", "@Name")]))        
    
    return plot, sourceBunched, sourceNonBunched



In [27]:

    
plot, sourceBunched, sourceNonBunched = visualize_selected_time(pd.Timestamp("20170601 12:44:00"))

The following plot visualizes the bus bunching instance at period around 12:44:00. The bunched bus instances are highlighted by red, whereas the other bus records are makred by blue. There are thousands of bus stops available at the New York city, and they are shown as orange circles in this plot



In [28]:

    
show(plot, notebook_handle=True)









    










  







    














    Out[28]:




<Bokeh Notebook handle for In[28]>

Finally, we can use ipywidgets to generate a dropdown menu, which allows us to visualize the potential bus bunching at any selected time interval.
Note that this interactive menu requires a functioning local server. You may download this notebook and play with the visualization using the following dropdown menu.



In [29]:

    
TimeIntervalStr = bus_df["TimeInterval"].astype(str)
UniqueTimeIntervalStr = TimeIntervalStr.unique()
UniqueTimeIntervalStr = sorted(UniqueTimeIntervalStr)



In [30]:

    
def update_plot(datetime="2017-06-01 12:44:00"):
    timestamp = pd.Timestamp(datetime)
    newBunched, newNonBunched = busSources(timestamp)
    sourceBunched.data = newBunched.data
    sourceNonBunched.data = newNonBunched.data
    push_notebook()



In [31]:

    
from ipywidgets import interact

interact_panel = interact(update_plot, datetime=UniqueTimeIntervalStr)

	RecordedAtTime	DirectionRef	PublishedLineName	OriginName	OriginLat	OriginLong	DestinationName	DestinationLat	DestinationLong	VehicleRef	VehicleLocation.Latitude	VehicleLocation.Longitude	NextStopPointName	ArrivalProximityText	DistanceFromStop	ExpectedArrivalTime	ScheduledArrivalTime	BusCoord
0	2017-06-01 00:03:34	0	B8	4 AV/95 ST	40.616104	-74.031143	BROWNSVILLE ROCKAWAY AV	40.656048	-73.907379	NYCT_430	40.635170	-73.960803	FOSTER AV/E 18 ST	approaching	76.0	2017-06-01 00:03:59	24:06:14	(-73.960803, 40.63517)
1	2017-06-01 00:03:43	1	S61	ST GEORGE FERRY/S61 & S91	40.643169	-74.073494	S I MALL YUKON AV	40.575935	-74.167686	NYCT_8263	40.590802	-74.158340	MERRYMOUNT ST/TRAVIS AV	approaching	62.0	2017-06-01 00:03:56	23:58:02	(-74.15834, 40.590802000000004)
2	2017-06-01 00:03:49	0	Bx10	E 206 ST/BAINBRIDGE AV	40.875008	-73.880142	RIVERDALE 263 ST	40.912376	-73.902534	NYCT_4223	40.886010	-73.912647	HENRY HUDSON PKY E/W 235 ST	at stop	5.0	2017-06-01 00:03:56	24:00:53	(-73.91264699999999, 40.88601)
3	2017-06-01 00:03:31	0	Q5	TEARDROP/LAYOVER	40.701748	-73.802399	ROSEDALE LIRR STA via MERRICK	40.666012	-73.735939	NYCT_8422	40.668002	-73.729348	HOOK CREEK BL/SUNRISE HY	< 1 stop away	267.0	2017-06-01 00:04:03	24:03:00	(-73.729348, 40.668002)
4	2017-06-01 00:03:22	1	Bx1	RIVERDALE AV/W 231 ST	40.881187	-73.909340	MOTT HAVEN 136 ST via CONCOURSE	40.809654	-73.928360	NYCT_4710	40.868134	-73.893032	GRAND CONCOURSE/E 196 ST	at stop	11.0	2017-06-01 00:03:56	23:59:38	(-73.89303199999999, 40.868134000000005)

	RecordedAtTime	DirectionRef	PublishedLineName	OriginName	OriginLat	OriginLong	DestinationName	DestinationLat	DestinationLong	VehicleRef	VehicleLocation.Latitude	VehicleLocation.Longitude	NextStopPointName	ArrivalProximityText	DistanceFromStop	ExpectedArrivalTime	ScheduledArrivalTime	BusCoord
181	2017-06-01 00:03:28	1	S46	ST GEORGE FERRY/S46 & S96	40.643429	-74.073654	W SHORE PLZ via CASTLETON	40.601971	-74.19133	NYCT_8188	40.637748	-74.076454	VICTORY BL/MONTGOMERY AV	approaching	112.0	2017-06-01 00:04:03	00:02:22	(-74.076454, 40.637747999999995)
182	2017-06-01 00:03:28	1	S46	ST GEORGE FERRY/S46 & S96	40.643429	-74.073654	W SHORE PLZ via CASTLETON	40.601971	-74.19133	NYCT_8188	40.637748	-74.076454	VICTORY BL/MONTGOMERY AV	approaching	112.0	2017-06-01 00:04:03	00:02:26	(-74.076454, 40.637747999999995)