In [1]:
library(ggplot2)
library(dplyr)
library(ggmap)
library(maps)
library(mapdata)
library(lubridate)
options(jupyter.plot_mimetypes = 'image/png')
In [2]:
path <- "./google_transit"
data_path <- "./d3js-test/"
extension <- ".txt"
In [3]:
stopEdge_shapePts <- read.csv(file=file.path(path, paste("stopEdge_shapePts", extension, sep="")), header=T)
shape_stopEdge <- read.csv(file=file.path(path, paste("shape_stopEdge", extension, sep="")), header=T)
shapes <- read.csv(file=file.path(path, paste("shapes", extension, sep="")), header=T)
trip_idTOshape_id <- read.csv(file=file.path(path, paste("trip_idTOshape_id", extension, sep="")), header=T)
stop_times <- read.csv(file=file.path(path, paste("stop_times", extension, sep="")), header=T)
stops <- read.csv(file=file.path(path, paste("stops", extension, sep="")), header=T)
stop_edge_table <- read.csv(file=file.path(data_path, paste("stop_edge_table", extension, sep="")), header=T)
In [4]:
shape_id_test = '202_8_22'
shape_stop_edge_test <- shape_stopEdge[shape_stopEdge$shape_id == shape_id_test,]
stop_edge_list_test <- shape_stop_edge_test$stop_edge_id
trip_id_test <- trip_idTOshape_id[trip_idTOshape_id$shape_id == shape_id_test,]$trip_id[[1]]
stops_list_test <- stop_times[stop_times$trip_id == trip_id_test,]
stops_test <- merge(stops_list_test, stops)
In [5]:
stopEdge_shapePts_test <- filter(stopEdge_shapePts, stop_edge_id %in% stop_edge_list_test)
In [6]:
size_bbox <- 0.25
bbox <- make_bbox(lon = stopEdge_shapePts_test$shape_pt_lon, lat = stopEdge_shapePts_test$shape_pt_lat, f = size_bbox)
sq_map <- get_map(location = bbox, maptype = "roadmap", source = "google")
In [7]:
stop_edge_test_path_plot <- geom_path(data = stopEdge_shapePts_test,
mapping = aes(x = shape_pt_lon, y = shape_pt_lat, colour=stop_edge_id),
alpha = 0.9, size = 2, show.legend = F)
stops_test_plot = geom_point(data = stops_test,
mapping = aes(x = stop_lon, y = stop_lat),
color="black", alpha = 0.5, size = 5, show.legend = F)
In [8]:
ggmap(sq_map) + stop_edge_test_path_plot + stops_test_plot
This map displays a subset (from route "202") of what I have defined as "stop edges". Each of the black points marks a stop taken on this route. Each of the colored edges represents a "stop edge". Each "stop edge" contains data about the average velocity the buses were able to traverse this edge, with a table of entries depending on hour and day of the week.
Since the script I wrote for gathering the data has not been running very long, the data shown below is an aggregated average velocity of all the stop edges, over all the days of the week
In [9]:
stop_edge_table <- stop_edge_table[(complete.cases(stop_edge_table)),]
by_hour_aggregate <- aggregate(stop_edge_table$v_avg, by = list(time = stop_edge_table$hour),
FUN=mean, na.rm=TRUE)
by_hour_aggregate$v_avg <- by_hour_aggregate$x
In [10]:
by_hour_plot = geom_line(data = by_hour_aggregate,
mapping = aes(x = time, y = v_avg),
alpha = 0.9, size = 2)
In [11]:
ggplot() + by_hour_plot
The project is still incomplete (d3 uses a functional programming paradigm that I was unfamiliar with). The final visualization will use D3 and the google maps api, and be interactive, displaying the relevant graph when mousing over each edge. Below I drafted a quick visualization to demonstrate how it may look
In [12]:
stop_edge_listed <- unique(stop_edge_table$stop_edge_id)
stopEdge_shapePts_listed <- filter(stopEdge_shapePts, stop_edge_id %in% stop_edge_listed)
In [13]:
size_big_bbox <- 0.25
big_bbox <- make_bbox(lon = stopEdge_shapePts_listed$shape_pt_lon, lat = stopEdge_shapePts_listed$shape_pt_lat, f = size_big_bbox)
big_sq_map <- get_map(location = big_bbox, maptype = "roadmap", source = "google")
In [14]:
stopEdge_shapePts_listed_plot <- geom_path(data = stopEdge_shapePts_listed,
mapping = aes(x = shape_pt_lon, y = shape_pt_lat, colour=stop_edge_id),
alpha = 0.5, size = 1, show.legend = F)
In [15]:
ggmap(big_sq_map) + stopEdge_shapePts_listed_plot
SDMTS, like most transit agencies, uses the General Transit Feed Specification (GTFS) to format their data. It is divided into two main parts, the Static Transit and the Realtime Extension
This specification defines the infrastructure of the entire system, the key components being:
The information comes in the form of 15 csv files. Here I've included a diagram of the most relevant 7 and their relations:
The important, not obvious values:
shape_id - Corresponds to a series of coordinates that define a path
stop_times - Defines a list of stop visited by a trip
trip_id - Corresponds to a route, the shape it takes, and the stops it visits
We need to generate several of our own tables for the pipeline to function:
stop_edges.txt - Using the stop_times.txt set, we can create all of the unique edges by connecting the adjacent stop_ids per trip. I defined each stop_edge by ordering the two ids smallest first, and concatenating them with a underscore in between.
stop_edge_table.txt - This table simply contains and entry for each stop_edge, as well as all 24 hours of the day, 7 days of the week. Each entry has a average velocity entry (v_avg), that will be continuously updated (by an exponential moving average)
shape_stopEdge.txt -
stopEdge_shapes.txt -
With these data tables, we can properly contextualize and record the data coming in from the live feed
The real time feed can be downloaded in the form of protocol buffers (the .proto file for the GTFS spec can be found here. I've also included a link to a human readable version of the feed here.
I've written a python script that will run every minute, gathering data from the feed, keep track of active trips, parse and contextualize the data, and update the data table. Below is a diagram of its workflow (with minor details omitted):
In [16]:
shapes_plot <- geom_point(data = shapes,
mapping = aes(x = shape_pt_lon, y = shape_pt_lat, colour=shape_id),
alpha = 0.7, size = 2, show.legend = F)
stops_plot <- geom_point(data = stops,
mapping = aes(x = stop_lon, y = stop_lat),
color="black", alpha = 0.5, size = 5, show.legend = F)
stop_edge_path_plot <- geom_point(data = stopEdge_shapePts,
mapping = aes(x = shape_pt_lon, y = shape_pt_lat, colour=stop_edge_id),
alpha = 0.7, size = 2, show.legend = F)
In [17]:
ggmap(sq_map) + shapes_plot
Compare amount of shapes vertices to sparseness of stop vertices
In [18]:
ggmap(sq_map) + stop_edge_path_plot
Removed a significant amount of shape points in order to save data. Points were removed by lowering precision of coordinate, then removing middle points that were in a line. This was done because of a significant overlap in points, as well as high point densities in various locations.
In [19]:
shapes_test <- shapes[shapes$shape_id == shape_id_test,]
shapes_test_plot <- geom_point(data = shapes_test,
mapping = aes(x = shape_pt_lon, y = shape_pt_lat, colour=shape_id),
alpha = 0.9, size = 2, show.legend = F)
In [20]:
ggplot() + shapes_test_plot
In [21]:
stop_edge_test_path_points_plot <- geom_point(data = stopEdge_shapePts_test,
mapping = aes(x = shape_pt_lon, y = shape_pt_lat, colour=stop_edge_id),
alpha = 0.9, size = 5, show.legend = F)
In [22]:
ggplot() + stop_edge_test_path_points_plot + stop_edge_test_path_plot
In [23]:
nrow(shapes_test)
nrow(stopEdge_shapePts_test)
Bugs: Some velocity values are clearly incorrect
In [24]:
qplot(stop_edge_table$hour, stop_edge_table$v_avg)
As you can see, there are clearly some unrealistic values
In [25]:
v_high <- 100
stop_edge_table_max <- stop_edge_table[stop_edge_table$v_avg == max(stop_edge_table$v_avg),]
stop_edge_table_high <- stop_edge_table[stop_edge_table$v_avg >= v_high,]
stop_edge_table_max
In [26]:
stop_edge_table_max_list <- stop_edge_table_max$stop_edge_id
stop_edge_table_high_list <- stop_edge_table_high$stop_edge_id
In [27]:
v_high <- 100
stop_edge_table_max <- stop_edge_table[stop_edge_table$v_avg == max(stop_edge_table$v_avg),]
stop_edge_table_high <- stop_edge_table[stop_edge_table$v_avg >= v_high,]
stop_edge_table_max_list <- stop_edge_table_max$stop_edge_id
stop_edge_table_high_list <- stop_edge_table_high$stop_edge_id
shape_id_max <- filter(shape_stopEdge, stop_edge_id %in% stop_edge_table_max_list)
shape_id_high <- filter(shape_stopEdge, stop_edge_id %in% stop_edge_table_high_list)
stop_edge_table_max <- merge(x = shape_id_max, y = stop_edge_table_max, by="stop_edge_id", all.x = TRUE)
stop_edge_table_high <- merge(x = shape_id_high, y = stop_edge_table_high, by="stop_edge_id", all.x = TRUE)
stop_edge_table_max
stop_edge_table_high
These problems seem to be related to small index changes and negative index changes. In addition, these routes seem to involve loops
In [28]:
shapes_992 <- shapes[shapes$shape_id == "992_9_48",]
In [29]:
qplot(shapes_992$shape_pt_lon, shapes_992$shape_pt_lat)