Outline

  • Preliminary Setup
  • Final Result Demonstration
  • The Pipeline
  • Design Decisions (and their implications)
  • Future Updates

Preliminary Setup


In [1]:
library(ggplot2)
library(dplyr)
library(ggmap)
library(maps)
library(mapdata)
library(lubridate)
options(jupyter.plot_mimetypes = 'image/png')


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Google Maps API Terms of Service: http://developers.google.com/maps/terms.
Please cite ggmap if you use it: see citation("ggmap") for details.

Attaching package: ‘lubridate’

The following object is masked from ‘package:base’:

    date


In [2]:
path <- "./google_transit"
data_path <- "./d3js-test/"
extension <- ".txt"

In [3]:
stopEdge_shapePts <- read.csv(file=file.path(path, paste("stopEdge_shapePts", extension, sep="")), header=T)
shape_stopEdge <- read.csv(file=file.path(path, paste("shape_stopEdge", extension, sep="")), header=T)
shapes <- read.csv(file=file.path(path, paste("shapes", extension, sep="")), header=T)
trip_idTOshape_id <- read.csv(file=file.path(path, paste("trip_idTOshape_id", extension, sep="")), header=T)
stop_times <- read.csv(file=file.path(path, paste("stop_times", extension, sep="")), header=T)
stops <- read.csv(file=file.path(path, paste("stops", extension, sep="")), header=T)
stop_edge_table <- read.csv(file=file.path(data_path, paste("stop_edge_table", extension, sep="")), header=T)

Final Result Demonstration


In [4]:
shape_id_test =  '202_8_22'
shape_stop_edge_test <- shape_stopEdge[shape_stopEdge$shape_id ==  shape_id_test,]
stop_edge_list_test <- shape_stop_edge_test$stop_edge_id

trip_id_test <- trip_idTOshape_id[trip_idTOshape_id$shape_id == shape_id_test,]$trip_id[[1]]
stops_list_test <- stop_times[stop_times$trip_id == trip_id_test,]
stops_test <- merge(stops_list_test, stops)

In [5]:
stopEdge_shapePts_test <- filter(stopEdge_shapePts, stop_edge_id %in% stop_edge_list_test)

In [6]:
size_bbox <- 0.25
bbox <- make_bbox(lon = stopEdge_shapePts_test$shape_pt_lon, lat = stopEdge_shapePts_test$shape_pt_lat, f = size_bbox)
sq_map <- get_map(location = bbox, maptype = "roadmap", source = "google")


Warning message:
“bounding box given to google - spatial extent only approximate.”converting bounding box to center/zoom specification. (experimental)
Source : https://maps.googleapis.com/maps/api/staticmap?center=32.87185,-117.2265&zoom=15&size=640x640&scale=2&maptype=roadmap&language=en-EN

In [7]:
stop_edge_test_path_plot <- geom_path(data = stopEdge_shapePts_test,
                                      mapping = aes(x = shape_pt_lon, y = shape_pt_lat, colour=stop_edge_id), 
                                      alpha = 0.9, size = 2, show.legend = F)

stops_test_plot = geom_point(data = stops_test,
                             mapping = aes(x = stop_lon, y = stop_lat), 
                             color="black", alpha = 0.5, size = 5, show.legend = F)

In [8]:
ggmap(sq_map) + stop_edge_test_path_plot + stops_test_plot


This map displays a subset (from route "202") of what I have defined as "stop edges". Each of the black points marks a stop taken on this route. Each of the colored edges represents a "stop edge". Each "stop edge" contains data about the average velocity the buses were able to traverse this edge, with a table of entries depending on hour and day of the week.

Since the script I wrote for gathering the data has not been running very long, the data shown below is an aggregated average velocity of all the stop edges, over all the days of the week


In [9]:
stop_edge_table <- stop_edge_table[(complete.cases(stop_edge_table)),]
by_hour_aggregate <- aggregate(stop_edge_table$v_avg, by = list(time = stop_edge_table$hour), 
                               FUN=mean, na.rm=TRUE)
by_hour_aggregate$v_avg <- by_hour_aggregate$x

In [10]:
by_hour_plot = geom_line(data = by_hour_aggregate,
                                mapping = aes(x = time, y = v_avg), 
                                alpha = 0.9, size = 2)

In [11]:
ggplot() + by_hour_plot


The project is still incomplete (d3 uses a functional programming paradigm that I was unfamiliar with). The final visualization will use D3 and the google maps api, and be interactive, displaying the relevant graph when mousing over each edge. Below I drafted a quick visualization to demonstrate how it may look


In [12]:
stop_edge_listed <- unique(stop_edge_table$stop_edge_id)
stopEdge_shapePts_listed <- filter(stopEdge_shapePts, stop_edge_id %in% stop_edge_listed)

In [13]:
size_big_bbox <- 0.25
big_bbox <- make_bbox(lon = stopEdge_shapePts_listed$shape_pt_lon, lat = stopEdge_shapePts_listed$shape_pt_lat, f = size_big_bbox)
big_sq_map <- get_map(location = big_bbox, maptype = "roadmap", source = "google")


Warning message:
“bounding box given to google - spatial extent only approximate.”converting bounding box to center/zoom specification. (experimental)
Source : https://maps.googleapis.com/maps/api/staticmap?center=32.8337,-116.9445&zoom=10&size=640x640&scale=2&maptype=roadmap&language=en-EN

In [14]:
stopEdge_shapePts_listed_plot <- geom_path(data = stopEdge_shapePts_listed,
                                            mapping = aes(x = shape_pt_lon, y = shape_pt_lat, colour=stop_edge_id), 
                                            alpha = 0.5, size = 1, show.legend = F)

In [15]:
ggmap(big_sq_map) + stopEdge_shapePts_listed_plot


The Pipeline

SDMTS, like most transit agencies, uses the General Transit Feed Specification (GTFS) to format their data. It is divided into two main parts, the Static Transit and the Realtime Extension

Static Transit

This specification defines the infrastructure of the entire system, the key components being:

  • route information
  • trip information and scheduling
  • stop locations

The information comes in the form of 15 csv files. Here I've included a diagram of the most relevant 7 and their relations:

The important, not obvious values:

  • shape_id - Corresponds to a series of coordinates that define a path

  • stop_times - Defines a list of stop visited by a trip

  • trip_id - Corresponds to a route, the shape it takes, and the stops it visits

We need to generate several of our own tables for the pipeline to function:

  • stop_edges.txt - Using the stop_times.txt set, we can create all of the unique edges by connecting the adjacent stop_ids per trip. I defined each stop_edge by ordering the two ids smallest first, and concatenating them with a underscore in between.

  • stop_edge_table.txt - This table simply contains and entry for each stop_edge, as well as all 24 hours of the day, 7 days of the week. Each entry has a average velocity entry (v_avg), that will be continuously updated (by an exponential moving average)

  • shape_stopEdge.txt -

  • stopEdge_shapes.txt -

With these data tables, we can properly contextualize and record the data coming in from the live feed

Realtime Extension

The real time feed can be downloaded in the form of protocol buffers (the .proto file for the GTFS spec can be found here. I've also included a link to a human readable version of the feed here.

I've written a python script that will run every minute, gathering data from the feed, keep track of active trips, parse and contextualize the data, and update the data table. Below is a diagram of its workflow (with minor details omitted):

Design Decisions (and their Implications)

Here I will make a case for several of the design decisions I made

Why stop edges

In building this network, there were three choices


In [16]:
shapes_plot <- geom_point(data = shapes,
                           mapping = aes(x = shape_pt_lon, y = shape_pt_lat, colour=shape_id), 
                           alpha = 0.7, size = 2, show.legend = F)
stops_plot <- geom_point(data = stops,
                             mapping = aes(x = stop_lon, y = stop_lat), 
                             color="black", alpha = 0.5, size = 5, show.legend = F)
stop_edge_path_plot <- geom_point(data = stopEdge_shapePts,
                                      mapping = aes(x = shape_pt_lon, y = shape_pt_lat, colour=stop_edge_id), 
                                      alpha = 0.7, size = 2, show.legend = F)

In [17]:
ggmap(sq_map) + shapes_plot


Warning message:
“Removed 120722 rows containing missing values (geom_point).”

Compare amount of shapes vertices to sparseness of stop vertices


In [18]:
ggmap(sq_map) + stop_edge_path_plot


Warning message:
“Removed 18146 rows containing missing values (geom_point).”

Approximating shape points for each stop edge

Removed a significant amount of shape points in order to save data. Points were removed by lowering precision of coordinate, then removing middle points that were in a line. This was done because of a significant overlap in points, as well as high point densities in various locations.


In [19]:
shapes_test <- shapes[shapes$shape_id == shape_id_test,]
shapes_test_plot <- geom_point(data = shapes_test,
                           mapping = aes(x = shape_pt_lon, y = shape_pt_lat, colour=shape_id), 
                           alpha = 0.9, size = 2, show.legend = F)

In [20]:
ggplot() + shapes_test_plot



In [21]:
stop_edge_test_path_points_plot <- geom_point(data = stopEdge_shapePts_test,
                                              mapping = aes(x = shape_pt_lon, y = shape_pt_lat, colour=stop_edge_id), 
                                              alpha = 0.9, size = 5, show.legend = F)

In [22]:
ggplot() + stop_edge_test_path_points_plot + stop_edge_test_path_plot



In [23]:
nrow(shapes_test)
nrow(stopEdge_shapePts_test)


308
40

Bugs: Some velocity values are clearly incorrect


In [24]:
qplot(stop_edge_table$hour, stop_edge_table$v_avg)


As you can see, there are clearly some unrealistic values


In [25]:
v_high <- 100
stop_edge_table_max <- stop_edge_table[stop_edge_table$v_avg == max(stop_edge_table$v_avg),]
stop_edge_table_high <- stop_edge_table[stop_edge_table$v_avg >= v_high,]
stop_edge_table_max


stop_edge_idhourdowv_avg
4031212995_1299615 6 337.95
11608010090_1009415 6 337.95
12464810075_1299615 6 337.95
25132011612_1163815 6 337.95
29600811612_1238915 6 337.95
49004810449_1238915 6 337.95
54078410449_9902615 6 337.95
75061612995_6050015 6 337.95

In [26]:
stop_edge_table_max_list <- stop_edge_table_max$stop_edge_id
stop_edge_table_high_list <- stop_edge_table_high$stop_edge_id

In [27]:
v_high <- 100
stop_edge_table_max <- stop_edge_table[stop_edge_table$v_avg == max(stop_edge_table$v_avg),]
stop_edge_table_high <- stop_edge_table[stop_edge_table$v_avg >= v_high,]

stop_edge_table_max_list <- stop_edge_table_max$stop_edge_id
stop_edge_table_high_list <- stop_edge_table_high$stop_edge_id

shape_id_max  <- filter(shape_stopEdge, stop_edge_id %in% stop_edge_table_max_list)
shape_id_high <- filter(shape_stopEdge, stop_edge_id %in% stop_edge_table_high_list)
stop_edge_table_max  <- merge(x = shape_id_max, y = stop_edge_table_max, by="stop_edge_id", all.x = TRUE)
stop_edge_table_high <- merge(x = shape_id_high, y = stop_edge_table_high, by="stop_edge_id", all.x = TRUE)
stop_edge_table_max
stop_edge_table_high


stop_edge_idshape_idstart_idxstop_idxhourdowv_avg
10075_12996923_2_57 147 150 15 6 337.95
10075_12996992_9_48 135 60 15 6 337.95
10090_10094923_2_57 193 197 15 6 337.95
10090_10094992_9_48 18 14 15 6 337.95
10449_12389923_2_57 180 182 15 6 337.95
10449_12389992_9_48 168 29 15 6 337.95
10449_99026923_2_57 182 184 15 6 337.95
10449_99026992_9_48 29 27 15 6 337.95
11612_11638923_2_57 175 177 15 6 337.95
11612_11638992_9_48 163 165 15 6 337.95
11612_12389923_2_57 177 180 15 6 337.95
11612_12389992_9_48 165 168 15 6 337.95
12995_12996923_2_57 143 147 15 6 337.95
12995_12996992_9_48 131 135 15 6 337.95
12995_60500992_9_48 95 131 15 6 337.95
stop_edge_idshape_idstart_idxstop_idxhourdowv_avg
10032_1038430_1_242 159 161 20 6 178.7077
10032_1038430_1_235 408 410 20 6 178.7077
10032_1038430_1_234 408 410 20 6 178.7077
10032_1038430_1_241 54 56 20 6 178.7077
10032_1038430_1_238 26 28 20 6 178.7077
10032_1038430_1_232 159 161 20 6 178.7077
10032_1038430_1_236 408 410 20 6 178.7077
10032_1039530_1_242 161 168 20 6 178.7077
10032_1039530_1_241 56 63 20 6 178.7077
10032_1039530_1_236 410 417 20 6 178.7077
10032_1039530_1_234 410 417 20 6 178.7077
10032_1039530_1_232 161 168 20 6 178.7077
10032_1039530_1_238 28 35 20 6 178.7077
10032_1039530_1_235 410 417 20 6 178.7077
10044_1039530_1_241 63 67 20 6 178.7077
10044_1039530_1_234 417 421 20 6 178.7077
10044_1039530_1_232 168 172 20 6 178.7077
10044_1039530_1_242 168 172 20 6 178.7077
10044_1039530_1_238 35 39 20 6 178.7077
10044_1039530_1_236 417 421 20 6 178.7077
10044_1039530_1_235 417 421 20 6 178.7077
10075_12996992_9_48 135 60 16 5 167.6377
10075_12996992_9_48 135 60 15 6 337.9500
10075_12996923_2_57 147 150 16 5 167.6377
10075_12996923_2_57 147 150 15 6 337.9500
10086_10090923_2_57 189 193 16 5 151.4291
10086_10090992_9_48 22 18 16 5 151.4291
10086_10454992_9_48 25 22 16 5 151.5022
10086_10454290_0_13 12 15 16 5 151.5022
10086_10454280_0_12 12 15 16 5 151.5022
13510_9404220_1_205 127 148 10 4 130.7392
13510_9404220_1_205 127 148 9 6 226.8000
13510_9404220_1_205 127 148 20 5 110.5975
13510_9404220_1_207 246 267 10 4 130.7392
13510_9404220_1_207 246 267 9 6 226.8000
13510_9404220_1_207 246 267 20 5 110.5975
13510_9404220_1_202 648 669 10 4 130.7392
13510_9404220_1_202 648 669 9 6 226.8000
13510_9404220_1_202 648 669 20 5 110.5975
23002_23018235_1_31 167 233 10 4 104.8084
23002_23018235_1_31 167 233 16 6 107.1532
23002_99375235_1_32 0 114 11 5 109.6460
23002_99375235_1_31 233 347 11 5 109.6460
23006_9939420_0_201 340 351 9 6 110.5142
23006_9939420_0_206 337 348 9 6 110.5142
60004_60005901_0_85 300 303 6 6 105.7671
60004_60005901_0_83 318 321 6 6 105.7671
60004_60005901_0_81 300 303 6 6 105.7671
60004_60057901_0_83 313 318 6 6 105.7671
60004_60057901_0_81 295 300 6 6 105.7671
60004_60057901_0_85 295 300 6 6 105.7671
60056_60057901_0_81 290 295 6 6 105.7671
60056_60057901_0_83 308 313 6 6 105.7671
60056_60057901_0_85 290 295 6 6 105.7671
60500_99471923_3_59 9 11 16 5 137.4051
60500_99471992_9_48 93 95 16 5 137.4051
99356_99859992_9_48 31 34 16 5 137.6283
99356_99859992_9_49 31 34 16 5 137.6283
99356_99859923_3_58 33 36 16 5 137.6283
99496_99497235_0_29 504 566 5 6 109.5419

These problems seem to be related to small index changes and negative index changes. In addition, these routes seem to involve loops


In [28]:
shapes_992 <- shapes[shapes$shape_id == "992_9_48",]

In [29]:
qplot(shapes_992$shape_pt_lon, shapes_992$shape_pt_lat)


Future Updates

  • finish data visualization
  • automate static data updating