How to Visualize New York City Using Taxi Location Data and ggplot2

by Max Woolf

This notebook is the complement to my blog post How to Visualize New York City Using Taxi Location Data and ggplot2.

This notebook is licensed under the MIT License. If you use the code or data visualization designs contained within this notebook, it would be greatly appreciated if proper attribution is given back to this notebook and/or myself. Thanks! :)


In [1]:
options(warn=-1)

# IMPORTANT: This assumes that all packages in "Rstart.R" are installed,
# and the fonts "Source Sans Pro" and "Open Sans Condensed Bold" are installed
# via extrafont. If ggplot2 charts fail to render, you may need to change/remove the theme call.

source("Rstart.R")
library(bigrquery)
library(methods) # needed for query_exec in Jupyter: https://github.com/hadley/bigrquery/issues/32

options(repr.plot.mimetypes = 'image/png', repr.plot.width=4, repr.plot.height=3, repr.plot.res=300)

sessionInfo()


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Registering fonts with R

Attaching package: ‘scales’

The following objects are masked from ‘package:readr’:

    col_factor, col_numeric

Out[1]:
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.1 (El Capitan)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] bigrquery_0.1.0    stringr_1.0.0      digest_0.6.8       RColorBrewer_1.1-2
[5] scales_0.3.0       extrafont_0.17     ggplot2_1.0.1      dplyr_0.4.3       
[9] readr_0.1.1       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.1      Rttf2pt1_1.3.3   magrittr_1.5     MASS_7.3-43     
 [5] munsell_0.4.2    uuid_0.1-2       colorspace_1.2-6 R6_2.1.1        
 [9] httr_1.0.0       plyr_1.8.3       tools_3.2.2      parallel_3.2.2  
[13] gtable_0.1.2     DBI_0.3.1        extrafontdb_1.0  assertthat_0.1  
[17] IRdisplay_0.3    reshape2_1.4.1   repr_0.4         base64enc_0.1-3 
[21] IRkernel_0.5     evaluate_0.8     rzmq_0.7.7       stringi_0.5-5   
[25] jsonlite_0.9.17  proto_0.3-10    

rbigquery

This uses the rbigquery R package to query the data. Ensure that it is set up correctly, with your own project name from BigQuery.


In [2]:
project_id <- <FILL IN>   # DO NOT SHARE!

Getting the Data via BigQuery

Gather the query and execute. May take a couple minutes to run data + 15 minutes to retrieve the data! Also uses a query optimization recommended by Felipe Hoffa.


In [3]:
query <- "SELECT ROUND(pickup_latitude, 4) AS lat,
ROUND(pickup_longitude, 4) AS long,
COUNT(*) AS num_pickups,
SUM(fare_amount) AS total_revenue
FROM [nyc-tlc:yellow.trips]
WHERE fare_amount/trip_distance BETWEEN 2 AND 10
GROUP BY lat, long"

df <- tbl_df(query_exec(query, project=project_id, max_pages=Inf))


Retrieving data: 853.8s

Examine data, and cache locally. (Comment/Uncomment the read/write as appropriate)


In [2]:
df <- read.csv("nyc-taxi-data.csv", header=T)

df %>% head(10)

sprintf("# of Rows in Dataframe: %s", nrow(df))
sprintf("Dataframe Size: %s", format(object.size(df), units = "MB"))

# write.csv(df, "nyc-taxi-data.csv", row.names=F)


Out[2]:
latlongnum_pickupstotal_revenue
140.7772-73.955212181115615.1
2-73.996840.737210101.5
340.7508-73.9916714070124.4
440.7574-73.9724402542911.6
540.7904-73.9769874888959.9
640.7182-73.9893382944125.6
740.7121-73.9591527166410.8
840.7717-73.95916957137926.3
940.6884-73.9805487067445.5
1040.7551-73.971511107110735.7
Out[2]:
'# of Rows in Dataframe: 4457474'
Out[2]:
'Dataframe Size: 119 Mb'

In [3]:
min_lat <- 40.5774
max_lat <- 40.9176
min_long <- -74.15
max_long <- -73.7004

Build theme for the map, scripping out most of the grid and replacing it with black.


In [4]:
theme_map_dark <- function(palate_color = "Greys") {

palate <- brewer.pal(palate_color, n=9)
  color.background = "black"
  color.grid.minor = "black"
  color.grid.major = "black"
  color.axis.text = palate[1]
  color.axis.title = palate[1]
  color.title = palate[1]

  font.title <- "Source Sans Pro"
  font.axis <- "Open Sans Condensed Bold"

theme_bw(base_size=5) +
    theme(panel.background=element_rect(fill=color.background, color=color.background)) +
    theme(plot.background=element_rect(fill=color.background, color=color.background)) +
    theme(panel.border=element_rect(color=color.background)) +
    theme(panel.grid.major=element_blank()) +
    theme(panel.grid.minor=element_blank()) +
    theme(axis.ticks=element_blank()) +
    theme(legend.background = element_rect(fill=color.background)) +
    theme(legend.text = element_text(size=3,colour=color.axis.title,family=font.axis)) +
    theme(legend.title = element_blank(), legend.position="top", legend.direction="horizontal") +
    theme(legend.key.width=unit(1, "cm"), legend.key.height=unit(0.25, "cm"), legend.margin=unit(-0.5,"cm")) +
    theme(plot.title=element_text(colour=color.title,family=font.title, size=5)) +
    theme(axis.text.x=element_blank()) +
    theme(axis.text.y=element_blank()) +
    theme(axis.title.y=element_blank()) +
    theme(axis.title.x=element_blank()) +
    theme(plot.margin = unit(c(0.0, -0.5, -1, -0.75), "cm")) +
    theme(strip.background = element_rect(fill=color.background, color=color.background),strip.text=element_text(size=7,colour=color.axis.title,family=font.title))

}

Plotting NYC

Now we can plot NYC! Let's test using the most basic ggplot2 plot possible.


In [32]:
plot <- ggplot(df, aes(x=long, y=lat)) +
            geom_point(size=0.06) +

png("nyc-taxi-1.png", w=600, h=600)
plot
dev.off()


Out[32]:
pdf: 2

Latitude and Longitude in the thousands? That's definitely not right.

Let's force the bounding box.


In [33]:
plot <- ggplot(df, aes(x=long, y=lat)) +
            geom_point(size=0.06) +
            scale_x_continuous(limits=c(min_long, max_long)) +
            scale_y_continuous(limits=c(min_lat, max_lat))

png("nyc-taxi-2.png", w=600, h=600)
plot
dev.off()


Out[33]:
pdf: 2

Better. Let's apply the theme and force a 300 dpi to reduce aliasing.


In [34]:
plot <- ggplot(df %>% filter(num_pickups > 10), aes(x=long, y=lat)) +
            geom_point(color="white", size=0.06) +
            scale_x_continuous(limits=c(min_long, max_long)) +
            scale_y_continuous(limits=c(min_lat, max_lat)) +
            theme_map_dark()

png("nyc-taxi-3.png", w=600, h=600, res=300)
plot
dev.off()


Out[34]:
pdf: 2

Even better. Make the final improvements and annotations to the chart.


In [49]:
plot <- ggplot(df %>% filter(num_pickups > 10), aes(x=long, y=lat, color=num_pickups)) +
            geom_point(size=0.06) +
            scale_x_continuous(limits=c(min_long, max_long)) +
            scale_y_continuous(limits=c(min_lat, max_lat)) +
            theme_map_dark() +
            scale_color_gradient(low="#CCCCCC", high="#8E44AD", trans="log") +
            labs(title = "Map of NYC, Plotted Using Locations Of All Yellow Taxi Pickups") +
            theme(legend.position="none") +
            coord_equal()

png("nyc-taxi-4.png", w=600, h=600, res=300)
plot
dev.off()


Out[49]:
pdf: 2

coord_equal results in white space above the chart. Here's a (failed) attempt to fix it using tools from the grid package.


In [88]:
#map_save <- function(filename, plot) {
#    vp = viewport(grid.layout(1, 1))
#    
#    png(filename, w=600, h=600, res=300)
#    grid.newpage()
#    print(plot, vp=vp)
#    grid.rect(gp=gpar(fill="black", col="black"))
#    pushViewport(viewport(grid.layout(1, 1)))
#    print(plot, vp=vp)
#    dev.off()
#}

Add Hex bins above the map we've generated.


In [23]:
plot <- ggplot(df %>% filter(num_pickups > 20), aes(x=long, y=lat, z=total_revenue)) +
            geom_point(size=0.06, color="#999999") +
            stat_summary_hex(fun = sum, bins=100, alpha=0.7) +
            scale_x_continuous(limits=c(min_long, max_long)) +
            scale_y_continuous(limits=c(min_lat, max_lat)) +
            theme_map_dark() +
            scale_fill_gradient(low="#CCCCCC", high="#27AE60", labels=dollar) +
            labs(title = "Total Revenue for NYC Yellow Taxis by Pickup Location, from Jan 2009 ― June 2015") +
            coord_equal()

png("nyc-taxi-5.png", w=950, h=860, res=300)
plot
dev.off()


Out[23]:
pdf: 2

Conveys information accurately, but aestetics could be improved. One last try:


In [42]:
# Helper function to make bins not show if total revenue is below threshold
total_rev <- function(x, threshold = 10^5) {
    if (sum(x) < threshold) {return (NA)}
    else {return (sum(x))}
}

plot <- ggplot(df %>% filter(num_pickups > 10), aes(x=long, y=lat, z=total_revenue)) +
            geom_point(size=0.06, color="#999999") +
            stat_summary_hex(fun = total_rev, bins=100, alpha=0.5) +
            scale_x_continuous(limits=c(-74.0224, -73.8521)) +
            scale_y_continuous(limits=c(40.6959, 40.8348)) +
            theme_map_dark() +
            scale_fill_gradient(low="#FFFFFF", high="#E74C3C", labels=dollar, trans="log", breaks=c(10^(6:8))) +
            labs(title = "Total Revenue for NYC Yellow Taxis by Pickup Location, from Jan 2009 ― June 2015") +
            coord_equal()

png("nyc-taxi-6.png", w=900, h=900, res=300)
plot
dev.off()


Out[42]:
pdf: 2

The MIT License (MIT)

Copyright (c) 2015 Max Woolf

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.