Analyzing When and Where San Francisco Criminal Arrests Occur Using R and ggplot2

by Max Woolf

This notebook is the complement to my blog posts Analyzing San Francisco Crime Data to Determine When Arrests Frequently Occur and Mapping Where Arrests Frequently Occur in San Francisco Using Crime Data.

This notebook is licensed under the MIT License. If you use the code or data visualization designs contained within this notebook, it would be greatly appreciated if proper attribution is given back to this notebook and/or myself. Thanks! :)


In [1]:
options(warn = -1)

# IMPORTANT: This assumes that all packages in "Rstart.R" are installed,
# and the fonts "Source Sans Pro" and "Open Sans Condensed Bold" are installed
# via extrafont. If ggplot2 charts fail to render, you may need to change/remove the theme call.

source("Rstart.R")
library(ggmap)

options(repr.plot.mimetypes = 'image/png', repr.plot.width = 4, repr.plot.height = 3, repr.plot.res = 300)

sessionInfo()


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Registering fonts with R

Attaching package: ‘scales’

The following objects are masked from ‘package:readr’:

    col_factor, col_numeric

Note: the specification for S3 class “AsIs” in package ‘RJSONIO’ seems equivalent to one from package ‘jsonlite’: not turning on duplicate class definitions for this class.
Out[1]:
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.1 (El Capitan)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] ggmap_2.5.2        stringr_1.0.0      digest_0.6.8       RColorBrewer_1.1-2
[5] scales_0.3.0       extrafont_0.17     ggplot2_1.0.1      dplyr_0.4.3       
[9] readr_0.1.1       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.1         plyr_1.8.3          base64enc_0.1-3    
 [4] tools_3.2.2         uuid_0.1-2          jsonlite_0.9.17    
 [7] evaluate_0.8        gtable_0.1.2        lattice_0.20-33    
[10] png_0.1-7           IRdisplay_0.3       DBI_0.3.1          
[13] mapproj_1.2-4       IRkernel_0.5        parallel_3.2.2     
[16] proto_0.3-10        rzmq_0.7.7          Rttf2pt1_1.3.3     
[19] repr_0.4            maps_3.0.0-2        RgoogleMaps_1.2.0.7
[22] R6_2.1.1            jpeg_0.1-8          RJSONIO_1.3-0      
[25] sp_1.2-0            reshape2_1.4.1      extrafontdb_1.0    
[28] magrittr_1.5        MASS_7.3-43         assertthat_0.1     
[31] colorspace_1.2-6    geosphere_1.4-3     stringi_0.5-5      
[34] munsell_0.4.2       rjson_0.2.15       

Processing the Data

We load the data using readr and read_csv() since it's faster. Since there is a lot of redundant data (e.g. address, coordinates), we only load the columns we need.


In [2]:
path <- "~/Downloads/SFPD_Incidents_-_from_1_January_2003.csv"

df <- read_csv(path)


|================================================================================| 100%  360 MB

In [3]:
df %>% head(10)
sprintf("# of Rows in Dataframe: %s", nrow(df))
sprintf("Dataframe Size: %s", format(object.size(df), units = "MB"))


Out[3]:
IncidntNumCategoryDescriptDayOfWeekDateTimePdDistrictResolutionAddressXYLocationPdId
1150996567BURGLARYBURGLARY OF APARTMENT HOUSE, UNLAWFUL ENTRYSunday11/15/201523:58INGLESIDENONE3200 Block of HARRISON ST-122.411537.74605(37.7460485796086, -122.411460219918)1.509966e+13
2156283485LARCENY/THEFTPETTY THEFT OF PROPERTYSunday11/15/201523:30BAYVIEWNONE17TH ST / DEHARO ST-122.401637.76484(37.7648403636386, -122.401600659931)1.562835e+13
3150997195VANDALISMMALICIOUS MISCHIEF, GRAFFITISunday11/15/201523:30PARKNONE2500 Block of 15TH ST-122.437837.76603(37.7660311509137, -122.437848713459)1.509972e+13
4150996501VANDALISMMALICIOUS MISCHIEF, VANDALISM OF VEHICLESSunday11/15/201523:15SOUTHERNNONE1ST ST / FOLSOM ST-122.394537.7873(37.7872982355244, -122.394484874311)1.509965e+13
5150996501SUSPICIOUS OCCSUSPICIOUS OCCURRENCESunday11/15/201523:15SOUTHERNNONE1ST ST / FOLSOM ST-122.394537.7873(37.7872982355244, -122.394484874311)1.509965e+13
6156280936LARCENY/THEFTPETTY THEFT OF PROPERTYSunday11/15/201523:00NORTHERNNONE1800 Block of GEARY BL-122.43237.78425(37.7842501079896, -122.432035315509)1.562809e+13
7150998171VEHICLE THEFTSTOLEN AUTOMOBILESunday11/15/201523:00TARAVALNONE2300 Block of 30TH AV-122.487537.74345(37.7434503392393, -122.487471191928)1.509982e+13
8150996777VANDALISMMALICIOUS MISCHIEF, BREAKING WINDOWSSunday11/15/201523:00CENTRALARREST, BOOKED400 Block of STOCKTON ST-122.40737.78992(37.789918101686, -122.406977563692)1.509968e+13
9151001038VEHICLE THEFTSTOLEN AUTOMOBILESunday11/15/201523:00SOUTHERNNONEHOWARD ST / 9TH ST-122.413237.77499(37.7749926445385, -122.413163134276)1.51001e+13
10150996498ASSAULTAGGRAVATED ASSAULT WITH BODILY FORCESunday11/15/201522:59MISSIONNONE3100 Block of 16TH ST-122.423637.76487(37.7648666651043, -122.423637302048)1.509965e+13
Out[3]:
'# of Rows in Dataframe: 1842050'
Out[3]:
'Dataframe Size: 180.9 Mb'

In [4]:
columns = c("Category", "Descript", "DayOfWeek", "Date", "Time", "PdDistrict", "Resolution", "X", "Y")

# select() requires column indices, so use which() to find them
df <- df %>% select(which(names(df) %in% columns))

df %>% head(10)
sprintf("# of Rows in Dataframe: %s", nrow(df))
sprintf("Dataframe Size: %s", format(object.size(df), units = "MB"))


Out[4]:
CategoryDescriptDayOfWeekDateTimePdDistrictResolutionXY
1BURGLARYBURGLARY OF APARTMENT HOUSE, UNLAWFUL ENTRYSunday11/15/201523:58INGLESIDENONE-122.411537.74605
2LARCENY/THEFTPETTY THEFT OF PROPERTYSunday11/15/201523:30BAYVIEWNONE-122.401637.76484
3VANDALISMMALICIOUS MISCHIEF, GRAFFITISunday11/15/201523:30PARKNONE-122.437837.76603
4VANDALISMMALICIOUS MISCHIEF, VANDALISM OF VEHICLESSunday11/15/201523:15SOUTHERNNONE-122.394537.7873
5SUSPICIOUS OCCSUSPICIOUS OCCURRENCESunday11/15/201523:15SOUTHERNNONE-122.394537.7873
6LARCENY/THEFTPETTY THEFT OF PROPERTYSunday11/15/201523:00NORTHERNNONE-122.43237.78425
7VEHICLE THEFTSTOLEN AUTOMOBILESunday11/15/201523:00TARAVALNONE-122.487537.74345
8VANDALISMMALICIOUS MISCHIEF, BREAKING WINDOWSSunday11/15/201523:00CENTRALARREST, BOOKED-122.40737.78992
9VEHICLE THEFTSTOLEN AUTOMOBILESunday11/15/201523:00SOUTHERNNONE-122.413237.77499
10ASSAULTAGGRAVATED ASSAULT WITH BODILY FORCESunday11/15/201522:59MISSIONNONE-122.423637.76487
Out[4]:
'# of Rows in Dataframe: 1842050'
Out[4]:
'Dataframe Size: 126.9 Mb'

The All-Caps text is ugly: let's force the text in the appropriate columns into proper case. (see this Stack Overflow question)


In [5]:
proper_case <- function(x) {
    return (gsub("\\b([A-Z])([A-Z]+)", "\\U\\1\\L\\2" , x, perl = TRUE))
}

df <- df %>% mutate(Category = proper_case(Category),
                 Descript = proper_case(Descript),
                 PdDistrict = proper_case(PdDistrict),
                 Resolution = proper_case(Resolution))

df %>% head(10)


Out[5]:
CategoryDescriptDayOfWeekDateTimePdDistrictResolutionXY
1BurglaryBurglary Of Apartment House, Unlawful EntrySunday11/15/201523:58InglesideNone-122.411537.74605
2Larceny/TheftPetty Theft Of PropertySunday11/15/201523:30BayviewNone-122.401637.76484
3VandalismMalicious Mischief, GraffitiSunday11/15/201523:30ParkNone-122.437837.76603
4VandalismMalicious Mischief, Vandalism Of VehiclesSunday11/15/201523:15SouthernNone-122.394537.7873
5Suspicious OccSuspicious OccurrenceSunday11/15/201523:15SouthernNone-122.394537.7873
6Larceny/TheftPetty Theft Of PropertySunday11/15/201523:00NorthernNone-122.43237.78425
7Vehicle TheftStolen AutomobileSunday11/15/201523:00TaravalNone-122.487537.74345
8VandalismMalicious Mischief, Breaking WindowsSunday11/15/201523:00CentralArrest, Booked-122.40737.78992
9Vehicle TheftStolen AutomobileSunday11/15/201523:00SouthernNone-122.413237.77499
10AssaultAggravated Assault With Bodily ForceSunday11/15/201522:59MissionNone-122.423637.76487

Filtering the Data

Let's filter df by Arrests to aggregate some intersting statistics.


In [6]:
# grepl() is the best way to do in-text search
df_arrest <- df %>% filter(grepl("Arrest", Resolution))

df_arrest %>% head(10)
sprintf("# of Rows in Dataframe: %s", nrow(df_arrest))
sprintf("Dataframe Size: %s", format(object.size(df_arrest), units = "MB"))


Out[6]:
CategoryDescriptDayOfWeekDateTimePdDistrictResolutionXY
1VandalismMalicious Mischief, Breaking WindowsSunday11/15/201523:00CentralArrest, Booked-122.40737.78992
2AssaultBatterySunday11/15/201522:53NorthernArrest, Booked-122.418737.78501
3AssaultChild Abuse (Physical)Sunday11/15/201522:53NorthernArrest, Booked-122.418737.78501
4Other OffensesDrivers License, Suspended Or RevokedSunday11/15/201522:35SouthernArrest, Booked-122.41237.7809
5Stolen PropertyStolen Property, Possession With Knowledge, ReceivingSunday11/15/201522:20CentralArrest, Booked-122.418537.80615
6Other OffensesTampering With A VehicleSunday11/15/201522:20CentralArrest, Booked-122.418537.80615
7WarrantsEnroute To Department Of CorrectionsSunday11/15/201522:20CentralArrest, Booked-122.418537.80615
8Secondary CodesDomestic ViolenceSunday11/15/201522:00NorthernArrest, Booked-122.438537.79941
9AssaultThreats Against LifeSunday11/15/201522:00NorthernArrest, Booked-122.438537.79941
10Larceny/TheftLost Property, Petty TheftSunday11/15/201521:40CentralArrest, Booked-122.418537.80615
Out[6]:
'# of Rows in Dataframe: 587499'
Out[6]:
'Dataframe Size: 40.7 Mb'

Crime Over Time

Create a chart of crimes over time.


In [7]:
df_arrest_daily <- df_arrest %>%
                    mutate(Date = as.Date(Date, "%m/%d/%Y")) %>%
                    group_by(Date) %>% 
                    summarize(count = n()) %>%
                    arrange(Date)

df_arrest_daily %>% head(10)


Out[7]:
Datecount
12003-01-01172
22003-01-02144
32003-01-03191
42003-01-04123
52003-01-05161
62003-01-06184
72003-01-07181
82003-01-08233
92003-01-09183
102003-01-10135

In [8]:
plot <- ggplot(df_arrest_daily, aes(x = Date, y = count)) +
    geom_line(color = "#F2CA27", size = 0.1) +
    geom_smooth(color = "#1A1A1A") +
    fte_theme() +
    scale_x_date(breaks = date_breaks("2 years"), labels = date_format("%Y")) +
    labs(x = "Date of Arrest", y = "# of Police Arrests", title = "Daily Police Arrests in San Francisco from 2003 – 2015")

max_save(plot, "sf-arrest-when-1", "SF OpenData")


geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

Crime Time Heatmap

Aggregate counts of arrests by Day-of-Week and Time to create heat map. Fortunately, the Day-Of-Week part is pre-derived, but Hour is slightly harder.


In [9]:
# Returns the numeric hour component of a string formatted "HH:MM", e.g. "09:40" input returns 9
get_hour <- function(x) {
    return (as.numeric(strsplit(x,":")[[1]][1]))
}

df_arrest_time <- df_arrest %>%
                    mutate(Hour = sapply(Time, get_hour)) %>%
                    group_by(DayOfWeek, Hour) %>% 
                    summarize(count = n())

df_arrest_time %>% head(10)


Out[9]:
DayOfWeekHourcount
1Friday03670
2Friday12627
3Friday22277
4Friday31399
5Friday4986
6Friday5879
7Friday61294
8Friday72283
9Friday82873
10Friday93227

Reorder and format Factors.


In [10]:
dow_format <- c("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday")
hour_format <- c(paste(c(12,1:11),"AM"), paste(c(12,1:11),"PM"))

df_arrest_time$DayOfWeek <- factor(df_arrest_time$DayOfWeek, level = rev(dow_format))
df_arrest_time$Hour <- factor(df_arrest_time$Hour, level = 0:23, label = hour_format)

df_arrest_time %>% head(10)


Out[10]:
DayOfWeekHourcount
1Friday12 AM3670
2Friday1 AM2627
3Friday2 AM2277
4Friday3 AM1399
5Friday4 AM986
6Friday5 AM879
7Friday6 AM1294
8Friday7 AM2283
9Friday8 AM2873
10Friday9 AM3227

In [11]:
plot <- ggplot(df_arrest_time, aes(x = Hour, y = DayOfWeek, fill = count)) +
    geom_tile() +
    fte_theme() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.6), legend.title = element_blank(), legend.position="top", legend.direction="horizontal", legend.key.width=unit(2, "cm"), legend.key.height=unit(0.25, "cm"), legend.margin=unit(-0.5,"cm"), panel.margin=element_blank()) +
    labs(x = "Hour of Arrest (Local Time)", y = "Day of Week of Arrest", title = "# of Police Arrests in San Francisco from 2003 – 2015, by Time of Arrest") +
    scale_fill_gradient(low = "white", high = "#27AE60", labels = comma)

max_save(plot, "sf-arrest-when-2", "SF OpenData", w=6)

Hmm, why is there a surge on Wednesday afternoon, and at 4-5PM on all days? Let's look at subgroups to verify there isn't a latent factor.

Factor by Crime Category

Certain types of crime may be more time dependent. (i.e. more traffic violations when people leave work)


In [12]:
df_top_crimes <- df_arrest %>%
                    group_by(Category) %>% 
                    summarize(count = n()) %>%
                    arrange(desc(count))

df_top_crimes %>% head(20)


Out[12]:
Categorycount
1Other Offenses183156
2Drug/Narcotic98400
3Warrants81426
4Assault56934
5Larceny/Theft31369
6Prostitution14429
7Weapon Laws11674
8Burglary10449
9Trespass10308
10Non-Criminal10046
11Vandalism9280
12Robbery8168
13Stolen Property8042
14Drunkenness7202
15Secondary Codes6960
16Disorderly Conduct5769
17Fraud4849
18Driving Under The Influence4549
19Vehicle Theft4376
20Forgery/Counterfeiting4210

In [13]:
df_arrest_time_crime <- df_arrest %>%
                    filter(Category %in% df_top_crimes$Category[2:19]) %>%
                    mutate(Hour = sapply(Time, get_hour)) %>%
                    group_by(Category, DayOfWeek, Hour) %>% 
                    summarize(count = n())

df_arrest_time_crime$DayOfWeek <- factor(df_arrest_time_crime$DayOfWeek, level = rev(dow_format))
df_arrest_time_crime$Hour <- factor(df_arrest_time_crime$Hour, level = 0:23, label = hour_format)

df_arrest_time_crime %>% head(10)


Out[13]:
CategoryDayOfWeekHourcount
1AssaultFriday12 AM408
2AssaultFriday1 AM341
3AssaultFriday2 AM326
4AssaultFriday3 AM149
5AssaultFriday4 AM105
6AssaultFriday5 AM88
7AssaultFriday6 AM113
8AssaultFriday7 AM193
9AssaultFriday8 AM238
10AssaultFriday9 AM254

In [14]:
plot <- ggplot(df_arrest_time_crime, aes(x = Hour, y = DayOfWeek, fill = count)) +
    geom_tile() +
    fte_theme() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.6, size = 4)) +
    labs(x = "Hour of Arrest (Local Time)", y = "Day of Week of Arrest", title = "# of Police Arrests in San Francisco from 2003 – 2015, by Category and Time of Arrest") +
    scale_fill_gradient(low = "white", high = "#2980B9") +
    facet_wrap(~ Category, nrow = 6)

max_save(plot, "sf-arrest-when-3", "SF OpenData", w = 6, h = 8, tall = T)

Good, but the gradients aren't helpful because they are not normalized. We need to normalize the range on each facet. (unfortunately, this makes the value of the gradient unhelpful)


In [15]:
df_arrest_time_crime <- df_arrest_time_crime %>%
                            group_by(Category) %>%
                            mutate(norm = count/sum(count))

df_arrest_time_crime %>% head(10)


Out[15]:
CategoryDayOfWeekHourcountnorm
1AssaultFriday12 AM4080.007166192
2AssaultFriday1 AM3410.005989391
3AssaultFriday2 AM3260.005725928
4AssaultFriday3 AM1490.002617065
5AssaultFriday4 AM1050.001844241
6AssaultFriday5 AM880.001545649
7AssaultFriday6 AM1130.001984754
8AssaultFriday7 AM1930.00338989
9AssaultFriday8 AM2380.004180279
10AssaultFriday9 AM2540.004461306

In [16]:
plot <- ggplot(df_arrest_time_crime, aes(x = Hour, y = DayOfWeek, fill = norm)) +
    geom_tile() +
    fte_theme() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.6, size = 4)) +
    labs(x = "Hour of Arrest (Local Time)", y = "Day of Week of Arrest", title = "Police Arrests in San Francisco from 2003 – 2015 by Time of Arrest, Normalized by Type of Crime") +
    scale_fill_gradient(low = "white", high = "#2980B9") +
    facet_wrap(~ Category, nrow = 6)

max_save(plot, "sf-arrest-when-4", "SF OpenData", w = 6, h = 8, tall = T)

Much more helpful.

Factor by Police District

Same as above, but with a different facet.


In [17]:
df_arrest_time_district <- df_arrest %>%
                    mutate(Hour = sapply(Time, get_hour)) %>%
                    group_by(PdDistrict, DayOfWeek, Hour) %>% 
                    summarize(count = n()) %>%
                    group_by(PdDistrict) %>%
                    mutate(norm = count/sum(count))

df_arrest_time_district$DayOfWeek <- factor(df_arrest_time_district$DayOfWeek, level = rev(dow_format))
df_arrest_time_district$Hour <- factor(df_arrest_time_district$Hour, level = 0:23, label = hour_format)

df_arrest_time_district %>% head(10)


Out[17]:
PdDistrictDayOfWeekHourcountnorm
1BayviewFriday12 AM3470.005714474
2BayviewFriday1 AM1950.003211304
3BayviewFriday2 AM1510.002486702
4BayviewFriday3 AM900.00148214
5BayviewFriday4 AM1010.001663291
6BayviewFriday5 AM810.001333926
7BayviewFriday6 AM1000.001646822
8BayviewFriday7 AM2260.003721819
9BayviewFriday8 AM3130.005154554
10BayviewFriday9 AM3970.006537885

In [18]:
plot <- ggplot(df_arrest_time_district, aes(x = Hour, y = DayOfWeek, fill = norm)) +
    geom_tile() +
    fte_theme() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.6, size = 4)) +
    labs(x = "Hour of Arrest (Local Time)", y = "Day of Week of Arrest", title = "Police Arrests in San Francisco from 2003 – 2015 by Time of Arrest, Normalized by Station") +
    scale_fill_gradient(low = "white", high = "#8E44AD") +
    facet_wrap(~ PdDistrict, nrow = 5)

max_save(plot, "sf-arrest-when-5", "SF OpenData", w = 6, h = 8, tall = T)

Not helpful either. Meh.

Factor by Month

If crime is tied to activities, the period at which activies end may impact.


In [19]:
df_arrest_time_month <- df_arrest %>%
                    mutate(Month = format(as.Date(Date, "%m/%d/%Y"), "%B"), Hour = sapply(Time, get_hour)) %>%
                    group_by(Month, DayOfWeek, Hour) %>% 
                    summarize(count = n()) %>%
                    group_by(Month) %>%
                    mutate(norm = count/sum(count))

df_arrest_time_month$DayOfWeek <- factor(df_arrest_time_month$DayOfWeek, level = rev(dow_format))
df_arrest_time_month$Hour <- factor(df_arrest_time_month$Hour, level = 0:23, label = hour_format)

df_arrest_time_month %>% head(10)


Out[19]:
MonthDayOfWeekHourcountnorm
1AprilFriday12 AM2920.005988884
2AprilFriday1 AM1870.003835347
3AprilFriday2 AM2090.004286564
4AprilFriday3 AM980.002009968
5AprilFriday4 AM1030.002112517
6AprilFriday5 AM530.001087023
7AprilFriday6 AM1070.002194557
8AprilFriday7 AM1900.003896876
9AprilFriday8 AM2160.004430133
10AprilFriday9 AM2840.005824805

In [20]:
# Set order of month facets by chronological order instead of alphabetical
df_arrest_time_month$Month <- factor(df_arrest_time_month$Month,
                                     level = c("January","February","March","April","May","June","July","August","September","October","November","December"))

plot <- ggplot(df_arrest_time_month, aes(x = Hour, y = DayOfWeek, fill = norm)) +
    geom_tile() +
    fte_theme() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.6, size = 4)) +
    labs(x = "Hour of Arrest (Local Time)", y = "Day of Week of Arrest", title = "Police Arrests in San Francisco from 2003 – 2015 by Time of Arrest, Normalized by Month") +
    scale_fill_gradient(low = "white", high = "#E74C3C") +
    facet_wrap(~ Month, nrow = 4)

max_save(plot, "sf-arrest-when-6", "SF OpenData", w = 6, h = 6, tall = T)

That is not helpful either!

Factor By Year

Perhaps things changed overtime?


In [21]:
df_arrest_time_year <- df_arrest %>%
                    mutate(Year = format(as.Date(Date, "%m/%d/%Y"), "%Y"), Hour = sapply(Time, get_hour)) %>%
                    group_by(Year, DayOfWeek, Hour) %>% 
                    summarize(count = n()) %>%
                    group_by(Year) %>%
                    mutate(norm = count/sum(count))

df_arrest_time_year$DayOfWeek <- factor(df_arrest_time_year$DayOfWeek, level = rev(dow_format))
df_arrest_time_year$Hour <- factor(df_arrest_time_year$Hour, level = 0:23, label = hour_format)

df_arrest_time_year %>% head(10)


Out[21]:
YearDayOfWeekHourcountnorm
12003Friday12 AM2950.005575084
22003Friday1 AM1950.003685225
32003Friday2 AM1810.003420645
42003Friday3 AM1550.002929281
52003Friday4 AM610.001152814
62003Friday5 AM950.001795366
72003Friday6 AM1520.002872586
82003Friday7 AM2400.004535662
92003Friday8 AM2960.005593983
102003Friday9 AM3950.007464943

In [22]:
plot <- ggplot(df_arrest_time_year, aes(x = Hour, y = DayOfWeek, fill = norm)) +
    geom_tile() +
    fte_theme() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.6, size = 4)) +
    labs(x = "Hour of Arrest (Local Time)", y = "Day of Week of Arrest", title = "Police Arrests in San Francisco from 2003 – 2015 by Time of Arrest, Normalized by Year") +
    scale_fill_gradient(low = "white", high = "#E67E22") +
    facet_wrap(~ Year, nrow = 6)

max_save(plot, "sf-arrest-when-7", "SF OpenData", w = 6, h = 6, tall = T)

Ack, not really.

Plot with ggmap

Let's try working with maps. (Ed. Note: Due to their size, the maps will not be embedded directly into the notebook, but they will be available in the repository.}

We can use the CSV output of the Bounding Box Tool to easily choose explicit bounds.


In [23]:
bbox = c(-122.516441,37.702072,-122.37276,37.811818)

# credit to /u/all_genes_considered for map setting suggestion
map <- get_map(location = bbox, source = "stamen", maptype = "toner-lite")


Map from URL : http://tile.stamen.com/toner-lite/13/1308/3165.png
Map from URL : http://tile.stamen.com/toner-lite/13/1309/3165.png
Map from URL : http://tile.stamen.com/toner-lite/13/1310/3165.png
Map from URL : http://tile.stamen.com/toner-lite/13/1311/3165.png
Map from URL : http://tile.stamen.com/toner-lite/13/1308/3166.png
Map from URL : http://tile.stamen.com/toner-lite/13/1309/3166.png
Map from URL : http://tile.stamen.com/toner-lite/13/1310/3166.png
Map from URL : http://tile.stamen.com/toner-lite/13/1311/3166.png
Map from URL : http://tile.stamen.com/toner-lite/13/1308/3167.png
Map from URL : http://tile.stamen.com/toner-lite/13/1309/3167.png
Map from URL : http://tile.stamen.com/toner-lite/13/1310/3167.png
Map from URL : http://tile.stamen.com/toner-lite/13/1311/3167.png
Map from URL : http://tile.stamen.com/toner-lite/13/1308/3168.png
Map from URL : http://tile.stamen.com/toner-lite/13/1309/3168.png
Map from URL : http://tile.stamen.com/toner-lite/13/1310/3168.png
Map from URL : http://tile.stamen.com/toner-lite/13/1311/3168.png

Test map download.


In [64]:
png("sf-arrest-where-0.png", w=900, h=900, res=300)
ggmap(map)
dev.off()


Out[64]:
pdf: 2

The "white space" issue noted in the bootstrap article is still present due to the fixed ratio of the ggmap. You will need to tweak chart dimensions accordingly.


In [24]:
plot <- ggmap(map) +
            geom_point(data = df_arrest, aes(x=X, y=Y), color = "#27AE60", size = 0.5, alpha = 0.01) +
            fte_theme() +
            theme(axis.text.x = element_blank(), axis.text.y = element_blank(), axis.title.x = element_blank(), axis.title.y = element_blank()) +
            theme(plot.margin = unit(c(0.3, 0.3, -0.25, 0), "cm")) +
            labs(title = "Locations of Police Arrests Made in San Francisco from 2003 – 2015")

max_save(plot, "sf-arrest-where-1", "SF OpenData", w = 3.8, h = 4)

We can facet the map by the Type of Crime using facet_wrap. (contrary to notes in the documentation, setting the ggplot as the base_layer is apparently not necessary, and imposes a performance penalty)


In [25]:
plot <- ggmap(map) +
            geom_point(data = df_arrest %>% filter(Category %in% df_top_crimes$Category[2:19]), aes(x=X, y=Y, color=Category), size=0.75, alpha=0.05) +
            fte_theme() +
            theme(axis.text.x = element_blank(), axis.text.y = element_blank(), axis.title.x = element_blank(), axis.title.y = element_blank()) +
            labs(title = "Locations of Police Arrests Made in San Francisco from 2003 – 2015, by Type of Crime") +
            facet_wrap(~ Category, nrow = 3)

max_save(plot, "sf-arrest-where-2", "SF OpenData", w = 14.2, h = 8, tall = T)

Now let's normalize the above plot for each facter, with Hex aggregation.


In [38]:
# Do not show hex if sum is below threshold
sum_thresh <- function(x, threshold = 10^-3) {
    if (sum(x) < threshold) {return (NA)}
    else {return (sum(x))}
}

plot <- ggmap(map) +
            stat_summary_hex(data = df_arrest %>% filter(Category %in% df_top_crimes$Category[2:19]) %>% group_by(Category) %>% mutate(w=1/n()), aes(x=X, y=Y, z=w), fun=sum_thresh, alpha = 0.8, color="#CCCCCC") +
            fte_theme() +
            theme(axis.text.x = element_blank(), axis.text.y = element_blank(), axis.title.x = element_blank(), axis.title.y = element_blank()) +
            scale_fill_gradient(low = "#DDDDDD", high = "#2980B9") +
            labs(title = "Locations of Police Arrests Made in San Francisco from 2003 – 2015, Normalized by Type of Crime") +
            facet_wrap(~ Category, nrow = 3)

max_save(plot, "sf-arrest-where-3", "SF OpenData", w = 14.2, h = 8, tall = T)

Facet by police districts.


In [56]:
plot <- ggmap(map) +
            stat_summary_hex(data = df_arrest %>% group_by(PdDistrict) %>% mutate(w=1/n()), aes(x=X, y=Y, z=w), fun=sum_thresh, alpha = 0.8, color="#CCCCCC") +
            fte_theme() +
            theme(axis.text.x = element_blank(), axis.text.y = element_blank(), axis.title.x = element_blank(), axis.title.y = element_blank()) +
            scale_fill_gradient(low = "#DDDDDD", high = "#8E44AD") +
            labs(title = "Locations of Police Arrests Made in San Francisco from 2003 – 2015, Normalized by Police District") +
            facet_wrap(~ PdDistrict, nrow = 2)

max_save(plot, "sf-arrest-where-4", "SF OpenData", w = 13, h = 6, tall = T)

Facet by months. (The raw month must be appended to the original df_arrest data frame now)


In [55]:
df_arrest <- df_arrest %>% mutate(Month=format(as.Date(Date, "%m/%d/%Y"), "%B"))
df_arrest$Month <- factor(df_arrest$Month,
                                     level = c("January","February","March","April","May","June","July","August","September","October","November","December"))

plot <- ggmap(map) +
            stat_summary_hex(data = df_arrest %>% group_by(Month) %>% mutate(w=1/n()), aes(x=X, y=Y, z=w), fun=sum_thresh, alpha = 0.8, color="#CCCCCC") +
            fte_theme() +
            theme(axis.text.x = element_blank(), axis.text.y = element_blank(), axis.title.x = element_blank(), axis.title.y = element_blank()) +
            scale_fill_gradient(low = "#DDDDDD", high = "#E74C3C") +
            labs(title = "Locations of Police Arrests Made in San Francisco from 2003 – 2015, Normalized by Month") +
            facet_wrap(~ Month, nrow=2)

max_save(plot, "sf-arrest-where-5", "SF OpenData", w=13, h=5, tall=T)

Facet by year.


In [54]:
df_arrest <- df_arrest %>% mutate(Year=format(as.Date(Date, "%m/%d/%Y"), "%Y"))

plot <- ggmap(map) +
            stat_summary_hex(data=df_arrest %>% group_by(Year) %>% mutate(w=1/n()), aes(x=X, y=Y, z=w), fun=sum_thresh, alpha = 0.8, color="#CCCCCC") +
            fte_theme() +
            theme(axis.text.x = element_blank(), axis.text.y = element_blank(), axis.title.x = element_blank(), axis.title.y = element_blank()) +
            scale_fill_gradient(low = "#DDDDDD", high = "#E67E22") +
            labs(title = "Locations of Police Arrests Made in San Francisco from 2003 – 2015, Normalized by Year") +
            facet_wrap(~ Year, nrow=2)

max_save(plot, "sf-arrest-where-6", "SF OpenData", w=10.5, h=4)

Facet by hour of day.


In [53]:
df_arrest <- df_arrest %>% mutate(Hour = sapply(Time, get_hour))
df_arrest$Hour <- factor(df_arrest$Hour, level = 0:23, label = hour_format)

plot <- ggmap(map) +
            stat_summary_hex(data=df_arrest %>% group_by(Hour) %>% mutate(w=1/n()), aes(x=X, y=Y, z=w), fun=sum_thresh, alpha = 0.8, color="#CCCCCC") +
            fte_theme() +
            theme(axis.text.x = element_blank(), axis.text.y = element_blank(), axis.title.x = element_blank(), axis.title.y = element_blank()) +
            scale_fill_gradient(low = "#DDDDDD", high = "#1ABC9C") +
            labs(title = "Locations of Police Arrests Made in San Francisco from 2003 – 2015, Normalized by Hour") +
            facet_wrap(~ Hour, nrow=4)

max_save(plot, "sf-arrest-where-7", "SF OpenData", w=10.5, h=8, tall=T)

Facet by Day of Week


In [40]:
df_arrest$DayOfWeek <- factor(df_arrest$DayOfWeek, level = dow_format)

plot <- ggmap(map) +
            stat_summary_hex(data=df_arrest %>% group_by(DayOfWeek) %>% mutate(w=1/n()), aes(x=X, y=Y, z=w), fun=sum_thresh, alpha = 0.8, color="#CCCCCC") +
            fte_theme() +
            theme(axis.text.x = element_blank(), axis.text.y = element_blank(), axis.title.x = element_blank(), axis.title.y = element_blank()) +
            scale_fill_gradient(low = "#DDDDDD", high = "#16A085") +
            labs(title = "Locations of Police Arrests Made in San Francisco from 2003 – 2015, Normalized by Day of Week") +
            facet_wrap(~ DayOfWeek, nrow=2)

max_save(plot, "sf-arrest-where-8", "SF OpenData", w=10.5, h=6, tall=T)

Bonus: Does Social Security payments lead to arrests?

Followup analysis to /u/NowProveIt's comment on Reddit suggesting that SSI payments lead to higher activity on Wednesday. Here's a code fragment to create a data frame of Wednesdays and their month-wise ordinals.


In [31]:
start_date <- "2003-01-01"
end_date <- "2015-11-15"

# Create a vector of all days between start and end date
days <- seq(as.Date(start_date), as.Date(end_date), "days")

df_dates <- tbl_df(data.frame(Date = days)) %>%
                mutate(weekday = format(Date, "%A"),
                       month = format(Date, "%B"),
                       year = format(Date, "%Y"))

df_dates %>% head(10)


Out[31]:
Dateweekdaymonthyear
12003-01-01WednesdayJanuary2003
22003-01-02ThursdayJanuary2003
32003-01-03FridayJanuary2003
42003-01-04SaturdayJanuary2003
52003-01-05SundayJanuary2003
62003-01-06MondayJanuary2003
72003-01-07TuesdayJanuary2003
82003-01-08WednesdayJanuary2003
92003-01-09ThursdayJanuary2003
102003-01-10FridayJanuary2003

Use window function shennanigans to get the ordinal ranks.


In [32]:
# Text values to replace numeric ordinals
ordinals <- c("First", "Second", "Third", "Fourth")

df_dates <- df_dates %>%
                filter(weekday == "Wednesday") %>%
                group_by(year, month) %>%
                mutate(rank = rank(Date)) %>%
                filter(rank <= 4) %>% # removes the rare 5th Wednesday
                mutate(Date = format(Date, format = "%m/%d/%Y"),  # needs to be proper format for merging
                       rank = factor(rank, levels = 1:4, labels = ordinals),
                       ordinal = paste(rank, weekday))

df_dates %>% head(10)


Out[32]:
Dateweekdaymonthyearrankordinal
101/01/2003WednesdayJanuary2003FirstFirst Wednesday
201/08/2003WednesdayJanuary2003SecondSecond Wednesday
301/15/2003WednesdayJanuary2003ThirdThird Wednesday
401/22/2003WednesdayJanuary2003FourthFourth Wednesday
502/05/2003WednesdayFebruary2003FirstFirst Wednesday
602/12/2003WednesdayFebruary2003SecondSecond Wednesday
702/19/2003WednesdayFebruary2003ThirdThird Wednesday
802/26/2003WednesdayFebruary2003FourthFourth Wednesday
903/05/2003WednesdayMarch2003FirstFirst Wednesday
1003/12/2003WednesdayMarch2003SecondSecond Wednesday

Combine with the arrest data frame.


In [49]:
df_arrest_wed <- df_arrest %>%
                filter(DayOfWeek == "Wednesday") %>%
                inner_join(df_dates) %>%
                select(Date, Time, X, Y, ordinal)

sprintf("NA values present from Merge: %s", sum(is.na(df_arrest_wed %>% select(ordinal))) > 0)
set.seed(42)
df_arrest_wed %>% sample_n(10)


Joining by: "Date"
Out[49]:
'NA values present from Merge: FALSE'
Out[49]:
DateTimeXYordinal
111/26/200311:15-122.410637.7825Fourth Wednesday
209/03/200322:17-122.389137.71968First Wednesday
309/14/201123:24-122.410837.78321Second Wednesday
411/10/200416:00-122.416237.76363Second Wednesday
506/06/200720:09-122.416437.7816First Wednesday
612/10/200815:52-122.41137.78414Second Wednesday
702/22/200610:55-122.408537.77376Fourth Wednesday
810/23/201323:30-122.429637.76775Fourth Wednesday
903/28/200710:25-122.446637.78225Fourth Wednesday
1008/02/200617:52-122.412437.783First Wednesday

In [50]:
df_arrests_ord <- df_arrest_wed %>%
                    mutate(Hour = sapply(Time, get_hour)) %>%
                    group_by(ordinal, Hour) %>%
                    summarize(count = n())

df_arrests_ord %>% head(10)


Out[50]:
ordinalHourcount
1First Wednesday0783
2First Wednesday1581
3First Wednesday2460
4First Wednesday3363
5First Wednesday4247
6First Wednesday5282
7First Wednesday6356
8First Wednesday7621
9First Wednesday8795
10First Wednesday9831

In [65]:
df_arrests_ord$ordinal <- factor(df_arrests_ord$ordinal, levels = c("First Wednesday", "Second Wednesday", "Third Wednesday", "Fourth Wednesday"))

plot <- ggplot(df_arrests_ord, aes(x = Hour, y = count, color = ordinal)) +
    geom_line() +
    fte_theme() +
    scale_x_continuous(breaks = c(0,4,8,12,16,20), labels = c("12 AM", "4 AM", "8 AM", "12 PM", "4 PM", "8 PM")) +
    scale_y_continuous(labels = comma) +
    theme(legend.title = element_blank(), legend.position="top", legend.direction="horizontal", legend.key.height=unit(0.25, "cm"), legend.margin=unit(-0.5,"cm")) +
    labs(x = "Hour of Arrest (Local Time)", y = "Total # of Arrests by Hour", title = "# of Police Arrests in San Francisco from 2003 – 2015 on Wednesdays, by Hour")

max_save(plot, "ssi-crime-1", "SF OpenData", w = 5)

Create a normallized map of the Wednesdays, because why not?


In [52]:
df_arrest_wed$ordinal <- factor(df_arrest_wed$ordinal, levels = c("First Wednesday", "Second Wednesday", "Third Wednesday", "Fourth Wednesday"))

plot <- ggmap(map) +
            stat_summary_hex(data=df_arrest_wed %>% group_by(ordinal) %>% mutate(w=1/n()), aes(x=X, y=Y, z=w), fun=sum_thresh, alpha = 0.8, color="#CCCCCC") +
            fte_theme() +
            theme(axis.text.x = element_blank(), axis.text.y = element_blank(), axis.title.x = element_blank(), axis.title.y = element_blank()) +
            scale_fill_gradient(low = "#DDDDDD", high = "#E74C3C") +
            labs(title = "Locations of Police Arrests Made in San Francisco from 2003 – 2015, Normalized by # Wednesday") +
            facet_wrap(~ ordinal, nrow=2)

max_save(plot, "ssi-crime-2", "SF OpenData", w = 5.5, h = 6, tall=T)

The MIT License (MIT)

Copyright (c) 2015 Max Woolf

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.