What we will get: Basic statistics on retired and active officers, with a plot of their age v. years of service How will we get there: First, explore the data to get a sense of what is in here, then narrow in on retired officers and what that can tell us about active ones. Finally, graph it out! Note that for the final story, the numbers are used to paint a picture of the age and service of active members in the force, not for predictive purposes. In this script, we go into what totals we get if we engage on more educated guessing (based on PABF information).
In [ ]:
library(dplyr)
library(lubridate)
library(ggplot2)
In [2]:
data <- read.csv('audit.csv')
# rename df columns
data$termination <- data$Date.of.Termination
data$dohire <- data$Date.of.Hire.based.on.Service
data$rank <- data$Rank.Code
data$birth<-data$Year.of.Birth
In [ ]:
# number of NAs in termination date
sum(is.na(as.Date(data$termination,"%m/%d/%y")))
In [4]:
sort(table(format(as.Date(data$termination,"%m/%d/%y", na.rm = TRUE),"%Y")),decreasing=TRUE)
Out[4]:
In [5]:
# get frequencies by month of termination
sort(table(format(as.Date(data$termination,"%m/%d/%y", na.rm = TRUE),"%B")),decreasing=TRUE)
Out[5]:
In [9]:
# get frequencies for year of hire
yoh <- table(format(as.Date(data$dohire,"%m/%d/%y", na.rm = TRUE),"%Y"))
sort(yoh, decreasing=TRUE)
Out[9]:
In [ ]:
# breakdown of types in fund data
table(data$Type)
In [10]:
retired <- data[data$Type == 'Service Retirement Annuities', ] # n = 2,123
active <- data[data$Type == 'Active Officer', ] # n = 12,061
disability <- data[data$Type == 'Disability Benefit Recipient', ] # n = 306
inactive <- data[data$Type == 'Total Inactive Participants', ] # n = 237
In [13]:
nrow(active) # sanity check
Out[13]:
Let's analyze the data on the officers whose status is "Service Retirement Annuities." We do not include other types of benefits-receiving officers to be as conservative as possible for the next summary statistics.
In [ ]:
# Question: How many active officers had a start-dates by service in the last 5 years?
active_hired <- as.Date(active$Date.of.Hire.based.on.Service, '%m/%d/%y')
format(active_hired, "%Y")
table(format(active_hired, "%Y"))
In [ ]:
# Question: Of retired officers, what were their termination year?
retired_term <- as.Date(retired$Date.of.Termination, '%m/%d/%y')
format(retired_term, "%Y")
table(format(retired_hired, "%Y"))
In [ ]:
# Question: What is the average no of years of service for retired members? What is the ave year of birth?
retired$birth<-retired$Year.of.Birth
retired$age <- 2016 - retired$birth
mean(retired$age)
mean(retired$Service.Years, na.rm=TRUE)
summary(retired$birth)
In [ ]:
# Question: What is the number of active officers with 50+ years of age, who have served more than the ave. of retiree's total service
# To get how many officers have served >= 26 years, the average, skip the filtering by age.
# Similarly, for a raw count of officers who are eligible for "55 and out," calculate with the age column
# Number of active members, age 50 or older
active$birth<-active$Year.of.Birth
active$age <- 2016 - active$birth
active_over50<-active[active$age >=50,]
In [ ]:
nrow(active_over50)
In [ ]:
service26 <-active_over50[active_over50$Service.Years>=26,]
In [ ]:
nrow(service26)
In [ ]:
# Question: What is the number of active officers who are eligible to retire?
# Note that these estimates are chosen based on the presentation that the PABF gives all officers
# who qualify for retirement.
# 1) By having served 29+1d years
# 1a) and are also >= 50 years old
# 2) By having served 20+ years
# 2a) and are also >= 50 years old
In [ ]:
# 1) 29 years and + 1day
service29 <-active[active$Service.Years>=29,]
service29_plus_a_day <- service29[service29$Service.Days>=1,]
In [ ]:
# 1a)
nrow(service29_plus_a_day[service29_plus_a_day$age>=50,])
In [ ]:
# 2) 20 years
service20 <-active[active$Service.Years>=20,]
nrow(service20)
In [ ]:
# 2a)
nrow(service20[service20$age>=50,])
In [ ]:
# Born on or before 1955
nrow(active[active$birth<=1955,]) # n = 240
In [ ]:
# Born on or before 1955 AND service of 29+ years
nrow(service29_plus_a_day[service29_plus_a_day$birth<=1955,]) # 55
In [ ]:
# we add day, month and year to get a better estimate of time served
# lubridate is good for this
d <- days(active$Service.Days)
m <- months(active$Service.Months)
y <- years(active$Service.Years)
active$time_served <- y+m+d
active$duration <- as.duration(active$time_served) / dweeks(1)
active$yduration <- as.duration(active$time_served) / dyears(1)
In [ ]:
library(hexbin)
# hexbin is simple to use but not as straight-forward to manipulate axes and colors
hbin <- hexbin(active$yduration, active$age, xbins = 40)
plot(hbin)
In [ ]:
# For a more polished graph, use ggplot
In [ ]:
hex <- ggplot(active,aes(x=yduration,y=age)) + geom_hex(bins=30) +
geom_hline(yintercept=55) + geom_vline(xintercept = 26) +
theme_classic() + scale_fill_gradient(low = "white", high = "black")
In [ ]:
# format ticks and add highlight-lines where we want them
ticks_hex <- hex +
scale_x_continuous(breaks=seq(0,max(active$yduration, na.rm=TRUE),3))+
scale_y_continuous(breaks=seq(0,max(active$age, na.rm=TRUE),5))
In [ ]:
# export to .eps because Illustrator is iffy with svg
postscript("hex_bin.eps")
plot(ticks_hex, xlab = 'Years of Service', ylab = 'Age', color=TRUE)
dev.off()