Pension Fund Data analysis

What we will get: Basic statistics on retired and active officers, with a plot of their age v. years of service How will we get there: First, explore the data to get a sense of what is in here, then narrow in on retired officers and what that can tell us about active ones. Finally, graph it out! Note that for the final story, the numbers are used to paint a picture of the age and service of active members in the force, not for predictive purposes. In this script, we go into what totals we get if we engage on more educated guessing (based on PABF information).


In [ ]:
library(dplyr)
library(lubridate)
library(ggplot2)

In [2]:
data <- read.csv('audit.csv')

# rename df columns
data$termination <- data$Date.of.Termination
data$dohire <- data$Date.of.Hire.based.on.Service
data$rank <- data$Rank.Code
data$birth<-data$Year.of.Birth

In [ ]:
# number of NAs in termination date
sum(is.na(as.Date(data$termination,"%m/%d/%y")))

In [4]:
sort(table(format(as.Date(data$termination,"%m/%d/%y", na.rm = TRUE),"%Y")),decreasing=TRUE)


Out[4]:
2012 2011 2015 2013 2014 2010 2008 2005 2007 2009 1999 2002 1998 2006 1991 2001 
 576  483  435  403  388   11    9    7    7    7    4    4    3    3    2    2 
1989 1990 1992 1993 1994 2000 2004 
   1    1    1    1    1    1    1 

In [5]:
# get frequencies by month of termination
sort(table(format(as.Date(data$termination,"%m/%d/%y", na.rm = TRUE),"%B")),decreasing=TRUE)


Out[5]:
 November   January     April      June  February    August       May  December 
      325       286       225       190       186       185       180       179 
     July     March September   October 
      165       159       149       122 

In [9]:
# get frequencies for year of hire
yoh <- table(format(as.Date(data$dohire,"%m/%d/%y", na.rm = TRUE),"%Y"))
sort(yoh, decreasing=TRUE)


Out[9]:
1994 1995 1986 1991 1999 1998 2000 2006 2013 2001 1996 1997 1990 2002 2005 2004 
 790  785  782  782  754  700  658  644  576  535  526  493  490  466  447  442 
2015 2012 2003 1992 2007 1987 1993 2014 1985 1982 1988 2010 1989 2009 2008 1980 
 441  430  406  399  349  339  326  305  268  239  223  169  147  129  113   79 
1981 1983 1977 1973 2011 1976 1972 1970 1978 1971 1975 1969 2068 1979 1974 1984 
  72   67   65   61   45   44   34   30   28   18   10    6    5    4    3    1 
2066 2067 
   1    1 

In [ ]:
# breakdown of types in fund data
table(data$Type)

In [10]:
retired <- data[data$Type == 'Service Retirement Annuities', ] # n = 2,123
active <- data[data$Type == 'Active Officer', ] # n = 12,061
disability <- data[data$Type == 'Disability Benefit Recipient', ] # n = 306
inactive <- data[data$Type == 'Total Inactive Participants', ] # n = 237

In [13]:
nrow(active) # sanity check


Out[13]:
2123

Retirements

Let's analyze the data on the officers whose status is "Service Retirement Annuities." We do not include other types of benefits-receiving officers to be as conservative as possible for the next summary statistics.


In [ ]:
# Question: How many active officers had a start-dates by service in the last 5 years?
active_hired <- as.Date(active$Date.of.Hire.based.on.Service, '%m/%d/%y')
format(active_hired, "%Y")
table(format(active_hired, "%Y"))

In [ ]:
# Question: Of retired officers, what were their termination year?
retired_term <- as.Date(retired$Date.of.Termination, '%m/%d/%y')
format(retired_term, "%Y")
table(format(retired_hired, "%Y"))

In [ ]:
# Question: What is the average no of years of service for retired members? What is the ave year of birth?

retired$birth<-retired$Year.of.Birth
retired$age <- 2016 - retired$birth
mean(retired$age)
mean(retired$Service.Years, na.rm=TRUE)

summary(retired$birth)

Active Members


In [ ]:
# Question: What is the number of active officers with 50+ years of age, who have served more than the ave. of retiree's total service
# To get how many officers have served >= 26 years, the average, skip the filtering by age.
# Similarly, for a raw count of officers who are eligible for "55 and out," calculate with the age column

# Number of active members, age 50 or older
active$birth<-active$Year.of.Birth
active$age <- 2016 - active$birth
active_over50<-active[active$age >=50,]

In [ ]:
nrow(active_over50)

In [ ]:
service26 <-active_over50[active_over50$Service.Years>=26,]

In [ ]:
nrow(service26)

In [ ]:
# Question: What is the number of active officers who are eligible to retire?
# Note that these estimates are chosen based on the presentation that the PABF gives all officers
# who qualify for retirement. 

# 1) By having served 29+1d years
# 1a) and are also >= 50 years old
# 2) By having served 20+ years
# 2a) and are also >= 50 years old

In [ ]:
# 1) 29 years and + 1day 
service29 <-active[active$Service.Years>=29,]
service29_plus_a_day <- service29[service29$Service.Days>=1,]

In [ ]:
# 1a) 
nrow(service29_plus_a_day[service29_plus_a_day$age>=50,])

In [ ]:
# 2) 20 years
service20 <-active[active$Service.Years>=20,]
nrow(service20)

In [ ]:
# 2a) 
nrow(service20[service20$age>=50,])

In [ ]:
# Born on or before 1955
nrow(active[active$birth<=1955,]) # n = 240

In [ ]:
# Born on or before 1955 AND service of 29+ years
nrow(service29_plus_a_day[service29_plus_a_day$birth<=1955,]) # 55

Let's make a hexbin map


In [ ]:
# we add day, month and year to get a better estimate of time served
# lubridate is good for this
d <- days(active$Service.Days)
m <- months(active$Service.Months)
y <- years(active$Service.Years)

active$time_served <- y+m+d
active$duration <- as.duration(active$time_served) / dweeks(1)
active$yduration <- as.duration(active$time_served) / dyears(1)

In [ ]:
library(hexbin)
# hexbin is simple to use but not as straight-forward to manipulate axes and colors
hbin <- hexbin(active$yduration, active$age, xbins = 40)
plot(hbin)

In [ ]:
# For a more polished graph, use ggplot

In [ ]:
hex <- ggplot(active,aes(x=yduration,y=age)) + geom_hex(bins=30) +
geom_hline(yintercept=55) + geom_vline(xintercept = 26) +
theme_classic() + scale_fill_gradient(low = "white", high = "black")

In [ ]:
# format ticks and add highlight-lines where we want them
ticks_hex <- hex +
scale_x_continuous(breaks=seq(0,max(active$yduration, na.rm=TRUE),3))+
scale_y_continuous(breaks=seq(0,max(active$age, na.rm=TRUE),5))

In [ ]:
# export to .eps because Illustrator is iffy with svg
postscript("hex_bin.eps")
plot(ticks_hex, xlab = 'Years of Service', ylab = 'Age', color=TRUE)
dev.off()