In [ ]:
# Add required packages here
require('plyr')
require('dplyr')
require('tidyr')
require('lubridate')
require('stringr')
require("ggplot2") 

# Path of data used for analysis
RAW_PATH = "C:\\Users\\sahil.maheshwari\\Documents\\R Scripts\\eGov_DC\\data\\datadive\\coc.csv"
ROLLED_DATA_PATH = "C:\\Users\\sahil.maheshwari\\Documents\\R Scripts\\eGov_DC\\data\\datadive\\ddive_rolled_seq.csv"

In [ ]:
# Put TODOs here

Exploratory Data Analysis

The purpose of this exercise is to:

  • Explore the data in order to understand it
  • Find interesting patterns
  • Validate assumptions

The notebook is organized into different sections, based on the levels already present in the data.

Questions

To simplify the EDA process, let's start with some basic questions we can ask

  • How is the data spread? -> 5 Number Summary and Box Plots
  • Data Validation - For eg, number of days to resolve to complaint should be the difference between complaint date and resolution date.
  • Check for missing values in the data.
  • What is the complaint share across the wards? Is it uniform? Do some wards have a dispproportionate share of complaints?
  • Are some complaint types more frequently reported than some?
  • What days are the complaints filed? How is the pattern across the wards? How is the pattern across complaint types? What type of distribution do these patterns follow?
  • Trend Analysis: What is the number of complaints per month, per year? Is there an increasing trend? What type of trend (linear / non-linear) is it?
  • How are the complaints distributed in each year? Does it differ ward by ward, or by complaint type?
  • How are the complaints distributed in each month? Does it differ ward by ward, or by complaint type?
  • Are there more complaints in some months? What about in a particular ward, or for a complaint type?
  • What are the top k months in which most complaints are filed? What about in each ward?
  • What are the top l complaint types in a ward? Are these complaints filed in only a few months?
  • Based on the complaint type percentage (explore other factors), what wards are 'similar'? (Clustering exercise)
  • What are the top m complaint type(s) for each ward and when does it primarily occur? Is the time series distribution unimodal, bimodal, multimodal? During what months do they peak?
  • In which wards are the top 5 complaints(overall)?
  • Have the number of complaints decreased for a particular type over time? Is this because it is no longer a problem or was it solved? Compute Rolling resolution rate?

Since we will be building alerts and forecast models, it is essential to perform a more focussed EDA

  • How are the complaints auto-correlated? (ACF / PACF plots)
  • Data Continuity and necessary conditions for time series modelling: Is the time series stationary? What ward x complaint types can be modelled? (Come up with a rough metric to assess suitability for building time series models)
  • Are there any seasonal / cyclical trend in the overall complaints? (Decomposition)
  • Are there any seasonal / cyclical trend in each complaint type taken on its own? (Decomposition)
  • Put each Ward x Complaint Type into buckets, based on the complaint type's average / median resolution percentage rate, and average / median resolution time (may need to confer with eGov):
    • High Difficulty, High Volume
    • High Difficulty, Low Volume
    • Low Difficulty, High Volume
    • Low Difficulty, Low Volume
  • Is the complaint distribution per complaint normal at the city level? Is it another type of distribution?
  • Is the complaint distribution per complaint normal at the ward level? Is it another type of distribution?
  • Do outliers (in complaints at city level) consistently occur in a time period?
  • Do outliers (in complaints at ward level) consistently occur in a time period?

In [ ]:
Read data

In [ ]:
nas <- c(NA, 'NA', '', ' ', 'NULL')
rolled_data <- read.csv(ROLLED_DATA_PATH, stringsAsFactors = F,
                    na.strings = nas)

In [ ]:
glimpse(rolled_data)

In [ ]:
rolled_data$Complaint.Date <- ymd(rolled_data$Complaint.Date)
rolled_data$Complaint.Type <- as.factor(rolled_data$Complaint.Type)
rolled_data$Ward <- as.factor(rolled_data$Ward)

In [ ]:
glimpse(rolled_data)

How is the data spread?


In [ ]:
summary(rolled_data$Complaint.Date)

In [ ]:
rolled_data$Year <- year(rolled_data$Complaint.Date)
rolled_data$Month <- month(rolled_data$Complaint.Date, label=T)

In [ ]:
rolled_data %>% 
group_by(Year) %>% 
summarise(comp_count=sum(complaint_count, na.rm=T)) %>% 
mutate(count_percent=(comp_count/sum(comp_count))*100)

In [ ]:
rolled_data %>% 
group_by(Month) %>% 
summarise(comp_count=sum(complaint_count, na.rm=T)) %>% 
mutate(count_percent=(comp_count/sum(comp_count))*100)

In [ ]:
summary(rolled_data$complaint_count)

In [ ]:
comp_count_m1 <- rolled_data %>% dplyr::select(complaint_count) %>% dplyr::filter(complaint_count>1)

In [ ]:
bins <- c(0,2,4,6,8,10,15,20,25,30,40,50,60)
hist(comp_count_m1$complaint_count, breaks=bins, xlim=c(0,20))

In [ ]:
bins <- c(0, seq(5,55,5))
hist(comp_count_m1$complaint_count, breaks=bins, xlim=c(10,55), ylim=c(0,100), freq=T)

What is the complaint share across the wards? Is it uniform? Do some wards have a dispproportionate share of complaints?


In [ ]:
length(unique(rolled_data$Ward))

In [ ]:
count_by_ward<- rolled_data %>% 
                group_by(Ward) %>% 
                dplyr::filter(!is.na(Ward)) %>%
                summarise(comp_count=sum(complaint_count, na.rm=T)) %>% 
                mutate(count_percent=(comp_count/sum(comp_count))*100) %>%
                arrange(count_percent)

count_by_ward

In [ ]:
ggplot(count_by_ward, aes(reorder(Ward,count_percent), count_percent)) + 
geom_bar(stat="identity")

In [ ]:
summary(count_by_ward$count_percent)

In [ ]:
ward_count_high <- dplyr::filter(count_by_ward, count_percent>0.70050)
sum(ward_count_high$count_percent)

So, the complaint count across Wards is not evenly distributed. Top 25% Wards by count of complaints have ~47% of complaints

What days are the complaints filed? How is the pattern across the wards? How is the pattern across complaint types? What type of distribution do these patterns follow?


In [ ]:
rolled_data$WDay <- wday(rolled_data$Complaint.Date, label=T)

In [ ]:
rolled_data %>% 
group_by(WDay) %>% 
summarise(comp_count=sum(complaint_count, na.rm=T)) %>% 
mutate(count_percent=(comp_count/sum(comp_count))*100)

Almost uniformally distributed


In [ ]: