Study of Correlation Between Building Demolition and Associated Features

Capstone Project for Data Science at Scale on Coursera
Repo is located here



In [2]:

    
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import Image
%matplotlib inline

Objective

Build a model to make predictions on blighted buildings based on real data from data.detroitmi.gov as given by coursera.

Building demolition is very important for the city to turn around and revive its economy. However, it's no easy task. Accurate predictions can provide guidance on potential blighted buildings and help avoid complications at early stages.

Building List

The buildings were defined as described below:

Building sizes were estimated using parcel info downloaded here at data.detroitmi.gov. Details can be found in this notebook.
A event table was constructed from the 4 files (detroit-311.csv, detroit-blight-violations.csv, detroit-crime.csv, and detroit-demolition-permits.tsv) using their coordinates, as shown here.
Buildings were defined using these coordinates with an estimated building size (median of all parcels). Each building was represented as a same sized rectangle.



In [3]:

    
# The resulted buildings:
Image("./data/buildings_distribution.png")









    Out[3]:

Features

Three kinds (311-calls, blight-violations, and crimes) of incident counts and coordinates (normalized) was used in the end. I also tried to generate more features by differentiating each kind of crimes or each kind of violations in this notebook. However, these differentiated features lead to smaller AUC scores.

Data

The buildings were down-sampled to contain same number of blighted buildings and non-blighted ones.
The ratio between train and test was set at a ratio of 80:20.
During training using xgboost, the train data was further separated into train and evaluation with a ratio of 80:20 for monitoring.

Model

A Gradient Boosted Tree model using Xgboost achieved AUC score of 0.85 on evaluation data set:



In [4]:

    
Image('./data/train_process.png')









    Out[4]:

This model resulted in an AUC score of 0.858 on test data. Feature importances are shown below:



In [5]:

    
Image('./data/feature_f_scores.png')









    Out[5]:

Locations were most important features in this model. Although I tried using more features generated by differentiating different kind of crimes or violations, the AUC scores did not improve.

Feature importance can also be viewed using tree representation:



In [6]:

    
Image('./data/bst_tree.png')









    Out[6]:

To reduce variance of the model, since overfitting was observed during training. I also tried to reduce variance by including in more nonblighted buildings by sampling again multiple times with replacement (bagging).
A final AUC score of 0.8625 was achieved. The resulted ROC Curve on test data is shown below:



In [7]:

    
Image('./data/ROC_Curve_combined.png')









    Out[7]:

Discussion

Several things worth trying:

Using neural net to study more features generated from differentiated crimes or violations if given more time.
Taken into account possibilities that a building might blight in the future.

Thanks for your time reading the report!



In [ ]: