Property Maintenance Fines

Predicting the probability that a set of blight tickets will be paid on time

Supervised Learning. Classification

Source: Applied Machine Learning in Python | Coursera. Solved with classical machine learning classifiers here

Data provided by Michigan Data Science Team (MDST), the Michigan Student Symposium for Interdisciplinary Statistical Sciences (MSSISS) and the City of Detroit Detroit Open Data Portal.

Each row of the dataset corresponds to a single blight ticket, and includes information about when, why, and to whom each ticket was issued. The target variable is compliance, which is True if the ticket was paid early, on time, or within one month of the hearing data, False if the ticket was paid after the hearing date or not at all, and Null if the violator was found not responsible.

Features

ticket_id - unique identifier for tickets
agency_name - Agency that issued the ticket
inspector_name - Name of inspector that issued the ticket
violator_name - Name of the person/organization that the ticket was issued to
violation_street_number, violation_street_name, violation_zip_code - Address where the violation occurred
mailing_address_str_number, mailing_address_str_name, city, state, zip_code, non_us_str_code, country - Mailing address of the violator
ticket_issued_date - Date and time the ticket was issued
hearing_date - Date and time the violator's hearing was scheduled
violation_code, violation_description - Type of violation
disposition - Judgment and judgement type
fine_amount - Violation fine amount, excluding fees
admin_fee - $20 fee assigned to responsible judgments

state_fee - $10 fee assigned to responsible judgments late_fee - 10% fee assigned to responsible judgments discount_amount - discount applied, if any clean_up_cost - DPW clean-up or graffiti removal cost judgment_amount - Sum of all fines and fees grafitti_status - Flag for graffiti violations

Labels

payment_amount - Amount paid, if any
payment_date - Date payment was made, if it was received
payment_status - Current payment status as of Feb 1 2017
balance_due - Fines and fees still owed
collection_status - Flag for payments in collections
compliance [target variable for prediction]
 Null = Not responsible
 0 = Responsible, non-compliant
 1 = Responsible, compliant
compliance_detail - More information on why each ticket was marked compliant or non-compliant

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import helper
import keras

helper.info_gpu()
#sns.set_palette("GnBu_d")
#helper.reproducible(seed=0) # Setup reproducible results from run to run using Keras

%matplotlib inline


Using TensorFlow backend.
/device:GPU:0
Keras		v2.1.3
TensorFlow	v1.4.1

1. Data Processing


In [2]:
data_path = 'data/property_maintenance_fines_data.csv'
target = ['compliance']

df_original = pd.read_csv(data_path, encoding='iso-8859-1', dtype='unicode')
print("{} rows \n{} columns \ntarget: {}".format(*df_original.shape, target))


250306 rows 
34 columns 
target: ['compliance']

Explore and Clean the target


In [3]:
print(df_original[target].squeeze().value_counts(dropna=False))


0.0    148283
NaN     90426
1.0     11597
Name: compliance, dtype: int64

In [4]:
# Remove rows with NULL targets

df_original = df_original.dropna(subset=target)

print(df_original[target].squeeze().value_counts())
print(df_original.shape)


0.0    148283
1.0     11597
Name: compliance, dtype: int64
(159880, 34)

Imbalanced target: the evaluation metric used in this problem is the Area Under the ROC Curve

Split original data into training and validation test set


In [5]:
from sklearn.model_selection import train_test_split

df, df_test = train_test_split(
    df_original, test_size=0.2, stratify=df_original[target], random_state=0)

To avoid data leakage, only the training dataframe, df, will be explored and processed here

Show the training data


In [6]:
df.head(2)


Out[6]:
ticket_id agency_name inspector_name violator_name violation_street_number violation_street_name violation_zip_code mailing_address_str_number mailing_address_str_name city ... clean_up_cost judgment_amount payment_amount balance_due payment_date payment_status collection_status grafitti_status compliance_detail compliance
131030 159232 Department of Public Works Montgomery-Coit, Kimberlye JOHNSON-GREENE, MARGUSIE F. 11645.0 LAKEPOINTE NaN 11645 LAKEPOINTE DETROIT ... 0.0 85.0 85.0 0.0 2008-09-12 00:00:00 PAID IN FULL NaN NaN non-compliant by late payment more than 1 month 0.0
29573 49039 Buildings, Safety Engineering & Env Department Watson, Jerry BAPT CHURCH, HOLY TABERNACLE 3184.0 CANFIELD NaN 3184 E. CANFIELD DETROIT ... 0.0 305.0 305.0 0.0 2007-06-26 00:00:00 PAID IN FULL NaN NaN non-compliant by late payment more than 1 month 0.0

2 rows × 34 columns

Missing values


In [7]:
helper.missing(df)


Transform Data

Remove irrelevant features


In [8]:
def remove_features(df):

    relevant_col = ['agency_name', 'violation_street_name', 'city', 'state', 'violator_name',
        'violation_code', 'late_fee', 'discount_amount', 'judgment_amount', 'disposition',
        'fine_amount', 'compliance']

    df = df[relevant_col]

    return df


df = remove_features(df)

print(df.shape)


(127904, 12)

Classify variables


In [9]:
num = ['late_fee', 'discount_amount', 'judgment_amount', 'fine_amount']

df = helper.classify_data(df, target, numerical=num)

pd.DataFrame(dict(df.dtypes), index=["Type"])[df.columns].head()  # show data types


numerical features:   4
categorical features: 7
target 'compliance': category
Out[9]:
late_fee discount_amount judgment_amount fine_amount agency_name violation_street_name city state violator_name violation_code disposition compliance
Type float32 float32 float32 float32 category category category category category category category category

Remove low-frequency categorical values


In [10]:
df, dict_categories = helper.remove_categories(df, target=target, ratio=0.001, show=False)

Fill missing values

Missing categorical values filled by 'Other' There are no numerical missing values


In [11]:
df = helper.fill_simple(df, target, missing_categorical='Other')


Missing categorical filled with label: "Other"

In [12]:
helper.missing(df);


No missing values found

Visualize the data

Categorical features


In [13]:
for i in ['state', 'disposition']:
    helper.show_categorical(df[[i]])


Target vs Categorical features


In [14]:
for i in ['state', 'disposition']:
    helper.show_target_vs_categorical(df[[i, target[0]]], target)


Numerical features


In [15]:
helper.show_numerical(df, kde=True)


Target vs Numerical features


In [16]:
helper.show_target_vs_numerical(df, target, point_size=10 ,jitter=0.3, fit_reg=True)
plt.ylim(ymin=-0.2, ymax=1.2)


Out[16]:
(-0.2, 1.2)

Correlation between numerical features and target


In [17]:
helper.show_correlation(df, target, figsize=(6,3))


2. Neural Network Model

Select the features


In [18]:
droplist = []  # features to drop

# For the model 'data' instead of 'df'
data = df.copy()
# del(df)
data.drop(droplist, axis='columns', inplace=True)
data.head(2)


Out[18]:
late_fee discount_amount judgment_amount fine_amount agency_name violation_street_name city state violator_name violation_code disposition compliance
131030 5.0 0.0 85.0 50.0 Department of Public Works LAKEPOINTE DETROIT MI Other 9-1-103(C) Responsible by Determination 0.0
29573 25.0 0.0 305.0 250.0 Buildings, Safety Engineering & Env Department Other DETROIT MI Other 9-1-36(a) Responsible by Default 0.0

Scale numerical variables

Shift and scale numerical variables to a standard normal distribution. The scaling factors are saved to be used for predictions.


In [19]:
data, scale_param = helper.scale(data)

Create dummy features

Replace categorical features (no target) with dummy features


In [20]:
data, dict_dummies = helper.replace_by_dummies(data, target)

model_features = [f for f in data if f not in target]  # sorted neural network inputs

data.head(3)


Out[20]:
late_fee discount_amount judgment_amount fine_amount compliance agency_name_Buildings, Safety Engineering & Env Department agency_name_Department of Public Works agency_name_Detroit Police Department agency_name_Health Department agency_name_Other ... violation_code_9-1-43(a) - (Dwellin violation_code_9-1-43(a) - (Structu violation_code_9-1-81(a) violation_code_9-1-82(d) - (Dwellin violation_code_Other disposition_Responsible (Fine Waived) by Deter disposition_Responsible by Admission disposition_Responsible by Default disposition_Responsible by Determination disposition_Other
131030 -0.422306 -0.049474 -0.451255 -0.453714 0.0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
29573 -0.126339 -0.049474 -0.154289 -0.156975 0.0 1 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
2657 -0.126339 -0.049474 -0.154289 -0.156975 0.0 1 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0

3 rows × 446 columns

Split the data into training and validation sets


In [21]:
val_size = 0.2
random_state = 0


def validation_split(data, val_size=0.25):

    train, test = train_test_split(
        data, test_size=val_size, random_state=random_state, stratify=data[target])

    # Separate the data into features and target (x=features, y=target)
    x_train, y_train = train.drop(target, axis=1).values, train[target].values
    x_val, y_val = test.drop(target, axis=1).values, test[target].values
    # _nc: non-categorical yet (needs one-hot encoding)

    return x_train, y_train, x_val, y_val


x_train, y_train, x_val, y_val = validation_split(data, val_size=val_size)

# x_train = x_train.astype(np.float16)
y_train = y_train.astype(np.float16)
# X_val = x_val.astype(np.float16)
y_val = y_val.astype(np.float16)

Encode the output


In [22]:
def one_hot_output(y_train, y_val):
    num_classes = len(np.unique(y_train))
    y_train = keras.utils.to_categorical(y_train, num_classes)
    y_val = keras.utils.to_categorical(y_val, num_classes)
    return y_train, y_val


y_train, y_val = one_hot_output(y_train, y_val)

In [23]:
print("train size \t X:{} \t Y:{}".format(x_train.shape, y_train.shape))
print("val size \t X:{} \t Y:{}".format(x_val.shape, y_val.shape))


train size 	 X:(102323, 445) 	 Y:(102323, 2)
val size 	 X:(25581, 445) 	 Y:(25581, 2)

Build a dummy classifier


In [24]:
from sklearn.dummy import DummyClassifier

clf = DummyClassifier(strategy='most_frequent').fit(x_train, np.ravel(y_train))
# The dummy 'most_frequent' classifier always predicts class 0
y_pred = clf.predict(x_val).reshape([-1, 1])

helper.binary_classification_scores(y_val[:, 1], y_pred);


Scores:
-----------
Log_Loss: 	2.5059
Accuracy: 	0.93
Precision: 	0.00
Recall: 	0.00
ROC AUC: 	0.00
F1-score: 	0.00

Confusion matrix: 
 [[23725     0]
 [ 1856     0]]

Build a random forest classifier (best of grid search)


In [25]:
from sklearn.ensemble import RandomForestClassifier


%time clf_random_forest_opt = RandomForestClassifier(n_estimators = 30, max_features=150, \
                                max_depth=13, class_weight='balanced', n_jobs=-1, \
                                   random_state=0).fit(x_train, np.ravel(y_train[:,1]))


CPU times: user 29 s, sys: 63.4 ms, total: 29.1 s
Wall time: 9.1 s

In [26]:
y_pred = clf_random_forest_opt.predict(x_val).reshape([-1, 1])
helper.binary_classification_scores(y_val[:, 1], y_pred);


Scores:
-----------
Log_Loss: 	4.3881
Accuracy: 	0.87
Precision: 	0.31
Recall: 	0.60
ROC AUC: 	0.75
F1-score: 	0.41

Confusion matrix: 
 [[21209  2516]
 [  734  1122]]

Build the Neural Network for Binary Classification


In [27]:
cw = helper.get_class_weight(y_train[:, 1])  # class weight (imbalanced target)

import keras
from keras.models import Sequential
from keras.layers.core import Dense, Dropout


def build_nn(input_size, output_size, summary=False):

    input_nodes = input_size // 8

    model = Sequential()
    model.add(Dense(input_nodes, input_dim=input_size, activation='relu'))

    model.add(Dense(output_size, activation='softmax'))

    if summary:
        model.summary()

    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

    return model


model = build_nn(x_train.shape[1], y_train.shape[1], summary=True)


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, 55)                24530     
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 112       
=================================================================
Total params: 24,642
Trainable params: 24,642
Non-trainable params: 0
_________________________________________________________________

Train the Neural Network


In [28]:
import os
from time import time
model_path = os.path.join("models", "detroit.h5")


def train_nn(model, x_train, y_train, validation_data=None, path=False, show=True):
    """ 
    Train the neural network model. If no validation_datais provided, a split for validation
    will be used
    """
    
    if show:
        print('Training ....')
    
    callbacks = [keras.callbacks.EarlyStopping(monitor='val_loss', patience=0, verbose=1)]
    t0 = time()

    history = model.fit(
        x_train,
        y_train,
        epochs=100,
        batch_size=2048,
        
        class_weight = cw,
        
        verbose=1,
        validation_split=0.3,
        validation_data = validation_data,
        callbacks=callbacks)

    if show:
        print("time: \t {:.1f} s".format(time() - t0))
        helper.show_training(history)

    if path:
        model.save(path)
        print("\nModel saved at", path)
    
    return history
        

model = None
model = build_nn(x_train.shape[1], y_train.shape[1], summary=False)
train_nn(model, x_train, y_train, path=None);


from sklearn.metrics import roc_auc_score

y_pred_train = model.predict(x_train, verbose=1)
print('\n\n ROC_AUC train:\t{:.2f} \n'.format(roc_auc_score(y_train, y_pred_train)))
y_pred_val = model.predict(x_val, verbose=1)
print('\n\n ROC_AUC val:\t{:.2f}'.format(roc_auc_score(y_val, y_pred_val)))


Training ....
Train on 71626 samples, validate on 30697 samples
Epoch 1/100
71626/71626 [==============================] - 6s 84us/step - loss: 0.6311 - acc: 0.7677 - val_loss: 0.5909 - val_acc: 0.7567
Epoch 2/100
71626/71626 [==============================] - 5s 76us/step - loss: 0.5471 - acc: 0.7890 - val_loss: 0.5538 - val_acc: 0.8086
Epoch 3/100
71626/71626 [==============================] - 5s 69us/step - loss: 0.5232 - acc: 0.8325 - val_loss: 0.5477 - val_acc: 0.8383
Epoch 4/100
71626/71626 [==============================] - 5s 65us/step - loss: 0.5144 - acc: 0.8364 - val_loss: 0.5460 - val_acc: 0.8433
Epoch 5/100
71626/71626 [==============================] - 5s 64us/step - loss: 0.5092 - acc: 0.8353 - val_loss: 0.5432 - val_acc: 0.8311
Epoch 6/100
71626/71626 [==============================] - 5s 64us/step - loss: 0.5048 - acc: 0.8282 - val_loss: 0.5427 - val_acc: 0.8280
Epoch 7/100
71626/71626 [==============================] - 5s 67us/step - loss: 0.5014 - acc: 0.8279 - val_loss: 0.5427 - val_acc: 0.8283
Epoch 00007: early stopping
time: 	 35.1 s
Training loss:  	0.5014
Validation loss: 	0.5427

Training accuracy: 	0.828
Validation accuracy:	0.828
102323/102323 [==============================] - 4s 39us/step


 ROC_AUC train:	0.82 

25581/25581 [==============================] - 1s 39us/step


 ROC_AUC val:	0.81

Validate the model (validation set)


In [29]:
helper.binary_classification_scores(y_val[:, 1], y_pred_val[:, 1]);


Scores:
-----------
Log_Loss: 	nan
Accuracy: 	0.83
Precision: 	0.24
Recall: 	0.63
ROC AUC: 	0.81
F1-score: 	0.35

Confusion matrix: 
 [[20094  3631]
 [  678  1178]]

Evaluate the final model (test set)


In [30]:
df_test.head(2)


Out[30]:
ticket_id agency_name inspector_name violator_name violation_street_number violation_street_name violation_zip_code mailing_address_str_number mailing_address_str_name city ... clean_up_cost judgment_amount payment_amount balance_due payment_date payment_status collection_status grafitti_status compliance_detail compliance
29156 48696 Department of Public Works Funchess, Mitchell CORP, CONTIMORTGAGE 8200.0 HEYDEN NaN 3815 WEST TEMPLE SALT LAKE CITY ... 0.0 140.0 0.0 140.0 NaN NO PAYMENT APPLIED IN COLLECTION NaN non-compliant by no payment 0.0
125262 152329 Buildings, Safety Engineering & Env Department Doetsch, James JACKSON, THEO 13821.0 GLENWOOD NaN 1464 PO BOX DETROIT ... 0.0 305.0 0.0 305.0 NaN NO PAYMENT APPLIED NaN NaN non-compliant by no payment 0.0

2 rows × 34 columns

Process test data with training set parameters (no data leakage)


In [31]:
df_test = remove_features(df_test)

df_test = helper.classify_data(df_test, target, numerical=num)

df_test, _ = helper.remove_categories(
    df_test, target=target, show=False, dict_categories=dict_categories)

df_test = helper.fill_simple(df_test, target, missing_categorical='Other') 

df_test, _ = helper.scale(df_test, scale_param)
df_test, _ = helper.replace_by_dummies(df_test, target, dict_dummies)
df_test = df_test[model_features+target] # sort columns to match training features order


numerical features:   4
categorical features: 7
target 'compliance': category
Missing categorical filled with label: "Other"

In [32]:
def separate_x_y(data):
    """ Separate the data into features and target (x=features, y=target) """

    x, y = data.drop(target, axis=1).values, data[target].values
    x = x.astype(np.float16)
    y = y.astype(np.float16)
   
    return x, y

x_test, y_test = separate_x_y(df_test)

y_test = keras.utils.to_categorical(y_test, 2)

Random Forest model


In [33]:
y_pred = clf_random_forest_opt.predict_proba(x_test)[:,1]
helper.binary_classification_scores(y_test[:,1], y_pred);


Scores:
-----------
Log_Loss: 	0.4576
Accuracy: 	0.87
Precision: 	0.30
Recall: 	0.59
ROC AUC: 	0.81
F1-score: 	0.40

Confusion matrix: 
 [[26488  3169]
 [  955  1364]]

In [34]:
helper.show_feature_importances(model_features, clf_random_forest_opt)


 Top contributing features:
 --------------------------
disposition_Responsible by Default          0.32
late_fee                                    0.26
judgment_amount                             0.09
disposition_Responsible by Admission        0.05
disposition_Responsible by Determination    0.04
fine_amount                                 0.03
discount_amount                             0.03
violation_code_9-1-36(a)                    0.01
violation_code_22-2-61                      0.01
violation_code_9-1-81(a)                    0.01
dtype: float64

Neural Network model


In [35]:
y_pred = model.predict(x_test, verbose=1)[:,1]
helper.binary_classification_scores(y_test[:,1], y_pred);


31976/31976 [==============================] - 1s 22us/step
Scores:
-----------
Log_Loss: 	nan
Accuracy: 	0.83
Precision: 	0.24
Recall: 	0.62
ROC AUC: 	0.80
F1-score: 	0.35

Confusion matrix: 
 [[25076  4581]
 [  875  1444]]

Compare with other non-neural ML models


In [36]:
helper.ml_classification(x_train, y_train[:, 1], x_test, y_test[:, 1])


Naive Bayes
AdaBoost
Decision Tree
Random Forest
Extremely Randomized Trees
Out[36]:
Time (s) Loss Accuracy Precision Recall ROC-AUC F1-score
AdaBoost 12.40 0.66 0.94 0.89 0.22 0.80 0.35
Random Forest 87.48 0.48 0.94 0.70 0.27 0.77 0.39
Extremely Randomized Trees 130.21 0.81 0.94 0.66 0.26 0.74 0.38
Decision Tree 18.39 1.47 0.93 0.59 0.28 0.68 0.38
Naive Bayes 1.88 20.79 0.36 0.09 0.83 0.59 0.16