Kaggle San Francisco Crime Classification

Berkeley MIDS W207 Final Project

Sarah Cha, Sam Goodgame, Kalvin Kao, Bryan Moore

This project ingests data about crimes committed in San Francisco and predicts their types. It is part of a Kaggle competition.

Problem Description

Our task in this project is to predict the type of a crime based on its component details. The component details are primarily related to time and location.

This is an interesting problem space because time and location are both high-dimensional variables. Such variables don't tend to work well with machine learning models, because they lead the models to overfit and generalize poorly.

Accordingly, our goal in this project is to generate an accurate, parsimonious model by working our way through the model selection triple:

  • Model selection
  • Feature Engineering
  • Hyperparameter tuning (and calibration)

While the model selection triple may seem like a linear checklist, we did not approach it that way. In other words, we conducted model selection, feature engineering, and hyperparameter tuning in parallel, with different members of the team focusing on different aspects of the problem at the same time.

After transforming the data into a usable format, we set about engineering useful features. We focused primarily on enriching our data with features related to weather and schools.

Then, we protoyped the major model types using their default specifications. Then, we dug slightly deeper into each model type by tuning the hyperparameters and calibrating the models.

Note

This notebook only includes code relevant to the winning model. However, we tested and optimized many models (perhaps even tens of thousands of model specifications) in order to converge on our final model. To view our entire approach to this problem, please refer to the appropriate Jupyter notebook in this repository.

Dependencies


In [1]:
# Basic libraries
import time
import numpy as np
import pandas as pd

# Modeling libraries and estimators
from sklearn.neighbors import KNeighborsClassifier
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import log_loss
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

# Meta-estimators
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import GradientBoostingClassifier

# Calibration tools
from sklearn.calibration import CalibratedClassifierCV

# Set random seed and format print output
np.random.seed(0)
np.set_printoptions(precision=3)


/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/grid_search.py:43: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)

Initial Data Transformation

Our first step was to transform the data into a form that we could use for machine learning: a Numpy array with a single column of targets (in this case, the type of crime committed) and many columns of independent/predictor variables.

There are many ways to wrangle the data; we chose SQL for the first step:

Use DDL below to construct a table for the data within a PostgreSQL database:

CREATE TABLE kaggle_sf_crime (
dates TIMESTAMP,                                
category VARCHAR,
descript VARCHAR,
dayofweek VARCHAR,
pd_district VARCHAR,
resolution VARCHAR,
addr VARCHAR,
X FLOAT,
Y FLOAT);

Move the training data, downloaded from Kaggle as a .csv file, into the PostgreSQL database:

\copy kaggle_sf_crime FROM '/Users/Goodgame/Desktop/MIDS/207/final/sf_crime_train.csv' DELIMITER ',' CSV HEADER;

Use SQL to transform the data into a usable form:

SELECT
  category,
  date_part('hour', dates) AS hour_of_day,
  CASE
    WHEN dayofweek = 'Monday' then 1
    WHEN dayofweek = 'Tuesday' THEN 2
    WHEN dayofweek = 'Wednesday' THEN 3
    WHEN dayofweek = 'Thursday' THEN 4
    WHEN dayofweek = 'Friday' THEN 5
    WHEN dayofweek = 'Saturday' THEN 6
    WHEN dayofweek = 'Sunday' THEN 7
  END AS dayofweek_numeric,
  X,
  Y,
  CASE
    WHEN pd_district = 'BAYVIEW' THEN 1
    ELSE 0
  END AS bayview_binary,
    CASE
    WHEN pd_district = 'INGLESIDE' THEN 1
    ELSE 0
  END AS ingleside_binary,
    CASE
    WHEN pd_district = 'NORTHERN' THEN 1
    ELSE 0
  END AS northern_binary,
    CASE
    WHEN pd_district = 'CENTRAL' THEN 1
    ELSE 0
  END AS central_binary,
    CASE
    WHEN pd_district = 'MISSION' THEN 1
    ELSE 0
  END AS mission_binary,
    CASE
    WHEN pd_district = 'SOUTHERN' THEN 1
    ELSE 0
  END AS southern_binary,
    CASE
    WHEN pd_district = 'TENDERLOIN' THEN 1
    ELSE 0
  END AS tenderloin_binary,
    CASE
    WHEN pd_district = 'PARK' THEN 1
    ELSE 0
  END AS park_binary,
    CASE
    WHEN pd_district = 'RICHMOND' THEN 1
    ELSE 0
  END AS richmond_binary,
    CASE
    WHEN pd_district = 'TARAVAL' THEN 1
    ELSE 0
  END AS taraval_binary
FROM kaggle_sf_crime;

Feature Engineering

One-Hot Encoding

Location One-Hot Encoding

Location information tends to be extremely high-dimensional, which presents difficulties for modeling. The original dataset's location information came in three major formats:

  1. X and Y coordinates
  2. Street address information
  3. Police department districts

After visualizing the data and conducting basic exploratory data analysis, we decided to use one-hot encoding the transform the police department location information into features. In other words, each police department becomes a feature, and a given observation receives a value of '1' if it occured in the police department described by the feature. Otherwise, it received a value of '0'. Approaching location in this way allowed us to preserve the parsimony of our model; it retains the majority of the important variance in the data without overfitting.

Time One-Hot Encoding

We took the same approach to simplifying time data. Time is also high-dimensional; if we were to try to model crimes based on the provided timestamps (which include seconds), then the eight weeks of data generate 4,838,400 possibilities. We used one-hot encoding to simplify those possibilities into one 7-dimensional day-of-week feature, and an addition 24-dimensional hour-of-day feature. We beleive that these features capture the most variation from the data and allow our models to generalize without risking overfitting.

Weather Data

We sought to add features to our models that improved performance with respect to out desired performance metric. Scientists before us have documented correlations between weather patterns and crime; some experts even argue for a causal relationship between weather and crime [1]. More specifically, a 2013 paper published in Science demonstrated that higher temperatures and extreme rainfall led to large increases in conflict. Due to this research, we see weather as a source for additional features to improve the performance of our classifiers. We gathered weather data from the National Centers for Environmental Information --specifically, teh National Oceanic and Atmospheric Administration.

We selected the following weather features, which we felt captured the overall state of the weather at the time of each example, and which also had enough datapoints to be useful:

  • 'HOURLYDRYBULBTEMPF'-- hourly ‘dry bulb’ temperature (*F)
  • 'HOURLYRelativeHumidity'-- hourly ‘relative’ humidity (%)
  • 'HOURLYWindSpeed'-- hourly wind speed (mph)
  • 'HOURLYSeaLevelPressure'-- hourly sea level pressure (in-Hg)
  • 'HOURLYVISIBILITY'-- hourly visibility (miles).

In addition, we created a variable, ‘Daylight’, a daylight indicator (1 if time of sunrise < timestamp < time of sunset, and 0 otherwise). We engineered this feature by comparing the timestamp of each crime example to the time of sunrise (‘DAILYSunrise’) and the time of sunset (‘DAILYSunset’) for the date of the example.

Extracting and creating these additional features was tricky since there were differences in formatting between the timestamp of each crime example, the timestamp of each weather timepoint, and the times listed for ‘DAILYSunrise’ and ‘DAILYSunset’.

Additionally, we investigated usage of sky conditions (‘HOURLYSKYCONDITIONS’), weather type (‘HOURLYPRSENTWEATHERTYPE’), and total precipitation over the past hour (‘HOURLYPrecip’), but these variables resulted in missing data for some examples, and also additional processing that we did not feel would substantially increase the amount of information captured by the other weather variables we selected. For example, the ‘HOURLYPRSENTWEATHERTYPE’ was an initially attractive feature since it explicitly states such weather conditions as fog and rain, but it is a categorical variable that has over 100 possible values. The data for this variable was not available for many crime examples, and some weather timepoints contain multiple conflicting values in this variable (multiple observers). Furthermore, binarization of this variable would have increased the dimensionality of our data by an order of magnitude.

School Data

We also found government data from data.gov that links each zip code in the country to longitude and latitude coordinates, in an effort to reduce location dimensions from 2 to 1. We used a custom distance function to map each set of longitude, latitude coordinates associated with each crime incidents to find the close 5-digit zip code using the data.gov file. Unfortunately even after we narrowed the data California zip codes (> 94000 and < 95000) we were unable to run the data in time to add the feature to the data set.

In addition, we found a data set offered by the California Department of Education (cde.ca.gov) with a list of all schools (K-12) in California and accompanying characteristics of those schools including active status, grades taught (elementary, middle vs. high school), and address including location coordinates. We created a distance function to match the longitude, latitude coordinates of each of crime to the closest school and pull other potentially interesting features including: ‘closest_school’ - name of closest school, ‘school_distance’ - euclidean distance between lat/long coordinates of crime and lat/long coordinates of school, and ‘school_type’ - whether it’s an elementary school, high school, etc.

Ultimately, the technical challenges in incorporating these features amounted in an unexpectedly high computational requirement that we weren’t able to meet in time for this project.

Data Wrangling

Once we generated our features, we tranformed them into the appropriate formats. We also broke the data into subsets for training and development. We also created smaller subsets for prototyping quickly.

One point of note: we removed two categories from the data. The crimes labelled 'TREA' (treason) and 'PORNOGRAPHY/OBSCENE MAT' occurred so infrequently that they presented difficulties for modeling. The best approach was simply to remove them. We assume the risk of incorrectly identifying these crimes in the test set, but doing so allows us to proceed with modeling unhindered by off-by-one errors.

The final CSV we use for modeling is here--to follow along with our code, download it and adjust the first line of the code below to reference wherever it's stored.


In [2]:
data_path = "/Users/Goodgame/Desktop/prototyping/x_data_3.csv"
df = pd.read_csv(data_path, header=0)
x_data = df.drop('category', 1)
y = df.category.as_matrix()

# Impute missing values with mean values:
x_complete = x_data.fillna(x_data.mean())
X_raw = x_complete.as_matrix()

# Scale the data between 0 and 1:
X = MinMaxScaler().fit_transform(X_raw)

# Shuffle data to remove any underlying pattern that may exist
np.random.seed(0)
shuffle = np.random.permutation(np.arange(X.shape[0]))
X, y = X[shuffle], y[shuffle]

'''Due to difficulties with log loss and set(y_pred) needing to match set(labels), 
we will remove the extremely rare crimes from the data for quality issues.'''
X_minus_trea = X[np.where(y != 'TREA')]
y_minus_trea = y[np.where(y != 'TREA')]
X_final = X_minus_trea[np.where(y_minus_trea != 'PORNOGRAPHY/OBSCENE MAT')]
y_final = y_minus_trea[np.where(y_minus_trea != 'PORNOGRAPHY/OBSCENE MAT')]

# Separate training, dev, and test data:
all_train_data, all_train_labels = X, y
dev_data, dev_labels = X_final[700000:], y_final[700000:]
train_data, train_labels = X_final[100000:700000], y_final[100000:700000]
calibrate_data, calibrate_labels = X_final[:100000], y_final[:100000]

# Mini datasets for quick prototyping
mini_train_data, mini_train_labels = X_final[:20000], y_final[:20000]
mini_calibrate_data, mini_calibrate_labels = X_final[19000:28000], y_final[19000:28000]
mini_dev_data, mini_dev_labels = X_final[49000:60000], y_final[49000:60000]

# Create list of the crime type labels.  
# This will act as the "labels" parameter for the log loss functions that follow

crime_labels = list(set(y_final))
crime_labels_mini_train = list(set(mini_train_labels))
crime_labels_mini_dev = list(set(mini_dev_labels))
crime_labels_mini_calibrate = list(set(mini_calibrate_labels))

print("There should be 37 of each type of label -- otherwise, we'll have an issue later:")
print(len(crime_labels), len(crime_labels_mini_train), len(crime_labels_mini_dev),len(crime_labels_mini_calibrate))


There should be 37 of each type of label -- otherwise, we'll have an issue later:
37 37 37 37

Formatting Test Data

The test data needs the same transformations we applied to the training data.

Download the transformed test data from here.

Download the sample submission from here.

To follow along with the notebook, ensure the paths to both in the code below are correct.


In [3]:
# The Kaggle submission format requires listing the ID of each example.
# This is to remember the order of the IDs after shuffling
allIDs = np.array(list(df.axes[0]))
allIDs = allIDs[shuffle]

testIDs = allIDs[800000:]
devIDs = allIDs[700000:800000]
trainIDs = allIDs[:700000]

# Extract the column names for the required submission format
sampleSubmission_path = "/Users/Goodgame/Desktop/prototyping/sampleSubmission.csv"
sampleDF = pd.read_csv(sampleSubmission_path)
allColumns = list(sampleDF.columns)
featureColumns = allColumns[1:]

# Extracting the test data for a baseline submission
real_test_path = "/Users/Goodgame/Desktop/prototyping/test_data_transformed.csv"
testDF = pd.read_csv(real_test_path, header=0)
real_test_data = testDF

test_complete = real_test_data.fillna(real_test_data.mean())
Test_raw = test_complete.as_matrix()

test_data = MinMaxScaler().fit_transform(Test_raw)

# Remember the ID of each test data point, in case we shuffle the test data
testIDs = list(testDF.axes[0])

Note: the code above will shuffle data differently every time it's run, so model accuracies will vary accordingly.

Defining Performance Criteria

As determined by the Kaggle submission guidelines, the performance criteria metric for the San Francisco Crime Classification competition is Multi-class Logarithmic Loss (also known as cross-entropy). There are various other performance metrics that are appropriate for different domains: accuracy, F-score, Lift, ROC Area, average precision, precision/recall break-even point, and squared error.

Because the Kaggle competition guidelines use multi-class logarithmic loss to score submissions, we will use that metric to gauge the effectiveness of our models in development.

Hyperparameter Tuning and Calibration

During the course of this project, we tested dozens of models and tens of thousands of model specifications:

1) Hyperparameter tuning

Each classifier has parameters that we can engineer to further optimize performance, as opposed to using the default parameter values. The approach is specific to each model type.

2) Model calibration

After tuning hyperparameters, we calibrated the models via Platt Scaling or Isotonic Regression to attempt to improve their performance.

We used CalibratedClassifierCV to perform probability calibration with isotonic regression or sigmoid (Platt Scaling). The parameters within CalibratedClassifierCV allowed us to adjust the method ('sigmoid' or 'isotonic') and cv (cross-validation generator). Because we train our models before calibration, we only use cv = 'prefit'. Therefore, in practice, the cross-validation generator is not a modifiable parameter for our pipeline.

Models we tuned and calibrated

To see more about our work with these models, please reference the additional Jupyter notebooks in this repository.

  1. Multinomial Naive Bayes
  2. Bernoulli Naive Bayes
  3. Gaussian Naive Bayes
  4. Logistic Regression
  5. Neural Networks (from Theano and MLPClassifier)
  6. Decision Trees
  7. K-Nearest Neighbors

Additionally, we examined three types of meta-estimators:

  1. AdaBoost Classifier
  2. Bagging Classifier
  3. Gradient Boosting Classifier

The Winning Model: Random Forest

For the Random Forest classifier, we optimized the following classifier parameters:

  1. n_estimators (the number of trees in the forsest)
  2. max_features
  3. max_depth
  4. min_samples_leaf
  5. bootstrap (whether or not bootstrap samples are used when building trees)
  6. oob_score (whether or not out-of-bag samples are used to estimate the generalization accuracy)

Parallelizing GridSearchCV with Spark-sklearn

To optimize the parameters, we used GridSearchCV -- with a slight wrinkle. Because we needed GridSearchCV to sort through an incredible number of model specifications with a very large amount of data, we decided to parallelize the process using Spark.

Fortunately, there is a PyPI library for doing just that: spark-sklearn. Check out the package here.

In order to run spark-sklearn, we took the following steps:

  • Create an AWS EC2 instance (in our case, a c3.8xlarge instance with an Ubuntu Linux operating system, with 32 vCPUs and 60GiB of memory)
  • Install: Java, Scala, Anaconda, pip, and relevant dependencies (key library: spark_sklearn)
  • Run GridSearchCV within a SparkContext

All of the code is the exact same as a normal GridSearchCV with scikit-learn, except for two lines:

$ from spark_sklearn import GridSearchCV

$ gs = GridSearchCV(sc, clf, param_grid)

In other words, the grid search takes SparkContext as an extra parameter. Because of that, the process can be parallelized across multiple cores, which saves a lot of time.

For more information on parallelizing GridSearchCV using Spark, see this DataBricks tutorial and this AWS EC2 PySpark tutorial. Note: we ran the PySpark code in the PySpark REPL, rather than in a script. We hit issues with dependencies using Python scripts. We appear not to be alone in this issue; other data scientists have also hit a wall using scikit-learn with Spark.

Model Accuracy After Hyperparameter Tuning: 2.36593209104

Final evaluation on test data

During development, we were able to achieve a multi-class logarithmic loss of 2.36593209104 using the model specification and calibration below. The isotonic calibration yielded slightly more accuracy than the sigmoid calibration.

The code in the cell below will generate a .csv file with the predictions on test data. It can take about 25 minutes to run, depending on your environment and hardware.


In [8]:
def generate_predictions():

    random_forest_tuned = RandomForestClassifier(min_impurity_split=1, 
                                       n_estimators=100, 
                                       bootstrap= True,
                                       max_features=15,
                                       criterion='entropy',
                                       min_samples_leaf=10,
                                       max_depth=None
                                      ).fit(X_final, y_final)
    rf_isotonic = CalibratedClassifierCV(random_forest_tuned, method = 'isotonic', cv = 'prefit')
    rf_isotonic.fit(all_train_data, all_train_labels)
    return rf_isotonic.predict_proba(test_data)

predictions = generate_predictions()
resultDF = pd.DataFrame(predictions, columns=featureColumns)

# Add the IDs as a final column
resultDF.loc[:,'Id'] = pd.Series(testIDs,index=resultDF.index)

# Make the 'Id' column the first column, per the requirements
colnames = resultDF.columns.tolist()
colnames = colnames[-1:] + colnames[:-1]
resultDF = resultDF[colnames]

# Output to a .csv file
resultDF.to_csv('result.csv',index=False)

Dev Data Log Loss

Run this code to view our log loss on development data.

Warning: it takes approximately 25 minutes to run.


In [ ]:
tuned_DT_calibrate_isotonic = RandomForestClassifier(min_impurity_split=1, 
                                       n_estimators=100, 
                                       bootstrap= True,
                                       max_features=15,
                                       criterion='entropy',
                                       min_samples_leaf=10,
                                       max_depth=None
                                      ).fit(train_data, train_labels)

ccv_isotonic = CalibratedClassifierCV(tuned_DT_calibrate_isotonic, method = 'isotonic', cv = 'prefit')
ccv_isotonic.fit(calibrate_data, calibrate_labels)
ccv_prediction_probabilities_isotonic = ccv_isotonic.predict_proba(dev_data)
working_log_loss_isotonic = log_loss(y_true = dev_labels, 
                                     y_pred = ccv_prediction_probabilities_isotonic, 
                                     labels = crime_labels)

print("Multi-class Log Loss with Random Forest and isotonic calibration:", working_log_loss_isotonic)

Error Analysis

To keep the code in this notebook streamlined towards its goal -- predicting the types of crimes in test data -- we've conducted the error analysis in separate notebooks in this repository. Please view the error analysis notebook separately.

Having said that, you can view our confusion matrix in its entirety here.

References

1) Hsiang, Solomon M. and Burke, Marshall and Miguel, Edward. "Quantifying the Influence of Climate on Human Conflict". Science, Vol 341, Issue 6151, 2013

2) Huang, Cheng-Lung. Wang, Chieh-Jen. "A GA-based feature selection and parameters optimization for support vector machines". Expert Systems with Applications, Vol 31, 2006, p 231-240

3) https://gallery.cortanaintelligence.com/Experiment/Evaluating-and-Parameter-Tuning-a-Decision-Tree-Model-1