Kaggle San Francisco Crime Classification

Berkeley MIDS W207 Final Project

Sarah Cha, Sam Goodgame, Kalvin Kao, Bryan Moore

This project attempts to predict the type of crime commited in San Francisco based on known attributes of the crime. It is part of a Kaggle competition.

Problem Description

Our task in this project is to predict the type of a crime based on its component details. The component details are primarily related to time and location.

This is an interesting problem space because time and location are both high-dimensional variables. Such variables don't tend to work well with machine learning models, because they lead the models to overfit and generalize poorly.

Accordingly, our goal in this project is to generate an accurate, parsimonious model by working our way through the model selection triple:

  • Model selection
  • Feature Engineering
  • Hyperparameter tuning (and calibration)

Dependencies


In [1]:
# Basic libraries
import time
import numpy as np
import pandas as pd

# Modeling libraries and estimators
from sklearn.neighbors import KNeighborsClassifier
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import log_loss
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

# Meta-estimators
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import GradientBoostingClassifier

# Calibration tools
from sklearn.calibration import CalibratedClassifierCV

# Set random seed and format print output
np.random.seed(0)
np.set_printoptions(precision=3)


/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/grid_search.py:43: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)

Initial Data Transformation

Our first step was to transform the data into a form that we could use for machine learning: a Numpy array with a single column of targets (in this case, the type of crime committed) and many columns of independent/predictor variables.

There are many ways to wrangle the data; we chose SQL for the first step:

Use DDL below to construct a table for the data within a PostgreSQL database:

CREATE TABLE kaggle_sf_crime (
dates TIMESTAMP,                                
category VARCHAR,
descript VARCHAR,
dayofweek VARCHAR,
pd_district VARCHAR,
resolution VARCHAR,
addr VARCHAR,
X FLOAT,
Y FLOAT);

Move the training data, downloaded from Kaggle as a .csv file, into the PostgreSQL database:

\copy kaggle_sf_crime FROM '/Users/Goodgame/Desktop/MIDS/207/final/sf_crime_train.csv' DELIMITER ',' CSV HEADER;

Use SQL to transform the data into a usable form:

SELECT
  category,
  date_part('hour', dates) AS hour_of_day,
  CASE
    WHEN dayofweek = 'Monday' then 1
    WHEN dayofweek = 'Tuesday' THEN 2
    WHEN dayofweek = 'Wednesday' THEN 3
    WHEN dayofweek = 'Thursday' THEN 4
    WHEN dayofweek = 'Friday' THEN 5
    WHEN dayofweek = 'Saturday' THEN 6
    WHEN dayofweek = 'Sunday' THEN 7
  END AS dayofweek_numeric,
  X,
  Y,
  CASE
    WHEN pd_district = 'BAYVIEW' THEN 1
    ELSE 0
  END AS bayview_binary,
    CASE
    WHEN pd_district = 'INGLESIDE' THEN 1
    ELSE 0
  END AS ingleside_binary,
    CASE
    WHEN pd_district = 'NORTHERN' THEN 1
    ELSE 0
  END AS northern_binary,
    CASE
    WHEN pd_district = 'CENTRAL' THEN 1
    ELSE 0
  END AS central_binary,
    CASE
    WHEN pd_district = 'MISSION' THEN 1
    ELSE 0
  END AS mission_binary,
    CASE
    WHEN pd_district = 'SOUTHERN' THEN 1
    ELSE 0
  END AS southern_binary,
    CASE
    WHEN pd_district = 'TENDERLOIN' THEN 1
    ELSE 0
  END AS tenderloin_binary,
    CASE
    WHEN pd_district = 'PARK' THEN 1
    ELSE 0
  END AS park_binary,
    CASE
    WHEN pd_district = 'RICHMOND' THEN 1
    ELSE 0
  END AS richmond_binary,
    CASE
    WHEN pd_district = 'TARAVAL' THEN 1
    ELSE 0
  END AS taraval_binary
FROM kaggle_sf_crime;

Feature Engineering

One-Hot Encoding

Location One-Hot Encoding

Location information tends to be extremely high-dimensional, which presents difficulties for modeling. The original dataset's location information came in three major formats:

  1. X and Y coordinates
  2. Street address information
  3. Police department districts

After visualizing the data and conducting basic exploratory data analysis, we decided to use one-hot encoding the transform the police department location information into features. In other words, each police department becomes a feature, and a given observation receives a value of '1' if it occured in the police department described by the feature. Otherwise, it received a value of '0'. Approaching location in this way allowed us to preserve the parsimony of our model; it retains the majority of the important variance in the data without overfitting.

Time One-Hot Encoding

We took the same approach to simplifying time data. Time is also high-dimensional; if we were to try to model crimes based on the provided timestamps (which include seconds), then the eight weeks of data generate 4,838,400 possibilities. We used one-hot encoding to simplify those possibilities into one 7-dimensional day-of-week feature, and an addition 24-dimensional hour-of-day feature. We beleive that these features capture the most variation from the data and allow our models to generalize without risking overfitting.

Weather Data (Kalvin Add Here)

We sought to add features to our models that improved performance with respect to out desired performance metric. Scientists before us have documented correlations between weather patterns and crime; some experts even argue for a causal relationship between weather and crime [1]. More specifically, a 2013 paper published in Science demonstrated that higher temperatures and extreme rainfall led to large increases in conflict. Due to this research, we see weather as a source for additional features to improve the performance of our classifiers. We gathered weather data from the National Centers for Environmental Information --specifically, teh National Oceanic and Atmospheric Administration.

We created the following weather features: <<<>>>>

<<<>>>

Data Wrangling

Once we generated our features, we tranformed them into the appropriate formats. We also broke the data into subsets for training and development. We also created smaller subsets for prototyping quickly.

One point of note: we removed two categories from the data. The crimes labelled 'TREA' (treason) and 'PORNOGRAPHY/OBSCENE MAT' occurred so infrequently that they presented difficulties for modeling. The best approach was simply to remove them. We assume the risk of incorrectly identifying these crimes in the test set, but doing so allows us to proceed with modeling unhindered by off-by-one errors.

The final CSV we use for modeling is here--to follow along with our code, download it and adjust the first line of the code below to reference wherever it's stored.


In [3]:
data_path = "/Users/Goodgame/Desktop/prototyping/x_data_3.csv"
df = pd.read_csv(data_path, header=0)
x_data = df.drop('category', 1)
y = df.category.as_matrix()

# Impute missing values with mean values:
x_complete = x_data.fillna(x_data.mean())
X_raw = x_complete.as_matrix()

# Scale the data between 0 and 1:
X = MinMaxScaler().fit_transform(X_raw)

# Shuffle data to remove any underlying pattern that may exist
np.random.seed(0)
shuffle = np.random.permutation(np.arange(X.shape[0]))
X, y = X[shuffle], y[shuffle]

'''Due to difficulties with log loss and set(y_pred) needing to match set(labels), 
we will remove the extremely rare crimes from the data for quality issues.'''
X_minus_trea = X[np.where(y != 'TREA')]
y_minus_trea = y[np.where(y != 'TREA')]
X_final = X_minus_trea[np.where(y_minus_trea != 'PORNOGRAPHY/OBSCENE MAT')]
y_final = y_minus_trea[np.where(y_minus_trea != 'PORNOGRAPHY/OBSCENE MAT')]

# Separate training, dev, and test data:
all_train_data, all_train_labels = X, y
dev_data, dev_labels = X_final[700000:], y_final[700000:]
train_data, train_labels = X_final[100000:700000], y_final[100000:700000]
calibrate_data, calibrate_labels = X_final[:100000], y_final[:100000]

# Mini datasets for quick prototyping
mini_train_data, mini_train_labels = X_final[:20000], y_final[:20000]
mini_calibrate_data, mini_calibrate_labels = X_final[19000:28000], y_final[19000:28000]
mini_dev_data, mini_dev_labels = X_final[49000:60000], y_final[49000:60000]

# Create list of the crime type labels.  
# This will act as the "labels" parameter for the log loss functions that follow

crime_labels = list(set(y_final))
crime_labels_mini_train = list(set(mini_train_labels))
crime_labels_mini_dev = list(set(mini_dev_labels))
crime_labels_mini_calibrate = list(set(mini_calibrate_labels))

print("There should be 37 of each type of label -- otherwise, we'll have an issue later:")
print(len(crime_labels), len(crime_labels_mini_train), len(crime_labels_mini_dev),len(crime_labels_mini_calibrate))


There should be 37 of each type of label -- otherwise, we'll have an issue later:
37 37 37 37

Formatting Test Data

The test data needs the same transformations we applied to the training data.

Download the transformed test data from here.

Download the sample submission from here.

To follow along with the notebook, ensure the paths to both in the code below are correct.


In [4]:
# The Kaggle submission format requires listing the ID of each example.
# This is to remember the order of the IDs after shuffling
allIDs = np.array(list(df.axes[0]))
allIDs = allIDs[shuffle]

testIDs = allIDs[800000:]
devIDs = allIDs[700000:800000]
trainIDs = allIDs[:700000]

# Extract the column names for the required submission format
sampleSubmission_path = "/Users/Goodgame/Desktop/prototyping/sampleSubmission.csv"
sampleDF = pd.read_csv(sampleSubmission_path)
allColumns = list(sampleDF.columns)
featureColumns = allColumns[1:]

# Extracting the test data for a baseline submission
real_test_path = "/Users/Goodgame/Desktop/prototyping/test_data_transformed.csv"
testDF = pd.read_csv(real_test_path, header=0)
real_test_data = testDF

test_complete = real_test_data.fillna(real_test_data.mean())
Test_raw = test_complete.as_matrix()

test_data = MinMaxScaler().fit_transform(Test_raw)

# Remember the ID of each test data point, in case we shuffle the test data
testIDs = list(testDF.axes[0])

Note: the code above will shuffle data differently every time it's run, so model accuracies will vary accordingly.

Defining Performance Criteria

As determined by the Kaggle submission guidelines, the performance criteria metric for the San Francisco Crime Classification competition is Multi-class Logarithmic Loss (also known as cross-entropy). There are various other performance metrics that are appropriate for different domains: accuracy, F-score, Lift, ROC Area, average precision, precision/recall break-even point, and squared error.

Because the Kaggle competition guidelines use multi-class logarithmic loss to score submissions, we will use that metric to gauge the effectiveness of our models in development.

Hyperparameter Tuning and Calibration

During the course of this project, we tested dozens of models and tens of thousands of model specifications:

1) Hyperparameter tuning

Each classifier has parameters that we can engineer to further optimize performance, as opposed to using the default parameter values. The approach is specific to each model type.

2) Model calibration

After tuning hyperparameters, we calibrated the models via Platt Scaling or Isotonic Regression to attempt to improve their performance.

We used CalibratedClassifierCV to perform probability calibration with isotonic regression or sigmoid (Platt Scaling). The parameters within CalibratedClassifierCV allowed us to adjust the method ('sigmoid' or 'isotonic') and cv (cross-validation generator). Because we train our models before calibration, we only use cv = 'prefit'. Therefore, in practice, the cross-validation generator is not a modifiable parameter for our pipeline.

Models we tuned and calibrated

To see more about our work with these models, please reference the additional Jupyter notebooks in this repository.

  1. Multinomial Naive Bayes
  2. Bernoulli Naive Bayes
  3. Gaussian Naive Bayes
  4. Logistic Regression
  5. Neural Networks (from Theano and MLPClassifier)
  6. Decision Trees
  7. K-Nearest Neighbors

Additionally, we examined three types of meta-estimators:

  1. AdaBoost Classifier
  2. Bagging Classifier
  3. Gradient Boosting Classifier

The Winning Model: Random Forest

For the Random Forest classifier, we optimized the following classifier parameters:

  1. n_estimators (the number of trees in the forsest)
  2. max_features
  3. max_depth
  4. min_samples_leaf
  5. bootstrap (whether or not bootstrap samples are used when building trees)
  6. oob_score (whether or not out-of-bag samples are used to estimate the generalization accuracy)

Parallelizing GridSearchCV with Spark-sklearn

To optimize the parameters, we used GridSearchCV -- with a slight wrinkle. Because we needed GridSearchCV to sort through an incredible number of model specifications with a very large amount of data, we decided to parallelize the process using Spark.

Fortunately, there is a PyPI library for doing just that: spark-sklearn. Check out the package here.

In order to run spark-sklearn, we took the following steps:

  • Create an AWS EC2 instance (in our case, a c3.8xlarge instance with an Ubuntu Linux operating system, with 32 vCPUs and 60GiB of memory)
  • Install: Java, Scala, Anaconda, pip, and relevant dependencies (key library: spark_sklearn)
  • Run GridSearchCV within a SparkContext

All of the code is the exact same as a normal GridSearchCV with scikit-learn, except for two lines:

$ from spark_sklearn import GridSearchCV

$ gs = GridSearchCV(sc, clf, param_grid)

In other words, the grid search takes SparkContext as an extra parameter. Because of that, the process can be parallelized across multiple cores, which saves a lot of time.

For more information on parallelizing GridSearchCV using Spark, see this DataBricks tutorial and this AWS EC2 PySpark tutorial. Note: we ran the PySpark code in the PySpark REPL, rather than in a script. We hit issues with dependencies using Python scripts. We appear not to be alone in this issue; other data scientists have also hit a wall using scikit-learn with Spark.

Model Accuracy After Hyperparameter Tuning:

Final evaluation on test data

During development, we were able to achieve a multi-class logarithmic loss of 2.36593209104 using the model specification and calibration below. The isotonic calibration yielded slightly more accuracy than the sigmoid calibration.


In [5]:
def generate_predictions():

    random_forest_tuned = RandomForestClassifier(min_impurity_split=1, 
                                       n_estimators=100, 
                                       bootstrap= True,
                                       max_features=15,
                                       criterion='entropy',
                                       min_samples_leaf=10,
                                       max_depth=None
                                      ).fit(all_train_data, all_train_labels)
    rf_isotonic = CalibratedClassifierCV(random_forest_tuned, method = 'isotonic', cv = 'prefit')
    rf_isotonic.fit(all_train_data, all_train_labels)
    return rf_isotonic.predict_proba(test_data)

predictions = generate_predictions()
resultDF = pd.DataFrame(predictions, columns=featureColumns)

# Add the IDs as a final column
resultDF.loc[:,'Id'] = pd.Series(testIDs,index=resultDF.index)

# Make the 'Id' column the first column, per the requirements
colnames = resultDF.columns.tolist()
colnames = colnames[-1:] + colnames[:-1]
resultDF = resultDF[colnames]

# Output to a .csv file
resultDF.to_csv('result.csv',index=False)

Error Analysis: Calibration (I haven't made this work yet. We actually only submit the prediction probabilities, not the category we predict -- so to get the actual prediction, we'd just have to take the highest probability from the predictions variable above)


In [ ]:
def error_calibration(buckets, clf_probabilities, clf_predictions, labels):
    """inputs:
    clf_probabilities = clf.predict_proba(dev_data)
    clf_predictions = clf.predict(dev_data)
    labels = dev_labels"""
    
    #buckets = [0.05, 0.15, 0.3, 0.5, 0.8]
    #buckets = [0.15, 0.25, 0.3, 1.0]
    correct = [0 for i in buckets]
    total = [0 for i in buckets]

    lLimit = 0
    uLimit = 0
    for i in range(len(buckets)):
        uLimit = buckets[i]
        for j in range(clf_probabilities.shape[0]):
            if (np.amax(clf_probabilities[j]) > lLimit) and (np.amax(clf_probabilities[j]) <= uLimit):
                if clf_predictions[j] == labels[j]:
                    correct[i] += 1
                total[i] += 1
        lLimit = uLimit
        
    print(sum(correct))
    print(sum(total))
    print(correct)
    print(total)

    # Report the classifier accuracy for each posterior probability bucket
    accuracies = []
    for k in range(len(buckets)):
        print(1.0*correct[k]/total[k])
        accuracies.append(1.0*correct[k]/total[k])
        print('p(pred) <= %.13f    total = %3d    correct = %3d    accuracy = %.3f' \
              %(buckets[k], total[k], correct[k], 1.0*correct[k]/total[k]))
    plt.plot(buckets,accuracies)
    plt.title("Calibration Analysis")
    plt.xlabel("Posterior Probability")
    plt.ylabel("Classifier Accuracy")
    
    return buckets, accuracies

pd.DataFrame(np.amax(bestLRPredictionProbabilities, axis=1)).hist()

buckets = [0.15, 0.25, 0.3, 1.0]
calibration_buckets, calibration_accuracies = error_calibration(buckets, clf_probabilities=bestLRPredictionProbabilities, 
                                                                         clf_predictions=bestLRPredictions,
                                                                         labels=mini_dev_labels)

In [16]:
# Each row in 'predictions' is the probability that that feature is the predicted category

print(len(predictions))
print(len(predictions[0]))
print(predictions[0])


884262
39
[  0.000e+00   3.788e-01   0.000e+00   0.000e+00   7.576e-08   0.000e+00
   0.000e+00   5.543e-09   7.487e-09   0.000e+00   0.000e+00   0.000e+00
   0.000e+00   0.000e+00   0.000e+00   0.000e+00   1.125e-02   0.000e+00
   0.000e+00   5.710e-02   1.093e-02   2.268e-02   0.000e+00   0.000e+00
   0.000e+00   1.232e-03   0.000e+00   4.726e-08   0.000e+00   0.000e+00
   0.000e+00   0.000e+00   2.835e-05   0.000e+00   0.000e+00   4.208e-01
   9.715e-02   1.550e-05   0.000e+00]

In [ ]:

Error Analysis: Classification Report (I haven't made this work yet)


In [ ]:
def classification_report(clf_predictions, labels):
    """inputs:
    clf_predictions = clf.predict(dev_data)
    labels = dev_labels"""
    print('Classification Report:')
    report = classification_report(labels, clf_predictions)
    print(report)
    return report


report = classification_report(clf_predictions=bestLRPredictions, labels=mini_dev_labels)

Error Analysis: Confusion Matrix (I haven't made this work yet)

To see the confusion matrix in its entirety, click here.


In [ ]:
def confusion_matrix(label_names, clf_predictions, labels):
    """inputs:
    clf_predictions = clf.predict(dev_data)
    labels = dev_labels"""
    cm = pd.DataFrame(confusion_matrix(labels, clf_predictions, labels=label_names))
    cm.columns=label_names
    cm.index=label_names
    cm.to_csv(path_or_buf="./confusion_matrix.csv")
    #print(cm)
    return cm

print(confustion_matrix(label_names=crime_labels_mini_dev, clf_predictions=bestLRPredictions,
                                                            labels=mini_dev_labels))

References

1) Hsiang, Solomon M. and Burke, Marshall and Miguel, Edward. "Quantifying the Influence of Climate on Human Conflict". Science, Vol 341, Issue 6151, 2013

2) Huang, Cheng-Lung. Wang, Chieh-Jen. "A GA-based feature selection and parameters optimization for support vector machines". Expert Systems with Applications, Vol 31, 2006, p 231-240

3) https://gallery.cortanaintelligence.com/Experiment/Evaluating-and-Parameter-Tuning-a-Decision-Tree-Model-1