This project attempts to predict the type of crime commited in San Francisco based on known attributes of the crime. It is part of a Kaggle competition.
Our task in this project is to predict the type of a crime based on its component details. The component details are primarily related to time and location.
This is an interesting problem space because time and location are both high-dimensional variables. Such variables don't tend to work well with machine learning models, because they lead the models to overfit and generalize poorly.
Accordingly, our goal in this project is to generate an accurate, parsimonious model by working our way through the model selection triple:
In [1]:
# Basic libraries
import time
import numpy as np
import pandas as pd
# Modeling libraries and estimators
from sklearn.neighbors import KNeighborsClassifier
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import log_loss
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
# Meta-estimators
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import GradientBoostingClassifier
# Calibration tools
from sklearn.calibration import CalibratedClassifierCV
# Set random seed and format print output
np.random.seed(0)
np.set_printoptions(precision=3)
Our first step was to transform the data into a form that we could use for machine learning: a Numpy array with a single column of targets (in this case, the type of crime committed) and many columns of independent/predictor variables.
There are many ways to wrangle the data; we chose SQL for the first step:
CREATE TABLE kaggle_sf_crime (
dates TIMESTAMP,
category VARCHAR,
descript VARCHAR,
dayofweek VARCHAR,
pd_district VARCHAR,
resolution VARCHAR,
addr VARCHAR,
X FLOAT,
Y FLOAT);
\copy kaggle_sf_crime FROM '/Users/Goodgame/Desktop/MIDS/207/final/sf_crime_train.csv' DELIMITER ',' CSV HEADER;
SELECT
category,
date_part('hour', dates) AS hour_of_day,
CASE
WHEN dayofweek = 'Monday' then 1
WHEN dayofweek = 'Tuesday' THEN 2
WHEN dayofweek = 'Wednesday' THEN 3
WHEN dayofweek = 'Thursday' THEN 4
WHEN dayofweek = 'Friday' THEN 5
WHEN dayofweek = 'Saturday' THEN 6
WHEN dayofweek = 'Sunday' THEN 7
END AS dayofweek_numeric,
X,
Y,
CASE
WHEN pd_district = 'BAYVIEW' THEN 1
ELSE 0
END AS bayview_binary,
CASE
WHEN pd_district = 'INGLESIDE' THEN 1
ELSE 0
END AS ingleside_binary,
CASE
WHEN pd_district = 'NORTHERN' THEN 1
ELSE 0
END AS northern_binary,
CASE
WHEN pd_district = 'CENTRAL' THEN 1
ELSE 0
END AS central_binary,
CASE
WHEN pd_district = 'MISSION' THEN 1
ELSE 0
END AS mission_binary,
CASE
WHEN pd_district = 'SOUTHERN' THEN 1
ELSE 0
END AS southern_binary,
CASE
WHEN pd_district = 'TENDERLOIN' THEN 1
ELSE 0
END AS tenderloin_binary,
CASE
WHEN pd_district = 'PARK' THEN 1
ELSE 0
END AS park_binary,
CASE
WHEN pd_district = 'RICHMOND' THEN 1
ELSE 0
END AS richmond_binary,
CASE
WHEN pd_district = 'TARAVAL' THEN 1
ELSE 0
END AS taraval_binary
FROM kaggle_sf_crime;
Location information tends to be extremely high-dimensional, which presents difficulties for modeling. The original dataset's location information came in three major formats:
After visualizing the data and conducting basic exploratory data analysis, we decided to use one-hot encoding the transform the police department location information into features. In other words, each police department becomes a feature, and a given observation receives a value of '1' if it occured in the police department described by the feature. Otherwise, it received a value of '0'. Approaching location in this way allowed us to preserve the parsimony of our model; it retains the majority of the important variance in the data without overfitting.
We took the same approach to simplifying time data. Time is also high-dimensional; if we were to try to model crimes based on the provided timestamps (which include seconds), then the eight weeks of data generate 4,838,400 possibilities. We used one-hot encoding to simplify those possibilities into one 7-dimensional day-of-week feature, and an addition 24-dimensional hour-of-day feature. We beleive that these features capture the most variation from the data and allow our models to generalize without risking overfitting.
We sought to add features to our models that improved performance with respect to out desired performance metric. Scientists before us have documented correlations between weather patterns and crime; some experts even argue for a causal relationship between weather and crime [1]. More specifically, a 2013 paper published in Science demonstrated that higher temperatures and extreme rainfall led to large increases in conflict. Due to this research, we see weather as a source for additional features to improve the performance of our classifiers. We gathered weather data from the National Centers for Environmental Information --specifically, teh National Oceanic and Atmospheric Administration.
We created the following weather features: <<<
<<<
Once we generated our features, we tranformed them into the appropriate formats. We also broke the data into subsets for training and development. We also created smaller subsets for prototyping quickly.
One point of note: we removed two categories from the data. The crimes labelled 'TREA' (treason) and 'PORNOGRAPHY/OBSCENE MAT' occurred so infrequently that they presented difficulties for modeling. The best approach was simply to remove them. We assume the risk of incorrectly identifying these crimes in the test set, but doing so allows us to proceed with modeling unhindered by off-by-one errors.
The final CSV we use for modeling is here--to follow along with our code, download it and adjust the first line of the code below to reference wherever it's stored.
In [3]:
data_path = "/Users/Goodgame/Desktop/prototyping/x_data_3.csv"
df = pd.read_csv(data_path, header=0)
x_data = df.drop('category', 1)
y = df.category.as_matrix()
# Impute missing values with mean values:
x_complete = x_data.fillna(x_data.mean())
X_raw = x_complete.as_matrix()
# Scale the data between 0 and 1:
X = MinMaxScaler().fit_transform(X_raw)
# Shuffle data to remove any underlying pattern that may exist
np.random.seed(0)
shuffle = np.random.permutation(np.arange(X.shape[0]))
X, y = X[shuffle], y[shuffle]
'''Due to difficulties with log loss and set(y_pred) needing to match set(labels),
we will remove the extremely rare crimes from the data for quality issues.'''
X_minus_trea = X[np.where(y != 'TREA')]
y_minus_trea = y[np.where(y != 'TREA')]
X_final = X_minus_trea[np.where(y_minus_trea != 'PORNOGRAPHY/OBSCENE MAT')]
y_final = y_minus_trea[np.where(y_minus_trea != 'PORNOGRAPHY/OBSCENE MAT')]
# Separate training, dev, and test data:
all_train_data, all_train_labels = X, y
dev_data, dev_labels = X_final[700000:], y_final[700000:]
train_data, train_labels = X_final[100000:700000], y_final[100000:700000]
calibrate_data, calibrate_labels = X_final[:100000], y_final[:100000]
# Mini datasets for quick prototyping
mini_train_data, mini_train_labels = X_final[:20000], y_final[:20000]
mini_calibrate_data, mini_calibrate_labels = X_final[19000:28000], y_final[19000:28000]
mini_dev_data, mini_dev_labels = X_final[49000:60000], y_final[49000:60000]
# Create list of the crime type labels.
# This will act as the "labels" parameter for the log loss functions that follow
crime_labels = list(set(y_final))
crime_labels_mini_train = list(set(mini_train_labels))
crime_labels_mini_dev = list(set(mini_dev_labels))
crime_labels_mini_calibrate = list(set(mini_calibrate_labels))
print("There should be 37 of each type of label -- otherwise, we'll have an issue later:")
print(len(crime_labels), len(crime_labels_mini_train), len(crime_labels_mini_dev),len(crime_labels_mini_calibrate))
In [4]:
# The Kaggle submission format requires listing the ID of each example.
# This is to remember the order of the IDs after shuffling
allIDs = np.array(list(df.axes[0]))
allIDs = allIDs[shuffle]
testIDs = allIDs[800000:]
devIDs = allIDs[700000:800000]
trainIDs = allIDs[:700000]
# Extract the column names for the required submission format
sampleSubmission_path = "/Users/Goodgame/Desktop/prototyping/sampleSubmission.csv"
sampleDF = pd.read_csv(sampleSubmission_path)
allColumns = list(sampleDF.columns)
featureColumns = allColumns[1:]
# Extracting the test data for a baseline submission
real_test_path = "/Users/Goodgame/Desktop/prototyping/test_data_transformed.csv"
testDF = pd.read_csv(real_test_path, header=0)
real_test_data = testDF
test_complete = real_test_data.fillna(real_test_data.mean())
Test_raw = test_complete.as_matrix()
test_data = MinMaxScaler().fit_transform(Test_raw)
# Remember the ID of each test data point, in case we shuffle the test data
testIDs = list(testDF.axes[0])
Note: the code above will shuffle data differently every time it's run, so model accuracies will vary accordingly.
As determined by the Kaggle submission guidelines, the performance criteria metric for the San Francisco Crime Classification competition is Multi-class Logarithmic Loss (also known as cross-entropy). There are various other performance metrics that are appropriate for different domains: accuracy, F-score, Lift, ROC Area, average precision, precision/recall break-even point, and squared error.
Because the Kaggle competition guidelines use multi-class logarithmic loss to score submissions, we will use that metric to gauge the effectiveness of our models in development.
During the course of this project, we tested dozens of models and tens of thousands of model specifications:
Each classifier has parameters that we can engineer to further optimize performance, as opposed to using the default parameter values. The approach is specific to each model type.
After tuning hyperparameters, we calibrated the models via Platt Scaling or Isotonic Regression to attempt to improve their performance.
We used CalibratedClassifierCV to perform probability calibration with isotonic regression or sigmoid (Platt Scaling). The parameters within CalibratedClassifierCV allowed us to adjust the method ('sigmoid' or 'isotonic') and cv (cross-validation generator). Because we train our models before calibration, we only use cv = 'prefit'. Therefore, in practice, the cross-validation generator is not a modifiable parameter for our pipeline.
Additionally, we examined three types of meta-estimators:
For the Random Forest classifier, we optimized the following classifier parameters:
To optimize the parameters, we used GridSearchCV -- with a slight wrinkle. Because we needed GridSearchCV to sort through an incredible number of model specifications with a very large amount of data, we decided to parallelize the process using Spark.
Fortunately, there is a PyPI library for doing just that: spark-sklearn. Check out the package here.
In order to run spark-sklearn, we took the following steps:
All of the code is the exact same as a normal GridSearchCV with scikit-learn, except for two lines:
$ from spark_sklearn import GridSearchCV
$ gs = GridSearchCV(sc, clf, param_grid)
In other words, the grid search takes SparkContext as an extra parameter. Because of that, the process can be parallelized across multiple cores, which saves a lot of time.
For more information on parallelizing GridSearchCV using Spark, see this DataBricks tutorial and this AWS EC2 PySpark tutorial. Note: we ran the PySpark code in the PySpark REPL, rather than in a script. We hit issues with dependencies using Python scripts. We appear not to be alone in this issue; other data scientists have also hit a wall using scikit-learn with Spark.
In [5]:
def generate_predictions():
random_forest_tuned = RandomForestClassifier(min_impurity_split=1,
n_estimators=100,
bootstrap= True,
max_features=15,
criterion='entropy',
min_samples_leaf=10,
max_depth=None
).fit(all_train_data, all_train_labels)
rf_isotonic = CalibratedClassifierCV(random_forest_tuned, method = 'isotonic', cv = 'prefit')
rf_isotonic.fit(all_train_data, all_train_labels)
return rf_isotonic.predict_proba(test_data)
predictions = generate_predictions()
resultDF = pd.DataFrame(predictions, columns=featureColumns)
# Add the IDs as a final column
resultDF.loc[:,'Id'] = pd.Series(testIDs,index=resultDF.index)
# Make the 'Id' column the first column, per the requirements
colnames = resultDF.columns.tolist()
colnames = colnames[-1:] + colnames[:-1]
resultDF = resultDF[colnames]
# Output to a .csv file
resultDF.to_csv('result.csv',index=False)
In [ ]:
def error_calibration(buckets, clf_probabilities, clf_predictions, labels):
"""inputs:
clf_probabilities = clf.predict_proba(dev_data)
clf_predictions = clf.predict(dev_data)
labels = dev_labels"""
#buckets = [0.05, 0.15, 0.3, 0.5, 0.8]
#buckets = [0.15, 0.25, 0.3, 1.0]
correct = [0 for i in buckets]
total = [0 for i in buckets]
lLimit = 0
uLimit = 0
for i in range(len(buckets)):
uLimit = buckets[i]
for j in range(clf_probabilities.shape[0]):
if (np.amax(clf_probabilities[j]) > lLimit) and (np.amax(clf_probabilities[j]) <= uLimit):
if clf_predictions[j] == labels[j]:
correct[i] += 1
total[i] += 1
lLimit = uLimit
print(sum(correct))
print(sum(total))
print(correct)
print(total)
# Report the classifier accuracy for each posterior probability bucket
accuracies = []
for k in range(len(buckets)):
print(1.0*correct[k]/total[k])
accuracies.append(1.0*correct[k]/total[k])
print('p(pred) <= %.13f total = %3d correct = %3d accuracy = %.3f' \
%(buckets[k], total[k], correct[k], 1.0*correct[k]/total[k]))
plt.plot(buckets,accuracies)
plt.title("Calibration Analysis")
plt.xlabel("Posterior Probability")
plt.ylabel("Classifier Accuracy")
return buckets, accuracies
pd.DataFrame(np.amax(bestLRPredictionProbabilities, axis=1)).hist()
buckets = [0.15, 0.25, 0.3, 1.0]
calibration_buckets, calibration_accuracies = error_calibration(buckets, clf_probabilities=bestLRPredictionProbabilities,
clf_predictions=bestLRPredictions,
labels=mini_dev_labels)
In [16]:
# Each row in 'predictions' is the probability that that feature is the predicted category
print(len(predictions))
print(len(predictions[0]))
print(predictions[0])
In [ ]:
In [ ]:
def classification_report(clf_predictions, labels):
"""inputs:
clf_predictions = clf.predict(dev_data)
labels = dev_labels"""
print('Classification Report:')
report = classification_report(labels, clf_predictions)
print(report)
return report
report = classification_report(clf_predictions=bestLRPredictions, labels=mini_dev_labels)
To see the confusion matrix in its entirety, click here.
In [ ]:
def confusion_matrix(label_names, clf_predictions, labels):
"""inputs:
clf_predictions = clf.predict(dev_data)
labels = dev_labels"""
cm = pd.DataFrame(confusion_matrix(labels, clf_predictions, labels=label_names))
cm.columns=label_names
cm.index=label_names
cm.to_csv(path_or_buf="./confusion_matrix.csv")
#print(cm)
return cm
print(confustion_matrix(label_names=crime_labels_mini_dev, clf_predictions=bestLRPredictions,
labels=mini_dev_labels))
1) Hsiang, Solomon M. and Burke, Marshall and Miguel, Edward. "Quantifying the Influence of Climate on Human Conflict". Science, Vol 341, Issue 6151, 2013
2) Huang, Cheng-Lung. Wang, Chieh-Jen. "A GA-based feature selection and parameters optimization for support vector machines". Expert Systems with Applications, Vol 31, 2006, p 231-240