Kaggle San Francisco Crime Classification: Supporting Notebook

UCB MIDS W207 Final Project: Sam Goodgame, Sarah Cha, Kalvin Kao, Bryan Moore

Purpose

This notebook will supoort the final, refined notebook submitted for the Berkeley MIDS W207 Final Project. This notebook contains code and markdown that contirbuted to our final notebook, graphs, and formal presentation. It serves as a reference for the final notebook.

Throughout this notebook, there are local dependencies that will require data pathways to local files. These will be supressed for viewing comfort, as to not overwhelm the notebook with errors.

Environment Engineering


In [2]:
# Import relevant libraries:
import time
import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import log_loss
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier

# Import Meta-estimators
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier

# Import Calibration tools
from sklearn.calibration import CalibratedClassifierCV

# Set random seed and format print output:
np.random.seed(0)
np.set_printoptions(precision=3)


/Users/Bryan/anaconda/lib/python3.6/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/Users/Bryan/anaconda/lib/python3.6/site-packages/sklearn/grid_search.py:43: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)

Data Cleaning

Here we include the data definition language that we wrote to bring our csv data into formatted tables:

CREATE TABLE kaggle_sf_crime (
dates TIMESTAMP,                                
category VARCHAR,
descript VARCHAR,
dayofweek VARCHAR,
pd_district VARCHAR,
resolution VARCHAR,
addr VARCHAR,
X FLOAT,
Y FLOAT);

Getting training data into a locally hosted PostgreSQL database:

\copy kaggle_sf_crime FROM '/Users/Goodgame/Desktop/MIDS/207/final/sf_crime_train.csv' DELIMITER ',' CSV HEADER;

SQL Query used to perform out transformations:

SELECT
  category,
  date_part('hour', dates) AS hour_of_day,
  CASE
    WHEN dayofweek = 'Monday' then 1
    WHEN dayofweek = 'Tuesday' THEN 2
    WHEN dayofweek = 'Wednesday' THEN 3
    WHEN dayofweek = 'Thursday' THEN 4
    WHEN dayofweek = 'Friday' THEN 5
    WHEN dayofweek = 'Saturday' THEN 6
    WHEN dayofweek = 'Sunday' THEN 7
  END AS dayofweek_numeric,
  X,
  Y,
  CASE
    WHEN pd_district = 'BAYVIEW' THEN 1
    ELSE 0
  END AS bayview_binary,
    CASE
    WHEN pd_district = 'INGLESIDE' THEN 1
    ELSE 0
  END AS ingleside_binary,
    CASE
    WHEN pd_district = 'NORTHERN' THEN 1
    ELSE 0
  END AS northern_binary,
    CASE
    WHEN pd_district = 'CENTRAL' THEN 1
    ELSE 0
  END AS central_binary,
    CASE
    WHEN pd_district = 'BAYVIEW' THEN 1
    ELSE 0
  END AS pd_bayview_binary,
    CASE
    WHEN pd_district = 'MISSION' THEN 1
    ELSE 0
  END AS mission_binary,
    CASE
    WHEN pd_district = 'SOUTHERN' THEN 1
    ELSE 0
  END AS southern_binary,
    CASE
    WHEN pd_district = 'TENDERLOIN' THEN 1
    ELSE 0
  END AS tenderloin_binary,
    CASE
    WHEN pd_district = 'PARK' THEN 1
    ELSE 0
  END AS park_binary,
    CASE
    WHEN pd_district = 'RICHMOND' THEN 1
    ELSE 0
  END AS richmond_binary,
    CASE
    WHEN pd_district = 'TARAVAL' THEN 1
    ELSE 0
  END AS taraval_binary
FROM kaggle_sf_crime;

The initial source of our data was at https://www.kaggle.com/c/sf-crime. We loaded the original training data, train.csv, then extracted the dates column and added it to the transformed data. Dates were converted to number of seconds from the Unix epoch (defined as the number of seconds that have elapsed since 00:00:00 Coordinated Universal Time (UTC), Thursday, 1 January 1970, minus the number of leap seconds that have taken place since then). We then moved the dates to the first column.

Next, we one-hot encoded the location data. Location information tends to be extremely high-dimensional, which presents difficulties for modeling. The original dataset's location information came in three major formats:

  1. X and Y coordinates
  2. Street address information
  3. Police department districts

After visualizing the data and conducting basic exploratory data analysis, we decided to use one-hot encoding to transform the police department location information into features. In other words, each police department becomes a feature, and a given observation receives a value of '1' if it occurred in the police department jurisdiction described by the feature. Otherwise, it received a value of '0'. Approaching location in this way allowed us to preserve the parsimony of our model; it retains the majority of the important variance in the data without overfitting.

Feature Engineering: Weather Data, School Data, and Zipcode Data

We sought to add features to our models that would improve performance with respect to our desired performance metric of Multi-class Log Loss. There is evidence that there is a correlation between weather patterns and crime, with some experts even arguing for a causal relationship between weather and crime [1]. More specifically, a 2013 paper published in Science showed that higher temperatures and extreme rainfall led to large increases in conflict. In the setting of strong evidence that weather influences crime, we saw it as a candidate for additional features to improve the performance of our classifiers. Weather data was gathered from the National Centers for Environmental Information (specifically, the National Oceanic and Atmospheric Administration). Certain features from this data set were incorporated into the original crime data set in order to add features that were hypothesizzed to improve performance. These features included:

  • Hourly Dry Bulb Temperature
  • Hourly Relative Humidity
  • Hourly Wind Speed
  • Hourly Seal Level Pressure
  • Hourly Visibility
  • Daylight: a daylight indicator (1 if time of sunrise < timestamp < time of sunset, and 0 otherwise)

Details on School data and Zipcode data are below.

Loading The Data With Additional Features

There are local file dependencies here. For viewing comfort, we will suppress the code in order to not generate error messages. The code that follows was written by us to incorporate the weather data features into the original data from Kaggle, and to load and then subset the data into train, development, calibrate, and test subsets.

A separate portion of code follows that was used to load the data from the CSV file that is generated from the code here. Further details on data cleaning will be provided there.

Weather Data


In [ ]:
#data_path = "./data/train_transformed.csv"

#df = pd.read_csv(data_path, header=0)
#x_data = df.drop('category', 1)
#y = df.category.as_matrix()

########## Adding the date back into the data
#import csv
#import time
#import calendar
#data_path = "./data/train.csv"
#dataCSV = open(data_path, 'rt')
#csvData = list(csv.reader(dataCSV))
#csvFields = csvData[0] #['Dates', 'Category', 'Descript', 'DayOfWeek', 'PdDistrict', 'Resolution', 'Address', 'X', 'Y']
#allData = csvData[1:]
#dataCSV.close()

#df2 = pd.DataFrame(allData)
#df2.columns = csvFields
#dates = df2['Dates']
#dates = dates.apply(time.strptime, args=("%Y-%m-%d %H:%M:%S",))
#dates = dates.apply(calendar.timegm)
#print(dates.head())

#x_data['secondsFromEpoch'] = dates
#colnames = x_data.columns.tolist()
#colnames = colnames[-1:] + colnames[:-1]
#x_data = x_data[colnames]
##########

########## Adding the weather data into the original crime data
#weatherData1 = "./data/1027175.csv"
#weatherData2 = "./data/1027176.csv"
#dataCSV = open(weatherData1, 'rt')
#csvData = list(csv.reader(dataCSV))
#csvFields = csvData[0] #['Dates', 'Category', 'Descript', 'DayOfWeek', 'PdDistrict', 'Resolution', 'Address', 'X', 'Y']
#allWeatherData1 = csvData[1:]
#dataCSV.close()

#dataCSV = open(weatherData2, 'rt')
#csvData = list(csv.reader(dataCSV))
#csvFields = csvData[0] #['Dates', 'Category', 'Descript', 'DayOfWeek', 'PdDistrict', 'Resolution', 'Address', 'X', 'Y']
#allWeatherData2 = csvData[1:]
#dataCSV.close()

#weatherDF1 = pd.DataFrame(allWeatherData1)
#weatherDF1.columns = csvFields
#dates1 = weatherDF1['DATE']
#sunrise1 = weatherDF1['DAILYSunrise']
#sunset1 = weatherDF1['DAILYSunset']

#weatherDF2 = pd.DataFrame(allWeatherData2)
#weatherDF2.columns = csvFields
#dates2 = weatherDF2['DATE']
#sunrise2 = weatherDF2['DAILYSunrise']
#sunset2 = weatherDF2['DAILYSunset']

#functions for processing the sunrise and sunset times of each day
#def get_hour_and_minute(milTime):
 #   hour = int(milTime[:-2])
 #   minute = int(milTime[-2:])
 #   return [hour, minute]

#def get_date_only(date):
#    return time.struct_time(tuple([date[0], date[1], date[2], 0, 0, 0, date[6], date[7], date[8]]))

#def structure_sun_time(timeSeries, dateSeries):
#    sunTimes = timeSeries.copy()
#    for index in range(len(dateSeries)):
#        sunTimes[index] = time.struct_time(tuple([dateSeries[index][0], dateSeries[index][1], dateSeries[index][2], timeSeries[index][0], timeSeries[index][1], dateSeries[index][5], dateSeries[index][6], dateSeries[index][7], dateSeries[index][8]]))
#    return sunTimes

#dates1 = dates1.apply(time.strptime, args=("%Y-%m-%d %H:%M",))
#sunrise1 = sunrise1.apply(get_hour_and_minute)
#sunrise1 = structure_sun_time(sunrise1, dates1)
#sunrise1 = sunrise1.apply(calendar.timegm)
#sunset1 = sunset1.apply(get_hour_and_minute)
#sunset1 = structure_sun_time(sunset1, dates1)
#sunset1 = sunset1.apply(calendar.timegm)
#dates1 = dates1.apply(calendar.timegm)

#dates2 = dates2.apply(time.strptime, args=("%Y-%m-%d %H:%M",))
#sunrise2 = sunrise2.apply(get_hour_and_minute)
#sunrise2 = structure_sun_time(sunrise2, dates2)
#sunrise2 = sunrise2.apply(calendar.timegm)
#sunset2 = sunset2.apply(get_hour_and_minute)
#sunset2 = structure_sun_time(sunset2, dates2)
#sunset2 = sunset2.apply(calendar.timegm)
#dates2 = dates2.apply(calendar.timegm)

#weatherDF1['DATE'] = dates1
#weatherDF1['DAILYSunrise'] = sunrise1
#weatherDF1['DAILYSunset'] = sunset1
#weatherDF2['DATE'] = dates2
#weatherDF2['DAILYSunrise'] = sunrise2
#weatherDF2['DAILYSunset'] = sunset2

#weatherDF = pd.concat([weatherDF1,weatherDF2[32:]],ignore_index=True)

# Starting off with some of the easier features to work with-- more to come here . . . still in beta
#weatherMetrics = weatherDF[['DATE','HOURLYDRYBULBTEMPF','HOURLYRelativeHumidity', 'HOURLYWindSpeed', \
#                            'HOURLYSeaLevelPressure', 'HOURLYVISIBILITY', 'DAILYSunrise', 'DAILYSunset']]
#weatherMetrics = weatherMetrics.convert_objects(convert_numeric=True)
#weatherDates = weatherMetrics['DATE']
#'DATE','HOURLYDRYBULBTEMPF','HOURLYRelativeHumidity', 'HOURLYWindSpeed',
#'HOURLYSeaLevelPressure', 'HOURLYVISIBILITY'
#timeWindow = 10800 #3 hours
#hourlyDryBulbTemp = []
#hourlyRelativeHumidity = []
#hourlyWindSpeed = []
#hourlySeaLevelPressure = []
#hourlyVisibility = []
#dailySunrise = []
#dailySunset = []
#daylight = []
#test = 0
#for timePoint in dates:#dates is the epoch time from the kaggle data
#    relevantWeather = weatherMetrics[(weatherDates <= timePoint) & (weatherDates > timePoint - timeWindow)]
#    hourlyDryBulbTemp.append(relevantWeather['HOURLYDRYBULBTEMPF'].mean())
#    hourlyRelativeHumidity.append(relevantWeather['HOURLYRelativeHumidity'].mean())
#    hourlyWindSpeed.append(relevantWeather['HOURLYWindSpeed'].mean())
#    hourlySeaLevelPressure.append(relevantWeather['HOURLYSeaLevelPressure'].mean())
#    hourlyVisibility.append(relevantWeather['HOURLYVISIBILITY'].mean())
#    dailySunrise.append(relevantWeather['DAILYSunrise'].iloc[-1])
#    dailySunset.append(relevantWeather['DAILYSunset'].iloc[-1])
#    daylight.append(1.0*((timePoint >= relevantWeather['DAILYSunrise'].iloc[-1]) and (timePoint < relevantWeather['DAILYSunset'].iloc[-1])))
    #if timePoint < relevantWeather['DAILYSunset'][-1]:
        #daylight.append(1)
    #else:
        #daylight.append(0)
    
#    if test%100000 == 0:
#        print(relevantWeather)
#    test += 1

#hourlyDryBulbTemp = pd.Series.from_array(np.array(hourlyDryBulbTemp))
#hourlyRelativeHumidity = pd.Series.from_array(np.array(hourlyRelativeHumidity))
#hourlyWindSpeed = pd.Series.from_array(np.array(hourlyWindSpeed))
#hourlySeaLevelPressure = pd.Series.from_array(np.array(hourlySeaLevelPressure))
#hourlyVisibility = pd.Series.from_array(np.array(hourlyVisibility))
#dailySunrise = pd.Series.from_array(np.array(dailySunrise))
#dailySunset = pd.Series.from_array(np.array(dailySunset))
#daylight = pd.Series.from_array(np.array(daylight))

#x_data['HOURLYDRYBULBTEMPF'] = hourlyDryBulbTemp
#x_data['HOURLYRelativeHumidity'] = hourlyRelativeHumidity
#x_data['HOURLYWindSpeed'] = hourlyWindSpeed
#x_data['HOURLYSeaLevelPressure'] = hourlySeaLevelPressure
#x_data['HOURLYVISIBILITY'] = hourlyVisibility
#x_data['DAILYSunrise'] = dailySunrise
#x_data['DAILYSunset'] = dailySunset
#x_data['Daylight'] = daylight

#x_data.to_csv(path_or_buf="C:/MIDS/W207 final project/x_data.csv")
##########

# Impute missing values with mean values:
#x_complete = x_data.fillna(x_data.mean())
#X_raw = x_complete.as_matrix()

# Scale the data between 0 and 1:
#X = MinMaxScaler().fit_transform(X_raw)

# Shuffle data to remove any underlying pattern that may exist:
#shuffle = np.random.permutation(np.arange(X.shape[0]))
#X, y = X[shuffle], y[shuffle]

# Separate training, dev, and test data:
#test_data, test_labels = X[800000:], y[800000:]
#dev_data, dev_labels = X[700000:800000], y[700000:800000]
#train_data, train_labels = X[:700000], y[:700000]

#mini_train_data, mini_train_labels = X[:75000], y[:75000]
#mini_dev_data, mini_dev_labels = X[75000:100000], y[75000:100000]
#labels_set = set(mini_dev_labels)
#print(labels_set)
#print(len(labels_set))
#print(train_data[:10])

School Data

We hypothesized that incorporating data on school location as well as other details on schools as features would increase the performance of our classifiers. We found a data set offered by California Department of Education (cde.ca.gov) with a list of all schools (K-12) in California and accompanying characteristics of those schools including activity status, grades taught (elementary vs middle vs high school), and address, including location coordinates. We created a distance function to match the longitude, latitude coordinates of each of the crimes to the closest school and pulled other potentially interesting features including:

  1. ‘Closest_school’ - Name of closest school
  2. ‘School_distance’ - Euclidean distance between latitude/longitude coordinates of crime and latitude/longitude coordinates of school
  3. ‘School_type’ - Elementary school vs Middle school vs High school

Ultimately, we were not able to get these features incorporated into our data, although the attempt was educational.


In [ ]:
### Read in school data
#data_path_schools = "./data/pubschls.csv"
#schools = pd.read_csv(data_path_schools,header=0, sep ='\t', usecols = ["CDSCode","StatusType", "School", "EILCode", "EILName", "Zip", "Latitude", "Longitude"], dtype ={'CDSCode': str, 'StatusType': str, 'School': str, 'EILCode': str,'EILName': str,'Zip': str, 'Latitude': float, 'Longitude': float})
#schools = schools[(schools["StatusType"] == 'Active')]

### Find the closest school
#def dist(lat1, long1, lat2, long2):
#    return np.sqrt((lat1-lat2)**2+(long1-long2)**2)

#def find_closest_school(lat, long):    
#    distances = schools.apply(lambda row: dist(lat, long, row["Latitude"], row["Longitude"]), axis=1)
#    return min(distances)
#x_data['closest_school'] = x_data_sub.apply(lambda row: find_closest_school(row['y'], row['x']), axis=1)

Zipcode Data

We found government data from data.gov that links each zip code in the country to longitude and latitude coordinates in an effort to reduce location dimensions from 2 to 1. We used a custom distance function to map each set of longitude, latitude coordinates associated with each crime incident to find the closest 5-digit zip code using the data.gov file. Unfortunately even after we narrowed the data California zip codes (> 94000 and < 95000) we were unable to run the data in time to add the feature to the data set.


In [ ]:
### Read in zip code data
#data_path_zip = "./data/2016_zips.csv"
#zips = pd.read_csv(data_path_zip, header=0, sep ='\t', usecols = [0,5,6], names = ["GEOID", "INTPTLAT", "INTPTLONG"], dtype ={'GEOID': int, 'INTPTLAT': float, 'INTPTLONG': float})
#sf_zips = zips[(zips['GEOID'] > 94000) & (zips['GEOID'] < 94189)]

### Mapping longitude/latitude to zipcodes
#def dist(lat1, long1, lat2, long2):
#    return np.sqrt((lat1-lat2)**2+(long1-long2)**2)
#    return abs(lat1-lat2)+abs(long1-long2)
#def find_zipcode(lat, long):    
#    distances = sf_zips.apply(lambda row: dist(lat, long, row["INTPTLAT"], row["INTPTLONG"]), axis=1)
#    return sf_zips.loc[distances.idxmin(), "GEOID"]
#x_data['zipcode'] = 0
#for i in range(0, 1):
#    x_data['zipcode'][i] = x_data.apply(lambda row: find_zipcode(row['x'], row['y']), axis=1)
#x_data['zipcode']= x_data.apply(lambda row: find_zipcode(row['x'], row['y']), axis=1)

Local, individual clean and load of the updated data set (with weather data integrated) generated by the previous cells.

Our final set of features was:

  1. hour_of_day
  2. dayofweek_numeric
  3. x: is this the longitude of the crime?
  4. y: is this the latitude of the crime?
  5. bayview_binary
  6. ingleside_binary
  7. northern_binary
  8. central_binary
  9. mission_binary
  10. southern_binary
  11. tenderloin_binary
  12. park_binary
  13. richmond_binary
  14. taraval_binary
  15. HOURLYDRYBULBTEMPF
  16. HOURLYRelativeHumidity
  17. HOURLYWindSpeed
  18. HOURLYSeaLevelPressure
  19. HOURLYVISIBILITIY
  20. Daylight: = 1 if ‘DAILYSunset’ > ‘DATE’ and 0 otherwise

As you can see below, the data required further cleaning and tranformation.

We imputed missing values with the feature's mean value. We then scaled features between 0 and 1. The data was shuffled to remove any underlying pattern that may exist.

Kaggle dictated that our performance metric was log loss. This was not directly submitted by us, but computed from a submitted csv file with a matrix of the probability estimates of each class of crime for each data point. In our work, we sought to directly optimize the log loss by hyperparameter tuning, calibrating, and applying meta-estimation to our models. This meant that we would not be submitting our probability estimates each time we attempted to improve the performance of a model. This posed operational difficulty though, as the log loss function in sklearn.metrics requires that the set of correct labels for the data must exactly match the set of labels as generated by the max values within the predicted probabilities.

Two of the crimes in our data set were extremely rare: trespassing on an industrial property (TREA) and possessing obscene material (PORNOGRAPHY/OBSC) at 4 and 8 in approximately 1 million crimes, respectively. At such rare rates, those crimes were not showing up in our data when randomly suhhfling and then subsetting into necessary train, development, calibration, and test subsets. This caused errors in generating log loss Therefore, we decided to sacrifice our performance on these rare 12 out of 1 million data points to be able to more quickly determine our performance metric.

To ensure that all the data subsets contained all of the remaining crimes as labels (all 37 of them), the data load sometimes required repeated iterations in order to secure a crime set of 37 across the board. This was due to the inability to set a seed when using the permutation method in numpy. The data subsets were sized such that it would not require more than a few attempts to get all 37 crime labels into each subset.

There are local file dependencies here. For viewing comfort, we will suppress the code in order to not generate error messages. The code that follows was written by us to incorporate the weather data features into the original data from Kaggle, and to load and then subset the data into train, development, calibrate, and test subsets.


In [3]:
#data_path = "/Users/Bryan/Desktop/UC_Berkeley_MIDS_files/Courses/W207_Intro_To_Machine_Learning/Final_Project/x_data_3.csv"
#df = pd.read_csv(data_path, header=0)
#x_data = df.drop('category', 1)
#y = df.category.as_matrix()

# Impute missing values with mean values:
#x_complete = x_data.fillna(x_data.mean())
#X_raw = x_complete.as_matrix()

# Scale the data between 0 and 1:
#X = MinMaxScaler().fit_transform(X_raw)

# Shuffle data to remove any underlying pattern that may exist.  Must re-run random seed step each time:
#np.random.seed(0)
#shuffle = np.random.permutation(np.arange(X.shape[0]))
#X, y = X[shuffle], y[shuffle]

# Due to difficulties with log loss and set(y_pred) needing to match set(labels), we will remove the extremely rare
# crimes from the data for quality issues.
#X_minus_trea = X[np.where(y != 'TREA')]
#y_minus_trea = y[np.where(y != 'TREA')]
#X_final = X_minus_trea[np.where(y_minus_trea != 'PORNOGRAPHY/OBSCENE MAT')]
#y_final = y_minus_trea[np.where(y_minus_trea != 'PORNOGRAPHY/OBSCENE MAT')]

# Separate training, dev, and test data:
#test_data, test_labels = X_final[800000:], y_final[800000:]
#dev_data, dev_labels = X_final[700000:800000], y_final[700000:800000]
#train_data, train_labels = X_final[100000:700000], y_final[100000:700000]
#calibrate_data, calibrate_labels = X_final[:100000], y_final[:100000]

# Create mini versions of the above sets
#mini_train_data, mini_train_labels = X_final[:20000], y_final[:20000]
#mini_calibrate_data, mini_calibrate_labels = X_final[19000:28000], y_final[19000:28000]
#mini_dev_data, mini_dev_labels = X_final[49000:60000], y_final[49000:60000]

# Create list of the crime type labels.  This will act as the "labels" parameter for the log loss functions that follow
#crime_labels = list(set(y_final))
#crime_labels_mini_train = list(set(mini_train_labels))
#crime_labels_mini_dev = list(set(mini_dev_labels))
#crime_labels_mini_calibrate = list(set(mini_calibrate_labels))
#print(len(crime_labels), len(crime_labels_mini_train), len(crime_labels_mini_dev),len(crime_labels_mini_calibrate))

#print(len(train_data),len(train_labels))
#print(len(dev_data),len(dev_labels))
#print(len(mini_train_data),len(mini_train_labels))
#print(len(mini_dev_data),len(mini_dev_labels))
#print(len(test_data),len(test_labels))
#print(len(mini_calibrate_data),len(mini_calibrate_labels))
#print(len(calibrate_data),len(calibrate_labels))


37 37 37 37

Exploratory Analysis Of The Train Data

Plot distribution of classes of the training data


In [ ]:
yFrame = pd.DataFrame(y, columns=['category'])
yCounts = yFrame['category'].value_counts()

f = plt.figure(figsize=(15,8))
yCounts.plot(kind='bar')
plt.title('Distribution of Classes in the Training Data')
plt.ylabel('Frequency')

Generate table showing the fraction of the total variance explained by the first k components


In [ ]:
pca1 = PCA(n_components=20)
pca1.fit(train_data)
explainedVariances = pca1.explained_variance_ratio_

totalExplained = []
for index in range(len(explainedVariances)):
    totalExplained.append(sum(explainedVariances[:index+1]))

pd.DataFrame(totalExplained)

Plot the fraction of the total variance explained by the first k components


In [ ]:
plt.figure(figsize=(15,8))
plt.plot(np.linspace(1,20,20),totalExplained,".")
plt.title("Fraction of Total Variance Explained by First k Principal Components")
plt.xlabel("Number of Components")
plt.ylabel("Fraction of Total Variance Explained")
plt.show

Formatting Output for Kaggle

Kaggle required a very specific format for submitting our results. Submissions are evaluated using the multi-class logarithmic loss. Each incident must be labeled with one true class. For each incident, one must submit a set of predicted probabilities (one for every class). A CSV file with the incident id, all candidate class names, and a probability for each class is required for the final submission document.


In [86]:
# The Kaggle submission format requires listing the ID of each example.
# This is to remember the order of the IDs after shuffling
#allIDs = np.array(list(df.axes[0]))
#allIDs = allIDs[shuffle]

#testIDs = allIDs[800000:]
#devIDs = allIDs[700000:800000]
#trainIDs = allIDs[:700000]

# Extract the column names for the required submission format
#sampleSubmission_path = "./data/sampleSubmission.csv"
#sampleDF = pd.read_csv(sampleSubmission_path)
#allColumns = list(sampleDF.columns)
#featureColumns = allColumns[1:]

# Extracting the test data for a baseline submission
#real_test_path = "./data/test_transformed.csv"
#testDF = pd.read_csv(real_test_path, header=0)
#real_test_data = testDF

#test_complete = real_test_data.fillna(real_test_data.mean())
#Test_raw = test_complete.as_matrix()

#TestData = MinMaxScaler().fit_transform(Test_raw)

# Here we remember the ID of each test data point, in case we ever decide to shuffle the test data for some reason
#testIDs = list(testDF.axes[0])

Generating Baseline Prediction Probabilities from MNB classifier and store in a .csv file

This step was completed as a test-run to ensure that everything up to this point was functioning properly and that we could generate a proper submission CSV file. We also provided an example of the model accuacy being generated, although this was not our formal metric for evaluation.


In [87]:
# Generate a baseline MNB classifier and make it return prediction probabilities for the actual test data
#def MNB():
#    mnb = MultinomialNB(alpha = 0.0000001)
#    mnb.fit(train_data, train_labels)
#    print("\n\nMultinomialNB accuracy on dev data:", mnb.score(dev_data, dev_labels))
#    return mnb.predict_proba(dev_data)
#MNB()

#baselinePredictionProbabilities = MNB()

# Place the resulting prediction probabilities in a .csv file in the required format
# First, turn the prediction probabilties into a data frame
#resultDF = pd.DataFrame(baselinePredictionProbabilities,columns=featureColumns)
# Add the IDs as a final column
#resultDF.loc[:,'Id'] = pd.Series(testIDs,index=resultDF.index)
# Make the 'Id' column the first column
#colnames = resultDF.columns.tolist()
#colnames = colnames[-1:] + colnames[:-1]
#resultDF = resultDF[colnames]
# Output to a .csv file
# resultDF.to_csv('result.csv',index=False)

Note: the code above will shuffle data differently every time it's run, so model accuracies will vary accordingly.

We later wrote a python script to automatically generate a Kaggle-formatted CSV document for our results


In [ ]:
# -*- coding: utf-8 -*-
"""
Created on Sat Aug 19 19:49:47 2017
@author: kalvi
"""

#required imports
import pandas as pd
import numpy as np

def make_kaggle_format(sample_submission_path, test_transformed_path, prediction_probabilities):
    """this function requires:
        sample_submission_path=(filepath for 'sampleSubmission.csv')
        test_transformed_path=(filepath for 'test_transformed.csv')
        prediction_probabilities=(output from clf.predict_proba(test_data))"""
    
    #this is for extracting the column names for the required submission format
    sampleDF = pd.read_csv(sample_submission_path)
    allColumns = list(sampleDF.columns)
    featureColumns = allColumns[1:]
    
    #this is for extracting the test data IDs for our baseline submission
    testDF = pd.read_csv(test_transformed_path, header=0)
    testIDs = list(testDF.axes[0])
    
    #first we turn the prediction probabilties into a data frame
    resultDF = pd.DataFrame(prediction_probabilities,columns=featureColumns)
    
    #this adds the IDs as a final column
    resultDF.loc[:,'Id'] = pd.Series(testIDs,index=resultDF.index)
    
    #the next few lines make the 'Id' column the first column
    colnames = resultDF.columns.tolist()
    colnames = colnames[-1:] + colnames[:-1]
    resultDF = resultDF[colnames]
    
    #output to .csv file
    resultDF.to_csv('result.csv',index=False)

In [26]:
## Data sub-setting quality check-point
print(train_data[:1])
print(train_labels[:1])


[[ 0.016  0.985  0.826  0.667  0.055  0.002  0.     0.     0.     1.     0.
   0.     0.     0.     0.     0.     0.     0.514  0.405  0.375  0.661  1.
   0.985  0.985  0.   ]]
['LARCENY/THEFT']

In [27]:
# Modeling quality check-point with MNB--fast model

def MNB():
    mnb = MultinomialNB(alpha = 0.0000001)
    mnb.fit(train_data, train_labels)
    print("\n\nMultinomialNB accuracy on dev data:", mnb.score(dev_data, dev_labels))
    
MNB()



MultinomialNB accuracy on dev data: 0.22347

Defining Performance Criteria

As determined by the Kaggle submission guidelines, the performance criteria metric for the San Francisco Crime Classification competition is Multi-class Logarithmic Loss (also known as cross-entropy). There are various other performance metrics that are appropriate for different domains: accuracy, F-score, Lift, ROC Area, average precision, precision/recall break-even point, and squared error.

  • Multi-class Log Loss

  • Accuracy

  • F-score

  • Lift

  • ROC Area

  • Average precision

  • Precision/Recall break-even point

  • Squared-error

Model Prototyping With Default Hyperparameters

We started our classifier engineering process by looking at the performance of various classifiers with default parameter settings in predicting labels on the mini_dev_data:


In [46]:
def model_prototype(train_data, train_labels, eval_data, eval_labels):
    knn = KNeighborsClassifier(n_neighbors=5).fit(train_data, train_labels)
    bnb = BernoulliNB(alpha=1, binarize = 0.5).fit(train_data, train_labels)
    mnb = MultinomialNB().fit(train_data, train_labels)
    log_reg = LogisticRegression().fit(train_data, train_labels)
    neural_net = MLPClassifier().fit(train_data, train_labels)
    random_forest = RandomForestClassifier().fit(train_data, train_labels)
    decision_tree = DecisionTreeClassifier().fit(train_data, train_labels)
    support_vm_step_one = svm.SVC(probability = True)
    support_vm = support_vm_step_one.fit(train_data, train_labels)
    
    models = [knn, bnb, mnb, log_reg, neural_net, random_forest, decision_tree, support_vm]
    for model in models:
        eval_prediction_probabilities = model.predict_proba(eval_data)
        eval_predictions = model.predict(eval_data)
        print(model, "Multi-class Log Loss:", log_loss(y_true = eval_labels, y_pred = eval_prediction_probabilities, labels = crime_labels_mini_dev), "\n\n")

model_prototype(mini_train_data, mini_train_labels, mini_dev_data, mini_dev_labels)


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform') Multi-class Log Loss: 21.0240643644 


BernoulliNB(alpha=1, binarize=0.5, class_prior=None, fit_prior=True) Multi-class Log Loss: 2.6947927812 


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True) Multi-class Log Loss: 2.60974496429 


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False) Multi-class Log Loss: 2.59547592791 


MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False) Multi-class Log Loss: 2.60265495281 


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False) Multi-class Log Loss: 15.5020995603 


DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best') Multi-class Log Loss: 29.8634820265 


SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False) Multi-class Log Loss: 2.62373605675 


Hyperparameter Tuning and Model Calibration To Improve Prediction For Each Classifier

Here we sought to optimize the performance of our classifiers in a three-step, dynamnic engineering process.

1) Feature addition

We previously added components from the weather data into the original SF crime data as new features. We will not repeat work done in our initial submission, where our training dataset did not include these features. For comparision with respoect to how the added features improved our performance with respect to log loss, please refer back to our initial submission.

2) Hyperparameter tuning

Each classifier has hyperparameters that we can engineer to further optimize performance, as opposed to using the default parameter values as we did above in the model prototyping cell. This will be specific to each classifier as detailed below.

3) Model calibration

We can calibrate the models via Platt Scaling or Isotonic Regression to attempt to improve their performance.

  • Platt Scaling: ((brief explanation of how it works))

  • Isotonic Regression: ((brief explanation of how it works))

For each classifier, we can use CalibratedClassifierCV to perform probability calibration with isotonic regression or sigmoid (Platt Scaling). The parameters within CalibratedClassifierCV that we can adjust are the method ('sigmoid' or 'isotonic') and cv (cross-validation generator). As we will already be training our models before calibration, we will only use cv = 'prefit'. Thus, in practice the cross-validation generator will not be a modifiable parameter for us.

4) Parallelizing GridSearchCV with Spark-sklearn

To optimize the parameters, we used GridSearchCV -- with a slight wrinkle. Because we needed GridSearchCV to sort through an incredible number of model specifications with a very large amount of data, we decided to parallelize the process using Spark. Fortunately, there is a PyPI library for doing just that: spark-sklearn. Check out the package here.

In order to run spark-sklearn, we took the following steps:

  • Create an AWS EC2 instance (in our case, a c3.8xlarge instance with an Ubuntu Linux operating system, with 32 vCPUs and 60GiB of memory)
  • Install: Java, Scala, Anaconda, pip, and relevant dependencies (key library: spark_sklearn)
  • Run GridSearchCV within a SparkContext
  • All of the code is the exact same as a normal GridSearchCV with scikit-learn, except for two lines:

"from spark_sklearn import GridSearchCV"

"gs = GridSearchCV(sc, clf, param_grid)"

In other words, the grid search takes SparkContext as an extra parameter. Because of that, the process can be parallelized across multiple cores, which saves a lot of time. For more information on parallelizing GridSearchCV using Spark, see this DataBricks tutorial and this AWS EC2 PySpark tutorial. Note: we ran the PySpark code in the PySpark REPL, rather than in a script. We hit issues with dependencies using Python scripts. We appear not to be alone in this issue; other data scientists have also hit a wall using scikit-learn with Spark.

K-Nearest Neighbors

Hyperparameter tuning:

For the KNN classifier, we can seek to optimize the following classifier parameters: n-neighbors, weights, and the power parameter ('p').


In [28]:
list_for_ks = []
list_for_ws = []
list_for_ps = []
list_for_log_loss = []

def k_neighbors_tuned(k,w,p):
    tuned_KNN = KNeighborsClassifier(n_neighbors=k, weights=w, p=p).fit(mini_train_data, mini_train_labels)
    dev_prediction_probabilities = tuned_KNN.predict_proba(mini_dev_data)
    list_for_ks.append(this_k)
    list_for_ws.append(this_w)
    list_for_ps.append(this_p)
    working_log_loss = log_loss(y_true = mini_dev_labels, y_pred = dev_prediction_probabilities, labels = crime_labels_mini_dev)
    list_for_log_loss.append(working_log_loss)
    #print("Multi-class Log Loss with KNN and k,w,p =", k,",",w,",", p, "is:", working_log_loss)

k_value_tuning = [i for i in range(1,5002,500)]
weight_tuning = ['uniform', 'distance']
power_parameter_tuning = [1,2]

start = time.clock()
for this_k in k_value_tuning:
    for this_w in weight_tuning:
        for this_p in power_parameter_tuning:
            k_neighbors_tuned(this_k, this_w, this_p)
            
index_best_logloss = np.argmin(list_for_log_loss)
print('For KNN the best log loss with hyperparameter tuning is',list_for_log_loss[index_best_logloss], 'with k =', list_for_ks[index_best_logloss], 'w =', list_for_ws[index_best_logloss], 'p =', list_for_ps[index_best_logloss])
end = time.clock()
print("Computation time for this step is %.2f" % (end-start), 'seconds')


For KNN the best log loss with hyperparameter tuning is 2.62923629844 with k = 2001 w = uniform p = 1
Computation time for this step is 351.40 seconds

Model calibration:

Here we will calibrate the KNN classifier with both Platt Scaling and with Isotonic Regression using CalibratedClassifierCV with various parameter settings. The "method" parameter can be set to "sigmoid" or to "isotonic", corresponding to Platt Scaling and to Isotonic Regression respectively.


In [11]:
list_for_ks = []
list_for_ws = []
list_for_ps = []
list_for_ms = []
list_for_log_loss = []

def knn_calibrated(k,w,p,m):
    tuned_KNN = KNeighborsClassifier(n_neighbors=k, weights=w, p=p).fit(mini_train_data, mini_train_labels)
    dev_prediction_probabilities = tuned_KNN.predict_proba(mini_dev_data)
    ccv = CalibratedClassifierCV(tuned_KNN, method = m, cv = 'prefit')
    ccv.fit(mini_calibrate_data, mini_calibrate_labels)
    ccv_prediction_probabilities = ccv.predict_proba(mini_dev_data)
    list_for_ks.append(this_k)
    list_for_ws.append(this_w)
    list_for_ps.append(this_p)
    list_for_ms.append(this_m)
    working_log_loss = log_loss(y_true = mini_dev_labels, y_pred = ccv_prediction_probabilities, labels = crime_labels_mini_dev)
    list_for_log_loss.append(working_log_loss)
    print("Multi-class Log Loss with KNN and k,w,p =", k,",",w,",",p,",",m,"is:", working_log_loss)

k_value_tuning = ([i for i in range(1,21,1)] + [j for j in range(25,51,5)] + [k for k in range(55,22000,1000)])
weight_tuning = ['uniform', 'distance']
power_parameter_tuning = [1,2]
methods = ['sigmoid', 'isotonic']

start = time.clock()
for this_k in k_value_tuning:
    for this_w in weight_tuning:
        for this_p in power_parameter_tuning:
            for this_m in methods:
                knn_calibrated(this_k, this_w, this_p, this_m)
            
index_best_logloss = np.argmin(list_for_log_loss)
print('For KNN the best log loss with hyperparameter tuning and calibration is',list_for_log_loss[index_best_logloss], 'with k =', list_for_ks[index_best_logloss], 'w =', list_for_ws[index_best_logloss], 'p =', list_for_ps[index_best_logloss], 'm =', list_for_ms[index_best_logloss])
end = time.clock()
print("Computation time for this step is %.2f" % (end-start), 'seconds')


Multi-class Log Loss with KNN and k,w,p = 2 , uniform , 1 , sigmoid is: 2.70144987755
Multi-class Log Loss with KNN and k,w,p = 2 , uniform , 1 , isotonic is: 2.70118580885
Multi-class Log Loss with KNN and k,w,p = 2 , uniform , 2 , sigmoid is: 2.70079087512
Multi-class Log Loss with KNN and k,w,p = 2 , uniform , 2 , isotonic is: 2.70309343586
Multi-class Log Loss with KNN and k,w,p = 2 , distance , 1 , sigmoid is: 2.725631755
Multi-class Log Loss with KNN and k,w,p = 2 , distance , 1 , isotonic is: 3.54686417213
Multi-class Log Loss with KNN and k,w,p = 2 , distance , 2 , sigmoid is: 2.7248315732
Multi-class Log Loss with KNN and k,w,p = 2 , distance , 2 , isotonic is: 3.58297553553
Multi-class Log Loss with KNN and k,w,p = 3 , uniform , 1 , sigmoid is: 2.69314603155
Multi-class Log Loss with KNN and k,w,p = 3 , uniform , 1 , isotonic is: 2.69304318298
Multi-class Log Loss with KNN and k,w,p = 3 , uniform , 2 , sigmoid is: 2.68965954407
Multi-class Log Loss with KNN and k,w,p = 3 , uniform , 2 , isotonic is: 2.69144921228
Multi-class Log Loss with KNN and k,w,p = 3 , distance , 1 , sigmoid is: 2.72329504137
Multi-class Log Loss with KNN and k,w,p = 3 , distance , 1 , isotonic is: 3.61148466836
Multi-class Log Loss with KNN and k,w,p = 3 , distance , 2 , sigmoid is: 2.7190933662
Multi-class Log Loss with KNN and k,w,p = 3 , distance , 2 , isotonic is: 3.63170844548
Multi-class Log Loss with KNN and k,w,p = 4 , uniform , 1 , sigmoid is: 2.68775935384
Multi-class Log Loss with KNN and k,w,p = 4 , uniform , 1 , isotonic is: 2.68847986277
Multi-class Log Loss with KNN and k,w,p = 4 , uniform , 2 , sigmoid is: 2.68764051296
Multi-class Log Loss with KNN and k,w,p = 4 , uniform , 2 , isotonic is: 2.68922847263
Multi-class Log Loss with KNN and k,w,p = 4 , distance , 1 , sigmoid is: 2.71828761494
Multi-class Log Loss with KNN and k,w,p = 4 , distance , 1 , isotonic is: 3.59984232382
Multi-class Log Loss with KNN and k,w,p = 4 , distance , 2 , sigmoid is: 2.71789190227
Multi-class Log Loss with KNN and k,w,p = 4 , distance , 2 , isotonic is: 3.65851135959
Multi-class Log Loss with KNN and k,w,p = 5 , uniform , 1 , sigmoid is: 2.68412338935
Multi-class Log Loss with KNN and k,w,p = 5 , uniform , 1 , isotonic is: 2.68153191204
Multi-class Log Loss with KNN and k,w,p = 5 , uniform , 2 , sigmoid is: 2.68440609082
Multi-class Log Loss with KNN and k,w,p = 5 , uniform , 2 , isotonic is: 2.68651054445
Multi-class Log Loss with KNN and k,w,p = 5 , distance , 1 , sigmoid is: 2.71261820403
Multi-class Log Loss with KNN and k,w,p = 5 , distance , 1 , isotonic is: 3.52999779954
Multi-class Log Loss with KNN and k,w,p = 5 , distance , 2 , sigmoid is: 2.71291395097
Multi-class Log Loss with KNN and k,w,p = 5 , distance , 2 , isotonic is: 3.524233494
For KNN the best log loss with hyperparameter tuning and calibration is 2.68153191204 with k = 5 w = uniform p = 1 m = isotonic
Computation time for this step is 231.29 seconds

Multinomial, Bernoulli, and Gaussian Naive Bayes

Hyperparameter tuning: Bernoulli Naive Bayes

For the Bernoulli Naive Bayes classifier, we seek to optimize the alpha parameter (Laplace smoothing parameter) and the binarize parameter (threshold for binarizing of the sample features). For the binarize parameter, we will create arbitrary thresholds over which our features, which are not binary/boolean features, will be binarized.


In [44]:
list_for_as = []
list_for_bs = []
list_for_log_loss = []

def BNB_tuned(a,b):
    bnb_tuned = BernoulliNB(alpha = a, binarize = b).fit(mini_train_data, mini_train_labels)
    dev_prediction_probabilities = bnb_tuned.predict_proba(mini_dev_data)
    list_for_as.append(this_a)
    list_for_bs.append(this_b)
    working_log_loss = log_loss(y_true = mini_dev_labels, y_pred = dev_prediction_probabilities, labels = crime_labels_mini_dev)
    list_for_log_loss.append(working_log_loss)
    #print("Multi-class Log Loss with BNB and a,b =", a,",",b,"is:", working_log_loss)

alpha_tuning = [0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.6, 0.8, 1.0, 1.1, 1.2, 1.4, 1.6, 1.8, 2.0, 10.0]
binarize_thresholds_tuning = [1e-20, 1e-19, 1e-18, 1e-17, 1e-16, 1e-15, 1e-14, 1e-13, 1e-12, 1e-11, 1e-10, 1e-9, 1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99, 0.999, 0.9999]

start = time.clock()
for this_a in alpha_tuning:
    for this_b in binarize_thresholds_tuning:
            BNB_tuned(this_a, this_b)
            
index_best_logloss = np.argmin(list_for_log_loss)
print('For BNB the best log loss with hyperparameter tuning is',list_for_log_loss[index_best_logloss], 'with alpha =', list_for_as[index_best_logloss], 'binarization threshold =', list_for_bs[index_best_logloss])
end = time.clock()
print("Computation time for this step is %.2f" % (end-start), 'seconds')


For BNB the best log loss with hyperparameter tuning is 2.6247750866 with alpha = 1.2 binarization threshold = 1e-20
Computation time for this step is 186.46 seconds

Model calibration: BernoulliNB

Here we will calibrate the BNB classifier with both Platt Scaling and with Isotonic Regression using CalibratedClassifierCV with various parameter settings. The "method" parameter can be set to "sigmoid" or to "isotonic", corresponding to Platt Scaling and to Isotonic Regression respectively.


In [8]:
list_for_as = []
list_for_bs = []
list_for_ms = []
list_for_log_loss = []

def BNB_calibrated(a,b,m):
    bnb_tuned = BernoulliNB(alpha = a, binarize = b).fit(mini_train_data, mini_train_labels)
    dev_prediction_probabilities = bnb_tuned.predict_proba(mini_dev_data)
    ccv = CalibratedClassifierCV(bnb_tuned, method = m, cv = 'prefit')
    ccv.fit(mini_calibrate_data, mini_calibrate_labels)
    ccv_prediction_probabilities = ccv.predict_proba(mini_dev_data)
    list_for_as.append(this_a)
    list_for_bs.append(this_b)
    list_for_ms.append(this_m)
    working_log_loss = log_loss(y_true = mini_dev_labels, y_pred = ccv_prediction_probabilities, labels = crime_labels_mini_dev)
    list_for_log_loss.append(working_log_loss)
    #print("Multi-class Log Loss with BNB and a,b,m =", a,",", b,",", m, "is:", working_log_loss)

alpha_tuning = [0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.6, 0.8, 1.0, 1.1, 1.2, 1.4, 1.6, 1.8, 2.0, 10.0]
binarize_thresholds_tuning = [1e-20, 1e-19, 1e-18, 1e-17, 1e-16, 1e-15, 1e-14, 1e-13, 1e-12, 1e-11, 1e-10, 1e-9, 1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99, 0.999, 0.9999]
methods = ['sigmoid', 'isotonic']

start = time.clock()
for this_a in alpha_tuning:
    for this_b in binarize_thresholds_tuning:
            for this_m in methods:
                BNB_calibrated(this_a, this_b, this_m)
            
index_best_logloss = np.argmin(list_for_log_loss)
print('For BNB the best log loss with hyperparameter tuning and calibration is',list_for_log_loss[index_best_logloss], 'with alpha =', list_for_as[index_best_logloss], 'binarization threshold =', list_for_bs[index_best_logloss], 'method = ', list_for_ms[index_best_logloss])
end = time.clock()
print("Computation time for this step is %.2f" % (end-start), 'seconds')


For BNB the best log loss with hyperparameter tuning and calibration is 2.61370308039 with alpha = 1.0 binarization threshold = 0.5 method =  sigmoid
Computation time for this step is 1066.40 seconds

Hyperparameter tuning: Multinomial Naive Bayes

For the Multinomial Naive Bayes classifer, we seek to optimize the alpha parameter (Laplace smoothing parameter).


In [22]:
list_for_as = []
list_for_log_loss = []

def MNB_tuned(a):
    mnb_tuned = MultinomialNB(alpha = a).fit(mini_train_data, mini_train_labels)
    dev_prediction_probabilities =mnb_tuned.predict_proba(mini_dev_data)
    list_for_as.append(this_a)
    working_log_loss = log_loss(y_true = mini_dev_labels, y_pred = dev_prediction_probabilities, labels = crime_labels_mini_dev)
    list_for_log_loss.append(working_log_loss)
    #print("Multi-class Log Loss with BNB and a =", a, "is:", working_log_loss)

alpha_tuning = [0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.6, 0.8, 1.0, 1.1, 1.2, 1.4, 1.6, 1.8, 2.0, 10.0]

start = time.clock()
for this_a in alpha_tuning:
            MNB_tuned(this_a)
            
index_best_logloss = np.argmin(list_for_log_loss)
print('For MNB the best log loss with hyperparameter tuning is',list_for_log_loss[index_best_logloss], 'with alpha =', list_for_as[index_best_logloss])
end = time.clock()
print("Computation time for this step is %.2f" % (end-start), 'seconds')


For MNB the best log loss with hyperparameter tuning is 2.60930490132 with alpha = 1.8
Computation time for this step is 5.96 seconds

Model calibration: MultinomialNB

Here we will calibrate the MNB classifier with both Platt Scaling and with Isotonic Regression using CalibratedClassifierCV with various parameter settings. The "method" parameter can be set to "sigmoid" or to "isotonic", corresponding to Platt Scaling and to Isotonic Regression respectively.


In [19]:
list_for_as = []
list_for_ms = []
list_for_log_loss = []

def MNB_calibrated(a,m):
    mnb_tuned = MultinomialNB(alpha = a).fit(mini_train_data, mini_train_labels)
    ccv = CalibratedClassifierCV(mnb_tuned, method = m, cv = 'prefit')
    ccv.fit(mini_calibrate_data, mini_calibrate_labels)
    ccv_prediction_probabilities = ccv.predict_proba(mini_dev_data)
    list_for_as.append(this_a)
    list_for_ms.append(this_m)
    working_log_loss = log_loss(y_true = mini_dev_labels, y_pred = ccv_prediction_probabilities, labels = crime_labels_mini_dev)
    list_for_log_loss.append(working_log_loss)
    #print("Multi-class Log Loss with MNB and a =", a, "and m =", m, "is:", working_log_loss)

alpha_tuning = [0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.6, 0.8, 1.0, 1.1, 1.2, 1.4, 1.6, 1.8, 2.0, 10.0]
methods = ['sigmoid', 'isotonic']

start = time.clock()
for this_a in alpha_tuning:
    for this_m in methods:
        MNB_calibrated(this_a, this_m)
            
index_best_logloss = np.argmin(list_for_log_loss)
print('For MNB the best log loss with hyperparameter tuning and calibration is',list_for_log_loss[index_best_logloss], 'with alpha =', list_for_as[index_best_logloss], 'and method =', list_for_ms[index_best_logloss])
end = time.clock()
print("Computation time for this step is %.2f" % (end-start), 'seconds')


For MNB the best log loss with hyperparameter tuning and calibration is 2.6145055659 with alpha = 2.0 and method = sigmoid
Computation time for this step is 34.13 seconds

GridSearch for best parameters for Logistic Regression (run separate from this notebook on PySpark)


In [ ]:
mnb_param_grid = {'alpha': [0.2, 0.4, 0.6, 0.8]}
MNB = GridSearchCV(MultinomialNB(), param_grid=mnb_param_grid, scoring = 'neg_log_loss')
MNB.fit(train_data, train_labels)
print("the best alpha value is:", str(MNB.best_params_['alpha']))

In [ ]:
mnb_param_grid = {'alpha': [0.3, 0.35, 0.4, 0.45, 0.5]}
MNB = GridSearchCV(MultinomialNB(), param_grid=mnb_param_grid, scoring = 'neg_log_loss')
MNB.fit(train_data, train_labels)
print("the best alpha value is:", str(MNB.best_params_['alpha']))

In [ ]:
mnb_param_grid = {'alpha': [0.31, 0.33, 0.35, 0.37, 0.39]}
MNB = GridSearchCV(MultinomialNB(), param_grid=mnb_param_grid, scoring = 'neg_log_loss')
MNB.fit(train_data, train_labels)
print("the best alpha value is:", str(MNB.best_params_['alpha']))

In [ ]:
# Generate the log loss from what we have tuned thus far
MNBPredictionProbabilities = MNB.predict_proba(dev_data)
print("Multi-class Log Loss:", log_loss(y_true = dev_labels, y_pred = MNBPredictionProbabilities, labels = crime_labels), "\n\n")

In [ ]:
mnb_param_grid = {'alpha': [0.340, 0.345, 0.35, 0.355, 0.360]}
MNB = GridSearchCV(MultinomialNB(), param_grid=mnb_param_grid, scoring = 'neg_log_loss')
MNB.fit(train_data, train_labels)
print("the best alpha value is:", str(MNB.best_params_['alpha']))

In [ ]:
# Generate the log loss from what we have tuned thus far
MNBPredictionProbabilities = MNB.predict_proba(dev_data)
print("Multi-class Log Loss:", log_loss(y_true = dev_labels, y_pred = MNBPredictionProbabilities, labels = crime_labels), "\n\n")

Tuning: Gaussian Naive Bayes

For the Gaussian Naive Bayes classifier there are no inherent parameters within the classifier function to optimize, but we will look at our log loss before and after adding noise to the data that is hypothesized to give it a more normal (Gaussian) distribution, which is required by the GNB classifier.


In [17]:
def GNB_pre_tune():
    gnb_pre_tuned = GaussianNB().fit(mini_train_data, mini_train_labels)
    dev_prediction_probabilities =gnb_pre_tuned.predict_proba(mini_dev_data)
    working_log_loss = log_loss(y_true = mini_dev_labels, y_pred = dev_prediction_probabilities, labels = crime_labels_mini_dev)
    print("Multi-class Log Loss with pre-tuned GNB is:", working_log_loss)

GNB_pre_tune()
    
def GNB_post_tune():
    # Gaussian Naive Bayes requires the data to have a relative normal distribution. Sometimes
    # adding noise can improve performance by making the data more normal:
    mini_train_data_noise = np.random.rand(mini_train_data.shape[0],mini_train_data.shape[1])
    modified_mini_train_data = np.multiply(mini_train_data,mini_train_data_noise)    
    gnb_with_noise = GaussianNB().fit(modified_mini_train_data,mini_train_labels)
    dev_prediction_probabilities =gnb_with_noise.predict_proba(mini_dev_data)
    working_log_loss = log_loss(y_true = mini_dev_labels, y_pred = dev_prediction_probabilities, labels = crime_labels_mini_dev)
    print("Multi-class Log Loss with tuned GNB via addition of noise to normalize the data's distribution is:", working_log_loss)
    
GNB_post_tune()


Multi-class Log Loss with pre-tuned GNB is: 34.1076504549
Multi-class Log Loss with tuned GNB via addition of noise to normalize the data's distribution is: 31.8040829494

Model calibration: GaussianNB

Here we will calibrate the GNB classifier with both Platt Scaling and with Isotonic Regression using CalibratedClassifierCV with various parameter settings. The "method" parameter can be set to "sigmoid" or to "isotonic", corresponding to Platt Scaling and to Isotonic Regression respectively.


In [21]:
list_for_ms = []
list_for_log_loss = []

def GNB_calibrated(m):
    # Gaussian Naive Bayes requires the data to have a relative normal distribution. Sometimes
    # adding noise can improve performance by making the data more normal:
    mini_train_data_noise = np.random.rand(mini_train_data.shape[0],mini_train_data.shape[1])
    modified_mini_train_data = np.multiply(mini_train_data,mini_train_data_noise)    
    gnb_with_noise = GaussianNB().fit(modified_mini_train_data,mini_train_labels)
    ccv = CalibratedClassifierCV(gnb_with_noise, method = m, cv = 'prefit')
    ccv.fit(mini_calibrate_data, mini_calibrate_labels)
    ccv_prediction_probabilities = ccv.predict_proba(mini_dev_data)
    list_for_ms.append(this_m)
    working_log_loss = log_loss(y_true = mini_dev_labels, y_pred = ccv_prediction_probabilities, labels = crime_labels_mini_dev)
    list_for_log_loss.append(working_log_loss)
    #print("Multi-class Log Loss with tuned GNB via addition of noise to normalize the data's distribution and after calibration is:", working_log_loss, 'with calibration method =', m)
    
methods = ['sigmoid', 'isotonic']

start = time.clock()
for this_m in methods:
    GNB_calibrated(this_m)
            
index_best_logloss = np.argmin(list_for_log_loss)
print('For GNB the best log loss with tuning and calibration is',list_for_log_loss[index_best_logloss], 'with method =', list_for_ms[index_best_logloss])
end = time.clock()
print("Computation time for this step is %.2f" % (end-start), 'seconds')


For GNB the best log loss with tuning and calibration is 2.67904020299 with method = sigmoid
Computation time for this step is 1.36 seconds

Logistic Regression

Hyperparameter tuning:

For the Logistic Regression classifier, we can seek to optimize the following classifier parameters: penalty (l1 or l2), C (inverse of regularization strength), solver ('newton-cg', 'lbfgs', 'liblinear', or 'sag')


In [ ]:
cValsL1 = [7.5, 10.0, 12.5, 20.0]
methods = ['sigmoid', 'isotonic']
cv = 2
tol = 0.01
for c in cValsL1:
    for m in methods:
        ccvL1 = CalibratedClassifierCV(LogisticRegression(penalty='l1', C=c, tol=tol), method=m, cv=cv)
        ccvL1.fit(mini_train_data, mini_train_labels)
        print(ccvL1.get_params)
        ccvL1_prediction_probabilities = ccvL1.predict_proba(mini_dev_data)
        ccvL1_predictions = ccvL1.predict(mini_dev_data)
        print("L1 Multi-class Log Loss:", log_loss(y_true = mini_dev_labels, y_pred = ccvL1_prediction_probabilities, labels = crime_labels_mini_dev), "\n\n")
        print()

Model calibration


In [ ]:
cValsL1 = [15.0, 20.0, 25.0, 50.0]
method = 'sigmoid'
cv = 2
tol = 0.01
for c in cValsL1:
    ccvL1 = CalibratedClassifierCV(LogisticRegression(penalty='l1', C=c, tol=tol), method=method, cv=cv)
    ccvL1.fit(mini_train_data, mini_train_labels)
    print(ccvL1.get_params)
    ccvL1_prediction_probabilities = ccvL1.predict_proba(mini_dev_data)
    ccvL1_predictions = ccvL1.predict(mini_dev_data)
    print("L1 Multi-class Log Loss:", log_loss(y_true = mini_dev_labels, y_pred = ccvL1_prediction_probabilities, labels = crime_labels_mini_dev), "\n\n")
    print()

GridSearch for best parameters for Logistic Regression (run separate from this notebook on PySpark)


In [ ]:
#L1
#lr_param_grid_1 = {'C': [0, 0.0001, 0.001, 0.01, 0.1, 0.5, 1.0, 5.0, 10.0]}
lr_param_grid_1 = {'C': [7.5, 10.0, 12.5, 15.0, 20.0, 30.0, 50.0, 100.0]}
LR_l1 = GridSearchCV(LogisticRegression(penalty='l1'), param_grid=lr_param_grid_1, scoring='neg_log_loss')
LR_l1.fit(train_data, train_labels)

print('L1: best C value:', str(LR_l1.best_params_['C']))

LR_l1_prediction_probabilities = LR_l1.predict_proba(dev_data)
LR_l1_predictions = LR_l1.predict(dev_data)
print("L1 Multi-class Log Loss:", log_loss(y_true = dev_labels, y_pred = LR_l1_prediction_probabilities, labels = crime_labels), "\n\n")

#create an LR-L1 classifier with the best params
bestL1 = LogisticRegression(penalty='l1', C=LR_l1.best_params_['C'])
bestL1.fit(train_data, train_labels)
#L1weights = bestL1.coef_

columns = ['hour_of_day','dayofweek',\
          'x','y','bayview','ingleside','northern',\
          'central','mission','southern','tenderloin',\
          'park','richmond','taraval','HOURLYDRYBULBTEMPF',\
          'HOURLYRelativeHumidity','HOURLYWindSpeed',\
          'HOURLYSeaLevelPressure','HOURLYVISIBILITY',\
          'Daylight']

allCoefsL1 = pd.DataFrame(index=columns)
for a in range(len(bestL1.coef_)):
    allCoefsL1[crime_labels[a]] = bestL1.coef_[a]

allCoefsL1

f = plt.figure(figsize=(15,8))
allCoefsL1.plot(kind='bar', figsize=(15,8))
plt.legend(loc='center left', bbox_to_anchor=(1.0,0.5))
plt.show()


#L2
lr_param_grid_2 = {'C': [0.0001, 0.01, 1.0, 10.0, 50.0, 100.0, 150.0, 250.0, 500.0], \
                 'solver':['liblinear','newton-cg','lbfgs', 'sag']}
LR_l2 = GridSearchCV(LogisticRegression(penalty='l2'), param_grid=lr_param_grid_2, scoring='neg_log_loss')
LR_l2.fit(train_data, train_labels)

print('L2: best C value:', str(LR_l2.best_params_['C']))
print('L2: best solver:', str(LR_l2.best_params_['solver']))

LR_l2_prediction_probabilities = LR_l2.predict_proba(dev_data)
LR_l2_predictions = LR_l2.predict(dev_data)
print("L2 Multi-class Log Loss:", log_loss(y_true = dev_labels, y_pred = LR_l2_prediction_probabilities, labels = crime_labels), "\n\n")

#create an LR-L2 classifier with the best params
bestL2 = LogisticRegression(penalty='l2', solver=LR_l2.best_params_['solver'], C=LR_l2.best_params_['C'])
bestL2.fit(train_data, train_labels)
#L2weights = bestL2.coef_

columns = ['hour_of_day','dayofweek',\
          'x','y','bayview','ingleside','northern',\
          'central','mission','southern','tenderloin',\
          'park','richmond','taraval','HOURLYDRYBULBTEMPF',\
          'HOURLYRelativeHumidity','HOURLYWindSpeed',\
          'HOURLYSeaLevelPressure','HOURLYVISIBILITY',\
          'Daylight']

allCoefsL2 = pd.DataFrame(index=columns)
for a in range(len(bestL2.coef_)):
    allCoefsL2[crime_labels[a]] = bestL2.coef_[a]

allCoefsL2

f = plt.figure(figsize=(15,8))
allCoefsL2.plot(kind='bar', figsize=(15,8))
plt.legend(loc='center left', bbox_to_anchor=(1.0,0.5))
plt.show()

Decision Tree

Hyperparameter tuning:

For the Decision Tree classifier, we can seek to optimize the following classifier parameters:

  • criterion: The function to measure the quality of a split; can be either Gini impurity "gini" or information gain "entropy"
  • splitter: The strategy used to choose the split at each node; can be either "best" to choose the best split or "random" to choose the best random split
  • min_samples_leaf: The minimum number of samples required to be at a leaf node
  • max_depth: The maximum depth of trees. If default "None" then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples
  • class_weight: The weights associated with classes; can be "None" giving all classes weight of one, or can be "balanced", which uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data
  • max_features: The number of features to consider when looking for the best split; can be "int", "float" (percent), "auto", "sqrt", or "None"

Other adjustable parameters include:

  • min_samples_split: The minimum number of samples required to split an internal node; can be an integer or a float (percentage and ceil as the minimum number of samples for each node)
  • min_weight_fraction_leaf: The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node; default = 0
  • max_leaf_nodes: Grosw a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If "None" then unlimited number of leaf nodes is used.
  • min_impurity_decrease: A node will be split if this split induces a decrease of the impurity greater than or equal to the min_impurity_decrease value. Default is zero.

Setting min_samples_leaf to approximately 1% of the data points can stop the tree from inappropriately classifying outliers, which can help to improve accuracy (unsure if significantly improves MCLL) [].


In [46]:
list_for_cs = []
list_for_ss = []
list_for_mds = []
list_for_mss = []
list_for_cws = []
list_for_fs = []
list_for_log_loss = []

def DT_tuned(c,s,md,ms,cw,f):
    tuned_DT = DecisionTreeClassifier(criterion=c, splitter=s, max_depth=md, min_samples_leaf=ms, max_features=f, class_weight=cw).fit(mini_train_data, mini_train_labels)
    dev_prediction_probabilities = tuned_DT.predict_proba(mini_dev_data)
    list_for_cs.append(this_c)
    list_for_ss.append(this_s)
    list_for_mds.append(this_md)
    list_for_mss.append(this_ms)
    list_for_cws.append(this_cw)
    list_for_fs.append(this_f)
    working_log_loss = log_loss(y_true = mini_dev_labels, y_pred = dev_prediction_probabilities, labels = crime_labels_mini_dev)
    list_for_log_loss.append(working_log_loss)
    #print("Multi-class Log Loss with DT and c,s,md,ms,cw,f =", c,",",s,",", md,",",ms,",",cw,",",f,"is:", working_log_loss)

criterion_tuning = ['gini', 'entropy']
splitter_tuning = ['best', 'random']
max_depth_tuning = ([None,6,7,8,9,10,11,12,13,14,15,16,17,18,19])
min_samples_leaf_tuning = [x + 1 for x in [i for i in range(0,int(0.091*len(mini_train_data)),100)]]
class_weight_tuning = [None, 'balanced']
max_features_tuning = ['auto', 2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]

start = time.clock()
for this_c in criterion_tuning:
    for this_s in splitter_tuning:
        for this_md in max_depth_tuning:
            for this_ms in min_samples_leaf_tuning:
                for this_cw in class_weight_tuning:
                    for this_f in max_features_tuning:
                        DT_tuned(this_c, this_s, this_md, this_ms, this_cw, this_f)
            
index_best_logloss = np.argmin(list_for_log_loss)
print('For DT the best log loss with hyperparameter tuning is',list_for_log_loss[index_best_logloss], 'with criterion =', list_for_cs[index_best_logloss], 'splitter =', list_for_ss[index_best_logloss], 'max_depth =', list_for_mds[index_best_logloss], 'min_samples_leaf =', list_for_mss[index_best_logloss], 'class_weight =', list_for_cws[index_best_logloss], 'max_features =', list_for_fs[index_best_logloss])
end = time.clock()
print("Computation time for this step is %.2f" % (end-start), 'seconds')


For DT the best log loss with hyperparameter tuning is 2.62645819033 with criterion = entropy splitter = best max_depth = 15 min_samples_leaf = 1201 class_weight = None max_features = 9
Computation time for this step is 1424.48 seconds
Model calibration:

In [48]:
list_for_cs = []
list_for_ss = []
list_for_mds = []
list_for_mss = []
list_for_cws = []
list_for_fs = []
list_for_cms = []
list_for_log_loss = []

def DT_calibrated(c,s,md,ms,cw,f,cm):
    tuned_DT = DecisionTreeClassifier(criterion=c, splitter=s, max_depth=md, min_samples_leaf=ms, max_features=f, class_weight=cw).fit(mini_train_data, mini_train_labels)
    ccv = CalibratedClassifierCV(tuned_DT, method = cm, cv = 'prefit')
    ccv.fit(mini_calibrate_data, mini_calibrate_labels)
    ccv_prediction_probabilities = ccv.predict_proba(mini_dev_data)
    list_for_cs.append(this_c)
    list_for_ss.append(this_s)
    list_for_mds.append(this_md)
    list_for_mss.append(this_ms)
    list_for_cws.append(this_cw)
    list_for_fs.append(this_f)
    list_for_cms.append(this_cm)
    working_log_loss = log_loss(y_true = mini_dev_labels, y_pred = ccv_prediction_probabilities, labels = crime_labels_mini_dev)
    list_for_log_loss.append(working_log_loss)
    print("Multi-class Log Loss with DT and c,s,md,ms,cw,f =", c,",",s,",", md,",",ms,",",cw,",",f,",",cm,"is:", working_log_loss)

criterion_tuning = ['gini', 'entropy']
splitter_tuning = ['best', 'random']
max_depth_tuning = ([None,6,7,8,9,10,11,12,13,14,15,16,17,18,19])
min_samples_leaf_tuning = [x + 1 for x in [i for i in range(0,int(0.091*len(mini_train_data)),100)]]
class_weight_tuning = [None, 'balanced']
max_features_tuning = ['auto', 2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]
methods = ['sigmoid', 'isotonic']

start = time.clock()
for this_c in criterion_tuning:
    for this_s in splitter_tuning:
        for this_md in max_depth_tuning:
            for this_ms in min_samples_leaf_tuning:
                for this_cw in class_weight_tuning:
                    for this_f in max_features_tuning:
                        for this_cm in methods:
                            DT_calibrated(this_c, this_s, this_md, this_ms, this_cw, this_f, this_cm)
            
index_best_logloss = np.argmin(list_for_log_loss)
print('For DT the best log loss with hyperparameter tuning and calibration is',list_for_log_loss[index_best_logloss], 'with criterion =', list_for_cs[index_best_logloss], 'splitter =', list_for_ss[index_best_logloss], 'max_depth =', list_for_mds[index_best_logloss], 'min_samples_leaf =', list_for_mss[index_best_logloss], 'class_weight =', list_for_cws[index_best_logloss], 'max_features =', list_for_fs[index_best_logloss], 'and calibration method =', list_for_cms[index_best_logloss])
end = time.clock()
print("Computation time for this step is %.2f" % (end-start), 'seconds')


Multi-class Log Loss with DT and c,s,md,ms,cw,f = gini , best , None , 1 , None , auto , sigmoid is: 2.70888798421
Multi-class Log Loss with DT and c,s,md,ms,cw,f = gini , best , None , 1 , None , auto , isotonic is: 2.85635450179
Multi-class Log Loss with DT and c,s,md,ms,cw,f = gini , best , None , 1 , None , 2 , sigmoid is: 2.70872615427
Multi-class Log Loss with DT and c,s,md,ms,cw,f = gini , best , None , 1 , None , 2 , isotonic is: 2.80355451969
Multi-class Log Loss with DT and c,s,md,ms,cw,f = gini , best , None , 1 , None , 3 , sigmoid is: 2.69815374851
Multi-class Log Loss with DT and c,s,md,ms,cw,f = gini , best , None , 1 , None , 3 , isotonic is: 2.82892176447
Multi-class Log Loss with DT and c,s,md,ms,cw,f = gini , best , None , 1 , None , 4 , sigmoid is: 2.70032530894
Multi-class Log Loss with DT and c,s,md,ms,cw,f = gini , best , None , 1 , None , 4 , isotonic is: 2.79782811709
Multi-class Log Loss with DT and c,s,md,ms,cw,f = gini , best , None , 1 , None , 5 , sigmoid is: 2.70315008561
Multi-class Log Loss with DT and c,s,md,ms,cw,f = gini , best , None , 1 , None , 5 , isotonic is: 2.8022793479
Multi-class Log Loss with DT and c,s,md,ms,cw,f = gini , best , None , 1 , None , 6 , sigmoid is: 2.69781719577
Multi-class Log Loss with DT and c,s,md,ms,cw,f = gini , best , None , 1 , None , 6 , isotonic is: 2.77490607908
Multi-class Log Loss with DT and c,s,md,ms,cw,f = gini , best , None , 1 , None , 7 , sigmoid is: 2.69993921872
Multi-class Log Loss with DT and c,s,md,ms,cw,f = gini , best , None , 1 , None , 7 , isotonic is: 2.76253067697
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-48-95506a412269> in <module>()
     40                     for this_f in max_features_tuning:
     41                         for this_cm in methods:
---> 42                             DT_calibrated(this_c, this_s, this_md, this_ms, this_cw, this_f, this_cm)
     43 
     44 index_best_logloss = np.argmin(list_for_log_loss)

<ipython-input-48-95506a412269> in DT_calibrated(c, s, md, ms, cw, f, cm)
     11     tuned_DT = DecisionTreeClassifier(criterion=c, splitter=s, max_depth=md, min_samples_leaf=ms, max_features=f, class_weight=cw).fit(mini_train_data, mini_train_labels)
     12     ccv = CalibratedClassifierCV(tuned_DT, method = cm, cv = 'prefit')
---> 13     ccv.fit(mini_calibrate_data, mini_calibrate_labels)
     14     ccv_prediction_probabilities = ccv.predict_proba(mini_dev_data)
     15     list_for_cs.append(this_c)

/Users/Bryan/anaconda/lib/python3.6/site-packages/sklearn/calibration.py in fit(self, X, y, sample_weight)
    155                 calibrated_classifier.fit(X, y, sample_weight)
    156             else:
--> 157                 calibrated_classifier.fit(X, y)
    158             self.calibrated_classifiers_.append(calibrated_classifier)
    159         else:

/Users/Bryan/anaconda/lib/python3.6/site-packages/sklearn/calibration.py in fit(self, X, y, sample_weight)
    341                 raise ValueError('method should be "sigmoid" or '
    342                                  '"isotonic". Got %s.' % self.method)
--> 343             calibrator.fit(this_df, Y[:, k], sample_weight)
    344             self.calibrators_.append(calibrator)
    345 

/Users/Bryan/anaconda/lib/python3.6/site-packages/sklearn/calibration.py in fit(self, X, y, sample_weight)
    488         X, y = indexable(X, y)
    489 
--> 490         self.a_, self.b_ = _sigmoid_calibration(X, y, sample_weight)
    491         return self
    492 

/Users/Bryan/anaconda/lib/python3.6/site-packages/sklearn/calibration.py in _sigmoid_calibration(df, y, sample_weight)
    450 
    451     AB0 = np.array([0., log((prior0 + 1.) / (prior1 + 1.))])
--> 452     AB_ = fmin_bfgs(objective, AB0, fprime=grad, disp=False)
    453     return AB_[0], AB_[1]
    454 

/Users/Bryan/anaconda/lib/python3.6/site-packages/scipy/optimize/optimize.py in fmin_bfgs(f, x0, fprime, args, gtol, norm, epsilon, maxiter, full_output, disp, retall, callback)
    857             'return_all': retall}
    858 
--> 859     res = _minimize_bfgs(f, x0, args, fprime, callback=callback, **opts)
    860 
    861     if full_output:

/Users/Bryan/anaconda/lib/python3.6/site-packages/scipy/optimize/optimize.py in _minimize_bfgs(fun, x0, args, jac, callback, gtol, norm, eps, maxiter, disp, return_all, **unknown_options)
    974         A2 = I - yk[:, numpy.newaxis] * sk[numpy.newaxis, :] * rhok
    975         Hk = numpy.dot(A1, numpy.dot(Hk, A2)) + (rhok * sk[:, numpy.newaxis] *
--> 976                                                  sk[numpy.newaxis, :])
    977 
    978     fval = old_fval

KeyboardInterrupt: 

Support Vector Machines

Hyperparameter tuning:

For the SVM classifier, we can seek to optimize the following classifier parameters: C (penalty parameter C of the error term), kernel ('linear', 'poly', 'rbf', sigmoid', or 'precomputed')

See source [2] for parameter optimization in SVM.


In [ ]:
from sklearn import svm
from sklearn.metrics import log_loss

class_labels = list(set(train_labels))

def enumerateY(data):
    enumerated_data = np.zeros(data.size)
    for j in range(0,data.size):
        feature = data[j]
        i = class_labels.index(feature)
        enumerated_data[j]+= i
    return (np.array(enumerated_data))


train_labels_enum = enumerateY(mini_train_labels)
dev_labels_enum = enumerateY(mini_dev_labels)

In [ ]:
# GridSearchCV to find optimal parameters

parameters = {'C': (0.001, 0.01, 0.1, 0.5, 1, 5, 10, 25, 100), 
             'decision_function_shape': ('ovo', 'ovr'),
             'kernel': ('linear', 'sigmoid', 'rbf', 'poly')}

clf = GridSearchCV(svm.SVC(tol = 0.01, probability = True), parameters, scoring = "log_loss")
clf.fit(mini_train_data, mini_train_labels).predict_proba(mini_dev_data)
print("optimal parameters:", clf.best_params_)

In [ ]:
# Most optimal parameters returned score of 2.62651014795
clf_svc = svm.SVC(kernel='rbf', C=100, decision_function_shape='ovr', probability = True).fit(mini_train_data, train_labels_enum)
predicted = clf_svc.fit(mini_train_data, train_labels_enum).predict_proba(mini_dev_data)


print(log_loss(y_true = dev_labels_enum, y_pred = predicted, labels = dev_labels_enum))

Neural Nets

Hyperparameter tuning:

The benefit of Neural Nets is that it can learn a non-linear function for classification. Relative to that of logistic regression, you have multiple "hidden" layers in between the input and output layers.

The disadvantages are however that 1) you could have more than one local minimum due to a convex loss function which means depending on where the weights initialize you could end up with very different results and 2) the number of hyperparameters to tune is high -- requires lots of computing power.

MLP trains using backpropgation - it trains using gradient descent and for classification, it minimizes the Cross-Entropy loss function, giving a vector of probability estimates.

For the Neural Networks MLP classifier, the following classifier parameters are subject to optimization: 1) hidden_layer_sizes - which includes number of hidden layers as well as number of hidden nodes; 2) activation function - 'identity', 'logistic', 'tanh', 'relu'; 3) alpha - L2 regularization term to avoid overfitting; 4) learning_rate - 'constant', 'invscaling','adaptive'; 5) solver - 'lbfgs','sgd', adam' and 6) batch size.

Given the limitations to compute power, we took a targeted approach to training our MLP model. We ran gridsearchCV several times over subsets of parameters which is obviously not ideal given the number of parameters. We also attempted to test the potential impact of scikit's neural network classifier which also can incorporate drop out methods.


In [ ]:
# One of the iterations of gridSearchCV we ran

import itertools
hidden_layers = [2, 3, 4]
for i in hidden_layers:
    hidden_layer_sizes = [x for x in itertools.product((4, 8, 12, 16, 20),repeat=i)]  

parameters = {'activation' : ('logistic', 'tanh', 'relu'), 
              'batch_size': (50, 100, 200), 
              'alpha': (0.001, 0.01, 0.1, 0.25), 
             'hidden_layer_sizes': hidden_layer_sizes}

clf = GridSearchCV(MLPClassifier(tol = 0.01), parameters, scoring = "log_loss")
clf.fit(mini_train_data, mini_train_labels).predict_proba(mini_dev_data)
print("optimal parameters: ",clf.best_params_)

In [ ]:
# Our best parameter return
clf = MLPClassifier(tol=0.01, activation = 'tanh', hidden_layer_sizes = (20,), batch_size = 10, alpha = 0.01).fit(mini_train_data, mini_train_labels)
predicted = clf.predict(mini_dev_data)
predicted_prob = clf.predict_proba(mini_dev_data)
score = log_loss(mini_dev_labels, predicted_prob)
print(score)

We take the best performing parameters from the tuning procedures above and add dropout method to illustrate the potential impact there. Dropout method would not have materially improved the score had we used it.


In [ ]:
from sknn.mlp import Classifier, Layer

nn = Classifier(
    layers=[
        Layer("Tanh", units = 20, dropout = .25),
        Layer("Softmax")], batch_size = 10, learning_rate = .001)

In [ ]:
clf = nn.fit(mini_train_data, mini_train_labels)
predicted = clf.predict(mini_dev_data)
predicted_prob = clf.predict_proba(mini_dev_data)
score = log_loss(mini_dev_labels, predicted_prob)
print(score)

In [ ]:
##We attempted to run a more comprehensive hyperparameter tuning procedure including number of iterations and ranges for dropout method. It was too computationally intensive. Even on a remote AWS instance on pyspark it kept stalling out. 
from sklearn.grid_search import GridSearchCV

gs = GridSearchCV(nn, param_grid={
    'learning_rate': [0.05, 0.01, 0.005, 0.001],
    'n_iter': [25, 50, 100, 200],
    'hidden0__units': [4, 8, 12, 16, 20],
    'hidden0__type': ["Rectifier", "Sigmoid", "Tanh"],
    'hidden0__dropout':[0.2, 0.3, 0.4],
    'hidden1__units': [4, 8, 12, 16, 20],
    'hidden1__type': ["Rectifier", "Sigmoid", "Tanh"],
    'hidden1__dropout':[0.2, 0.3, 0.4],
    'hidden2__units': [4, 8, 12, 16, 20],
    'hidden2__type': ["Rectifier", "Sigmoid", "Tanh"],
    'hidden2__dropout':[0.2, 0.3, 0.4]})
gs.fit(mini_train_data, mini_train_labels)
predicted = gs.predict(mini_dev_data)
predicted_prob = gs.predict_proba(mini_dev_data)
score = log_loss(mini_dev_labels, predicted_prob)

Model calibration:


In [ ]:
# Further calibration using CalibratedClassifierCV
list_for_log_loss = []

def NN_calibrated():
    tuned_NN = clf = MLPClassifier(tol=0.01, activation = 'tanh', hidden_layer_sizes = (20,), batch_size = 10, alpha = 0.001).fit(mini_train_data, mini_train_labels)
    dev_prediction_probabilities = tuned_NN.predict_proba(mini_dev_data)
    ccv = CalibratedClassifierCV(tuned_NN, method = m, cv = 'prefit')
    ccv.fit(mini_calibrate_data, mini_calibrate_labels)
    ccv_prediction_probabilities = ccv.predict_proba(mini_dev_data)
    working_log_loss = log_loss(y_true = mini_dev_labels, y_pred = ccv_prediction_probabilities, labels = crime_labels_mini_dev)
    list_for_log_loss.append(working_log_loss)
    print("Multi-class Log Loss with NN is:", working_log_loss)

methods = ['sigmoid', 'isotonic']

for m in methods:
    NN_calibrated()
            
index_best_logloss = np.argmin(list_for_log_loss)
print('For NN the best log loss with hyperparameter tuning and calibration is',list_for_log_loss[index_best_logloss])

Random Forest

Hyperparameter tuning:

For the Random Forest classifier, we can seek to optimize the following classifier parameters: n_estimators (the number of trees in the forsest), max_features, max_depth, min_samples_leaf, bootstrap (whether or not bootstrap samples are used when building trees), oob_score (whether or not out-of-bag samples are used to estimate the generalization accuracy)

This information is only included in our Final Notebook. Please refer to it.

Model calibration:

See Final Notebook please

Gradient Boosting Classifier

Hyperparameter tuning:

For the Gradient Boosting classifier, we can seek to optimize the following classifier parameters: n_estimators (the number of trees in the forsest), max_depth, min_samples_leaf, and max_features


In [ ]:
#learning_rate : float, optional (default=0.1)
#n_estimators : int (default=100)
#max_depth : integer, optional (default=3)
#criterion : string, optional (default=”friedman_mse”)
#min_samples_split : int, float, optional (default=2)
#min_samples_leaf : int, float, optional (default=1)
#min_weight_fraction_leaf : float, optional (default=0.)
#subsample : float, optional (default=1.0)
#max_features : int, float, string or None, optional (default=None)
#max_leaf_nodes : int or None, optional (default=None)

nList = [75, 125]
depthList = [1, 5]
leafList = [2, 7]
featuresList = [8, 17]

In [ ]:
#gb_param_grid = {'n_estimators':[25, 75, 250, 750], 'max_depth': [1, 5, 9], \
#                 'min_samples_leaf': [2, 7, 12], 'max_features': [3, 8, 17]}
#GB = GridSearchCV(GradientBoostingClassifier(), param_grid=gb_param_grid, scoring='neg_log_loss')
#GB.fit(train_data, train_labels)

for n_estimators in nList:
    for max_depth in depthList:
        for min_samples_leaf in leafList:
            for max_features in featuresList:
                gbTest = GradientBoostingClassifier(n_estimators=n_estimators, max_depth=max_depth, \
                                                    min_samples_leaf=min_samples_leaf, max_features=max_features)
                gbTest.fit(mini_train_data, mini_train_labels)
                gbTestPredictionProbabilities = gbTest.predict_proba(mini_dev_data)
                print("Parameters:")
                print("n_estimators:", str(n_estimators)+";", " max_depth:", str(max_depth)+";", \
                      " min_samples_leaf:", str(min_samples_leaf)+";", " max_features:", str(max_features))
                print("Multi-class Log Loss:", log_loss(y_true = mini_dev_labels, y_pred = gbTestPredictionProbabilities, \
                                                        labels = crime_labels_mini_dev), "\n\n")
                print()

GridSearch for best parameters for Gradient Boosting (run separate from this notebook on PySpark)


In [ ]:
gb_param_grid = {'n_estimators':[25, 75, 250, 750], 'max_depth': [1, 5, 9], \
                 'min_samples_leaf': [2, 7, 12], 'max_features': [3, 8, 17]}
GB = GridSearchCV(GradientBoostingClassifier(), param_grid=gb_param_grid, scoring='neg_log_loss')
GB.fit(train_data, train_labels)

In [ ]:
GB.best_params_
GBPredictionProbabilities = GB.predict_proba(dev_data)
print("Multi-class Log Loss:", log_loss(y_true = dev_labels, y_pred = GBPredictionProbabilities, labels = crime_labels), "\n\n")

Meta-estimators

AdaBoost Classifier

Hyperparameter tuning:

For the AdaBoostClassifier, one can seek to optimize the following meta-estimator parameters: base_estimator (the possible classifiers over which you are providing a meta-estimation with Adaptive Boosting), n_estimators (the max number of estimators at which boosting is terminated), algorithm ('SAMME' or 'SAMME.R').

Adaboosting each classifier:

We considered running the AdaBoostClassifier on each different classifier from above, using the classifier settings with optimized Multi-class Log Loss after hyperparameter tuning and calibration. However, with further research, we discovered that Adaptive Boosting in its current state can either increase or decrease performance compared to the initial, non-boosted model. Further detail on this research with respect to KNN follows. As such, we decided to use the Bagging meta-estimator.

Example: Adaboosting KNN:

We will not seek to apply Adaptive Boosting (AdaBoost) to our K-nearest Neighbors classifier. It has been shown that AdaBoost actually reduces accuracy and other performance metrics when used with nearest neighbors classifiers [4]. This is because KNN is a stable algorithm with low variance, which leads to correlated errors during each iteration of AdaBoost. There is, however, on-going research in ways to modify the AdaBoost algorithm for KNN classifiers to increase accuracy and other performance metrics. Due to the aforementioned, we will omit Adaptive Boosting of our hyperparameter tuned and calibrated KNN classifier.

Bagging Classifier

Hyperparameter tuning:

For the Bagging meta classifier, we can seek to optimize the following classifier parameters: n_estimators (the number of trees in the forsest), max_samples, max_features, bootstrap (whether or not bootstrap samples are used when building trees), bootstrap_features (whether features are drawn with replacement), and oob_score (whether or not out-of-bag samples are used to estimate the generalization accuracy)

Bagging each classifier:

We attempted to run the BaggingClassifier on out best model after performing the previous steps, which was Random FOrest hyperparameter-tuned and calibrated. However, we were not able to find a way to run this in a distributed fashion, and after > 18 hours of runtime the attempt was stopped.


In [ ]:
RF_tuned = RandomForestClassifier(min_impurity_split=1, n_estimators=100, bootstrap= True, max_features=0.75, criterion='entropy', min_samples_leaf=10, max_depth=None).fit(train_data, train_labels)
#RF_calibrated_and_tuned_pre_fit = CalibratedClassifierCV(RF_tuned, method = 'isotonic', cv = 'prefit')
#RF_calibrated_and_tuned_fitted = RF_calibrated_and_tuned_pre_fit.fit(mini_calibrate_data,mini_calibrate_labels)

list_for_ns = []
list_for_mss = []
list_for_mfs = []
list_for_bs = []
list_for_bfs = []
list_for_os = []
list_for_log_loss = []

def Bagging_RF(n,ms,mf,b,bf,o):
    bagging_clf = BaggingClassifier(base_estimator = RF_tuned, n_estimators = n, max_samples = ms, max_features = mf, bootstrap = b, bootstrap_features = bf, oob_score = o)
    bagging_clf_fitted = bagging_clf.fit(train_data,train_labels)
    bagging_prediction_probabilities = bagging_clf_fitted.predict_proba(dev_data)
    list_for_ns.append(this_ns)
    list_for_mss.append(this_mss)
    list_for_mfs.append(this_mfs)
    list_for_bs.append(this_bs)
    list_for_bfs.append(this_bfs)
    list_for_os.append(this_os)
    working_log_loss = log_loss(y_true = dev_labels, y_pred = bagging_prediction_probabilities, labels = crime_labels)
    list_for_log_loss.append(working_log_loss)
    #print("Multi-class Log Loss with RF tuned/calibrated/bagged and n,ms,mf,b,bf,o =", n,",",ms,",", mf,",",b,",",bf,",",o, "is:", working_log_loss)

n_estimators_list = [i for i in range(1,1002,100)]
max_samples_list = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]
max_features_list = [0.1, 0.2, 0.4, 0.5, 0.75, 1.0]
bootstrap_list =[True]
bootstrap_features_list = [True, False]
oob_score_list = [False]

start = time.clock()
for this_ns in n_estimators_list:
    for this_mss in max_samples_list:
        for this_mfs in max_features_list:
            for this_bs in bootstrap_list:
                for this_bfs in bootstrap_features_list:
                    for this_os in oob_score_list:
                        Bagging_RF(this_ns, this_mss, this_mfs, this_bs, this_bfs, this_os)
            
index_best_logloss = np.argmin(list_for_log_loss)
print('For RF the best log loss with hyperparameter tuning, calibration, and bagging is',list_for_log_loss[index_best_logloss], 'with n_estimators =', list_for_ns[index_best_logloss], 'max_samples =', list_for_mss[index_best_logloss], 'max_features =', list_for_mfs[index_best_logloss], 'bootstrap =', list_for_bs[index_best_logloss], 'bootstrap_features =', list_for_bfs[index_best_logloss], 'oob_score =', list_for_os[index_best_logloss])
end = time.clock()
print("Computation time for this step is %.2f" % (end-start), 'seconds')

Error Analysis On Our Best Classifier: Randon Forrest

tuned_DT_calibrate_isotonic = RandomForestClassifier(min_impurity_split=1, n_estimators=100, bootstrap= True, max_features=15, criterion='entropy', min_samples_leaf=10, max_depth=None ).fit(train_data, train_labels) ccv_isotonic = CalibratedClassifierCV(tuned_DT_calibrate_isotonic, method = 'isotonic', cv = 'prefit') ccv_isotonic.fit(calibrate_data, calibrate_labels) ccv_predictions = ccv_isotonic.predict(dev_data) ccv_prediction_probabilities_isotonic = ccv_isotonic.predict_proba(dev_data) working_log_loss_isotonic = log_loss(y_true = dev_labels, y_pred = ccv_prediction_probabilities_isotonic, labels = crime_labels) print("Multi-class Log Loss with RF and calibration with isotonic is:", working_log_loss_isotonic)


In [ ]:
pd.DataFrame(np.amax(ccv_prediction_probabilities_isotonic, axis=1)).hist()

Error Analysis: Calibration


In [ ]:
#clf_probabilities, clf_predictions, labels
def error_analysis_calibration(buckets, clf_probabilities, clf_predictions, labels):
    """inputs:
    clf_probabilities = clf.predict_proba(dev_data)
    clf_predictions = clf.predict(dev_data)
    labels = dev_labels"""
    
    #buckets = [0.05, 0.15, 0.3, 0.5, 0.8]
    #buckets = [0.15, 0.25, 0.3, 1.0]
    correct = [0 for i in buckets]
    total = [0 for i in buckets]

    lLimit = 0
    uLimit = 0
    for i in range(len(buckets)):
        uLimit = buckets[i]
        for j in range(clf_probabilities.shape[0]):
            if (np.amax(clf_probabilities[j]) > lLimit) and (np.amax(clf_probabilities[j]) <= uLimit):
                if clf_predictions[j] == labels[j]:
                    correct[i] += 1
                total[i] += 1
        lLimit = uLimit
        
    print(sum(correct))
    print(sum(total))
    print(correct)
    print(total)

    #here we report the classifier accuracy for each posterior probability bucket
    accuracies = []
    for k in range(len(buckets)):
        print(1.0*correct[k]/total[k])
        accuracies.append(1.0*correct[k]/total[k])
        print('p(pred) <= %.13f    total = %3d    correct = %3d    accuracy = %.3f' \
              %(buckets[k], total[k], correct[k], 1.0*correct[k]/total[k]))
    plt.plot(buckets,accuracies)
    plt.title("Calibration Analysis")
    plt.xlabel("Posterior Probability")
    plt.ylabel("Classifier Accuracy")
    
    return buckets, accuracies

In [ ]:
# Look at how the posteriors are distributed in order to set the best bins in 'buckets'
pd.DataFrame(np.amax(bestLRPredictionProbabilities, axis=1)).hist()

In [ ]:
buckets = [0.15, 0.25, 0.3, 1.0]
calibration_buckets, calibration_accuracies = error_analysis_calibration(buckets, clf_probabilities=bestLRPredictionProbabilities, \
                                                                         clf_predictions=bestLRPredictions, \
                                                                         labels=mini_dev_labels)

Error Analysis: Classification Report


In [1]:
def error_analysis_classification_report(clf_predictions, labels):
    """inputs:
    clf_predictions = clf.predict(dev_data)
    labels = dev_labels"""
    print('Classification Report:')
    report = classification_report(labels, clf_predictions)
    print(report)
    return report

In [ ]:
classification_report = error_analysis_classification_report(clf_predictions=bestLRPredictions, \
                                                            labels=mini_dev_labels)

Error Analysis: Confusion Matrix


In [ ]:
def error_analysis_confusion_matrix(label_names, clf_predictions, labels):
    """inputs:
    clf_predictions = clf.predict(dev_data)
    labels = dev_labels"""
    cm = pd.DataFrame(confusion_matrix(labels, clf_predictions, labels=label_names))
    cm.columns=label_names
    cm.index=label_names
    cm.to_csv(path_or_buf="./confusion_matrix.csv")
    #print(cm)
    return cm

In [ ]:
error_analysis_confusion_matrix(label_names=crime_labels_mini_dev, clf_predictions=bestLRPredictions, \
                                                            labels=mini_dev_labels)

References

1) Hsiang, Solomon M. and Burke, Marshall and Miguel, Edward. "Quantifying the Influence of Climate on Human Conflict". Science, Vol 341, Issue 6151, 2013

2) Huang, Cheng-Lung. Wang, Chieh-Jen. "A GA-based feature selection and parameters optimization for support vector machines". Expert Systems with Applications, Vol 31, 2006, p 231-240

3) https://gallery.cortanaintelligence.com/Experiment/Evaluating-and-Parameter-Tuning-a-Decision-Tree-Model-1

4) Neo, Toh Koon Charlie and Ventura, Dan. "A directd boosting algorithm for the k-nearest neighbor classifier via local wraping of the distance metric". Pattern Regognition Letters, Vol 33, 2012 p 92-102