This notebook will supoort the final, refined notebook submitted for the Berkeley MIDS W207 Final Project. This notebook contains code and markdown that contirbuted to our final notebook, graphs, and formal presentation. It serves as a reference for the final notebook.
Throughout this notebook, there are local dependencies that will require data pathways to local files. These will be supressed for viewing comfort, as to not overwhelm the notebook with errors.
In [2]:
# Import relevant libraries:
import time
import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import log_loss
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
# Import Meta-estimators
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
# Import Calibration tools
from sklearn.calibration import CalibratedClassifierCV
# Set random seed and format print output:
np.random.seed(0)
np.set_printoptions(precision=3)
CREATE TABLE kaggle_sf_crime (
dates TIMESTAMP,
category VARCHAR,
descript VARCHAR,
dayofweek VARCHAR,
pd_district VARCHAR,
resolution VARCHAR,
addr VARCHAR,
X FLOAT,
Y FLOAT);
\copy kaggle_sf_crime FROM '/Users/Goodgame/Desktop/MIDS/207/final/sf_crime_train.csv' DELIMITER ',' CSV HEADER;
SELECT
category,
date_part('hour', dates) AS hour_of_day,
CASE
WHEN dayofweek = 'Monday' then 1
WHEN dayofweek = 'Tuesday' THEN 2
WHEN dayofweek = 'Wednesday' THEN 3
WHEN dayofweek = 'Thursday' THEN 4
WHEN dayofweek = 'Friday' THEN 5
WHEN dayofweek = 'Saturday' THEN 6
WHEN dayofweek = 'Sunday' THEN 7
END AS dayofweek_numeric,
X,
Y,
CASE
WHEN pd_district = 'BAYVIEW' THEN 1
ELSE 0
END AS bayview_binary,
CASE
WHEN pd_district = 'INGLESIDE' THEN 1
ELSE 0
END AS ingleside_binary,
CASE
WHEN pd_district = 'NORTHERN' THEN 1
ELSE 0
END AS northern_binary,
CASE
WHEN pd_district = 'CENTRAL' THEN 1
ELSE 0
END AS central_binary,
CASE
WHEN pd_district = 'BAYVIEW' THEN 1
ELSE 0
END AS pd_bayview_binary,
CASE
WHEN pd_district = 'MISSION' THEN 1
ELSE 0
END AS mission_binary,
CASE
WHEN pd_district = 'SOUTHERN' THEN 1
ELSE 0
END AS southern_binary,
CASE
WHEN pd_district = 'TENDERLOIN' THEN 1
ELSE 0
END AS tenderloin_binary,
CASE
WHEN pd_district = 'PARK' THEN 1
ELSE 0
END AS park_binary,
CASE
WHEN pd_district = 'RICHMOND' THEN 1
ELSE 0
END AS richmond_binary,
CASE
WHEN pd_district = 'TARAVAL' THEN 1
ELSE 0
END AS taraval_binary
FROM kaggle_sf_crime;
The initial source of our data was at https://www.kaggle.com/c/sf-crime. We loaded the original training data, train.csv, then extracted the dates column and added it to the transformed data. Dates were converted to number of seconds from the Unix epoch (defined as the number of seconds that have elapsed since 00:00:00 Coordinated Universal Time (UTC), Thursday, 1 January 1970, minus the number of leap seconds that have taken place since then). We then moved the dates to the first column.
Next, we one-hot encoded the location data. Location information tends to be extremely high-dimensional, which presents difficulties for modeling. The original dataset's location information came in three major formats:
After visualizing the data and conducting basic exploratory data analysis, we decided to use one-hot encoding to transform the police department location information into features. In other words, each police department becomes a feature, and a given observation receives a value of '1' if it occurred in the police department jurisdiction described by the feature. Otherwise, it received a value of '0'. Approaching location in this way allowed us to preserve the parsimony of our model; it retains the majority of the important variance in the data without overfitting.
We sought to add features to our models that would improve performance with respect to our desired performance metric of Multi-class Log Loss. There is evidence that there is a correlation between weather patterns and crime, with some experts even arguing for a causal relationship between weather and crime [1]. More specifically, a 2013 paper published in Science showed that higher temperatures and extreme rainfall led to large increases in conflict. In the setting of strong evidence that weather influences crime, we saw it as a candidate for additional features to improve the performance of our classifiers. Weather data was gathered from the National Centers for Environmental Information (specifically, the National Oceanic and Atmospheric Administration). Certain features from this data set were incorporated into the original crime data set in order to add features that were hypothesizzed to improve performance. These features included:
Details on School data and Zipcode data are below.
In [ ]:
#data_path = "./data/train_transformed.csv"
#df = pd.read_csv(data_path, header=0)
#x_data = df.drop('category', 1)
#y = df.category.as_matrix()
########## Adding the date back into the data
#import csv
#import time
#import calendar
#data_path = "./data/train.csv"
#dataCSV = open(data_path, 'rt')
#csvData = list(csv.reader(dataCSV))
#csvFields = csvData[0] #['Dates', 'Category', 'Descript', 'DayOfWeek', 'PdDistrict', 'Resolution', 'Address', 'X', 'Y']
#allData = csvData[1:]
#dataCSV.close()
#df2 = pd.DataFrame(allData)
#df2.columns = csvFields
#dates = df2['Dates']
#dates = dates.apply(time.strptime, args=("%Y-%m-%d %H:%M:%S",))
#dates = dates.apply(calendar.timegm)
#print(dates.head())
#x_data['secondsFromEpoch'] = dates
#colnames = x_data.columns.tolist()
#colnames = colnames[-1:] + colnames[:-1]
#x_data = x_data[colnames]
##########
########## Adding the weather data into the original crime data
#weatherData1 = "./data/1027175.csv"
#weatherData2 = "./data/1027176.csv"
#dataCSV = open(weatherData1, 'rt')
#csvData = list(csv.reader(dataCSV))
#csvFields = csvData[0] #['Dates', 'Category', 'Descript', 'DayOfWeek', 'PdDistrict', 'Resolution', 'Address', 'X', 'Y']
#allWeatherData1 = csvData[1:]
#dataCSV.close()
#dataCSV = open(weatherData2, 'rt')
#csvData = list(csv.reader(dataCSV))
#csvFields = csvData[0] #['Dates', 'Category', 'Descript', 'DayOfWeek', 'PdDistrict', 'Resolution', 'Address', 'X', 'Y']
#allWeatherData2 = csvData[1:]
#dataCSV.close()
#weatherDF1 = pd.DataFrame(allWeatherData1)
#weatherDF1.columns = csvFields
#dates1 = weatherDF1['DATE']
#sunrise1 = weatherDF1['DAILYSunrise']
#sunset1 = weatherDF1['DAILYSunset']
#weatherDF2 = pd.DataFrame(allWeatherData2)
#weatherDF2.columns = csvFields
#dates2 = weatherDF2['DATE']
#sunrise2 = weatherDF2['DAILYSunrise']
#sunset2 = weatherDF2['DAILYSunset']
#functions for processing the sunrise and sunset times of each day
#def get_hour_and_minute(milTime):
# hour = int(milTime[:-2])
# minute = int(milTime[-2:])
# return [hour, minute]
#def get_date_only(date):
# return time.struct_time(tuple([date[0], date[1], date[2], 0, 0, 0, date[6], date[7], date[8]]))
#def structure_sun_time(timeSeries, dateSeries):
# sunTimes = timeSeries.copy()
# for index in range(len(dateSeries)):
# sunTimes[index] = time.struct_time(tuple([dateSeries[index][0], dateSeries[index][1], dateSeries[index][2], timeSeries[index][0], timeSeries[index][1], dateSeries[index][5], dateSeries[index][6], dateSeries[index][7], dateSeries[index][8]]))
# return sunTimes
#dates1 = dates1.apply(time.strptime, args=("%Y-%m-%d %H:%M",))
#sunrise1 = sunrise1.apply(get_hour_and_minute)
#sunrise1 = structure_sun_time(sunrise1, dates1)
#sunrise1 = sunrise1.apply(calendar.timegm)
#sunset1 = sunset1.apply(get_hour_and_minute)
#sunset1 = structure_sun_time(sunset1, dates1)
#sunset1 = sunset1.apply(calendar.timegm)
#dates1 = dates1.apply(calendar.timegm)
#dates2 = dates2.apply(time.strptime, args=("%Y-%m-%d %H:%M",))
#sunrise2 = sunrise2.apply(get_hour_and_minute)
#sunrise2 = structure_sun_time(sunrise2, dates2)
#sunrise2 = sunrise2.apply(calendar.timegm)
#sunset2 = sunset2.apply(get_hour_and_minute)
#sunset2 = structure_sun_time(sunset2, dates2)
#sunset2 = sunset2.apply(calendar.timegm)
#dates2 = dates2.apply(calendar.timegm)
#weatherDF1['DATE'] = dates1
#weatherDF1['DAILYSunrise'] = sunrise1
#weatherDF1['DAILYSunset'] = sunset1
#weatherDF2['DATE'] = dates2
#weatherDF2['DAILYSunrise'] = sunrise2
#weatherDF2['DAILYSunset'] = sunset2
#weatherDF = pd.concat([weatherDF1,weatherDF2[32:]],ignore_index=True)
# Starting off with some of the easier features to work with-- more to come here . . . still in beta
#weatherMetrics = weatherDF[['DATE','HOURLYDRYBULBTEMPF','HOURLYRelativeHumidity', 'HOURLYWindSpeed', \
# 'HOURLYSeaLevelPressure', 'HOURLYVISIBILITY', 'DAILYSunrise', 'DAILYSunset']]
#weatherMetrics = weatherMetrics.convert_objects(convert_numeric=True)
#weatherDates = weatherMetrics['DATE']
#'DATE','HOURLYDRYBULBTEMPF','HOURLYRelativeHumidity', 'HOURLYWindSpeed',
#'HOURLYSeaLevelPressure', 'HOURLYVISIBILITY'
#timeWindow = 10800 #3 hours
#hourlyDryBulbTemp = []
#hourlyRelativeHumidity = []
#hourlyWindSpeed = []
#hourlySeaLevelPressure = []
#hourlyVisibility = []
#dailySunrise = []
#dailySunset = []
#daylight = []
#test = 0
#for timePoint in dates:#dates is the epoch time from the kaggle data
# relevantWeather = weatherMetrics[(weatherDates <= timePoint) & (weatherDates > timePoint - timeWindow)]
# hourlyDryBulbTemp.append(relevantWeather['HOURLYDRYBULBTEMPF'].mean())
# hourlyRelativeHumidity.append(relevantWeather['HOURLYRelativeHumidity'].mean())
# hourlyWindSpeed.append(relevantWeather['HOURLYWindSpeed'].mean())
# hourlySeaLevelPressure.append(relevantWeather['HOURLYSeaLevelPressure'].mean())
# hourlyVisibility.append(relevantWeather['HOURLYVISIBILITY'].mean())
# dailySunrise.append(relevantWeather['DAILYSunrise'].iloc[-1])
# dailySunset.append(relevantWeather['DAILYSunset'].iloc[-1])
# daylight.append(1.0*((timePoint >= relevantWeather['DAILYSunrise'].iloc[-1]) and (timePoint < relevantWeather['DAILYSunset'].iloc[-1])))
#if timePoint < relevantWeather['DAILYSunset'][-1]:
#daylight.append(1)
#else:
#daylight.append(0)
# if test%100000 == 0:
# print(relevantWeather)
# test += 1
#hourlyDryBulbTemp = pd.Series.from_array(np.array(hourlyDryBulbTemp))
#hourlyRelativeHumidity = pd.Series.from_array(np.array(hourlyRelativeHumidity))
#hourlyWindSpeed = pd.Series.from_array(np.array(hourlyWindSpeed))
#hourlySeaLevelPressure = pd.Series.from_array(np.array(hourlySeaLevelPressure))
#hourlyVisibility = pd.Series.from_array(np.array(hourlyVisibility))
#dailySunrise = pd.Series.from_array(np.array(dailySunrise))
#dailySunset = pd.Series.from_array(np.array(dailySunset))
#daylight = pd.Series.from_array(np.array(daylight))
#x_data['HOURLYDRYBULBTEMPF'] = hourlyDryBulbTemp
#x_data['HOURLYRelativeHumidity'] = hourlyRelativeHumidity
#x_data['HOURLYWindSpeed'] = hourlyWindSpeed
#x_data['HOURLYSeaLevelPressure'] = hourlySeaLevelPressure
#x_data['HOURLYVISIBILITY'] = hourlyVisibility
#x_data['DAILYSunrise'] = dailySunrise
#x_data['DAILYSunset'] = dailySunset
#x_data['Daylight'] = daylight
#x_data.to_csv(path_or_buf="C:/MIDS/W207 final project/x_data.csv")
##########
# Impute missing values with mean values:
#x_complete = x_data.fillna(x_data.mean())
#X_raw = x_complete.as_matrix()
# Scale the data between 0 and 1:
#X = MinMaxScaler().fit_transform(X_raw)
# Shuffle data to remove any underlying pattern that may exist:
#shuffle = np.random.permutation(np.arange(X.shape[0]))
#X, y = X[shuffle], y[shuffle]
# Separate training, dev, and test data:
#test_data, test_labels = X[800000:], y[800000:]
#dev_data, dev_labels = X[700000:800000], y[700000:800000]
#train_data, train_labels = X[:700000], y[:700000]
#mini_train_data, mini_train_labels = X[:75000], y[:75000]
#mini_dev_data, mini_dev_labels = X[75000:100000], y[75000:100000]
#labels_set = set(mini_dev_labels)
#print(labels_set)
#print(len(labels_set))
#print(train_data[:10])
We hypothesized that incorporating data on school location as well as other details on schools as features would increase the performance of our classifiers. We found a data set offered by California Department of Education (cde.ca.gov) with a list of all schools (K-12) in California and accompanying characteristics of those schools including activity status, grades taught (elementary vs middle vs high school), and address, including location coordinates. We created a distance function to match the longitude, latitude coordinates of each of the crimes to the closest school and pulled other potentially interesting features including:
Ultimately, we were not able to get these features incorporated into our data, although the attempt was educational.
In [ ]:
### Read in school data
#data_path_schools = "./data/pubschls.csv"
#schools = pd.read_csv(data_path_schools,header=0, sep ='\t', usecols = ["CDSCode","StatusType", "School", "EILCode", "EILName", "Zip", "Latitude", "Longitude"], dtype ={'CDSCode': str, 'StatusType': str, 'School': str, 'EILCode': str,'EILName': str,'Zip': str, 'Latitude': float, 'Longitude': float})
#schools = schools[(schools["StatusType"] == 'Active')]
### Find the closest school
#def dist(lat1, long1, lat2, long2):
# return np.sqrt((lat1-lat2)**2+(long1-long2)**2)
#def find_closest_school(lat, long):
# distances = schools.apply(lambda row: dist(lat, long, row["Latitude"], row["Longitude"]), axis=1)
# return min(distances)
#x_data['closest_school'] = x_data_sub.apply(lambda row: find_closest_school(row['y'], row['x']), axis=1)
We found government data from data.gov that links each zip code in the country to longitude and latitude coordinates in an effort to reduce location dimensions from 2 to 1. We used a custom distance function to map each set of longitude, latitude coordinates associated with each crime incident to find the closest 5-digit zip code using the data.gov file. Unfortunately even after we narrowed the data California zip codes (> 94000 and < 95000) we were unable to run the data in time to add the feature to the data set.
In [ ]:
### Read in zip code data
#data_path_zip = "./data/2016_zips.csv"
#zips = pd.read_csv(data_path_zip, header=0, sep ='\t', usecols = [0,5,6], names = ["GEOID", "INTPTLAT", "INTPTLONG"], dtype ={'GEOID': int, 'INTPTLAT': float, 'INTPTLONG': float})
#sf_zips = zips[(zips['GEOID'] > 94000) & (zips['GEOID'] < 94189)]
### Mapping longitude/latitude to zipcodes
#def dist(lat1, long1, lat2, long2):
# return np.sqrt((lat1-lat2)**2+(long1-long2)**2)
# return abs(lat1-lat2)+abs(long1-long2)
#def find_zipcode(lat, long):
# distances = sf_zips.apply(lambda row: dist(lat, long, row["INTPTLAT"], row["INTPTLONG"]), axis=1)
# return sf_zips.loc[distances.idxmin(), "GEOID"]
#x_data['zipcode'] = 0
#for i in range(0, 1):
# x_data['zipcode'][i] = x_data.apply(lambda row: find_zipcode(row['x'], row['y']), axis=1)
#x_data['zipcode']= x_data.apply(lambda row: find_zipcode(row['x'], row['y']), axis=1)
Our final set of features was:
As you can see below, the data required further cleaning and tranformation.
We imputed missing values with the feature's mean value. We then scaled features between 0 and 1. The data was shuffled to remove any underlying pattern that may exist.
Kaggle dictated that our performance metric was log loss. This was not directly submitted by us, but computed from a submitted csv file with a matrix of the probability estimates of each class of crime for each data point. In our work, we sought to directly optimize the log loss by hyperparameter tuning, calibrating, and applying meta-estimation to our models. This meant that we would not be submitting our probability estimates each time we attempted to improve the performance of a model. This posed operational difficulty though, as the log loss function in sklearn.metrics requires that the set of correct labels for the data must exactly match the set of labels as generated by the max values within the predicted probabilities.
Two of the crimes in our data set were extremely rare: trespassing on an industrial property (TREA) and possessing obscene material (PORNOGRAPHY/OBSC) at 4 and 8 in approximately 1 million crimes, respectively. At such rare rates, those crimes were not showing up in our data when randomly suhhfling and then subsetting into necessary train, development, calibration, and test subsets. This caused errors in generating log loss Therefore, we decided to sacrifice our performance on these rare 12 out of 1 million data points to be able to more quickly determine our performance metric.
To ensure that all the data subsets contained all of the remaining crimes as labels (all 37 of them), the data load sometimes required repeated iterations in order to secure a crime set of 37 across the board. This was due to the inability to set a seed when using the permutation method in numpy. The data subsets were sized such that it would not require more than a few attempts to get all 37 crime labels into each subset.
In [3]:
#data_path = "/Users/Bryan/Desktop/UC_Berkeley_MIDS_files/Courses/W207_Intro_To_Machine_Learning/Final_Project/x_data_3.csv"
#df = pd.read_csv(data_path, header=0)
#x_data = df.drop('category', 1)
#y = df.category.as_matrix()
# Impute missing values with mean values:
#x_complete = x_data.fillna(x_data.mean())
#X_raw = x_complete.as_matrix()
# Scale the data between 0 and 1:
#X = MinMaxScaler().fit_transform(X_raw)
# Shuffle data to remove any underlying pattern that may exist. Must re-run random seed step each time:
#np.random.seed(0)
#shuffle = np.random.permutation(np.arange(X.shape[0]))
#X, y = X[shuffle], y[shuffle]
# Due to difficulties with log loss and set(y_pred) needing to match set(labels), we will remove the extremely rare
# crimes from the data for quality issues.
#X_minus_trea = X[np.where(y != 'TREA')]
#y_minus_trea = y[np.where(y != 'TREA')]
#X_final = X_minus_trea[np.where(y_minus_trea != 'PORNOGRAPHY/OBSCENE MAT')]
#y_final = y_minus_trea[np.where(y_minus_trea != 'PORNOGRAPHY/OBSCENE MAT')]
# Separate training, dev, and test data:
#test_data, test_labels = X_final[800000:], y_final[800000:]
#dev_data, dev_labels = X_final[700000:800000], y_final[700000:800000]
#train_data, train_labels = X_final[100000:700000], y_final[100000:700000]
#calibrate_data, calibrate_labels = X_final[:100000], y_final[:100000]
# Create mini versions of the above sets
#mini_train_data, mini_train_labels = X_final[:20000], y_final[:20000]
#mini_calibrate_data, mini_calibrate_labels = X_final[19000:28000], y_final[19000:28000]
#mini_dev_data, mini_dev_labels = X_final[49000:60000], y_final[49000:60000]
# Create list of the crime type labels. This will act as the "labels" parameter for the log loss functions that follow
#crime_labels = list(set(y_final))
#crime_labels_mini_train = list(set(mini_train_labels))
#crime_labels_mini_dev = list(set(mini_dev_labels))
#crime_labels_mini_calibrate = list(set(mini_calibrate_labels))
#print(len(crime_labels), len(crime_labels_mini_train), len(crime_labels_mini_dev),len(crime_labels_mini_calibrate))
#print(len(train_data),len(train_labels))
#print(len(dev_data),len(dev_labels))
#print(len(mini_train_data),len(mini_train_labels))
#print(len(mini_dev_data),len(mini_dev_labels))
#print(len(test_data),len(test_labels))
#print(len(mini_calibrate_data),len(mini_calibrate_labels))
#print(len(calibrate_data),len(calibrate_labels))
In [ ]:
yFrame = pd.DataFrame(y, columns=['category'])
yCounts = yFrame['category'].value_counts()
f = plt.figure(figsize=(15,8))
yCounts.plot(kind='bar')
plt.title('Distribution of Classes in the Training Data')
plt.ylabel('Frequency')
In [ ]:
pca1 = PCA(n_components=20)
pca1.fit(train_data)
explainedVariances = pca1.explained_variance_ratio_
totalExplained = []
for index in range(len(explainedVariances)):
totalExplained.append(sum(explainedVariances[:index+1]))
pd.DataFrame(totalExplained)
In [ ]:
plt.figure(figsize=(15,8))
plt.plot(np.linspace(1,20,20),totalExplained,".")
plt.title("Fraction of Total Variance Explained by First k Principal Components")
plt.xlabel("Number of Components")
plt.ylabel("Fraction of Total Variance Explained")
plt.show
Kaggle required a very specific format for submitting our results. Submissions are evaluated using the multi-class logarithmic loss. Each incident must be labeled with one true class. For each incident, one must submit a set of predicted probabilities (one for every class). A CSV file with the incident id, all candidate class names, and a probability for each class is required for the final submission document.
In [86]:
# The Kaggle submission format requires listing the ID of each example.
# This is to remember the order of the IDs after shuffling
#allIDs = np.array(list(df.axes[0]))
#allIDs = allIDs[shuffle]
#testIDs = allIDs[800000:]
#devIDs = allIDs[700000:800000]
#trainIDs = allIDs[:700000]
# Extract the column names for the required submission format
#sampleSubmission_path = "./data/sampleSubmission.csv"
#sampleDF = pd.read_csv(sampleSubmission_path)
#allColumns = list(sampleDF.columns)
#featureColumns = allColumns[1:]
# Extracting the test data for a baseline submission
#real_test_path = "./data/test_transformed.csv"
#testDF = pd.read_csv(real_test_path, header=0)
#real_test_data = testDF
#test_complete = real_test_data.fillna(real_test_data.mean())
#Test_raw = test_complete.as_matrix()
#TestData = MinMaxScaler().fit_transform(Test_raw)
# Here we remember the ID of each test data point, in case we ever decide to shuffle the test data for some reason
#testIDs = list(testDF.axes[0])
This step was completed as a test-run to ensure that everything up to this point was functioning properly and that we could generate a proper submission CSV file. We also provided an example of the model accuacy being generated, although this was not our formal metric for evaluation.
In [87]:
# Generate a baseline MNB classifier and make it return prediction probabilities for the actual test data
#def MNB():
# mnb = MultinomialNB(alpha = 0.0000001)
# mnb.fit(train_data, train_labels)
# print("\n\nMultinomialNB accuracy on dev data:", mnb.score(dev_data, dev_labels))
# return mnb.predict_proba(dev_data)
#MNB()
#baselinePredictionProbabilities = MNB()
# Place the resulting prediction probabilities in a .csv file in the required format
# First, turn the prediction probabilties into a data frame
#resultDF = pd.DataFrame(baselinePredictionProbabilities,columns=featureColumns)
# Add the IDs as a final column
#resultDF.loc[:,'Id'] = pd.Series(testIDs,index=resultDF.index)
# Make the 'Id' column the first column
#colnames = resultDF.columns.tolist()
#colnames = colnames[-1:] + colnames[:-1]
#resultDF = resultDF[colnames]
# Output to a .csv file
# resultDF.to_csv('result.csv',index=False)
Note: the code above will shuffle data differently every time it's run, so model accuracies will vary accordingly.
In [ ]:
# -*- coding: utf-8 -*-
"""
Created on Sat Aug 19 19:49:47 2017
@author: kalvi
"""
#required imports
import pandas as pd
import numpy as np
def make_kaggle_format(sample_submission_path, test_transformed_path, prediction_probabilities):
"""this function requires:
sample_submission_path=(filepath for 'sampleSubmission.csv')
test_transformed_path=(filepath for 'test_transformed.csv')
prediction_probabilities=(output from clf.predict_proba(test_data))"""
#this is for extracting the column names for the required submission format
sampleDF = pd.read_csv(sample_submission_path)
allColumns = list(sampleDF.columns)
featureColumns = allColumns[1:]
#this is for extracting the test data IDs for our baseline submission
testDF = pd.read_csv(test_transformed_path, header=0)
testIDs = list(testDF.axes[0])
#first we turn the prediction probabilties into a data frame
resultDF = pd.DataFrame(prediction_probabilities,columns=featureColumns)
#this adds the IDs as a final column
resultDF.loc[:,'Id'] = pd.Series(testIDs,index=resultDF.index)
#the next few lines make the 'Id' column the first column
colnames = resultDF.columns.tolist()
colnames = colnames[-1:] + colnames[:-1]
resultDF = resultDF[colnames]
#output to .csv file
resultDF.to_csv('result.csv',index=False)
In [26]:
## Data sub-setting quality check-point
print(train_data[:1])
print(train_labels[:1])
In [27]:
# Modeling quality check-point with MNB--fast model
def MNB():
mnb = MultinomialNB(alpha = 0.0000001)
mnb.fit(train_data, train_labels)
print("\n\nMultinomialNB accuracy on dev data:", mnb.score(dev_data, dev_labels))
MNB()
As determined by the Kaggle submission guidelines, the performance criteria metric for the San Francisco Crime Classification competition is Multi-class Logarithmic Loss (also known as cross-entropy). There are various other performance metrics that are appropriate for different domains: accuracy, F-score, Lift, ROC Area, average precision, precision/recall break-even point, and squared error.
Multi-class Log Loss
Accuracy
F-score
Lift
ROC Area
Average precision
Precision/Recall break-even point
Squared-error
In [46]:
def model_prototype(train_data, train_labels, eval_data, eval_labels):
knn = KNeighborsClassifier(n_neighbors=5).fit(train_data, train_labels)
bnb = BernoulliNB(alpha=1, binarize = 0.5).fit(train_data, train_labels)
mnb = MultinomialNB().fit(train_data, train_labels)
log_reg = LogisticRegression().fit(train_data, train_labels)
neural_net = MLPClassifier().fit(train_data, train_labels)
random_forest = RandomForestClassifier().fit(train_data, train_labels)
decision_tree = DecisionTreeClassifier().fit(train_data, train_labels)
support_vm_step_one = svm.SVC(probability = True)
support_vm = support_vm_step_one.fit(train_data, train_labels)
models = [knn, bnb, mnb, log_reg, neural_net, random_forest, decision_tree, support_vm]
for model in models:
eval_prediction_probabilities = model.predict_proba(eval_data)
eval_predictions = model.predict(eval_data)
print(model, "Multi-class Log Loss:", log_loss(y_true = eval_labels, y_pred = eval_prediction_probabilities, labels = crime_labels_mini_dev), "\n\n")
model_prototype(mini_train_data, mini_train_labels, mini_dev_data, mini_dev_labels)
Here we sought to optimize the performance of our classifiers in a three-step, dynamnic engineering process.
We previously added components from the weather data into the original SF crime data as new features. We will not repeat work done in our initial submission, where our training dataset did not include these features. For comparision with respoect to how the added features improved our performance with respect to log loss, please refer back to our initial submission.
Each classifier has hyperparameters that we can engineer to further optimize performance, as opposed to using the default parameter values as we did above in the model prototyping cell. This will be specific to each classifier as detailed below.
We can calibrate the models via Platt Scaling or Isotonic Regression to attempt to improve their performance.
Platt Scaling: ((brief explanation of how it works))
Isotonic Regression: ((brief explanation of how it works))
For each classifier, we can use CalibratedClassifierCV to perform probability calibration with isotonic regression or sigmoid (Platt Scaling). The parameters within CalibratedClassifierCV that we can adjust are the method ('sigmoid' or 'isotonic') and cv (cross-validation generator). As we will already be training our models before calibration, we will only use cv = 'prefit'. Thus, in practice the cross-validation generator will not be a modifiable parameter for us.
To optimize the parameters, we used GridSearchCV -- with a slight wrinkle. Because we needed GridSearchCV to sort through an incredible number of model specifications with a very large amount of data, we decided to parallelize the process using Spark. Fortunately, there is a PyPI library for doing just that: spark-sklearn. Check out the package here.
In order to run spark-sklearn, we took the following steps:
"from spark_sklearn import GridSearchCV"
"gs = GridSearchCV(sc, clf, param_grid)"
In other words, the grid search takes SparkContext as an extra parameter. Because of that, the process can be parallelized across multiple cores, which saves a lot of time. For more information on parallelizing GridSearchCV using Spark, see this DataBricks tutorial and this AWS EC2 PySpark tutorial. Note: we ran the PySpark code in the PySpark REPL, rather than in a script. We hit issues with dependencies using Python scripts. We appear not to be alone in this issue; other data scientists have also hit a wall using scikit-learn with Spark.
In [28]:
list_for_ks = []
list_for_ws = []
list_for_ps = []
list_for_log_loss = []
def k_neighbors_tuned(k,w,p):
tuned_KNN = KNeighborsClassifier(n_neighbors=k, weights=w, p=p).fit(mini_train_data, mini_train_labels)
dev_prediction_probabilities = tuned_KNN.predict_proba(mini_dev_data)
list_for_ks.append(this_k)
list_for_ws.append(this_w)
list_for_ps.append(this_p)
working_log_loss = log_loss(y_true = mini_dev_labels, y_pred = dev_prediction_probabilities, labels = crime_labels_mini_dev)
list_for_log_loss.append(working_log_loss)
#print("Multi-class Log Loss with KNN and k,w,p =", k,",",w,",", p, "is:", working_log_loss)
k_value_tuning = [i for i in range(1,5002,500)]
weight_tuning = ['uniform', 'distance']
power_parameter_tuning = [1,2]
start = time.clock()
for this_k in k_value_tuning:
for this_w in weight_tuning:
for this_p in power_parameter_tuning:
k_neighbors_tuned(this_k, this_w, this_p)
index_best_logloss = np.argmin(list_for_log_loss)
print('For KNN the best log loss with hyperparameter tuning is',list_for_log_loss[index_best_logloss], 'with k =', list_for_ks[index_best_logloss], 'w =', list_for_ws[index_best_logloss], 'p =', list_for_ps[index_best_logloss])
end = time.clock()
print("Computation time for this step is %.2f" % (end-start), 'seconds')
Here we will calibrate the KNN classifier with both Platt Scaling and with Isotonic Regression using CalibratedClassifierCV with various parameter settings. The "method" parameter can be set to "sigmoid" or to "isotonic", corresponding to Platt Scaling and to Isotonic Regression respectively.
In [11]:
list_for_ks = []
list_for_ws = []
list_for_ps = []
list_for_ms = []
list_for_log_loss = []
def knn_calibrated(k,w,p,m):
tuned_KNN = KNeighborsClassifier(n_neighbors=k, weights=w, p=p).fit(mini_train_data, mini_train_labels)
dev_prediction_probabilities = tuned_KNN.predict_proba(mini_dev_data)
ccv = CalibratedClassifierCV(tuned_KNN, method = m, cv = 'prefit')
ccv.fit(mini_calibrate_data, mini_calibrate_labels)
ccv_prediction_probabilities = ccv.predict_proba(mini_dev_data)
list_for_ks.append(this_k)
list_for_ws.append(this_w)
list_for_ps.append(this_p)
list_for_ms.append(this_m)
working_log_loss = log_loss(y_true = mini_dev_labels, y_pred = ccv_prediction_probabilities, labels = crime_labels_mini_dev)
list_for_log_loss.append(working_log_loss)
print("Multi-class Log Loss with KNN and k,w,p =", k,",",w,",",p,",",m,"is:", working_log_loss)
k_value_tuning = ([i for i in range(1,21,1)] + [j for j in range(25,51,5)] + [k for k in range(55,22000,1000)])
weight_tuning = ['uniform', 'distance']
power_parameter_tuning = [1,2]
methods = ['sigmoid', 'isotonic']
start = time.clock()
for this_k in k_value_tuning:
for this_w in weight_tuning:
for this_p in power_parameter_tuning:
for this_m in methods:
knn_calibrated(this_k, this_w, this_p, this_m)
index_best_logloss = np.argmin(list_for_log_loss)
print('For KNN the best log loss with hyperparameter tuning and calibration is',list_for_log_loss[index_best_logloss], 'with k =', list_for_ks[index_best_logloss], 'w =', list_for_ws[index_best_logloss], 'p =', list_for_ps[index_best_logloss], 'm =', list_for_ms[index_best_logloss])
end = time.clock()
print("Computation time for this step is %.2f" % (end-start), 'seconds')
For the Bernoulli Naive Bayes classifier, we seek to optimize the alpha parameter (Laplace smoothing parameter) and the binarize parameter (threshold for binarizing of the sample features). For the binarize parameter, we will create arbitrary thresholds over which our features, which are not binary/boolean features, will be binarized.
In [44]:
list_for_as = []
list_for_bs = []
list_for_log_loss = []
def BNB_tuned(a,b):
bnb_tuned = BernoulliNB(alpha = a, binarize = b).fit(mini_train_data, mini_train_labels)
dev_prediction_probabilities = bnb_tuned.predict_proba(mini_dev_data)
list_for_as.append(this_a)
list_for_bs.append(this_b)
working_log_loss = log_loss(y_true = mini_dev_labels, y_pred = dev_prediction_probabilities, labels = crime_labels_mini_dev)
list_for_log_loss.append(working_log_loss)
#print("Multi-class Log Loss with BNB and a,b =", a,",",b,"is:", working_log_loss)
alpha_tuning = [0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.6, 0.8, 1.0, 1.1, 1.2, 1.4, 1.6, 1.8, 2.0, 10.0]
binarize_thresholds_tuning = [1e-20, 1e-19, 1e-18, 1e-17, 1e-16, 1e-15, 1e-14, 1e-13, 1e-12, 1e-11, 1e-10, 1e-9, 1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99, 0.999, 0.9999]
start = time.clock()
for this_a in alpha_tuning:
for this_b in binarize_thresholds_tuning:
BNB_tuned(this_a, this_b)
index_best_logloss = np.argmin(list_for_log_loss)
print('For BNB the best log loss with hyperparameter tuning is',list_for_log_loss[index_best_logloss], 'with alpha =', list_for_as[index_best_logloss], 'binarization threshold =', list_for_bs[index_best_logloss])
end = time.clock()
print("Computation time for this step is %.2f" % (end-start), 'seconds')
Here we will calibrate the BNB classifier with both Platt Scaling and with Isotonic Regression using CalibratedClassifierCV with various parameter settings. The "method" parameter can be set to "sigmoid" or to "isotonic", corresponding to Platt Scaling and to Isotonic Regression respectively.
In [8]:
list_for_as = []
list_for_bs = []
list_for_ms = []
list_for_log_loss = []
def BNB_calibrated(a,b,m):
bnb_tuned = BernoulliNB(alpha = a, binarize = b).fit(mini_train_data, mini_train_labels)
dev_prediction_probabilities = bnb_tuned.predict_proba(mini_dev_data)
ccv = CalibratedClassifierCV(bnb_tuned, method = m, cv = 'prefit')
ccv.fit(mini_calibrate_data, mini_calibrate_labels)
ccv_prediction_probabilities = ccv.predict_proba(mini_dev_data)
list_for_as.append(this_a)
list_for_bs.append(this_b)
list_for_ms.append(this_m)
working_log_loss = log_loss(y_true = mini_dev_labels, y_pred = ccv_prediction_probabilities, labels = crime_labels_mini_dev)
list_for_log_loss.append(working_log_loss)
#print("Multi-class Log Loss with BNB and a,b,m =", a,",", b,",", m, "is:", working_log_loss)
alpha_tuning = [0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.6, 0.8, 1.0, 1.1, 1.2, 1.4, 1.6, 1.8, 2.0, 10.0]
binarize_thresholds_tuning = [1e-20, 1e-19, 1e-18, 1e-17, 1e-16, 1e-15, 1e-14, 1e-13, 1e-12, 1e-11, 1e-10, 1e-9, 1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99, 0.999, 0.9999]
methods = ['sigmoid', 'isotonic']
start = time.clock()
for this_a in alpha_tuning:
for this_b in binarize_thresholds_tuning:
for this_m in methods:
BNB_calibrated(this_a, this_b, this_m)
index_best_logloss = np.argmin(list_for_log_loss)
print('For BNB the best log loss with hyperparameter tuning and calibration is',list_for_log_loss[index_best_logloss], 'with alpha =', list_for_as[index_best_logloss], 'binarization threshold =', list_for_bs[index_best_logloss], 'method = ', list_for_ms[index_best_logloss])
end = time.clock()
print("Computation time for this step is %.2f" % (end-start), 'seconds')
In [22]:
list_for_as = []
list_for_log_loss = []
def MNB_tuned(a):
mnb_tuned = MultinomialNB(alpha = a).fit(mini_train_data, mini_train_labels)
dev_prediction_probabilities =mnb_tuned.predict_proba(mini_dev_data)
list_for_as.append(this_a)
working_log_loss = log_loss(y_true = mini_dev_labels, y_pred = dev_prediction_probabilities, labels = crime_labels_mini_dev)
list_for_log_loss.append(working_log_loss)
#print("Multi-class Log Loss with BNB and a =", a, "is:", working_log_loss)
alpha_tuning = [0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.6, 0.8, 1.0, 1.1, 1.2, 1.4, 1.6, 1.8, 2.0, 10.0]
start = time.clock()
for this_a in alpha_tuning:
MNB_tuned(this_a)
index_best_logloss = np.argmin(list_for_log_loss)
print('For MNB the best log loss with hyperparameter tuning is',list_for_log_loss[index_best_logloss], 'with alpha =', list_for_as[index_best_logloss])
end = time.clock()
print("Computation time for this step is %.2f" % (end-start), 'seconds')
Here we will calibrate the MNB classifier with both Platt Scaling and with Isotonic Regression using CalibratedClassifierCV with various parameter settings. The "method" parameter can be set to "sigmoid" or to "isotonic", corresponding to Platt Scaling and to Isotonic Regression respectively.
In [19]:
list_for_as = []
list_for_ms = []
list_for_log_loss = []
def MNB_calibrated(a,m):
mnb_tuned = MultinomialNB(alpha = a).fit(mini_train_data, mini_train_labels)
ccv = CalibratedClassifierCV(mnb_tuned, method = m, cv = 'prefit')
ccv.fit(mini_calibrate_data, mini_calibrate_labels)
ccv_prediction_probabilities = ccv.predict_proba(mini_dev_data)
list_for_as.append(this_a)
list_for_ms.append(this_m)
working_log_loss = log_loss(y_true = mini_dev_labels, y_pred = ccv_prediction_probabilities, labels = crime_labels_mini_dev)
list_for_log_loss.append(working_log_loss)
#print("Multi-class Log Loss with MNB and a =", a, "and m =", m, "is:", working_log_loss)
alpha_tuning = [0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.6, 0.8, 1.0, 1.1, 1.2, 1.4, 1.6, 1.8, 2.0, 10.0]
methods = ['sigmoid', 'isotonic']
start = time.clock()
for this_a in alpha_tuning:
for this_m in methods:
MNB_calibrated(this_a, this_m)
index_best_logloss = np.argmin(list_for_log_loss)
print('For MNB the best log loss with hyperparameter tuning and calibration is',list_for_log_loss[index_best_logloss], 'with alpha =', list_for_as[index_best_logloss], 'and method =', list_for_ms[index_best_logloss])
end = time.clock()
print("Computation time for this step is %.2f" % (end-start), 'seconds')
In [ ]:
mnb_param_grid = {'alpha': [0.2, 0.4, 0.6, 0.8]}
MNB = GridSearchCV(MultinomialNB(), param_grid=mnb_param_grid, scoring = 'neg_log_loss')
MNB.fit(train_data, train_labels)
print("the best alpha value is:", str(MNB.best_params_['alpha']))
In [ ]:
mnb_param_grid = {'alpha': [0.3, 0.35, 0.4, 0.45, 0.5]}
MNB = GridSearchCV(MultinomialNB(), param_grid=mnb_param_grid, scoring = 'neg_log_loss')
MNB.fit(train_data, train_labels)
print("the best alpha value is:", str(MNB.best_params_['alpha']))
In [ ]:
mnb_param_grid = {'alpha': [0.31, 0.33, 0.35, 0.37, 0.39]}
MNB = GridSearchCV(MultinomialNB(), param_grid=mnb_param_grid, scoring = 'neg_log_loss')
MNB.fit(train_data, train_labels)
print("the best alpha value is:", str(MNB.best_params_['alpha']))
In [ ]:
# Generate the log loss from what we have tuned thus far
MNBPredictionProbabilities = MNB.predict_proba(dev_data)
print("Multi-class Log Loss:", log_loss(y_true = dev_labels, y_pred = MNBPredictionProbabilities, labels = crime_labels), "\n\n")
In [ ]:
mnb_param_grid = {'alpha': [0.340, 0.345, 0.35, 0.355, 0.360]}
MNB = GridSearchCV(MultinomialNB(), param_grid=mnb_param_grid, scoring = 'neg_log_loss')
MNB.fit(train_data, train_labels)
print("the best alpha value is:", str(MNB.best_params_['alpha']))
In [ ]:
# Generate the log loss from what we have tuned thus far
MNBPredictionProbabilities = MNB.predict_proba(dev_data)
print("Multi-class Log Loss:", log_loss(y_true = dev_labels, y_pred = MNBPredictionProbabilities, labels = crime_labels), "\n\n")
For the Gaussian Naive Bayes classifier there are no inherent parameters within the classifier function to optimize, but we will look at our log loss before and after adding noise to the data that is hypothesized to give it a more normal (Gaussian) distribution, which is required by the GNB classifier.
In [17]:
def GNB_pre_tune():
gnb_pre_tuned = GaussianNB().fit(mini_train_data, mini_train_labels)
dev_prediction_probabilities =gnb_pre_tuned.predict_proba(mini_dev_data)
working_log_loss = log_loss(y_true = mini_dev_labels, y_pred = dev_prediction_probabilities, labels = crime_labels_mini_dev)
print("Multi-class Log Loss with pre-tuned GNB is:", working_log_loss)
GNB_pre_tune()
def GNB_post_tune():
# Gaussian Naive Bayes requires the data to have a relative normal distribution. Sometimes
# adding noise can improve performance by making the data more normal:
mini_train_data_noise = np.random.rand(mini_train_data.shape[0],mini_train_data.shape[1])
modified_mini_train_data = np.multiply(mini_train_data,mini_train_data_noise)
gnb_with_noise = GaussianNB().fit(modified_mini_train_data,mini_train_labels)
dev_prediction_probabilities =gnb_with_noise.predict_proba(mini_dev_data)
working_log_loss = log_loss(y_true = mini_dev_labels, y_pred = dev_prediction_probabilities, labels = crime_labels_mini_dev)
print("Multi-class Log Loss with tuned GNB via addition of noise to normalize the data's distribution is:", working_log_loss)
GNB_post_tune()
Here we will calibrate the GNB classifier with both Platt Scaling and with Isotonic Regression using CalibratedClassifierCV with various parameter settings. The "method" parameter can be set to "sigmoid" or to "isotonic", corresponding to Platt Scaling and to Isotonic Regression respectively.
In [21]:
list_for_ms = []
list_for_log_loss = []
def GNB_calibrated(m):
# Gaussian Naive Bayes requires the data to have a relative normal distribution. Sometimes
# adding noise can improve performance by making the data more normal:
mini_train_data_noise = np.random.rand(mini_train_data.shape[0],mini_train_data.shape[1])
modified_mini_train_data = np.multiply(mini_train_data,mini_train_data_noise)
gnb_with_noise = GaussianNB().fit(modified_mini_train_data,mini_train_labels)
ccv = CalibratedClassifierCV(gnb_with_noise, method = m, cv = 'prefit')
ccv.fit(mini_calibrate_data, mini_calibrate_labels)
ccv_prediction_probabilities = ccv.predict_proba(mini_dev_data)
list_for_ms.append(this_m)
working_log_loss = log_loss(y_true = mini_dev_labels, y_pred = ccv_prediction_probabilities, labels = crime_labels_mini_dev)
list_for_log_loss.append(working_log_loss)
#print("Multi-class Log Loss with tuned GNB via addition of noise to normalize the data's distribution and after calibration is:", working_log_loss, 'with calibration method =', m)
methods = ['sigmoid', 'isotonic']
start = time.clock()
for this_m in methods:
GNB_calibrated(this_m)
index_best_logloss = np.argmin(list_for_log_loss)
print('For GNB the best log loss with tuning and calibration is',list_for_log_loss[index_best_logloss], 'with method =', list_for_ms[index_best_logloss])
end = time.clock()
print("Computation time for this step is %.2f" % (end-start), 'seconds')
In [ ]:
cValsL1 = [7.5, 10.0, 12.5, 20.0]
methods = ['sigmoid', 'isotonic']
cv = 2
tol = 0.01
for c in cValsL1:
for m in methods:
ccvL1 = CalibratedClassifierCV(LogisticRegression(penalty='l1', C=c, tol=tol), method=m, cv=cv)
ccvL1.fit(mini_train_data, mini_train_labels)
print(ccvL1.get_params)
ccvL1_prediction_probabilities = ccvL1.predict_proba(mini_dev_data)
ccvL1_predictions = ccvL1.predict(mini_dev_data)
print("L1 Multi-class Log Loss:", log_loss(y_true = mini_dev_labels, y_pred = ccvL1_prediction_probabilities, labels = crime_labels_mini_dev), "\n\n")
print()
In [ ]:
cValsL1 = [15.0, 20.0, 25.0, 50.0]
method = 'sigmoid'
cv = 2
tol = 0.01
for c in cValsL1:
ccvL1 = CalibratedClassifierCV(LogisticRegression(penalty='l1', C=c, tol=tol), method=method, cv=cv)
ccvL1.fit(mini_train_data, mini_train_labels)
print(ccvL1.get_params)
ccvL1_prediction_probabilities = ccvL1.predict_proba(mini_dev_data)
ccvL1_predictions = ccvL1.predict(mini_dev_data)
print("L1 Multi-class Log Loss:", log_loss(y_true = mini_dev_labels, y_pred = ccvL1_prediction_probabilities, labels = crime_labels_mini_dev), "\n\n")
print()
In [ ]:
#L1
#lr_param_grid_1 = {'C': [0, 0.0001, 0.001, 0.01, 0.1, 0.5, 1.0, 5.0, 10.0]}
lr_param_grid_1 = {'C': [7.5, 10.0, 12.5, 15.0, 20.0, 30.0, 50.0, 100.0]}
LR_l1 = GridSearchCV(LogisticRegression(penalty='l1'), param_grid=lr_param_grid_1, scoring='neg_log_loss')
LR_l1.fit(train_data, train_labels)
print('L1: best C value:', str(LR_l1.best_params_['C']))
LR_l1_prediction_probabilities = LR_l1.predict_proba(dev_data)
LR_l1_predictions = LR_l1.predict(dev_data)
print("L1 Multi-class Log Loss:", log_loss(y_true = dev_labels, y_pred = LR_l1_prediction_probabilities, labels = crime_labels), "\n\n")
#create an LR-L1 classifier with the best params
bestL1 = LogisticRegression(penalty='l1', C=LR_l1.best_params_['C'])
bestL1.fit(train_data, train_labels)
#L1weights = bestL1.coef_
columns = ['hour_of_day','dayofweek',\
'x','y','bayview','ingleside','northern',\
'central','mission','southern','tenderloin',\
'park','richmond','taraval','HOURLYDRYBULBTEMPF',\
'HOURLYRelativeHumidity','HOURLYWindSpeed',\
'HOURLYSeaLevelPressure','HOURLYVISIBILITY',\
'Daylight']
allCoefsL1 = pd.DataFrame(index=columns)
for a in range(len(bestL1.coef_)):
allCoefsL1[crime_labels[a]] = bestL1.coef_[a]
allCoefsL1
f = plt.figure(figsize=(15,8))
allCoefsL1.plot(kind='bar', figsize=(15,8))
plt.legend(loc='center left', bbox_to_anchor=(1.0,0.5))
plt.show()
#L2
lr_param_grid_2 = {'C': [0.0001, 0.01, 1.0, 10.0, 50.0, 100.0, 150.0, 250.0, 500.0], \
'solver':['liblinear','newton-cg','lbfgs', 'sag']}
LR_l2 = GridSearchCV(LogisticRegression(penalty='l2'), param_grid=lr_param_grid_2, scoring='neg_log_loss')
LR_l2.fit(train_data, train_labels)
print('L2: best C value:', str(LR_l2.best_params_['C']))
print('L2: best solver:', str(LR_l2.best_params_['solver']))
LR_l2_prediction_probabilities = LR_l2.predict_proba(dev_data)
LR_l2_predictions = LR_l2.predict(dev_data)
print("L2 Multi-class Log Loss:", log_loss(y_true = dev_labels, y_pred = LR_l2_prediction_probabilities, labels = crime_labels), "\n\n")
#create an LR-L2 classifier with the best params
bestL2 = LogisticRegression(penalty='l2', solver=LR_l2.best_params_['solver'], C=LR_l2.best_params_['C'])
bestL2.fit(train_data, train_labels)
#L2weights = bestL2.coef_
columns = ['hour_of_day','dayofweek',\
'x','y','bayview','ingleside','northern',\
'central','mission','southern','tenderloin',\
'park','richmond','taraval','HOURLYDRYBULBTEMPF',\
'HOURLYRelativeHumidity','HOURLYWindSpeed',\
'HOURLYSeaLevelPressure','HOURLYVISIBILITY',\
'Daylight']
allCoefsL2 = pd.DataFrame(index=columns)
for a in range(len(bestL2.coef_)):
allCoefsL2[crime_labels[a]] = bestL2.coef_[a]
allCoefsL2
f = plt.figure(figsize=(15,8))
allCoefsL2.plot(kind='bar', figsize=(15,8))
plt.legend(loc='center left', bbox_to_anchor=(1.0,0.5))
plt.show()
For the Decision Tree classifier, we can seek to optimize the following classifier parameters:
Other adjustable parameters include:
Setting min_samples_leaf to approximately 1% of the data points can stop the tree from inappropriately classifying outliers, which can help to improve accuracy (unsure if significantly improves MCLL) [].
In [46]:
list_for_cs = []
list_for_ss = []
list_for_mds = []
list_for_mss = []
list_for_cws = []
list_for_fs = []
list_for_log_loss = []
def DT_tuned(c,s,md,ms,cw,f):
tuned_DT = DecisionTreeClassifier(criterion=c, splitter=s, max_depth=md, min_samples_leaf=ms, max_features=f, class_weight=cw).fit(mini_train_data, mini_train_labels)
dev_prediction_probabilities = tuned_DT.predict_proba(mini_dev_data)
list_for_cs.append(this_c)
list_for_ss.append(this_s)
list_for_mds.append(this_md)
list_for_mss.append(this_ms)
list_for_cws.append(this_cw)
list_for_fs.append(this_f)
working_log_loss = log_loss(y_true = mini_dev_labels, y_pred = dev_prediction_probabilities, labels = crime_labels_mini_dev)
list_for_log_loss.append(working_log_loss)
#print("Multi-class Log Loss with DT and c,s,md,ms,cw,f =", c,",",s,",", md,",",ms,",",cw,",",f,"is:", working_log_loss)
criterion_tuning = ['gini', 'entropy']
splitter_tuning = ['best', 'random']
max_depth_tuning = ([None,6,7,8,9,10,11,12,13,14,15,16,17,18,19])
min_samples_leaf_tuning = [x + 1 for x in [i for i in range(0,int(0.091*len(mini_train_data)),100)]]
class_weight_tuning = [None, 'balanced']
max_features_tuning = ['auto', 2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]
start = time.clock()
for this_c in criterion_tuning:
for this_s in splitter_tuning:
for this_md in max_depth_tuning:
for this_ms in min_samples_leaf_tuning:
for this_cw in class_weight_tuning:
for this_f in max_features_tuning:
DT_tuned(this_c, this_s, this_md, this_ms, this_cw, this_f)
index_best_logloss = np.argmin(list_for_log_loss)
print('For DT the best log loss with hyperparameter tuning is',list_for_log_loss[index_best_logloss], 'with criterion =', list_for_cs[index_best_logloss], 'splitter =', list_for_ss[index_best_logloss], 'max_depth =', list_for_mds[index_best_logloss], 'min_samples_leaf =', list_for_mss[index_best_logloss], 'class_weight =', list_for_cws[index_best_logloss], 'max_features =', list_for_fs[index_best_logloss])
end = time.clock()
print("Computation time for this step is %.2f" % (end-start), 'seconds')
In [48]:
list_for_cs = []
list_for_ss = []
list_for_mds = []
list_for_mss = []
list_for_cws = []
list_for_fs = []
list_for_cms = []
list_for_log_loss = []
def DT_calibrated(c,s,md,ms,cw,f,cm):
tuned_DT = DecisionTreeClassifier(criterion=c, splitter=s, max_depth=md, min_samples_leaf=ms, max_features=f, class_weight=cw).fit(mini_train_data, mini_train_labels)
ccv = CalibratedClassifierCV(tuned_DT, method = cm, cv = 'prefit')
ccv.fit(mini_calibrate_data, mini_calibrate_labels)
ccv_prediction_probabilities = ccv.predict_proba(mini_dev_data)
list_for_cs.append(this_c)
list_for_ss.append(this_s)
list_for_mds.append(this_md)
list_for_mss.append(this_ms)
list_for_cws.append(this_cw)
list_for_fs.append(this_f)
list_for_cms.append(this_cm)
working_log_loss = log_loss(y_true = mini_dev_labels, y_pred = ccv_prediction_probabilities, labels = crime_labels_mini_dev)
list_for_log_loss.append(working_log_loss)
print("Multi-class Log Loss with DT and c,s,md,ms,cw,f =", c,",",s,",", md,",",ms,",",cw,",",f,",",cm,"is:", working_log_loss)
criterion_tuning = ['gini', 'entropy']
splitter_tuning = ['best', 'random']
max_depth_tuning = ([None,6,7,8,9,10,11,12,13,14,15,16,17,18,19])
min_samples_leaf_tuning = [x + 1 for x in [i for i in range(0,int(0.091*len(mini_train_data)),100)]]
class_weight_tuning = [None, 'balanced']
max_features_tuning = ['auto', 2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]
methods = ['sigmoid', 'isotonic']
start = time.clock()
for this_c in criterion_tuning:
for this_s in splitter_tuning:
for this_md in max_depth_tuning:
for this_ms in min_samples_leaf_tuning:
for this_cw in class_weight_tuning:
for this_f in max_features_tuning:
for this_cm in methods:
DT_calibrated(this_c, this_s, this_md, this_ms, this_cw, this_f, this_cm)
index_best_logloss = np.argmin(list_for_log_loss)
print('For DT the best log loss with hyperparameter tuning and calibration is',list_for_log_loss[index_best_logloss], 'with criterion =', list_for_cs[index_best_logloss], 'splitter =', list_for_ss[index_best_logloss], 'max_depth =', list_for_mds[index_best_logloss], 'min_samples_leaf =', list_for_mss[index_best_logloss], 'class_weight =', list_for_cws[index_best_logloss], 'max_features =', list_for_fs[index_best_logloss], 'and calibration method =', list_for_cms[index_best_logloss])
end = time.clock()
print("Computation time for this step is %.2f" % (end-start), 'seconds')
In [ ]:
from sklearn import svm
from sklearn.metrics import log_loss
class_labels = list(set(train_labels))
def enumerateY(data):
enumerated_data = np.zeros(data.size)
for j in range(0,data.size):
feature = data[j]
i = class_labels.index(feature)
enumerated_data[j]+= i
return (np.array(enumerated_data))
train_labels_enum = enumerateY(mini_train_labels)
dev_labels_enum = enumerateY(mini_dev_labels)
In [ ]:
# GridSearchCV to find optimal parameters
parameters = {'C': (0.001, 0.01, 0.1, 0.5, 1, 5, 10, 25, 100),
'decision_function_shape': ('ovo', 'ovr'),
'kernel': ('linear', 'sigmoid', 'rbf', 'poly')}
clf = GridSearchCV(svm.SVC(tol = 0.01, probability = True), parameters, scoring = "log_loss")
clf.fit(mini_train_data, mini_train_labels).predict_proba(mini_dev_data)
print("optimal parameters:", clf.best_params_)
In [ ]:
# Most optimal parameters returned score of 2.62651014795
clf_svc = svm.SVC(kernel='rbf', C=100, decision_function_shape='ovr', probability = True).fit(mini_train_data, train_labels_enum)
predicted = clf_svc.fit(mini_train_data, train_labels_enum).predict_proba(mini_dev_data)
print(log_loss(y_true = dev_labels_enum, y_pred = predicted, labels = dev_labels_enum))
The benefit of Neural Nets is that it can learn a non-linear function for classification. Relative to that of logistic regression, you have multiple "hidden" layers in between the input and output layers.
The disadvantages are however that 1) you could have more than one local minimum due to a convex loss function which means depending on where the weights initialize you could end up with very different results and 2) the number of hyperparameters to tune is high -- requires lots of computing power.
MLP trains using backpropgation - it trains using gradient descent and for classification, it minimizes the Cross-Entropy loss function, giving a vector of probability estimates.
For the Neural Networks MLP classifier, the following classifier parameters are subject to optimization: 1) hidden_layer_sizes - which includes number of hidden layers as well as number of hidden nodes; 2) activation function - 'identity', 'logistic', 'tanh', 'relu'; 3) alpha - L2 regularization term to avoid overfitting; 4) learning_rate - 'constant', 'invscaling','adaptive'; 5) solver - 'lbfgs','sgd', adam' and 6) batch size.
Given the limitations to compute power, we took a targeted approach to training our MLP model. We ran gridsearchCV several times over subsets of parameters which is obviously not ideal given the number of parameters. We also attempted to test the potential impact of scikit's neural network classifier which also can incorporate drop out methods.
In [ ]:
# One of the iterations of gridSearchCV we ran
import itertools
hidden_layers = [2, 3, 4]
for i in hidden_layers:
hidden_layer_sizes = [x for x in itertools.product((4, 8, 12, 16, 20),repeat=i)]
parameters = {'activation' : ('logistic', 'tanh', 'relu'),
'batch_size': (50, 100, 200),
'alpha': (0.001, 0.01, 0.1, 0.25),
'hidden_layer_sizes': hidden_layer_sizes}
clf = GridSearchCV(MLPClassifier(tol = 0.01), parameters, scoring = "log_loss")
clf.fit(mini_train_data, mini_train_labels).predict_proba(mini_dev_data)
print("optimal parameters: ",clf.best_params_)
In [ ]:
# Our best parameter return
clf = MLPClassifier(tol=0.01, activation = 'tanh', hidden_layer_sizes = (20,), batch_size = 10, alpha = 0.01).fit(mini_train_data, mini_train_labels)
predicted = clf.predict(mini_dev_data)
predicted_prob = clf.predict_proba(mini_dev_data)
score = log_loss(mini_dev_labels, predicted_prob)
print(score)
We take the best performing parameters from the tuning procedures above and add dropout method to illustrate the potential impact there. Dropout method would not have materially improved the score had we used it.
In [ ]:
from sknn.mlp import Classifier, Layer
nn = Classifier(
layers=[
Layer("Tanh", units = 20, dropout = .25),
Layer("Softmax")], batch_size = 10, learning_rate = .001)
In [ ]:
clf = nn.fit(mini_train_data, mini_train_labels)
predicted = clf.predict(mini_dev_data)
predicted_prob = clf.predict_proba(mini_dev_data)
score = log_loss(mini_dev_labels, predicted_prob)
print(score)
In [ ]:
##We attempted to run a more comprehensive hyperparameter tuning procedure including number of iterations and ranges for dropout method. It was too computationally intensive. Even on a remote AWS instance on pyspark it kept stalling out.
from sklearn.grid_search import GridSearchCV
gs = GridSearchCV(nn, param_grid={
'learning_rate': [0.05, 0.01, 0.005, 0.001],
'n_iter': [25, 50, 100, 200],
'hidden0__units': [4, 8, 12, 16, 20],
'hidden0__type': ["Rectifier", "Sigmoid", "Tanh"],
'hidden0__dropout':[0.2, 0.3, 0.4],
'hidden1__units': [4, 8, 12, 16, 20],
'hidden1__type': ["Rectifier", "Sigmoid", "Tanh"],
'hidden1__dropout':[0.2, 0.3, 0.4],
'hidden2__units': [4, 8, 12, 16, 20],
'hidden2__type': ["Rectifier", "Sigmoid", "Tanh"],
'hidden2__dropout':[0.2, 0.3, 0.4]})
gs.fit(mini_train_data, mini_train_labels)
predicted = gs.predict(mini_dev_data)
predicted_prob = gs.predict_proba(mini_dev_data)
score = log_loss(mini_dev_labels, predicted_prob)
In [ ]:
# Further calibration using CalibratedClassifierCV
list_for_log_loss = []
def NN_calibrated():
tuned_NN = clf = MLPClassifier(tol=0.01, activation = 'tanh', hidden_layer_sizes = (20,), batch_size = 10, alpha = 0.001).fit(mini_train_data, mini_train_labels)
dev_prediction_probabilities = tuned_NN.predict_proba(mini_dev_data)
ccv = CalibratedClassifierCV(tuned_NN, method = m, cv = 'prefit')
ccv.fit(mini_calibrate_data, mini_calibrate_labels)
ccv_prediction_probabilities = ccv.predict_proba(mini_dev_data)
working_log_loss = log_loss(y_true = mini_dev_labels, y_pred = ccv_prediction_probabilities, labels = crime_labels_mini_dev)
list_for_log_loss.append(working_log_loss)
print("Multi-class Log Loss with NN is:", working_log_loss)
methods = ['sigmoid', 'isotonic']
for m in methods:
NN_calibrated()
index_best_logloss = np.argmin(list_for_log_loss)
print('For NN the best log loss with hyperparameter tuning and calibration is',list_for_log_loss[index_best_logloss])
For the Random Forest classifier, we can seek to optimize the following classifier parameters: n_estimators (the number of trees in the forsest), max_features, max_depth, min_samples_leaf, bootstrap (whether or not bootstrap samples are used when building trees), oob_score (whether or not out-of-bag samples are used to estimate the generalization accuracy)
This information is only included in our Final Notebook. Please refer to it.
See Final Notebook please
In [ ]:
#learning_rate : float, optional (default=0.1)
#n_estimators : int (default=100)
#max_depth : integer, optional (default=3)
#criterion : string, optional (default=”friedman_mse”)
#min_samples_split : int, float, optional (default=2)
#min_samples_leaf : int, float, optional (default=1)
#min_weight_fraction_leaf : float, optional (default=0.)
#subsample : float, optional (default=1.0)
#max_features : int, float, string or None, optional (default=None)
#max_leaf_nodes : int or None, optional (default=None)
nList = [75, 125]
depthList = [1, 5]
leafList = [2, 7]
featuresList = [8, 17]
In [ ]:
#gb_param_grid = {'n_estimators':[25, 75, 250, 750], 'max_depth': [1, 5, 9], \
# 'min_samples_leaf': [2, 7, 12], 'max_features': [3, 8, 17]}
#GB = GridSearchCV(GradientBoostingClassifier(), param_grid=gb_param_grid, scoring='neg_log_loss')
#GB.fit(train_data, train_labels)
for n_estimators in nList:
for max_depth in depthList:
for min_samples_leaf in leafList:
for max_features in featuresList:
gbTest = GradientBoostingClassifier(n_estimators=n_estimators, max_depth=max_depth, \
min_samples_leaf=min_samples_leaf, max_features=max_features)
gbTest.fit(mini_train_data, mini_train_labels)
gbTestPredictionProbabilities = gbTest.predict_proba(mini_dev_data)
print("Parameters:")
print("n_estimators:", str(n_estimators)+";", " max_depth:", str(max_depth)+";", \
" min_samples_leaf:", str(min_samples_leaf)+";", " max_features:", str(max_features))
print("Multi-class Log Loss:", log_loss(y_true = mini_dev_labels, y_pred = gbTestPredictionProbabilities, \
labels = crime_labels_mini_dev), "\n\n")
print()
In [ ]:
gb_param_grid = {'n_estimators':[25, 75, 250, 750], 'max_depth': [1, 5, 9], \
'min_samples_leaf': [2, 7, 12], 'max_features': [3, 8, 17]}
GB = GridSearchCV(GradientBoostingClassifier(), param_grid=gb_param_grid, scoring='neg_log_loss')
GB.fit(train_data, train_labels)
In [ ]:
GB.best_params_
GBPredictionProbabilities = GB.predict_proba(dev_data)
print("Multi-class Log Loss:", log_loss(y_true = dev_labels, y_pred = GBPredictionProbabilities, labels = crime_labels), "\n\n")
For the AdaBoostClassifier, one can seek to optimize the following meta-estimator parameters: base_estimator (the possible classifiers over which you are providing a meta-estimation with Adaptive Boosting), n_estimators (the max number of estimators at which boosting is terminated), algorithm ('SAMME' or 'SAMME.R').
We considered running the AdaBoostClassifier on each different classifier from above, using the classifier settings with optimized Multi-class Log Loss after hyperparameter tuning and calibration. However, with further research, we discovered that Adaptive Boosting in its current state can either increase or decrease performance compared to the initial, non-boosted model. Further detail on this research with respect to KNN follows. As such, we decided to use the Bagging meta-estimator.
We will not seek to apply Adaptive Boosting (AdaBoost) to our K-nearest Neighbors classifier. It has been shown that AdaBoost actually reduces accuracy and other performance metrics when used with nearest neighbors classifiers [4]. This is because KNN is a stable algorithm with low variance, which leads to correlated errors during each iteration of AdaBoost. There is, however, on-going research in ways to modify the AdaBoost algorithm for KNN classifiers to increase accuracy and other performance metrics. Due to the aforementioned, we will omit Adaptive Boosting of our hyperparameter tuned and calibrated KNN classifier.
For the Bagging meta classifier, we can seek to optimize the following classifier parameters: n_estimators (the number of trees in the forsest), max_samples, max_features, bootstrap (whether or not bootstrap samples are used when building trees), bootstrap_features (whether features are drawn with replacement), and oob_score (whether or not out-of-bag samples are used to estimate the generalization accuracy)
We attempted to run the BaggingClassifier on out best model after performing the previous steps, which was Random FOrest hyperparameter-tuned and calibrated. However, we were not able to find a way to run this in a distributed fashion, and after > 18 hours of runtime the attempt was stopped.
In [ ]:
RF_tuned = RandomForestClassifier(min_impurity_split=1, n_estimators=100, bootstrap= True, max_features=0.75, criterion='entropy', min_samples_leaf=10, max_depth=None).fit(train_data, train_labels)
#RF_calibrated_and_tuned_pre_fit = CalibratedClassifierCV(RF_tuned, method = 'isotonic', cv = 'prefit')
#RF_calibrated_and_tuned_fitted = RF_calibrated_and_tuned_pre_fit.fit(mini_calibrate_data,mini_calibrate_labels)
list_for_ns = []
list_for_mss = []
list_for_mfs = []
list_for_bs = []
list_for_bfs = []
list_for_os = []
list_for_log_loss = []
def Bagging_RF(n,ms,mf,b,bf,o):
bagging_clf = BaggingClassifier(base_estimator = RF_tuned, n_estimators = n, max_samples = ms, max_features = mf, bootstrap = b, bootstrap_features = bf, oob_score = o)
bagging_clf_fitted = bagging_clf.fit(train_data,train_labels)
bagging_prediction_probabilities = bagging_clf_fitted.predict_proba(dev_data)
list_for_ns.append(this_ns)
list_for_mss.append(this_mss)
list_for_mfs.append(this_mfs)
list_for_bs.append(this_bs)
list_for_bfs.append(this_bfs)
list_for_os.append(this_os)
working_log_loss = log_loss(y_true = dev_labels, y_pred = bagging_prediction_probabilities, labels = crime_labels)
list_for_log_loss.append(working_log_loss)
#print("Multi-class Log Loss with RF tuned/calibrated/bagged and n,ms,mf,b,bf,o =", n,",",ms,",", mf,",",b,",",bf,",",o, "is:", working_log_loss)
n_estimators_list = [i for i in range(1,1002,100)]
max_samples_list = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]
max_features_list = [0.1, 0.2, 0.4, 0.5, 0.75, 1.0]
bootstrap_list =[True]
bootstrap_features_list = [True, False]
oob_score_list = [False]
start = time.clock()
for this_ns in n_estimators_list:
for this_mss in max_samples_list:
for this_mfs in max_features_list:
for this_bs in bootstrap_list:
for this_bfs in bootstrap_features_list:
for this_os in oob_score_list:
Bagging_RF(this_ns, this_mss, this_mfs, this_bs, this_bfs, this_os)
index_best_logloss = np.argmin(list_for_log_loss)
print('For RF the best log loss with hyperparameter tuning, calibration, and bagging is',list_for_log_loss[index_best_logloss], 'with n_estimators =', list_for_ns[index_best_logloss], 'max_samples =', list_for_mss[index_best_logloss], 'max_features =', list_for_mfs[index_best_logloss], 'bootstrap =', list_for_bs[index_best_logloss], 'bootstrap_features =', list_for_bfs[index_best_logloss], 'oob_score =', list_for_os[index_best_logloss])
end = time.clock()
print("Computation time for this step is %.2f" % (end-start), 'seconds')
tuned_DT_calibrate_isotonic = RandomForestClassifier(min_impurity_split=1, n_estimators=100, bootstrap= True, max_features=15, criterion='entropy', min_samples_leaf=10, max_depth=None ).fit(train_data, train_labels) ccv_isotonic = CalibratedClassifierCV(tuned_DT_calibrate_isotonic, method = 'isotonic', cv = 'prefit') ccv_isotonic.fit(calibrate_data, calibrate_labels) ccv_predictions = ccv_isotonic.predict(dev_data) ccv_prediction_probabilities_isotonic = ccv_isotonic.predict_proba(dev_data) working_log_loss_isotonic = log_loss(y_true = dev_labels, y_pred = ccv_prediction_probabilities_isotonic, labels = crime_labels) print("Multi-class Log Loss with RF and calibration with isotonic is:", working_log_loss_isotonic)
In [ ]:
pd.DataFrame(np.amax(ccv_prediction_probabilities_isotonic, axis=1)).hist()
In [ ]:
#clf_probabilities, clf_predictions, labels
def error_analysis_calibration(buckets, clf_probabilities, clf_predictions, labels):
"""inputs:
clf_probabilities = clf.predict_proba(dev_data)
clf_predictions = clf.predict(dev_data)
labels = dev_labels"""
#buckets = [0.05, 0.15, 0.3, 0.5, 0.8]
#buckets = [0.15, 0.25, 0.3, 1.0]
correct = [0 for i in buckets]
total = [0 for i in buckets]
lLimit = 0
uLimit = 0
for i in range(len(buckets)):
uLimit = buckets[i]
for j in range(clf_probabilities.shape[0]):
if (np.amax(clf_probabilities[j]) > lLimit) and (np.amax(clf_probabilities[j]) <= uLimit):
if clf_predictions[j] == labels[j]:
correct[i] += 1
total[i] += 1
lLimit = uLimit
print(sum(correct))
print(sum(total))
print(correct)
print(total)
#here we report the classifier accuracy for each posterior probability bucket
accuracies = []
for k in range(len(buckets)):
print(1.0*correct[k]/total[k])
accuracies.append(1.0*correct[k]/total[k])
print('p(pred) <= %.13f total = %3d correct = %3d accuracy = %.3f' \
%(buckets[k], total[k], correct[k], 1.0*correct[k]/total[k]))
plt.plot(buckets,accuracies)
plt.title("Calibration Analysis")
plt.xlabel("Posterior Probability")
plt.ylabel("Classifier Accuracy")
return buckets, accuracies
In [ ]:
# Look at how the posteriors are distributed in order to set the best bins in 'buckets'
pd.DataFrame(np.amax(bestLRPredictionProbabilities, axis=1)).hist()
In [ ]:
buckets = [0.15, 0.25, 0.3, 1.0]
calibration_buckets, calibration_accuracies = error_analysis_calibration(buckets, clf_probabilities=bestLRPredictionProbabilities, \
clf_predictions=bestLRPredictions, \
labels=mini_dev_labels)
In [1]:
def error_analysis_classification_report(clf_predictions, labels):
"""inputs:
clf_predictions = clf.predict(dev_data)
labels = dev_labels"""
print('Classification Report:')
report = classification_report(labels, clf_predictions)
print(report)
return report
In [ ]:
classification_report = error_analysis_classification_report(clf_predictions=bestLRPredictions, \
labels=mini_dev_labels)
In [ ]:
def error_analysis_confusion_matrix(label_names, clf_predictions, labels):
"""inputs:
clf_predictions = clf.predict(dev_data)
labels = dev_labels"""
cm = pd.DataFrame(confusion_matrix(labels, clf_predictions, labels=label_names))
cm.columns=label_names
cm.index=label_names
cm.to_csv(path_or_buf="./confusion_matrix.csv")
#print(cm)
return cm
In [ ]:
error_analysis_confusion_matrix(label_names=crime_labels_mini_dev, clf_predictions=bestLRPredictions, \
labels=mini_dev_labels)
1) Hsiang, Solomon M. and Burke, Marshall and Miguel, Edward. "Quantifying the Influence of Climate on Human Conflict". Science, Vol 341, Issue 6151, 2013
2) Huang, Cheng-Lung. Wang, Chieh-Jen. "A GA-based feature selection and parameters optimization for support vector machines". Expert Systems with Applications, Vol 31, 2006, p 231-240
4) Neo, Toh Koon Charlie and Ventura, Dan. "A directd boosting algorithm for the k-nearest neighbor classifier via local wraping of the distance metric". Pattern Regognition Letters, Vol 33, 2012 p 92-102