Kaggle San Francisco Crime Classification

Berkeley MIDS W207 Final Project: Sam Goodgame, Sarah Cha, Kalvin Kao, Bryan Moore

Environment and Data


In [2]:
# Import relevant libraries:
import time
import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import log_loss
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
# Import Meta-estimators
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import GradientBoostingClassifier
# Import Calibration tools
from sklearn.calibration import CalibratedClassifierCV

# Set random seed and format print output:
np.random.seed(0)
np.set_printoptions(precision=3)

DDL to construct table for SQL transformations:

CREATE TABLE kaggle_sf_crime (
dates TIMESTAMP,                                
category VARCHAR,
descript VARCHAR,
dayofweek VARCHAR,
pd_district VARCHAR,
resolution VARCHAR,
addr VARCHAR,
X FLOAT,
Y FLOAT);

Getting training data into a locally hosted PostgreSQL database:

\copy kaggle_sf_crime FROM '/Users/Goodgame/Desktop/MIDS/207/final/sf_crime_train.csv' DELIMITER ',' CSV HEADER;

SQL Query used for transformations:

SELECT
  category,
  date_part('hour', dates) AS hour_of_day,
  CASE
    WHEN dayofweek = 'Monday' then 1
    WHEN dayofweek = 'Tuesday' THEN 2
    WHEN dayofweek = 'Wednesday' THEN 3
    WHEN dayofweek = 'Thursday' THEN 4
    WHEN dayofweek = 'Friday' THEN 5
    WHEN dayofweek = 'Saturday' THEN 6
    WHEN dayofweek = 'Sunday' THEN 7
  END AS dayofweek_numeric,
  X,
  Y,
  CASE
    WHEN pd_district = 'BAYVIEW' THEN 1
    ELSE 0
  END AS bayview_binary,
    CASE
    WHEN pd_district = 'INGLESIDE' THEN 1
    ELSE 0
  END AS ingleside_binary,
    CASE
    WHEN pd_district = 'NORTHERN' THEN 1
    ELSE 0
  END AS northern_binary,
    CASE
    WHEN pd_district = 'CENTRAL' THEN 1
    ELSE 0
  END AS central_binary,
    CASE
    WHEN pd_district = 'BAYVIEW' THEN 1
    ELSE 0
  END AS pd_bayview_binary,
    CASE
    WHEN pd_district = 'MISSION' THEN 1
    ELSE 0
  END AS mission_binary,
    CASE
    WHEN pd_district = 'SOUTHERN' THEN 1
    ELSE 0
  END AS southern_binary,
    CASE
    WHEN pd_district = 'TENDERLOIN' THEN 1
    ELSE 0
  END AS tenderloin_binary,
    CASE
    WHEN pd_district = 'PARK' THEN 1
    ELSE 0
  END AS park_binary,
    CASE
    WHEN pd_district = 'RICHMOND' THEN 1
    ELSE 0
  END AS richmond_binary,
    CASE
    WHEN pd_district = 'TARAVAL' THEN 1
    ELSE 0
  END AS taraval_binary
FROM kaggle_sf_crime;

Loading the data, version 2, with weather features to improve performance: (Negated with hashtags for now, as will cause file dependency issues if run locally for everyone. Will be run by Isabell in final notebook with correct files she needs)

We seek to add features to our models that will improve performance with respect to out desired performance metric. There is evidence that there is a correlation between weather patterns and crime, with some experts even arguing for a causal relationship between weather and crime [1]. More specifically, a 2013 paper published in Science showed that higher temperatures and extreme rainfall led to large increases in conflict. In the setting of strong evidence that weather influences crime, we see it as a candidate for additional features to improve the performance of our classifiers. Weather data was gathered from (insert source). Certain features from this data set were incorporated into the original crime data set in order to add features that were hypothesizzed to improve performance. These features included (insert what we eventually include).


In [ ]:
#data_path = "./data/train_transformed.csv"

#df = pd.read_csv(data_path, header=0)
#x_data = df.drop('category', 1)
#y = df.category.as_matrix()

########## Adding the date back into the data
#import csv
#import time
#import calendar
#data_path = "./data/train.csv"
#dataCSV = open(data_path, 'rt')
#csvData = list(csv.reader(dataCSV))
#csvFields = csvData[0] #['Dates', 'Category', 'Descript', 'DayOfWeek', 'PdDistrict', 'Resolution', 'Address', 'X', 'Y']
#allData = csvData[1:]
#dataCSV.close()

#df2 = pd.DataFrame(allData)
#df2.columns = csvFields
#dates = df2['Dates']
#dates = dates.apply(time.strptime, args=("%Y-%m-%d %H:%M:%S",))
#dates = dates.apply(calendar.timegm)
#print(dates.head())

#x_data['secondsFromEpoch'] = dates
#colnames = x_data.columns.tolist()
#colnames = colnames[-1:] + colnames[:-1]
#x_data = x_data[colnames]
##########

########## Adding the weather data into the original crime data
#weatherData1 = "./data/1027175.csv"
#weatherData2 = "./data/1027176.csv"
#dataCSV = open(weatherData1, 'rt')
#csvData = list(csv.reader(dataCSV))
#csvFields = csvData[0] #['Dates', 'Category', 'Descript', 'DayOfWeek', 'PdDistrict', 'Resolution', 'Address', 'X', 'Y']
#allWeatherData1 = csvData[1:]
#dataCSV.close()

#dataCSV = open(weatherData2, 'rt')
#csvData = list(csv.reader(dataCSV))
#csvFields = csvData[0] #['Dates', 'Category', 'Descript', 'DayOfWeek', 'PdDistrict', 'Resolution', 'Address', 'X', 'Y']
#allWeatherData2 = csvData[1:]
#dataCSV.close()

#weatherDF1 = pd.DataFrame(allWeatherData1)
#weatherDF1.columns = csvFields
#dates1 = weatherDF1['DATE']
#sunrise1 = weatherDF1['DAILYSunrise']
#sunset1 = weatherDF1['DAILYSunset']

#weatherDF2 = pd.DataFrame(allWeatherData2)
#weatherDF2.columns = csvFields
#dates2 = weatherDF2['DATE']
#sunrise2 = weatherDF2['DAILYSunrise']
#sunset2 = weatherDF2['DAILYSunset']

#functions for processing the sunrise and sunset times of each day
#def get_hour_and_minute(milTime):
 #   hour = int(milTime[:-2])
 #   minute = int(milTime[-2:])
 #   return [hour, minute]

#def get_date_only(date):
#    return time.struct_time(tuple([date[0], date[1], date[2], 0, 0, 0, date[6], date[7], date[8]]))

#def structure_sun_time(timeSeries, dateSeries):
#    sunTimes = timeSeries.copy()
#    for index in range(len(dateSeries)):
#        sunTimes[index] = time.struct_time(tuple([dateSeries[index][0], dateSeries[index][1], dateSeries[index][2], timeSeries[index][0], timeSeries[index][1], dateSeries[index][5], dateSeries[index][6], dateSeries[index][7], dateSeries[index][8]]))
#    return sunTimes

#dates1 = dates1.apply(time.strptime, args=("%Y-%m-%d %H:%M",))
#sunrise1 = sunrise1.apply(get_hour_and_minute)
#sunrise1 = structure_sun_time(sunrise1, dates1)
#sunrise1 = sunrise1.apply(calendar.timegm)
#sunset1 = sunset1.apply(get_hour_and_minute)
#sunset1 = structure_sun_time(sunset1, dates1)
#sunset1 = sunset1.apply(calendar.timegm)
#dates1 = dates1.apply(calendar.timegm)

#dates2 = dates2.apply(time.strptime, args=("%Y-%m-%d %H:%M",))
#sunrise2 = sunrise2.apply(get_hour_and_minute)
#sunrise2 = structure_sun_time(sunrise2, dates2)
#sunrise2 = sunrise2.apply(calendar.timegm)
#sunset2 = sunset2.apply(get_hour_and_minute)
#sunset2 = structure_sun_time(sunset2, dates2)
#sunset2 = sunset2.apply(calendar.timegm)
#dates2 = dates2.apply(calendar.timegm)

#weatherDF1['DATE'] = dates1
#weatherDF1['DAILYSunrise'] = sunrise1
#weatherDF1['DAILYSunset'] = sunset1
#weatherDF2['DATE'] = dates2
#weatherDF2['DAILYSunrise'] = sunrise2
#weatherDF2['DAILYSunset'] = sunset2

#weatherDF = pd.concat([weatherDF1,weatherDF2[32:]],ignore_index=True)

# Starting off with some of the easier features to work with-- more to come here . . . still in beta
#weatherMetrics = weatherDF[['DATE','HOURLYDRYBULBTEMPF','HOURLYRelativeHumidity', 'HOURLYWindSpeed', \
#                            'HOURLYSeaLevelPressure', 'HOURLYVISIBILITY', 'DAILYSunrise', 'DAILYSunset']]
#weatherMetrics = weatherMetrics.convert_objects(convert_numeric=True)
#weatherDates = weatherMetrics['DATE']
#'DATE','HOURLYDRYBULBTEMPF','HOURLYRelativeHumidity', 'HOURLYWindSpeed',
#'HOURLYSeaLevelPressure', 'HOURLYVISIBILITY'
#timeWindow = 10800 #3 hours
#hourlyDryBulbTemp = []
#hourlyRelativeHumidity = []
#hourlyWindSpeed = []
#hourlySeaLevelPressure = []
#hourlyVisibility = []
#dailySunrise = []
#dailySunset = []
#daylight = []
#test = 0
#for timePoint in dates:#dates is the epoch time from the kaggle data
#    relevantWeather = weatherMetrics[(weatherDates <= timePoint) & (weatherDates > timePoint - timeWindow)]
#    hourlyDryBulbTemp.append(relevantWeather['HOURLYDRYBULBTEMPF'].mean())
#    hourlyRelativeHumidity.append(relevantWeather['HOURLYRelativeHumidity'].mean())
#    hourlyWindSpeed.append(relevantWeather['HOURLYWindSpeed'].mean())
#    hourlySeaLevelPressure.append(relevantWeather['HOURLYSeaLevelPressure'].mean())
#    hourlyVisibility.append(relevantWeather['HOURLYVISIBILITY'].mean())
#    dailySunrise.append(relevantWeather['DAILYSunrise'].iloc[-1])
#    dailySunset.append(relevantWeather['DAILYSunset'].iloc[-1])
#    daylight.append(1.0*((timePoint >= relevantWeather['DAILYSunrise'].iloc[-1]) and (timePoint < relevantWeather['DAILYSunset'].iloc[-1])))
    #if timePoint < relevantWeather['DAILYSunset'][-1]:
        #daylight.append(1)
    #else:
        #daylight.append(0)
    
#    if test%100000 == 0:
#        print(relevantWeather)
#    test += 1

#hourlyDryBulbTemp = pd.Series.from_array(np.array(hourlyDryBulbTemp))
#hourlyRelativeHumidity = pd.Series.from_array(np.array(hourlyRelativeHumidity))
#hourlyWindSpeed = pd.Series.from_array(np.array(hourlyWindSpeed))
#hourlySeaLevelPressure = pd.Series.from_array(np.array(hourlySeaLevelPressure))
#hourlyVisibility = pd.Series.from_array(np.array(hourlyVisibility))
#dailySunrise = pd.Series.from_array(np.array(dailySunrise))
#dailySunset = pd.Series.from_array(np.array(dailySunset))
#daylight = pd.Series.from_array(np.array(daylight))

#x_data['HOURLYDRYBULBTEMPF'] = hourlyDryBulbTemp
#x_data['HOURLYRelativeHumidity'] = hourlyRelativeHumidity
#x_data['HOURLYWindSpeed'] = hourlyWindSpeed
#x_data['HOURLYSeaLevelPressure'] = hourlySeaLevelPressure
#x_data['HOURLYVISIBILITY'] = hourlyVisibility
#x_data['DAILYSunrise'] = dailySunrise
#x_data['DAILYSunset'] = dailySunset
#x_data['Daylight'] = daylight

#x_data.to_csv(path_or_buf="C:/MIDS/W207 final project/x_data.csv")
##########

# Impute missing values with mean values:
#x_complete = x_data.fillna(x_data.mean())
#X_raw = x_complete.as_matrix()

# Scale the data between 0 and 1:
#X = MinMaxScaler().fit_transform(X_raw)

# Shuffle data to remove any underlying pattern that may exist:
#shuffle = np.random.permutation(np.arange(X.shape[0]))
#X, y = X[shuffle], y[shuffle]

# Separate training, dev, and test data:
#test_data, test_labels = X[800000:], y[800000:]
#dev_data, dev_labels = X[700000:800000], y[700000:800000]
#train_data, train_labels = X[:700000], y[:700000]

#mini_train_data, mini_train_labels = X[:75000], y[:75000]
#mini_dev_data, mini_dev_labels = X[75000:100000], y[75000:100000]
#labels_set = set(mini_dev_labels)
#print(labels_set)
#print(len(labels_set))
#print(train_data[:10])

Local, individual load of updated data set (with weather data integrated) into training, development, and test subsets.


In [4]:
data_path = "/Users/Goodgame/Desktop/prototyping/x_data_3.csv"
df = pd.read_csv(data_path, header=0)
x_data = df.drop('category', 1)
y = df.category.as_matrix()

# Impute missing values with mean values:
x_complete = x_data.fillna(x_data.mean())
X_raw = x_complete.as_matrix()

# Scale the data between 0 and 1:
X = MinMaxScaler().fit_transform(X_raw)

# Shuffle data to remove any underlying pattern that may exist.  Must re-run random seed step each time:
np.random.seed(0)
shuffle = np.random.permutation(np.arange(X.shape[0]))
X, y = X[shuffle], y[shuffle]

# Due to difficulties with log loss and set(y_pred) needing to match set(labels), we will remove the extremely rare
# crimes from the data for quality issues.
X_minus_trea = X[np.where(y != 'TREA')]
y_minus_trea = y[np.where(y != 'TREA')]
X_final = X_minus_trea[np.where(y_minus_trea != 'PORNOGRAPHY/OBSCENE MAT')]
y_final = y_minus_trea[np.where(y_minus_trea != 'PORNOGRAPHY/OBSCENE MAT')]

# Separate training, dev, and test data:
test_data, test_labels = X_final[800000:], y_final[800000:]
dev_data, dev_labels = X_final[700000:800000], y_final[700000:800000]
train_data, train_labels = X_final[100000:700000], y_final[100000:700000]
calibrate_data, calibrate_labels = X_final[:100000], y_final[:100000]

# Create mini versions of the above sets
mini_train_data, mini_train_labels = X_final[:20000], y_final[:20000]
mini_calibrate_data, mini_calibrate_labels = X_final[19000:28000], y_final[19000:28000]
mini_dev_data, mini_dev_labels = X_final[49000:60000], y_final[49000:60000]

# Create list of the crime type labels.  This will act as the "labels" parameter for the log loss functions that follow
crime_labels = list(set(y_final))
crime_labels_mini_train = list(set(mini_train_labels))
crime_labels_mini_dev = list(set(mini_dev_labels))
crime_labels_mini_calibrate = list(set(mini_calibrate_labels))
print(len(crime_labels), len(crime_labels_mini_train), len(crime_labels_mini_dev),len(crime_labels_mini_calibrate))

#print(len(train_data),len(train_labels))
#print(len(dev_data),len(dev_labels))
#print(len(mini_train_data),len(mini_train_labels))
#print(len(mini_dev_data),len(mini_dev_labels))
#print(len(test_data),len(test_labels))
#print(len(mini_calibrate_data),len(mini_calibrate_labels))
#print(len(calibrate_data),len(calibrate_labels))


37 37 37 37

Sarah's School data that we may still get to work as features: (Negated with hashtags for now, as will cause file dependency issues if run locally for everyone. Will be run by Isabell in final notebook with correct files she needs)


In [ ]:
### Read in zip code data
#data_path_zip = "./data/2016_zips.csv"
#zips = pd.read_csv(data_path_zip, header=0, sep ='\t', usecols = [0,5,6], names = ["GEOID", "INTPTLAT", "INTPTLONG"], dtype ={'GEOID': int, 'INTPTLAT': float, 'INTPTLONG': float})
#sf_zips = zips[(zips['GEOID'] > 94000) & (zips['GEOID'] < 94189)]

### Mapping longitude/latitude to zipcodes
#def dist(lat1, long1, lat2, long2):
#    return np.sqrt((lat1-lat2)**2+(long1-long2)**2)
#    return abs(lat1-lat2)+abs(long1-long2)
#def find_zipcode(lat, long):    
#    distances = sf_zips.apply(lambda row: dist(lat, long, row["INTPTLAT"], row["INTPTLONG"]), axis=1)
#    return sf_zips.loc[distances.idxmin(), "GEOID"]
#x_data['zipcode'] = 0
#for i in range(0, 1):
#    x_data['zipcode'][i] = x_data.apply(lambda row: find_zipcode(row['x'], row['y']), axis=1)
#x_data['zipcode']= x_data.apply(lambda row: find_zipcode(row['x'], row['y']), axis=1)


### Read in school data
#data_path_schools = "./data/pubschls.csv"
#schools = pd.read_csv(data_path_schools,header=0, sep ='\t', usecols = ["CDSCode","StatusType", "School", "EILCode", "EILName", "Zip", "Latitude", "Longitude"], dtype ={'CDSCode': str, 'StatusType': str, 'School': str, 'EILCode': str,'EILName': str,'Zip': str, 'Latitude': float, 'Longitude': float})
#schools = schools[(schools["StatusType"] == 'Active')]

### Find the closest school
#def dist(lat1, long1, lat2, long2):
#    return np.sqrt((lat1-lat2)**2+(long1-long2)**2)

#def find_closest_school(lat, long):    
#    distances = schools.apply(lambda row: dist(lat, long, row["Latitude"], row["Longitude"]), axis=1)
#    return min(distances)
#x_data['closest_school'] = x_data_sub.apply(lambda row: find_closest_school(row['y'], row['x']), axis=1)

Formatting to meet Kaggle submission standards: (Negated with hashtags for now, as will cause file dependency issues if run locally for everyone. Will be run by Isabell in final notebook with correct files she needs)


In [86]:
# The Kaggle submission format requires listing the ID of each example.
# This is to remember the order of the IDs after shuffling
#allIDs = np.array(list(df.axes[0]))
#allIDs = allIDs[shuffle]

#testIDs = allIDs[800000:]
#devIDs = allIDs[700000:800000]
#trainIDs = allIDs[:700000]

# Extract the column names for the required submission format
#sampleSubmission_path = "./data/sampleSubmission.csv"
#sampleDF = pd.read_csv(sampleSubmission_path)
#allColumns = list(sampleDF.columns)
#featureColumns = allColumns[1:]

# Extracting the test data for a baseline submission
#real_test_path = "./data/test_transformed.csv"
#testDF = pd.read_csv(real_test_path, header=0)
#real_test_data = testDF

#test_complete = real_test_data.fillna(real_test_data.mean())
#Test_raw = test_complete.as_matrix()

#TestData = MinMaxScaler().fit_transform(Test_raw)

# Here we remember the ID of each test data point, in case we ever decide to shuffle the test data for some reason
#testIDs = list(testDF.axes[0])

Generate baseline prediction probabilities from MNB classifier and store in a .csv file (Negated with hashtags for now, as will cause file dependency issues if run locally for everyone. Will be run by Isabell in final notebook with correct files she needs)


In [87]:
# Generate a baseline MNB classifier and make it return prediction probabilities for the actual test data
#def MNB():
#    mnb = MultinomialNB(alpha = 0.0000001)
#    mnb.fit(train_data, train_labels)
#    print("\n\nMultinomialNB accuracy on dev data:", mnb.score(dev_data, dev_labels))
#    return mnb.predict_proba(dev_data)
#MNB()

#baselinePredictionProbabilities = MNB()

# Place the resulting prediction probabilities in a .csv file in the required format
# First, turn the prediction probabilties into a data frame
#resultDF = pd.DataFrame(baselinePredictionProbabilities,columns=featureColumns)
# Add the IDs as a final column
#resultDF.loc[:,'Id'] = pd.Series(testIDs,index=resultDF.index)
# Make the 'Id' column the first column
#colnames = resultDF.columns.tolist()
#colnames = colnames[-1:] + colnames[:-1]
#resultDF = resultDF[colnames]
# Output to a .csv file
# resultDF.to_csv('result.csv',index=False)

Note: the code above will shuffle data differently every time it's run, so model accuracies will vary accordingly.


In [26]:
## Data sub-setting quality check-point
print(train_data[:1])
print(train_labels[:1])


[[ 0.016  0.985  0.826  0.667  0.055  0.002  0.     0.     0.     1.     0.
   0.     0.     0.     0.     0.     0.     0.514  0.405  0.375  0.661  1.
   0.985  0.985  0.   ]]
['LARCENY/THEFT']

In [27]:
# Modeling quality check-point with MNB--fast model

def MNB():
    mnb = MultinomialNB(alpha = 0.0000001)
    mnb.fit(train_data, train_labels)
    print("\n\nMultinomialNB accuracy on dev data:", mnb.score(dev_data, dev_labels))
    
MNB()



MultinomialNB accuracy on dev data: 0.22347

Defining Performance Criteria

As determined by the Kaggle submission guidelines, the performance criteria metric for the San Francisco Crime Classification competition is Multi-class Logarithmic Loss (also known as cross-entropy). There are various other performance metrics that are appropriate for different domains: accuracy, F-score, Lift, ROC Area, average precision, precision/recall break-even point, and squared error.

(Describe each performance metric and a domain in which it is preferred. Give Pros/Cons if able)

  • Multi-class Log Loss:

  • Accuracy:

  • F-score:

  • Lift:

  • ROC Area:

  • Average precision

  • Precision/Recall break-even point:

  • Squared-error:

Model Prototyping

We will start our classifier and feature engineering process by looking at the performance of various classifiers with default parameter settings in predicting labels on the mini_dev_data:


In [46]:
def model_prototype(train_data, train_labels, eval_data, eval_labels):
    knn = KNeighborsClassifier(n_neighbors=5).fit(train_data, train_labels)
    bnb = BernoulliNB(alpha=1, binarize = 0.5).fit(train_data, train_labels)
    mnb = MultinomialNB().fit(train_data, train_labels)
    log_reg = LogisticRegression().fit(train_data, train_labels)
    neural_net = MLPClassifier().fit(train_data, train_labels)
    random_forest = RandomForestClassifier().fit(train_data, train_labels)
    decision_tree = DecisionTreeClassifier().fit(train_data, train_labels)
    support_vm_step_one = svm.SVC(probability = True)
    support_vm = support_vm_step_one.fit(train_data, train_labels)
    
    models = [knn, bnb, mnb, log_reg, neural_net, random_forest, decision_tree, support_vm]
    for model in models:
        eval_prediction_probabilities = model.predict_proba(eval_data)
        eval_predictions = model.predict(eval_data)
        print(model, "Multi-class Log Loss:", log_loss(y_true = eval_labels, y_pred = eval_prediction_probabilities, labels = crime_labels_mini_dev), "\n\n")

model_prototype(mini_train_data, mini_train_labels, mini_dev_data, mini_dev_labels)


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform') Multi-class Log Loss: 21.0240643644 


BernoulliNB(alpha=1, binarize=0.5, class_prior=None, fit_prior=True) Multi-class Log Loss: 2.6947927812 


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True) Multi-class Log Loss: 2.60974496429 


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False) Multi-class Log Loss: 2.59547592791 


MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False) Multi-class Log Loss: 2.60265495281 


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False) Multi-class Log Loss: 15.5020995603 


DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best') Multi-class Log Loss: 29.8634820265 


SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False) Multi-class Log Loss: 2.62373605675 


Adding Features, Hyperparameter Tuning, and Model Calibration To Improve Prediction For Each Classifier

Here we seek to optimize the performance of our classifiers in a three-step, dynamnic engineering process.

1) Feature addition

We previously added components from the weather data into the original SF crime data as new features. We will not repeat work done in our initial submission, where our training dataset did not include these features. For comparision with respoect to how the added features improved our performance with respect to log loss, please refer back to our initial submission.

We can have Kalvin expand on exactly what he did here.

2) Hyperparameter tuning

Each classifier has parameters that we can engineer to further optimize performance, as opposed to using the default parameter values as we did above in the model prototyping cell. This will be specific to each classifier as detailed below.

3) Model calibration

We can calibrate the models via Platt Scaling or Isotonic Regression to attempt to improve their performance.

  • Platt Scaling: ((brief explanation of how it works))

  • Isotonic Regression: ((brief explanation of how it works))

For each classifier, we can use CalibratedClassifierCV to perform probability calibration with isotonic regression or sigmoid (Platt Scaling). The parameters within CalibratedClassifierCV that we can adjust are the method ('sigmoid' or 'isotonic') and cv (cross-validation generator). As we will already be training our models before calibration, we will only use cv = 'prefit'. Thus, in practice the cross-validation generator will not be a modifiable parameter for us.

K-Nearest Neighbors

Hyperparameter tuning:

For the KNN classifier, we can seek to optimize the following classifier parameters: n-neighbors, weights, and the power parameter ('p').


In [28]:
list_for_ks = []
list_for_ws = []
list_for_ps = []
list_for_log_loss = []

def k_neighbors_tuned(k,w,p):
    tuned_KNN = KNeighborsClassifier(n_neighbors=k, weights=w, p=p).fit(mini_train_data, mini_train_labels)
    dev_prediction_probabilities = tuned_KNN.predict_proba(mini_dev_data)
    list_for_ks.append(this_k)
    list_for_ws.append(this_w)
    list_for_ps.append(this_p)
    working_log_loss = log_loss(y_true = mini_dev_labels, y_pred = dev_prediction_probabilities, labels = crime_labels_mini_dev)
    list_for_log_loss.append(working_log_loss)
    #print("Multi-class Log Loss with KNN and k,w,p =", k,",",w,",", p, "is:", working_log_loss)

k_value_tuning = [i for i in range(1,5002,500)]
weight_tuning = ['uniform', 'distance']
power_parameter_tuning = [1,2]

start = time.clock()
for this_k in k_value_tuning:
    for this_w in weight_tuning:
        for this_p in power_parameter_tuning:
            k_neighbors_tuned(this_k, this_w, this_p)
            
index_best_logloss = np.argmin(list_for_log_loss)
print('For KNN the best log loss with hyperparameter tuning is',list_for_log_loss[index_best_logloss], 'with k =', list_for_ks[index_best_logloss], 'w =', list_for_ws[index_best_logloss], 'p =', list_for_ps[index_best_logloss])
end = time.clock()
print("Computation time for this step is %.2f" % (end-start), 'seconds')


For KNN the best log loss with hyperparameter tuning is 2.62923629844 with k = 2001 w = uniform p = 1
Computation time for this step is 351.40 seconds
Model calibration:

Here we will calibrate the KNN classifier with both Platt Scaling and with Isotonic Regression using CalibratedClassifierCV with various parameter settings. The "method" parameter can be set to "sigmoid" or to "isotonic", corresponding to Platt Scaling and to Isotonic Regression respectively.


In [33]:
list_for_ks = []
list_for_ws = []
list_for_ps = []
list_for_ms = []
list_for_log_loss = []

def knn_calibrated(k,w,p,m):
    tuned_KNN = KNeighborsClassifier(n_neighbors=k, weights=w, p=p).fit(mini_train_data, mini_train_labels)
    dev_prediction_probabilities = tuned_KNN.predict_proba(mini_dev_data)
    ccv = CalibratedClassifierCV(tuned_KNN, method = m, cv = 'prefit')
    ccv.fit(mini_calibrate_data, mini_calibrate_labels)
    ccv_prediction_probabilities = ccv.predict_proba(mini_dev_data)
    list_for_ks.append(this_k)
    list_for_ws.append(this_w)
    list_for_ps.append(this_p)
    list_for_ms.append(this_m)
    working_log_loss = log_loss(y_true = mini_dev_labels, y_pred = ccv_prediction_probabilities, labels = crime_labels_mini_dev)
    list_for_log_loss.append(working_log_loss)
    print("Multi-class Log Loss with KNN and k,w,p =", k,",",w,",",p,",",m,"is:", working_log_loss)

k_value_tuning = ([i for i in range(1,21,1)] + [j for j in range(25,51,5)] + [k for k in range(55,22000,1000)])
weight_tuning = ['uniform', 'distance']
power_parameter_tuning = [1,2]
methods = ['sigmoid', 'isotonic']

start = time.clock()
for this_k in k_value_tuning:
    for this_w in weight_tuning:
        for this_p in power_parameter_tuning:
            for this_m in methods:
                knn_calibrated(this_k, this_w, this_p, this_m)
            
index_best_logloss = np.argmin(list_for_log_loss)
print('For KNN the best log loss with hyperparameter tuning and calibration is',list_for_log_loss[index_best_logloss], 'with k =', list_for_ks[index_best_logloss], 'w =', list_for_ws[index_best_logloss], 'p =', list_for_ps[index_best_logloss], 'm =', list_for_ms[index_best_logloss])
end = time.clock()
print("Computation time for this step is %.2f" % (end-start), 'seconds')


Multi-class Log Loss with KNN and k,w,p = 1 , uniform , 1 , sigmoid is: 2.71372469963
Multi-class Log Loss with KNN and k,w,p = 1 , uniform , 1 , isotonic is: 2.71505449357
Multi-class Log Loss with KNN and k,w,p = 1 , uniform , 2 , sigmoid is: 2.71707486418
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-33-e0828cdad7c7> in <module>()
     29         for this_p in power_parameter_tuning:
     30             for this_m in methods:
---> 31                 knn_calibrated(this_k, this_w, this_p, this_m)
     32 
     33 index_best_logloss = np.argmin(list_for_log_loss)

<ipython-input-33-e0828cdad7c7> in knn_calibrated(k, w, p, m)
      9     dev_prediction_probabilities = tuned_KNN.predict_proba(mini_dev_data)
     10     ccv = CalibratedClassifierCV(tuned_KNN, method = m, cv = 'prefit')
---> 11     ccv.fit(mini_calibrate_data, mini_calibrate_labels)
     12     ccv_prediction_probabilities = ccv.predict_proba(mini_dev_data)
     13     list_for_ks.append(this_k)

/Users/Bryan/anaconda/lib/python3.6/site-packages/sklearn/calibration.py in fit(self, X, y, sample_weight)
    155                 calibrated_classifier.fit(X, y, sample_weight)
    156             else:
--> 157                 calibrated_classifier.fit(X, y)
    158             self.calibrated_classifiers_.append(calibrated_classifier)
    159         else:

/Users/Bryan/anaconda/lib/python3.6/site-packages/sklearn/calibration.py in fit(self, X, y, sample_weight)
    328 
    329         self.classes_ = self.label_encoder_.classes_
--> 330         Y = label_binarize(y, self.classes_)
    331 
    332         df, idx_pos_class = self._preproc(X)

/Users/Bryan/anaconda/lib/python3.6/site-packages/sklearn/preprocessing/label.py in label_binarize(y, classes, neg_label, pos_label, sparse_output)
    498 
    499         # pick out the known labels from y
--> 500         y_in_classes = in1d(y, classes)
    501         y_seen = y[y_in_classes]
    502         indices = np.searchsorted(sorted_class, y_seen)

/Users/Bryan/anaconda/lib/python3.6/site-packages/numpy/lib/arraysetops.py in in1d(ar1, ar2, assume_unique, invert)
    393             mask = np.zeros(len(ar1), dtype=np.bool)
    394             for a in ar2:
--> 395                 mask |= (ar1 == a)
    396         return mask
    397 

KeyboardInterrupt: 
Comments on results for Hyperparameter tuning and Calibration for KNN:

We see that the best log loss we achieve for KNN is with neighbors, weights, and _ power parameter.

When we add-in calibration, we see that the the best log loss we achieve for KNN is with neighbors, weights, power parameter, and calibration method.

(Further explanation here?)

Multinomial, Bernoulli, and Gaussian Naive Bayes

Hyperparameter tuning: Bernoulli Naive Bayes

For the Bernoulli Naive Bayes classifier, we seek to optimize the alpha parameter (Laplace smoothing parameter) and the binarize parameter (threshold for binarizing of the sample features). For the binarize parameter, we will create arbitrary thresholds over which our features, which are not binary/boolean features, will be binarized.


In [44]:
list_for_as = []
list_for_bs = []
list_for_log_loss = []

def BNB_tuned(a,b):
    bnb_tuned = BernoulliNB(alpha = a, binarize = b).fit(mini_train_data, mini_train_labels)
    dev_prediction_probabilities = bnb_tuned.predict_proba(mini_dev_data)
    list_for_as.append(this_a)
    list_for_bs.append(this_b)
    working_log_loss = log_loss(y_true = mini_dev_labels, y_pred = dev_prediction_probabilities, labels = crime_labels_mini_dev)
    list_for_log_loss.append(working_log_loss)
    #print("Multi-class Log Loss with BNB and a,b =", a,",",b,"is:", working_log_loss)

alpha_tuning = [0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.6, 0.8, 1.0, 1.1, 1.2, 1.4, 1.6, 1.8, 2.0, 10.0]
binarize_thresholds_tuning = [1e-20, 1e-19, 1e-18, 1e-17, 1e-16, 1e-15, 1e-14, 1e-13, 1e-12, 1e-11, 1e-10, 1e-9, 1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99, 0.999, 0.9999]

start = time.clock()
for this_a in alpha_tuning:
    for this_b in binarize_thresholds_tuning:
            BNB_tuned(this_a, this_b)
            
index_best_logloss = np.argmin(list_for_log_loss)
print('For BNB the best log loss with hyperparameter tuning is',list_for_log_loss[index_best_logloss], 'with alpha =', list_for_as[index_best_logloss], 'binarization threshold =', list_for_bs[index_best_logloss])
end = time.clock()
print("Computation time for this step is %.2f" % (end-start), 'seconds')


For BNB the best log loss with hyperparameter tuning is 2.6247750866 with alpha = 1.2 binarization threshold = 1e-20
Computation time for this step is 186.46 seconds
Model calibration: BernoulliNB

Here we will calibrate the BNB classifier with both Platt Scaling and with Isotonic Regression using CalibratedClassifierCV with various parameter settings. The "method" parameter can be set to "sigmoid" or to "isotonic", corresponding to Platt Scaling and to Isotonic Regression respectively.


In [8]:
list_for_as = []
list_for_bs = []
list_for_ms = []
list_for_log_loss = []

def BNB_calibrated(a,b,m):
    bnb_tuned = BernoulliNB(alpha = a, binarize = b).fit(mini_train_data, mini_train_labels)
    dev_prediction_probabilities = bnb_tuned.predict_proba(mini_dev_data)
    ccv = CalibratedClassifierCV(bnb_tuned, method = m, cv = 'prefit')
    ccv.fit(mini_calibrate_data, mini_calibrate_labels)
    ccv_prediction_probabilities = ccv.predict_proba(mini_dev_data)
    list_for_as.append(this_a)
    list_for_bs.append(this_b)
    list_for_ms.append(this_m)
    working_log_loss = log_loss(y_true = mini_dev_labels, y_pred = ccv_prediction_probabilities, labels = crime_labels_mini_dev)
    list_for_log_loss.append(working_log_loss)
    #print("Multi-class Log Loss with BNB and a,b,m =", a,",", b,",", m, "is:", working_log_loss)

alpha_tuning = [0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.6, 0.8, 1.0, 1.1, 1.2, 1.4, 1.6, 1.8, 2.0, 10.0]
binarize_thresholds_tuning = [1e-20, 1e-19, 1e-18, 1e-17, 1e-16, 1e-15, 1e-14, 1e-13, 1e-12, 1e-11, 1e-10, 1e-9, 1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99, 0.999, 0.9999]
methods = ['sigmoid', 'isotonic']

start = time.clock()
for this_a in alpha_tuning:
    for this_b in binarize_thresholds_tuning:
            for this_m in methods:
                BNB_calibrated(this_a, this_b, this_m)
            
index_best_logloss = np.argmin(list_for_log_loss)
print('For BNB the best log loss with hyperparameter tuning and calibration is',list_for_log_loss[index_best_logloss], 'with alpha =', list_for_as[index_best_logloss], 'binarization threshold =', list_for_bs[index_best_logloss], 'method = ', list_for_ms[index_best_logloss])
end = time.clock()
print("Computation time for this step is %.2f" % (end-start), 'seconds')


For BNB the best log loss with hyperparameter tuning and calibration is 2.61370308039 with alpha = 1.0 binarization threshold = 0.5 method =  sigmoid
Computation time for this step is 1066.40 seconds
Hyperparameter tuning: Multinomial Naive Bayes

For the Multinomial Naive Bayes classifer, we seek to optimize the alpha parameter (Laplace smoothing parameter).


In [22]:
list_for_as = []
list_for_log_loss = []

def MNB_tuned(a):
    mnb_tuned = MultinomialNB(alpha = a).fit(mini_train_data, mini_train_labels)
    dev_prediction_probabilities =mnb_tuned.predict_proba(mini_dev_data)
    list_for_as.append(this_a)
    working_log_loss = log_loss(y_true = mini_dev_labels, y_pred = dev_prediction_probabilities, labels = crime_labels_mini_dev)
    list_for_log_loss.append(working_log_loss)
    #print("Multi-class Log Loss with BNB and a =", a, "is:", working_log_loss)

alpha_tuning = [0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.6, 0.8, 1.0, 1.1, 1.2, 1.4, 1.6, 1.8, 2.0, 10.0]

start = time.clock()
for this_a in alpha_tuning:
            MNB_tuned(this_a)
            
index_best_logloss = np.argmin(list_for_log_loss)
print('For MNB the best log loss with hyperparameter tuning is',list_for_log_loss[index_best_logloss], 'with alpha =', list_for_as[index_best_logloss])
end = time.clock()
print("Computation time for this step is %.2f" % (end-start), 'seconds')


For MNB the best log loss with hyperparameter tuning is 2.60930490132 with alpha = 1.8
Computation time for this step is 5.96 seconds
Model calibration: MultinomialNB

Here we will calibrate the MNB classifier with both Platt Scaling and with Isotonic Regression using CalibratedClassifierCV with various parameter settings. The "method" parameter can be set to "sigmoid" or to "isotonic", corresponding to Platt Scaling and to Isotonic Regression respectively.


In [19]:
list_for_as = []
list_for_ms = []
list_for_log_loss = []

def MNB_calibrated(a,m):
    mnb_tuned = MultinomialNB(alpha = a).fit(mini_train_data, mini_train_labels)
    ccv = CalibratedClassifierCV(mnb_tuned, method = m, cv = 'prefit')
    ccv.fit(mini_calibrate_data, mini_calibrate_labels)
    ccv_prediction_probabilities = ccv.predict_proba(mini_dev_data)
    list_for_as.append(this_a)
    list_for_ms.append(this_m)
    working_log_loss = log_loss(y_true = mini_dev_labels, y_pred = ccv_prediction_probabilities, labels = crime_labels_mini_dev)
    list_for_log_loss.append(working_log_loss)
    #print("Multi-class Log Loss with MNB and a =", a, "and m =", m, "is:", working_log_loss)

alpha_tuning = [0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.6, 0.8, 1.0, 1.1, 1.2, 1.4, 1.6, 1.8, 2.0, 10.0]
methods = ['sigmoid', 'isotonic']

start = time.clock()
for this_a in alpha_tuning:
    for this_m in methods:
        MNB_calibrated(this_a, this_m)
            
index_best_logloss = np.argmin(list_for_log_loss)
print('For MNB the best log loss with hyperparameter tuning and calibration is',list_for_log_loss[index_best_logloss], 'with alpha =', list_for_as[index_best_logloss], 'and method =', list_for_ms[index_best_logloss])
end = time.clock()
print("Computation time for this step is %.2f" % (end-start), 'seconds')


For MNB the best log loss with hyperparameter tuning and calibration is 2.6145055659 with alpha = 2.0 and method = sigmoid
Computation time for this step is 34.13 seconds

In [ ]:
#KK:
#Since the pyspark-sklearn thing never got working, I did a manual hyperparameter optimization:
mnb_param_grid = {'alpha': [0.340, 0.345, 0.35, 0.355, 0.360]}
MNB = GridSearchCV(MultinomialNB(), param_grid=mnb_param_grid, scoring = 'neg_log_loss')
MNB.fit(train_data, train_labels)
print("the best alpha value is:", str(MNB.best_params_['alpha']))

MNBPredictionProbabilities = MNB.predict_proba(dev_data)
print("Multi-class Log Loss:", log_loss(y_true = dev_labels, y_pred = MNBPredictionProbabilities, labels = crime_labels), "\n\n")

#the results from my analysis, which subsetted train_data differently in order to avoid having to delete the 2 rarest crimes
#the best alpha value is: 0.345
#Multi-class Log Loss: 2.60028093653
Tuning: Gaussian Naive Bayes

For the Gaussian Naive Bayes classifier there are no inherent parameters within the classifier function to optimize, but we will look at our log loss before and after adding noise to the data that is hypothesized to give it a more normal (Gaussian) distribution, which is required by the GNB classifier.


In [17]:
def GNB_pre_tune():
    gnb_pre_tuned = GaussianNB().fit(mini_train_data, mini_train_labels)
    dev_prediction_probabilities =gnb_pre_tuned.predict_proba(mini_dev_data)
    working_log_loss = log_loss(y_true = mini_dev_labels, y_pred = dev_prediction_probabilities, labels = crime_labels_mini_dev)
    print("Multi-class Log Loss with pre-tuned GNB is:", working_log_loss)

GNB_pre_tune()
    
def GNB_post_tune():
    # Gaussian Naive Bayes requires the data to have a relative normal distribution. Sometimes
    # adding noise can improve performance by making the data more normal:
    mini_train_data_noise = np.random.rand(mini_train_data.shape[0],mini_train_data.shape[1])
    modified_mini_train_data = np.multiply(mini_train_data,mini_train_data_noise)    
    gnb_with_noise = GaussianNB().fit(modified_mini_train_data,mini_train_labels)
    dev_prediction_probabilities =gnb_with_noise.predict_proba(mini_dev_data)
    working_log_loss = log_loss(y_true = mini_dev_labels, y_pred = dev_prediction_probabilities, labels = crime_labels_mini_dev)
    print("Multi-class Log Loss with tuned GNB via addition of noise to normalize the data's distribution is:", working_log_loss)
    
GNB_post_tune()


Multi-class Log Loss with pre-tuned GNB is: 34.1076504549
Multi-class Log Loss with tuned GNB via addition of noise to normalize the data's distribution is: 31.8040829494
Model calibration: GaussianNB

Here we will calibrate the GNB classifier with both Platt Scaling and with Isotonic Regression using CalibratedClassifierCV with various parameter settings. The "method" parameter can be set to "sigmoid" or to "isotonic", corresponding to Platt Scaling and to Isotonic Regression respectively.


In [21]:
list_for_ms = []
list_for_log_loss = []

def GNB_calibrated(m):
    # Gaussian Naive Bayes requires the data to have a relative normal distribution. Sometimes
    # adding noise can improve performance by making the data more normal:
    mini_train_data_noise = np.random.rand(mini_train_data.shape[0],mini_train_data.shape[1])
    modified_mini_train_data = np.multiply(mini_train_data,mini_train_data_noise)    
    gnb_with_noise = GaussianNB().fit(modified_mini_train_data,mini_train_labels)
    ccv = CalibratedClassifierCV(gnb_with_noise, method = m, cv = 'prefit')
    ccv.fit(mini_calibrate_data, mini_calibrate_labels)
    ccv_prediction_probabilities = ccv.predict_proba(mini_dev_data)
    list_for_ms.append(this_m)
    working_log_loss = log_loss(y_true = mini_dev_labels, y_pred = ccv_prediction_probabilities, labels = crime_labels_mini_dev)
    list_for_log_loss.append(working_log_loss)
    #print("Multi-class Log Loss with tuned GNB via addition of noise to normalize the data's distribution and after calibration is:", working_log_loss, 'with calibration method =', m)
    
methods = ['sigmoid', 'isotonic']

start = time.clock()
for this_m in methods:
    GNB_calibrated(this_m)
            
index_best_logloss = np.argmin(list_for_log_loss)
print('For GNB the best log loss with tuning and calibration is',list_for_log_loss[index_best_logloss], 'with method =', list_for_ms[index_best_logloss])
end = time.clock()
print("Computation time for this step is %.2f" % (end-start), 'seconds')


For GNB the best log loss with tuning and calibration is 2.67904020299 with method = sigmoid
Computation time for this step is 1.36 seconds

Logistic Regression

Hyperparameter tuning:

For the Logistic Regression classifier, we can seek to optimize the following classifier parameters: penalty (l1 or l2), C (inverse of regularization strength), solver ('newton-cg', 'lbfgs', 'liblinear', or 'sag')

Model calibration:

See above

Manual Hyperparameter Tuning for a Logistic Regression Classifier with an L1-Penalty

  • The previous iterations of parameter searches are omitted from this final notebook, but are located in the logistic regression optimization notebook.

In [ ]:
##KK:
##the results of running this cell will be different because I subset my training data differently
cValsL1 = [15.0, 20.0, 25.0, 50.0]
method = 'sigmoid'
cv = 2
tol = 0.01
for c in cValsL1:
    ccvL1 = CalibratedClassifierCV(LogisticRegression(penalty='l1', C=c, tol=tol), method=method, cv=cv)
    ccvL1.fit(mini_train_data, mini_train_labels)
    print(ccvL1.get_params)
    ccvL1_prediction_probabilities = ccvL1.predict_proba(mini_dev_data)
    ccvL1_predictions = ccvL1.predict(mini_dev_data)
    print("L1 Multi-class Log Loss:", log_loss(y_true = mini_dev_labels, y_pred = ccvL1_prediction_probabilities, labels = crime_labels_mini_dev), "\n\n")
    print()
Parameters/Results:
  • Starting Parameters:
    • CalibratedClassifierCV: cv=2
    • LogisticRegression: penalty='l1', tol=0.01
  • Optimized Parameters:
    • CalibratedClassifierCV: method='sigmoid'
    • Logistic Regression: solver='liblinear', C=20.0
  • Result:
    • Multi-class Log Loss: 2.59346681891

Manual Hyperparameter Tuning for a Logistic Regression Classifier with an L2-Penalty

  • The previous iterations of parameter searches are omitted from this final notebook, but are located in the logistic regression optimization notebook.

In [ ]:
##KK:
##the results of running this cell will be different because I subset my training data differently
cValsL2 = [400.0, 500.0, 750.0, 1000.0]
method = 'isotonic'
cv = 2
tol = 0.01
for c in cValsL2:
    for m in methods:
        ccvL2 = CalibratedClassifierCV(LogisticRegression(penalty='l2', solver='newton-cg', C=c, tol=tol), method=method, cv=cv)
        ccvL2.fit(mini_train_data, mini_train_labels)
        print(ccvL2.get_params)
        ccvL2_prediction_probabilities = ccvL2.predict_proba(mini_dev_data)
        ccvL2_predictions = ccvL2.predict(mini_dev_data)
        print("L2 Multi-class Log Loss:", log_loss(y_true = mini_dev_labels, y_pred = ccvL2_prediction_probabilities, labels = crime_labels_mini_dev), "\n\n")
        print()
Parameters/Results:
  • Starting Parameters:
    • CalibratedClassifierCV: cv=2
    • LogisticRegression: penalty='l2', tol=0.01
  • Optimized Parameters:
    • CalibratedClassifierCV: method='isotonic'
    • Logistic Regression: solver='newton-cg', C=500.0
  • Result:
    • Multi-class Log Loss: 2.59107616746

Decision Tree

Hyperparameter tuning:

For the Decision Tree classifier, we can seek to optimize the following classifier parameters:

  • criterion: The function to measure the quality of a split; can be either Gini impurity "gini" or information gain "entropy"
  • splitter: The strategy used to choose the split at each node; can be either "best" to choose the best split or "random" to choose the best random split
  • min_samples_leaf: The minimum number of samples required to be at a leaf node
  • max_depth: The maximum depth of trees. If default "None" then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples
  • class_weight: The weights associated with classes; can be "None" giving all classes weight of one, or can be "balanced", which uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data
  • max_features: The number of features to consider when looking for the best split; can be "int", "float" (percent), "auto", "sqrt", or "None"

Other adjustable parameters include:

  • min_samples_split: The minimum number of samples required to split an internal node; can be an integer or a float (percentage and ceil as the minimum number of samples for each node)
  • min_weight_fraction_leaf: The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node; default = 0
  • max_leaf_nodes: Grosw a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If "None" then unlimited number of leaf nodes is used.
  • min_impurity_decrease: A node will be split if this split induces a decrease of the impurity greater than or equal to the min_impurity_decrease value. Default is zero.

Setting min_samples_leaf to approximately 1% of the data points can stop the tree from inappropriately classifying outliers, which can help to improve accuracy (unsure if significantly improves MCLL) [].


In [46]:
list_for_cs = []
list_for_ss = []
list_for_mds = []
list_for_mss = []
list_for_cws = []
list_for_fs = []
list_for_log_loss = []

def DT_tuned(c,s,md,ms,cw,f):
    tuned_DT = DecisionTreeClassifier(criterion=c, splitter=s, max_depth=md, min_samples_leaf=ms, max_features=f, class_weight=cw).fit(mini_train_data, mini_train_labels)
    dev_prediction_probabilities = tuned_DT.predict_proba(mini_dev_data)
    list_for_cs.append(this_c)
    list_for_ss.append(this_s)
    list_for_mds.append(this_md)
    list_for_mss.append(this_ms)
    list_for_cws.append(this_cw)
    list_for_fs.append(this_f)
    working_log_loss = log_loss(y_true = mini_dev_labels, y_pred = dev_prediction_probabilities, labels = crime_labels_mini_dev)
    list_for_log_loss.append(working_log_loss)
    #print("Multi-class Log Loss with DT and c,s,md,ms,cw,f =", c,",",s,",", md,",",ms,",",cw,",",f,"is:", working_log_loss)

criterion_tuning = ['gini', 'entropy']
splitter_tuning = ['best', 'random']
max_depth_tuning = ([None,6,7,8,9,10,11,12,13,14,15,16,17,18,19])
min_samples_leaf_tuning = [x + 1 for x in [i for i in range(0,int(0.091*len(mini_train_data)),100)]]
class_weight_tuning = [None, 'balanced']
max_features_tuning = ['auto', 2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]

start = time.clock()
for this_c in criterion_tuning:
    for this_s in splitter_tuning:
        for this_md in max_depth_tuning:
            for this_ms in min_samples_leaf_tuning:
                for this_cw in class_weight_tuning:
                    for this_f in max_features_tuning:
                        DT_tuned(this_c, this_s, this_md, this_ms, this_cw, this_f)
            
index_best_logloss = np.argmin(list_for_log_loss)
print('For DT the best log loss with hyperparameter tuning is',list_for_log_loss[index_best_logloss], 'with criterion =', list_for_cs[index_best_logloss], 'splitter =', list_for_ss[index_best_logloss], 'max_depth =', list_for_mds[index_best_logloss], 'min_samples_leaf =', list_for_mss[index_best_logloss], 'class_weight =', list_for_cws[index_best_logloss], 'max_features =', list_for_fs[index_best_logloss])
end = time.clock()
print("Computation time for this step is %.2f" % (end-start), 'seconds')


For DT the best log loss with hyperparameter tuning is 2.62645819033 with criterion = entropy splitter = best max_depth = 15 min_samples_leaf = 1201 class_weight = None max_features = 9
Computation time for this step is 1424.48 seconds
Model calibration:

See above


In [48]:
list_for_cs = []
list_for_ss = []
list_for_mds = []
list_for_mss = []
list_for_cws = []
list_for_fs = []
list_for_cms = []
list_for_log_loss = []

def DT_calibrated(c,s,md,ms,cw,f,cm):
    tuned_DT = DecisionTreeClassifier(criterion=c, splitter=s, max_depth=md, min_samples_leaf=ms, max_features=f, class_weight=cw).fit(mini_train_data, mini_train_labels)
    ccv = CalibratedClassifierCV(tuned_DT, method = cm, cv = 'prefit')
    ccv.fit(mini_calibrate_data, mini_calibrate_labels)
    ccv_prediction_probabilities = ccv.predict_proba(mini_dev_data)
    list_for_cs.append(this_c)
    list_for_ss.append(this_s)
    list_for_mds.append(this_md)
    list_for_mss.append(this_ms)
    list_for_cws.append(this_cw)
    list_for_fs.append(this_f)
    list_for_cms.append(this_cm)
    working_log_loss = log_loss(y_true = mini_dev_labels, y_pred = ccv_prediction_probabilities, labels = crime_labels_mini_dev)
    list_for_log_loss.append(working_log_loss)
    print("Multi-class Log Loss with DT and c,s,md,ms,cw,f =", c,",",s,",", md,",",ms,",",cw,",",f,",",cm,"is:", working_log_loss)

criterion_tuning = ['gini', 'entropy']
splitter_tuning = ['best', 'random']
max_depth_tuning = ([None,6,7,8,9,10,11,12,13,14,15,16,17,18,19])
min_samples_leaf_tuning = [x + 1 for x in [i for i in range(0,int(0.091*len(mini_train_data)),100)]]
class_weight_tuning = [None, 'balanced']
max_features_tuning = ['auto', 2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]
methods = ['sigmoid', 'isotonic']

start = time.clock()
for this_c in criterion_tuning:
    for this_s in splitter_tuning:
        for this_md in max_depth_tuning:
            for this_ms in min_samples_leaf_tuning:
                for this_cw in class_weight_tuning:
                    for this_f in max_features_tuning:
                        for this_cm in methods:
                            DT_calibrated(this_c, this_s, this_md, this_ms, this_cw, this_f, this_cm)
            
index_best_logloss = np.argmin(list_for_log_loss)
print('For DT the best log loss with hyperparameter tuning and calibration is',list_for_log_loss[index_best_logloss], 'with criterion =', list_for_cs[index_best_logloss], 'splitter =', list_for_ss[index_best_logloss], 'max_depth =', list_for_mds[index_best_logloss], 'min_samples_leaf =', list_for_mss[index_best_logloss], 'class_weight =', list_for_cws[index_best_logloss], 'max_features =', list_for_fs[index_best_logloss], 'and calibration method =', list_for_cms[index_best_logloss])
end = time.clock()
print("Computation time for this step is %.2f" % (end-start), 'seconds')


Multi-class Log Loss with DT and c,s,md,ms,cw,f = gini , best , None , 1 , None , auto , sigmoid is: 2.70888798421
Multi-class Log Loss with DT and c,s,md,ms,cw,f = gini , best , None , 1 , None , auto , isotonic is: 2.85635450179
Multi-class Log Loss with DT and c,s,md,ms,cw,f = gini , best , None , 1 , None , 2 , sigmoid is: 2.70872615427
Multi-class Log Loss with DT and c,s,md,ms,cw,f = gini , best , None , 1 , None , 2 , isotonic is: 2.80355451969
Multi-class Log Loss with DT and c,s,md,ms,cw,f = gini , best , None , 1 , None , 3 , sigmoid is: 2.69815374851
Multi-class Log Loss with DT and c,s,md,ms,cw,f = gini , best , None , 1 , None , 3 , isotonic is: 2.82892176447
Multi-class Log Loss with DT and c,s,md,ms,cw,f = gini , best , None , 1 , None , 4 , sigmoid is: 2.70032530894
Multi-class Log Loss with DT and c,s,md,ms,cw,f = gini , best , None , 1 , None , 4 , isotonic is: 2.79782811709
Multi-class Log Loss with DT and c,s,md,ms,cw,f = gini , best , None , 1 , None , 5 , sigmoid is: 2.70315008561
Multi-class Log Loss with DT and c,s,md,ms,cw,f = gini , best , None , 1 , None , 5 , isotonic is: 2.8022793479
Multi-class Log Loss with DT and c,s,md,ms,cw,f = gini , best , None , 1 , None , 6 , sigmoid is: 2.69781719577
Multi-class Log Loss with DT and c,s,md,ms,cw,f = gini , best , None , 1 , None , 6 , isotonic is: 2.77490607908
Multi-class Log Loss with DT and c,s,md,ms,cw,f = gini , best , None , 1 , None , 7 , sigmoid is: 2.69993921872
Multi-class Log Loss with DT and c,s,md,ms,cw,f = gini , best , None , 1 , None , 7 , isotonic is: 2.76253067697
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-48-95506a412269> in <module>()
     40                     for this_f in max_features_tuning:
     41                         for this_cm in methods:
---> 42                             DT_calibrated(this_c, this_s, this_md, this_ms, this_cw, this_f, this_cm)
     43 
     44 index_best_logloss = np.argmin(list_for_log_loss)

<ipython-input-48-95506a412269> in DT_calibrated(c, s, md, ms, cw, f, cm)
     11     tuned_DT = DecisionTreeClassifier(criterion=c, splitter=s, max_depth=md, min_samples_leaf=ms, max_features=f, class_weight=cw).fit(mini_train_data, mini_train_labels)
     12     ccv = CalibratedClassifierCV(tuned_DT, method = cm, cv = 'prefit')
---> 13     ccv.fit(mini_calibrate_data, mini_calibrate_labels)
     14     ccv_prediction_probabilities = ccv.predict_proba(mini_dev_data)
     15     list_for_cs.append(this_c)

/Users/Bryan/anaconda/lib/python3.6/site-packages/sklearn/calibration.py in fit(self, X, y, sample_weight)
    155                 calibrated_classifier.fit(X, y, sample_weight)
    156             else:
--> 157                 calibrated_classifier.fit(X, y)
    158             self.calibrated_classifiers_.append(calibrated_classifier)
    159         else:

/Users/Bryan/anaconda/lib/python3.6/site-packages/sklearn/calibration.py in fit(self, X, y, sample_weight)
    341                 raise ValueError('method should be "sigmoid" or '
    342                                  '"isotonic". Got %s.' % self.method)
--> 343             calibrator.fit(this_df, Y[:, k], sample_weight)
    344             self.calibrators_.append(calibrator)
    345 

/Users/Bryan/anaconda/lib/python3.6/site-packages/sklearn/calibration.py in fit(self, X, y, sample_weight)
    488         X, y = indexable(X, y)
    489 
--> 490         self.a_, self.b_ = _sigmoid_calibration(X, y, sample_weight)
    491         return self
    492 

/Users/Bryan/anaconda/lib/python3.6/site-packages/sklearn/calibration.py in _sigmoid_calibration(df, y, sample_weight)
    450 
    451     AB0 = np.array([0., log((prior0 + 1.) / (prior1 + 1.))])
--> 452     AB_ = fmin_bfgs(objective, AB0, fprime=grad, disp=False)
    453     return AB_[0], AB_[1]
    454 

/Users/Bryan/anaconda/lib/python3.6/site-packages/scipy/optimize/optimize.py in fmin_bfgs(f, x0, fprime, args, gtol, norm, epsilon, maxiter, full_output, disp, retall, callback)
    857             'return_all': retall}
    858 
--> 859     res = _minimize_bfgs(f, x0, args, fprime, callback=callback, **opts)
    860 
    861     if full_output:

/Users/Bryan/anaconda/lib/python3.6/site-packages/scipy/optimize/optimize.py in _minimize_bfgs(fun, x0, args, jac, callback, gtol, norm, eps, maxiter, disp, return_all, **unknown_options)
    974         A2 = I - yk[:, numpy.newaxis] * sk[numpy.newaxis, :] * rhok
    975         Hk = numpy.dot(A1, numpy.dot(Hk, A2)) + (rhok * sk[:, numpy.newaxis] *
--> 976                                                  sk[numpy.newaxis, :])
    977 
    978     fval = old_fval

KeyboardInterrupt: 

Support Vector Machines (Kalvin)

Hyperparameter tuning:

For the SVM classifier, we can seek to optimize the following classifier parameters: C (penalty parameter C of the error term), kernel ('linear', 'poly', 'rbf', sigmoid', or 'precomputed')

See source [2] for parameter optimization in SVM

Model calibration:

See above

Neural Nets (Sarah)

Hyperparameter tuning:

For the Neural Networks MLP classifier, we can seek to optimize the following classifier parameters: hidden_layer_sizes, activation ('identity', 'logistic', 'tanh', 'relu'), solver ('lbfgs','sgd', adam'), alpha, learning_rate ('constant', 'invscaling','adaptive')


In [99]:
### All the work from Sarah's notebook:

import theano
from theano import tensor as T
from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams
print (theano.config.device) # We're using CPUs (for now)
print (theano.config.floatX )# Should be 64 bit for CPUs

np.random.seed(0)

from IPython.display import display, clear_output


---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-99-746a0d71593d> in <module>()
      1 ### All the work from Sarah's notebook:
      2 
----> 3 import theano
      4 from theano import tensor as T
      5 from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams

ModuleNotFoundError: No module named 'theano'

In [92]:
numFeatures = train_data[1].size
numTrainExamples = train_data.shape[0]
numTestExamples = test_data.shape[0]
print ('Features = %d' %(numFeatures))
print ('Train set = %d' %(numTrainExamples))
print ('Test set = %d' %(numTestExamples))

class_labels = list(set(train_labels))
print(class_labels)
numClasses = len(class_labels)


Features = 25
Train set = 700000
Test set = 78049
['DRUG/NARCOTIC', 'RUNAWAY', 'DRUNKENNESS', 'LOITERING', 'STOLEN PROPERTY', 'MISSING PERSON', 'ARSON', 'FRAUD', 'SEX OFFENSES NON FORCIBLE', 'NON-CRIMINAL', 'WEAPON LAWS', 'RECOVERED VEHICLE', 'ASSAULT', 'TRESPASS', 'GAMBLING', 'SUSPICIOUS OCC', 'TREA', 'BAD CHECKS', 'VANDALISM', 'FAMILY OFFENSES', 'DRIVING UNDER THE INFLUENCE', 'WARRANTS', 'PROSTITUTION', 'SEX OFFENSES FORCIBLE', 'DISORDERLY CONDUCT', 'LIQUOR LAWS', 'ROBBERY', 'FORGERY/COUNTERFEITING', 'OTHER OFFENSES', 'EXTORTION', 'VEHICLE THEFT', 'SUICIDE', 'PORNOGRAPHY/OBSCENE MAT', 'LARCENY/THEFT', 'BRIBERY', 'EMBEZZLEMENT', 'SECONDARY CODES', 'KIDNAPPING', 'BURGLARY']

In [93]:
### Binarize the class labels

def binarizeY(data):
    binarized_data = np.zeros((data.size,39))
    for j in range(0,data.size):
        feature = data[j]
        i = class_labels.index(feature)
        binarized_data[j,i]=1
    return binarized_data

train_labels_b = binarizeY(train_labels)
test_labels_b = binarizeY(test_labels)
numClasses = train_labels_b[1].size

print ('Classes = %d' %(numClasses))
print ('\n', train_labels_b[:5, :], '\n')
print (train_labels[:10], '\n')


Classes = 39

 [[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.
   0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.]] 

['BURGLARY' 'LARCENY/THEFT' 'OTHER OFFENSES' 'OTHER OFFENSES'
 'SUSPICIOUS OCC' 'VANDALISM' 'DRUG/NARCOTIC' 'MISSING PERSON'
 'LARCENY/THEFT' 'OTHER OFFENSES'] 


In [94]:
###1) Parameters
numFeatures = train_data.shape[1]

numHiddenNodeslayer1 = 50
numHiddenNodeslayer2 = 30

w_1 = theano.shared(np.asarray((np.random.randn(*(numFeatures, numHiddenNodeslayer1))*0.01)))
w_2 = theano.shared(np.asarray((np.random.randn(*(numHiddenNodeslayer1, numHiddenNodeslayer2))*0.01)))
w_3 = theano.shared(np.asarray((np.random.randn(*(numHiddenNodeslayer2, numClasses))*0.01)))
params = [w_1, w_2, w_3]


###2) Model
X = T.matrix()
Y = T.matrix()

srng = RandomStreams()
def dropout(X, p=0.):
    if p > 0:
        X *= srng.binomial(X.shape, p=1 - p)
        X /= 1 - p
    return X

def model(X, w_1, w_2, w_3, p_1, p_2, p_3):
    return T.nnet.softmax(T.dot(dropout(T.nnet.sigmoid(T.dot(dropout(T.nnet.sigmoid(T.dot(dropout(X, p_1), w_1)),p_2), w_2)),p_3),w_3))
y_hat_train = model(X, w_1, w_2, w_3, 0.2, 0.5,0.5)
y_hat_predict = model(X, w_1, w_2, w_3, 0., 0., 0.)

### (3) Cost function
cost = T.mean(T.sqr(y_hat - Y))
cost = T.mean(T.nnet.categorical_crossentropy(y_hat_train, Y))


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-94-8c49129ca7e8> in <module>()
      5 numHiddenNodeslayer2 = 30
      6 
----> 7 w_1 = theano.shared(np.asarray((np.random.randn(*(numFeatures, numHiddenNodeslayer1))*0.01)))
      8 w_2 = theano.shared(np.asarray((np.random.randn(*(numHiddenNodeslayer1, numHiddenNodeslayer2))*0.01)))
      9 w_3 = theano.shared(np.asarray((np.random.randn(*(numHiddenNodeslayer2, numClasses))*0.01)))

NameError: name 'theano' is not defined

In [14]:
### (4) Objective (and solver)

alpha = 0.01
def backprop(cost, w):
    grads = T.grad(cost=cost, wrt=w)
    updates = []
    for wi, grad in zip(w, grads):
        updates.append([wi, wi - grad * alpha])
    return updates

update = backprop(cost, params)
train = theano.function(inputs=[X, Y], outputs=cost, updates=update, allow_input_downcast=True)
y_pred = T.argmax(y_hat_predict, axis=1)
predict = theano.function(inputs=[X], outputs=y_pred, allow_input_downcast=True)

miniBatchSize = 10 

def gradientDescent(epochs):
    for i in range(epochs):
        for start, end in zip(range(0, len(train_data), miniBatchSize), range(miniBatchSize, len(train_data), miniBatchSize)):
            cc = train(train_data[start:end], train_labels_b[start:end])
        clear_output(wait=True)
        print ('%d) accuracy = %.4f' %(i+1, np.mean(np.argmax(test_labels_b, axis=1) == predict(test_data))) )

gradientDescent(50)

### How to decide what # to use for epochs? epochs in this case are how many rounds?
### plot costs for each of the 50 iterations and see how much it decline.. if its still very decreasing, you should
### do more iterations; otherwise if its looking like its flattening, you can stop


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-14-ef1dde28bb64> in <module>()
      9     return updates
     10 
---> 11 update = backprop(cost, params)
     12 train = theano.function(inputs=[X, Y], outputs=cost, updates=update, allow_input_downcast=True)
     13 y_pred = T.argmax(y_hat_predict, axis=1)

NameError: name 'cost' is not defined
Model calibration:

See above


In [ ]:

Random Forest (Sam, possibly in AWS)

Hyperparameter tuning:

For the Random Forest classifier, we can seek to optimize the following classifier parameters: n_estimators (the number of trees in the forsest), max_features, max_depth, min_samples_leaf, bootstrap (whether or not bootstrap samples are used when building trees), oob_score (whether or not out-of-bag samples are used to estimate the generalization accuracy)

Model calibration:

See above

Meta-estimators

AdaBoost Classifier

Hyperparameter tuning:

There are no major changes that we seek to make in the AdaBoostClassifier with respect to default parameter values.

Adaboosting each classifier:

We will run the AdaBoostClassifier on each different classifier from above, using the classifier settings with optimized Multi-class Log Loss after hyperparameter tuning and calibration.

AdaBoost Classifier Test with the Best Logistic Regression Classifier

  • The default AdaBoost Classifier was also tested, but had the same result.

In [ ]:
##KK:
##the results of running this cell will be different because I subset my training data differently
bestLR = LogisticRegression(penalty='l2', solver='newton-cg', C=500, tol=0.01)
lrAdaBoost = AdaBoostClassifier(base_estimator=bestLR)
lrAdaBoost.fit(mini_train_data, mini_train_labels)
lrPredictionProbabilities = lrAdaBoost.predict_proba(mini_dev_data)
print("Multi-class Log Loss:", log_loss(y_true = mini_dev_labels, y_pred = lrPredictionProbabilities, \
                                        labels = crime_labels_mini_dev), "\n\n")
Parameters/Results:
  • Starting Parameters:
    • LogisticRegression: penalty='l2', solver='newton-cg', C=500, tol=0.01
    • AdaBoostClassifier: default
  • Optimized Parameters:
    • N/A
  • Result:
    • Multi-class Log Loss: 3.58335097496

Bagging Classifier

Hyperparameter tuning:

For the Bagging meta classifier, we can seek to optimize the following classifier parameters: n_estimators (the number of trees in the forsest), max_samples, max_features, bootstrap (whether or not bootstrap samples are used when building trees), bootstrap_features (whether features are drawn with replacement), and oob_score (whether or not out-of-bag samples are used to estimate the generalization accuracy)

Bagging each classifier:

We will run the BaggingClassifier on each different classifier from above, using the classifier settings with optimized Multi-class Log Loss after hyperparameter tuning and calibration.

Gradient Boosting Classifier

Hyperparameter tuning:

For the Gradient Boosting meta classifier, we can seek to optimize the following classifier parameters: n_estimators (the number of trees in the forsest), max_depth, min_samples_leaf, and max_features

Gradient Boosting each classifier:

We will run the GradientBoostingClassifier with loss = 'deviance' (as loss = 'exponential' uses the AdaBoost algorithm) on each different classifier from above, using the classifier settings with optimized Multi-class Log Loss after hyperparameter tuning and calibration.


In [ ]:
##KK: currently running a GridSearchCV, but this is the best classifier so far

Final evaluation on test data


In [ ]:
# Here we will likely use Pipeline and GridSearchCV in order to find the overall classifier with optimized Multi-class Log Loss.
# This will be the last step after all attempts at feature addition, hyperparameter tuning, and calibration are completed
# and the corresponding performance metrics are gathered.

References

1) Hsiang, Solomon M. and Burke, Marshall and Miguel, Edward. "Quantifying the Influence of Climate on Human Conflict". Science, Vol 341, Issue 6151, 2013

2) Huang, Cheng-Lung. Wang, Chieh-Jen. "A GA-based feature selection and parameters optimization for support vector machines". Expert Systems with Applications, Vol 31, 2006, p 231-240

3) https://gallery.cortanaintelligence.com/Experiment/Evaluating-and-Parameter-Tuning-a-Decision-Tree-Model-1


In [31]:
A = [n for n in range(1,21,1)]+([i for i in range(25,50,5)])
print(A)


[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 25, 30, 35, 40, 45]

In [35]:
A = int(float(1.1))
print(A)


1

In [38]:
print([i for i in range(0,int(0.031*len(mini_train_data)),100)])


[0, 100, 200, 300, 400, 500, 600]

In [40]:
A = 1+[1,2,3]
print(A)


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-40-87b62dcba226> in <module>()
----> 1 A = 1+[1,2,3]
      2 print(A)

TypeError: unsupported operand type(s) for +: 'int' and 'list'

In [ ]: