From 1934 to 1963, San Francisco was infamous for housing some of the world's most notorious criminals on the inescapable island of Alcatraz.
Today, the city is known more for its tech scene than its criminal past. But, with rising wealth inequality, housing shortages, and a proliferation of expensive digital toys riding BART to work, there is no scarcity of crime in the city by the bay.
From Sunset to SOMA, and Marina to Excelsior, this competition's dataset provides nearly 12 years of crime reports from across all of San Francisco's neighborhoods. Given time and location, you must predict the category of crime that occurred.
In [16]:
# Step 1 - importing classes we plan to use
import csv as csv
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
import seaborn as sns
# show plots inline
%matplotlib inline
In [34]:
#
# Preparing the data
#
data = pd.read_csv('../input/train.csv',parse_dates=['Dates'], dtype={"X": np.float64,"Y": np.float64}, )
# Add column containing day of week expressed in integer
dow = {
'Monday':0,
'Tuesday':1,
'Wednesday':2,
'Thursday':3,
'Friday':4,
'Saturday':5,
'Sunday':6
}
data['DOW'] = data.DayOfWeek.map(dow)
# Add column containing time of day
data['Hour'] = pd.to_datetime(data.Dates).dt.hour
# display the first 5 rows
data.head()
Out[34]:
In [35]:
# Retrieve categories list
cats = pd.Series(data.Category.values.ravel()).unique()
cats.sort()
#
# First, take a look at the total of all categories
#
plt.figure(1,figsize=(8,4))
plt.hist2d(
data.Hour.values,
data.DOW.values,
bins=[24,7],
range=[[-0.5,23.5],[-0.5,6.5]],
cmap=plt.cm.rainbow
)
plt.xticks(np.arange(0,24,6))
plt.xlabel('Time of Day')
plt.yticks(np.arange(0,7),['Mon','Tue','Wed','Thu','Fri','Sat','Sun'])
plt.ylabel('Day of Week')
plt.gca().invert_yaxis()
plt.title('Occurance by Time and Day - All Categories')
Out[35]:
In [11]:
#
# Now look into each category
#
plt.figure(2,figsize=(16,9))
plt.subplots_adjust(hspace=0.5)
for i in np.arange(1,cats.size + 1):
ax = plt.subplot(5,8,i)
ax.set_title(cats[i - 1],fontsize=10)
ax.axes.get_xaxis().set_visible(False)
ax.axes.get_yaxis().set_visible(False)
plt.hist2d(
data[data.Category==cats[i - 1]].Hour.values,
data[data.Category==cats[i - 1]].DOW.values,
bins=[24,7],
range=[[-0.5,23.5],[-0.5,6.5]],
cmap=plt.cm.rainbow
)
plt.gca().invert_yaxis()
Step 1:Import the class you plan to use
Step 2: "Instantiate" the "estimator"
Step 3: Fit the model with data (aka "model training")
Step 4: Predict the response for a new observation
Ok, so what are we actually trying to do? Given location, you must predict the category of crime that occurred.
In [18]:
# Separate test and train set out of orignal train set.
msk = np.random.rand(len(data)) < 0.8
knn_train = data[msk]
knn_test = data[~msk]
n = len(knn_test)
print("Original size: %s" % len(data))
print("Train set: %s" % len(knn_train))
print("Test set: %s" % len(knn_test))
# Prepare data sets
x = knn_train[['X', 'Y']]
y = knn_train['Category'].astype('category')
actual = knn_test['Category'].astype('category')
# Fit
import scipy as sp
def llfun1(act, pred):
epsilon = 1e-15
pred = sp.maximum(epsilon, pred)
pred = sp.minimum(1-epsilon, pred)
ll = sum(act*sp.log(pred) + sp.subtract(1,act)*sp.log(sp.subtract(1,pred)))
ll = ll * -1.0/len(act)
return ll
def llfun(act, pred):
""" Logloss function for 1/0 probability
"""
return (-(~(act == pred)).astype(int) * math.log(1e-15)).sum() / len(act)
logloss = []
for i in range(1, 50, 1):
knn = KNeighborsClassifier(n_neighbors=i)
knn.fit(x, y)
# Predict on test set
outcome = knn.predict(knn_test[['X', 'Y']])
# Logloss
logloss.append(llfun(actual, outcome))
https://www.kaggle.com/wiki/LogarithmicLoss
The logarithm of the likelihood function for a Bernouli random distribution.
In plain English, this error metric is used where contestants have to predict that something is true or false with a probability (likelihood) ranging from definitely true (1) to equally true (0.5) to definitely false(0).
The use of log on the error provides extreme punishments for being both confident and wrong. In the worst possible case, a single prediction that something is definitely true (1) when it is actually false will add infinite to your error score and make every other entry pointless. In Kaggle competitions, predictions are bounded away from the extremes by a small value in order to prevent this.
Let's plot it as a function of k for our nearest neighbor
In [19]:
plt.plot(logloss)
plt.savefig('n_neighbors_vs_logloss.png')
based on the log loss we can see that around 40 is optimal for k. Now lets predict using the test data
In [20]:
# Submit for K=40
knn = KNeighborsClassifier(n_neighbors=40)
knn.fit(x, y)
# predict from our test set
test = pd.read_csv('../input/test.csv',parse_dates=['Dates'], dtype={"X": np.float64,"Y": np.float64}, )
x_test = test[['X', 'Y']]
outcomes = knn.predict(x_test)
submit = pd.DataFrame({'Id': test.Id.tolist()})
for category in y.cat.categories:
submit[category] = np.where(outcomes == category, 1, 0)
submit.to_csv('k_nearest_neigbour.csv', index = False)
In [36]:
# map pd district to int
unique_pd_district = data["PdDistrict"].unique()
pd_district_mapping = {}
i=0
for c in unique_pd_district:
pd_district_mapping[c] = i
i += 1
data['PdDistrictId'] = data.PdDistrict.map(pd_district_mapping)
print(data.describe())
data.tail()
Out[36]:
In [37]:
# store feature matrix in "X"
X = data[['Hour','DOW','X','Y','PdDistrictId']]
# store response vector in "y"
y = data['Category'].astype('category')
# Submit for K=40
knn = KNeighborsClassifier(n_neighbors=40)
knn.fit(X, y)
Out[37]:
In [38]:
test = pd.read_csv('../input/test.csv',parse_dates=['Dates'], dtype={"X": np.float64,"Y": np.float64}, )
In [39]:
# clean up test set
test['DOW'] = test.DayOfWeek.map(dow)
test['Hour'] = pd.to_datetime(test.Dates).dt.hour
test['PdDistrictId'] = test.PdDistrict.map(pd_district_mapping)
test.tail()
Out[39]:
In [40]:
# Predictions for the test set
X_test = test[['Hour','DOW','X','Y','PdDistrictId']]
outcomes = knn.predict(X_test)
submit = pd.DataFrame({'Id': test.Id.tolist()})
for category in y.cat.categories:
submit[category] = np.where(outcomes == category, 1, 0)
submit.to_csv('k_nearest_neigbour_2.csv', index = False)
In [41]:
# lets see how much dow, hour and district correlate to category
plt.figure()
sns.pairplot(data=data[["Category","DOW","Hour","PdDistrictId"]],
hue="Category", dropna=True)
Out[41]:
In [42]:
plt.savefig("seaborn_pair_plot.png")
In [46]:
import csv as csv
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsClassifier
import seaborn as sns
# show plots inline
# Add column containing day of week expressed in integer
dow = {
'Monday':0,
'Tuesday':1,
'Wednesday':2,
'Thursday':3,
'Friday':4,
'Saturday':5,
'Sunday':6
}
data = pd.read_csv('../input/train.csv',parse_dates=['Dates'], dtype={"X": np.float64,"Y": np.float64}, )
data['DOW'] = data.DayOfWeek.map(dow)
data['Hour'] = pd.to_datetime(data.Dates).dt.hour
X = data[['Hour','DOW','X','Y']]
y = data['Category'].astype('category')
knn = KNeighborsClassifier(n_neighbors=39)
knn.fit(X, y)
test = pd.read_csv('../input/test.csv',parse_dates=['Dates'], dtype={"X": np.float64,"Y": np.float64}, )
test['DOW'] = test.DayOfWeek.map(dow)
test['Hour'] = pd.to_datetime(test.Dates).dt.hour
X_test = test[['Hour','DOW','X','Y']]
outcomes = knn.predict(X_test)
submit = pd.DataFrame({'Id': test.Id.tolist()})
for category in y.cat.categories:
submit[category] = np.where(outcomes == category, 1, 0)
submit.to_csv('k_nearest_neigbour3.csv', index = False)
In [ ]: