import csv
import numpy as np
from scipy.spatial import distance
from sklearn.cluster import DBSCAN, KMeans

In [19]:

def clusters_to_csv(labels, types, coords):
'''
Helper function to turn scikit-learn clusters into abbreviated CSVs
'''
for k in set(labels):
class_members = [index[0] for index in np.argwhere(labels == k)]
for index in class_members:
print '%s,%s,%s' % (int(k), types[index], '{0},{1}'.format(*coords[index]))

## Preparing the data

After we import our CSV of crime data, we need to do a couple things to get it ready for clustering: extracting the coordinate pairs that we want to cluster, and pulling together some simple labels so we know which indcident each point refers to.

# This part just splits out the latitude and longitude coordinate fields for each incident, which we need for mapping.
coords = [(float(d['lat']), float(d['lng'])) for d in data if len(d['lat']) > 0]
print coords[:10]

[(38.9379, -92.3343), (38.9515, -92.3265), (38.93819, -92.35111), (38.96017, -92.32459), (38.99221, -92.31791), (38.9526, -92.3265), (38.9505, -92.3276), (38.95723, -92.34396), (38.9499, -92.3276), (38.96531, -92.34368)]

In [16]:

# And this creates a matching array of incident types
types = [d['ExtNatureDisplayName'] for d in data]
print types[:10]

## K-means clustering

Here we'll review the idea of k-means clustering you discussed last week and see how it applies to our crime data. We'll start with three clusters.

number_of_clusters = 3
kmeans = KMeans(n_clusters=number_of_clusters)
kmeans.fit(coords)

0,TRAFFIC,38.9379,-92.3343
0,DWI,38.9515,-92.3265
0,CHECK SUBJECT,38.93819,-92.35111
0,TRAFFIC STOP,38.96017,-92.32459
0,TRESPASSING,38.99221,-92.31791
0,TRAFFIC STOP,38.9526,-92.3265
0,LEAVING THE SCENE ACCIDENT,38.9505,-92.3276
0,DISTURBANCE,38.95723,-92.34396
0,DWI,38.9499,-92.3276
0,LAW ALARM,38.96531,-92.34368
0,LARCENY MV,38.96297,-92.32342
0,TRAFFIC,38.9381,-92.3593
0,TRAFFIC STOP,38.96474,-92.33143
0,CHECK SUBJECT,38.94987,-92.32676
0,SUSPICIOUS VEHICLE,38.9602,-92.32905
0,ASSIST CITIZEN,38.9633,-92.3387
0,911 CHECKS,38.954167,-92.33414
0,DISTURBANCE,38.9633,-92.3398
0,SUSPICIOUS INCIDENT,38.9585,-92.33271
0,LARCENY,39.00124,-92.31707
0,SUSPICIOUS PERSON,38.96419,-92.37769
0,DISTURBANCE,38.98455,-92.36872
0,911 CHECKS,38.993,-92.3174
0,VANDALISM,39.0061,-92.31822
0,ASSAULT,38.98646,-92.32027
0,911 CHECKS,38.95544,-92.36135
0,LAW ALARM,38.96772,-92.31883
0,MISCHIEF,38.96164,-92.32737
0,CHECK SUBJECT,38.96429,-92.32096
0,TRAFFIC,38.9564,-92.3598
0,TRESPASSING,38.99595,-92.31625
0,TRAFFIC STOP,38.9101,-92.3332
0,ANIMAL COMPLAINT,38.95202,-92.3988
0,C&I DRIVING,38.9693,-92.3802
0,SPECIAL ASSIGNMENT,38.9525,-92.3305
0,TRAFFIC,38.9525,-92.3217
0,LAW ALARM,38.95165,-92.34255
0,SPECIAL ASSIGNMENT,38.95248,-92.33047
0,LAW ALARM,38.95166,-92.35094
0,911 CHECKS,38.90054,-92.35527
1,DISTURBANCE,0.0,0.0
1,911 CHECKS,0.0,0.0
1,SPECIAL ASSIGNMENT,0.0,0.0
1,911 CHECKS,0.0,0.0
1,911 CHECKS,0.0,0.0
1,LAW ALARM,0.0,0.0
1,911 CHECKS,0.0,0.0
1,911 CHECKS,0.0,0.0
1,911 CHECKS,0.0,0.0
1,TRAFFIC STOP,0.0,0.0
1,CIVIL MATTER,0.0,0.0
1,ACCIDENT,0.0,0.0
1,ACCIDENT,0.0,0.0
1,WARRANT,0.0,0.0
1,ACCIDENT,0.0,0.0
1,LAW ALARM,0.0,0.0
2,LAW ALARM,38.97468,-92.28922
The data comes out in the format of cluster_id,incident_type,lat,lng. If we save it to a csv file, we can load it into Google's simple map viewer tool to see how it looks.

As you can see, segmenting the data into only three clusters doesn't give us anything useful. Let's try a bigger number.

