In [19]:
import numpy as np
import pandas as pd
import itertools
from __future__ import division
import os
from sklearn import cluster
import json
In [4]:
data_path = '../../SFPD_Incidents_-_from_1_January_2003.csv'
data = pd.read_csv(data_path)
Then we want to filter the data set.
We do this by only taking the rows with the category PROSTITUTION as well as removing some rows with invalid Y coordinate.
In [5]:
mask = (data.Category == 'PROSTITUTION') & (data.Y != 90)
filterByCat = data[mask]
To reduce the amount of data we need to load on the page, we only extract the columns that we need.
In this case it is the district, longtitude and latitude.
If this file were written to the disk at this point, the size would be around 700KB (e.i. very small).
In [6]:
reducted = filterByCat[['PdDistrict','X','Y']]
Then we define a function that we use to calculate the clusters, as well as centoids.
In [7]:
X = data.loc[mask][['X','Y']]
centers = {}
def knn(k):
md = cluster.KMeans(n_clusters=k).fit(X)
return md.predict(X),md.cluster_centers_
Now we calcualte all our K nearest neighbor, for 2..6.
In [8]:
for i in range(2,7):
reducted['K'+str(i)], centers[i] = knn(i)
In [9]:
centers
Out[9]:
Here is a preview of our data, now enriched with K values.
In [10]:
reducted.head()
Out[10]:
In [11]:
reducted.to_csv('week_8_vis_1.csv', sep=',')
Below is the centoids printed
In [23]:
centers
Out[23]: