In [19]:

    
import numpy as np
import pandas as pd
import itertools
from __future__ import division
import os
from sklearn import cluster
import json

Creating datasets for 2D

We begin by reading the csv file, into a data frame. This makes it easier to create.



In [4]:

    
data_path = '../../SFPD_Incidents_-_from_1_January_2003.csv'

data = pd.read_csv(data_path)

Then we want to filter the data set. We do this by only taking the rows with the category PROSTITUTION as well as removing some rows with invalid Y coordinate.



In [5]:

    
mask = (data.Category == 'PROSTITUTION') & (data.Y != 90)

filterByCat = data[mask]

To reduce the amount of data we need to load on the page, we only extract the columns that we need. In this case it is the district, longtitude and latitude. If this file were written to the disk at this point, the size would be around 700KB (e.i. very small).



In [6]:

    
reducted = filterByCat[['PdDistrict','X','Y']]

Then we define a function that we use to calculate the clusters, as well as centoids.



In [7]:

    
X = data.loc[mask][['X','Y']]

centers = {}

def knn(k):
    md = cluster.KMeans(n_clusters=k).fit(X)
    return md.predict(X),md.cluster_centers_

Now we calcualte all our K nearest neighbor, for 2..6.



In [8]:

    
for i in range(2,7):
    reducted['K'+str(i)], centers[i] = knn(i)



In [9]:

    
centers









    Out[9]:





{2: array([[-122.41829076,   37.76059012],
        [-122.41771519,   37.78740293]]),
 3: array([[-122.41771969,   37.78759671],
        [-122.41565475,   37.76154985],
        [-122.47831658,   37.74518815]]),
 4: array([[-122.41771794,   37.78759954],
        [-122.41562085,   37.76167551],
        [-122.48642315,   37.75842179],
        [-122.45730222,   37.71961699]]),
 5: array([[-122.41873268,   37.78764484],
        [-122.41583586,   37.76146322],
        [-122.48642315,   37.75842179],
        [-122.45730222,   37.71961699],
        [-122.4052264 ,   37.78511642]]),
 6: array([[-122.41599101,   37.76173894],
        [-122.41873463,   37.78764477],
        [-122.4621254 ,   37.72049778],
        [-122.4052441 ,   37.78512395],
        [-122.40458723,   37.72870227],
        [-122.48647474,   37.75851639]])}

Here is a preview of our data, now enriched with K values.



In [10]:

    
reducted.head()









    Out[10]:






  
    
      
      PdDistrict
      X
      Y
      K2
      K3
      K4
      K5
      K6
    
  
  
    
      670
      TENDERLOIN
      -122.406402
      37.786614
      1
      0
      0
      4
      3
    
    
      724
      TENDERLOIN
      -122.408202
      37.786885
      1
      0
      0
      4
      3
    
    
      727
      TENDERLOIN
      -122.408231
      37.787359
      1
      0
      0
      4
      3
    
    
      1105
      SOUTHERN
      -122.400916
      37.785457
      1
      0
      0
      4
      3
    
    
      1106
      SOUTHERN
      -122.400916
      37.785457
      1
      0
      0
      4
      3

Write our result

Lastly we write our result to the disk, so that we can use it on our page.



In [11]:

    
reducted.to_csv('week_8_vis_1.csv', sep=',')

Below is the centoids printed



In [23]:

    
centers









    Out[23]:





{2: array([[-122.41829076,   37.76059012],
        [-122.41771519,   37.78740293]]),
 3: array([[-122.41771969,   37.78759671],
        [-122.41565475,   37.76154985],
        [-122.47831658,   37.74518815]]),
 4: array([[-122.41771794,   37.78759954],
        [-122.41562085,   37.76167551],
        [-122.48642315,   37.75842179],
        [-122.45730222,   37.71961699]]),
 5: array([[-122.41873268,   37.78764484],
        [-122.41583586,   37.76146322],
        [-122.48642315,   37.75842179],
        [-122.45730222,   37.71961699],
        [-122.4052264 ,   37.78511642]]),
 6: array([[-122.41599101,   37.76173894],
        [-122.41873463,   37.78764477],
        [-122.4621254 ,   37.72049778],
        [-122.4052441 ,   37.78512395],
        [-122.40458723,   37.72870227],
        [-122.48647474,   37.75851639]])}

	PdDistrict	X	Y	K2	K5	K6
670	TENDERLOIN	-122.406402	37.786614	1	4	3
724	TENDERLOIN	-122.408202	37.786885	1	4	3
727	TENDERLOIN	-122.408231	37.787359	1	4	3
1105	SOUTHERN	-122.400916	37.785457	1	4	3
1106	SOUTHERN	-122.400916	37.785457	1	4	3