In [19]:
import numpy as np
import pandas as pd
import itertools
from __future__ import division
import os
from sklearn import cluster
import json

Creating datasets for 2D

We begin by reading the csv file, into a data frame. This makes it easier to create.


In [4]:
data_path = '../../SFPD_Incidents_-_from_1_January_2003.csv'

data = pd.read_csv(data_path)

Then we want to filter the data set. We do this by only taking the rows with the category PROSTITUTION as well as removing some rows with invalid Y coordinate.


In [5]:
mask = (data.Category == 'PROSTITUTION') & (data.Y != 90)

filterByCat = data[mask]

To reduce the amount of data we need to load on the page, we only extract the columns that we need. In this case it is the district, longtitude and latitude. If this file were written to the disk at this point, the size would be around 700KB (e.i. very small).


In [6]:
reducted = filterByCat[['PdDistrict','X','Y']]

Then we define a function that we use to calculate the clusters, as well as centoids.


In [7]:
X = data.loc[mask][['X','Y']]

centers = {}

def knn(k):
    md = cluster.KMeans(n_clusters=k).fit(X)
    return md.predict(X),md.cluster_centers_

Now we calcualte all our K nearest neighbor, for 2..6.


In [8]:
for i in range(2,7):
    reducted['K'+str(i)], centers[i] = knn(i)




In [9]:
centers


Out[9]:
{2: array([[-122.41829076,   37.76059012],
        [-122.41771519,   37.78740293]]),
 3: array([[-122.41771969,   37.78759671],
        [-122.41565475,   37.76154985],
        [-122.47831658,   37.74518815]]),
 4: array([[-122.41771794,   37.78759954],
        [-122.41562085,   37.76167551],
        [-122.48642315,   37.75842179],
        [-122.45730222,   37.71961699]]),
 5: array([[-122.41873268,   37.78764484],
        [-122.41583586,   37.76146322],
        [-122.48642315,   37.75842179],
        [-122.45730222,   37.71961699],
        [-122.4052264 ,   37.78511642]]),
 6: array([[-122.41599101,   37.76173894],
        [-122.41873463,   37.78764477],
        [-122.4621254 ,   37.72049778],
        [-122.4052441 ,   37.78512395],
        [-122.40458723,   37.72870227],
        [-122.48647474,   37.75851639]])}

Here is a preview of our data, now enriched with K values.


In [10]:
reducted.head()


Out[10]:
PdDistrict X Y K2 K3 K4 K5 K6
670 TENDERLOIN -122.406402 37.786614 1 0 0 4 3
724 TENDERLOIN -122.408202 37.786885 1 0 0 4 3
727 TENDERLOIN -122.408231 37.787359 1 0 0 4 3
1105 SOUTHERN -122.400916 37.785457 1 0 0 4 3
1106 SOUTHERN -122.400916 37.785457 1 0 0 4 3

Write our result

Lastly we write our result to the disk, so that we can use it on our page.


In [11]:
reducted.to_csv('week_8_vis_1.csv', sep=',')

Below is the centoids printed


In [23]:
centers


Out[23]:
{2: array([[-122.41829076,   37.76059012],
        [-122.41771519,   37.78740293]]),
 3: array([[-122.41771969,   37.78759671],
        [-122.41565475,   37.76154985],
        [-122.47831658,   37.74518815]]),
 4: array([[-122.41771794,   37.78759954],
        [-122.41562085,   37.76167551],
        [-122.48642315,   37.75842179],
        [-122.45730222,   37.71961699]]),
 5: array([[-122.41873268,   37.78764484],
        [-122.41583586,   37.76146322],
        [-122.48642315,   37.75842179],
        [-122.45730222,   37.71961699],
        [-122.4052264 ,   37.78511642]]),
 6: array([[-122.41599101,   37.76173894],
        [-122.41873463,   37.78764477],
        [-122.4621254 ,   37.72049778],
        [-122.4052441 ,   37.78512395],
        [-122.40458723,   37.72870227],
        [-122.48647474,   37.75851639]])}