We are trying to find the correct range of times and locations to do our joins. In this notebook, I will be exploring different types of groupings of the 911 reports, and trying to pick one grouping that doesn't have too many or too few reports.
Note that I don't remove duplicates here since we are mapping (time_range, location_range) to {0, 1}, where 0 means no crime happened in this time_range and location_range, and 1 means one or more crimes happened. Duplicated data would not change the result here.
In [182]:
from pylab import *
%matplotlib inline
import pandas as pd
df = pd.read_csv("data/sfpd_incident_2014.csv",
names=['IncidntNum','Category','Descript','DayOfWeek','Date','Time','PdDistrict','Resolution','Location','X','Y'],
na_values=['-'])
df = df[1:] #the first row is a copy of the labels.
In [183]:
max_x_loc = df[['X']][1:].max()
min_x_loc = df[['X']][1:].min()
print "min X location: %f" % min_x_loc
print "max X location: %f" % max_x_loc
max_y_loc = df[['Y']][1:].max()
min_y_loc = df[['Y']][1:].min()
print "min Y location: %f" % min_y_loc
print "max Y location: %f" % max_y_loc
range_x = max_x_loc - min_x_loc
range_y = max_y_loc - min_y_loc
print "range X: %f" % range_x
print "range Y: %f" % range_y
In [184]:
import numpy as np
num_loc_bins = 10 #create a grid to map locations on. Size of grid is square of this number.
x_bins = np.arange(min_x_loc, max_x_loc, range_x/float(num_loc_bins))
y_bins = np.arange(min_y_loc, max_y_loc, range_y/float(num_loc_bins))
print x_bins
print y_bins
In [185]:
# maps each possible x or y location to a bin. The bin is denoted by the lowest value in the bin,
# i.e. (The bin is round_down_x plus x_range)
def round_down_x(xloc):
for bin in x_bins[::-1]: #iterate through x_bins in reverse order
if xloc >= bin:
return bin
def round_down_y(yloc):
for bin in y_bins[::-1]: #iterate through y_bins in reverse order
if yloc >= bin:
return bin
In [214]:
dff = df.head()
xycol = dff.apply(lambda row: (round_down_x(row['X']), round_down_y(row['Y'])), axis=1)
#dff.append(xycol, ignore_index=True, axis=1)
dff = pd.concat([dff, xycol], axis=1, names='a')
dff.columns.values[-1] = 'XY'
dff = dff[[1,2,3,4,5,6,7,11]]
dff
Out[214]:
In [110]:
dff = df.head()
ct = 0
for row in dff.iterrows():
x_loc = row[1][9]
y_loc = row[1][10]
print x_loc, y_loc
ct += 1
if ct > 10:
break
dff
Out[110]:
In [ ]: