This exploratory analysis takes into account how long does it take to a given open case to be closed.
In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import bokeh
from bokeh.io import output_notebook
output_notebook()
import os
In [2]:
DATA_STREETLIGHT_CASES_URL = 'https://data.sfgov.org/api/views/c53t-rr3f/rows.json?accessType=DOWNLOAD'
DATA_STREETLIGHT_CASES_LOCAL = 'DATA_STREETLIGHT_CASES.json'
data_path = DATA_STREETLIGHT_CASES_URL
if os.path.isfile(DATA_STREETLIGHT_CASES_LOCAL):
data_path = DATA_STREETLIGHT_CASES_LOCAL
In [3]:
import urllib, json
def _load_data(url):
response = urllib.urlopen(url)
raw_data = json.loads(response.read())
columns = [col['name'] for col in raw_data['meta']['view']['columns']]
rows = raw_data['data']
return pd.DataFrame(data=rows, columns=columns)
In [4]:
df = _load_data(data_path)
In [5]:
df.columns = [col.lower().replace(' ', '_') for col in df.columns]
In [6]:
df.columns
Out[6]:
In [7]:
df['opened'] = pd.to_datetime(df.opened)
df['opened_dayofweek'] = df.opened.dt.dayofweek
df['opened_month'] = df.opened.dt.month
df['opened_year'] = df.opened.dt.year
df['opened_dayofmonth'] = df.opened.dt.day
df['closed'] = pd.to_datetime(df.closed)
df['closed_dayofweek'] = df.closed.dt.dayofweek
df['closed_month'] = df.closed.dt.month
df['closed_year'] = df.closed.dt.year
df['closed_dayofmonth'] = df.closed.dt.day
In [8]:
df['delta'] = (df.closed - df.opened).dt.days
df['is_open'] = pd.isnull(df.closed)
df['opened_weekend'] = df.opened_dayofweek >= 5
df['closed_weekend'] = df.closed_dayofweek >= 5
df['target'] = df.delta <= 2
In [9]:
from geopy.distance import vincenty
df['latitude'] = df.point.apply(lambda e: float(e[1]))
df['longitude'] = df.point.apply(lambda e: float(e[2]))
Useful for applying unsupervised learning (clustering).
WARNING :In order to keep this transformation more appropriate, we should use the grid of maximum and minimum coordinates in San Francisco. For the sake of simplicity, we'll use the leftmost and lowermost case.
In [10]:
min_lat, max_lat = min(df.latitude), max(df.latitude)
min_lng, max_lng = min(df.longitude), max(df.longitude)
def grid(lat, lng):
x = vincenty((lat, min_lng), (lat, lng)).miles
y = vincenty((min_lat, lng), (lat, lng)).miles
return x, y
xy = [grid(lat, lng) for lat, lng in zip(df.latitude.values, df.longitude.values)]
df['loc_x'] = np.array(xy)[:,0]
df['loc_y'] = np.array(xy)[:,1]
The neighborhood might be important because, even if a given case $C_1$ is closer to a case $C_2$, it might be more affected by a case $C_3$ that is further away but it is in the same neighborhood (duo policies, strategic location ...).
Interesting metrics that could be added in the future:
In [11]:
df.neighborhood.unique()
Out[11]:
To avoid the dummy-trap (and curse of dimensionality). We should drop one of the columns when applying one-hot-encoding depending on the classification method to be used. However, since we are not using a LogisticRegression-like classifier, we'll add all features.
In [12]:
dummies = pd.get_dummies(df.neighborhood.str.replace(' ', '_').str.lower(), prefix='neigh_', drop_first=False)
dummies.head()
Out[12]:
In [13]:
#del df['neighborhood']
df[dummies.columns] = dummies
In [14]:
df.category.unique()
Out[14]:
In [15]:
dummies = pd.get_dummies(df.category.str.replace(' ', '_').str.lower(), prefix='cat_', drop_first=False)
dummies.head()
Out[15]:
In [16]:
#del df['category']
df[dummies.columns] = dummies
The Request Type is highly associated to the category, meaning that category 'Abandoned Vehicle' is always correlated with request types 'Abandoned Vehicle - Car2door' and 'Car4door' ...
A minor exception is the difference between 'routine' and 'emergency'.
For the sake of simplicity, we decided to remove request_type
In [17]:
df.request_type.unique()
Out[17]:
In [18]:
tmp = df[['request_type', 'category', 'delta', 'target']]
tmp = tmp.dropna()
vc = tmp.request_type.value_counts()
tmp.loc[vc[tmp.request_type].values < 50, 'request_type'] = 'Others'
pivot = tmp.pivot_table(index='request_type', columns='category', values='target',
aggfunc=sum, fill_value=0)
plt.figure(figsize=(10,6))
sns.heatmap(pivot.astype(int), annot=True, fmt="d", linewidths=.5)
Out[18]:
In [19]:
dummies = pd.get_dummies(df.source.str.replace(' ', '_').str.lower(), prefix='source_', drop_first=False)
df[dummies.columns] = dummies
In [20]:
df['status'] = df.status == 'Closed'
In [21]:
original_columns = [u'sid', u'id', u'position', u'created_at', u'created_meta',
u'updated_at', u'updated_meta', u'meta', u'caseid', u'opened',
u'closed', u'status', u'responsible_agency', u'address', u'category',
u'request_type', u'request_details', u'source', u'supervisor_district',
u'neighborhood', u'updated', u'point']
In [22]:
del df['sid']
del df['id']
del df['position']
del df['created_at']
del df['created_meta']
del df['updated_at']
del df['meta']
del df['caseid']
del df['address']
del df['responsible_agency']
del df['request_details']
del df['request_type']
del df['status']
del df['updated']
del df['supervisor_district']
del df['point']
In [ ]: