Data Transformation

This exploratory analysis takes into account how long does it take to a given open case to be closed.



In [1]:

    
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import bokeh
from bokeh.io import output_notebook
output_notebook()

import os









    





    
        
        Loading BokehJS ...



In [2]:

    
DATA_STREETLIGHT_CASES_URL = 'https://data.sfgov.org/api/views/c53t-rr3f/rows.json?accessType=DOWNLOAD'
DATA_STREETLIGHT_CASES_LOCAL = 'DATA_STREETLIGHT_CASES.json'

data_path = DATA_STREETLIGHT_CASES_URL
if os.path.isfile(DATA_STREETLIGHT_CASES_LOCAL):
    data_path = DATA_STREETLIGHT_CASES_LOCAL



In [3]:

    
import urllib, json

def _load_data(url):
    response = urllib.urlopen(url)
    raw_data = json.loads(response.read())
    columns = [col['name'] for col in raw_data['meta']['view']['columns']]
    rows = raw_data['data']
    return pd.DataFrame(data=rows, columns=columns)



In [4]:

    
df = _load_data(data_path)

Transforming Column Names



In [5]:

    
df.columns = [col.lower().replace(' ', '_') for col in df.columns]



In [6]:

    
df.columns









    Out[6]:





Index([u'sid', u'id', u'position', u'created_at', u'created_meta',
       u'updated_at', u'updated_meta', u'meta', u'caseid', u'opened',
       u'closed', u'status', u'responsible_agency', u'address', u'category',
       u'request_type', u'request_details', u'source', u'supervisor_district',
       u'neighborhood', u'updated', u'point'],
      dtype='object')

Transforming Open and Close Dates

Changing Granularity



In [7]:

    
df['opened'] = pd.to_datetime(df.opened)
df['opened_dayofweek'] = df.opened.dt.dayofweek
df['opened_month'] = df.opened.dt.month
df['opened_year'] = df.opened.dt.year
df['opened_dayofmonth'] = df.opened.dt.day

df['closed'] = pd.to_datetime(df.closed)
df['closed_dayofweek'] = df.closed.dt.dayofweek
df['closed_month'] = df.closed.dt.month
df['closed_year'] = df.closed.dt.year
df['closed_dayofmonth'] = df.closed.dt.day

Adding new features

delta: int, time (in days) taken for a given case to be closed
is_open: boolean, defines if a given case is still opened
opened_weekend: boolean, whether a given case was opened on a weekend
closed_weekend: boolean, whether a given case was closed on a weekend



In [8]:

    
df['delta'] = (df.closed - df.opened).dt.days
df['is_open'] = pd.isnull(df.closed)
df['opened_weekend'] = df.opened_dayofweek >= 5
df['closed_weekend'] = df.closed_dayofweek >= 5

df['target'] = df.delta <= 2

Transforming Localization

Latitude and Longitude



In [9]:

    
from geopy.distance import vincenty

df['latitude'] = df.point.apply(lambda e: float(e[1]))
df['longitude'] = df.point.apply(lambda e: float(e[2]))

Localization Grid (in Km)

Useful for applying unsupervised learning (clustering).

loc_x: horizontal distance to the leftmost case
loc_y: vertical distance to the lowermost case

WARNING :In order to keep this transformation more appropriate, we should use the grid of maximum and minimum coordinates in San Francisco. For the sake of simplicity, we'll use the leftmost and lowermost case.



In [10]:

    
min_lat, max_lat = min(df.latitude), max(df.latitude)
min_lng, max_lng = min(df.longitude), max(df.longitude)

def grid(lat, lng):
    x = vincenty((lat, min_lng), (lat, lng)).miles
    y = vincenty((min_lat, lng), (lat, lng)).miles
    return x, y

xy = [grid(lat, lng) for lat, lng in zip(df.latitude.values, df.longitude.values)]

df['loc_x'] = np.array(xy)[:,0]
df['loc_y'] = np.array(xy)[:,1]

Neighborhood

The neighborhood might be important because, even if a given case $C_1$ is closer to a case $C_2$, it might be more affected by a case $C_3$ that is further away but it is in the same neighborhood (duo policies, strategic location ...).

Interesting metrics that could be added in the future:

distance to center: how distant is the case from the center of the neighborhood in which it is located.



In [11]:

    
df.neighborhood.unique()









    Out[11]:





array([u'Silver Terrace', u'McLaren Park', u'Potrero Hill', u'Sunnydale',
       u'Produce Market', u'Bayview', None, u'Bret Harte', u'Mission',
       u'South of Market', u'Visitacion Valley', u'Crocker Amazon',
       u'Hunters Point', u'Central Waterfront', u'Excelsior', u'Portola',
       u'Dogpatch', u'Little Hollywood', u'Showplace Square',
       u'India Basin', u'Apparel City', u'Mission Bay',
       u'Candlestick Point SRA'], dtype=object)

To avoid the dummy-trap (and curse of dimensionality). We should drop one of the columns when applying one-hot-encoding depending on the classification method to be used. However, since we are not using a LogisticRegression-like classifier, we'll add all features.



In [12]:

    
dummies = pd.get_dummies(df.neighborhood.str.replace(' ', '_').str.lower(), prefix='neigh_', drop_first=False)
dummies.head()









    Out[12]:






  
    
      
      neigh__apparel_city
      neigh__bayview
      neigh__bret_harte
      neigh__candlestick_point_sra
      neigh__central_waterfront
      neigh__crocker_amazon
      neigh__dogpatch
      neigh__excelsior
      neigh__hunters_point
      neigh__india_basin
      ...
      neigh__mission
      neigh__mission_bay
      neigh__portola
      neigh__potrero_hill
      neigh__produce_market
      neigh__showplace_square
      neigh__silver_terrace
      neigh__south_of_market
      neigh__sunnydale
      neigh__visitacion_valley
    
  
  
    
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
    
    
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
    
    
      3
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
    
    
      4
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
    
  

5 rows × 22 columns



In [13]:

    
#del df['neighborhood']
df[dummies.columns] = dummies

Request Type

The Request Type is highly associated to the category, meaning that category 'Abandoned Vehicle' is always correlated with request types 'Abandoned Vehicle - Car2door' and 'Car4door' ...

A minor exception is the difference between 'routine' and 'emergency'.

For the sake of simplicity, we decided to remove request_type



In [17]:

    
df.request_type.unique()









    Out[17]:





array([u'General Cleaning', u'Park - UrbanForestry', u'Bulky Items',
       u'SFHA Priority - Emergency', u'SFHA Priority - Urgent',
       u'Park - Neighborhood_Services_Area', u'Sidewalk_Defect',
       u'Trees - Damaging_Property', u'Illegal_Dumping',
       u'Sidewalk_Cleaning', u'Park - Structural_Maintenance',
       u'Sewage_back_up', u'Abandoned Vehicle - Car2door',
       u'Not_Offensive Graffiti on Private Property', u'Pavement_Defect',
       u'Trees - Tree_Other', u'Streetlight - Light_Burnt_Out',
       u'Street_Cleaning', u'dpw - bses - customer_callback',
       u'tt_collector - tt_collector - customer_callback',
       u'fire_department - support_services - request_for_service',
       u'Abandoned Vehicle - SUV', u'Water_leak',
       u'Abandoned Vehicle - PickupTruck', u'Abandoned Vehicle - Other',
       u'Maintenance', u'Abandoned Vehicle - Car4door',
       u'building_inspection - building_inspection - request_for_service',
       u'dph - environmental_health - request_for_service', u'Flooding',
       u'Catch_Basin_Other', u'dpw - bsm - request_for_service',
       u'Trees - Damaged_Tree',
       u'building_inspection - building_inspection - complaint',
       u'Streetlight - Light_Flickering_On_Off',
       u'Offensive Graffiti on Public Property', u'City_Can_Other',
       u'dpw - buf - request_for_service',
       u'puc - water - request_for_service',
       u'puc - sewer - customer_callback', u'Cans_Left_Out_24x7',
       u'Abandoned Vehicle - ', u'Offensive Graffiti on Private Property',
       u'Utility_Excavation',
       u'tt_collector - tt_collector - request_for_service',
       u'puc - billing - customer_callback',
       u'Abandoned Vehicle - Trailer',
       u'Not_Offensive Graffiti on Public Property',
       u'animal_care_control - animal_care_control - request_for_service',
       u'dpw - bses - complaint', u'mta - mta_other - customer_callback',
       u'Sign Repair - Other', u'puc - puco - customer_callback',
       u'dpw - bses - request_for_service',
       u'mta - traffic_engineering - customer_callback',
       u'Missed_Route_Mechanical_Sweeping', u'Odor',
       u'mta - mta_other - request_for_service',
       u'Streetlight - Light_Dim', u'Illegal Posting - No_Posting_Date',
       u'Illegal Posting - Affixed_Improperly',
       u'Illegal Posting - Posted_on_Traffic_Light',
       u'Illegal Postings - Posted_on_Traffic_Light', u'Graffiti',
       u'Illegal Postings - Affixed_Improperly', u'Community_Clean_Team',
       u'rpd - rpd_other - request_for_service',
       u'dph - environmental_health - complaint', u'Manhole_Cover_Off',
       u'Blocked_Sidewalk', u'dph - environmental_health - compliment',
       u'Building - Plumbing_Broken_leaking',
       u'tt_collector - tt_collector - followup_request',
       u'Building - Second_Hand_Smoke',
       u'tt_collector - tt_collector - mailing_request',
       u'Damaged_City_Can',
       u'mta - parking_traffic_other - request_for_service',
       u'Damaged Traffic_Signal', u'mta - traffic_engineering - complaint',
       u'Sign Repair - On_Ground', u'Trees',
       u'city_attorney - city_attorney - mailing_request',
       u'mta - mta_other - complaint', u'Damaged_Property',
       u'Human/Animal Waste', u'Other', u'Illegal Encampment',
       u'Trees - Landscaping', u'Sign - On_Ground',
       u'dpw - bsm - complaint', u'Automotive', u'Graffiti_Watch',
       u'puc - sewer - request_for_service', u'SFHA Priority - Routine',
       u'building_inspection - plumbing_inspection - customer_callback',
       u'building_inspection - electrical_inspection - customer_callback',
       u'dph - environmental_health - customer_callback',
       u'dpw - buf - customer_callback', u'Trees - Overgrown_Tree',
       u'Removal_From_Location',
       u'Overflowing_City_Receptacle_or_Dumpster', u'Hazardous Materials',
       u'SFHA Priority - Preventive',
       u'animal_care_control - animal_care_control - customer_callback',
       u'mta - parking_enforcement - request_for_service',
       u'Blocked_Parking_Space_or_Strip',
       u'Temporary Sign Request for Moving_residential_property',
       u'dpw - buf - complaint', u'Sign Repair - Bent',
       u'Abandoned Vehicle -', u'mta - paint_shop - request_for_service',
       u'Streetlight - Light_On_all_the_time',
       u'dpw - bssr - request_for_service', u'Park - Park_Patrol',
       u'Illegal Posting - Multiple_Postings',
       u'mta - parking_enforcement - complaint', u'Sign - Loose',
       u'Damaged Parking_Meter', u'dpw - bssr - customer_callback',
       u'dph - dph_other - request_for_service',
       u'dpw - dpw_other - request_for_service',
       u'assessor_recorder - real_property - customer_callback',
       u'assessor_recorder - real_property - mailing_request',
       u'port_authority - port_authority - request_for_service',
       u'dph - environmental_health_tobacco - complaint',
       u'Streetlight - Pole_Damaged', u'Sign - Defaced',
       u'dpw - dpw_other - complaint', u'dpw - bssr - complaint',
       u'Curb_or_Curb_Ramp_Defect',
       u'mta - color_curb - request_for_service', u'Sign - Missing',
       u'Sign - Other', u'Medical_waste', u'Missing_Side_Sewer_Vent_Cover',
       u'puc - electric - customer_callback',
       u'tt_collector - treasurer_tax_collector_k2c - customer_callback',
       u'Temporary Sign Request for Other_Event_Type',
       u'assessor_recorder - assessor - customer_callback',
       u'dpw - bsm - customer_callback',
       u'mta - traffic_engineering - request_for_service',
       u'building_inspection - building_inspection - customer_callback',
       u'New_Litter_Receptacle', u'Construction_Plate_Shifted',
       u'port_authority - port_authority - customer_callback',
       u'Streetlight - Pole_Knockdown',
       u'dph - dph_other - customer_callback',
       u'Building - Garbage_Receptacles', u'Damaged_Side_Sewer_Vent_Cover',
       u'building_inspection - plan_review - request_for_service',
       u'Abandoned Vehicle - Motorcycle',
       u'mta - parking_sfpark - customer_callback',
       u'puc - puco - request_for_service',
       u'Streetlight - Pole_Wires_Exposed', u'puc - water - complaint',
       u'Building - Illegal_Construction_No_Permit_Exceeds_Permit_Scope',
       u'dph - environmental_health_tobacco - request_for_service',
       u'mta - parking_enforcement - customer_callback',
       u'Temporary Sign Request for Corporate_Events',
       u'dph - environmental_health - followup_request',
       u'rpd - neighborhood_services - complaint',
       u'assessor_recorder - real_property - request_for_service',
       u'mta - sign_shop - request_for_service',
       u'rpd - park_patrol - complaint', u'MUNI - Services_Miscellaneous',
       u'Streetlight - Other', u'Sign - Painted_Over',
       u'Post_Abatement_Inspection Graffiti on Private Property',
       u'Streetlight - Other_Request_New_Streetlight',
       u'mta - residential_parking_permit - compliment',
       u'Building - Heat_Lack_of_Heat',
       u'building_inspection - inspection_services - complaint',
       u'MUNI - Services_Service_Delivery_Facilities',
       u'Damaged News_Rack',
       u'Illegal Posting - Posting_Too_Large_in_Size',
       u'mta - bicycle - customer_callback',
       u'assessor_recorder - recorder - customer_callback',
       u'building_inspection - dbi_other - request_for_service',
       u'puc - puco - complaint', u'tt_collector - tt_collector - other',
       u'City_Can_Removed',
       u'building_inspection - building_inspection - compliment',
       u'dpw - bses - other', u'dpw - dpw_other - customer_callback',
       u'mta - parking_traffic_other - customer_callback',
       u'Building - Infestation_Rodent_Insect', u'mons - mons - complaint',
       u'sfpd - sfpd - request_for_service',
       u'Temporary Sign Request for Moving_commercial_property',
       u'Streetlight - Other_Request_Light_Shield',
       u'building_inspection - permit_services - complaint',
       u'Sign - Turned', u'Building - Ventilation_Inadequate_or_None',
       u'puc - sewer - followup_request', u'mta - color_curb - complaint',
       u'Damaged Benches_on_Sidewalk', u'mons - mons - customer_callback',
       u'Illegal Postings - Multiple_Postings',
       u'environment - environment - complaint',
       u'dph - environmental_health - mailing_request', u'Sign - Faded',
       u'Blue Color Curb Request',
       u'animal_care_control - animal_care_control - complaint',
       u'dtis - dtis - request_for_service',
       u'Red_NoFee Color Curb Request',
       u'building_inspection - dbi_other - complaint',
       u'dpw - bses - followup_request', u'Streetlight - Pole_Leaning',
       u'central_shops - central_shops - complaint',
       u'assessor_recorder - assessor_recorder_re_watchdog_program - customer_callback',
       u'Garbage',
       u'art_commission - art_commission - request_for_service',
       u'building_inspection - dbi_other - followup_request',
       u'puc - water - customer_callback',
       u'MUNI - Services_Service_Planning', u'Damaged Kiosk_Public_Toilet',
       u'Streetlight - Light_Glass_Cover_Missing',
       u'fire_department - sffd_other - request_for_service',
       u'dpw - buf - compliment', u'sfpd - sfpd - customer_callback',
       u'Building - Electrical_Hazardous_Condition',
       u'animal_care_control - animal_care_control - followup_request',
       u'mta - parking_enforcement - compliment',
       u'dpw - bses - compliment', u'Building - Deck_Stairs_Handrails',
       u'dpw - bses - mailing_request',
       u'rpd - urban_forestry - request_for_service',
       u'Sign Repair - Turned',
       u'city_attorney - city_attorney - request_for_service',
       u'Temporary Sign Request for Street_Fairs',
       u'Temporary Sign Request for Funerals',
       u'mta - construction_permit - complaint',
       u'Red Color Curb Request for', u'mta - traffic_engineering - other',
       u'dpw - bbr - request_for_service',
       u'Temporary Sign Request for City_Sponsored_Celebrations',
       u'puc - electric - request_for_service'], dtype=object)



In [18]:

    
tmp = df[['request_type', 'category', 'delta', 'target']]
tmp = tmp.dropna()
vc = tmp.request_type.value_counts()
tmp.loc[vc[tmp.request_type].values < 50, 'request_type'] = 'Others'

pivot = tmp.pivot_table(index='request_type', columns='category', values='target', 
                        aggfunc=sum, fill_value=0)
plt.figure(figsize=(10,6))
sns.heatmap(pivot.astype(int), annot=True, fmt="d", linewidths=.5)









    Out[18]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f3f9a5288d0>

Source



In [19]:

    
dummies = pd.get_dummies(df.source.str.replace(' ', '_').str.lower(), prefix='source_', drop_first=False)
df[dummies.columns] = dummies



In [20]:

    
df['status'] = df.status == 'Closed'



In [21]:

    
original_columns = [u'sid', u'id', u'position', u'created_at', u'created_meta',
       u'updated_at', u'updated_meta', u'meta', u'caseid', u'opened',
       u'closed', u'status', u'responsible_agency', u'address', u'category',
       u'request_type', u'request_details', u'source', u'supervisor_district',
       u'neighborhood', u'updated', u'point']



In [22]:

    
del df['sid']
del df['id']
del df['position']
del df['created_at']
del df['created_meta']
del df['updated_at']
del df['meta']
del df['caseid']
del df['address']
del df['responsible_agency']
del df['request_details']
del df['request_type']
del df['status']
del df['updated']
del df['supervisor_district']
del df['point']



In [ ]:

	...	cat__sfha_requests	cat__street_and_sidewalk_cleaning
0	...	0	1
1	...	0	0
2	...	0	1
3	...	1	0
4	...	1	0

	...	neigh__potrero_hill	neigh__silver_terrace	neigh__sunnydale
0	...	0	1	0
1	...	0	0	0
2	...	1	0	0
3	...	1	0	0
4	...	0	0	1

	...	cat__sfha_requests	cat__street_and_sidewalk_cleaning
0	...	0	1
1	...	0	0
2	...	0	1
3	...	1	0
4	...	1	0

	...	neigh__potrero_hill	neigh__silver_terrace	neigh__sunnydale
0	...	0	1	0
1	...	0	0	0
2	...	1	0	0
3	...	1	0	0
4	...	0	0	1

	...	cat__sfha_requests	cat__street_and_sidewalk_cleaning
0	...	0	1
1	...	0	0
2	...	0	1
3	...	1	0
4	...	1	0