Data Transformation

This exploratory analysis takes into account how long does it take to a given open case to be closed.


In [1]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import bokeh
from bokeh.io import output_notebook
output_notebook()

import os


Loading BokehJS ...

In [2]:
DATA_STREETLIGHT_CASES_URL = 'https://data.sfgov.org/api/views/c53t-rr3f/rows.json?accessType=DOWNLOAD'
DATA_STREETLIGHT_CASES_LOCAL = 'DATA_STREETLIGHT_CASES.json'

data_path = DATA_STREETLIGHT_CASES_URL
if os.path.isfile(DATA_STREETLIGHT_CASES_LOCAL):
    data_path = DATA_STREETLIGHT_CASES_LOCAL

In [3]:
import urllib, json

def _load_data(url):
    response = urllib.urlopen(url)
    raw_data = json.loads(response.read())
    columns = [col['name'] for col in raw_data['meta']['view']['columns']]
    rows = raw_data['data']
    return pd.DataFrame(data=rows, columns=columns)

In [4]:
df = _load_data(data_path)

Transforming Column Names


In [5]:
df.columns = [col.lower().replace(' ', '_') for col in df.columns]

In [6]:
df.columns


Out[6]:
Index([u'sid', u'id', u'position', u'created_at', u'created_meta',
       u'updated_at', u'updated_meta', u'meta', u'caseid', u'opened',
       u'closed', u'status', u'responsible_agency', u'address', u'category',
       u'request_type', u'request_details', u'source', u'supervisor_district',
       u'neighborhood', u'updated', u'point'],
      dtype='object')

Transforming Open and Close Dates

Changing Granularity


In [7]:
df['opened'] = pd.to_datetime(df.opened)
df['opened_dayofweek'] = df.opened.dt.dayofweek
df['opened_month'] = df.opened.dt.month
df['opened_year'] = df.opened.dt.year
df['opened_dayofmonth'] = df.opened.dt.day

df['closed'] = pd.to_datetime(df.closed)
df['closed_dayofweek'] = df.closed.dt.dayofweek
df['closed_month'] = df.closed.dt.month
df['closed_year'] = df.closed.dt.year
df['closed_dayofmonth'] = df.closed.dt.day

Adding new features

  • delta: int, time (in days) taken for a given case to be closed
  • is_open: boolean, defines if a given case is still opened
  • opened_weekend: boolean, whether a given case was opened on a weekend
  • closed_weekend: boolean, whether a given case was closed on a weekend

In [8]:
df['delta'] = (df.closed - df.opened).dt.days
df['is_open'] = pd.isnull(df.closed)
df['opened_weekend'] = df.opened_dayofweek >= 5
df['closed_weekend'] = df.closed_dayofweek >= 5

df['target'] = df.delta <= 2

Transforming Localization

Latitude and Longitude


In [9]:
from geopy.distance import vincenty

df['latitude'] = df.point.apply(lambda e: float(e[1]))
df['longitude'] = df.point.apply(lambda e: float(e[2]))

Localization Grid (in Km)

Useful for applying unsupervised learning (clustering).

  • loc_x: horizontal distance to the leftmost case
  • loc_y: vertical distance to the lowermost case

WARNING :In order to keep this transformation more appropriate, we should use the grid of maximum and minimum coordinates in San Francisco. For the sake of simplicity, we'll use the leftmost and lowermost case.


In [10]:
min_lat, max_lat = min(df.latitude), max(df.latitude)
min_lng, max_lng = min(df.longitude), max(df.longitude)

def grid(lat, lng):
    x = vincenty((lat, min_lng), (lat, lng)).miles
    y = vincenty((min_lat, lng), (lat, lng)).miles
    return x, y

xy = [grid(lat, lng) for lat, lng in zip(df.latitude.values, df.longitude.values)]

df['loc_x'] = np.array(xy)[:,0]
df['loc_y'] = np.array(xy)[:,1]

Neighborhood

The neighborhood might be important because, even if a given case $C_1$ is closer to a case $C_2$, it might be more affected by a case $C_3$ that is further away but it is in the same neighborhood (duo policies, strategic location ...).

Interesting metrics that could be added in the future:

  • distance to center: how distant is the case from the center of the neighborhood in which it is located.

In [11]:
df.neighborhood.unique()


Out[11]:
array([u'Silver Terrace', u'McLaren Park', u'Potrero Hill', u'Sunnydale',
       u'Produce Market', u'Bayview', None, u'Bret Harte', u'Mission',
       u'South of Market', u'Visitacion Valley', u'Crocker Amazon',
       u'Hunters Point', u'Central Waterfront', u'Excelsior', u'Portola',
       u'Dogpatch', u'Little Hollywood', u'Showplace Square',
       u'India Basin', u'Apparel City', u'Mission Bay',
       u'Candlestick Point SRA'], dtype=object)

To avoid the dummy-trap (and curse of dimensionality). We should drop one of the columns when applying one-hot-encoding depending on the classification method to be used. However, since we are not using a LogisticRegression-like classifier, we'll add all features.


In [12]:
dummies = pd.get_dummies(df.neighborhood.str.replace(' ', '_').str.lower(), prefix='neigh_', drop_first=False)
dummies.head()


Out[12]:
neigh__apparel_city neigh__bayview neigh__bret_harte neigh__candlestick_point_sra neigh__central_waterfront neigh__crocker_amazon neigh__dogpatch neigh__excelsior neigh__hunters_point neigh__india_basin ... neigh__mission neigh__mission_bay neigh__portola neigh__potrero_hill neigh__produce_market neigh__showplace_square neigh__silver_terrace neigh__south_of_market neigh__sunnydale neigh__visitacion_valley
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0

5 rows × 22 columns


In [13]:
#del df['neighborhood']
df[dummies.columns] = dummies

Category


In [14]:
df.category.unique()


Out[14]:
array([u'Street and Sidewalk Cleaning', u'Rec and Park Requests',
       u'SFHA Requests', u'Sidewalk or Curb', u'Tree Maintenance',
       u'Sewer Issues', u'Abandoned Vehicle', u'Graffiti Private Property',
       u'Street Defects', u'Streetlights', u'General Requests',
       u'Catch Basin Maintenance', u'Graffiti Public Property',
       u'Litter Receptacles', u'Sign Repair', u'Illegal Postings',
       u'311 External Request', u'DPW Volunteer Programs',
       u'Blocked Street or SideWalk', u'Residential Building Request',
       u'Damaged Property', u'Temporary Sign Request', u'MUNI Feedback',
       u'Color Curb'], dtype=object)

In [15]:
dummies = pd.get_dummies(df.category.str.replace(' ', '_').str.lower(), prefix='cat_', drop_first=False)
dummies.head()


Out[15]:
cat__311_external_request cat__abandoned_vehicle cat__blocked_street_or_sidewalk cat__catch_basin_maintenance cat__color_curb cat__damaged_property cat__dpw_volunteer_programs cat__general_requests cat__graffiti_private_property cat__graffiti_public_property ... cat__residential_building_request cat__sewer_issues cat__sfha_requests cat__sidewalk_or_curb cat__sign_repair cat__street_and_sidewalk_cleaning cat__street_defects cat__streetlights cat__temporary_sign_request cat__tree_maintenance
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 0

5 rows × 24 columns


In [16]:
#del df['category']
df[dummies.columns] = dummies

Request Type

The Request Type is highly associated to the category, meaning that category 'Abandoned Vehicle' is always correlated with request types 'Abandoned Vehicle - Car2door' and 'Car4door' ...

A minor exception is the difference between 'routine' and 'emergency'.

For the sake of simplicity, we decided to remove request_type


In [17]:
df.request_type.unique()


Out[17]:
array([u'General Cleaning', u'Park - UrbanForestry', u'Bulky Items',
       u'SFHA Priority - Emergency', u'SFHA Priority - Urgent',
       u'Park - Neighborhood_Services_Area', u'Sidewalk_Defect',
       u'Trees - Damaging_Property', u'Illegal_Dumping',
       u'Sidewalk_Cleaning', u'Park - Structural_Maintenance',
       u'Sewage_back_up', u'Abandoned Vehicle - Car2door',
       u'Not_Offensive Graffiti on Private Property', u'Pavement_Defect',
       u'Trees - Tree_Other', u'Streetlight - Light_Burnt_Out',
       u'Street_Cleaning', u'dpw - bses - customer_callback',
       u'tt_collector - tt_collector - customer_callback',
       u'fire_department - support_services - request_for_service',
       u'Abandoned Vehicle - SUV', u'Water_leak',
       u'Abandoned Vehicle - PickupTruck', u'Abandoned Vehicle - Other',
       u'Maintenance', u'Abandoned Vehicle - Car4door',
       u'building_inspection - building_inspection - request_for_service',
       u'dph - environmental_health - request_for_service', u'Flooding',
       u'Catch_Basin_Other', u'dpw - bsm - request_for_service',
       u'Trees - Damaged_Tree',
       u'building_inspection - building_inspection - complaint',
       u'Streetlight - Light_Flickering_On_Off',
       u'Offensive Graffiti on Public Property', u'City_Can_Other',
       u'dpw - buf - request_for_service',
       u'puc - water - request_for_service',
       u'puc - sewer - customer_callback', u'Cans_Left_Out_24x7',
       u'Abandoned Vehicle - ', u'Offensive Graffiti on Private Property',
       u'Utility_Excavation',
       u'tt_collector - tt_collector - request_for_service',
       u'puc - billing - customer_callback',
       u'Abandoned Vehicle - Trailer',
       u'Not_Offensive Graffiti on Public Property',
       u'animal_care_control - animal_care_control - request_for_service',
       u'dpw - bses - complaint', u'mta - mta_other - customer_callback',
       u'Sign Repair - Other', u'puc - puco - customer_callback',
       u'dpw - bses - request_for_service',
       u'mta - traffic_engineering - customer_callback',
       u'Missed_Route_Mechanical_Sweeping', u'Odor',
       u'mta - mta_other - request_for_service',
       u'Streetlight - Light_Dim', u'Illegal Posting - No_Posting_Date',
       u'Illegal Posting - Affixed_Improperly',
       u'Illegal Posting - Posted_on_Traffic_Light',
       u'Illegal Postings - Posted_on_Traffic_Light', u'Graffiti',
       u'Illegal Postings - Affixed_Improperly', u'Community_Clean_Team',
       u'rpd - rpd_other - request_for_service',
       u'dph - environmental_health - complaint', u'Manhole_Cover_Off',
       u'Blocked_Sidewalk', u'dph - environmental_health - compliment',
       u'Building - Plumbing_Broken_leaking',
       u'tt_collector - tt_collector - followup_request',
       u'Building - Second_Hand_Smoke',
       u'tt_collector - tt_collector - mailing_request',
       u'Damaged_City_Can',
       u'mta - parking_traffic_other - request_for_service',
       u'Damaged Traffic_Signal', u'mta - traffic_engineering - complaint',
       u'Sign Repair - On_Ground', u'Trees',
       u'city_attorney - city_attorney - mailing_request',
       u'mta - mta_other - complaint', u'Damaged_Property',
       u'Human/Animal Waste', u'Other', u'Illegal Encampment',
       u'Trees - Landscaping', u'Sign - On_Ground',
       u'dpw - bsm - complaint', u'Automotive', u'Graffiti_Watch',
       u'puc - sewer - request_for_service', u'SFHA Priority - Routine',
       u'building_inspection - plumbing_inspection - customer_callback',
       u'building_inspection - electrical_inspection - customer_callback',
       u'dph - environmental_health - customer_callback',
       u'dpw - buf - customer_callback', u'Trees - Overgrown_Tree',
       u'Removal_From_Location',
       u'Overflowing_City_Receptacle_or_Dumpster', u'Hazardous Materials',
       u'SFHA Priority - Preventive',
       u'animal_care_control - animal_care_control - customer_callback',
       u'mta - parking_enforcement - request_for_service',
       u'Blocked_Parking_Space_or_Strip',
       u'Temporary Sign Request for Moving_residential_property',
       u'dpw - buf - complaint', u'Sign Repair - Bent',
       u'Abandoned Vehicle -', u'mta - paint_shop - request_for_service',
       u'Streetlight - Light_On_all_the_time',
       u'dpw - bssr - request_for_service', u'Park - Park_Patrol',
       u'Illegal Posting - Multiple_Postings',
       u'mta - parking_enforcement - complaint', u'Sign - Loose',
       u'Damaged Parking_Meter', u'dpw - bssr - customer_callback',
       u'dph - dph_other - request_for_service',
       u'dpw - dpw_other - request_for_service',
       u'assessor_recorder - real_property - customer_callback',
       u'assessor_recorder - real_property - mailing_request',
       u'port_authority - port_authority - request_for_service',
       u'dph - environmental_health_tobacco - complaint',
       u'Streetlight - Pole_Damaged', u'Sign - Defaced',
       u'dpw - dpw_other - complaint', u'dpw - bssr - complaint',
       u'Curb_or_Curb_Ramp_Defect',
       u'mta - color_curb - request_for_service', u'Sign - Missing',
       u'Sign - Other', u'Medical_waste', u'Missing_Side_Sewer_Vent_Cover',
       u'puc - electric - customer_callback',
       u'tt_collector - treasurer_tax_collector_k2c - customer_callback',
       u'Temporary Sign Request for Other_Event_Type',
       u'assessor_recorder - assessor - customer_callback',
       u'dpw - bsm - customer_callback',
       u'mta - traffic_engineering - request_for_service',
       u'building_inspection - building_inspection - customer_callback',
       u'New_Litter_Receptacle', u'Construction_Plate_Shifted',
       u'port_authority - port_authority - customer_callback',
       u'Streetlight - Pole_Knockdown',
       u'dph - dph_other - customer_callback',
       u'Building - Garbage_Receptacles', u'Damaged_Side_Sewer_Vent_Cover',
       u'building_inspection - plan_review - request_for_service',
       u'Abandoned Vehicle - Motorcycle',
       u'mta - parking_sfpark - customer_callback',
       u'puc - puco - request_for_service',
       u'Streetlight - Pole_Wires_Exposed', u'puc - water - complaint',
       u'Building - Illegal_Construction_No_Permit_Exceeds_Permit_Scope',
       u'dph - environmental_health_tobacco - request_for_service',
       u'mta - parking_enforcement - customer_callback',
       u'Temporary Sign Request for Corporate_Events',
       u'dph - environmental_health - followup_request',
       u'rpd - neighborhood_services - complaint',
       u'assessor_recorder - real_property - request_for_service',
       u'mta - sign_shop - request_for_service',
       u'rpd - park_patrol - complaint', u'MUNI - Services_Miscellaneous',
       u'Streetlight - Other', u'Sign - Painted_Over',
       u'Post_Abatement_Inspection Graffiti on Private Property',
       u'Streetlight - Other_Request_New_Streetlight',
       u'mta - residential_parking_permit - compliment',
       u'Building - Heat_Lack_of_Heat',
       u'building_inspection - inspection_services - complaint',
       u'MUNI - Services_Service_Delivery_Facilities',
       u'Damaged News_Rack',
       u'Illegal Posting - Posting_Too_Large_in_Size',
       u'mta - bicycle - customer_callback',
       u'assessor_recorder - recorder - customer_callback',
       u'building_inspection - dbi_other - request_for_service',
       u'puc - puco - complaint', u'tt_collector - tt_collector - other',
       u'City_Can_Removed',
       u'building_inspection - building_inspection - compliment',
       u'dpw - bses - other', u'dpw - dpw_other - customer_callback',
       u'mta - parking_traffic_other - customer_callback',
       u'Building - Infestation_Rodent_Insect', u'mons - mons - complaint',
       u'sfpd - sfpd - request_for_service',
       u'Temporary Sign Request for Moving_commercial_property',
       u'Streetlight - Other_Request_Light_Shield',
       u'building_inspection - permit_services - complaint',
       u'Sign - Turned', u'Building - Ventilation_Inadequate_or_None',
       u'puc - sewer - followup_request', u'mta - color_curb - complaint',
       u'Damaged Benches_on_Sidewalk', u'mons - mons - customer_callback',
       u'Illegal Postings - Multiple_Postings',
       u'environment - environment - complaint',
       u'dph - environmental_health - mailing_request', u'Sign - Faded',
       u'Blue Color Curb Request',
       u'animal_care_control - animal_care_control - complaint',
       u'dtis - dtis - request_for_service',
       u'Red_NoFee Color Curb Request',
       u'building_inspection - dbi_other - complaint',
       u'dpw - bses - followup_request', u'Streetlight - Pole_Leaning',
       u'central_shops - central_shops - complaint',
       u'assessor_recorder - assessor_recorder_re_watchdog_program - customer_callback',
       u'Garbage',
       u'art_commission - art_commission - request_for_service',
       u'building_inspection - dbi_other - followup_request',
       u'puc - water - customer_callback',
       u'MUNI - Services_Service_Planning', u'Damaged Kiosk_Public_Toilet',
       u'Streetlight - Light_Glass_Cover_Missing',
       u'fire_department - sffd_other - request_for_service',
       u'dpw - buf - compliment', u'sfpd - sfpd - customer_callback',
       u'Building - Electrical_Hazardous_Condition',
       u'animal_care_control - animal_care_control - followup_request',
       u'mta - parking_enforcement - compliment',
       u'dpw - bses - compliment', u'Building - Deck_Stairs_Handrails',
       u'dpw - bses - mailing_request',
       u'rpd - urban_forestry - request_for_service',
       u'Sign Repair - Turned',
       u'city_attorney - city_attorney - request_for_service',
       u'Temporary Sign Request for Street_Fairs',
       u'Temporary Sign Request for Funerals',
       u'mta - construction_permit - complaint',
       u'Red Color Curb Request for', u'mta - traffic_engineering - other',
       u'dpw - bbr - request_for_service',
       u'Temporary Sign Request for City_Sponsored_Celebrations',
       u'puc - electric - request_for_service'], dtype=object)

In [18]:
tmp = df[['request_type', 'category', 'delta', 'target']]
tmp = tmp.dropna()
vc = tmp.request_type.value_counts()
tmp.loc[vc[tmp.request_type].values < 50, 'request_type'] = 'Others'

pivot = tmp.pivot_table(index='request_type', columns='category', values='target', 
                        aggfunc=sum, fill_value=0)
plt.figure(figsize=(10,6))
sns.heatmap(pivot.astype(int), annot=True, fmt="d", linewidths=.5)


Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f3f9a5288d0>

Source


In [19]:
dummies = pd.get_dummies(df.source.str.replace(' ', '_').str.lower(), prefix='source_', drop_first=False)
df[dummies.columns] = dummies

In [20]:
df['status'] = df.status == 'Closed'

In [21]:
original_columns = [u'sid', u'id', u'position', u'created_at', u'created_meta',
       u'updated_at', u'updated_meta', u'meta', u'caseid', u'opened',
       u'closed', u'status', u'responsible_agency', u'address', u'category',
       u'request_type', u'request_details', u'source', u'supervisor_district',
       u'neighborhood', u'updated', u'point']

In [22]:
del df['sid']
del df['id']
del df['position']
del df['created_at']
del df['created_meta']
del df['updated_at']
del df['meta']
del df['caseid']
del df['address']
del df['responsible_agency']
del df['request_details']
del df['request_type']
del df['status']
del df['updated']
del df['supervisor_district']
del df['point']

In [ ]: