Traffic Incident Reports in San Francisco


In [1]:
from cartoframes.auth import set_default_credentials, Credentials
from cartoframes.viz import Map, Layer, Source
import pandas as pd
import geopandas as gpd

If you have a CARTO account, you can set your credentials in the following cell. This allows you to upload the dataset and share the final visualization through your account.


In [2]:
set_default_credentials('creds.json')

Load incident reports

Using pandas, we can read an external data source, which is converted to a dataframe. Let's see which columns we have:


In [3]:
incident_reports_df = pd.read_csv('http://data.sfgov.org/resource/wg3w-h783.csv')
incident_reports_df.head()


Out[3]:
incident_datetime incident_date incident_time incident_year incident_day_of_week report_datetime row_id incident_id incident_number cad_number ... :@computed_region_6qbp_sg9q :@computed_region_qgnn_b9vv :@computed_region_26cr_cadq :@computed_region_ajp5_b2md :@computed_region_nqbw_i6c3 :@computed_region_2dwj_jsy4 :@computed_region_h4ep_8xdi :@computed_region_y6ts_4iup :@computed_region_jg9y_a9du :@computed_region_6pnf_4xz7
0 2020-02-03T14:45:00.000 2020-02-03T00:00:00.000 14:45 2020 Monday 2020-02-03T17:50:00.000 89881675000 898816 200085557 200342870.0 ... 41.0 10.0 8.0 16.0 NaN NaN NaN NaN NaN 2.0
1 2020-02-03T03:45:00.000 2020-02-03T00:00:00.000 03:45 2020 Monday 2020-02-03T03:45:00.000 89860711012 898607 200083749 200340316.0 ... 53.0 3.0 2.0 20.0 3.0 NaN NaN NaN NaN 2.0
2 2020-02-03T10:00:00.000 2020-02-03T00:00:00.000 10:00 2020 Monday 2020-02-03T10:06:00.000 89867264015 898672 200084060 200340808.0 ... 19.0 5.0 3.0 8.0 NaN 35.0 NaN NaN NaN 2.0
3 2020-01-19T17:12:00.000 2020-01-19T00:00:00.000 17:12 2020 Sunday 2020-02-01T13:01:00.000 89863571000 898635 206024187 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 2020-01-05T00:00:00.000 2020-01-05T00:00:00.000 00:00 2020 Sunday 2020-02-03T16:09:00.000 89877368020 898773 200085193 200342341.0 ... 103.0 4.0 6.0 30.0 NaN NaN NaN NaN NaN 1.0

5 rows × 36 columns


In [4]:
incident_reports_df.columns


Out[4]:
Index(['incident_datetime', 'incident_date', 'incident_time', 'incident_year',
       'incident_day_of_week', 'report_datetime', 'row_id', 'incident_id',
       'incident_number', 'cad_number', 'report_type_code',
       'report_type_description', 'filed_online', 'incident_code',
       'incident_category', 'incident_subcategory', 'incident_description',
       'resolution', 'intersection', 'cnn', 'police_district',
       'analysis_neighborhood', 'supervisor_district', 'latitude', 'longitude',
       'point', ':@computed_region_6qbp_sg9q', ':@computed_region_qgnn_b9vv',
       ':@computed_region_26cr_cadq', ':@computed_region_ajp5_b2md',
       ':@computed_region_nqbw_i6c3', ':@computed_region_2dwj_jsy4',
       ':@computed_region_h4ep_8xdi', ':@computed_region_y6ts_4iup',
       ':@computed_region_jg9y_a9du', ':@computed_region_6pnf_4xz7'],
      dtype='object')

Some of the latitude and longitude values are NaN, in the next step we get rid of them. After that, we create a dataset from the dataframe and use it in a Layer to visualize the data:


In [5]:
incident_reports_df = incident_reports_df[incident_reports_df.longitude == incident_reports_df.longitude]
incident_reports_df = incident_reports_df[incident_reports_df.latitude == incident_reports_df.latitude]

incident_reports_gdf = gpd.GeoDataFrame(
    incident_reports_df,
    geometry=gpd.points_from_xy(incident_reports_df.longitude, incident_reports_df.latitude)
)

incident_reports_gdf.head()


Out[5]:
incident_datetime incident_date incident_time incident_year incident_day_of_week report_datetime row_id incident_id incident_number cad_number ... :@computed_region_qgnn_b9vv :@computed_region_26cr_cadq :@computed_region_ajp5_b2md :@computed_region_nqbw_i6c3 :@computed_region_2dwj_jsy4 :@computed_region_h4ep_8xdi :@computed_region_y6ts_4iup :@computed_region_jg9y_a9du :@computed_region_6pnf_4xz7 geometry
0 2020-02-03T14:45:00.000 2020-02-03T00:00:00.000 14:45 2020 Monday 2020-02-03T17:50:00.000 89881675000 898816 200085557 200342870.0 ... 10.0 8.0 16.0 NaN NaN NaN NaN NaN 2.0 POINT (-122.47604 37.72695)
1 2020-02-03T03:45:00.000 2020-02-03T00:00:00.000 03:45 2020 Monday 2020-02-03T03:45:00.000 89860711012 898607 200083749 200340316.0 ... 3.0 2.0 20.0 3.0 NaN NaN NaN NaN 2.0 POINT (-122.41517 37.75244)
2 2020-02-03T10:00:00.000 2020-02-03T00:00:00.000 10:00 2020 Monday 2020-02-03T10:06:00.000 89867264015 898672 200084060 200340808.0 ... 5.0 3.0 8.0 NaN 35.0 NaN NaN NaN 2.0 POINT (-122.40734 37.78456)
4 2020-01-05T00:00:00.000 2020-01-05T00:00:00.000 00:00 2020 Sunday 2020-02-03T16:09:00.000 89877368020 898773 200085193 200342341.0 ... 4.0 6.0 30.0 NaN NaN NaN NaN NaN 1.0 POINT (-122.44025 37.78711)
5 2020-02-03T08:36:00.000 2020-02-03T00:00:00.000 08:36 2020 Monday 2020-02-03T08:36:00.000 89876268020 898762 200083909 200340826.0 ... 6.0 3.0 8.0 NaN NaN NaN NaN NaN 1.0 POINT (-122.39951 37.79693)

5 rows × 37 columns


In [7]:
Layer(incident_reports_gdf)


Out[7]:
:
StackTrace
    ">

    Now, we are going to use a helper method to color by category, and the category is 'Day of Week' (incident_day_of_week)

    
    
    In [20]:
    from cartoframes.viz import Layer, color_category_style
    
    Layer(incident_reports_gdf, color_category_style('incident_day_of_week', top=7), title='Day of Week')
    
    
    
    
    Out[20]:
    :
    StackTrace
      ">

      As we can see in the legend, the days are sorted by frequency, which means that there're less incidents on Thursdays and More on Tuesdays. Since our purpose is not to visualize the frequency and we want to see the days properly sorted from Monday to Sunday in the legend, we can modify the helper and set the categories we want to visualize in the desired position:

      
      
      In [21]:
      from cartoframes.viz import color_category_style
      
      
      Layer(
          incident_reports_gdf, 
          color_category_style(
              'incident_day_of_week',
              cat=[
                  'Monday',
                  'Tuesday',
                  'Wednesday',
                  'Thursday',
                  'Friday',
                  'Saturday',
                  'Sunday'
              ]
          ),
          title='Day of Week'
      )
      
      
      
      
      Out[21]:
      :
      StackTrace
        ">

        Now, we want to look for traffic incidents, and then use these categories to visualize those incidents:

        
        
        In [10]:
        incident_reports_df.incident_category.unique()
        
        
        
        
        Out[10]:
        array(['Missing Person', 'Stolen Property', 'Non-Criminal',
               'Miscellaneous Investigation',
               'Offences Against The Family And Children', 'Other Miscellaneous',
               'Lost Property', 'Larceny Theft', 'Warrant', 'Assault', 'Fraud',
               'Burglary', 'Traffic Violation Arrest', 'Weapons Carrying Etc',
               'Malicious Mischief', 'Motor Vehicle Theft', 'Drug Offense',
               'Other', 'Other Offenses', 'Recovered Vehicle', 'Robbery',
               'Suspicious Occ', 'Disorderly Conduct', 'Weapons Offense',
               'Vandalism', 'Embezzlement', 'Courtesy Report', 'Sex Offense',
               'Drug Violation', 'Traffic Collision', 'Prostitution',
               'Forgery And Counterfeiting', 'Case Closure', 'Homicide', 'Arson',
               'Suicide', 'Vehicle Impounded', 'Liquor Laws'], dtype=object)
        
        
        In [22]:
        from cartoframes.viz import Layer, size_category_style
        
        Layer(
            incident_reports_gdf,
            size_category_style(
                'incident_category',
                cat=['Traffic Collision', 'Traffic Violation Arrest']
            ),
            title='Traffic Incidents'
        )
        
        
        
        
        Out[22]:
        :
        StackTrace
          ">

          In CARTO we have a dataset we can use for the next step, named 'sfcta_congestion_roads'. We are going to set the Credentials for this dataset. To have more control over this dataset, if you have a CARTO account you can import it to have everything together, and it won't be needed to create a different source for this Dataset.

          Once we've the data source created, we're going to combine two helper methods. The first one uses the Source with the roads data from CARTO, and the second one the traffic incident reports.

          
          
          In [23]:
          from cartoframes.viz import Layer, color_continuous_style, size_category_style
          
          sfcta_congestion_roads_source=Source(
              'sfcta_congestion_roads',
              Credentials(
                  base_url='https://cartovl.carto.com',
                  api_key='default_public'
              )
          )
          
          Map([
              Layer(
                  sfcta_congestion_roads_source,
                  color_continuous_style('auto_speed'),
                  title='Recorded vehicle speeds'
              ),
              Layer(
                  incident_reports_gdf,
                  size_category_style(
                      'incident_category',
                      cat=['Traffic Collision', 'Traffic Violation Arrest']
                  ),
                  title='Traffic Incidents'
              )
          ])
          
          
          
          
          Out[23]:
          :
          StackTrace
            ">

            We are going to add information about traffic signals, by getting data from a different source:

            
            
            In [30]:
            traffic_signals_df = pd.read_csv('http://data.sfgov.org/resource/c8ue-f4py.csv')
            traffic_signals_df.head()
            
            
            
            
            Out[30]:
            objectid cnn code cnn_1 street1 street2 street3 street4 detection sup_dist ... percent_po point point_address point_city point_state point_zip :@computed_region_6qbp_sg9q :@computed_region_qgnn_b9vv :@computed_region_26cr_cadq :@computed_region_ajp5_b2md
            0 2400 26392000 NaN 26392000 MASONIC TURK NaN NaN NaN 1,2,5 ... NaN POINT (-122.44702658724947 37.778617364507106) NaN NaN NaN NaN 97.0 7.0 11.0 18.0
            1 2737 20421000 NaN 20421000 GENEVA SANTOS NaN NaN NaN 10 ... NaN POINT (-122.42006985840429 37.70830632733309) NaN NaN NaN NaN NaN NaN 9.0 NaN
            2 2981 24018000 NaN 24018000 16TH ST UTAH NaN NaN NaN 10 ... NaN POINT (-122.40653961377903 37.76583732660274) NaN NaN NaN NaN 54.0 3.0 9.0 20.0
            3 2934 9941000 NaN 9941000 OLYMPIA MIDBLOCK AT DELLBROOK (N) DELLBROOK (S) NaN 7 ... NaN POINT (-122.45364260943721 37.75120299724448) NaN NaN NaN NaN 48.0 7.0 8.0 38.0
            4 2922 26932000 NaN 26932000 BEAUMONT GEARY NaN NaN NaN 1,2 ... NaN POINT (-122.4552675286903 37.781453718876065) NaN NaN NaN NaN 12.0 8.0 4.0 18.0

            5 rows × 44 columns

            
            
            In [31]:
            traffic_signals_df.columns
            
            
            
            
            Out[31]:
            Index(['objectid', 'cnn', 'code', 'cnn_1', 'street1', 'street2', 'street3',
                   'street4', 'detection', 'sup_dist', 'veh_actuat', 'aps', 'ped_signal',
                   'ped_actuat', 'tbc', 'preempt_pr', 'd_ate2070', 'project_ne',
                   'project_ol', 'upgraded', 'yr_of_cont', 'last_upgra', 'new_signal',
                   'mod_projec', 'full_upgra', 'beacon_fla', 'funding', 'rlcam',
                   'startyear', 'caltrans_r', 'caltrans', 'percent_c', 'sf', 'percent_sf',
                   'percent_po', 'point', 'point_address', 'point_city', 'point_state',
                   'point_zip', ':@computed_region_6qbp_sg9q',
                   ':@computed_region_qgnn_b9vv', ':@computed_region_26cr_cadq',
                   ':@computed_region_ajp5_b2md'],
                  dtype='object')
            
            
            In [32]:
            traffic_signals_df.code.unique()
            
            
            
            
            Out[32]:
            array([nan])

            Since there is no latitude and longitude columns, we can use the point column to create a GeoDataFrame.

            
            
            In [33]:
            from shapely import wkt
            
            traffic_signals_df['point'] = traffic_signals_df['point'].apply(wkt.loads)
            traffic_signals_df = traffic_signals_df.rename(columns={'point': 'geometry'}).set_geometry('geometry')
            trafic_signals_gdf = gpd.GeoDataFrame(traffic_signals_df, geometry='geometry')
            
            
            
            In [34]:
            Map(Layer(trafic_signals_gdf))
            
            
            
            
            Out[34]:
            :
            StackTrace
              ">
              
              
              In [36]:
              from cartoframes.viz import Layer, color_continuous_style, size_category_style, basic_style
              
              Map([
                  Layer(
                      sfcta_congestion_roads_source,
                      color_continuous_style('auto_speed'),
                      title='Recorded vehicle speeds'
                  ),
                  Layer(
                      incident_reports_gdf,
                      size_category_style(
                          'incident_category',
                          cat=['Traffic Collision', 'Traffic Violation Arrest']
                      ),
                      title='Traffic Incidents'
                  ),
                  Layer(
                      trafic_signals_gdf,
                      basic_style(color='blue', size=1),
                      title='Traffic Signals'
                  )
              ],
              layer_selector=True)
              
              
              
              
              Out[36]:
              :
              StackTrace
                ">