In [1]:

    
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import geocoder
import sklearn
from IPython.core.display import HTML
css = open('style-table.css').read() + open('style-notebook.css').read()
HTML('<style>{}</style>'.format(css))









    Out[1]:

We want to build a machine learning model that looks at various intersections in Toronto and predicts the collision rates expected from cyclists at each point. Using the accident history from 1986-2010, along with information such as the intersection's Bike traffic, control type, and classification (local, major arterial etc.) we should be able to use a supervised learning approach, training our algorithm on a subset of the collisions available, and the validating on other subsets. Once we are satisfied with the performance (and at some point we'll need to define where our cut-off for prediction should be), we will compare the output by applying it to accident data scrapped from a twitterbot.

Note that we do not include whether or not an intersection has a bike lane in our list of features. While one can download a dataset which includes all locations of cycle tracks and bike lanes by street sections, many are fairly new. With the exception of certain cycle tracks which were built after 2014, there doesn't seem to be any easy way to date various bike lane additions. As such it wouldn't make sense to label our intersections with this information, as most bike lanes and cycle tracks are fairly new, and were not around during the period of time our dataset is built from.

Since we don't have access to cycle app or bikeshare data to get accurate bike counts for a variety of locations, we'll need to make due with a limited dataset of city manual and automatic bike counts. These span only a few points in Toronto, and more importantly, are focused on Downtown.

From the bike counts above, we compared to vehicle and pedestrian counts from a sepearate data set. These are far more wide ranging, and if a relationship could be found, one could make predictions on bike traffic in all the missing points.

While there was unfortunately no obvious direct relationship between traffic numbers (something like a constant ratio, or a linearly increasing one), we do get an acceptable power fit when comparing the bike-to-pedestrian traffic ratio, once we filter out a few areas. The University of Toronto campus has a much higher B/P ratio than any other avaialble bike count locations. This isn't suprising, as a student there I can confirm it's the most popular place to see cyclists in the city. The other area is the waterfront right by financial and entertainment districts. These points had a far lower traffic ratio than given by this fit. So we'll treat them as seperate special cases, mapping the available bike counts directly to any nearby collisions, as long as they fit within a 'box' that we'll use to define those areas.

Note, done in openoffice spreadsheet. Should transfer over into Python for a prettier graph. Will perhaps change the fit equation a bit

Next we needed to clean and filter our dataset. Cleaning mostly invovled searching for NaNs or mixed up entries, as it prevents us from easily manipulating our dataframe. Once we have things roughly cleaned up, we want to filter our collisions. The bike count to pedestrian count relationship we had was based on points that were overwhelminghly found in central Toronto; downtown and the surrounding neighborhoods. So we need some way to do this



In [2]:

    
col_df = pd.DataFrame.from_csv('collisions_clean.csv', index_col='ID')
col_df=col_df.sort_index()
col_df.head()









    Out[2]:






  
    
      
      INJURY
      SAFETY EQUIP.
      ROAD CLASS
      CYCLIST CRASH TYPE
      AGE OF CYCLIST
      STNAME1
      STET 1 TYPE
      STNAME 2
      STET 2 TYPE
      LONG
      ...
      ACC TIME
      ACC TIME2
      TRAFFIC CONTROL
      ROAD SURFACE2
      DATE
      DATE2
      DRIVER ACTION
      DRIVER CONDITION
      YEAR
      INTERSECTION
    
    
      ID
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
      Minimal
      Unrecorded
      Minor Arterial
      Unrecorded
      34.0
      BIRCHMOUNT
      RD
      HIGHVIEW
      AV
      -79.26539
      ...
      1000.0
      10:00:00
      stop sign
      wet
      1986/01/12
      1986/01/12
      Failed to yield right-of-way
      Normal
      1986
      Birchmount Rd & Highview Ave, Toronto, Ontario...
    
    
      2
      Minimal
      Unrecorded
      Major Arterial
      Unrecorded
      54.0
      LAKE SHORE
      BLVD
      THIRTY-FIFTH
      ST
      -79.53500
      ...
      1000.0
      10:00:00
      no control
      wet
      1986/01/19
      1986/01/19
      Improper turn
      Normal
      1986
      Lake Shore Blvd W & 35th St, Toronto, Ontario,...
    
    
      3
      Minor
      Unrecorded
      Major Arterial
      Unrecorded
      19.0
      LAWRENCE
      AV
      FORTUNE
      GATE
      -79.21800
      ...
      1500.0
      15:00:00
      no control
      dry
      1986/01/23
      1986/01/23
      Failed to yield right-of-way
      Normal
      1986
      Lawrence Ave E & Fortune Gate, Toronto, Ontari...
    
    
      4
      Minimal
      Unrecorded
      Local
      Unrecorded
      34.0
      EUCLID
      AV
      ULSTER
      ST
      -79.41330
      ...
      800.0
      08:00:00
      no control
      loose snow
      1986/01/14
      1986/01/14
      Lost control
      Normal
      1986
      Euclid Ave & Harbord St, Toronto, Ontario, M6G
    
    
      5
      Minimal
      Unrecorded
      Major Arterial
      Unrecorded
      34.0
      AVENUE
      AV
      DRAYTON
      AV
      -79.32041
      ...
      1700.0
      17:00:00
      traffic signal
      dry
      1986/01/30
      1986/01/30
      Other
      Normal
      1986
      Danforth Ave & Drayton Ave, Toronto, Ontario, M4C
    
  

5 rows × 21 columns

The above collisions were all geocoded based on the crosstreets listed, rather than the GPS coordinates, since there were many coordinates that just were defaulted to the center of Toronto. Nonetheless, there were still a few "impossible" intersections, that were then geocded in a second pass using their coordinates.

We also use a list of ALL the intersections in the city, and geocode those, using the GPS coordinates. These also required a second pass using a different geocoder on about 10 or so entries. Later we'll cross reference with the collisions database and drop all the duplicates. This way we can keep track of the "zeroes". That is, intersections which never had a collision in those 25 years.

Below is an example of how this was done, for the entire list of intersections in the city. We don't recommened running this, as it'll take a few hours.

Skip cell below!



In [ ]:

    
#our datasetfrom the Toronto website of all intersections
all_real_df = pd.DataFrame.from_csv('all_intersections_real.csv', index_col='int_id') 
#Get the coordinates out, put them as a list, rather than iterating over the dataframe itself.
coords2 = all_real_df[['latitude','longitude']]
coords2 = coords2.replace(np.nan,0)
coordinate_list = coords2.values.tolist() 
import geocoder
import csv
geo = []
idx = 0
for pair in coordinate_list:
    g = geocoder.arcgis(pair, method='reverse')
    geo.append(g.address)
    print(idx, '', geo[idx])
    idx += 1

myfile = open('centerline_coords_int.csv', 'w')
wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
wr.writerow(geo)

all_real_df['arcgis_int'] = geo



In [3]:

    
import geocoder
#Example of what geocoder output looks like, for two different locations. The API calls return information on the locality in json format
latlng = [43.6532,-79.3828]
g = geocoder.arcgis('Bay St & Bloor St, Toronto', method='geocode')
f= geocoder.arcgis(latlng, method='reverse')
print(g.address)
print(f.address)









    



Bay St & Bloor St W, Toronto, Ontario, M5R
Bay St & Albert St, Toronto, Ontario, M5G

Once this is done, we then add this "intersection" column to our dataframe, and selected all collisions that had a postal code beginning with one of those associated with the areas roughly east of the Junction, West of carlaw, and south of Dupont, with the neigbourhoods near Exhibition place also thrown out. This isn't a perfect system by any means, but we have no reason to believe that our nice power-fit relationship will hold outside the areas we used. In fact, outside central Toronto, it seems likely that it wouldn't hold, as most cyclists are typically found closer to the city center.

The SQl query to do this is:

SELECT *

FROM kit_farfan.centreline_reversegeocoded

WHERE "arcgis_int" LIKE '%M7A%' OR "arcgis_int" LIKE '%M6G%' OR "arcgis_int" LIKE '%M6H%' OR

"arcgis_int" LIKE '%M6P%' OR

"arcgis_int" LIKE '%M5A%' OR "arcgis_int" LIKE '%M5B%' OR "arcgis_int" LIKE '%M5C%' OR

"arcgis_int" LIKE '%M5E%' OR "arcgis_int" LIKE '%M5G%' OR "arcgis_int" LIKE '%M5H%' OR

"arcgis_int" LIKE '%M5J%' OR "arcgis_int" LIKE '%M5K%' OR "arcgis_int" LIKE '%M5L%' OR

"arcgis_int" LIKE '%M5S%' OR "arcgis_int" LIKE '%M5T%' OR

"arcgis_int" LIKE '%M5V%' OR "arcgis_int" LIKE '%M5W%' OR "arcgis_int" LIKE '%M5X%' OR

"arcgis_int" LIKE '%M4Y%' OR "arcgis_int" LIKE '%M4X%' OR "arcgis_int" LIKE '%M4W%' OR

"arcgis_int" LIKE '%M4X%' OR "arcgis_int" LIKE '%M4W%'

Did it in SQL originally since exploratory data analysis was done using Mode Analytic's platform.



In [4]:

    
central_realintersections_df = pd.DataFrame.from_csv('central_real_intersections.csv')
central_realintersections_df= central_realintersections_df.rename(columns = {'arcgis_int':'intersection'}) 

col_df = pd.DataFrame.from_csv('toronto_cycling_central_mode.csv', index_col='id')
col_df = col_df.sort_index()

central_col_df = col_df
from numpy import random
from scipy.spatial import distance
import matplotlib.path as mplPath
import numpy as np

#We want to follow a standard intersection convention to make our life easier. 
#Unfortunately the free geocoder with unlimited request numbers per day doesn't see mto follow one of it's own. 
#So we will get the intersections, strip out the two streets, and order them alphabetically.
st2 = col_df['intersection'].str.split('&').str.get(1)
st2 =st2.str.split(', Toronto').str[0].str.strip()

post = col_df['intersection'].str.split('&').str.get(1)
post = post.str.partition(', ')[2].str.strip()

st1 = col_df['intersection'].str.split('&').str[0]
st1 = st1.str.strip()

intersection_list = []
streets = pd.concat([st1,st2,post],axis=1)
streets = streets.values.tolist()
streets
for pair in streets:
    if isinstance(pair[0],str):
        if pair[1] <= pair[0]:
            temp = pair[0]
            pair[0] = pair[1]
            pair[1] = temp
for pair2 in streets:
    intersection = str(pair2[0]) + ' & ' + str(pair2[1] + ', ' +str(pair2[2]))
    intersection_list.append(intersection)

col_df['intersection'] = intersection_list
    
st2 = central_realintersections_df['intersection'].str.split('&').str.get(1)
st2 =st2.str.split(', Toronto').str[0].str.strip()

post =  central_realintersections_df['intersection'].str.split('&').str.get(1)
post = post.str.partition(', ')[2].str.strip()

st1 =  central_realintersections_df['intersection'].str.split('&').str[0]
st1 = st1.str.strip()

intersection_list = []
streets = pd.concat([st1,st2,post],axis=1)
streets = streets.values.tolist()
streets
for pair in streets:
    if isinstance(pair[0],str):
        if pair[1] <= pair[0]:
            temp = pair[0]
            pair[0] = pair[1]
            pair[1] = temp
for pair2 in streets:
    intersection = str(pair2[0]) + ' & ' + str(pair2[1] + ', ' +str(pair2[2]))
    intersection_list.append(intersection)
    
central_realintersections_df['intersection'] = intersection_list
central_realintersections_df.head()









    Out[4]:






  
    
      
      intersec5
      elev_id
      classifi6
      classifi7
      num_elev
      elevatio9
      elevatio10
      elev_level
      elevation
      elevatio13
      height_r14
      height_r15
      x_coord
      y_coord
      longitude
      latitude
      objectid
      intersection
    
    
      int_id
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      13468990.0
      Lakeshore Ave / Third St
      27090.0
      MNRSL
      Minor-Single Level
      1.0
      501300.0
      Minor
      0.0
      0.0
      NaN
      0.0
      NaN
      316596.462
      4832242.321
      -79.353614
      43.632026
      84477.0
      3rd St & Lakeshore Ave, Toronto, Ontario, M5J
    
    
      13466511.0
      Princess St / Abbey Lane
      16736.0
      MNRSL
      Minor-Single Level
      1.0
      501300.0
      Minor
      0.0
      0.0
      NaN
      0.0
      NaN
      315523.959
      4834358.610
      -79.366863
      43.651092
      29444.0
      Abbey Ln & Princess St, Toronto, Ontario, M5A
    
    
      14014045.0
      Abbott Ave / Ln W Dundas N Glenlake
      21746.0
      LSRSL
      Lesser-Single Level
      1.0
      501700.0
      Laneway
      0.0
      0.0
      NaN
      0.0
      NaN
      308440.142
      4835412.777
      -79.454671
      43.660649
      5606.0
      Abbott Ave & Dundas St W, Toronto, Ontario, M6P
    
    
      14014044.0
      Abbott Ave / Ln W Dundas N Abbott
      26354.0
      LSRSL
      Lesser-Single Level
      1.0
      501700.0
      Laneway
      0.0
      0.0
      NaN
      0.0
      NaN
      308429.281
      4835409.765
      -79.454806
      43.660622
      5604.0
      Abbott Ave & Dundas St W, Toronto, Ontario, M6P
    
    
      14013942.0
      Abbott Ave / Ln W Indian Rd N Abbott
      28279.0
      LSRSL
      Lesser-Single Level
      1.0
      501700.0
      Laneway
      0.0
      0.0
      NaN
      0.0
      NaN
      308179.207
      4835371.356
      -79.457907
      43.660277
      5591.0
      Abbott Ave & Indian Rd, Toronto, Ontario, M6P

Once we have our naming convention sorted, we can start by trying to predict the Collision rates naively, without any machine learning. What do the raw numbers tell us?

We proceed by mapping the pedestrian data to our collisions and intersections, so we have a Bike traffic estimate. Rememebering that we have a few areas that we're treating apart, we have our boxes so we can check if points fall inside those areas.



In [5]:

    
ped_counts_df = pd.DataFrame.from_csv('Vehicle and Pedestrian Counts/TrafficPedestrianVolumes_2011.csv')

#Using the Power fit for the Bike/Pedestrian ratio, we get a function that predicts the bike numbers at any one 
#intersection.

ped_counts_df['bike_prediction'] = (500.2146799711*ped_counts_df['8HrPedVol']**(-0.8950759596))*ped_counts_df['8HrPedVol']
ped_coords = ped_counts_df[['Latitude','Longitude']]
ped_coords = ped_coords.replace(np.nan,0)
ped_coordinate_list = ped_coords.values.tolist()
ped_counts_df['coordinates'] = ped_counts_df[['Latitude','Longitude']].apply(tuple, axis=1)
ped_counts_df.head()









    Out[5]:






  
    
      
      Main
      Midblock Route
      Side 1 Route
      Side 2 Route
      Activation Date
      Latitude
      Longitude
      Count Date
      8HrPedVol
      8HrVehVol
      24HrPedVol
      24HrVehVol
      bike_prediction
      coordinates
    
    
      PX2
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2
      JARVIS ST
      NaN
      FRONT ST E
      NaN
      1948/11/15
      43.649450
      -79.371410
      2011/09/08
      17008
      19335
      34016
      38670
      1390.124396
      (43.64945, -79.37141)
    
    
      3
      KING ST E
      NaN
      JARVIS ST
      NaN
      1950/08/23
      43.650461
      -79.371924
      2011/09/07
      37719
      17665
      75438
      35330
      1511.289334
      (43.6504606, -79.3719239)
    
    
      4
      JARVIS ST
      NaN
      ADELAIDE ST E
      NaN
      1958/09/12
      43.651534
      -79.372360
      2008/06/16
      1991
      19726
      3982
      39452
      1109.962532
      (43.6515337, -79.37236)
    
    
      5
      JARVIS ST
      NaN
      RICHMOND ST E
      NaN
      1962/04/21
      43.652718
      -79.372824
      2009/07/30
      2696
      24842
      5392
      49684
      1145.833262
      (43.6527176, -79.372824)
    
    
      6
      JARVIS ST
      NaN
      QUEEN ST E
      NaN
      1928/08/24
      43.653704
      -79.373238
      2011/05/18
      3622
      19772
      7244
      39544
      1181.886259
      (43.653704, -79.373238)



In [6]:

    
ped_dict = ped_counts_df.set_index('coordinates').to_dict()['bike_prediction']
col_df['coordinates'] = col_df[['lat','long']].apply(tuple, axis=1)
col_df.head()
central_realintersections_df['coordinates'] = central_realintersections_df[['latitude','longitude']].apply(tuple, axis=1)
equiv = {'Laneway':3,'Minor':2,'Major':1}
central_realintersections_df['num_class'] = central_realintersections_df['elevatio10'].map(equiv)
central_realintersections_df= central_realintersections_df.sort_values(by=['intersection','num_class'])
#take the lowest numerical class, drop the other duplicates. 
central_realintersections_df = central_realintersections_df.drop_duplicates(subset='intersection', keep='first')



In [7]:

    
waterfront_Path = mplPath.Path(np.array([[43.635497, -79.398156],
[43.639000, -79.400725],
[43.640822, -79.401427],
[43.646984, -79.376977],
[43.649889, -79.370343],
[43.651614, -79.362725],
[43.648090, -79.361191],
[43.646451, -79.361937],
[43.641209, -79.376739],
[43.639969, -79.379965],
[43.637698, -79.391847],
[43.635666, -79.398368],
[43.636489, -79.399603]]))

campus_Path = mplPath.Path(np.array([[43.659838, -79.399772],
[43.661388, -79.401006],
[43.665592, -79.402705],
[43.666768, -79.401354],
[43.668213, -79.393958],
[43.663141, -79.392719],
[43.659264, -79.394100],
[43.658329, -79.398204]]
))

castleFrank_Path = mplPath.Path(np.array([[43.672105, -79.376696],
[43.671562, -79.370962],
[43.674418, -79.366821],
[43.676086, -79.358731],
[43.677056, -79.354021],
[43.677040, -79.355126],
[43.677622, -79.358516],
[43.676194, -79.359503],
[43.675170, -79.364760],
[43.674580, -79.367539],
[43.672019, -79.371112],
[43.672710, -79.376927]]
))

campus = [[43.657946,-79.39993],
          [43.663502,-79.40005],
          [43.663051,-79.402196],
          [43.665429,-79.398975]
         ]

campus_dict = {(43.657946,-79.39993):4495.8,
               (43.663502,-79.400050):2653,
               (43.663051,-79.402196):3574,
               (43.665429,-79.398975):2304
              }

waterfront = [[43.648208,-79.370923],
              [43.642711,-79.394043],
              [43.639944,-79.387032],
              [43.640625,-79.3932],
              [43.640093,-79.380152]
             ]

waterfront_dict = {(43.648208,-79.370923):330,
              (43.642711,-79.394043):745,
              (43.639944,-79.387032):128,
              (43.640625,-79.3932):154,
              (43.640093,-79.380152):235
             }          
castleFrank = [[43.673792,-79.368187]]

castleFrank_dict = {(43.673792,-79.368187):3413}

Here we have defined our boxes, and our pedestrian traffic dictionary. Now we go through the data sets, check if the point belongs in one of our boxes. If it does, we'll map the closest traffic count in the box. Otherwise, we find the closest intersection in the Pedestrian dataframe, and use the predicted bike traffic number from our fit



In [8]:

    
import csv
closest_traffic_point = []
bike_traffic = []
i = 0
for i in range(0,len(central_col_df)):
    point = central_col_df['coordinates'].iloc[i]
    if waterfront_Path.contains_point(point):
        closest = waterfront[distance.cdist([point], waterfront).argmin()]
        closest_traffic_point.append(tuple(closest))
        bike_traffic.append(waterfront_dict[tuple(closest)])
    elif campus_Path.contains_point(point):
        closest = campus[distance.cdist([point], campus).argmin()]
        closest_traffic_point.append(tuple(closest))
        bike_traffic.append(campus_dict[tuple(closest)])
    elif castleFrank_Path.contains_point(point):
        closest = castleFrank[distance.cdist([point], castleFrank).argmin()]
        closest_traffic_point.append(tuple(closest))
        bike_traffic.append(castleFrank_dict[tuple(closest)])
    else:
        closest = ped_coordinate_list[distance.cdist([point], ped_coordinate_list).argmin()]
        closest_traffic_point.append(tuple(closest))
        bike_traffic.append(ped_dict[tuple(closest)])
        
myfile3 = open('closest_intersection.csv', 'w')
wr = csv.writer(myfile3)
wr.writerow(closest_traffic_point)
myfile3.close()

myfile4 = open('closest_int_bike_predictions.csv', 'w')
wr = csv.writer(myfile4)
wr.writerow(bike_traffic)
myfile4.close()



In [9]:

    
central_col_df['closest_traffic'] = tuple(closest_traffic_point)
central_col_df['traffic_count'] = bike_traffic

central_col_df.rename(columns={'closest_traffic': 'closest_ped_count', 'traffic_count':'predicted_bike_count'})
central_col_df.head()









    Out[9]:






  
    
      
      injury
      safety_equip
      road_class
      cyclist_crash_type
      age_of_cyclist
      stname1
      stet_1_type
      stname_2
      stet_2_type
      long
      ...
      date
      date2
      driver_action
      driver_condition
      year
      intersection
      coords
      coordinates
      closest_traffic
      traffic_count
    
    
      id
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      4.0
      Minimal
      Unrecorded
      Local
      Unrecorded
      34.0
      EUCLID
      AV
      ULSTER
      ST
      -79.413300
      ...
      1986-01-14 00:00:00
      1986-01-14 00:00:00
      Lost control
      Normal
      1986.0
      Euclid Ave & Ulster St, Toronto, Ontario, M6G
      [43.65906013500046, -79.41272077799965]
      (43.6599, -79.4133)
      (43.66055, -79.41466000000001)
      1145.966977
    
    
      7.0
      Minimal
      Unrecorded
      Major Arterial
      Unrecorded
      34.0
      SALEM
      AV
      FERNBANK
      AV
      -79.433801
      ...
      1986-01-28 00:00:00
      1986-01-28 00:00:00
      Driving properly
      Normal
      1986.0
      Fernbank Ave & Salem Ave, Toronto, Ontario, M6H
      [43.66582101600045, -79.43379108299968]
      (43.665832, -79.433801)
      (43.66717180000001, -79.43169820000001)
      1043.663994
    
    
      8.0
      Minor
      Unrecorded
      Major Arterial
      Unrecorded
      24.0
      DUNDAS
      ST
      BEVERLEY
      ST
      -79.394060
      ...
      1986-01-16 00:00:00
      1986-01-16 00:00:00
      Failed to yield right-of-way
      Normal
      1986.0
      Beverley St & Dundas St W, Toronto, Ontario, M5T
      [43.65382045300049, -79.39385107199968]
      (43.65377, -79.39406)
      (43.65381040000001, -79.39383199999999)
      1235.708176
    
    
      9.0
      Minimal
      Unrecorded
      Major Arterial
      Unrecorded
      24.0
      BLOOR
      ST
      MARGUERETTA
      ST
      -79.440590
      ...
      1986-02-19 00:00:00
      1986-02-19 00:00:00
      Failed to yield right-of-way
      Normal
      1986.0
      Bloor St W & Margueretta, Toronto, Ontario, M6H
      [43.658975,-79.439718]
      (43.6588, -79.44059)
      (43.6591, -79.43918000000002)
      1083.287193
    
    
      10.0
      Minor
      Unrecorded
      Minor Arterial
      Unrecorded
      54.0
      AVENUE
      RD
      SHANLY
      ST
      -79.430720
      ...
      1986-03-01 00:00:00
      1986-03-01 00:00:00
      Driving properly
      Normal
      1986.0
      Dovercourt Rd & Shanly St, Toronto, Ontario, M6H
      [43.664316,-79.430499]
      (43.66425, -79.43072)
      (43.66717180000001, -79.43169820000001)
      1043.663994
    
  

5 rows × 25 columns



In [10]:

    
import csv
closest_traffic_point = []
bike_traffic = []
i = 0
for i in range(0,len(central_realintersections_df)):
    point = central_realintersections_df['coordinates'].iloc[i]
    if waterfront_Path.contains_point(point):
        closest = waterfront[distance.cdist([point], waterfront).argmin()]
        closest_traffic_point.append(tuple(closest))
        bike_traffic.append(waterfront_dict[tuple(closest)])
    elif campus_Path.contains_point(point):
        closest = campus[distance.cdist([point], campus).argmin()]
        closest_traffic_point.append(tuple(closest))
        bike_traffic.append(campus_dict[tuple(closest)])
    elif castleFrank_Path.contains_point(point):
        closest = castleFrank[distance.cdist([point], castleFrank).argmin()]
        closest_traffic_point.append(tuple(closest))
        bike_traffic.append(castleFrank_dict[tuple(closest)])
    else:
        closest = ped_coordinate_list[distance.cdist([point], ped_coordinate_list).argmin()]
        closest_traffic_point.append(tuple(closest))
        bike_traffic.append(ped_dict[tuple(closest)])
        
myfile5 = open('closest_intersection_centreline.csv', 'w')
wr = csv.writer(myfile5)
wr.writerow(closest_traffic_point)
myfile5.close()

myfile6 = open('closest_int_bike_predictions_centreline.csv', 'w')
wr = csv.writer(myfile6)
wr.writerow(bike_traffic)
myfile6.close()

central_realintersections_df['closest_traffic'] = tuple(closest_traffic_point)
central_realintersections_df['traffic_count'] = bike_traffic

central_realintersections_df.rename(columns={'closest_traffic': 'closest_ped_count', 'traffic_count':'predicted_bike_count'})
central_realintersections_df.sort_values(by='traffic_count').tail()









    Out[10]:






  
    
      
      intersec5
      elev_id
      classifi6
      classifi7
      num_elev
      elevatio9
      elevatio10
      elev_level
      elevation
      elevatio13
      ...
      x_coord
      y_coord
      longitude
      latitude
      objectid
      intersection
      coordinates
      num_class
      closest_traffic
      traffic_count
    
    
      int_id
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      13462813.0
      Castle Frank Rd / Bloor St E
      33572.0
      MJRSL
      Major-Single Level
      1.0
      501200.0
      Major
      0.0
      0.0
      NaN
      ...
      315411.781
      4836878.500
      -79.368204
      43.673775
      12108.0
      Bloor St E & Castle Frank Rd, Toronto, Ontario...
      (43.673775492, -79.368204105)
      1.0
      (43.673792, -79.368187)
      3413.0
    
    
      14021473.0
      Glen Morris St / Ln E Spadina N Harbord / Ln E...
      16644.0
      LSRSL
      Lesser-Single Level
      1.0
      501700.0
      Laneway
      0.0
      0.0
      NaN
      ...
      312720.746
      4835791.622
      -79.401591
      43.664026
      44814.0
      Glen Morris St & Spadina Ave, Toronto, Ontario...
      (43.664026103, -79.401591237)
      3.0
      (43.663051, -79.402196)
      3574.0
    
    
      13465137.0
      Galbraith Rd / St George St
      5198.0
      MNRSL
      Minor-Single Level
      1.0
      501300.0
      Minor
      0.0
      0.0
      NaN
      ...
      313107.457
      4835358.550
      -79.396803
      43.660124
      32167.0
      Galbraith Rd & St George St, Toronto, Ontario,...
      (43.660123609, -79.396803266)
      2.0
      (43.657946, -79.39993)
      4495.8
    
    
      13465081.0
      St George St / Russell St
      28558.0
      MNRSL
      Minor-Single Level
      1.0
      501300.0
      Minor
      0.0
      0.0
      NaN
      ...
      313096.575
      4835394.417
      -79.396938
      43.660447
      15644.0
      Russell St & St George St, Toronto, Ontario, M5S
      (43.660446592, -79.396937609)
      2.0
      (43.657946, -79.39993)
      4495.8
    
    
      13465147.0
      Russell St / Huron St
      17776.0
      MNRSL
      Minor-Single Level
      1.0
      501300.0
      Minor
      0.0
      0.0
      NaN
      ...
      312953.652
      4835353.068
      -79.398710
      43.660076
      15652.0
      Huron St & Russell St, Toronto, Ontario, M5S
      (43.660076003, -79.398710203)
      2.0
      (43.657946, -79.39993)
      4495.8
    
  

5 rows × 22 columns

So now we have traffic numbers for our entire intersection dataframe, as well as our dataframe of collisions. Now let's count up our collisions (including the zeroes), and order them.

To make things cleaner, we'll strip away the columns that aren't helpful to us at this time. Once we group the collision dataframe by intersection, we can concat that with the dataframe of all intersections (now call zeroes_df).

Once it's sorted by our normalized accident rate, we can drop the non-zero intersections from our merged dataframe by just dropping the duplicated indexes. Then we should be left with our collisions, properly grouped, and the "zero" intersections.



In [11]:

    
central_realint_df = central_realintersections_df[['intersection','coordinates','closest_traffic','traffic_count','elevatio10']]
central_realint_df.sort_values(by='traffic_count')

result = pd.merge(central_col_df,central_realint_df, how ='outer',on='intersection')
result = result.sort_values(by='intersection')
traff2 =result['traffic_count_y']
result['traffic_count_x'] =result['traffic_count_x'].fillna(value = traff2 )



In [12]:

    
result.to_csv('merged_data.csv')
intersection_df = central_col_df.groupby('intersection').agg({'traffic_count':[np.size,np.mean]}).sort_values(by=[('traffic_count','size')],ascending=False)
intersection_df.columns = intersection_df.columns.set_levels(['traffic_count'], level=0)
zeroes_df = central_realint_df.groupby('intersection').agg({'traffic_count':[np.size,np.mean]}).sort_values(by=[('traffic_count','size')],ascending=False)
zeroes_df['traffic_count','size'] = 0
#Let's use 360 as the days in the year. It's a nice round number. And since weekends are less busy it's probably still an overestimate.
zeroes_df['traffic_count','normalized_accident_rate'] = zeroes_df['traffic_count','size']/(25*360*zeroes_df['traffic_count','mean'])
zeroes_df = zeroes_df.sort_values(by=[('traffic_count','normalized_accident_rate')],ascending=False)
intersection_df['traffic_count','normalized_accident_rate'] = intersection_df['traffic_count','size']/(25*360*intersection_df['traffic_count','mean'])
intersection_df = intersection_df.sort_values(by=[('traffic_count','normalized_accident_rate')],ascending=False)
intersection_df.head(20)

totals_df = pd.concat([intersection_df,zeroes_df])
df_test = totals_df['traffic_count','normalized_accident_rate']
df = df_test.to_frame(name='normalized yearly accident rate')
df['total collisions'] = totals_df['traffic_count','size']
df['traffic estimate'] = totals_df['traffic_count','mean']
#df = df.dropna()
#Here we make the index a column, drop the duplicates from this intersection column, then make it the index again. 
df = df.reset_index().drop_duplicates(subset='intersection', keep='first').set_index('intersection')
len(df)









    Out[12]:





2078

Now we can take a look at the dataframe we've created. The one we'll use for our machine learning approach will be roughly similar, but instead we'll have carried over the information on the type of intersection, and the type of control (stop sign, yield, no control, traffic lights etc). These will be part of the feature set we use.

For now though, we'll just proceed via Emperical Baysian analysis. We want to account for the fact that collisions are very rare events. Without a very long period of time, it would be unwise to assume that our sample distribution properly matches the actual probability of collisions.

In fact we can see that there is a strong "break" away from roughly linear spread when we plot the accident yearly rate against the total collisions.

Should I be concerned about this linear trend? I half expected a more uniform spread, which I believe is what one sees in the case of automobile collisions. Obviously, the power-fit that I used to estimate traffic numbers does mean that I'm imposing a shape on this distribution. So that makes sense. Accepting that, we can obviously still spot the outliers.



In [13]:

    
pd.set_option('display.float_format', '{:.4g}'.format)
df.sort_values(by='total collisions', ascending = False).head()









    Out[13]:






  
    
      
      normalized yearly accident rate
      total collisions
      traffic estimate
    
    
      intersection
      
      
      
    
  
  
    
      Queen St W & Spadina Ave, Toronto, Ontario, M5V
      6.256e-06
      74
      1314
    
    
      Bay St & Dundas St W, Toronto, Ontario, M5G
      5.393e-06
      73
      1504
    
    
      Queen St W & University Ave, Toronto, Ontario, M5H
      5.315e-06
      64
      1338
    
    
      Bloor St E & Yonge St, Toronto, Ontario, M4W
      4.861e-06
      63
      1440
    
    
      Bathurst St & Queen St W, Toronto, Ontario, M5T
      5.334e-06
      61
      1271



In [14]:

    
df.loc['Jarvis St & King St E, Toronto, Ontario, M5A']









    Out[14]:





normalized yearly accident rate   7.352e-08
total collisions                          1
traffic estimate                       1511
Name: Jarvis St & King St E, Toronto, Ontario, M5A, dtype: float64



In [15]:

    
sns.set_context("notebook", font_scale=1.1)
scatter = sns.jointplot(x='total collisions', y = 'normalized yearly accident rate', data = df, ylim= (0,0.00002))

So we'll want to create a prior estimate on what the distribution actually is, using the samples that we have available. In a sense, we're using the data twice. First to estimate our prior. Then we'll use it along with this prior to create a posterior distribution.

There is a lot of discussion revolving around Full Bayesian analysis, versus Empirical Bayesian Analysis. In some sense, if you know enough about the system you are studying, you should be able to develop a good prior beforehand, rather than using your limited dataset to do so. On the otherhand, imposing beliefs about your system beforehand can be equally controversial to depending on the data twice.



In [16]:

    
df.to_csv('totals_test.csv')
import scipy.stats as stats
sns.distplot(df["traffic estimate"], bins = 100)









    



/home/christian/anaconda3/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j






    Out[16]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fac259a47f0>



In [17]:

    
sns.distplot(df["total collisions"], bins = 15)









    



/home/christian/anaconda3/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j






    Out[17]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fac22fb0be0>



In [18]:

    
sns.distplot(df["total collisions"], fit=stats.gamma,  bins=range(0, 75, 1), kde=False)









    Out[18]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fac22dc8710>

A gamma distribution looks like a good choice to fit our prior. We know that we are dealing with sums of poission distributions where each intersection is basically a poisson process, with the mean accidents for a year being the parameter. So the above histogram is really summing up the number of accidents each year, over these poisson processes. While we could try and fit each intersection individually, this is a whole lot of work. And it's not clear how one would use these estimated parameters to compute the prior distribution for the ensemble.

A gamma distribution looks nice, and intuitively makes sense since the prior for a poisson distribution is a gamma. Trying to read up on the intuition for how a ensemble of Poisson processes, each with their own paramater (not merely the same), has lead me to some very involved work on Bayesian inference for traffic problems, without directly using machine learning. We won't go that deep in this case though. Suffice to say, there are extensions to the above Poisson model, which attempt to account for the heterogenity of various intersections. The same features one would choose to use in a machine learning approach would show up as parameters in the distribution one tries to fit, and the poisson parameters for each intersection is assumed to be a random variablle, drawn from a chosen distribution

An overview of 3 models which extend our basic method is found in (Miranda-Mereno et al 2005): http://www.civil.uwaterloo.ca/itss/papers%5C2005-2%20(Alternative%20risk%20models%20for%20ranking%20sites).pdf



In [19]:

    
stats.gamma.fit(df['total collisions'], loc =0)









    Out[19]:





(0.43930468201111406, -7.7697069386935107e-32, 15.102836399931292)



In [20]:

    
import matplotlib.pylab as plt

#plot normed histogram
plt.hist(df['total collisions'], normed=True,  bins=range(0, 75, 1))

# find minimum and maximum of xticks, so we know
# where we should compute theoretical distribution
xt = plt.xticks()[0]  
xmin, xmax = 0.1, 75
lnspc = np.linspace(xmin, xmax, len(df))

# exactly same as above
ag,bg,cg = stats.gamma.fit(df['total collisions'], loc=0)  
pdf_gamma = stats.gamma.pdf(lnspc,ag, bg,cg)  
plt.plot(lnspc, pdf_gamma, label="Gamma")

plt.axis([0,20,0,0.6])
plt.show()



In [21]:

    
#plot normed histogram
plt.hist(df['total collisions'], normed=True,  bins=range(0, 75, 1))

# find minimum and maximum of xticks, so we know
# where we should compute theoretical distribution
xt = plt.xticks()[0]  
xmin, xmax = 0.1, 75
lnspc = np.linspace(xmin, xmax, len(df))

# exactly same as above
ag,bg,cg = stats.gamma.fit(df['total collisions'])  
pdf_gamma = stats.gamma.pdf(lnspc,ag, bg,cg)  
plt.plot(lnspc, pdf_gamma, label="Gamma")
plt.xlabel('total collisions')
plt.ylabel('normalized count')
plt.axis([0,70,0,0.08])
plt.show()

That's a pretty good looking fit. The histogram shows the normalized counts of integer collision totals for our merged dataframe. We can now use the fit paramters of our model, and use this to adjust our estimate of the accident rate.

The mathematics of why the calculation below works to update our estiamte is straight forward to derive from Bayesian probablity theory, but intuitively it just looks likes a weighted average of the prior and sample estimates (when expaneded), where more weight is given to the sample data when the time period is larger, or when the traffic numbers are greater. This makes sense, as in those cases we are MORE confident in our data providing us with a good idea of probability distibution for that intersection.



In [22]:

    
beta = 1/cg
df['posterior mean'] = (ag + df['total collisions'])/(beta+25*360*df['traffic estimate']) 
df['posterior STD'] = np.sqrt((ag + df['total collisions'])/((beta+25*360*df['traffic estimate'])**2)) 
pd.set_option('display.float_format', '{:.5g}'.format)
df.sort_values(by='posterior mean',ascending = False).head(10)









    Out[22]:






  
    
      
      normalized yearly accident rate
      total collisions
      traffic estimate
      posterior mean
      posterior STD
    
    
      intersection
      
      
      
      
      
    
  
  
    
      Front St W & York St, Toronto, Ontario, M5J
      1.1649e-05
      29
      276.61
      1.19e-05
      2.1863e-06
    
    
      Queens Quay W & Rees St, Toronto, Ontario, M5J
      7.1485e-06
      9
      139.89
      7.6438e-06
      2.464e-06
    
    
      Bay St & Queens Quay W, Toronto, Ontario, M5J
      7.3792e-06
      19
      286.09
      7.6214e-06
      1.7205e-06
    
    
      Station St & York St, Toronto, Ontario, M5J
      7.0922e-06
      15
      235
      7.387e-06
      1.8689e-06
    
    
      Queens Quay W & York St, Toronto, Ontario, M5J
      6.9128e-06
      18
      289.32
      7.1523e-06
      1.6574e-06
    
    
      The Esplanade & Yonge St, Toronto, Ontario, M5E
      6.734e-06
      20
      330
      6.944e-06
      1.5291e-06
    
    
      Lake Shore Blvd W & York St, Toronto, Ontario, M5J
      6.6194e-06
      14
      235
      6.9142e-06
      1.8081e-06
    
    
      Front St W & John St, Toronto, Ontario, M5V
      6.469e-06
      19
      326.34
      6.6813e-06
      1.5083e-06
    
    
      Lake Shore Blvd E & Yonge St, Toronto, Ontario, M5E
      6.1466e-06
      13
      235
      6.4414e-06
      1.7452e-06
    
    
      Queen St W & Spadina Ave, Toronto, Ontario, M5V
      6.2559e-06
      74
      1314.3
      6.3087e-06
      7.303e-07



In [23]:

    
df = df.sort_values(by='total collisions', ascending=False)



In [24]:

    
df.head()









    Out[24]:






  
    
      
      normalized yearly accident rate
      total collisions
      traffic estimate
      posterior mean
      posterior STD
    
    
      intersection
      
      
      
      
      
    
  
  
    
      Queen St W & Spadina Ave, Toronto, Ontario, M5V
      6.2559e-06
      74
      1314.3
      6.3087e-06
      7.303e-07
    
    
      Bay St & Dundas St W, Toronto, Ontario, M5G
      5.393e-06
      73
      1504
      5.4391e-06
      6.339e-07
    
    
      Queen St W & University Ave, Toronto, Ontario, M5H
      5.3153e-06
      64
      1337.9
      5.3671e-06
      6.6764e-07
    
    
      Bloor St E & Yonge St, Toronto, Ontario, M4W
      4.8614e-06
      63
      1439.9
      4.9095e-06
      6.1551e-07
    
    
      Bathurst St & Queen St W, Toronto, Ontario, M5T
      5.3338e-06
      61
      1270.7
      5.3883e-06
      6.864e-07



In [25]:

    
df.to_csv('eba_normalizedcollisions.csv')

So now we have our dataframe with posterior yearly accident rate, as well as a posterior standard deviation. If one looks deeper throughout the data set we see that the order changes slightly, but nothing too crazy. Since we're using so many years of data, the "prior" dominates. The biggest changes are for those intersections that have a very low traffic count, comparatively. Those with high traffic counts have smaller changes.



In [26]:

    
scatter = sns.jointplot(x='total collisions', y = 'posterior mean', data = df, ylim= (0,0.00002))

It is at this point that we may start wondering whether using ALL 25 years of data makes sense. If the predictions are to be for the past year, then surely data closer to this time frame is more applicable. We saw in our exploratory data analysis that there was a total decline in accidents year by year up until 1996. There was a sharp decline in minimal and minor injuries that year, with the slack partly picked up by more "No injuries". Whether this was due to a change in how accidents were classified, or due to the legislation mandating minors to use helmets is hard to say. Likely it could be both!

After this year however, cycling accidents pick up again. and start looking fairly consistent. Expecially as compared to the first dozen years.

It may be that predictive power would improve if we limited the data set to 1998 onwards. This would roughly half our number of starting collisions, but those remaining may be more indicative of current cycling trends. It also avoids that obvious decreasing trend and sudden jump in 96-97.

Moving forward, we can compare the predicted collision numbers to those observed over the past year, from the Twitterbot. We will have a trained model to also compare prediction numbers. One hopes that the increased sophistication of the machine learning approach, using the intersection features which are available, will get closer to the observed data.



In [27]:

    
full_df = pd.DataFrame.from_csv('collisions_clean.csv', index_col='ID')



In [28]:

    
fullgeocoded_df = pd.DataFrame.from_csv('collisions_geocode.csv', index_col ='ID')



In [29]:

    
fullgeocoded_df.head()









    Out[29]:






  
    
      
      INJURY
      SAFETY EQUIP.
      ROAD CLASS
      CYCLIST CRASH TYPE
      AGE OF CYCLIST
      STNAME1
      STET 1 TYPE
      STNAME 2
      STET 2 TYPE
      LONG
      ...
      ACC TIME2
      TRAFFIC CONTROL
      ROAD SURFACE2
      DATE
      DATE2
      DRIVER ACTION
      DRIVER CONDITION
      YEAR
      INTERSECTION
      coords
    
    
      ID
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      3558
      Minimal
      No equipment available
      Local
      Unrecorded
      11
      ELEVENTH
      ST
      LAKE SHORE
      BLVD
      -79.509
      ...
      17:00:00
      no control
      unrecorded
      1988/06/27
      1988/06/27
      Driving properly
      Normal
      1988
      11th St & Lake Shore Blvd W, Etobicoke, Ontari...
      [43.59980079900049, -79.50971068699965]
    
    
      499
      Minimal
      Unrecorded
      Major Arterial
      Unrecorded
      19
      ELEVENTH
      ST
      LAKE SHORE
      BLVD
      -79.51
      ...
      15:00:00
      stop sign
      dry
      1986/06/26
      1986/06/26
      Driving properly
      Normal
      1986
      11th St & Lake Shore Blvd W, Etobicoke, Ontari...
      [43.59980079900049, -79.50971068699965]
    
    
      1529
      Minor
      Unrecorded
      Major Arterial
      Unrecorded
      19
      LAKE SHORE
      BLVD
      ELEVENTH
      ST
      -79.509
      ...
      18:00:00
      no control
      dry
      1987/04/07
      1987/04/07
      Failed to yield right-of-way
      Normal
      1987
      11th St & Lake Shore Blvd W, Etobicoke, Ontari...
      [43.59980079900049, -79.50971068699965]
    
    
      7880
      Minor
      No equipment available
      Major Arterial
      Unrecorded
      42
      LAKE SHORE
      BLVD
      ELEVENTH
      ST
      -79.509
      ...
      15:00:00
      no control
      unrecorded
      1991/06/11
      1991/06/11
      Driving properly
      Normal
      1991
      11th St & Lake Shore Blvd W, Etobicoke, Ontari...
      [43.59980079900049, -79.50971068699965]
    
    
      10049
      Minor
      Equipment not used but available
      Local
      Unrecorded
      18
      MORRISON
      ST
      ELEVENTH
      ST
      -79.509
      ...
      21:40:00
      no control
      dry
      1992/09/04
      1992/09/04
      Driving properly
      Normal
      1992
      11th St & Morrison St, Etobicoke, Ontario, M8V
      [43.59743007200046, -79.50873080099967]
    
  

5 rows × 22 columns



In [30]:

    
fullgeocoded_df= fullgeocoded_df.rename(columns = {'INTERSECTION':'intersection'}) 

col_df = fullgeocoded_df.sort_index()
all_realintersections_df = pd.DataFrame.from_csv('real_reversegeocoded.csv')
all_realintersections_df= all_realintersections_df.rename(columns = {'arcgis_int':'intersection'}) 

from numpy import random
from scipy.spatial import distance
import matplotlib.path as mplPath
import numpy as np

#We want to follow a standard intersection convention to make our life easier. 
#Unfortunately the free geocoder with unlimited request numbers per day doesn't see mto follow one of it's own. 
#So we will get the intersections, strip out the two streets, and order them alphabetically.
st2 = col_df['intersection'].str.split('&').str.get(1)
st2 =st2.str.split(', Toronto').str[0].str.strip()

post = col_df['intersection'].str.split('&').str.get(1)
post = post.str.partition(', ')[2].str.strip()

st1 = col_df['intersection'].str.split('&').str[0]
st1 = st1.str.strip()

intersection_list = []
streets = pd.concat([st1,st2,post],axis=1)
streets = streets.values.tolist()
streets
for pair in streets:
    if isinstance(pair[0],str) and isinstance(pair[1],str):
        if pair[1] <= pair[0]:
            temp = pair[0]
            pair[0] = pair[1]
            pair[1] = temp
for pair2 in streets:
    intersection = str(pair2[0]) + ' & ' + str(pair2[1]) + ', ' +str(pair2[2])
    intersection_list.append(intersection)

col_df['intersection'] = intersection_list
    
st2 = all_realintersections_df['intersection'].str.split('&').str.get(1)
st2 =st2.str.split(', Toronto').str[0].str.strip()

post =  all_realintersections_df['intersection'].str.split('&').str.get(1)
post = post.str.partition(', ')[2].str.strip()

st1 =  all_realintersections_df['intersection'].str.split('&').str[0]
st1 = st1.str.strip()

intersection_list = []
streets = pd.concat([st1,st2,post],axis=1)
streets = streets.values.tolist()
streets
for pair in streets:
    if isinstance(pair[0],str):
        if pair[1] <= pair[0]:
            temp = pair[0]
            pair[0] = pair[1]
            pair[1] = temp
for pair2 in streets:
    intersection = str(pair2[0]) + ' & ' + str(pair2[1] + ', ' +str(pair2[2]))
    intersection_list.append(intersection)
    
all_realintersections_df['intersection'] = intersection_list
all_realintersections_df.head()









    Out[30]:






  
    
      
      intersec5
      elev_id
      classifi6
      classifi7
      num_elev
      elevatio9
      elevatio10
      elev_level
      elevation
      elevatio13
      height_r14
      height_r15
      x_coord
      y_coord
      longitude
      latitude
      objectid
      intersection
    
    
      int_id
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      20041183
      Allingham Gdns / Ln N Wilson W Collinson
      29125
      LSRML
      Lesser-Multi Level
      2
      501700
      Laneway
      0
      0
      NaN
      0
      NaN
      3.094e+05
      4.8438e+06
      -79.443
      43.736
      32028
      Allingham Gdns & King High Ave, Toronto, Ontar...
    
    
      14045679
      Ln N Annette E Beresford / Ln S St Johns W Run...
      17094
      LSRML
      Lesser-Multi Level
      2
      501700
      Laneway
      0
      0
      NaN
      0
      NaN
      3.0633e+05
      4.8354e+06
      -79.481
      43.66
      44986
      Annette St & Runnymede Rd, Toronto, Ontario, M6S
    
    
      14665974
      Byng Ave / Ln 2 N Newport W Byng
      22281
      LSRML
      Lesser-Multi Level
      1
      501700
      Laneway
      0
      0
      NaN
      0
      NaN
      3.2236e+05
      4.8395e+06
      -79.282
      43.697
      51026
      Byng Ave & Prairie Dr, Toronto, Ontario, M1L
    
    
      13465351
      Durie St / Ln E Durie S St Johns / Ln E Winder...
      7655
      LSRML
      Lesser-Multi Level
      2
      501700
      Laneway
      0
      0
      NaN
      0
      NaN
      3.0618e+05
      4.8353e+06
      -79.483
      43.66
      32868
      Annette St & Durie St, Toronto, Ontario, M6S
    
    
      13459936
      Luttrell Ave / Ln 2 S Danforth E Luttrell / Ln...
      25170
      LSRML
      Lesser-Multi Level
      2
      501700
      Laneway
      0
      0
      NaN
      0
      NaN
      3.2145e+05
      4.8385e+06
      -79.293
      43.689
      85370
      Drummondville Ln & Flagman St, Toronto, Ontari...



In [31]:

    
central_realintersections_df = pd.DataFrame.from_csv('central_real_intersections.csv')
central_realintersections_df= all_realintersections_df.rename(columns = {'arcgis_int':'intersection'}) 
central_col_df = pd.DataFrame.from_csv('toronto_cycling_central_mode.csv', index_col='id')
from numpy import random
from scipy.spatial import distance
import matplotlib.path as mplPath
import numpy as np

#We want to follow a standard intersection convention to make our life easier. 
#Unfortunately the free geocoder with unlimited request numbers per day doesn't see mto follow one of it's own. 
#So we will get the intersections, strip out the two streets, and order them alphabetically.
st2 = central_col_df['intersection'].str.split('&').str.get(1)
st2 =st2.str.split(', Toronto').str[0].str.strip()

post = central_col_df['intersection'].str.split('&').str.get(1)
post = post.str.partition(', ')[2].str.strip()

st1 = central_col_df['intersection'].str.split('&').str[0]
st1 = st1.str.strip()

intersection_list = []
streets = pd.concat([st1,st2,post],axis=1)
streets = streets.values.tolist()
streets
for pair in streets:
    if isinstance(pair[0],str) and isinstance(pair[1],str):
        if pair[1] <= pair[0]:
            temp = pair[0]
            pair[0] = pair[1]
            pair[1] = temp
for pair2 in streets:
    intersection = str(pair2[0]) + ' & ' + str(pair2[1]) + ', ' +str(pair2[2])
    intersection_list.append(intersection)

central_col_df['intersection'] = intersection_list
    
st2 = central_realintersections_df['intersection'].str.split('&').str.get(1)
st2 =st2.str.split(', Toronto').str[0].str.strip()

post =  central_realintersections_df['intersection'].str.split('&').str.get(1)
post = post.str.partition(', ')[2].str.strip()

st1 =  central_realintersections_df['intersection'].str.split('&').str[0]
st1 = st1.str.strip()

intersection_list = []
streets = pd.concat([st1,st2,post],axis=1)
streets = streets.values.tolist()
streets
for pair in streets:
    if isinstance(pair[0],str):
        if pair[1] <= pair[0]:
            temp = pair[0]
            pair[0] = pair[1]
            pair[1] = temp
for pair2 in streets:
    intersection = str(pair2[0]) + ' & ' + str(pair2[1] + ', ' +str(pair2[2]))
    intersection_list.append(intersection)
    
central_realintersections_df['intersection'] = intersection_list
central_realintersections_df.head()









    Out[31]:






  
    
      
      intersec5
      elev_id
      classifi6
      classifi7
      num_elev
      elevatio9
      elevatio10
      elev_level
      elevation
      elevatio13
      height_r14
      height_r15
      x_coord
      y_coord
      longitude
      latitude
      objectid
      intersection
    
    
      int_id
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      20041183
      Allingham Gdns / Ln N Wilson W Collinson
      29125
      LSRML
      Lesser-Multi Level
      2
      501700
      Laneway
      0
      0
      NaN
      0
      NaN
      3.094e+05
      4.8438e+06
      -79.443
      43.736
      32028
      Allingham Gdns & King High Ave, Toronto, Ontar...
    
    
      14045679
      Ln N Annette E Beresford / Ln S St Johns W Run...
      17094
      LSRML
      Lesser-Multi Level
      2
      501700
      Laneway
      0
      0
      NaN
      0
      NaN
      3.0633e+05
      4.8354e+06
      -79.481
      43.66
      44986
      Annette St & Runnymede Rd, Toronto, Ontario, M6S
    
    
      14665974
      Byng Ave / Ln 2 N Newport W Byng
      22281
      LSRML
      Lesser-Multi Level
      1
      501700
      Laneway
      0
      0
      NaN
      0
      NaN
      3.2236e+05
      4.8395e+06
      -79.282
      43.697
      51026
      Byng Ave & Prairie Dr, Toronto, Ontario, M1L
    
    
      13465351
      Durie St / Ln E Durie S St Johns / Ln E Winder...
      7655
      LSRML
      Lesser-Multi Level
      2
      501700
      Laneway
      0
      0
      NaN
      0
      NaN
      3.0618e+05
      4.8353e+06
      -79.483
      43.66
      32868
      Annette St & Durie St, Toronto, Ontario, M6S
    
    
      13459936
      Luttrell Ave / Ln 2 S Danforth E Luttrell / Ln...
      25170
      LSRML
      Lesser-Multi Level
      2
      501700
      Laneway
      0
      0
      NaN
      0
      NaN
      3.2145e+05
      4.8385e+06
      -79.293
      43.689
      85370
      Drummondville Ln & Flagman St, Toronto, Ontari...



In [32]:

    
ped_counts_df = pd.DataFrame.from_csv('Vehicle and Pedestrian Counts/TrafficPedestrianVolumes_2011.csv')

#Using the Power fit for the Bike/Pedestrian ratio, we get a function that predicts the bike numbers at any one intersection.

ped_counts_df['bike_prediction'] = (500.2146799711*ped_counts_df['8HrPedVol']**(-0.8950759596))*ped_counts_df['8HrPedVol']
ped_coords = ped_counts_df[['Latitude','Longitude']]
ped_coords = ped_coords.replace(np.nan,0)
ped_coordinate_list = ped_coords.values.tolist()
ped_counts_df['coordinates'] = ped_counts_df[['Latitude','Longitude']].apply(tuple, axis=1)
ped_counts_df.head()









    Out[32]:






  
    
      
      Main
      Midblock Route
      Side 1 Route
      Side 2 Route
      Activation Date
      Latitude
      Longitude
      Count Date
      8HrPedVol
      8HrVehVol
      24HrPedVol
      24HrVehVol
      bike_prediction
      coordinates
    
    
      PX2
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2
      JARVIS ST
      NaN
      FRONT ST E
      NaN
      1948/11/15
      43.649
      -79.371
      2011/09/08
      17008
      19335
      34016
      38670
      1390.1
      (43.64945, -79.37141)
    
    
      3
      KING ST E
      NaN
      JARVIS ST
      NaN
      1950/08/23
      43.65
      -79.372
      2011/09/07
      37719
      17665
      75438
      35330
      1511.3
      (43.6504606, -79.3719239)
    
    
      4
      JARVIS ST
      NaN
      ADELAIDE ST E
      NaN
      1958/09/12
      43.652
      -79.372
      2008/06/16
      1991
      19726
      3982
      39452
      1110
      (43.6515337, -79.37236)
    
    
      5
      JARVIS ST
      NaN
      RICHMOND ST E
      NaN
      1962/04/21
      43.653
      -79.373
      2009/07/30
      2696
      24842
      5392
      49684
      1145.8
      (43.6527176, -79.372824)
    
    
      6
      JARVIS ST
      NaN
      QUEEN ST E
      NaN
      1928/08/24
      43.654
      -79.373
      2011/05/18
      3622
      19772
      7244
      39544
      1181.9
      (43.653704, -79.373238)



In [33]:

    
bike_dict = ped_counts_df.set_index('coordinates').to_dict()['bike_prediction']
ped_dict = ped_counts_df.set_index('coordinates').to_dict()['8HrPedVol']
veh_dict = ped_counts_df.set_index('coordinates').to_dict()['8HrVehVol']
col_df['coordinates'] = col_df[['LAT','LONG']].apply(tuple, axis=1)
col_df.head()
all_realintersections_df['coordinates'] = all_realintersections_df[['latitude','longitude']].apply(tuple, axis=1)
equiv = {'Laneway':3,'Minor':2,'Major':1}
all_realintersections_df['num_class'] = all_realintersections_df['elevatio10'].map(equiv)
all_realintersections_df= all_realintersections_df.sort_values(by=['intersection','num_class'])
#take the lowest numerical class, drop the other duplicates. 
all_realintersections_df = all_realintersections_df.drop_duplicates(subset='intersection', keep='first')

central_col_df['coordinates'] = central_col_df[['lat','long']].apply(tuple, axis=1)
central_col_df.head()
central_realintersections_df['coordinates'] = central_realintersections_df[['latitude','longitude']].apply(tuple, axis=1)
equiv = {'Laneway':3,'Minor':2,'Major':1}
central_realintersections_df['num_class'] = central_realintersections_df['elevatio10'].map(equiv)
central_realintersections_df= central_realintersections_df.sort_values(by=['intersection','num_class'])
#take the lowest numerical class, drop the other duplicates. 
central_realintersections_df = central_realintersections_df.drop_duplicates(subset='intersection', keep='first')



In [34]:

    
import csv
closest_traffic_point = []
bike_traffic = []
veh_traffic = []
ped_traffic = []
i = 0
for i in range(0,len(col_df)):
    point = col_df['coordinates'].iloc[i]
    if waterfront_Path.contains_point(point):
        closest = waterfront[distance.cdist([point], waterfront).argmin()]
        closest_traffic_point.append(tuple(closest))
        bike_traffic.append(waterfront_dict[tuple(closest)])
    elif campus_Path.contains_point(point):
        closest = campus[distance.cdist([point], campus).argmin()]
        closest_traffic_point.append(tuple(closest))
        bike_traffic.append(campus_dict[tuple(closest)])
    elif castleFrank_Path.contains_point(point):
        closest = castleFrank[distance.cdist([point], castleFrank).argmin()]
        closest_traffic_point.append(tuple(closest))
        bike_traffic.append(castleFrank_dict[tuple(closest)])
    else:
        closest = ped_coordinate_list[distance.cdist([point], ped_coordinate_list).argmin()]
        closest_traffic_point.append(tuple(closest))
        bike_traffic.append(bike_dict[tuple(closest)])
    closest = ped_coordinate_list[distance.cdist([point], ped_coordinate_list).argmin()]
    veh_traffic.append(veh_dict[tuple(closest)])
    ped_traffic.append(ped_dict[tuple(closest)])
    
myfile3 = open('closest_intersection.csv', 'w')
wr = csv.writer(myfile3)
wr.writerow(closest_traffic_point)
myfile3.close()

myfile4 = open('closest_int_bike_predictions.csv', 'w')
wr = csv.writer(myfile4)
wr.writerow(bike_traffic)
myfile4.close()

col_df['closest_traffic'] = tuple(closest_traffic_point)
col_df['traffic_count'] = bike_traffic
col_df['vehicle_traffic'] = veh_traffic
col_df['pedestrian_traffic'] = ped_traffic


col_df= col_df.rename(columns={'closest_traffic': 'closest_ped_count', 'traffic_count': 'predicted_bike_count'})



In [35]:

    
import csv
closest_traffic_point = []
bike_traffic = []
veh_traffic = []
ped_traffic = []
i = 0
for i in range(0,len(central_col_df)):
    point = central_col_df['coordinates'].iloc[i]
    if waterfront_Path.contains_point(point):
        closest = waterfront[distance.cdist([point], waterfront).argmin()]
        closest_traffic_point.append(tuple(closest))
        bike_traffic.append(waterfront_dict[tuple(closest)])
    elif campus_Path.contains_point(point):
        closest = campus[distance.cdist([point], campus).argmin()]
        closest_traffic_point.append(tuple(closest))
        bike_traffic.append(campus_dict[tuple(closest)])
    elif castleFrank_Path.contains_point(point):
        closest = castleFrank[distance.cdist([point], castleFrank).argmin()]
        closest_traffic_point.append(tuple(closest))
        bike_traffic.append(castleFrank_dict[tuple(closest)])
    else:
        closest = ped_coordinate_list[distance.cdist([point], ped_coordinate_list).argmin()]
        closest_traffic_point.append(tuple(closest))
        bike_traffic.append(bike_dict[tuple(closest)])
    closest = ped_coordinate_list[distance.cdist([point], ped_coordinate_list).argmin()]
    veh_traffic.append(veh_dict[tuple(closest)])
    ped_traffic.append(ped_dict[tuple(closest)])
myfile3 = open('central_closest_intersection.csv', 'w')
wr = csv.writer(myfile3)
wr.writerow(closest_traffic_point)
myfile3.close()

myfile4 = open('central_closest_int_bike_predictions.csv', 'w')
wr = csv.writer(myfile4)
wr.writerow(bike_traffic)
myfile4.close()

central_col_df['closest_traffic'] = tuple(closest_traffic_point)
central_col_df['traffic_count'] = bike_traffic
central_col_df['vehicle_traffic'] = veh_traffic
central_col_df['pedestrian_traffic'] = ped_traffic


central_col_df= central_col_df.rename(columns={'closest_traffic': 'closest_ped_count', 'traffic_count':'predicted_bike_count'})



In [36]:

    
central_col_df.columns.values









    Out[36]:





array(['injury', 'safety_equip', 'road_class', 'cyclist_crash_type',
       'age_of_cyclist', 'stname1', 'stet_1_type', 'stname_2',
       'stet_2_type', 'long', 'lat', 'acc_time', 'acc_time2',
       'traffic_control', 'road_surface2', 'date', 'date2',
       'driver_action', 'driver_condition', 'year', 'intersection',
       'coords', 'coordinates', 'closest_ped_count',
       'predicted_bike_count', 'vehicle_traffic', 'pedestrian_traffic'], dtype=object)



In [37]:

    
import csv
closest_traffic_point = []
bike_traffic = []
veh_traffic = []
ped_traffic = []
i = 0
for i in range(0,len(all_realintersections_df)):
    point = all_realintersections_df['coordinates'].iloc[i]
    if waterfront_Path.contains_point(point):
        closest = waterfront[distance.cdist([point], waterfront).argmin()]
        closest_traffic_point.append(tuple(closest))
        bike_traffic.append(waterfront_dict[tuple(closest)])
    elif campus_Path.contains_point(point):
        closest = campus[distance.cdist([point], campus).argmin()]
        closest_traffic_point.append(tuple(closest))
        bike_traffic.append(campus_dict[tuple(closest)])
    elif castleFrank_Path.contains_point(point):
        closest = castleFrank[distance.cdist([point], castleFrank).argmin()]
        closest_traffic_point.append(tuple(closest))
        bike_traffic.append(castleFrank_dict[tuple(closest)])
    else:
        closest = ped_coordinate_list[distance.cdist([point], ped_coordinate_list).argmin()]
        closest_traffic_point.append(tuple(closest))
        bike_traffic.append(bike_dict[tuple(closest)])
    closest = ped_coordinate_list[distance.cdist([point], ped_coordinate_list).argmin()]
    veh_traffic.append(veh_dict[tuple(closest)])
    ped_traffic.append(ped_dict[tuple(closest)])   
myfile5 = open('closest_intersection_centreline.csv', 'w')
wr = csv.writer(myfile5)
wr.writerow(closest_traffic_point)
myfile5.close()

myfile6 = open('closest_int_bike_predictions_centreline.csv', 'w')
wr = csv.writer(myfile6)
wr.writerow(bike_traffic)
myfile6.close()

all_realintersections_df['closest_traffic'] = tuple(closest_traffic_point)
all_realintersections_df['traffic_count'] = bike_traffic
all_realintersections_df['vehicle_traffic'] = veh_traffic
all_realintersections_df['pedestrian_traffic'] = ped_traffic

all_realintersections_df = all_realintersections_df.rename(columns={'closest_traffic': 'closest_ped_count', 'traffic_count':'predicted_bike_count'})

Adding traffic values for our list of intersections above. Cell below does the same for only those which have "central" Postal Code values.



In [38]:

    
import csv
closest_traffic_point = []
bike_traffic = []
veh_traffic = []
ped_traffic = []
i = 0
for i in range(0,len(central_realintersections_df)):
    point = central_realintersections_df['coordinates'].iloc[i]
    if waterfront_Path.contains_point(point):
        closest = waterfront[distance.cdist([point], waterfront).argmin()]
        closest_traffic_point.append(tuple(closest))
        bike_traffic.append(waterfront_dict[tuple(closest)])
    elif campus_Path.contains_point(point):
        closest = campus[distance.cdist([point], campus).argmin()]
        closest_traffic_point.append(tuple(closest))
        bike_traffic.append(campus_dict[tuple(closest)])
    elif castleFrank_Path.contains_point(point):
        closest = castleFrank[distance.cdist([point], castleFrank).argmin()]
        closest_traffic_point.append(tuple(closest))
        bike_traffic.append(castleFrank_dict[tuple(closest)])
    else:
        closest = ped_coordinate_list[distance.cdist([point], ped_coordinate_list).argmin()]
        closest_traffic_point.append(tuple(closest))
        bike_traffic.append(bike_dict[tuple(closest)])
    closest = ped_coordinate_list[distance.cdist([point], ped_coordinate_list).argmin()]
    veh_traffic.append(veh_dict[tuple(closest)])
    ped_traffic.append(ped_dict[tuple(closest)])   
myfile5 = open('central_closest_intersection_centreline.csv', 'w')
wr = csv.writer(myfile5)
wr.writerow(closest_traffic_point)
myfile5.close()

myfile6 = open('central_closest_int_bike_predictions_centreline.csv', 'w')
wr = csv.writer(myfile6)
wr.writerow(bike_traffic)
myfile6.close()

central_realintersections_df['closest_traffic'] = tuple(closest_traffic_point)
central_realintersections_df['traffic_count'] = bike_traffic
central_realintersections_df['vehicle_traffic'] = veh_traffic
central_realintersections_df['pedestrian_traffic'] = ped_traffic

central_realintersections_df = central_realintersections_df.rename(columns={'closest_traffic': 'closest_ped_count', 'traffic_count':'predicted_bike_count'})



In [39]:

    
all_realint_df = all_realintersections_df[['intersection','coordinates','closest_ped_count','predicted_bike_count','vehicle_traffic','pedestrian_traffic','elevatio10']]
#all_realint_df.sort_values(by='predicted_bike_count')

result = pd.merge(col_df,all_realint_df, how ='outer',on='intersection')
result = result.sort_values(by='intersection')
traff2 =result['predicted_bike_count_y']
traff2.fillna(1,inplace=True)
result['predicted_bike_count_x'] =result['predicted_bike_count_x'].fillna(value = traff2 )



In [40]:

    
central_realint_df = central_realintersections_df[['intersection','coordinates','closest_ped_count','predicted_bike_count','vehicle_traffic','pedestrian_traffic','elevatio10']]
#all_realint_df.sort_values(by='predicted_bike_count')

result_central = pd.merge(central_col_df,central_realint_df, how ='outer',on='intersection')
result_central = result_central.sort_values(by='intersection')
traff_c =result_central['predicted_bike_count_y']
result_central['predicted_bike_count_x'] =result_central['predicted_bike_count_x'].fillna(value = traff_c )

Now let's take a look at the total types of injuries after 1997



In [47]:

    
injury_df = full_df[full_df['YEAR'] >1997].groupby('INJURY').size()
total = injury_df.sum()
d = {'injury': pd.Series([total/31,total/543,total/5575,total/(7171+2039)], index=['Fatal','Major','Minor','Minimal'])}
inverse_df = pd.DataFrame(d)
inverse_df = inverse_df/inverse_df.loc['Minimal']
inverse_df



In [42]:

    
injury_df.plot(kind='bar')









    Out[42]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fac25b25588>



In [43]:

    
inverse_df.plot(kind='bar')









    Out[43]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fac22b0aa58>

Filtering past 1997 got ridd of the "unrecorded" entries, so that's a helpful. Now combined 'minimal' and 'none' into one category. This is because we are going to estimate a "risk factor" for each intersection, where major accidents are worth more than a minor accident, which are worth more than a minor etc. In treating a 'none' it seems simplest to compine this with 'minimal' accidents, so we can merely just invert our data to get our weighting factors.

It can be difficult and subjective to make this sort of call. How is the loss of a human life (for fatal accidents), numerically compared to that in which a minor accident occurs? Obviously more serious accidents should be percieved as more dangerous. We are letting the data provide this estimate for us. We will merely compare the relative sizes of each.

This may not seem neccessary, as working off the raw collision numbers and normalizing those may feel more intuitive to some. However, I believe that when we run our algorithm against new data (Nov 2015 to Nov 2016), an algorithm which weighs these different accident types will make better predictions. Unfortunately it will not be an apples to apples comparison. To do this, one would like to know the severity of accidents in our test data as well. The twitter bot does not make this distinction clear. But it is merely retweeting collisions reported by the Toronto police twitter app, which is not the same as having access to the filed police reports of the Toronto Police database. One would then believe that our new data is made up of more serious collisions, as the police were immediately informed.



In [44]:

    
injury_dict = inverse_df.to_dict()



In [45]:

    
injury_dict['None']=1.0
injury_dict = injury_dict
col_df['injury_num'] = col_df['INJURY'].map(injury_dict) 
col_df['injury_num'] = col_df['injury_num'].fillna(value=0)
central_col_df['injury_num'] = central_col_df['injury'].map(injury_dict) 
central_col_df['injury_num'] = central_col_df['injury_num'].fillna(value=0)

Here we produce two resulting dataframes of intersections, featuring the total numbers of collisions, along with the features of that intersection.



In [46]:

    
col_df.columns.values









    Out[46]:





array(['INJURY', 'SAFETY EQUIP.', 'ROAD CLASS', 'CYCLIST CRASH TYPE',
       'AGE OF CYCLIST', 'STNAME1', 'STET 1 TYPE', 'STNAME 2',
       'STET 2 TYPE', 'LONG', 'LAT', 'ACC TIME', 'ACC TIME2',
       'TRAFFIC CONTROL', 'ROAD SURFACE2', 'DATE', 'DATE2',
       'DRIVER ACTION', 'DRIVER CONDITION', 'YEAR', 'intersection',
       'coords', 'coordinates', 'closest_ped_count',
       'predicted_bike_count', 'vehicle_traffic', 'pedestrian_traffic',
       'injury_num'], dtype=object)



In [47]:

    
#result.to_csv('merged_data.csv')
new_col = col_df[col_df['YEAR']>1997]

result = pd.merge(new_col,all_realint_df, how ='outer',on='intersection')
result = result.sort_values(by='intersection')

severity_df = pd.get_dummies(new_col.INJURY, dummy_na=False)
new_col = pd.concat([new_col,severity_df],axis=1)
new_col['Minimal/None'] = new_col['Minimal'] + new_col['None']
intersection_df = new_col.groupby('intersection').agg({'predicted_bike_count':[np.size,np.mean],'vehicle_traffic':[np.mean],'pedestrian_traffic':[np.mean],'injury_num':[np.sum],'Fatal':[np.sum],'Major':[np.sum],'Minor':[np.sum],'Minimal/None':[np.sum]}).sort_values(by=[('predicted_bike_count','size')],ascending=False)



In [48]:

    
#result.to_csv('merged_data.csv')
new_col = col_df[col_df['YEAR']>1997]

result = pd.merge(new_col,all_realint_df, how ='outer',on='intersection')
result = result.sort_values(by='intersection')

severity_df = pd.get_dummies(new_col.INJURY, dummy_na=False)
new_col = pd.concat([new_col,severity_df],axis=1)
new_col['Minimal/None'] = new_col['Minimal'] + new_col['None']
intersection_df = new_col.groupby('intersection').agg({'predicted_bike_count':[np.size,np.mean],'vehicle_traffic':[np.mean],'pedestrian_traffic':[np.mean],'injury_num':[np.sum],'Fatal':[np.sum],'Major':[np.sum],'Minor':[np.sum],'Minimal/None':[np.sum]}).sort_values(by=[('predicted_bike_count','size')],ascending=False)
#intersection_df.columns = intersection_df.columns.set_levels(['minor','predicted_bike_count','injury_num','pedestrian_traffic','fatal','minimal/none','major','vehicle_traffic'], level=0)
zeroes_df = all_realint_df.groupby('intersection').agg({'predicted_bike_count':[np.size,np.mean],'vehicle_traffic':[np.mean],'pedestrian_traffic':[np.mean]}).sort_values(by=[('predicted_bike_count','size')],ascending=False)
zeroes_df['predicted_bike_count','size'] = 0
zeroes_df['injury_num','sum'] = 0
#Let's use 360 as the days in the year. It's a nice round number. And since weekends are less busy it's probably still an overestimate.
zeroes_df['Rates','normalized_accident_rate'] = zeroes_df['predicted_bike_count','size']/(13*360*zeroes_df['predicted_bike_count','mean'])
zeroes_df = zeroes_df.sort_values(by=[('Rates','normalized_accident_rate')],ascending=False)
intersection_df['Rates','normalized_accident_rate'] = intersection_df['predicted_bike_count','size']/(13*360*intersection_df['predicted_bike_count','mean'])
intersection_df = intersection_df.sort_values(by=[('Rates','normalized_accident_rate')],ascending=False)

zeroes_df['Rates','normalized_risk_factor'] = zeroes_df['injury_num','sum']/(13*360*zeroes_df['predicted_bike_count','mean'])
zeroes_df = zeroes_df.sort_values(by=[('Rates','normalized_risk_factor')],ascending=False)
intersection_df['Rates','normalized_risk_factor'] = intersection_df['injury_num','sum']/(13*360*intersection_df['predicted_bike_count','mean'])
intersection_df = intersection_df.sort_values(by=[('Rates','normalized_risk_factor')],ascending=False)

totals_df = pd.concat([intersection_df,zeroes_df])
df_test = totals_df['Rates','normalized_accident_rate']
df_data2 = df_test.to_frame(name='normalized yearly accident rate')
df_data2['total collisions'] = totals_df['predicted_bike_count','size']
df_data2['bike traffic estimate'] = totals_df['predicted_bike_count','mean']
df_data2['scaled collisions'] = totals_df['injury_num','sum']
df_data2['normalized_risk_factor'] = totals_df['Rates','normalized_risk_factor']
df_data2['vehicle traffic'] = totals_df['vehicle_traffic']
df_data2['pedestrian traffic'] = totals_df['pedestrian_traffic']
df_data2['fatal'] = totals_df['Fatal']
df_data2['major'] = totals_df.Major
df_data2['minor'] = totals_df.Minor
df_data2['minimal/none'] = totals_df['Minimal/None']
df_data2['normalized yearly accident rate'].fillna(0,inplace=True)
df_data2['normalized_risk_factor'].fillna(0,inplace=True)
df_data2.fillna(0,inplace=True)
df_data2.dropna(inplace=True)
#Here we make the index a column, drop the duplicates from this intersection column, then make it the index again. 
df_data2 = df_data2.reset_index().drop_duplicates(subset='intersection', keep='first').set_index('intersection')
df_data2 = df_data2.sort_values(by='total collisions',ascending=False)
len(df_data2)









    Out[48]:





20371



In [49]:

    
df_data2.head()









    Out[49]:






  
    
      
      normalized yearly accident rate
      total collisions
      bike traffic estimate
      scaled collisions
      normalized_risk_factor
      vehicle traffic
      pedestrian traffic
      fatal
      major
      minor
      minimal/none
    
    
      intersection
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Queen St W & Spadina Ave, Toronto, Ontario, M5V
      7.0407e-06
      43
      1305
      4
      6.5495e-07
      22206
      9791.5
      0
      2
      15
      26
    
    
      Bathurst St & Bloor St W, Toronto, Ontario, M5R
      6.5778e-06
      41
      1331.9
      7
      1.123e-06
      22976
      11309
      0
      1
      20
      20
    
    
      Bathurst St & Queen St W, Toronto, Ontario, M5T
      6.7557e-06
      40
      1265.2
      6
      1.0134e-06
      17330
      7338.1
      0
      0
      18
      22
    
    
      Bay St & Dundas St W, Toronto, Ontario, M5G
      5.264e-06
      37
      1501.9
      4
      5.6909e-07
      20558
      37431
      0
      0
      13
      24
    
    
      Queen St W & University Ave, Toronto, Ontario, M5H
      5.8533e-06
      37
      1350.7
      6
      9.4918e-07
      29023
      14508
      0
      1
      7
      29



In [50]:

    
#result.to_csv('merged_data.csv')
new_col_central = central_col_df[central_col_df['year']>1997]

result_central = pd.merge(new_col_central,central_realint_df, how ='outer',on='intersection')
result_central = result_central.sort_values(by='intersection')

severity_df = pd.get_dummies(new_col_central.injury, dummy_na=False)
new_col_central = pd.concat([new_col_central,severity_df],axis=1)
new_col_central['Minimal/None'] = new_col_central['Minimal'] + new_col_central['None']
intersection_df = new_col_central.groupby('intersection').agg({'predicted_bike_count':[np.size,np.mean],'vehicle_traffic':[np.mean],'pedestrian_traffic':[np.mean],'injury_num':[np.sum],'Fatal':[np.sum],'Major':[np.sum],'Minor':[np.sum],'Minimal/None':[np.sum]}).sort_values(by=[('predicted_bike_count','size')],ascending=False)
#intersection_df.columns = intersection_df.columns.set_levels(['minor','predicted_bike_count','injury_num','pedestrian_traffic','fatal','minimal/none','major','vehicle_traffic'], level=0)
zeroes_df = central_realint_df.groupby('intersection').agg({'predicted_bike_count':[np.size,np.mean],'vehicle_traffic':[np.mean],'pedestrian_traffic':[np.mean]}).sort_values(by=[('predicted_bike_count','size')],ascending=False)
zeroes_df['predicted_bike_count','size'] = 0
zeroes_df['injury_num','sum'] = 0
#Let's use 360 as the days in the year. It's a nice round number. And since weekends are less busy it's probably still an overestimate.
zeroes_df['Rates','normalized_accident_rate'] = zeroes_df['predicted_bike_count','size']/(13*360*zeroes_df['predicted_bike_count','mean'])
zeroes_df = zeroes_df.sort_values(by=[('Rates','normalized_accident_rate')],ascending=False)
intersection_df['Rates','normalized_accident_rate'] = intersection_df['predicted_bike_count','size']/(13*360*intersection_df['predicted_bike_count','mean'])
intersection_df = intersection_df.sort_values(by=[('Rates','normalized_accident_rate')],ascending=False)

zeroes_df['Rates','normalized_risk_factor'] = zeroes_df['injury_num','sum']/(13*360*zeroes_df['predicted_bike_count','mean'])
zeroes_df = zeroes_df.sort_values(by=[('Rates','normalized_risk_factor')],ascending=False)
intersection_df['Rates','normalized_risk_factor'] = intersection_df['injury_num','sum']/(13*360*intersection_df['predicted_bike_count','mean'])
intersection_df = intersection_df.sort_values(by=[('Rates','normalized_risk_factor')],ascending=False)

totals_df = pd.concat([intersection_df,zeroes_df])
df_test = totals_df['Rates','normalized_accident_rate']
df_data_central = df_test.to_frame(name='normalized yearly accident rate')
df_data_central['total collisions'] = totals_df['predicted_bike_count','size']
df_data_central['bike traffic estimate'] = totals_df['predicted_bike_count','mean']
df_data_central['scaled collisions'] = totals_df['injury_num','sum']
df_data_central['normalized_risk_factor'] = totals_df['Rates','normalized_risk_factor']
df_data_central['vehicle traffic'] = totals_df['vehicle_traffic']
df_data_central['pedestrian traffic'] = totals_df['pedestrian_traffic']
df_data_central['fatal'] = totals_df['Fatal']
df_data_central['major'] = totals_df.Major
df_data_central['minor'] = totals_df.Minor
df_data_central['minimal/none'] = totals_df['Minimal/None']
df_data_central['normalized yearly accident rate'].fillna(0,inplace=True)
df_data_central['normalized_risk_factor'].fillna(0,inplace=True)
df_data_central.fillna(0,inplace=True)
df_data_central.dropna(inplace=True)
#Here we make the index a column, drop the duplicates from this intersection column, then make it the index again. 
df_data_central = df_data_central.reset_index().drop_duplicates(subset='intersection', keep='first').set_index('intersection')
df_data_central = df_data_central.sort_values(by='total collisions',ascending=False)
len(df_data_central)









    Out[50]:





18693



In [51]:

    
df_data2.to_csv('all_totals.csv')
df_data_central.to_csv('central_totals.csv')

Now we have our two main datasets. A collection of intersections in the central areas of the city with their associated accident numbers for cycling collisions from 1998 onwards, and one for the entire city of Toronto. To gain any predicted power however, we'll need to throw in some features of the intersections. We'll use the road class types from the collisions data set, using the "busiest" class of the two streets to define the intersection. We'll also use the intersection control type as well. The cells merely add these columns to our two dataframes.



In [52]:

    
temp = result
temp = temp.replace(np.nan,-1)
roadclass_df =temp.groupby(['intersection']+['ROAD CLASS']) \
            .size().to_frame('count') \
            .reset_index().sort_values('intersection',ascending=False)
                
#roadclass_df['road_class'] =roadclass_df['road_class'].fillna(value =elevatio_df)
roadclass_df = roadclass_df.sort_values('count', ascending=False)
roadclass_df = roadclass_df.replace(np.nan,-1)     
roadclass_dict = {'Collector':1, 'Expressway':1,'Expressway Ramp':1,'Major Arterial':2,'Minor Arterial':3,'Local':4, 'Laneway':4, 'Private Property':4}
roadclass_df['roadclass_num'] = roadclass_df['ROAD CLASS'].map(roadclass_dict)
roadclass_df = roadclass_df.sort_values(by = ['intersection','roadclass_num'], ascending = True)
roadclass_df = roadclass_df.drop_duplicates(subset='intersection', keep='first')
roadclass_df = roadclass_df.sort_values(by='intersection', ascending= False)

temp = result
temp['elevatio10'].replace(np.nan,'unrecorded', inplace =True)
elevatio_df =temp.groupby(['intersection']+['elevatio10']) \
            .size().to_frame('count') \
            .reset_index().sort_values('intersection',ascending=False)
        
elevatio10_dict = {'Major':'Major Arterial', 'Minor':'Minor Arterial','Laneway':'Local'}
elevatio_df['intersection_type']= elevatio_df['elevatio10'].map(elevatio10_dict)

elevatio_df = elevatio_df.sort_values(by = ['intersection','intersection_type'], ascending = True)
elevatio_df = elevatio_df.drop_duplicates(subset='intersection', keep='first')
elevatio_df = elevatio_df.sort_values(by='intersection', ascending= False)
elevatio_df.reset_index(inplace = True)
elevatio10 = elevatio_df['intersection_type']

roadclass_df['ROAD CLASS'] =roadclass_df['ROAD CLASS'].replace(to_replace = -1,value = np.nan)
roadclass_df.reset_index(inplace=True)
roadclass_df['ROAD CLASS'] =roadclass_df['ROAD CLASS'].fillna(value = elevatio10)

roadclass_df['roadclass_num'] = roadclass_df['ROAD CLASS'].map(roadclass_dict)
roadclass_df = roadclass_df.sort_values(by = ['intersection','roadclass_num'], ascending = True)
roadclass_df = roadclass_df.drop_duplicates(subset='intersection', keep='first')
roadclass_df = roadclass_df.sort_values(by='intersection', ascending= False)



In [53]:

    
roadclass_df.groupby(['ROAD CLASS']).size()









    Out[53]:





ROAD CLASS
Collector                791
Expressway                23
Expressway Ramp           49
Hydro Line                 1
Laneway                   10
Local                   1104
Major Arterial          2818
Major Arterial Ramp        3
Minor Arterial         15408
Minor Arterial Ramp        1
Other                      4
Private Property         158
River                      1
dtype: int64



In [54]:

    
result['TRAFFIC CONTROL'].replace(np.nan,'unrecorded', inplace=True)
result['ROAD CLASS'].replace(['Expressway Ramp', 'Hydro Line','Laneway','Major Arterial Ramp','Minor Arterial Ramp','Other','Private Property', 'River'],['Expressway','Local','Local','Major Arterial','Minor Arterial','Local','Local','Local'],inplace=True)
temp = result
trafficcontrol_df =temp.groupby(['intersection']+['TRAFFIC CONTROL']) \
            .size().to_frame('count') \
            .reset_index().sort_values('count',ascending=False)
        
trafficcontrol_dict = {'traffic signal':1, 'traffic gate':1,'traffic controller':1,'stop sign':2, 'yield sign':2, 'pedestrian crossover':3, 'no control':5, 'other':6, 'unrecorded':6,'police control':6,'school bus':6,'school guard':3}
trafficcontrol_df['trafficcontrol_num'] = trafficcontrol_df['TRAFFIC CONTROL'].map(trafficcontrol_dict)
trafficcontrol_df = trafficcontrol_df.sort_values(by = ['intersection','trafficcontrol_num'], ascending = True)
trafficcontrol_df = trafficcontrol_df.drop_duplicates(subset='intersection', keep='first')
trafficcontrol_df = trafficcontrol_df.sort_values(by='intersection', ascending= False)



In [55]:

    
trafficcontrol_df.to_csv('trafficcontrol_df.csv')



In [56]:

    
reverse_dict = {1:'traffic signal/gate',2:'stop/yield sign', 3:'pedestrian crossover',5:'no control', 6:'other/unrecorded'}
trafficcontrol_df['control category'] = trafficcontrol_df['trafficcontrol_num'].map(reverse_dict)

trafficcontrol_df.head()









    Out[56]:






  
    
      
      intersection
      TRAFFIC CONTROL
      count
      trafficcontrol_num
      control category
    
  
  
    
      22146
      bpNichol Ln & bpNichol Ln, Toronto, Ontario, M5S
      no control
      1
      5
      no control
    
    
      22145
      Zachary Ct & Zachary Ct, Toronto, Ontario, M6A
      unrecorded
      1
      6
      other/unrecorded
    
    
      22144
      York Ridge Rd & York Ridge Rd, Toronto, Ontari...
      unrecorded
      1
      6
      other/unrecorded
    
    
      22143
      York Mills Rd & York Ridge Rd, Toronto, Ontari...
      unrecorded
      1
      6
      other/unrecorded
    
    
      22142
      York Mills Rd & York Mills Rd, Toronto, Ontari...
      unrecorded
      1
      6
      other/unrecorded



In [57]:

    
trafficcontrol_df.groupby(['control category']).size()









    Out[57]:





control category
no control               2172
other/unrecorded        14693
pedestrian crossover       93
stop/yield sign          1702
traffic signal/gate      1711
dtype: int64



In [58]:

    
reverseROAD_dict = {1:'Collector/Expressway',2:'Major Arterial', 3:'Minor Arterial',4:'Local/Laneway'}
roadclass_df['road_class_cat'] = roadclass_df['roadclass_num'].map(reverseROAD_dict)

roadclass_df.head()









    Out[58]:






  
    
      
      index
      intersection
      ROAD CLASS
      count
      roadclass_num
      road_class_cat
    
  
  
    
      0
      21209
      bpNichol Ln & bpNichol Ln, Toronto, Ontario, M5S
      Major Arterial
      1
      2
      Major Arterial
    
    
      1
      21208
      Zachary Ct & Zachary Ct, Toronto, Ontario, M6A
      Local
      1
      4
      Local/Laneway
    
    
      2
      21207
      York Ridge Rd & York Ridge Rd, Toronto, Ontari...
      Minor Arterial
      1
      3
      Minor Arterial
    
    
      3
      21206
      York Mills Rd & York Ridge Rd, Toronto, Ontari...
      Minor Arterial
      1
      3
      Minor Arterial
    
    
      4
      21205
      York Mills Rd & York Mills Rd, Toronto, Ontari...
      Minor Arterial
      1
      3
      Minor Arterial



In [59]:

    
roadclass_df.groupby(['road_class_cat']).size()









    Out[59]:





road_class_cat
Collector/Expressway      863
Local/Laneway            1272
Major Arterial           2818
Minor Arterial          15408
dtype: int64



In [60]:

    
df_data2 = df_data2.sort_index(ascending = False)
df_data2['road_class'] = roadclass_df['road_class_cat'].values
df_data2['control type'] = trafficcontrol_df['control category'].values
df_data2[['fatal','major','minor','minimal/none']] = df_data2[['fatal','major','minor','minimal/none']].fillna(0)

traffdummy_df = pd.get_dummies(df_data2['control type'], dummy_na=False)
df_data2 = pd.concat([df_data2, traffdummy_df ],axis=1)

rdclass_dummy = pd.get_dummies(df_data2['road_class'])
df_data2 = pd.concat([df_data2, rdclass_dummy],axis=1)

We had to do a bit of work to map certain obvious groups together (such as minor arterial ramp grouping with arterial ramp). The same process was performed for traffic control types. Part of the thinking in the grouping was trying to use common sense, traffic gates would likely be the most restrictive intersections, but due to their very small number, they are combined with the next restrictive in a "traffic signal/gate" groups.

We note that the greatest number of of control categories are other or unrecorded. This is an interesting characterisitc of our data, as any bias in the REASON for an intersection's traffic control being identified as unrecorded can skew the predictive power of our algorithms.



In [61]:

    
df_data2['bike traffic estimate'].replace(0,1,inplace=True)
df_data2['pedestrian traffic'].replace(0,1,inplace=True)

In order to deal with duplicate entries, we'll assign a numerical value to the roat types, so as to order them by how large the expected traffic flow is. We can then keep the most repeated, highest traffic flow option when we ask Pandas to drop duplicate entries in the "intersection" index.

We're also mapping the intersection types found in the intersection list dataframe to correspond to those in the cycling collisions one.



In [62]:

    
temp = result_central
temp = temp.replace(np.nan,-1)
roadclass_df =temp.groupby(['intersection']+['road_class']) \
            .size().to_frame('count') \
            .reset_index().sort_values('intersection',ascending=False)
                
#roadclass_df['road_class'] =roadclass_df['road_class'].fillna(value =elevatio_df)
roadclass_df = roadclass_df.sort_values('count', ascending=False)
roadclass_df = roadclass_df.replace(np.nan,-1)     
roadclass_dict = {'Collector':1, 'Expressway':1,'Expressway Ramp':1,'Major Arterial':2,'Minor Arterial':3,'Local':4, 'Laneway':4, 'Private Property':4}
roadclass_df['roadclass_num'] = roadclass_df['road_class'].map(roadclass_dict)
roadclass_df = roadclass_df.sort_values(by = ['intersection','roadclass_num'], ascending = True)
roadclass_df = roadclass_df.drop_duplicates(subset='intersection', keep='first')
roadclass_df = roadclass_df.sort_values(by='intersection', ascending= False)

temp = result_central
temp['elevatio10'].replace(np.nan,'unrecorded', inplace =True)
elevatio_df =temp.groupby(['intersection']+['elevatio10']) \
            .size().to_frame('count') \
            .reset_index().sort_values('intersection',ascending=False)
        
elevatio10_dict = {'Major':'Major Arterial', 'Minor':'Minor Arterial','Laneway':'Local'}
elevatio_df['intersection_type']= elevatio_df['elevatio10'].map(elevatio10_dict)

elevatio_df = elevatio_df.sort_values(by = ['intersection','intersection_type'], ascending = True)
elevatio_df = elevatio_df.drop_duplicates(subset='intersection', keep='first')
elevatio_df = elevatio_df.sort_values(by='intersection', ascending= False)
elevatio_df.reset_index(inplace = True)
elevatio10 = elevatio_df['intersection_type']

roadclass_df['ROAD CLASS'] =roadclass_df['road_class'].replace(to_replace = -1,value = np.nan)
roadclass_df.reset_index(inplace=True)
roadclass_df['ROAD CLASS'] =roadclass_df['ROAD CLASS'].fillna(value = elevatio10)

roadclass_df['roadclass_num'] = roadclass_df['ROAD CLASS'].map(roadclass_dict)
roadclass_df = roadclass_df.sort_values(by = ['intersection','roadclass_num'], ascending = True)
roadclass_df = roadclass_df.drop_duplicates(subset='intersection', keep='first')
roadclass_df = roadclass_df.sort_values(by='intersection', ascending= False)

roadclass_df.groupby(['ROAD CLASS']).size()









    Out[62]:





ROAD CLASS
Collector             169
Expressway              6
Expressway Ramp         5
Laneway                 3
Local                 469
Major Arterial        761
Minor Arterial      17275
Private Property        5
dtype: int64



In [63]:

    
result_central['traffic_control'].replace(np.nan,'unrecorded', inplace=True)
result_central['road_class'].replace(['Expressway Ramp', 'Hydro Line','Laneway','Major Arterial Ramp','Minor Arterial Ramp','Other','Private Property', 'River'],['Expressway','Local','Local','Major Arterial','Minor Arterial','Local','Local','Local'],inplace=True)
temp = result_central
trafficcontrol_df =temp.groupby(['intersection']+['traffic_control']) \
            .size().to_frame('count') \
            .reset_index().sort_values('count',ascending=False)
        
trafficcontrol_dict = {'traffic signal':1, 'traffic gate':1,'traffic controller':1,'stop sign':2, 'yield sign':2, 'pedestrian crossover':3, 'no control':5, 'other':6, 'unrecorded':6,'police control':6,'school bus':6,'school guard':3}
trafficcontrol_df['trafficcontrol_num'] = trafficcontrol_df['traffic_control'].map(trafficcontrol_dict)
trafficcontrol_df = trafficcontrol_df.sort_values(by = ['intersection','trafficcontrol_num'], ascending = True)
trafficcontrol_df = trafficcontrol_df.drop_duplicates(subset='intersection', keep='first')
trafficcontrol_df = trafficcontrol_df.sort_values(by='intersection', ascending= False)



In [64]:

    
reverse_dict = {1:'traffic signal/gate',2:'stop/yield sign', 3:'pedestrian crossover',5:'no control', 6:'other/unrecorded'}
trafficcontrol_df['control category'] = trafficcontrol_df['trafficcontrol_num'].map(reverse_dict)

reverseROAD_dict = {1:'Collector/Expressway',2:'Major Arterial', 3:'Minor Arterial',4:'Local/Laneway'}
roadclass_df['road_class_cat'] = roadclass_df['roadclass_num'].map(reverseROAD_dict)

df_data_central = df_data_central.sort_index(ascending = False)
df_data_central['road_class'] = roadclass_df['road_class_cat'].values
df_data_central['control type'] = trafficcontrol_df['control category'].values
df_data_central[['fatal','major','minor','minimal/none']] = df_data_central[['fatal','major','minor','minimal/none']].fillna(0)

traffdummy_df = pd.get_dummies(df_data_central['control type'], dummy_na=False)
df_data_central = pd.concat([df_data_central, traffdummy_df ],axis=1)

rdclass_dummy = pd.get_dummies(df_data_central['road_class'])
df_data_central = pd.concat([df_data_central, rdclass_dummy],axis=1)

We've now added our features, as well as corresponding dummy variables for each class. These are neccessary to perform our linear regressions in a little bit.

We're going to fit posterior means via empirical bayesian estimation as before. These won't be used for prediction, but it will be interesting to see if any of our predictive approaches perform better than the estimates from the posterior rates when we take a look at data from this past year scrapped from twitter.



In [65]:

    
ag,bg,cg = stats.gamma.fit(df_data2['total collisions'])  
beta = 1/cg
df_data2['posterior mean rate'] = (ag + df_data2['total collisions'])/(beta+13*360*df_data2['bike traffic estimate']) 
df_data2['posterior STD'] = np.sqrt((ag + df_data2['total collisions'])/((beta+13*360*df_data2['bike traffic estimate'])**2)) 
#pd.set_option('display.float_format', '{:.5g}'.format)
df_data2.sort_values(by='posterior mean rate',ascending = False).head()









    Out[65]:






  
    
      
      normalized yearly accident rate
      total collisions
      bike traffic estimate
      scaled collisions
      normalized_risk_factor
      vehicle traffic
      pedestrian traffic
      fatal
      major
      minor
      ...
      other/unrecorded
      pedestrian crossover
      stop/yield sign
      traffic signal/gate
      Collector/Expressway
      Local/Laneway
      Major Arterial
      Minor Arterial
      posterior mean rate
      posterior STD
    
    
      intersection
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      MacDonald-Cartier Fwy & Victoria Park Ave, Toronto, Ontario, M1R
      0
      9
      1
      1
      0
      35267
      1
      0
      0
      4
      ...
      0
      0
      0
      1
      1
      0
      0
      0
      0.0019472
      0.00064499
    
    
      Reesor Rd & Steeles Ave E, Scarborough, Ontario, M1X, Scarborough, Ontario, M1X
      0
      3
      1
      0
      0
      1081
      1
      0
      0
      1
      ...
      0
      0
      1
      0
      0
      0
      0
      1
      0.00066532
      0.00037702
    
    
      HWY-401, Mississauga, Ontario, L4W & Renforth Dr, Mississauga, Ontario, L4W
      0
      2
      1
      0
      0
      24049
      1
      0
      2
      0
      ...
      0
      0
      0
      0
      0
      0
      1
      0
      0.00045168
      0.00031064
    
    
      Highway of Heroes & Victoria Park Ave, Toronto, Ontario, M1R
      0
      2
      1
      0
      0
      35267
      1
      0
      0
      2
      ...
      0
      0
      0
      0
      1
      0
      0
      0
      0.00045168
      0.00031064
    
    
      Blackhurst Ct & Morningview Trl, Scarborough, Ontario, M1B, Scarborough, Ontario, M1B
      0
      1
      1
      0
      0
      1081
      1
      0
      1
      0
      ...
      0
      0
      0
      0
      0
      1
      0
      0
      0.00023803
      0.00022551
    
  

5 rows × 24 columns



In [66]:

    
ag,bg,cg = stats.gamma.fit(df_data_central['total collisions'])  
beta = 1/cg
df_data_central['posterior mean rate'] = (ag + df_data_central['total collisions'])/(beta+13*360*df_data_central['bike traffic estimate']) 
df_data_central['posterior STD'] = np.sqrt((ag + df_data_central['total collisions'])/((beta+13*360*df_data_central['bike traffic estimate'])**2)) 
#pd.set_option('display.float_format', '{:.5g}'.format)
df_data_central.sort_values(by='posterior mean rate',ascending = False).head()









    Out[66]:






  
    
      
      normalized yearly accident rate
      total collisions
      bike traffic estimate
      scaled collisions
      normalized_risk_factor
      vehicle traffic
      pedestrian traffic
      fatal
      major
      minor
      ...
      other/unrecorded
      pedestrian crossover
      stop/yield sign
      traffic signal/gate
      Collector/Expressway
      Local/Laneway
      Major Arterial
      Minor Arterial
      posterior mean rate
      posterior STD
    
    
      intersection
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Farm Greenway & Victoria Park Ave, Toronto, Ontario, M1R
      0
      0
      0
      0
      0
      35267
      0
      0
      0
      0
      ...
      1
      0
      0
      0
      0
      0
      0
      1
      0.65002
      2.7456
    
    
      Bingley Rd & Valley Centre Dr, Toronto, Ontario, M1X
      0
      0
      0
      0
      0
      1081
      0
      0
      0
      0
      ...
      1
      0
      0
      0
      0
      0
      0
      1
      0.65002
      2.7456
    
    
      Lake Shore Blvd W & Lakeshore British Columbia Rd, Toronto, Ontario, M6K
      0
      0
      0
      0
      0
      27298
      0
      0
      0
      0
      ...
      1
      0
      0
      0
      0
      0
      0
      1
      0.65002
      2.7456
    
    
      Pharmacy Ave & Seabury Gate, Toronto, Ontario, M1T
      0
      0
      0
      0
      0
      35267
      0
      0
      0
      0
      ...
      1
      0
      0
      0
      0
      0
      0
      1
      0.65002
      2.7456
    
    
      Foxcote Cres & Rangoon Rd, Toronto, Ontario, M9C
      0
      0
      0
      0
      0
      16435
      0
      0
      0
      0
      ...
      1
      0
      0
      0
      0
      0
      0
      1
      0.65002
      2.7456
    
  

5 rows × 24 columns



In [67]:

    
df_data2.to_csv('preinput_data2.csv')
df_data_central.to_csv('preinput_data_central.csv')



In [68]:

    
df_data = pd.DataFrame.from_csv('preinput_data2.csv')
df_data_central = pd.DataFrame.from_csv('preinput_data_central.csv')



In [69]:

    
#df_data['injury'] = df_data.fatal + df_data.major + df_data.minor + df_data['minimal/none']
df_data['fatal/major'] = df_data.fatal + df_data.major

#df_data_central['injury'] = df_data_central.fatal + df_data_central.major + df_data_central.minor + df_data_central['minimal/none']
df_data_central['fatal/major'] = df_data_central.fatal + df_data_central.major



In [70]:

    
df_data.columns.values









    Out[70]:





array(['normalized yearly accident rate', 'total collisions',
       'bike traffic estimate', 'scaled collisions',
       'normalized_risk_factor', 'vehicle traffic', 'pedestrian traffic',
       'fatal', 'major', 'minor', 'minimal/none', 'road_class',
       'control type', 'no control', 'other/unrecorded',
       'pedestrian crossover', 'stop/yield sign', 'traffic signal/gate',
       'Collector/Expressway', 'Local/Laneway', 'Major Arterial',
       'Minor Arterial', 'posterior mean rate', 'posterior STD',
       'fatal/major'], dtype=object)

Training and testing linear models.

First we break set up our features and target data for both the full, and central-only data sets.

The barious linear regression algorithms available in sci-kit learn expect features which are gaussian. Otherwise they may perform very badly. We use the preproceessing library to scale the various traffic counts.



In [71]:

    
X= df_data.drop(['normalized yearly accident rate','posterior mean rate','posterior STD','total collisions','normalized_risk_factor','road_class','control type','minor','minimal/none','scaled collisions','fatal','major'],axis=1)
Y_priorTotals = df_data['total collisions']
Y_priorRate = df_data['total collisions']/13
Y_postRate = df_data['posterior mean rate']
Y_priorRiskRate = df_data['normalized_risk_factor']

Xc= df_data_central.drop(['normalized yearly accident rate','posterior mean rate','posterior STD','total collisions','normalized_risk_factor','road_class','control type','minor','minimal/none','scaled collisions','fatal','major'],axis=1)
Yc_priorTotals = df_data_central['total collisions']
Yc_priorRate = df_data_central['total collisions']/13
Yc_postRate = df_data_central['posterior mean rate']
Yc_priorRiskRate = df_data_central['normalized_risk_factor']



In [72]:

    
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
raw_bike = X['bike traffic estimate'].reshape(-1,1)
raw_veh = X['vehicle traffic'].reshape(-1,1)
raw_ped = X['pedestrian traffic'].reshape(-1,1)
bike_scaler = preprocessing.StandardScaler().fit(raw_bike)
veh_scaler = preprocessing.StandardScaler().fit(raw_veh)
ped_scaler = preprocessing.StandardScaler().fit(raw_ped)
X_scaled_bike = bike_scaler.transform(raw_bike)
X_scaled_veh = veh_scaler.transform(raw_veh)
X_scaled_ped = ped_scaler.transform(raw_ped)
X['bike traffic estimate'] = X_scaled_bike
X['vehicle traffic'] = X_scaled_veh
X['ped_scaler'] = X_scaled_ped



In [73]:

    
raw_bike = Xc['bike traffic estimate'].reshape(-1,1)
raw_veh = Xc['vehicle traffic'].reshape(-1,1)
raw_ped = Xc['pedestrian traffic'].reshape(-1,1)
bike_scaler_c = preprocessing.StandardScaler().fit(raw_bike)
veh_scaler_c = preprocessing.StandardScaler().fit(raw_veh)
ped_scaler_c = preprocessing.StandardScaler().fit(raw_ped)
Xc_scaled_bike = bike_scaler_c.transform(raw_bike)
Xc_scaled_veh = veh_scaler_c.transform(raw_veh)
Xc_scaled_ped = ped_scaler_c.transform(raw_ped)
Xc['bike traffic estimate'] = Xc_scaled_bike
Xc['vehicle traffic'] = Xc_scaled_veh
Xc['ped_scaler'] = Xc_scaled_ped

First we train a basic linear regression. Rather than splitting the data manually, or coding up a cross validation method ourselves (as was required in the linear regression exercise for the course), we'll use the included cross_val_score and _predict methods.



In [74]:

    
from sklearn import linear_model as lm
cycling_linear = lm.LinearRegression()
cycling_lasso = lm.Lasso()
cycling_ridge = lm.Ridge(alpha =10)



In [75]:

    
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_val_score
predicted = cross_val_predict(cycling_linear,X,Y_priorRate,cv=5)
linear_score = cross_val_score(cycling_linear,X,Y_priorRate,cv=5)
print("linear score: ", np.mean(linear_score))
print("predicted: ", predicted)









    



linear score:  0.508928284915
predicted:  [ 0.17652133 -0.01308693 -0.02239764 ...,  0.12101306  0.01118716
  0.01118716]

We first notice that the score isn't terribly good. There is a lot of room for improvement here. How do these predictions look visually when plotted against the actual data? Let's take a look



In [76]:

    
fig, ax = plt.subplots()
Y = Y_priorRate
r2 = linear_score
ax.scatter(Y,predicted)
ax.plot([Y.min(), Y.max()], [Y.min(), Y.max()], 'k--',label ="r^2="+str(r2), lw=4)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
plt.show()



In [77]:

    
plt.scatter(predicted, predicted- Y, c='g', s=40, alpha=0.5)
plt.hlines(y = 0, xmin=0, xmax = 1.5)
plt.title('Residual Plot of OLS regression for all postal codes, using cross-validation')
plt.ylabel('Residuals')









    Out[77]:





<matplotlib.text.Text at 0x7fac25aa90f0>



In [78]:

    
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_val_score

predicted = cross_val_predict(cycling_linear,Xc,Yc_priorRate,cv=5)
linear_score = cross_val_score(cycling_linear,Xc,Yc_priorRate,cv=5)
print("linear score: ", np.mean(linear_score))
print("predicted: ", predicted)

fig, ax = plt.subplots()
Y = Yc_priorRate
r2 = linear_score
ax.scatter(Y,predicted)
ax.plot([Y.min(), Y.max()], [Y.min(), Y.max()], 'k--',label ="r^2="+str(r2), lw=4)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
plt.show()









    



linear score:  0.60266120383
predicted:  [ 0.1582576  -0.02152201 -0.00971354 ...,  0.002264    0.002264    0.002264  ]



In [79]:

    
plt.scatter(predicted, predicted- Y, c='g', s=40, alpha=0.5)
plt.hlines(y = 0, xmin=0, xmax = 1.5)
plt.title('Residual Plot of OLS regression for central postal codes, using cross-validation')
plt.ylabel('Residuals')









    Out[79]:





<matplotlib.text.Text at 0x7fac1c12a4e0>

So it's very obvious to us that there is a very large tendency to understimate collision numbers, particularily for mid, to high risk intersections. We see that restricting ourselves to intersections which are in central Toronto gives a pretty sizable imporvement in $R^2$. From 0.5 to about 0.6. The residuals show an improvement in the underestimation as well.

Let's try a more sophisticated model. Lasso regression yields a sparse model. It is useful for reducing the dimensionality of a problem. In this case, we probably are going to get pretty bad results, since we actually have very few features to work with already.

We'll let the use the LassoCV method to fit an alpha, and then make our prediction from there.



In [80]:

    
lassomodel = lm.LassoCV(cv=5).fit(X,Y_priorRate)
predicted = lassomodel.predict(X)
lasso_score = lassomodel.score(X,Y_priorRate) 

print("chosen alpha: ",lassomodel.alpha_)
print("lasso score: ", np.mean(lasso_score))
print("predicted: ", predicted)









    



chosen alpha:  0.159143550513
lasso score:  0.133455369252
predicted:  [ 0.13981183  0.07347667  0.02680618 ...,  0.05670033  0.05670033
  0.05670033]



In [81]:

    
fig, ax = plt.subplots()
Y = Y_priorRate
r2= linear_score
ax.scatter(Y,predicted)
ax.plot([Y.min(), Y.max()], [Y.min(), Y.max()], 'k--',label ="r^2="+str(r2), lw=4)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
plt.show()



In [82]:

    
plt.scatter(predicted, predicted- Y, c='g', s=40, alpha=0.5)
plt.hlines(y = 0, xmin=0, xmax = 1.5)
plt.title('Residual Plot using cross-validated data')
plt.ylabel('Residuals')









    Out[82]:





<matplotlib.text.Text at 0x7fac1c0ff2b0>

So a rather poor performance overall. Worse than regular least squares regression, just as we predicted. It doesn't make sense to continue with this model by checking the performance with the central-only dataframe, as this isn't the kidn of problem it's designed for. Now what about ridge regression?



In [83]:

    
ridgemodel = lm.RidgeCV(cv=5).fit(X,Y_priorRate)
predicted = ridgemodel.predict(X)
ridge_score = ridgemodel.score(X,Y_priorRate) 

print("chosen alpha: ",ridgemodel.alpha_)
print("ridge score: ", np.mean(ridge_score))
print("predicted: ", predicted)









    



chosen alpha:  10.0
ridge score:  0.519202385901
predicted:  [ 0.17457528 -0.00871101 -0.01980315 ...,  0.12200059  0.01135503
  0.01135503]



In [84]:

    
fig, ax = plt.subplots()
Y = Y_priorRate
r2 = linear_score
ax.scatter(Y,predicted)
ax.plot([Y.min(), Y.max()], [Y.min(), Y.max()], 'k--', label ="r^2="+str(r2),lw=4)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
plt.show()



In [85]:

    
plt.scatter(predicted, predicted- Y, c='g', s=40, alpha=0.5)
plt.hlines(y = 0, xmin=0, xmax = 1.5)
plt.title('Residual Plot using cross-validated data')
plt.ylabel('Residuals')









    Out[85]:





<matplotlib.text.Text at 0x7fac22a142e8>



In [86]:

    
ridgemodel = lm.RidgeCV(cv=5).fit(Xc,Yc_priorRate)
predicted = ridgemodel.predict(Xc)
ridge_score = ridgemodel.score(Xc,Yc_priorRate) 

print("chosen alpha: ",ridgemodel.alpha_)
print("ridge score: ", np.mean(ridge_score))
print("predicted: ", predicted)

fig, ax = plt.subplots()
Y = Yc_priorRate
r2 = linear_score
ax.scatter(Y,predicted)
ax.plot([Y.min(), Y.max()], [Y.min(), Y.max()], 'k--',label ="r^2="+str(r2), lw=4)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
plt.show()









    



chosen alpha:  10.0
ridge score:  0.620701229935
predicted:  [ 0.159844   -0.01657434 -0.00851878 ...,  0.00251754  0.00251754
  0.00251754]



In [87]:

    
plt.scatter(predicted, predicted- Y, c='g', s=40, alpha=0.5)
plt.hlines(y = 0, xmin=0, xmax = 1.5)
plt.title('Residual Plot using cross-validated data')
plt.ylabel('Residuals')









    Out[87]:





<matplotlib.text.Text at 0x7fac22978eb8>

We have a slight improvement over our least squares model. Unfortunately still not very good. And we still underestimate systematically the high risk intersections. What can we accomplish by trying an ensemble method? And perhaps including another feature, like postal codes?

Not every postal code will be working to add information. But a few will be ranked highly when we look at the feature importance output for a random forrest.



In [88]:

    
Xrf = df_data[['bike traffic estimate','vehicle traffic','pedestrian traffic','road_class','control type']]
Yrf_priorTotals = df_data['total collisions']
Yrf_priorRate = df_data['total collisions']/13
Yrf_postRate = df_data['posterior mean rate']
Yrf_RiskRate = df_data['scaled collisions']/13

Xrf_c = df_data_central[['bike traffic estimate','vehicle traffic','pedestrian traffic','road_class','control type']]
Yrf_priorTotals_c = df_data_central['total collisions']
Yrf_priorRate_c = df_data_central['total collisions']/13
Yrf_postRate_c = df_data_central['posterior mean rate']
Yrf_RiskRate_c = df_data_central['scaled collisions']/13



In [89]:

    
Xrf.fillna('Minor Arterial',inplace= True)









    



/home/christian/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py:2756: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  downcast=downcast, **kwargs)



In [90]:

    
control_dict = {'traffic signal/gate':1,'stop/yield sign':2, 'pedestrian crossover':3,'no control':5, 'other/unrecorded':4}
Xrf['control_type_num'] = Xrf['control type'].map(control_dict)
Xrf_c['control_type_num'] = Xrf_c['control type'].map(control_dict)









    



/home/christian/anaconda3/lib/python3.5/site-packages/ipykernel/__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
/home/christian/anaconda3/lib/python3.5/site-packages/ipykernel/__main__.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()



In [91]:

    
roadclass_dict = {'Collector/Expressway':4,'Major Arterial':3,'Minor Arterial':2,'Local/Laneway':1}
Xrf['road_class_num'] = Xrf['road_class'].map(roadclass_dict)

Xrf_c.loc[:,'road_class_num'] = Xrf_c.loc[:,'road_class'].map(roadclass_dict)









    



/home/christian/anaconda3/lib/python3.5/site-packages/ipykernel/__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
/home/christian/anaconda3/lib/python3.5/site-packages/pandas/core/indexing.py:284: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)
/home/christian/anaconda3/lib/python3.5/site-packages/pandas/core/indexing.py:461: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s



In [92]:

    
Xrf = Xrf.drop(['road_class','control type'] , axis=1)
Xrf_c = Xrf_c.drop(['road_class','control type'],axis =1)



In [93]:

    
from sklearn.model_selection import train_test_split
Xc_train, Xc_test, Yc_train, Yc_test = train_test_split(Xrf_c, Yrf_priorRate_c, test_size=0.25, random_state = 5)
print (Xc_train.shape)
print (Xc_test.shape)
print (Yc_train.shape)
print (Yc_test.shape)

X_train, X_test, Y_train, Y_test = train_test_split(Xrf, Yrf_priorRate, test_size=0.25, random_state = 5)
print (X_train.shape)
print (X_test.shape)
print (Y_train.shape)
print (Y_test.shape)









    



(14019, 5)
(4674, 5)
(14019,)
(4674,)
(15278, 5)
(5093, 5)
(15278,)
(5093,)



In [95]:

    
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
rf_c= RandomForestRegressor(n_estimators = 200)
prediction_c = cross_val_predict(rf_c,Xrf_c,Yrf_priorRate_c, cv=6)
r2_c = np.mean(cross_val_score(rf_c,Xrf_c,Yrf_priorRate_c,cv=6))
mse = np.mean((Yrf_priorRate_c-prediction_c)**2)
print('MSE : ',mse)

print("R^2 : ", r2_c)

plt.scatter(Yrf_priorRate_c,prediction_c)
plt.plot(np.arange(0,4),np.arange(0,4), label ="r^2="+str(r2_c), c="r")
plt.legend(loc="lower right")
plt.xlabel('Measured')
plt.ylabel('Predicted')









    



MSE :  0.007115678508870524
R^2 :  0.608950817341






    Out[95]:





<matplotlib.text.Text at 0x7fac0141d7b8>



In [96]:

    
rf= RandomForestRegressor(n_estimators = 200)
prediction = cross_val_predict(rf,Xrf,Yrf_priorRate, cv=5)
r2 = np.mean(cross_val_score(rf,Xrf,Yrf_priorRate,cv=5))
mse = np.mean((Yrf_priorRate-prediction)**2)
print('MSE : ',mse)

print("R^2 : ", r2)

plt.scatter(Yrf_priorRate,prediction)
plt.plot(np.arange(0,4),np.arange(0,4), label ="r^2="+str(r2), c="r")
plt.legend(loc="lower right")
plt.xlabel('Measured')
plt.ylabel('Predicted')









    



MSE :  0.01319947702209104
R^2 :  0.530617278464






    Out[96]:





<matplotlib.text.Text at 0x7fac00543fd0>



In [97]:

    
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
rf= RandomForestRegressor(n_estimators = 200)
rf.fit(X_train,Y_train)
r2 = r2_score(Y_test, rf.predict(X_test))
mse = np.mean((Y_test-rf.predict(X_test))**2)
print('MSE : ',mse)

print("R^2 : ", r2)

plt.scatter(Y_test,rf.predict(X_test))
plt.plot(np.arange(0,4),np.arange(0,4), label ="r^2="+str(r2), c="r")
plt.legend(loc="lower right")
plt.xlabel('Measured')
plt.ylabel('Predicted')









    



MSE :  0.012051314399154295
R^2 :  0.562351570052






    Out[97]:





<matplotlib.text.Text at 0x7fac00454a58>



In [98]:

    
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
rf_c= RandomForestRegressor(n_estimators = 200)
rf_c.fit(Xc_train,Yc_train)
r2_c = r2_score(Yc_test, rf_c.predict(Xc_test))
mse = np.mean((Yc_test-rf_c.predict(Xc_test))**2)
print('MSE : ',mse)

print("R^2 : ", r2_c)

plt.scatter(Yc_test,rf_c.predict(Xc_test))
plt.plot(np.arange(0,3),np.arange(0,3), label ="r^2="+str(r2_c), c="r")
plt.legend(loc="lower right")
plt.xlabel('Measured')
plt.ylabel('Predicted')









    



MSE :  0.005105772686015189
R^2 :  0.693184021549






    Out[98]:





<matplotlib.text.Text at 0x7fac003d9668>



In [99]:

    
names = list(Xrf.columns.values)
print("Features sorted by their score:")
print(sorted(zip(map(lambda x: round(x, 4), rf.feature_importances_), names), 
             reverse=True))









    



Features sorted by their score:
[(0.42970000000000003, 'control_type_num'), (0.23400000000000001, 'pedestrian traffic'), (0.161, 'vehicle traffic'), (0.14460000000000001, 'bike traffic estimate'), (0.030599999999999999, 'road_class_num')]

Let's drop our bike traffic estimates and see how performance changes



In [100]:

    
X2 = Xrf.drop(['bike traffic estimate'],axis=1)
Xc2 = Xrf_c.drop(['bike traffic estimate'],axis=1)



In [101]:

    
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
rf= RandomForestRegressor(n_estimators = 200)
prediction = cross_val_predict(rf,X2,Yrf_priorRate)
r2 = np.mean(cross_val_score(rf,X2,Yrf_priorRate))
mse = np.mean((Yrf_priorRate-prediction)**2)
print('MSE : ',mse)

print("R^2 : ", r2)

plt.scatter(Yrf_priorRate,prediction)
plt.plot(np.arange(0,4),np.arange(0,4), label ="r^2="+str(r2), c="r")
plt.legend(loc="lower right")
plt.xlabel('Measured')
plt.ylabel('Predicted')









    



MSE :  0.013898510190349437
R^2 :  0.511604858674






    Out[101]:





<matplotlib.text.Text at 0x7fac00446908>



In [102]:

    
from sklearn.ensemble import RandomForestRegressor
rf= RandomForestRegressor(n_estimators = 200)
prediction = cross_val_predict(rf,Xc2,Yrf_priorRate_c)
r2 = np.mean(cross_val_score(rf,Xc2,Yrf_priorRate_c))
mse = np.mean((Yrf_priorRate_c-prediction)**2)
print('MSE : ',mse)

print("R^2 : ", r2)

plt.scatter(Yrf_priorRate_c,prediction)
plt.plot(np.arange(0,4),np.arange(0,4), label ="r^2="+str(r2), c="r")
plt.legend(loc="lower right")
plt.xlabel('Measured')
plt.ylabel('Predicted')









    



MSE :  0.007752838442441518
R^2 :  0.577092103529






    Out[102]:





<matplotlib.text.Text at 0x7fac00311ef0>

Testing performance is worse without the bike estimates.

Compared to our singular linear regression approaches, this ensemble of decision trees gives a significant improvement in predictive power.



In [106]:

    
plt.scatter(rf.predict(X_train), rf.predict(X_train) - Y_train, c='b', s=40, alpha=0.5)
plt.scatter(rf.predict(X_test), rf.predict(X_test) - Y_test, c='g', s=40)
plt.hlines(y = 0, xmin=0, xmax = 4)
plt.title('Residual Plot using training (blue) and test (green) data')
plt.ylabel('Residuals')









    Out[106]:





<matplotlib.text.Text at 0x7faf5d03d9e8>



In [107]:

    
plt.scatter(rf_c.predict(X2c_train), rf_c.predict(X2c_train) - Yc_train, c='b', s=40, alpha=0.5)
plt.scatter(rf_c.predict(X2c_test), rf_c.predict(X2c_test) - Yc_test, c='g', s=40)
plt.hlines(y = 0, xmin=0, xmax = 3)
plt.title('Residual Plot using training (blue) and test (green) data')
plt.ylabel('Residuals')









    Out[107]:





<matplotlib.text.Text at 0x7faf5cfd5630>

Defintely an improvement there.



In [103]:

    
X4 = Xrf.copy()
Xc4 = Xrf_c.copy()



In [104]:

    
X4.reset_index(inplace=True)
st1 = X4['intersection'].str.split('Ontario, ').str.get(1)

Xc4.reset_index(inplace=True)
st1_c = Xc4['intersection'].str.split('Ontario, ').str.get(1)

st2 = st1.str.split(' ').str.get(0)
st2.to_csv('postal_codes.csv')
st2_c = st1_c.str.split(' ').str.get(0)
st2_c.to_csv('postal_codes_c.csv')



In [105]:

    
st2 = pd.DataFrame.from_csv('postal_codes2.csv',header=None)

st2_c = pd.DataFrame.from_csv('postal_codes_c2.csv',header=None)



In [106]:

    
X4.head()









    Out[106]:






  
    
      
      intersection
      bike traffic estimate
      vehicle traffic
      pedestrian traffic
      control_type_num
      road_class_num
    
  
  
    
      0
      bpNichol Ln & bpNichol Ln, Toronto, Ontario, M5S
      1216.4
      11213
      4767
      5
      3
    
    
      1
      Zachary Ct & Zachary Ct, Toronto, Ontario, M6A
      1111.1
      12798
      2011
      4
      1
    
    
      2
      York Ridge Rd & York Ridge Rd, Toronto, Ontari...
      783.49
      13638
      72
      4
      2
    
    
      3
      York Mills Rd & York Ridge Rd, Toronto, Ontari...
      783.49
      13638
      72
      4
      2
    
    
      4
      York Mills Rd & York Mills Rd, Toronto, Ontari...
      1106.1
      25718
      1926
      4
      2



In [107]:

    
X4.set_index('intersection', inplace= True)
Xc4.set_index('intersection', inplace= True)
#X4.drop('index',inplace = True, axis=1)



In [108]:

    
X4['postal'] = st2[1].values
Xc4['postal'] = st2_c[1].values



In [109]:

    
postal_df = pd.get_dummies(X4.postal, dummy_na = False)
X4 = pd.concat([X4,postal_df],axis=1)

postal_df = pd.get_dummies(Xc4.postal,dummy_na=False)
Xc4 = pd.concat([Xc4,postal_df],axis=1)



In [110]:

    
X4.drop('postal',inplace = True, axis = 1)
Xc4.drop('postal',inplace = True, axis = 1)



In [111]:

    
Xc_train, Xc_test, Yc_train, Yc_test = train_test_split(Xc4, Yrf_priorRate_c, test_size=0.25, random_state = 5)
print (Xc_train.shape)
print (Xc_test.shape)
print (Yc_train.shape)
print (Yc_test.shape)

X_train, X_test, Y_train, Y_test = train_test_split(X4, Yrf_priorRate, test_size=0.25, random_state = 5)
print (X_train.shape)
print (X_test.shape)
print (Y_train.shape)
print (Y_test.shape)

rf= RandomForestRegressor(n_estimators = 200)

rf.fit(X_train,Y_train)
r2 = r2_score(Y_test, rf.predict(X_test))
mse = np.mean((Y_test-rf.predict(X_test))**2)
print('MSE for all data: ', mse)

rf_c= RandomForestRegressor(n_estimators = 200)
rf_c.fit(Xc_train,Yc_train)
r2_c = r2_score(Yc_test, rf_c.predict(Xc_test))
mse = np.mean((Yc_test-rf_c.predict(Xc_test))**2)
print("MSE for central data: ", mse)

names = list(X4.columns.values)
print("Features sorted by their score:")
print(sorted(zip(map(lambda x: round(x, 4), rf.feature_importances_), names), 
             reverse=True))

names2 = list(Xc4.columns.values)
print("Features sorted by their score:")
print(sorted(zip(map(lambda x: round(x, 4), rf_c.feature_importances_), names2), 
             reverse=True))









    



(14019, 115)
(4674, 115)
(14019,)
(4674,)
(15278, 119)
(5093, 119)
(15278,)
(5093,)
MSE for all data:  0.010668205012677612
MSE for central data:  0.004902613878013121
Features sorted by their score:
[(0.43099999999999999, 'control_type_num'), (0.1971, 'pedestrian traffic'), (0.1042, 'bike traffic estimate'), (0.1026, 'vehicle traffic'), (0.024, 'road_class_num'), (0.012699999999999999, 'M6G'), (0.010699999999999999, 'M6H'), (0.0099000000000000008, 'M5T'), (0.0086, 'M6J'), (0.0083000000000000001, 'M5A'), (0.0064999999999999997, 'M5V'), (0.0064000000000000003, 'M4M'), (0.0057999999999999996, 'M5B'), (0.0040000000000000001, 'M4L'), (0.0038, 'M5S'), (0.0037000000000000002, 'M4W'), (0.0030000000000000001, 'M4K'), (0.0028, 'M5R'), (0.0025000000000000001, 'M6K'), (0.0023, 'M5H'), (0.0022000000000000001, 'M6P'), (0.002, 'N0A'), (0.0019, 'M2R'), (0.0018, 'M5G'), (0.0018, 'M4Y'), (0.0018, 'M4J'), (0.0018, 'M1K'), (0.0016999999999999999, 'M5E'), (0.0016999999999999999, 'M1S'), (0.0015, 'M5C'), (0.0014, 'M5J'), (0.0012999999999999999, 'M4V'), (0.0012999999999999999, 'M3H'), (0.0011999999999999999, 'M9A'), (0.0011999999999999999, 'M4C'), (0.0011999999999999999, 'M1V'), (0.0011000000000000001, 'M6S'), (0.0011000000000000001, 'M1E'), (0.00089999999999999998, 'M3C'), (0.00089999999999999998, 'M2N'), (0.00080000000000000004, 'M3J'), (0.00080000000000000004, 'M1R'), (0.00069999999999999999, 'M9M'), (0.00069999999999999999, 'M4R'), (0.00069999999999999999, 'M3L'), (0.00069999999999999999, 'M1W'), (0.00059999999999999995, 'M9N'), (0.00059999999999999995, 'M8Z'), (0.00059999999999999995, 'M6M'), (0.00059999999999999995, 'M4X'), (0.00059999999999999995, 'M1L'), (0.00050000000000000001, 'M9W'), (0.00050000000000000001, 'M6R'), (0.00050000000000000001, 'M6N'), (0.00050000000000000001, 'M6E'), (0.00040000000000000002, 'M9V'), (0.00040000000000000002, 'M8V'), (0.00040000000000000002, 'M4S'), (0.00040000000000000002, 'M4P'), (0.00040000000000000002, 'M4E'), (0.00040000000000000002, 'M3N'), (0.00040000000000000002, 'M3B'), (0.00040000000000000002, 'M2J'), (0.00040000000000000002, 'M1T'), (0.00040000000000000002, 'M1H'), (0.00029999999999999997, 'M9R'), (0.00029999999999999997, 'M9C'), (0.00029999999999999997, 'M8W'), (0.00029999999999999997, 'M6B'), (0.00029999999999999997, 'M4T'), (0.00029999999999999997, 'M4N'), (0.00029999999999999997, 'M2M'), (0.00029999999999999997, 'M2H'), (0.00029999999999999997, 'M1P'), (0.00029999999999999997, 'M1J'), (0.00029999999999999997, 'M1B'), (0.00020000000000000001, 'M9P'), (0.00020000000000000001, 'M9B'), (0.00020000000000000001, 'M8Y'), (0.00020000000000000001, 'M8X'), (0.00020000000000000001, 'M6L'), (0.00020000000000000001, 'M6C'), (0.00020000000000000001, 'M6A'), (0.00020000000000000001, 'M5P'), (0.00020000000000000001, 'M5M'), (0.00020000000000000001, 'M4G'), (0.00020000000000000001, 'M4B'), (0.00020000000000000001, 'M3M'), (0.00020000000000000001, 'M3A'), (0.00020000000000000001, 'M2K'), (0.00020000000000000001, 'M1G'), (0.00020000000000000001, 'L3R'), (0.0001, 'M7A'), (0.0001, 'M5N'), (0.0001, 'M4H'), (0.0001, 'M4A'), (0.0001, 'M2L'), (0.0001, 'M1N'), (0.0001, 'M1C'), (0.0001, 'L3T'), (0.0, 'M9L'), (0.0, 'M5X'), (0.0, 'M5L'), (0.0, 'M3K'), (0.0, 'M2P'), (0.0, 'M1X'), (0.0, 'M1M'), (0.0, 'L6T'), (0.0, 'L5V'), (0.0, 'L5E'), (0.0, 'L4X'), (0.0, 'L4W'), (0.0, 'L4V'), (0.0, 'L4T'), (0.0, 'L4L'), (0.0, 'L4K'), (0.0, 'L4J'), (0.0, 'L4H'), (0.0, 'L3S')]
Features sorted by their score:
[(0.53610000000000002, 'control_type_num'), (0.1293, 'bike traffic estimate'), (0.12130000000000001, 'vehicle traffic'), (0.099000000000000005, 'pedestrian traffic'), (0.028799999999999999, 'road_class_num'), (0.023, 'M5T'), (0.0109, 'M5V'), (0.0067000000000000002, 'M5S'), (0.0061999999999999998, 'M5B'), (0.0061999999999999998, 'M5A'), (0.0057000000000000002, 'M6G'), (0.0040000000000000001, 'M6H'), (0.0038, 'M4W'), (0.0033999999999999998, 'M5H'), (0.0033, 'M5G'), (0.0033, 'M5C'), (0.0023999999999999998, 'M4Y'), (0.0019, 'M5E'), (0.0018, 'M5J'), (0.0012999999999999999, 'M4X'), (0.0011000000000000001, 'M6P'), (0.00050000000000000001, 'M7A'), (0.0, 'M9W'), (0.0, 'M9V'), (0.0, 'M9R'), (0.0, 'M9P'), (0.0, 'M9N'), (0.0, 'M9M'), (0.0, 'M9L'), (0.0, 'M9C'), (0.0, 'M9B'), (0.0, 'M9A'), (0.0, 'M8Z'), (0.0, 'M8Y'), (0.0, 'M8X'), (0.0, 'M8W'), (0.0, 'M8V'), (0.0, 'M6S'), (0.0, 'M6R'), (0.0, 'M6N'), (0.0, 'M6M'), (0.0, 'M6L'), (0.0, 'M6K'), (0.0, 'M6J'), (0.0, 'M6E'), (0.0, 'M6C'), (0.0, 'M6B'), (0.0, 'M6A'), (0.0, 'M5X'), (0.0, 'M5R'), (0.0, 'M5P'), (0.0, 'M5N'), (0.0, 'M5M'), (0.0, 'M5L'), (0.0, 'M4V'), (0.0, 'M4T'), (0.0, 'M4S'), (0.0, 'M4R'), (0.0, 'M4P'), (0.0, 'M4N'), (0.0, 'M4M'), (0.0, 'M4L'), (0.0, 'M4K'), (0.0, 'M4J'), (0.0, 'M4H'), (0.0, 'M4G'), (0.0, 'M4E'), (0.0, 'M4C'), (0.0, 'M4B'), (0.0, 'M4A'), (0.0, 'M3N'), (0.0, 'M3M'), (0.0, 'M3L'), (0.0, 'M3K'), (0.0, 'M3J'), (0.0, 'M3H'), (0.0, 'M3C'), (0.0, 'M3B'), (0.0, 'M3A'), (0.0, 'M2R'), (0.0, 'M2P'), (0.0, 'M2N'), (0.0, 'M2M'), (0.0, 'M2L'), (0.0, 'M2K'), (0.0, 'M2J'), (0.0, 'M2H'), (0.0, 'M1X'), (0.0, 'M1W'), (0.0, 'M1V'), (0.0, 'M1T'), (0.0, 'M1S'), (0.0, 'M1R'), (0.0, 'M1P'), (0.0, 'M1N'), (0.0, 'M1M'), (0.0, 'M1L'), (0.0, 'M1K'), (0.0, 'M1J'), (0.0, 'M1H'), (0.0, 'M1G'), (0.0, 'M1E'), (0.0, 'M1C'), (0.0, 'M1B'), (0.0, 'L5E'), (0.0, 'L4W'), (0.0, 'L4V'), (0.0, 'L4T'), (0.0, 'L4L'), (0.0, 'L4K'), (0.0, 'L4J'), (0.0, 'L4H'), (0.0, 'L3T'), (0.0, 'L3S'), (0.0, 'L3R')]



In [112]:

    
plt.scatter(Y_test,rf.predict(X_test))
plt.plot(np.arange(0,3.5),np.arange(0,3.5), label ="r^2="+str(r2), c="r")
plt.legend(loc="lower right")
plt.xlabel('Measured')
plt.ylabel('Predicted')









    Out[112]:





<matplotlib.text.Text at 0x7fac002444e0>



In [113]:

    
plt.scatter(Yc_test,rf_c.predict(Xc_test))
plt.plot(np.arange(0,3.5),np.arange(0,3.5), label ="r^2="+str(r2_c), c="r")
plt.legend(loc="lower right")
plt.xlabel('Measured')
plt.ylabel('Predicted')









    Out[113]:





<matplotlib.text.Text at 0x7fac00230908>

The fit from our random forest is starting to look pretty good, particularily when we restrict ourselves to the central intersections only, which still contain the bulk of our data. Looking at the output in terms of feature importance, we see that for BOTH datasets, most of the postal codes don't add any information. We could use the SelectFromModel meta-transformer to specify a threshold (manually or as some multiple of the mean), to remove features which are irrelevant. sci-kit lear ndoes this using mean decrease impurity. This is how much the variance is decreased by each feature on average.

Another feature selection method is to measure how much each feature impacts the accuracy of our model. By permuating the values of each feature, we can check how much it affets our prediction. For important variables, there should be a significant decrease. The cell below calculates the mean decrease accuracty for our model



In [114]:

    
from sklearn.model_selection import ShuffleSplit
from sklearn.metrics import r2_score
from collections import defaultdict
 
X = X4.values
Y = Yrf_priorRate

rf_alt = RandomForestRegressor()
scores = defaultdict(list)
forest_split = ShuffleSplit(n_splits=10,test_size = .25,random_state=1) 
#crossvalidate the scores on a number of different random splits of the data
for train_idx, test_idx in forest_split.split(X):
    X_train, X_test = X[train_idx], X[test_idx]
    Y_train, Y_test = Y[train_idx], Y[test_idx]
    r = rf_alt.fit(X_train, Y_train)
    acc = r2_score(Y_test, rf.predict(X_test))
    for i in range(X.shape[1]):
        X_t = X_test.copy()
        np.random.shuffle(X_t[:, i])
        shuff_acc = r2_score(Y_test, rf.predict(X_t))
        scores[names[i]].append((acc-shuff_acc)/acc)
print ("Features sorted by their score:")
print (sorted([(round(np.mean(score), 4), feat) for
              feat, score in scores.items()], reverse=True))









    



Features sorted by their score:
[(1.2586999999999999, 'control_type_num'), (0.31430000000000002, 'pedestrian traffic'), (0.17749999999999999, 'vehicle traffic'), (0.17710000000000001, 'bike traffic estimate'), (0.060699999999999997, 'road_class_num'), (0.025100000000000001, 'M6J'), (0.025000000000000001, 'M6G'), (0.023099999999999999, 'M5A'), (0.019099999999999999, 'M5T'), (0.0149, 'M4M'), (0.0144, 'M6H'), (0.0101, 'M5B'), (0.0077000000000000002, 'M5V'), (0.0073000000000000001, 'M4L'), (0.0064000000000000003, 'M6P'), (0.0057000000000000002, 'M5R'), (0.0053, 'M1K'), (0.0048999999999999998, 'M5S'), (0.0041999999999999997, 'M4W'), (0.0041999999999999997, 'M4J'), (0.0041000000000000003, 'M4K'), (0.0040000000000000001, 'N0A'), (0.0035000000000000001, 'M6K'), (0.0032000000000000002, 'M1S'), (0.0027000000000000001, 'M2R'), (0.0025999999999999999, 'M4C'), (0.002, 'M4V'), (0.0016999999999999999, 'M1W'), (0.0016000000000000001, 'M5H'), (0.0016000000000000001, 'M2N'), (0.0016000000000000001, 'M1V'), (0.0015, 'M1E'), (0.0014, 'M6S'), (0.0014, 'M4Y'), (0.0014, 'M3J'), (0.0014, 'M3H'), (0.0012999999999999999, 'M5G'), (0.0011999999999999999, 'M9A'), (0.0011000000000000001, 'M5E'), (0.001, 'M3C'), (0.00089999999999999998, 'M6R'), (0.00089999999999999998, 'M5J'), (0.00080000000000000004, 'M9M'), (0.00080000000000000004, 'M8V'), (0.00080000000000000004, 'M6N'), (0.00069999999999999999, 'M8Z'), (0.00069999999999999999, 'M1L'), (0.00059999999999999995, 'M4X'), (0.00050000000000000001, 'M9W'), (0.00050000000000000001, 'M9N'), (0.00050000000000000001, 'M9C'), (0.00050000000000000001, 'M9B'), (0.00050000000000000001, 'M6M'), (0.00050000000000000001, 'M5C'), (0.00050000000000000001, 'M4P'), (0.00050000000000000001, 'M4N'), (0.00050000000000000001, 'M4E'), (0.00050000000000000001, 'M1T'), (0.00050000000000000001, 'M1R'), (0.00040000000000000002, 'M6E'), (0.00040000000000000002, 'M4R'), (0.00040000000000000002, 'M3L'), (0.00040000000000000002, 'M1H'), (0.00029999999999999997, 'M6B'), (0.00029999999999999997, 'M5M'), (0.00029999999999999997, 'M3N'), (0.00029999999999999997, 'M2K'), (0.00029999999999999997, 'M2J'), (0.00020000000000000001, 'M9V'), (0.00020000000000000001, 'M9R'), (0.00020000000000000001, 'M9P'), (0.00020000000000000001, 'M8Y'), (0.00020000000000000001, 'M8W'), (0.00020000000000000001, 'M5P'), (0.00020000000000000001, 'M4S'), (0.00020000000000000001, 'M4B'), (0.00020000000000000001, 'M3B'), (0.00020000000000000001, 'M3A'), (0.00020000000000000001, 'M2H'), (0.00020000000000000001, 'M1P'), (0.00020000000000000001, 'M1J'), (0.00020000000000000001, 'M1B'), (0.0001, 'M8X'), (0.0001, 'M7A'), (0.0001, 'M6L'), (0.0001, 'M6C'), (0.0001, 'M6A'), (0.0001, 'M5N'), (0.0001, 'M4T'), (0.0001, 'M4H'), (0.0001, 'M4G'), (0.0001, 'M4A'), (0.0001, 'M3M'), (0.0001, 'M2L'), (0.0001, 'M1N'), (0.0001, 'M1G'), (0.0001, 'M1C'), (0.0001, 'L3R'), (0.0, 'M9L'), (0.0, 'M5X'), (0.0, 'M5L'), (0.0, 'M3K'), (0.0, 'M2P'), (0.0, 'M1X'), (0.0, 'M1M'), (0.0, 'L6T'), (0.0, 'L5V'), (0.0, 'L5E'), (0.0, 'L4X'), (0.0, 'L4W'), (0.0, 'L4V'), (0.0, 'L4T'), (0.0, 'L4L'), (0.0, 'L4K'), (0.0, 'L4J'), (0.0, 'L4H'), (0.0, 'L3T'), (0.0, 'L3S'), (-0.0001, 'M2M')]

This gives similar results as before. Let's reduce our postal code features, leaving only those which are scoring above half the mean by the decrease in mean variance.



In [115]:

    
from sklearn.feature_selection import SelectFromModel

rf= RandomForestRegressor(n_estimators = 20)


rf_c= RandomForestRegressor(n_estimators = 20)

model = SelectFromModel(rf,threshold = '0.25*mean')
model.fit(X4,Yrf_priorRate)

X5 = model.transform(X4)

model_c = SelectFromModel(rf_c,threshold = '0.25*mean')
model_c.fit(Xc4,Yrf_priorRate_c)
Xc5 = model_c.transform(Xc4)



In [116]:

    
X5.shape









    Out[116]:





(20371, 22)



In [117]:

    
Xc5.shape









    Out[117]:





(18693, 18)

So we are taking the first 23 features for our full data set, and the first 17 for our central only data set. This corresponds to 18 postal codes and 12 postal codes, respectively. Looking at the individual postal codes, there is strong overlap between the two lists.



In [118]:

    
Xc_train, Xc_test, Yc_train, Yc_test = train_test_split(Xc5, Yrf_priorRate_c, test_size=0.25, random_state = 5)
print (Xc_train.shape)
print (Xc_test.shape)
print (Yc_train.shape)
print (Yc_test.shape)

X_train, X_test, Y_train, Y_test = train_test_split(X5, Yrf_priorRate, test_size=0.25, random_state = 5)
print (X_train.shape)
print (X_test.shape)
print (Y_train.shape)
print (Y_test.shape)

rf= RandomForestRegressor(n_estimators = 200)

rf.fit(X_train,Y_train)
r2 = r2_score(Y_test, rf.predict(X_test))
mse = np.mean((Y_test-rf.predict(X_test))**2)
print('MSE for all data: ', mse)

rf_c= RandomForestRegressor(n_estimators = 200)
rf_c.fit(Xc_train,Yc_train)
r2_c = r2_score(Yc_test, rf_c.predict(Xc_test))
mse = np.mean((Yc_test-rf_c.predict(Xc_test))**2)
print("MSE for central data: ", mse)









    



(14019, 18)
(4674, 18)
(14019,)
(4674,)
(15278, 22)
(5093, 22)
(15278,)
(5093,)
MSE for all data:  0.010912322966155756
MSE for central data:  0.0049367048615294356



In [119]:

    
plt.scatter(Y_test,rf.predict(X_test))
plt.plot(np.arange(0,3.5),np.arange(0,3.5), label ="r^2="+str(r2), c="r")
plt.legend(loc="lower right")
plt.xlabel('Measured')
plt.ylabel('Predicted')









    Out[119]:





<matplotlib.text.Text at 0x7fac00069400>



In [120]:

    
plt.scatter(Yc_test,rf_c.predict(Xc_test))
plt.plot(np.arange(0,3.5),np.arange(0,3.5), label ="r^2="+str(r2_c), c="r")
plt.legend(loc="lower right")
plt.xlabel('Measured')
plt.ylabel('Predicted')









    Out[120]:





<matplotlib.text.Text at 0x7fabe3f0c978>

Our predictive power actually decreased in the full dataset, from about $R^2 = 0.61$ to $R^2=0.6$. However there was a slight improvement for the central only data set. This indicates that we should not use the same theshold in both cases, even if it's scaled by the mean score and not a constant value.

Ultimately, we have managed to build a predictive random forest, which using only traffic numbers (and an exponentially fit bike traffic estimate), along with the road type of the busiest road, and the traffic control type, performs a reasonable estimate of the yearly collision rates (accidents per year), at intersections for cyclists in the City of Toronto.

The scores are patricularily good when we restrict ourselves to the central post codes. This is partly attributed to the fact that the bike traffic estimates were fit with a few dozen points, all in the central neighbourhoods of Toronto. Had we better bike counts throughout the whole city, the predictive power would no doubt improve.

To output a final result, we'd lke to see how our random forest performs on the entire dataset, by cross validating so that every single point is contained in the test set at some point. Up to this point we were using the random state of our training splits to compare changes to the same test sets each time. While a random forest should not be prone to overfitting, thus not requiring cross_validation in the vast majority of cases, in order to gain a final answer on how well our forest performs, we'd like to generalize it's testing as much as possible.



In [121]:

    
rf= RandomForestRegressor(n_estimators = 200)
prediction = cross_val_predict(rf,X5,Yrf_priorRate)
r2 = np.mean(cross_val_score(rf,X5,Yrf_priorRate))
mse = np.mean((Yrf_priorRate-prediction)**2)
print('MSE : ',mse)

print("R^2 : ", r2)

plt.scatter(Yrf_priorRate,prediction)
plt.plot(np.arange(0,4),np.arange(0,4), label ="r^2="+str(r2), c="r")
plt.legend(loc="lower right")
plt.xlabel('Measured')
plt.ylabel('Predicted')









    



MSE :  0.012183449988187371
R^2 :  0.572186920283






    Out[121]:





<matplotlib.text.Text at 0x7fac003ae320>



In [122]:

    
rf_c= RandomForestRegressor(n_estimators = 200)
prediction_c = cross_val_predict(rf_c,Xc5,Yrf_priorRate_c)
r2_c = np.mean(cross_val_score(rf_c,Xc5,Yrf_priorRate_c))
mse = np.mean((Yrf_priorRate_c-prediction_c)**2)
print('MSE : ',mse)

print("R^2 : ", r2_c)

plt.scatter(Yrf_priorRate_c,prediction_c)
plt.plot(np.arange(0,4),np.arange(0,4), label ="r^2="+str(r2_c), c="r")
plt.legend(loc="lower right")
plt.xlabel('Measured')
plt.ylabel('Predicted')









    



MSE :  0.006811909988309021
R^2 :  0.628877742952






    Out[122]:





<matplotlib.text.Text at 0x7fabe3ff2390>

We see here that our choice of random state was fortuitous, and it so happened to give us a training/test split that lent itself to high scores.

While our score is lowered significantly, we still perform quite will, being able to explain over 63 percent of the variance in our data set with our very barebones model. Further improvements could be made by including distances to bike posts/parking lots, and whether or not a bike lane is present at the intersection. While this would make the project more involved, combining these extra features with more data from the years since 2010 should result in a model that is able to closely predict real life collision rates.



In [ ]:



In [ ]:

Bonus:

If we were to really try and ring out the best performance possible out of our forrest, the correct approach would be to run a randomized search over the various possible parameters, so as to select the forest type which performs best. One could even then re-do the search, wich narrower parameter ranges, to fine tune things furter.

The code below performs this search for the central only database. It will run for a very long time however, and the results will be updated into the notebook at a later date, since for our relatively sparsely featured and small data set, reducing the max features, max depth, or other parametes will not improve our score, or change the computational cost much. Nonetheless, it's a good exercise to carry out.



In [ ]:

    
import numpy as np

from time import time
from scipy.stats import randint as sp_randint

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier

# build our random forest
clf = RandomForestRegressor(n_estimators=200)


# Utility function to report best scores
def report(results, n_top=3):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
                  results['mean_test_score'][candidate],
                  results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")


# specify parameters and distributions to sample from
param_dist = {"max_depth": [3, None],
              "max_features": sp_randint(1,115),
              "min_samples_split": sp_randint(1, 20),
              "min_samples_leaf": sp_randint(1, 20),
              "bootstrap": [True, False],
              "criterion": ["mse", "mae"]} ##mean squared error, and mean absolute error

# run randomized search
n_iter_search = 20
random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
                                   n_iter=n_iter_search)

start = time()
random_search.fit(Xc5,Yrf_priorRate_c)
print("RandomizedSearchCV took %.2f seconds for %d candidates"
      " parameter settings." % ((time() - start), n_iter_search))
report(random_search.cv_results_)

	INJURY	SAFETY EQUIP.	ROAD CLASS	CYCLIST CRASH TYPE	AGE OF CYCLIST	STNAME1	STET 1 TYPE	STNAME 2	STET 2 TYPE	LONG	...	ACC TIME	ACC TIME2	TRAFFIC CONTROL	ROAD SURFACE2	DATE	DATE2	DRIVER ACTION	DRIVER CONDITION	YEAR	INTERSECTION
ID
1	Minimal	Unrecorded	Minor Arterial	Unrecorded	34.0	BIRCHMOUNT	RD	HIGHVIEW	AV	-79.26539	...	1000.0	10:00:00	stop sign	wet	1986/01/12	1986/01/12	Failed to yield right-of-way	Normal	1986	Birchmount Rd & Highview Ave, Toronto, Ontario...
2	Minimal	Unrecorded	Major Arterial	Unrecorded	54.0	LAKE SHORE	BLVD	THIRTY-FIFTH	ST	-79.53500	...	1000.0	10:00:00	no control	wet	1986/01/19	1986/01/19	Improper turn	Normal	1986	Lake Shore Blvd W & 35th St, Toronto, Ontario,...
3	Minor	Unrecorded	Major Arterial	Unrecorded	19.0	LAWRENCE	AV	FORTUNE	GATE	-79.21800	...	1500.0	15:00:00	no control	dry	1986/01/23	1986/01/23	Failed to yield right-of-way	Normal	1986	Lawrence Ave E & Fortune Gate, Toronto, Ontari...
4	Minimal	Unrecorded	Local	Unrecorded	34.0	EUCLID	AV	ULSTER	ST	-79.41330	...	800.0	08:00:00	no control	loose snow	1986/01/14	1986/01/14	Lost control	Normal	1986	Euclid Ave & Harbord St, Toronto, Ontario, M6G
5	Minimal	Unrecorded	Major Arterial	Unrecorded	34.0	AVENUE	AV	DRAYTON	AV	-79.32041	...	1700.0	17:00:00	traffic signal	dry	1986/01/30	1986/01/30	Other	Normal	1986	Danforth Ave & Drayton Ave, Toronto, Ontario, M4C

	intersec5	elev_id	classifi6	classifi7	num_elev	elevatio9	elevatio10	elev_level	elevation	elevatio13	height_r14	height_r15	x_coord	y_coord	longitude	latitude	objectid	intersection
int_id
13468990.0	Lakeshore Ave / Third St	27090.0	MNRSL	Minor-Single Level	1.0	501300.0	Minor	0.0	0.0	NaN	0.0	NaN	316596.462	4832242.321	-79.353614	43.632026	84477.0	3rd St & Lakeshore Ave, Toronto, Ontario, M5J
13466511.0	Princess St / Abbey Lane	16736.0	MNRSL	Minor-Single Level	1.0	501300.0	Minor	0.0	0.0	NaN	0.0	NaN	315523.959	4834358.610	-79.366863	43.651092	29444.0	Abbey Ln & Princess St, Toronto, Ontario, M5A
14014045.0	Abbott Ave / Ln W Dundas N Glenlake	21746.0	LSRSL	Lesser-Single Level	1.0	501700.0	Laneway	0.0	0.0	NaN	0.0	NaN	308440.142	4835412.777	-79.454671	43.660649	5606.0	Abbott Ave & Dundas St W, Toronto, Ontario, M6P
14014044.0	Abbott Ave / Ln W Dundas N Abbott	26354.0	LSRSL	Lesser-Single Level	1.0	501700.0	Laneway	0.0	0.0	NaN	0.0	NaN	308429.281	4835409.765	-79.454806	43.660622	5604.0	Abbott Ave & Dundas St W, Toronto, Ontario, M6P
14013942.0	Abbott Ave / Ln W Indian Rd N Abbott	28279.0	LSRSL	Lesser-Single Level	1.0	501700.0	Laneway	0.0	0.0	NaN	0.0	NaN	308179.207	4835371.356	-79.457907	43.660277	5591.0	Abbott Ave & Indian Rd, Toronto, Ontario, M6P

	Main	Midblock Route	Side 1 Route	Side 2 Route	Activation Date	Latitude	Longitude	Count Date	8HrPedVol	8HrVehVol	24HrPedVol	24HrVehVol	bike_prediction	coordinates
PX2
2	JARVIS ST	NaN	FRONT ST E	NaN	1948/11/15	43.649450	-79.371410	2011/09/08	17008	19335	34016	38670	1390.124396	(43.64945, -79.37141)
3	KING ST E	NaN	JARVIS ST	NaN	1950/08/23	43.650461	-79.371924	2011/09/07	37719	17665	75438	35330	1511.289334	(43.6504606, -79.3719239)
4	JARVIS ST	NaN	ADELAIDE ST E	NaN	1958/09/12	43.651534	-79.372360	2008/06/16	1991	19726	3982	39452	1109.962532	(43.6515337, -79.37236)
5	JARVIS ST	NaN	RICHMOND ST E	NaN	1962/04/21	43.652718	-79.372824	2009/07/30	2696	24842	5392	49684	1145.833262	(43.6527176, -79.372824)
6	JARVIS ST	NaN	QUEEN ST E	NaN	1928/08/24	43.653704	-79.373238	2011/05/18	3622	19772	7244	39544	1181.886259	(43.653704, -79.373238)

	intersec5	elev_id	classifi6	classifi7	num_elev	elevatio9	elevatio10	elev_level	elevation	elevatio13	...	x_coord	y_coord	longitude	latitude	objectid	intersection	coordinates	num_class	closest_traffic	traffic_count
int_id
13462813.0	Castle Frank Rd / Bloor St E	33572.0	MJRSL	Major-Single Level	1.0	501200.0	Major	0.0	0.0	NaN	...	315411.781	4836878.500	-79.368204	43.673775	12108.0	Bloor St E & Castle Frank Rd, Toronto, Ontario...	(43.673775492, -79.368204105)	1.0	(43.673792, -79.368187)	3413.0
14021473.0	Glen Morris St / Ln E Spadina N Harbord / Ln E...	16644.0	LSRSL	Lesser-Single Level	1.0	501700.0	Laneway	0.0	0.0	NaN	...	312720.746	4835791.622	-79.401591	43.664026	44814.0	Glen Morris St & Spadina Ave, Toronto, Ontario...	(43.664026103, -79.401591237)	3.0	(43.663051, -79.402196)	3574.0
13465137.0	Galbraith Rd / St George St	5198.0	MNRSL	Minor-Single Level	1.0	501300.0	Minor	0.0	0.0	NaN	...	313107.457	4835358.550	-79.396803	43.660124	32167.0	Galbraith Rd & St George St, Toronto, Ontario,...	(43.660123609, -79.396803266)	2.0	(43.657946, -79.39993)	4495.8
13465081.0	St George St / Russell St	28558.0	MNRSL	Minor-Single Level	1.0	501300.0	Minor	0.0	0.0	NaN	...	313096.575	4835394.417	-79.396938	43.660447	15644.0	Russell St & St George St, Toronto, Ontario, M5S	(43.660446592, -79.396937609)	2.0	(43.657946, -79.39993)	4495.8
13465147.0	Russell St / Huron St	17776.0	MNRSL	Minor-Single Level	1.0	501300.0	Minor	0.0	0.0	NaN	...	312953.652	4835353.068	-79.398710	43.660076	15652.0	Huron St & Russell St, Toronto, Ontario, M5S	(43.660076003, -79.398710203)	2.0	(43.657946, -79.39993)	4495.8

	normalized yearly accident rate	total collisions	traffic estimate
intersection
Queen St W & Spadina Ave, Toronto, Ontario, M5V	6.256e-06	74	1314
Bay St & Dundas St W, Toronto, Ontario, M5G	5.393e-06	73	1504
Queen St W & University Ave, Toronto, Ontario, M5H	5.315e-06	64	1338
Bloor St E & Yonge St, Toronto, Ontario, M4W	4.861e-06	63	1440
Bathurst St & Queen St W, Toronto, Ontario, M5T	5.334e-06	61	1271

	normalized yearly accident rate	total collisions	traffic estimate	posterior mean	posterior STD
intersection
Front St W & York St, Toronto, Ontario, M5J	1.1649e-05	29	276.61	1.19e-05	2.1863e-06
Queens Quay W & Rees St, Toronto, Ontario, M5J	7.1485e-06	9	139.89	7.6438e-06	2.464e-06
Bay St & Queens Quay W, Toronto, Ontario, M5J	7.3792e-06	19	286.09	7.6214e-06	1.7205e-06
Station St & York St, Toronto, Ontario, M5J	7.0922e-06	15	235	7.387e-06	1.8689e-06
Queens Quay W & York St, Toronto, Ontario, M5J	6.9128e-06	18	289.32	7.1523e-06	1.6574e-06
The Esplanade & Yonge St, Toronto, Ontario, M5E	6.734e-06	20	330	6.944e-06	1.5291e-06
Lake Shore Blvd W & York St, Toronto, Ontario, M5J	6.6194e-06	14	235	6.9142e-06	1.8081e-06
Front St W & John St, Toronto, Ontario, M5V	6.469e-06	19	326.34	6.6813e-06	1.5083e-06
Lake Shore Blvd E & Yonge St, Toronto, Ontario, M5E	6.1466e-06	13	235	6.4414e-06	1.7452e-06
Queen St W & Spadina Ave, Toronto, Ontario, M5V	6.2559e-06	74	1314.3	6.3087e-06	7.303e-07

	INJURY	SAFETY EQUIP.	ROAD CLASS	CYCLIST CRASH TYPE	AGE OF CYCLIST	STNAME1	STET 1 TYPE	STNAME 2	STET 2 TYPE	LONG	...	ACC TIME2	TRAFFIC CONTROL	ROAD SURFACE2	DATE	DATE2	DRIVER ACTION	DRIVER CONDITION	YEAR	INTERSECTION	coords
ID
3558	Minimal	No equipment available	Local	Unrecorded	11	ELEVENTH	ST	LAKE SHORE	BLVD	-79.509	...	17:00:00	no control	unrecorded	1988/06/27	1988/06/27	Driving properly	Normal	1988	11th St & Lake Shore Blvd W, Etobicoke, Ontari...	[43.59980079900049, -79.50971068699965]
499	Minimal	Unrecorded	Major Arterial	Unrecorded	19	ELEVENTH	ST	LAKE SHORE	BLVD	-79.51	...	15:00:00	stop sign	dry	1986/06/26	1986/06/26	Driving properly	Normal	1986	11th St & Lake Shore Blvd W, Etobicoke, Ontari...	[43.59980079900049, -79.50971068699965]
1529	Minor	Unrecorded	Major Arterial	Unrecorded	19	LAKE SHORE	BLVD	ELEVENTH	ST	-79.509	...	18:00:00	no control	dry	1987/04/07	1987/04/07	Failed to yield right-of-way	Normal	1987	11th St & Lake Shore Blvd W, Etobicoke, Ontari...	[43.59980079900049, -79.50971068699965]
7880	Minor	No equipment available	Major Arterial	Unrecorded	42	LAKE SHORE	BLVD	ELEVENTH	ST	-79.509	...	15:00:00	no control	unrecorded	1991/06/11	1991/06/11	Driving properly	Normal	1991	11th St & Lake Shore Blvd W, Etobicoke, Ontari...	[43.59980079900049, -79.50971068699965]
10049	Minor	Equipment not used but available	Local	Unrecorded	18	MORRISON	ST	ELEVENTH	ST	-79.509	...	21:40:00	no control	dry	1992/09/04	1992/09/04	Driving properly	Normal	1992	11th St & Morrison St, Etobicoke, Ontario, M8V	[43.59743007200046, -79.50873080099967]

	intersec5	elev_id	classifi6	classifi7	num_elev	elevatio9	elevatio10	elev_level	elevation	elevatio13	height_r14	height_r15	x_coord	y_coord	longitude	latitude	objectid	intersection
int_id
20041183	Allingham Gdns / Ln N Wilson W Collinson	29125	LSRML	Lesser-Multi Level	2	501700	Laneway	0	0	NaN	0	NaN	3.094e+05	4.8438e+06	-79.443	43.736	32028	Allingham Gdns & King High Ave, Toronto, Ontar...
14045679	Ln N Annette E Beresford / Ln S St Johns W Run...	17094	LSRML	Lesser-Multi Level	2	501700	Laneway	0	0	NaN	0	NaN	3.0633e+05	4.8354e+06	-79.481	43.66	44986	Annette St & Runnymede Rd, Toronto, Ontario, M6S
14665974	Byng Ave / Ln 2 N Newport W Byng	22281	LSRML	Lesser-Multi Level	1	501700	Laneway	0	0	NaN	0	NaN	3.2236e+05	4.8395e+06	-79.282	43.697	51026	Byng Ave & Prairie Dr, Toronto, Ontario, M1L
13465351	Durie St / Ln E Durie S St Johns / Ln E Winder...	7655	LSRML	Lesser-Multi Level	2	501700	Laneway	0	0	NaN	0	NaN	3.0618e+05	4.8353e+06	-79.483	43.66	32868	Annette St & Durie St, Toronto, Ontario, M6S
13459936	Luttrell Ave / Ln 2 S Danforth E Luttrell / Ln...	25170	LSRML	Lesser-Multi Level	2	501700	Laneway	0	0	NaN	0	NaN	3.2145e+05	4.8385e+06	-79.293	43.689	85370	Drummondville Ln & Flagman St, Toronto, Ontari...

	normalized yearly accident rate	total collisions	bike traffic estimate	scaled collisions	normalized_risk_factor	vehicle traffic	pedestrian traffic	fatal	major	minor	minimal/none
intersection
Queen St W & Spadina Ave, Toronto, Ontario, M5V	7.0407e-06	43	1305	4	6.5495e-07	22206	9791.5	0	2	15	26
Bathurst St & Bloor St W, Toronto, Ontario, M5R	6.5778e-06	41	1331.9	7	1.123e-06	22976	11309	0	1	20	20
Bathurst St & Queen St W, Toronto, Ontario, M5T	6.7557e-06	40	1265.2	6	1.0134e-06	17330	7338.1	0	0	18	22
Bay St & Dundas St W, Toronto, Ontario, M5G	5.264e-06	37	1501.9	4	5.6909e-07	20558	37431	0	0	13	24
Queen St W & University Ave, Toronto, Ontario, M5H	5.8533e-06	37	1350.7	6	9.4918e-07	29023	14508	0	1	7	29

	intersection	TRAFFIC CONTROL	count	trafficcontrol_num	control category
22146	bpNichol Ln & bpNichol Ln, Toronto, Ontario, M5S	no control	1	5	no control
22145	Zachary Ct & Zachary Ct, Toronto, Ontario, M6A	unrecorded	1	6	other/unrecorded
22144	York Ridge Rd & York Ridge Rd, Toronto, Ontari...	unrecorded	1	6	other/unrecorded
22143	York Mills Rd & York Ridge Rd, Toronto, Ontari...	unrecorded	1	6	other/unrecorded
22142	York Mills Rd & York Mills Rd, Toronto, Ontari...	unrecorded	1	6	other/unrecorded

	index	intersection	ROAD CLASS	count	roadclass_num	road_class_cat
0	21209	bpNichol Ln & bpNichol Ln, Toronto, Ontario, M5S	Major Arterial	1	2	Major Arterial
1	21208	Zachary Ct & Zachary Ct, Toronto, Ontario, M6A	Local	1	4	Local/Laneway
2	21207	York Ridge Rd & York Ridge Rd, Toronto, Ontari...	Minor Arterial	1	3	Minor Arterial
3	21206	York Mills Rd & York Ridge Rd, Toronto, Ontari...	Minor Arterial	1	3	Minor Arterial
4	21205	York Mills Rd & York Mills Rd, Toronto, Ontari...	Minor Arterial	1	3	Minor Arterial

	normalized yearly accident rate	total collisions	bike traffic estimate	scaled collisions	normalized_risk_factor	vehicle traffic	pedestrian traffic	fatal	major	minor	...	other/unrecorded	pedestrian crossover	stop/yield sign	traffic signal/gate	Collector/Expressway	Local/Laneway	Major Arterial	Minor Arterial	posterior mean rate	posterior STD
intersection
MacDonald-Cartier Fwy & Victoria Park Ave, Toronto, Ontario, M1R	0	9	1	1	0	35267	1	0	0	4	...	0	0	0	1	1	0	0	0	0.0019472	0.00064499
Reesor Rd & Steeles Ave E, Scarborough, Ontario, M1X, Scarborough, Ontario, M1X	0	3	1	0	0	1081	1	0	0	1	...	0	0	1	0	0	0	0	1	0.00066532	0.00037702
HWY-401, Mississauga, Ontario, L4W & Renforth Dr, Mississauga, Ontario, L4W	0	2	1	0	0	24049	1	0	2	0	...	0	0	0	0	0	0	1	0	0.00045168	0.00031064
Highway of Heroes & Victoria Park Ave, Toronto, Ontario, M1R	0	2	1	0	0	35267	1	0	0	2	...	0	0	0	0	1	0	0	0	0.00045168	0.00031064
Blackhurst Ct & Morningview Trl, Scarborough, Ontario, M1B, Scarborough, Ontario, M1B	0	1	1	0	0	1081	1	0	1	0	...	0	0	0	0	0	1	0	0	0.00023803	0.00022551

	normalized yearly accident rate	total collisions	bike traffic estimate	scaled collisions	normalized_risk_factor	vehicle traffic	pedestrian traffic	fatal	major	minor	...	other/unrecorded	pedestrian crossover	stop/yield sign	traffic signal/gate	Collector/Expressway	Local/Laneway	Major Arterial	Minor Arterial	posterior mean rate	posterior STD
intersection
Farm Greenway & Victoria Park Ave, Toronto, Ontario, M1R	0	0	0	0	0	35267	0	0	0	0	...	1	0	0	0	0	0	0	1	0.65002	2.7456
Bingley Rd & Valley Centre Dr, Toronto, Ontario, M1X	0	0	0	0	0	1081	0	0	0	0	...	1	0	0	0	0	0	0	1	0.65002	2.7456
Lake Shore Blvd W & Lakeshore British Columbia Rd, Toronto, Ontario, M6K	0	0	0	0	0	27298	0	0	0	0	...	1	0	0	0	0	0	0	1	0.65002	2.7456
Pharmacy Ave & Seabury Gate, Toronto, Ontario, M1T	0	0	0	0	0	35267	0	0	0	0	...	1	0	0	0	0	0	0	1	0.65002	2.7456
Foxcote Cres & Rangoon Rd, Toronto, Ontario, M9C	0	0	0	0	0	16435	0	0	0	0	...	1	0	0	0	0	0	0	1	0.65002	2.7456

	intersection	bike traffic estimate	vehicle traffic	pedestrian traffic	control_type_num	road_class_num
0	bpNichol Ln & bpNichol Ln, Toronto, Ontario, M5S	1216.4	11213	4767	5	3
1	Zachary Ct & Zachary Ct, Toronto, Ontario, M6A	1111.1	12798	2011	4	1
2	York Ridge Rd & York Ridge Rd, Toronto, Ontari...	783.49	13638	72	4	2
3	York Mills Rd & York Ridge Rd, Toronto, Ontari...	783.49	13638	72	4	2
4	York Mills Rd & York Mills Rd, Toronto, Ontari...	1106.1	25718	1926	4	2