In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import geocoder
import sklearn
from IPython.core.display import HTML
css = open('style-table.css').read() + open('style-notebook.css').read()
HTML('<style>{}</style>'.format(css))
Out[1]:
Note that we do not include whether or not an intersection has a bike lane in our list of features. While one can download a dataset which includes all locations of cycle tracks and bike lanes by street sections, many are fairly new. With the exception of certain cycle tracks which were built after 2014, there doesn't seem to be any easy way to date various bike lane additions. As such it wouldn't make sense to label our intersections with this information, as most bike lanes and cycle tracks are fairly new, and were not around during the period of time our dataset is built from.
From the bike counts above, we compared to vehicle and pedestrian counts from a sepearate data set. These are far more wide ranging, and if a relationship could be found, one could make predictions on bike traffic in all the missing points.
While there was unfortunately no obvious direct relationship between traffic numbers (something like a constant ratio, or a linearly increasing one), we do get an acceptable power fit when comparing the bike-to-pedestrian traffic ratio, once we filter out a few areas. The University of Toronto campus has a much higher B/P ratio than any other avaialble bike count locations. This isn't suprising, as a student there I can confirm it's the most popular place to see cyclists in the city. The other area is the waterfront right by financial and entertainment districts. These points had a far lower traffic ratio than given by this fit. So we'll treat them as seperate special cases, mapping the available bike counts directly to any nearby collisions, as long as they fit within a 'box' that we'll use to define those areas.
Next we needed to clean and filter our dataset. Cleaning mostly invovled searching for NaNs or mixed up entries, as it prevents us from easily manipulating our dataframe. Once we have things roughly cleaned up, we want to filter our collisions. The bike count to pedestrian count relationship we had was based on points that were overwhelminghly found in central Toronto; downtown and the surrounding neighborhoods. So we need some way to do this
In [2]:
col_df = pd.DataFrame.from_csv('collisions_clean.csv', index_col='ID')
col_df=col_df.sort_index()
col_df.head()
Out[2]:
The above collisions were all geocoded based on the crosstreets listed, rather than the GPS coordinates, since there were many coordinates that just were defaulted to the center of Toronto. Nonetheless, there were still a few "impossible" intersections, that were then geocded in a second pass using their coordinates.
We also use a list of ALL the intersections in the city, and geocode those, using the GPS coordinates. These also required a second pass using a different geocoder on about 10 or so entries. Later we'll cross reference with the collisions database and drop all the duplicates. This way we can keep track of the "zeroes". That is, intersections which never had a collision in those 25 years.
Below is an example of how this was done, for the entire list of intersections in the city. We don't recommened running this, as it'll take a few hours.
In [ ]:
#our datasetfrom the Toronto website of all intersections
all_real_df = pd.DataFrame.from_csv('all_intersections_real.csv', index_col='int_id')
#Get the coordinates out, put them as a list, rather than iterating over the dataframe itself.
coords2 = all_real_df[['latitude','longitude']]
coords2 = coords2.replace(np.nan,0)
coordinate_list = coords2.values.tolist()
import geocoder
import csv
geo = []
idx = 0
for pair in coordinate_list:
g = geocoder.arcgis(pair, method='reverse')
geo.append(g.address)
print(idx, '', geo[idx])
idx += 1
myfile = open('centerline_coords_int.csv', 'w')
wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
wr.writerow(geo)
all_real_df['arcgis_int'] = geo
In [3]:
import geocoder
#Example of what geocoder output looks like, for two different locations. The API calls return information on the locality in json format
latlng = [43.6532,-79.3828]
g = geocoder.arcgis('Bay St & Bloor St, Toronto', method='geocode')
f= geocoder.arcgis(latlng, method='reverse')
print(g.address)
print(f.address)
Once this is done, we then add this "intersection" column to our dataframe, and selected all collisions that had a postal code beginning with one of those associated with the areas roughly east of the Junction, West of carlaw, and south of Dupont, with the neigbourhoods near Exhibition place also thrown out. This isn't a perfect system by any means, but we have no reason to believe that our nice power-fit relationship will hold outside the areas we used. In fact, outside central Toronto, it seems likely that it wouldn't hold, as most cyclists are typically found closer to the city center.
The SQl query to do this is:
Did it in SQL originally since exploratory data analysis was done using Mode Analytic's platform.
In [4]:
central_realintersections_df = pd.DataFrame.from_csv('central_real_intersections.csv')
central_realintersections_df= central_realintersections_df.rename(columns = {'arcgis_int':'intersection'})
col_df = pd.DataFrame.from_csv('toronto_cycling_central_mode.csv', index_col='id')
col_df = col_df.sort_index()
central_col_df = col_df
from numpy import random
from scipy.spatial import distance
import matplotlib.path as mplPath
import numpy as np
#We want to follow a standard intersection convention to make our life easier.
#Unfortunately the free geocoder with unlimited request numbers per day doesn't see mto follow one of it's own.
#So we will get the intersections, strip out the two streets, and order them alphabetically.
st2 = col_df['intersection'].str.split('&').str.get(1)
st2 =st2.str.split(', Toronto').str[0].str.strip()
post = col_df['intersection'].str.split('&').str.get(1)
post = post.str.partition(', ')[2].str.strip()
st1 = col_df['intersection'].str.split('&').str[0]
st1 = st1.str.strip()
intersection_list = []
streets = pd.concat([st1,st2,post],axis=1)
streets = streets.values.tolist()
streets
for pair in streets:
if isinstance(pair[0],str):
if pair[1] <= pair[0]:
temp = pair[0]
pair[0] = pair[1]
pair[1] = temp
for pair2 in streets:
intersection = str(pair2[0]) + ' & ' + str(pair2[1] + ', ' +str(pair2[2]))
intersection_list.append(intersection)
col_df['intersection'] = intersection_list
st2 = central_realintersections_df['intersection'].str.split('&').str.get(1)
st2 =st2.str.split(', Toronto').str[0].str.strip()
post = central_realintersections_df['intersection'].str.split('&').str.get(1)
post = post.str.partition(', ')[2].str.strip()
st1 = central_realintersections_df['intersection'].str.split('&').str[0]
st1 = st1.str.strip()
intersection_list = []
streets = pd.concat([st1,st2,post],axis=1)
streets = streets.values.tolist()
streets
for pair in streets:
if isinstance(pair[0],str):
if pair[1] <= pair[0]:
temp = pair[0]
pair[0] = pair[1]
pair[1] = temp
for pair2 in streets:
intersection = str(pair2[0]) + ' & ' + str(pair2[1] + ', ' +str(pair2[2]))
intersection_list.append(intersection)
central_realintersections_df['intersection'] = intersection_list
central_realintersections_df.head()
Out[4]:
Once we have our naming convention sorted, we can start by trying to predict the Collision rates naively, without any machine learning. What do the raw numbers tell us?
We proceed by mapping the pedestrian data to our collisions and intersections, so we have a Bike traffic estimate. Rememebering that we have a few areas that we're treating apart, we have our boxes so we can check if points fall inside those areas.
In [5]:
ped_counts_df = pd.DataFrame.from_csv('Vehicle and Pedestrian Counts/TrafficPedestrianVolumes_2011.csv')
#Using the Power fit for the Bike/Pedestrian ratio, we get a function that predicts the bike numbers at any one
#intersection.
ped_counts_df['bike_prediction'] = (500.2146799711*ped_counts_df['8HrPedVol']**(-0.8950759596))*ped_counts_df['8HrPedVol']
ped_coords = ped_counts_df[['Latitude','Longitude']]
ped_coords = ped_coords.replace(np.nan,0)
ped_coordinate_list = ped_coords.values.tolist()
ped_counts_df['coordinates'] = ped_counts_df[['Latitude','Longitude']].apply(tuple, axis=1)
ped_counts_df.head()
Out[5]:
In [6]:
ped_dict = ped_counts_df.set_index('coordinates').to_dict()['bike_prediction']
col_df['coordinates'] = col_df[['lat','long']].apply(tuple, axis=1)
col_df.head()
central_realintersections_df['coordinates'] = central_realintersections_df[['latitude','longitude']].apply(tuple, axis=1)
equiv = {'Laneway':3,'Minor':2,'Major':1}
central_realintersections_df['num_class'] = central_realintersections_df['elevatio10'].map(equiv)
central_realintersections_df= central_realintersections_df.sort_values(by=['intersection','num_class'])
#take the lowest numerical class, drop the other duplicates.
central_realintersections_df = central_realintersections_df.drop_duplicates(subset='intersection', keep='first')
In [7]:
waterfront_Path = mplPath.Path(np.array([[43.635497, -79.398156],
[43.639000, -79.400725],
[43.640822, -79.401427],
[43.646984, -79.376977],
[43.649889, -79.370343],
[43.651614, -79.362725],
[43.648090, -79.361191],
[43.646451, -79.361937],
[43.641209, -79.376739],
[43.639969, -79.379965],
[43.637698, -79.391847],
[43.635666, -79.398368],
[43.636489, -79.399603]]))
campus_Path = mplPath.Path(np.array([[43.659838, -79.399772],
[43.661388, -79.401006],
[43.665592, -79.402705],
[43.666768, -79.401354],
[43.668213, -79.393958],
[43.663141, -79.392719],
[43.659264, -79.394100],
[43.658329, -79.398204]]
))
castleFrank_Path = mplPath.Path(np.array([[43.672105, -79.376696],
[43.671562, -79.370962],
[43.674418, -79.366821],
[43.676086, -79.358731],
[43.677056, -79.354021],
[43.677040, -79.355126],
[43.677622, -79.358516],
[43.676194, -79.359503],
[43.675170, -79.364760],
[43.674580, -79.367539],
[43.672019, -79.371112],
[43.672710, -79.376927]]
))
campus = [[43.657946,-79.39993],
[43.663502,-79.40005],
[43.663051,-79.402196],
[43.665429,-79.398975]
]
campus_dict = {(43.657946,-79.39993):4495.8,
(43.663502,-79.400050):2653,
(43.663051,-79.402196):3574,
(43.665429,-79.398975):2304
}
waterfront = [[43.648208,-79.370923],
[43.642711,-79.394043],
[43.639944,-79.387032],
[43.640625,-79.3932],
[43.640093,-79.380152]
]
waterfront_dict = {(43.648208,-79.370923):330,
(43.642711,-79.394043):745,
(43.639944,-79.387032):128,
(43.640625,-79.3932):154,
(43.640093,-79.380152):235
}
castleFrank = [[43.673792,-79.368187]]
castleFrank_dict = {(43.673792,-79.368187):3413}
Here we have defined our boxes, and our pedestrian traffic dictionary. Now we go through the data sets, check if the point belongs in one of our boxes. If it does, we'll map the closest traffic count in the box. Otherwise, we find the closest intersection in the Pedestrian dataframe, and use the predicted bike traffic number from our fit
In [8]:
import csv
closest_traffic_point = []
bike_traffic = []
i = 0
for i in range(0,len(central_col_df)):
point = central_col_df['coordinates'].iloc[i]
if waterfront_Path.contains_point(point):
closest = waterfront[distance.cdist([point], waterfront).argmin()]
closest_traffic_point.append(tuple(closest))
bike_traffic.append(waterfront_dict[tuple(closest)])
elif campus_Path.contains_point(point):
closest = campus[distance.cdist([point], campus).argmin()]
closest_traffic_point.append(tuple(closest))
bike_traffic.append(campus_dict[tuple(closest)])
elif castleFrank_Path.contains_point(point):
closest = castleFrank[distance.cdist([point], castleFrank).argmin()]
closest_traffic_point.append(tuple(closest))
bike_traffic.append(castleFrank_dict[tuple(closest)])
else:
closest = ped_coordinate_list[distance.cdist([point], ped_coordinate_list).argmin()]
closest_traffic_point.append(tuple(closest))
bike_traffic.append(ped_dict[tuple(closest)])
myfile3 = open('closest_intersection.csv', 'w')
wr = csv.writer(myfile3)
wr.writerow(closest_traffic_point)
myfile3.close()
myfile4 = open('closest_int_bike_predictions.csv', 'w')
wr = csv.writer(myfile4)
wr.writerow(bike_traffic)
myfile4.close()
In [9]:
central_col_df['closest_traffic'] = tuple(closest_traffic_point)
central_col_df['traffic_count'] = bike_traffic
central_col_df.rename(columns={'closest_traffic': 'closest_ped_count', 'traffic_count':'predicted_bike_count'})
central_col_df.head()
Out[9]:
In [10]:
import csv
closest_traffic_point = []
bike_traffic = []
i = 0
for i in range(0,len(central_realintersections_df)):
point = central_realintersections_df['coordinates'].iloc[i]
if waterfront_Path.contains_point(point):
closest = waterfront[distance.cdist([point], waterfront).argmin()]
closest_traffic_point.append(tuple(closest))
bike_traffic.append(waterfront_dict[tuple(closest)])
elif campus_Path.contains_point(point):
closest = campus[distance.cdist([point], campus).argmin()]
closest_traffic_point.append(tuple(closest))
bike_traffic.append(campus_dict[tuple(closest)])
elif castleFrank_Path.contains_point(point):
closest = castleFrank[distance.cdist([point], castleFrank).argmin()]
closest_traffic_point.append(tuple(closest))
bike_traffic.append(castleFrank_dict[tuple(closest)])
else:
closest = ped_coordinate_list[distance.cdist([point], ped_coordinate_list).argmin()]
closest_traffic_point.append(tuple(closest))
bike_traffic.append(ped_dict[tuple(closest)])
myfile5 = open('closest_intersection_centreline.csv', 'w')
wr = csv.writer(myfile5)
wr.writerow(closest_traffic_point)
myfile5.close()
myfile6 = open('closest_int_bike_predictions_centreline.csv', 'w')
wr = csv.writer(myfile6)
wr.writerow(bike_traffic)
myfile6.close()
central_realintersections_df['closest_traffic'] = tuple(closest_traffic_point)
central_realintersections_df['traffic_count'] = bike_traffic
central_realintersections_df.rename(columns={'closest_traffic': 'closest_ped_count', 'traffic_count':'predicted_bike_count'})
central_realintersections_df.sort_values(by='traffic_count').tail()
Out[10]:
So now we have traffic numbers for our entire intersection dataframe, as well as our dataframe of collisions. Now let's count up our collisions (including the zeroes), and order them.
To make things cleaner, we'll strip away the columns that aren't helpful to us at this time. Once we group the collision dataframe by intersection, we can concat that with the dataframe of all intersections (now call zeroes_df).
Once it's sorted by our normalized accident rate, we can drop the non-zero intersections from our merged dataframe by just dropping the duplicated indexes. Then we should be left with our collisions, properly grouped, and the "zero" intersections.
In [11]:
central_realint_df = central_realintersections_df[['intersection','coordinates','closest_traffic','traffic_count','elevatio10']]
central_realint_df.sort_values(by='traffic_count')
result = pd.merge(central_col_df,central_realint_df, how ='outer',on='intersection')
result = result.sort_values(by='intersection')
traff2 =result['traffic_count_y']
result['traffic_count_x'] =result['traffic_count_x'].fillna(value = traff2 )
In [12]:
result.to_csv('merged_data.csv')
intersection_df = central_col_df.groupby('intersection').agg({'traffic_count':[np.size,np.mean]}).sort_values(by=[('traffic_count','size')],ascending=False)
intersection_df.columns = intersection_df.columns.set_levels(['traffic_count'], level=0)
zeroes_df = central_realint_df.groupby('intersection').agg({'traffic_count':[np.size,np.mean]}).sort_values(by=[('traffic_count','size')],ascending=False)
zeroes_df['traffic_count','size'] = 0
#Let's use 360 as the days in the year. It's a nice round number. And since weekends are less busy it's probably still an overestimate.
zeroes_df['traffic_count','normalized_accident_rate'] = zeroes_df['traffic_count','size']/(25*360*zeroes_df['traffic_count','mean'])
zeroes_df = zeroes_df.sort_values(by=[('traffic_count','normalized_accident_rate')],ascending=False)
intersection_df['traffic_count','normalized_accident_rate'] = intersection_df['traffic_count','size']/(25*360*intersection_df['traffic_count','mean'])
intersection_df = intersection_df.sort_values(by=[('traffic_count','normalized_accident_rate')],ascending=False)
intersection_df.head(20)
totals_df = pd.concat([intersection_df,zeroes_df])
df_test = totals_df['traffic_count','normalized_accident_rate']
df = df_test.to_frame(name='normalized yearly accident rate')
df['total collisions'] = totals_df['traffic_count','size']
df['traffic estimate'] = totals_df['traffic_count','mean']
#df = df.dropna()
#Here we make the index a column, drop the duplicates from this intersection column, then make it the index again.
df = df.reset_index().drop_duplicates(subset='intersection', keep='first').set_index('intersection')
len(df)
Out[12]:
Now we can take a look at the dataframe we've created. The one we'll use for our machine learning approach will be roughly similar, but instead we'll have carried over the information on the type of intersection, and the type of control (stop sign, yield, no control, traffic lights etc). These will be part of the feature set we use.
For now though, we'll just proceed via Emperical Baysian analysis. We want to account for the fact that collisions are very rare events. Without a very long period of time, it would be unwise to assume that our sample distribution properly matches the actual probability of collisions.
In fact we can see that there is a strong "break" away from roughly linear spread when we plot the accident yearly rate against the total collisions.
In [13]:
pd.set_option('display.float_format', '{:.4g}'.format)
df.sort_values(by='total collisions', ascending = False).head()
Out[13]:
In [14]:
df.loc['Jarvis St & King St E, Toronto, Ontario, M5A']
Out[14]:
In [15]:
sns.set_context("notebook", font_scale=1.1)
scatter = sns.jointplot(x='total collisions', y = 'normalized yearly accident rate', data = df, ylim= (0,0.00002))
So we'll want to create a prior estimate on what the distribution actually is, using the samples that we have available. In a sense, we're using the data twice. First to estimate our prior. Then we'll use it along with this prior to create a posterior distribution.
There is a lot of discussion revolving around Full Bayesian analysis, versus Empirical Bayesian Analysis. In some sense, if you know enough about the system you are studying, you should be able to develop a good prior beforehand, rather than using your limited dataset to do so. On the otherhand, imposing beliefs about your system beforehand can be equally controversial to depending on the data twice.
In [16]:
df.to_csv('totals_test.csv')
import scipy.stats as stats
sns.distplot(df["traffic estimate"], bins = 100)
Out[16]:
In [17]:
sns.distplot(df["total collisions"], bins = 15)
Out[17]:
In [18]:
sns.distplot(df["total collisions"], fit=stats.gamma, bins=range(0, 75, 1), kde=False)
Out[18]:
A gamma distribution looks like a good choice to fit our prior. We know that we are dealing with sums of poission distributions where each intersection is basically a poisson process, with the mean accidents for a year being the parameter. So the above histogram is really summing up the number of accidents each year, over these poisson processes. While we could try and fit each intersection individually, this is a whole lot of work. And it's not clear how one would use these estimated parameters to compute the prior distribution for the ensemble.
A gamma distribution looks nice, and intuitively makes sense since the prior for a poisson distribution is a gamma. Trying to read up on the intuition for how a ensemble of Poisson processes, each with their own paramater (not merely the same), has lead me to some very involved work on Bayesian inference for traffic problems, without directly using machine learning. We won't go that deep in this case though. Suffice to say, there are extensions to the above Poisson model, which attempt to account for the heterogenity of various intersections. The same features one would choose to use in a machine learning approach would show up as parameters in the distribution one tries to fit, and the poisson parameters for each intersection is assumed to be a random variablle, drawn from a chosen distribution
An overview of 3 models which extend our basic method is found in (Miranda-Mereno et al 2005): http://www.civil.uwaterloo.ca/itss/papers%5C2005-2%20(Alternative%20risk%20models%20for%20ranking%20sites).pdf
In [19]:
stats.gamma.fit(df['total collisions'], loc =0)
Out[19]:
In [20]:
import matplotlib.pylab as plt
#plot normed histogram
plt.hist(df['total collisions'], normed=True, bins=range(0, 75, 1))
# find minimum and maximum of xticks, so we know
# where we should compute theoretical distribution
xt = plt.xticks()[0]
xmin, xmax = 0.1, 75
lnspc = np.linspace(xmin, xmax, len(df))
# exactly same as above
ag,bg,cg = stats.gamma.fit(df['total collisions'], loc=0)
pdf_gamma = stats.gamma.pdf(lnspc,ag, bg,cg)
plt.plot(lnspc, pdf_gamma, label="Gamma")
plt.axis([0,20,0,0.6])
plt.show()
In [21]:
#plot normed histogram
plt.hist(df['total collisions'], normed=True, bins=range(0, 75, 1))
# find minimum and maximum of xticks, so we know
# where we should compute theoretical distribution
xt = plt.xticks()[0]
xmin, xmax = 0.1, 75
lnspc = np.linspace(xmin, xmax, len(df))
# exactly same as above
ag,bg,cg = stats.gamma.fit(df['total collisions'])
pdf_gamma = stats.gamma.pdf(lnspc,ag, bg,cg)
plt.plot(lnspc, pdf_gamma, label="Gamma")
plt.xlabel('total collisions')
plt.ylabel('normalized count')
plt.axis([0,70,0,0.08])
plt.show()
That's a pretty good looking fit. The histogram shows the normalized counts of integer collision totals for our merged dataframe. We can now use the fit paramters of our model, and use this to adjust our estimate of the accident rate.
The mathematics of why the calculation below works to update our estiamte is straight forward to derive from Bayesian probablity theory, but intuitively it just looks likes a weighted average of the prior and sample estimates (when expaneded), where more weight is given to the sample data when the time period is larger, or when the traffic numbers are greater. This makes sense, as in those cases we are MORE confident in our data providing us with a good idea of probability distibution for that intersection.
In [22]:
beta = 1/cg
df['posterior mean'] = (ag + df['total collisions'])/(beta+25*360*df['traffic estimate'])
df['posterior STD'] = np.sqrt((ag + df['total collisions'])/((beta+25*360*df['traffic estimate'])**2))
pd.set_option('display.float_format', '{:.5g}'.format)
df.sort_values(by='posterior mean',ascending = False).head(10)
Out[22]:
In [23]:
df = df.sort_values(by='total collisions', ascending=False)
In [24]:
df.head()
Out[24]:
In [25]:
df.to_csv('eba_normalizedcollisions.csv')
So now we have our dataframe with posterior yearly accident rate, as well as a posterior standard deviation. If one looks deeper throughout the data set we see that the order changes slightly, but nothing too crazy. Since we're using so many years of data, the "prior" dominates. The biggest changes are for those intersections that have a very low traffic count, comparatively. Those with high traffic counts have smaller changes.
In [26]:
scatter = sns.jointplot(x='total collisions', y = 'posterior mean', data = df, ylim= (0,0.00002))
It is at this point that we may start wondering whether using ALL 25 years of data makes sense. If the predictions are to be for the past year, then surely data closer to this time frame is more applicable. We saw in our exploratory data analysis that there was a total decline in accidents year by year up until 1996. There was a sharp decline in minimal and minor injuries that year, with the slack partly picked up by more "No injuries". Whether this was due to a change in how accidents were classified, or due to the legislation mandating minors to use helmets is hard to say. Likely it could be both!
After this year however, cycling accidents pick up again. and start looking fairly consistent. Expecially as compared to the first dozen years.
Moving forward, we can compare the predicted collision numbers to those observed over the past year, from the Twitterbot. We will have a trained model to also compare prediction numbers. One hopes that the increased sophistication of the machine learning approach, using the intersection features which are available, will get closer to the observed data.
In [27]:
full_df = pd.DataFrame.from_csv('collisions_clean.csv', index_col='ID')
In [28]:
fullgeocoded_df = pd.DataFrame.from_csv('collisions_geocode.csv', index_col ='ID')
In [29]:
fullgeocoded_df.head()
Out[29]:
In [30]:
fullgeocoded_df= fullgeocoded_df.rename(columns = {'INTERSECTION':'intersection'})
col_df = fullgeocoded_df.sort_index()
all_realintersections_df = pd.DataFrame.from_csv('real_reversegeocoded.csv')
all_realintersections_df= all_realintersections_df.rename(columns = {'arcgis_int':'intersection'})
from numpy import random
from scipy.spatial import distance
import matplotlib.path as mplPath
import numpy as np
#We want to follow a standard intersection convention to make our life easier.
#Unfortunately the free geocoder with unlimited request numbers per day doesn't see mto follow one of it's own.
#So we will get the intersections, strip out the two streets, and order them alphabetically.
st2 = col_df['intersection'].str.split('&').str.get(1)
st2 =st2.str.split(', Toronto').str[0].str.strip()
post = col_df['intersection'].str.split('&').str.get(1)
post = post.str.partition(', ')[2].str.strip()
st1 = col_df['intersection'].str.split('&').str[0]
st1 = st1.str.strip()
intersection_list = []
streets = pd.concat([st1,st2,post],axis=1)
streets = streets.values.tolist()
streets
for pair in streets:
if isinstance(pair[0],str) and isinstance(pair[1],str):
if pair[1] <= pair[0]:
temp = pair[0]
pair[0] = pair[1]
pair[1] = temp
for pair2 in streets:
intersection = str(pair2[0]) + ' & ' + str(pair2[1]) + ', ' +str(pair2[2])
intersection_list.append(intersection)
col_df['intersection'] = intersection_list
st2 = all_realintersections_df['intersection'].str.split('&').str.get(1)
st2 =st2.str.split(', Toronto').str[0].str.strip()
post = all_realintersections_df['intersection'].str.split('&').str.get(1)
post = post.str.partition(', ')[2].str.strip()
st1 = all_realintersections_df['intersection'].str.split('&').str[0]
st1 = st1.str.strip()
intersection_list = []
streets = pd.concat([st1,st2,post],axis=1)
streets = streets.values.tolist()
streets
for pair in streets:
if isinstance(pair[0],str):
if pair[1] <= pair[0]:
temp = pair[0]
pair[0] = pair[1]
pair[1] = temp
for pair2 in streets:
intersection = str(pair2[0]) + ' & ' + str(pair2[1] + ', ' +str(pair2[2]))
intersection_list.append(intersection)
all_realintersections_df['intersection'] = intersection_list
all_realintersections_df.head()
Out[30]:
In [31]:
central_realintersections_df = pd.DataFrame.from_csv('central_real_intersections.csv')
central_realintersections_df= all_realintersections_df.rename(columns = {'arcgis_int':'intersection'})
central_col_df = pd.DataFrame.from_csv('toronto_cycling_central_mode.csv', index_col='id')
from numpy import random
from scipy.spatial import distance
import matplotlib.path as mplPath
import numpy as np
#We want to follow a standard intersection convention to make our life easier.
#Unfortunately the free geocoder with unlimited request numbers per day doesn't see mto follow one of it's own.
#So we will get the intersections, strip out the two streets, and order them alphabetically.
st2 = central_col_df['intersection'].str.split('&').str.get(1)
st2 =st2.str.split(', Toronto').str[0].str.strip()
post = central_col_df['intersection'].str.split('&').str.get(1)
post = post.str.partition(', ')[2].str.strip()
st1 = central_col_df['intersection'].str.split('&').str[0]
st1 = st1.str.strip()
intersection_list = []
streets = pd.concat([st1,st2,post],axis=1)
streets = streets.values.tolist()
streets
for pair in streets:
if isinstance(pair[0],str) and isinstance(pair[1],str):
if pair[1] <= pair[0]:
temp = pair[0]
pair[0] = pair[1]
pair[1] = temp
for pair2 in streets:
intersection = str(pair2[0]) + ' & ' + str(pair2[1]) + ', ' +str(pair2[2])
intersection_list.append(intersection)
central_col_df['intersection'] = intersection_list
st2 = central_realintersections_df['intersection'].str.split('&').str.get(1)
st2 =st2.str.split(', Toronto').str[0].str.strip()
post = central_realintersections_df['intersection'].str.split('&').str.get(1)
post = post.str.partition(', ')[2].str.strip()
st1 = central_realintersections_df['intersection'].str.split('&').str[0]
st1 = st1.str.strip()
intersection_list = []
streets = pd.concat([st1,st2,post],axis=1)
streets = streets.values.tolist()
streets
for pair in streets:
if isinstance(pair[0],str):
if pair[1] <= pair[0]:
temp = pair[0]
pair[0] = pair[1]
pair[1] = temp
for pair2 in streets:
intersection = str(pair2[0]) + ' & ' + str(pair2[1] + ', ' +str(pair2[2]))
intersection_list.append(intersection)
central_realintersections_df['intersection'] = intersection_list
central_realintersections_df.head()
Out[31]:
In [32]:
ped_counts_df = pd.DataFrame.from_csv('Vehicle and Pedestrian Counts/TrafficPedestrianVolumes_2011.csv')
#Using the Power fit for the Bike/Pedestrian ratio, we get a function that predicts the bike numbers at any one intersection.
ped_counts_df['bike_prediction'] = (500.2146799711*ped_counts_df['8HrPedVol']**(-0.8950759596))*ped_counts_df['8HrPedVol']
ped_coords = ped_counts_df[['Latitude','Longitude']]
ped_coords = ped_coords.replace(np.nan,0)
ped_coordinate_list = ped_coords.values.tolist()
ped_counts_df['coordinates'] = ped_counts_df[['Latitude','Longitude']].apply(tuple, axis=1)
ped_counts_df.head()
Out[32]:
In [33]:
bike_dict = ped_counts_df.set_index('coordinates').to_dict()['bike_prediction']
ped_dict = ped_counts_df.set_index('coordinates').to_dict()['8HrPedVol']
veh_dict = ped_counts_df.set_index('coordinates').to_dict()['8HrVehVol']
col_df['coordinates'] = col_df[['LAT','LONG']].apply(tuple, axis=1)
col_df.head()
all_realintersections_df['coordinates'] = all_realintersections_df[['latitude','longitude']].apply(tuple, axis=1)
equiv = {'Laneway':3,'Minor':2,'Major':1}
all_realintersections_df['num_class'] = all_realintersections_df['elevatio10'].map(equiv)
all_realintersections_df= all_realintersections_df.sort_values(by=['intersection','num_class'])
#take the lowest numerical class, drop the other duplicates.
all_realintersections_df = all_realintersections_df.drop_duplicates(subset='intersection', keep='first')
central_col_df['coordinates'] = central_col_df[['lat','long']].apply(tuple, axis=1)
central_col_df.head()
central_realintersections_df['coordinates'] = central_realintersections_df[['latitude','longitude']].apply(tuple, axis=1)
equiv = {'Laneway':3,'Minor':2,'Major':1}
central_realintersections_df['num_class'] = central_realintersections_df['elevatio10'].map(equiv)
central_realintersections_df= central_realintersections_df.sort_values(by=['intersection','num_class'])
#take the lowest numerical class, drop the other duplicates.
central_realintersections_df = central_realintersections_df.drop_duplicates(subset='intersection', keep='first')
In [34]:
import csv
closest_traffic_point = []
bike_traffic = []
veh_traffic = []
ped_traffic = []
i = 0
for i in range(0,len(col_df)):
point = col_df['coordinates'].iloc[i]
if waterfront_Path.contains_point(point):
closest = waterfront[distance.cdist([point], waterfront).argmin()]
closest_traffic_point.append(tuple(closest))
bike_traffic.append(waterfront_dict[tuple(closest)])
elif campus_Path.contains_point(point):
closest = campus[distance.cdist([point], campus).argmin()]
closest_traffic_point.append(tuple(closest))
bike_traffic.append(campus_dict[tuple(closest)])
elif castleFrank_Path.contains_point(point):
closest = castleFrank[distance.cdist([point], castleFrank).argmin()]
closest_traffic_point.append(tuple(closest))
bike_traffic.append(castleFrank_dict[tuple(closest)])
else:
closest = ped_coordinate_list[distance.cdist([point], ped_coordinate_list).argmin()]
closest_traffic_point.append(tuple(closest))
bike_traffic.append(bike_dict[tuple(closest)])
closest = ped_coordinate_list[distance.cdist([point], ped_coordinate_list).argmin()]
veh_traffic.append(veh_dict[tuple(closest)])
ped_traffic.append(ped_dict[tuple(closest)])
myfile3 = open('closest_intersection.csv', 'w')
wr = csv.writer(myfile3)
wr.writerow(closest_traffic_point)
myfile3.close()
myfile4 = open('closest_int_bike_predictions.csv', 'w')
wr = csv.writer(myfile4)
wr.writerow(bike_traffic)
myfile4.close()
col_df['closest_traffic'] = tuple(closest_traffic_point)
col_df['traffic_count'] = bike_traffic
col_df['vehicle_traffic'] = veh_traffic
col_df['pedestrian_traffic'] = ped_traffic
col_df= col_df.rename(columns={'closest_traffic': 'closest_ped_count', 'traffic_count': 'predicted_bike_count'})
In [35]:
import csv
closest_traffic_point = []
bike_traffic = []
veh_traffic = []
ped_traffic = []
i = 0
for i in range(0,len(central_col_df)):
point = central_col_df['coordinates'].iloc[i]
if waterfront_Path.contains_point(point):
closest = waterfront[distance.cdist([point], waterfront).argmin()]
closest_traffic_point.append(tuple(closest))
bike_traffic.append(waterfront_dict[tuple(closest)])
elif campus_Path.contains_point(point):
closest = campus[distance.cdist([point], campus).argmin()]
closest_traffic_point.append(tuple(closest))
bike_traffic.append(campus_dict[tuple(closest)])
elif castleFrank_Path.contains_point(point):
closest = castleFrank[distance.cdist([point], castleFrank).argmin()]
closest_traffic_point.append(tuple(closest))
bike_traffic.append(castleFrank_dict[tuple(closest)])
else:
closest = ped_coordinate_list[distance.cdist([point], ped_coordinate_list).argmin()]
closest_traffic_point.append(tuple(closest))
bike_traffic.append(bike_dict[tuple(closest)])
closest = ped_coordinate_list[distance.cdist([point], ped_coordinate_list).argmin()]
veh_traffic.append(veh_dict[tuple(closest)])
ped_traffic.append(ped_dict[tuple(closest)])
myfile3 = open('central_closest_intersection.csv', 'w')
wr = csv.writer(myfile3)
wr.writerow(closest_traffic_point)
myfile3.close()
myfile4 = open('central_closest_int_bike_predictions.csv', 'w')
wr = csv.writer(myfile4)
wr.writerow(bike_traffic)
myfile4.close()
central_col_df['closest_traffic'] = tuple(closest_traffic_point)
central_col_df['traffic_count'] = bike_traffic
central_col_df['vehicle_traffic'] = veh_traffic
central_col_df['pedestrian_traffic'] = ped_traffic
central_col_df= central_col_df.rename(columns={'closest_traffic': 'closest_ped_count', 'traffic_count':'predicted_bike_count'})
In [36]:
central_col_df.columns.values
Out[36]:
In [37]:
import csv
closest_traffic_point = []
bike_traffic = []
veh_traffic = []
ped_traffic = []
i = 0
for i in range(0,len(all_realintersections_df)):
point = all_realintersections_df['coordinates'].iloc[i]
if waterfront_Path.contains_point(point):
closest = waterfront[distance.cdist([point], waterfront).argmin()]
closest_traffic_point.append(tuple(closest))
bike_traffic.append(waterfront_dict[tuple(closest)])
elif campus_Path.contains_point(point):
closest = campus[distance.cdist([point], campus).argmin()]
closest_traffic_point.append(tuple(closest))
bike_traffic.append(campus_dict[tuple(closest)])
elif castleFrank_Path.contains_point(point):
closest = castleFrank[distance.cdist([point], castleFrank).argmin()]
closest_traffic_point.append(tuple(closest))
bike_traffic.append(castleFrank_dict[tuple(closest)])
else:
closest = ped_coordinate_list[distance.cdist([point], ped_coordinate_list).argmin()]
closest_traffic_point.append(tuple(closest))
bike_traffic.append(bike_dict[tuple(closest)])
closest = ped_coordinate_list[distance.cdist([point], ped_coordinate_list).argmin()]
veh_traffic.append(veh_dict[tuple(closest)])
ped_traffic.append(ped_dict[tuple(closest)])
myfile5 = open('closest_intersection_centreline.csv', 'w')
wr = csv.writer(myfile5)
wr.writerow(closest_traffic_point)
myfile5.close()
myfile6 = open('closest_int_bike_predictions_centreline.csv', 'w')
wr = csv.writer(myfile6)
wr.writerow(bike_traffic)
myfile6.close()
all_realintersections_df['closest_traffic'] = tuple(closest_traffic_point)
all_realintersections_df['traffic_count'] = bike_traffic
all_realintersections_df['vehicle_traffic'] = veh_traffic
all_realintersections_df['pedestrian_traffic'] = ped_traffic
all_realintersections_df = all_realintersections_df.rename(columns={'closest_traffic': 'closest_ped_count', 'traffic_count':'predicted_bike_count'})
Adding traffic values for our list of intersections above. Cell below does the same for only those which have "central" Postal Code values.
In [38]:
import csv
closest_traffic_point = []
bike_traffic = []
veh_traffic = []
ped_traffic = []
i = 0
for i in range(0,len(central_realintersections_df)):
point = central_realintersections_df['coordinates'].iloc[i]
if waterfront_Path.contains_point(point):
closest = waterfront[distance.cdist([point], waterfront).argmin()]
closest_traffic_point.append(tuple(closest))
bike_traffic.append(waterfront_dict[tuple(closest)])
elif campus_Path.contains_point(point):
closest = campus[distance.cdist([point], campus).argmin()]
closest_traffic_point.append(tuple(closest))
bike_traffic.append(campus_dict[tuple(closest)])
elif castleFrank_Path.contains_point(point):
closest = castleFrank[distance.cdist([point], castleFrank).argmin()]
closest_traffic_point.append(tuple(closest))
bike_traffic.append(castleFrank_dict[tuple(closest)])
else:
closest = ped_coordinate_list[distance.cdist([point], ped_coordinate_list).argmin()]
closest_traffic_point.append(tuple(closest))
bike_traffic.append(bike_dict[tuple(closest)])
closest = ped_coordinate_list[distance.cdist([point], ped_coordinate_list).argmin()]
veh_traffic.append(veh_dict[tuple(closest)])
ped_traffic.append(ped_dict[tuple(closest)])
myfile5 = open('central_closest_intersection_centreline.csv', 'w')
wr = csv.writer(myfile5)
wr.writerow(closest_traffic_point)
myfile5.close()
myfile6 = open('central_closest_int_bike_predictions_centreline.csv', 'w')
wr = csv.writer(myfile6)
wr.writerow(bike_traffic)
myfile6.close()
central_realintersections_df['closest_traffic'] = tuple(closest_traffic_point)
central_realintersections_df['traffic_count'] = bike_traffic
central_realintersections_df['vehicle_traffic'] = veh_traffic
central_realintersections_df['pedestrian_traffic'] = ped_traffic
central_realintersections_df = central_realintersections_df.rename(columns={'closest_traffic': 'closest_ped_count', 'traffic_count':'predicted_bike_count'})
In [39]:
all_realint_df = all_realintersections_df[['intersection','coordinates','closest_ped_count','predicted_bike_count','vehicle_traffic','pedestrian_traffic','elevatio10']]
#all_realint_df.sort_values(by='predicted_bike_count')
result = pd.merge(col_df,all_realint_df, how ='outer',on='intersection')
result = result.sort_values(by='intersection')
traff2 =result['predicted_bike_count_y']
traff2.fillna(1,inplace=True)
result['predicted_bike_count_x'] =result['predicted_bike_count_x'].fillna(value = traff2 )
In [40]:
central_realint_df = central_realintersections_df[['intersection','coordinates','closest_ped_count','predicted_bike_count','vehicle_traffic','pedestrian_traffic','elevatio10']]
#all_realint_df.sort_values(by='predicted_bike_count')
result_central = pd.merge(central_col_df,central_realint_df, how ='outer',on='intersection')
result_central = result_central.sort_values(by='intersection')
traff_c =result_central['predicted_bike_count_y']
result_central['predicted_bike_count_x'] =result_central['predicted_bike_count_x'].fillna(value = traff_c )
Now let's take a look at the total types of injuries after 1997
In [47]:
injury_df = full_df[full_df['YEAR'] >1997].groupby('INJURY').size()
total = injury_df.sum()
d = {'injury': pd.Series([total/31,total/543,total/5575,total/(7171+2039)], index=['Fatal','Major','Minor','Minimal'])}
inverse_df = pd.DataFrame(d)
inverse_df = inverse_df/inverse_df.loc['Minimal']
inverse_df
Out[47]:
In [42]:
injury_df.plot(kind='bar')
Out[42]:
In [43]:
inverse_df.plot(kind='bar')
Out[43]:
Filtering past 1997 got ridd of the "unrecorded" entries, so that's a helpful. Now combined 'minimal' and 'none' into one category. This is because we are going to estimate a "risk factor" for each intersection, where major accidents are worth more than a minor accident, which are worth more than a minor etc. In treating a 'none' it seems simplest to compine this with 'minimal' accidents, so we can merely just invert our data to get our weighting factors.
It can be difficult and subjective to make this sort of call. How is the loss of a human life (for fatal accidents), numerically compared to that in which a minor accident occurs? Obviously more serious accidents should be percieved as more dangerous. We are letting the data provide this estimate for us. We will merely compare the relative sizes of each.
This may not seem neccessary, as working off the raw collision numbers and normalizing those may feel more intuitive to some. However, I believe that when we run our algorithm against new data (Nov 2015 to Nov 2016), an algorithm which weighs these different accident types will make better predictions. Unfortunately it will not be an apples to apples comparison. To do this, one would like to know the severity of accidents in our test data as well. The twitter bot does not make this distinction clear. But it is merely retweeting collisions reported by the Toronto police twitter app, which is not the same as having access to the filed police reports of the Toronto Police database. One would then believe that our new data is made up of more serious collisions, as the police were immediately informed.
In [44]:
injury_dict = inverse_df.to_dict()
In [45]:
injury_dict['None']=1.0
injury_dict = injury_dict
col_df['injury_num'] = col_df['INJURY'].map(injury_dict)
col_df['injury_num'] = col_df['injury_num'].fillna(value=0)
central_col_df['injury_num'] = central_col_df['injury'].map(injury_dict)
central_col_df['injury_num'] = central_col_df['injury_num'].fillna(value=0)
Here we produce two resulting dataframes of intersections, featuring the total numbers of collisions, along with the features of that intersection.
In [46]:
col_df.columns.values
Out[46]:
In [47]:
#result.to_csv('merged_data.csv')
new_col = col_df[col_df['YEAR']>1997]
result = pd.merge(new_col,all_realint_df, how ='outer',on='intersection')
result = result.sort_values(by='intersection')
severity_df = pd.get_dummies(new_col.INJURY, dummy_na=False)
new_col = pd.concat([new_col,severity_df],axis=1)
new_col['Minimal/None'] = new_col['Minimal'] + new_col['None']
intersection_df = new_col.groupby('intersection').agg({'predicted_bike_count':[np.size,np.mean],'vehicle_traffic':[np.mean],'pedestrian_traffic':[np.mean],'injury_num':[np.sum],'Fatal':[np.sum],'Major':[np.sum],'Minor':[np.sum],'Minimal/None':[np.sum]}).sort_values(by=[('predicted_bike_count','size')],ascending=False)
In [48]:
#result.to_csv('merged_data.csv')
new_col = col_df[col_df['YEAR']>1997]
result = pd.merge(new_col,all_realint_df, how ='outer',on='intersection')
result = result.sort_values(by='intersection')
severity_df = pd.get_dummies(new_col.INJURY, dummy_na=False)
new_col = pd.concat([new_col,severity_df],axis=1)
new_col['Minimal/None'] = new_col['Minimal'] + new_col['None']
intersection_df = new_col.groupby('intersection').agg({'predicted_bike_count':[np.size,np.mean],'vehicle_traffic':[np.mean],'pedestrian_traffic':[np.mean],'injury_num':[np.sum],'Fatal':[np.sum],'Major':[np.sum],'Minor':[np.sum],'Minimal/None':[np.sum]}).sort_values(by=[('predicted_bike_count','size')],ascending=False)
#intersection_df.columns = intersection_df.columns.set_levels(['minor','predicted_bike_count','injury_num','pedestrian_traffic','fatal','minimal/none','major','vehicle_traffic'], level=0)
zeroes_df = all_realint_df.groupby('intersection').agg({'predicted_bike_count':[np.size,np.mean],'vehicle_traffic':[np.mean],'pedestrian_traffic':[np.mean]}).sort_values(by=[('predicted_bike_count','size')],ascending=False)
zeroes_df['predicted_bike_count','size'] = 0
zeroes_df['injury_num','sum'] = 0
#Let's use 360 as the days in the year. It's a nice round number. And since weekends are less busy it's probably still an overestimate.
zeroes_df['Rates','normalized_accident_rate'] = zeroes_df['predicted_bike_count','size']/(13*360*zeroes_df['predicted_bike_count','mean'])
zeroes_df = zeroes_df.sort_values(by=[('Rates','normalized_accident_rate')],ascending=False)
intersection_df['Rates','normalized_accident_rate'] = intersection_df['predicted_bike_count','size']/(13*360*intersection_df['predicted_bike_count','mean'])
intersection_df = intersection_df.sort_values(by=[('Rates','normalized_accident_rate')],ascending=False)
zeroes_df['Rates','normalized_risk_factor'] = zeroes_df['injury_num','sum']/(13*360*zeroes_df['predicted_bike_count','mean'])
zeroes_df = zeroes_df.sort_values(by=[('Rates','normalized_risk_factor')],ascending=False)
intersection_df['Rates','normalized_risk_factor'] = intersection_df['injury_num','sum']/(13*360*intersection_df['predicted_bike_count','mean'])
intersection_df = intersection_df.sort_values(by=[('Rates','normalized_risk_factor')],ascending=False)
totals_df = pd.concat([intersection_df,zeroes_df])
df_test = totals_df['Rates','normalized_accident_rate']
df_data2 = df_test.to_frame(name='normalized yearly accident rate')
df_data2['total collisions'] = totals_df['predicted_bike_count','size']
df_data2['bike traffic estimate'] = totals_df['predicted_bike_count','mean']
df_data2['scaled collisions'] = totals_df['injury_num','sum']
df_data2['normalized_risk_factor'] = totals_df['Rates','normalized_risk_factor']
df_data2['vehicle traffic'] = totals_df['vehicle_traffic']
df_data2['pedestrian traffic'] = totals_df['pedestrian_traffic']
df_data2['fatal'] = totals_df['Fatal']
df_data2['major'] = totals_df.Major
df_data2['minor'] = totals_df.Minor
df_data2['minimal/none'] = totals_df['Minimal/None']
df_data2['normalized yearly accident rate'].fillna(0,inplace=True)
df_data2['normalized_risk_factor'].fillna(0,inplace=True)
df_data2.fillna(0,inplace=True)
df_data2.dropna(inplace=True)
#Here we make the index a column, drop the duplicates from this intersection column, then make it the index again.
df_data2 = df_data2.reset_index().drop_duplicates(subset='intersection', keep='first').set_index('intersection')
df_data2 = df_data2.sort_values(by='total collisions',ascending=False)
len(df_data2)
Out[48]:
In [49]:
df_data2.head()
Out[49]:
In [50]:
#result.to_csv('merged_data.csv')
new_col_central = central_col_df[central_col_df['year']>1997]
result_central = pd.merge(new_col_central,central_realint_df, how ='outer',on='intersection')
result_central = result_central.sort_values(by='intersection')
severity_df = pd.get_dummies(new_col_central.injury, dummy_na=False)
new_col_central = pd.concat([new_col_central,severity_df],axis=1)
new_col_central['Minimal/None'] = new_col_central['Minimal'] + new_col_central['None']
intersection_df = new_col_central.groupby('intersection').agg({'predicted_bike_count':[np.size,np.mean],'vehicle_traffic':[np.mean],'pedestrian_traffic':[np.mean],'injury_num':[np.sum],'Fatal':[np.sum],'Major':[np.sum],'Minor':[np.sum],'Minimal/None':[np.sum]}).sort_values(by=[('predicted_bike_count','size')],ascending=False)
#intersection_df.columns = intersection_df.columns.set_levels(['minor','predicted_bike_count','injury_num','pedestrian_traffic','fatal','minimal/none','major','vehicle_traffic'], level=0)
zeroes_df = central_realint_df.groupby('intersection').agg({'predicted_bike_count':[np.size,np.mean],'vehicle_traffic':[np.mean],'pedestrian_traffic':[np.mean]}).sort_values(by=[('predicted_bike_count','size')],ascending=False)
zeroes_df['predicted_bike_count','size'] = 0
zeroes_df['injury_num','sum'] = 0
#Let's use 360 as the days in the year. It's a nice round number. And since weekends are less busy it's probably still an overestimate.
zeroes_df['Rates','normalized_accident_rate'] = zeroes_df['predicted_bike_count','size']/(13*360*zeroes_df['predicted_bike_count','mean'])
zeroes_df = zeroes_df.sort_values(by=[('Rates','normalized_accident_rate')],ascending=False)
intersection_df['Rates','normalized_accident_rate'] = intersection_df['predicted_bike_count','size']/(13*360*intersection_df['predicted_bike_count','mean'])
intersection_df = intersection_df.sort_values(by=[('Rates','normalized_accident_rate')],ascending=False)
zeroes_df['Rates','normalized_risk_factor'] = zeroes_df['injury_num','sum']/(13*360*zeroes_df['predicted_bike_count','mean'])
zeroes_df = zeroes_df.sort_values(by=[('Rates','normalized_risk_factor')],ascending=False)
intersection_df['Rates','normalized_risk_factor'] = intersection_df['injury_num','sum']/(13*360*intersection_df['predicted_bike_count','mean'])
intersection_df = intersection_df.sort_values(by=[('Rates','normalized_risk_factor')],ascending=False)
totals_df = pd.concat([intersection_df,zeroes_df])
df_test = totals_df['Rates','normalized_accident_rate']
df_data_central = df_test.to_frame(name='normalized yearly accident rate')
df_data_central['total collisions'] = totals_df['predicted_bike_count','size']
df_data_central['bike traffic estimate'] = totals_df['predicted_bike_count','mean']
df_data_central['scaled collisions'] = totals_df['injury_num','sum']
df_data_central['normalized_risk_factor'] = totals_df['Rates','normalized_risk_factor']
df_data_central['vehicle traffic'] = totals_df['vehicle_traffic']
df_data_central['pedestrian traffic'] = totals_df['pedestrian_traffic']
df_data_central['fatal'] = totals_df['Fatal']
df_data_central['major'] = totals_df.Major
df_data_central['minor'] = totals_df.Minor
df_data_central['minimal/none'] = totals_df['Minimal/None']
df_data_central['normalized yearly accident rate'].fillna(0,inplace=True)
df_data_central['normalized_risk_factor'].fillna(0,inplace=True)
df_data_central.fillna(0,inplace=True)
df_data_central.dropna(inplace=True)
#Here we make the index a column, drop the duplicates from this intersection column, then make it the index again.
df_data_central = df_data_central.reset_index().drop_duplicates(subset='intersection', keep='first').set_index('intersection')
df_data_central = df_data_central.sort_values(by='total collisions',ascending=False)
len(df_data_central)
Out[50]:
In [51]:
df_data2.to_csv('all_totals.csv')
df_data_central.to_csv('central_totals.csv')
Now we have our two main datasets. A collection of intersections in the central areas of the city with their associated accident numbers for cycling collisions from 1998 onwards, and one for the entire city of Toronto. To gain any predicted power however, we'll need to throw in some features of the intersections. We'll use the road class types from the collisions data set, using the "busiest" class of the two streets to define the intersection. We'll also use the intersection control type as well. The cells merely add these columns to our two dataframes.
In [52]:
temp = result
temp = temp.replace(np.nan,-1)
roadclass_df =temp.groupby(['intersection']+['ROAD CLASS']) \
.size().to_frame('count') \
.reset_index().sort_values('intersection',ascending=False)
#roadclass_df['road_class'] =roadclass_df['road_class'].fillna(value =elevatio_df)
roadclass_df = roadclass_df.sort_values('count', ascending=False)
roadclass_df = roadclass_df.replace(np.nan,-1)
roadclass_dict = {'Collector':1, 'Expressway':1,'Expressway Ramp':1,'Major Arterial':2,'Minor Arterial':3,'Local':4, 'Laneway':4, 'Private Property':4}
roadclass_df['roadclass_num'] = roadclass_df['ROAD CLASS'].map(roadclass_dict)
roadclass_df = roadclass_df.sort_values(by = ['intersection','roadclass_num'], ascending = True)
roadclass_df = roadclass_df.drop_duplicates(subset='intersection', keep='first')
roadclass_df = roadclass_df.sort_values(by='intersection', ascending= False)
temp = result
temp['elevatio10'].replace(np.nan,'unrecorded', inplace =True)
elevatio_df =temp.groupby(['intersection']+['elevatio10']) \
.size().to_frame('count') \
.reset_index().sort_values('intersection',ascending=False)
elevatio10_dict = {'Major':'Major Arterial', 'Minor':'Minor Arterial','Laneway':'Local'}
elevatio_df['intersection_type']= elevatio_df['elevatio10'].map(elevatio10_dict)
elevatio_df = elevatio_df.sort_values(by = ['intersection','intersection_type'], ascending = True)
elevatio_df = elevatio_df.drop_duplicates(subset='intersection', keep='first')
elevatio_df = elevatio_df.sort_values(by='intersection', ascending= False)
elevatio_df.reset_index(inplace = True)
elevatio10 = elevatio_df['intersection_type']
roadclass_df['ROAD CLASS'] =roadclass_df['ROAD CLASS'].replace(to_replace = -1,value = np.nan)
roadclass_df.reset_index(inplace=True)
roadclass_df['ROAD CLASS'] =roadclass_df['ROAD CLASS'].fillna(value = elevatio10)
roadclass_df['roadclass_num'] = roadclass_df['ROAD CLASS'].map(roadclass_dict)
roadclass_df = roadclass_df.sort_values(by = ['intersection','roadclass_num'], ascending = True)
roadclass_df = roadclass_df.drop_duplicates(subset='intersection', keep='first')
roadclass_df = roadclass_df.sort_values(by='intersection', ascending= False)
In [53]:
roadclass_df.groupby(['ROAD CLASS']).size()
Out[53]:
In [54]:
result['TRAFFIC CONTROL'].replace(np.nan,'unrecorded', inplace=True)
result['ROAD CLASS'].replace(['Expressway Ramp', 'Hydro Line','Laneway','Major Arterial Ramp','Minor Arterial Ramp','Other','Private Property', 'River'],['Expressway','Local','Local','Major Arterial','Minor Arterial','Local','Local','Local'],inplace=True)
temp = result
trafficcontrol_df =temp.groupby(['intersection']+['TRAFFIC CONTROL']) \
.size().to_frame('count') \
.reset_index().sort_values('count',ascending=False)
trafficcontrol_dict = {'traffic signal':1, 'traffic gate':1,'traffic controller':1,'stop sign':2, 'yield sign':2, 'pedestrian crossover':3, 'no control':5, 'other':6, 'unrecorded':6,'police control':6,'school bus':6,'school guard':3}
trafficcontrol_df['trafficcontrol_num'] = trafficcontrol_df['TRAFFIC CONTROL'].map(trafficcontrol_dict)
trafficcontrol_df = trafficcontrol_df.sort_values(by = ['intersection','trafficcontrol_num'], ascending = True)
trafficcontrol_df = trafficcontrol_df.drop_duplicates(subset='intersection', keep='first')
trafficcontrol_df = trafficcontrol_df.sort_values(by='intersection', ascending= False)
In [55]:
trafficcontrol_df.to_csv('trafficcontrol_df.csv')
In [56]:
reverse_dict = {1:'traffic signal/gate',2:'stop/yield sign', 3:'pedestrian crossover',5:'no control', 6:'other/unrecorded'}
trafficcontrol_df['control category'] = trafficcontrol_df['trafficcontrol_num'].map(reverse_dict)
trafficcontrol_df.head()
Out[56]:
In [57]:
trafficcontrol_df.groupby(['control category']).size()
Out[57]:
In [58]:
reverseROAD_dict = {1:'Collector/Expressway',2:'Major Arterial', 3:'Minor Arterial',4:'Local/Laneway'}
roadclass_df['road_class_cat'] = roadclass_df['roadclass_num'].map(reverseROAD_dict)
roadclass_df.head()
Out[58]:
In [59]:
roadclass_df.groupby(['road_class_cat']).size()
Out[59]:
In [60]:
df_data2 = df_data2.sort_index(ascending = False)
df_data2['road_class'] = roadclass_df['road_class_cat'].values
df_data2['control type'] = trafficcontrol_df['control category'].values
df_data2[['fatal','major','minor','minimal/none']] = df_data2[['fatal','major','minor','minimal/none']].fillna(0)
traffdummy_df = pd.get_dummies(df_data2['control type'], dummy_na=False)
df_data2 = pd.concat([df_data2, traffdummy_df ],axis=1)
rdclass_dummy = pd.get_dummies(df_data2['road_class'])
df_data2 = pd.concat([df_data2, rdclass_dummy],axis=1)
We had to do a bit of work to map certain obvious groups together (such as minor arterial ramp grouping with arterial ramp). The same process was performed for traffic control types. Part of the thinking in the grouping was trying to use common sense, traffic gates would likely be the most restrictive intersections, but due to their very small number, they are combined with the next restrictive in a "traffic signal/gate" groups.
In [61]:
df_data2['bike traffic estimate'].replace(0,1,inplace=True)
df_data2['pedestrian traffic'].replace(0,1,inplace=True)
We're also mapping the intersection types found in the intersection list dataframe to correspond to those in the cycling collisions one.
In [62]:
temp = result_central
temp = temp.replace(np.nan,-1)
roadclass_df =temp.groupby(['intersection']+['road_class']) \
.size().to_frame('count') \
.reset_index().sort_values('intersection',ascending=False)
#roadclass_df['road_class'] =roadclass_df['road_class'].fillna(value =elevatio_df)
roadclass_df = roadclass_df.sort_values('count', ascending=False)
roadclass_df = roadclass_df.replace(np.nan,-1)
roadclass_dict = {'Collector':1, 'Expressway':1,'Expressway Ramp':1,'Major Arterial':2,'Minor Arterial':3,'Local':4, 'Laneway':4, 'Private Property':4}
roadclass_df['roadclass_num'] = roadclass_df['road_class'].map(roadclass_dict)
roadclass_df = roadclass_df.sort_values(by = ['intersection','roadclass_num'], ascending = True)
roadclass_df = roadclass_df.drop_duplicates(subset='intersection', keep='first')
roadclass_df = roadclass_df.sort_values(by='intersection', ascending= False)
temp = result_central
temp['elevatio10'].replace(np.nan,'unrecorded', inplace =True)
elevatio_df =temp.groupby(['intersection']+['elevatio10']) \
.size().to_frame('count') \
.reset_index().sort_values('intersection',ascending=False)
elevatio10_dict = {'Major':'Major Arterial', 'Minor':'Minor Arterial','Laneway':'Local'}
elevatio_df['intersection_type']= elevatio_df['elevatio10'].map(elevatio10_dict)
elevatio_df = elevatio_df.sort_values(by = ['intersection','intersection_type'], ascending = True)
elevatio_df = elevatio_df.drop_duplicates(subset='intersection', keep='first')
elevatio_df = elevatio_df.sort_values(by='intersection', ascending= False)
elevatio_df.reset_index(inplace = True)
elevatio10 = elevatio_df['intersection_type']
roadclass_df['ROAD CLASS'] =roadclass_df['road_class'].replace(to_replace = -1,value = np.nan)
roadclass_df.reset_index(inplace=True)
roadclass_df['ROAD CLASS'] =roadclass_df['ROAD CLASS'].fillna(value = elevatio10)
roadclass_df['roadclass_num'] = roadclass_df['ROAD CLASS'].map(roadclass_dict)
roadclass_df = roadclass_df.sort_values(by = ['intersection','roadclass_num'], ascending = True)
roadclass_df = roadclass_df.drop_duplicates(subset='intersection', keep='first')
roadclass_df = roadclass_df.sort_values(by='intersection', ascending= False)
roadclass_df.groupby(['ROAD CLASS']).size()
Out[62]:
In [63]:
result_central['traffic_control'].replace(np.nan,'unrecorded', inplace=True)
result_central['road_class'].replace(['Expressway Ramp', 'Hydro Line','Laneway','Major Arterial Ramp','Minor Arterial Ramp','Other','Private Property', 'River'],['Expressway','Local','Local','Major Arterial','Minor Arterial','Local','Local','Local'],inplace=True)
temp = result_central
trafficcontrol_df =temp.groupby(['intersection']+['traffic_control']) \
.size().to_frame('count') \
.reset_index().sort_values('count',ascending=False)
trafficcontrol_dict = {'traffic signal':1, 'traffic gate':1,'traffic controller':1,'stop sign':2, 'yield sign':2, 'pedestrian crossover':3, 'no control':5, 'other':6, 'unrecorded':6,'police control':6,'school bus':6,'school guard':3}
trafficcontrol_df['trafficcontrol_num'] = trafficcontrol_df['traffic_control'].map(trafficcontrol_dict)
trafficcontrol_df = trafficcontrol_df.sort_values(by = ['intersection','trafficcontrol_num'], ascending = True)
trafficcontrol_df = trafficcontrol_df.drop_duplicates(subset='intersection', keep='first')
trafficcontrol_df = trafficcontrol_df.sort_values(by='intersection', ascending= False)
In [64]:
reverse_dict = {1:'traffic signal/gate',2:'stop/yield sign', 3:'pedestrian crossover',5:'no control', 6:'other/unrecorded'}
trafficcontrol_df['control category'] = trafficcontrol_df['trafficcontrol_num'].map(reverse_dict)
reverseROAD_dict = {1:'Collector/Expressway',2:'Major Arterial', 3:'Minor Arterial',4:'Local/Laneway'}
roadclass_df['road_class_cat'] = roadclass_df['roadclass_num'].map(reverseROAD_dict)
df_data_central = df_data_central.sort_index(ascending = False)
df_data_central['road_class'] = roadclass_df['road_class_cat'].values
df_data_central['control type'] = trafficcontrol_df['control category'].values
df_data_central[['fatal','major','minor','minimal/none']] = df_data_central[['fatal','major','minor','minimal/none']].fillna(0)
traffdummy_df = pd.get_dummies(df_data_central['control type'], dummy_na=False)
df_data_central = pd.concat([df_data_central, traffdummy_df ],axis=1)
rdclass_dummy = pd.get_dummies(df_data_central['road_class'])
df_data_central = pd.concat([df_data_central, rdclass_dummy],axis=1)
We're going to fit posterior means via empirical bayesian estimation as before. These won't be used for prediction, but it will be interesting to see if any of our predictive approaches perform better than the estimates from the posterior rates when we take a look at data from this past year scrapped from twitter.
In [65]:
ag,bg,cg = stats.gamma.fit(df_data2['total collisions'])
beta = 1/cg
df_data2['posterior mean rate'] = (ag + df_data2['total collisions'])/(beta+13*360*df_data2['bike traffic estimate'])
df_data2['posterior STD'] = np.sqrt((ag + df_data2['total collisions'])/((beta+13*360*df_data2['bike traffic estimate'])**2))
#pd.set_option('display.float_format', '{:.5g}'.format)
df_data2.sort_values(by='posterior mean rate',ascending = False).head()
Out[65]:
In [66]:
ag,bg,cg = stats.gamma.fit(df_data_central['total collisions'])
beta = 1/cg
df_data_central['posterior mean rate'] = (ag + df_data_central['total collisions'])/(beta+13*360*df_data_central['bike traffic estimate'])
df_data_central['posterior STD'] = np.sqrt((ag + df_data_central['total collisions'])/((beta+13*360*df_data_central['bike traffic estimate'])**2))
#pd.set_option('display.float_format', '{:.5g}'.format)
df_data_central.sort_values(by='posterior mean rate',ascending = False).head()
Out[66]:
In [67]:
df_data2.to_csv('preinput_data2.csv')
df_data_central.to_csv('preinput_data_central.csv')
In [68]:
df_data = pd.DataFrame.from_csv('preinput_data2.csv')
df_data_central = pd.DataFrame.from_csv('preinput_data_central.csv')
In [69]:
#df_data['injury'] = df_data.fatal + df_data.major + df_data.minor + df_data['minimal/none']
df_data['fatal/major'] = df_data.fatal + df_data.major
#df_data_central['injury'] = df_data_central.fatal + df_data_central.major + df_data_central.minor + df_data_central['minimal/none']
df_data_central['fatal/major'] = df_data_central.fatal + df_data_central.major
In [70]:
df_data.columns.values
Out[70]:
In [71]:
X= df_data.drop(['normalized yearly accident rate','posterior mean rate','posterior STD','total collisions','normalized_risk_factor','road_class','control type','minor','minimal/none','scaled collisions','fatal','major'],axis=1)
Y_priorTotals = df_data['total collisions']
Y_priorRate = df_data['total collisions']/13
Y_postRate = df_data['posterior mean rate']
Y_priorRiskRate = df_data['normalized_risk_factor']
Xc= df_data_central.drop(['normalized yearly accident rate','posterior mean rate','posterior STD','total collisions','normalized_risk_factor','road_class','control type','minor','minimal/none','scaled collisions','fatal','major'],axis=1)
Yc_priorTotals = df_data_central['total collisions']
Yc_priorRate = df_data_central['total collisions']/13
Yc_postRate = df_data_central['posterior mean rate']
Yc_priorRiskRate = df_data_central['normalized_risk_factor']
In [72]:
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
raw_bike = X['bike traffic estimate'].reshape(-1,1)
raw_veh = X['vehicle traffic'].reshape(-1,1)
raw_ped = X['pedestrian traffic'].reshape(-1,1)
bike_scaler = preprocessing.StandardScaler().fit(raw_bike)
veh_scaler = preprocessing.StandardScaler().fit(raw_veh)
ped_scaler = preprocessing.StandardScaler().fit(raw_ped)
X_scaled_bike = bike_scaler.transform(raw_bike)
X_scaled_veh = veh_scaler.transform(raw_veh)
X_scaled_ped = ped_scaler.transform(raw_ped)
X['bike traffic estimate'] = X_scaled_bike
X['vehicle traffic'] = X_scaled_veh
X['ped_scaler'] = X_scaled_ped
In [73]:
raw_bike = Xc['bike traffic estimate'].reshape(-1,1)
raw_veh = Xc['vehicle traffic'].reshape(-1,1)
raw_ped = Xc['pedestrian traffic'].reshape(-1,1)
bike_scaler_c = preprocessing.StandardScaler().fit(raw_bike)
veh_scaler_c = preprocessing.StandardScaler().fit(raw_veh)
ped_scaler_c = preprocessing.StandardScaler().fit(raw_ped)
Xc_scaled_bike = bike_scaler_c.transform(raw_bike)
Xc_scaled_veh = veh_scaler_c.transform(raw_veh)
Xc_scaled_ped = ped_scaler_c.transform(raw_ped)
Xc['bike traffic estimate'] = Xc_scaled_bike
Xc['vehicle traffic'] = Xc_scaled_veh
Xc['ped_scaler'] = Xc_scaled_ped
First we train a basic linear regression. Rather than splitting the data manually, or coding up a cross validation method ourselves (as was required in the linear regression exercise for the course), we'll use the included cross_val_score and _predict methods.
In [74]:
from sklearn import linear_model as lm
cycling_linear = lm.LinearRegression()
cycling_lasso = lm.Lasso()
cycling_ridge = lm.Ridge(alpha =10)
In [75]:
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_val_score
predicted = cross_val_predict(cycling_linear,X,Y_priorRate,cv=5)
linear_score = cross_val_score(cycling_linear,X,Y_priorRate,cv=5)
print("linear score: ", np.mean(linear_score))
print("predicted: ", predicted)
We first notice that the score isn't terribly good. There is a lot of room for improvement here. How do these predictions look visually when plotted against the actual data? Let's take a look
In [76]:
fig, ax = plt.subplots()
Y = Y_priorRate
r2 = linear_score
ax.scatter(Y,predicted)
ax.plot([Y.min(), Y.max()], [Y.min(), Y.max()], 'k--',label ="r^2="+str(r2), lw=4)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
plt.show()
In [77]:
plt.scatter(predicted, predicted- Y, c='g', s=40, alpha=0.5)
plt.hlines(y = 0, xmin=0, xmax = 1.5)
plt.title('Residual Plot of OLS regression for all postal codes, using cross-validation')
plt.ylabel('Residuals')
Out[77]:
In [78]:
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_val_score
predicted = cross_val_predict(cycling_linear,Xc,Yc_priorRate,cv=5)
linear_score = cross_val_score(cycling_linear,Xc,Yc_priorRate,cv=5)
print("linear score: ", np.mean(linear_score))
print("predicted: ", predicted)
fig, ax = plt.subplots()
Y = Yc_priorRate
r2 = linear_score
ax.scatter(Y,predicted)
ax.plot([Y.min(), Y.max()], [Y.min(), Y.max()], 'k--',label ="r^2="+str(r2), lw=4)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
plt.show()
In [79]:
plt.scatter(predicted, predicted- Y, c='g', s=40, alpha=0.5)
plt.hlines(y = 0, xmin=0, xmax = 1.5)
plt.title('Residual Plot of OLS regression for central postal codes, using cross-validation')
plt.ylabel('Residuals')
Out[79]:
So it's very obvious to us that there is a very large tendency to understimate collision numbers, particularily for mid, to high risk intersections. We see that restricting ourselves to intersections which are in central Toronto gives a pretty sizable imporvement in $R^2$. From 0.5 to about 0.6. The residuals show an improvement in the underestimation as well.
We'll let the use the LassoCV method to fit an alpha, and then make our prediction from there.
In [80]:
lassomodel = lm.LassoCV(cv=5).fit(X,Y_priorRate)
predicted = lassomodel.predict(X)
lasso_score = lassomodel.score(X,Y_priorRate)
print("chosen alpha: ",lassomodel.alpha_)
print("lasso score: ", np.mean(lasso_score))
print("predicted: ", predicted)
In [81]:
fig, ax = plt.subplots()
Y = Y_priorRate
r2= linear_score
ax.scatter(Y,predicted)
ax.plot([Y.min(), Y.max()], [Y.min(), Y.max()], 'k--',label ="r^2="+str(r2), lw=4)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
plt.show()
In [82]:
plt.scatter(predicted, predicted- Y, c='g', s=40, alpha=0.5)
plt.hlines(y = 0, xmin=0, xmax = 1.5)
plt.title('Residual Plot using cross-validated data')
plt.ylabel('Residuals')
Out[82]:
So a rather poor performance overall. Worse than regular least squares regression, just as we predicted. It doesn't make sense to continue with this model by checking the performance with the central-only dataframe, as this isn't the kidn of problem it's designed for. Now what about ridge regression?
In [83]:
ridgemodel = lm.RidgeCV(cv=5).fit(X,Y_priorRate)
predicted = ridgemodel.predict(X)
ridge_score = ridgemodel.score(X,Y_priorRate)
print("chosen alpha: ",ridgemodel.alpha_)
print("ridge score: ", np.mean(ridge_score))
print("predicted: ", predicted)
In [84]:
fig, ax = plt.subplots()
Y = Y_priorRate
r2 = linear_score
ax.scatter(Y,predicted)
ax.plot([Y.min(), Y.max()], [Y.min(), Y.max()], 'k--', label ="r^2="+str(r2),lw=4)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
plt.show()
In [85]:
plt.scatter(predicted, predicted- Y, c='g', s=40, alpha=0.5)
plt.hlines(y = 0, xmin=0, xmax = 1.5)
plt.title('Residual Plot using cross-validated data')
plt.ylabel('Residuals')
Out[85]:
In [86]:
ridgemodel = lm.RidgeCV(cv=5).fit(Xc,Yc_priorRate)
predicted = ridgemodel.predict(Xc)
ridge_score = ridgemodel.score(Xc,Yc_priorRate)
print("chosen alpha: ",ridgemodel.alpha_)
print("ridge score: ", np.mean(ridge_score))
print("predicted: ", predicted)
fig, ax = plt.subplots()
Y = Yc_priorRate
r2 = linear_score
ax.scatter(Y,predicted)
ax.plot([Y.min(), Y.max()], [Y.min(), Y.max()], 'k--',label ="r^2="+str(r2), lw=4)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
plt.show()
In [87]:
plt.scatter(predicted, predicted- Y, c='g', s=40, alpha=0.5)
plt.hlines(y = 0, xmin=0, xmax = 1.5)
plt.title('Residual Plot using cross-validated data')
plt.ylabel('Residuals')
Out[87]:
We have a slight improvement over our least squares model. Unfortunately still not very good. And we still underestimate systematically the high risk intersections. What can we accomplish by trying an ensemble method? And perhaps including another feature, like postal codes?
Not every postal code will be working to add information. But a few will be ranked highly when we look at the feature importance output for a random forrest.
In [88]:
Xrf = df_data[['bike traffic estimate','vehicle traffic','pedestrian traffic','road_class','control type']]
Yrf_priorTotals = df_data['total collisions']
Yrf_priorRate = df_data['total collisions']/13
Yrf_postRate = df_data['posterior mean rate']
Yrf_RiskRate = df_data['scaled collisions']/13
Xrf_c = df_data_central[['bike traffic estimate','vehicle traffic','pedestrian traffic','road_class','control type']]
Yrf_priorTotals_c = df_data_central['total collisions']
Yrf_priorRate_c = df_data_central['total collisions']/13
Yrf_postRate_c = df_data_central['posterior mean rate']
Yrf_RiskRate_c = df_data_central['scaled collisions']/13
In [89]:
Xrf.fillna('Minor Arterial',inplace= True)
In [90]:
control_dict = {'traffic signal/gate':1,'stop/yield sign':2, 'pedestrian crossover':3,'no control':5, 'other/unrecorded':4}
Xrf['control_type_num'] = Xrf['control type'].map(control_dict)
Xrf_c['control_type_num'] = Xrf_c['control type'].map(control_dict)
In [91]:
roadclass_dict = {'Collector/Expressway':4,'Major Arterial':3,'Minor Arterial':2,'Local/Laneway':1}
Xrf['road_class_num'] = Xrf['road_class'].map(roadclass_dict)
Xrf_c.loc[:,'road_class_num'] = Xrf_c.loc[:,'road_class'].map(roadclass_dict)
In [92]:
Xrf = Xrf.drop(['road_class','control type'] , axis=1)
Xrf_c = Xrf_c.drop(['road_class','control type'],axis =1)
In [93]:
from sklearn.model_selection import train_test_split
Xc_train, Xc_test, Yc_train, Yc_test = train_test_split(Xrf_c, Yrf_priorRate_c, test_size=0.25, random_state = 5)
print (Xc_train.shape)
print (Xc_test.shape)
print (Yc_train.shape)
print (Yc_test.shape)
X_train, X_test, Y_train, Y_test = train_test_split(Xrf, Yrf_priorRate, test_size=0.25, random_state = 5)
print (X_train.shape)
print (X_test.shape)
print (Y_train.shape)
print (Y_test.shape)
In [95]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
rf_c= RandomForestRegressor(n_estimators = 200)
prediction_c = cross_val_predict(rf_c,Xrf_c,Yrf_priorRate_c, cv=6)
r2_c = np.mean(cross_val_score(rf_c,Xrf_c,Yrf_priorRate_c,cv=6))
mse = np.mean((Yrf_priorRate_c-prediction_c)**2)
print('MSE : ',mse)
print("R^2 : ", r2_c)
plt.scatter(Yrf_priorRate_c,prediction_c)
plt.plot(np.arange(0,4),np.arange(0,4), label ="r^2="+str(r2_c), c="r")
plt.legend(loc="lower right")
plt.xlabel('Measured')
plt.ylabel('Predicted')
Out[95]:
In [96]:
rf= RandomForestRegressor(n_estimators = 200)
prediction = cross_val_predict(rf,Xrf,Yrf_priorRate, cv=5)
r2 = np.mean(cross_val_score(rf,Xrf,Yrf_priorRate,cv=5))
mse = np.mean((Yrf_priorRate-prediction)**2)
print('MSE : ',mse)
print("R^2 : ", r2)
plt.scatter(Yrf_priorRate,prediction)
plt.plot(np.arange(0,4),np.arange(0,4), label ="r^2="+str(r2), c="r")
plt.legend(loc="lower right")
plt.xlabel('Measured')
plt.ylabel('Predicted')
Out[96]:
In [97]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
rf= RandomForestRegressor(n_estimators = 200)
rf.fit(X_train,Y_train)
r2 = r2_score(Y_test, rf.predict(X_test))
mse = np.mean((Y_test-rf.predict(X_test))**2)
print('MSE : ',mse)
print("R^2 : ", r2)
plt.scatter(Y_test,rf.predict(X_test))
plt.plot(np.arange(0,4),np.arange(0,4), label ="r^2="+str(r2), c="r")
plt.legend(loc="lower right")
plt.xlabel('Measured')
plt.ylabel('Predicted')
Out[97]:
In [98]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
rf_c= RandomForestRegressor(n_estimators = 200)
rf_c.fit(Xc_train,Yc_train)
r2_c = r2_score(Yc_test, rf_c.predict(Xc_test))
mse = np.mean((Yc_test-rf_c.predict(Xc_test))**2)
print('MSE : ',mse)
print("R^2 : ", r2_c)
plt.scatter(Yc_test,rf_c.predict(Xc_test))
plt.plot(np.arange(0,3),np.arange(0,3), label ="r^2="+str(r2_c), c="r")
plt.legend(loc="lower right")
plt.xlabel('Measured')
plt.ylabel('Predicted')
Out[98]:
In [99]:
names = list(Xrf.columns.values)
print("Features sorted by their score:")
print(sorted(zip(map(lambda x: round(x, 4), rf.feature_importances_), names),
reverse=True))
In [100]:
X2 = Xrf.drop(['bike traffic estimate'],axis=1)
Xc2 = Xrf_c.drop(['bike traffic estimate'],axis=1)
In [101]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
rf= RandomForestRegressor(n_estimators = 200)
prediction = cross_val_predict(rf,X2,Yrf_priorRate)
r2 = np.mean(cross_val_score(rf,X2,Yrf_priorRate))
mse = np.mean((Yrf_priorRate-prediction)**2)
print('MSE : ',mse)
print("R^2 : ", r2)
plt.scatter(Yrf_priorRate,prediction)
plt.plot(np.arange(0,4),np.arange(0,4), label ="r^2="+str(r2), c="r")
plt.legend(loc="lower right")
plt.xlabel('Measured')
plt.ylabel('Predicted')
Out[101]:
In [102]:
from sklearn.ensemble import RandomForestRegressor
rf= RandomForestRegressor(n_estimators = 200)
prediction = cross_val_predict(rf,Xc2,Yrf_priorRate_c)
r2 = np.mean(cross_val_score(rf,Xc2,Yrf_priorRate_c))
mse = np.mean((Yrf_priorRate_c-prediction)**2)
print('MSE : ',mse)
print("R^2 : ", r2)
plt.scatter(Yrf_priorRate_c,prediction)
plt.plot(np.arange(0,4),np.arange(0,4), label ="r^2="+str(r2), c="r")
plt.legend(loc="lower right")
plt.xlabel('Measured')
plt.ylabel('Predicted')
Out[102]:
In [106]:
plt.scatter(rf.predict(X_train), rf.predict(X_train) - Y_train, c='b', s=40, alpha=0.5)
plt.scatter(rf.predict(X_test), rf.predict(X_test) - Y_test, c='g', s=40)
plt.hlines(y = 0, xmin=0, xmax = 4)
plt.title('Residual Plot using training (blue) and test (green) data')
plt.ylabel('Residuals')
Out[106]:
In [107]:
plt.scatter(rf_c.predict(X2c_train), rf_c.predict(X2c_train) - Yc_train, c='b', s=40, alpha=0.5)
plt.scatter(rf_c.predict(X2c_test), rf_c.predict(X2c_test) - Yc_test, c='g', s=40)
plt.hlines(y = 0, xmin=0, xmax = 3)
plt.title('Residual Plot using training (blue) and test (green) data')
plt.ylabel('Residuals')
Out[107]:
In [103]:
X4 = Xrf.copy()
Xc4 = Xrf_c.copy()
In [104]:
X4.reset_index(inplace=True)
st1 = X4['intersection'].str.split('Ontario, ').str.get(1)
Xc4.reset_index(inplace=True)
st1_c = Xc4['intersection'].str.split('Ontario, ').str.get(1)
st2 = st1.str.split(' ').str.get(0)
st2.to_csv('postal_codes.csv')
st2_c = st1_c.str.split(' ').str.get(0)
st2_c.to_csv('postal_codes_c.csv')
In [105]:
st2 = pd.DataFrame.from_csv('postal_codes2.csv',header=None)
st2_c = pd.DataFrame.from_csv('postal_codes_c2.csv',header=None)
In [106]:
X4.head()
Out[106]:
In [107]:
X4.set_index('intersection', inplace= True)
Xc4.set_index('intersection', inplace= True)
#X4.drop('index',inplace = True, axis=1)
In [108]:
X4['postal'] = st2[1].values
Xc4['postal'] = st2_c[1].values
In [109]:
postal_df = pd.get_dummies(X4.postal, dummy_na = False)
X4 = pd.concat([X4,postal_df],axis=1)
postal_df = pd.get_dummies(Xc4.postal,dummy_na=False)
Xc4 = pd.concat([Xc4,postal_df],axis=1)
In [110]:
X4.drop('postal',inplace = True, axis = 1)
Xc4.drop('postal',inplace = True, axis = 1)
In [111]:
Xc_train, Xc_test, Yc_train, Yc_test = train_test_split(Xc4, Yrf_priorRate_c, test_size=0.25, random_state = 5)
print (Xc_train.shape)
print (Xc_test.shape)
print (Yc_train.shape)
print (Yc_test.shape)
X_train, X_test, Y_train, Y_test = train_test_split(X4, Yrf_priorRate, test_size=0.25, random_state = 5)
print (X_train.shape)
print (X_test.shape)
print (Y_train.shape)
print (Y_test.shape)
rf= RandomForestRegressor(n_estimators = 200)
rf.fit(X_train,Y_train)
r2 = r2_score(Y_test, rf.predict(X_test))
mse = np.mean((Y_test-rf.predict(X_test))**2)
print('MSE for all data: ', mse)
rf_c= RandomForestRegressor(n_estimators = 200)
rf_c.fit(Xc_train,Yc_train)
r2_c = r2_score(Yc_test, rf_c.predict(Xc_test))
mse = np.mean((Yc_test-rf_c.predict(Xc_test))**2)
print("MSE for central data: ", mse)
names = list(X4.columns.values)
print("Features sorted by their score:")
print(sorted(zip(map(lambda x: round(x, 4), rf.feature_importances_), names),
reverse=True))
names2 = list(Xc4.columns.values)
print("Features sorted by their score:")
print(sorted(zip(map(lambda x: round(x, 4), rf_c.feature_importances_), names2),
reverse=True))
In [112]:
plt.scatter(Y_test,rf.predict(X_test))
plt.plot(np.arange(0,3.5),np.arange(0,3.5), label ="r^2="+str(r2), c="r")
plt.legend(loc="lower right")
plt.xlabel('Measured')
plt.ylabel('Predicted')
Out[112]:
In [113]:
plt.scatter(Yc_test,rf_c.predict(Xc_test))
plt.plot(np.arange(0,3.5),np.arange(0,3.5), label ="r^2="+str(r2_c), c="r")
plt.legend(loc="lower right")
plt.xlabel('Measured')
plt.ylabel('Predicted')
Out[113]:
The fit from our random forest is starting to look pretty good, particularily when we restrict ourselves to the central intersections only, which still contain the bulk of our data. Looking at the output in terms of feature importance, we see that for BOTH datasets, most of the postal codes don't add any information. We could use the SelectFromModel meta-transformer to specify a threshold (manually or as some multiple of the mean), to remove features which are irrelevant. sci-kit lear ndoes this using mean decrease impurity. This is how much the variance is decreased by each feature on average.
Another feature selection method is to measure how much each feature impacts the accuracy of our model. By permuating the values of each feature, we can check how much it affets our prediction. For important variables, there should be a significant decrease. The cell below calculates the mean decrease accuracty for our model
In [114]:
from sklearn.model_selection import ShuffleSplit
from sklearn.metrics import r2_score
from collections import defaultdict
X = X4.values
Y = Yrf_priorRate
rf_alt = RandomForestRegressor()
scores = defaultdict(list)
forest_split = ShuffleSplit(n_splits=10,test_size = .25,random_state=1)
#crossvalidate the scores on a number of different random splits of the data
for train_idx, test_idx in forest_split.split(X):
X_train, X_test = X[train_idx], X[test_idx]
Y_train, Y_test = Y[train_idx], Y[test_idx]
r = rf_alt.fit(X_train, Y_train)
acc = r2_score(Y_test, rf.predict(X_test))
for i in range(X.shape[1]):
X_t = X_test.copy()
np.random.shuffle(X_t[:, i])
shuff_acc = r2_score(Y_test, rf.predict(X_t))
scores[names[i]].append((acc-shuff_acc)/acc)
print ("Features sorted by their score:")
print (sorted([(round(np.mean(score), 4), feat) for
feat, score in scores.items()], reverse=True))
This gives similar results as before. Let's reduce our postal code features, leaving only those which are scoring above half the mean by the decrease in mean variance.
In [115]:
from sklearn.feature_selection import SelectFromModel
rf= RandomForestRegressor(n_estimators = 20)
rf_c= RandomForestRegressor(n_estimators = 20)
model = SelectFromModel(rf,threshold = '0.25*mean')
model.fit(X4,Yrf_priorRate)
X5 = model.transform(X4)
model_c = SelectFromModel(rf_c,threshold = '0.25*mean')
model_c.fit(Xc4,Yrf_priorRate_c)
Xc5 = model_c.transform(Xc4)
In [116]:
X5.shape
Out[116]:
In [117]:
Xc5.shape
Out[117]:
So we are taking the first 23 features for our full data set, and the first 17 for our central only data set. This corresponds to 18 postal codes and 12 postal codes, respectively. Looking at the individual postal codes, there is strong overlap between the two lists.
In [118]:
Xc_train, Xc_test, Yc_train, Yc_test = train_test_split(Xc5, Yrf_priorRate_c, test_size=0.25, random_state = 5)
print (Xc_train.shape)
print (Xc_test.shape)
print (Yc_train.shape)
print (Yc_test.shape)
X_train, X_test, Y_train, Y_test = train_test_split(X5, Yrf_priorRate, test_size=0.25, random_state = 5)
print (X_train.shape)
print (X_test.shape)
print (Y_train.shape)
print (Y_test.shape)
rf= RandomForestRegressor(n_estimators = 200)
rf.fit(X_train,Y_train)
r2 = r2_score(Y_test, rf.predict(X_test))
mse = np.mean((Y_test-rf.predict(X_test))**2)
print('MSE for all data: ', mse)
rf_c= RandomForestRegressor(n_estimators = 200)
rf_c.fit(Xc_train,Yc_train)
r2_c = r2_score(Yc_test, rf_c.predict(Xc_test))
mse = np.mean((Yc_test-rf_c.predict(Xc_test))**2)
print("MSE for central data: ", mse)
In [119]:
plt.scatter(Y_test,rf.predict(X_test))
plt.plot(np.arange(0,3.5),np.arange(0,3.5), label ="r^2="+str(r2), c="r")
plt.legend(loc="lower right")
plt.xlabel('Measured')
plt.ylabel('Predicted')
Out[119]:
In [120]:
plt.scatter(Yc_test,rf_c.predict(Xc_test))
plt.plot(np.arange(0,3.5),np.arange(0,3.5), label ="r^2="+str(r2_c), c="r")
plt.legend(loc="lower right")
plt.xlabel('Measured')
plt.ylabel('Predicted')
Out[120]:
Our predictive power actually decreased in the full dataset, from about $R^2 = 0.61$ to $R^2=0.6$. However there was a slight improvement for the central only data set. This indicates that we should not use the same theshold in both cases, even if it's scaled by the mean score and not a constant value.
Ultimately, we have managed to build a predictive random forest, which using only traffic numbers (and an exponentially fit bike traffic estimate), along with the road type of the busiest road, and the traffic control type, performs a reasonable estimate of the yearly collision rates (accidents per year), at intersections for cyclists in the City of Toronto.
The scores are patricularily good when we restrict ourselves to the central post codes. This is partly attributed to the fact that the bike traffic estimates were fit with a few dozen points, all in the central neighbourhoods of Toronto. Had we better bike counts throughout the whole city, the predictive power would no doubt improve.
To output a final result, we'd lke to see how our random forest performs on the entire dataset, by cross validating so that every single point is contained in the test set at some point. Up to this point we were using the random state of our training splits to compare changes to the same test sets each time. While a random forest should not be prone to overfitting, thus not requiring cross_validation in the vast majority of cases, in order to gain a final answer on how well our forest performs, we'd like to generalize it's testing as much as possible.
In [121]:
rf= RandomForestRegressor(n_estimators = 200)
prediction = cross_val_predict(rf,X5,Yrf_priorRate)
r2 = np.mean(cross_val_score(rf,X5,Yrf_priorRate))
mse = np.mean((Yrf_priorRate-prediction)**2)
print('MSE : ',mse)
print("R^2 : ", r2)
plt.scatter(Yrf_priorRate,prediction)
plt.plot(np.arange(0,4),np.arange(0,4), label ="r^2="+str(r2), c="r")
plt.legend(loc="lower right")
plt.xlabel('Measured')
plt.ylabel('Predicted')
Out[121]:
In [122]:
rf_c= RandomForestRegressor(n_estimators = 200)
prediction_c = cross_val_predict(rf_c,Xc5,Yrf_priorRate_c)
r2_c = np.mean(cross_val_score(rf_c,Xc5,Yrf_priorRate_c))
mse = np.mean((Yrf_priorRate_c-prediction_c)**2)
print('MSE : ',mse)
print("R^2 : ", r2_c)
plt.scatter(Yrf_priorRate_c,prediction_c)
plt.plot(np.arange(0,4),np.arange(0,4), label ="r^2="+str(r2_c), c="r")
plt.legend(loc="lower right")
plt.xlabel('Measured')
plt.ylabel('Predicted')
Out[122]:
While our score is lowered significantly, we still perform quite will, being able to explain over 63 percent of the variance in our data set with our very barebones model. Further improvements could be made by including distances to bike posts/parking lots, and whether or not a bike lane is present at the intersection. While this would make the project more involved, combining these extra features with more data from the years since 2010 should result in a model that is able to closely predict real life collision rates.
In [ ]:
In [ ]:
If we were to really try and ring out the best performance possible out of our forrest, the correct approach would be to run a randomized search over the various possible parameters, so as to select the forest type which performs best. One could even then re-do the search, wich narrower parameter ranges, to fine tune things furter.
The code below performs this search for the central only database. It will run for a very long time however, and the results will be updated into the notebook at a later date, since for our relatively sparsely featured and small data set, reducing the max features, max depth, or other parametes will not improve our score, or change the computational cost much. Nonetheless, it's a good exercise to carry out.
In [ ]:
import numpy as np
from time import time
from scipy.stats import randint as sp_randint
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier
# build our random forest
clf = RandomForestRegressor(n_estimators=200)
# Utility function to report best scores
def report(results, n_top=3):
for i in range(1, n_top + 1):
candidates = np.flatnonzero(results['rank_test_score'] == i)
for candidate in candidates:
print("Model with rank: {0}".format(i))
print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
results['mean_test_score'][candidate],
results['std_test_score'][candidate]))
print("Parameters: {0}".format(results['params'][candidate]))
print("")
# specify parameters and distributions to sample from
param_dist = {"max_depth": [3, None],
"max_features": sp_randint(1,115),
"min_samples_split": sp_randint(1, 20),
"min_samples_leaf": sp_randint(1, 20),
"bootstrap": [True, False],
"criterion": ["mse", "mae"]} ##mean squared error, and mean absolute error
# run randomized search
n_iter_search = 20
random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
n_iter=n_iter_search)
start = time()
random_search.fit(Xc5,Yrf_priorRate_c)
print("RandomizedSearchCV took %.2f seconds for %d candidates"
" parameter settings." % ((time() - start), n_iter_search))
report(random_search.cv_results_)