The motor vehicle collision database includes the date and time, location (as borough, street names, zip code and latitude and longitude coordinates), injuries and fatalities, vehicle number and types, and related factors for all collisions in New York City during 2015 and 2016.
The vehicle collision data was collected by the NYPD and published by NYC OpenData.
The second dataset used was the NYC Yellow Taxi Trips. The yellow taxi trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.
The dataset just released 4 months ago, so its quite intresting to have an inshight on the latest traffic collision patterns in N.Y., and contains all the required features we need for the analysis.
Furthermore we included a second dataset, that of the yellow taxi trip records. What we used were the fields capturing pick-up and drop-off dates/times in order to make a comparison between the main traffic locations and those of the accidents.
We target to enable the user to have an overview of the basic stats of traffic collisions, eg. number of accidents , deaths and injuries per factor of accident as well as the hour of the day that the accident occured. We want to inform the users of the most dangerous combinations and help them be safer. Additionaly, we would like to find out, if possible, why accidents happen, which will lead to ways of prevention. Using machine learning techniques we also let the user to estimate the most likely borough for an accident given the time, factor and type of vehicle. Using clustering we found out the cendroids of the accidents and traffic for 2,3 and 4 number of clusters respectively which are presented on our website. We did the same for the traffic. This way we will find out if there are areas where accidents happen while there is not much traffic.
The first thing that we investigated is which causes of accidents occur the most, as well as, the total amount of incidents. The most common factor is the drivers inattention. There is a huge difference with the runner up cause, failure to yield.
Secondly, the number of deaths and injuries per category (PERSONS, PEDESTRIANS, CYCLISTS, MOTORISTS) was counted. Persons group seems to be the one that suffered the most deaths and injuries. What is interesting here is the fact that the difference becomes much smaller compared to the second category, motorists. Another interesting fact is that while there are more pedestrians killed than motorists, there are more motorists injured than pedestrians. The reasons for that are two. To begin with, it is very easy and common for a motorist to just lose his balance and fall causing him an injury thus increasing the number of motorists injuried. Furthermore, pedestrians are very vulnerable, meaning that if they are hit by a car the changes that they will get killed are increased.
Thirdly, we investigated how the number of accidents are distributed during a day, namely what time most accidents take place.
We can see that there is an obvious peak at around 16:00-17:00 and a smaller one at around 8:00. The reason here is that there are more people moving around at that time, coming and going to work. It also makes sence that at 16:00 there are more accidents as people are tired and anxious to get back home. There is a peak that we can not really explain at 13:00 and is really intereting.
If we compare this with the respective graph of the taxi traffic, we can see the same peak in the morning but in the evening the peak in traffic seems to be happening a bit later and the duration to be longer. This means that while there is a correlation, it is not binding.
In [2]:
# Importing the necessary libraries for the analysis.
import pandas as pd #import library
data=pd.read_csv("database.csv") #reading the csv into dataframe
import matplotlib.pyplot as plt
%matplotlib inline
import geoplotlib
from geoplotlib.utils import BoundingBox
from geoplotlib.colors import ColorMap
from IPython.display import Image
from sklearn.externals.six import StringIO
import pydot
from sklearn.cluster import KMeans
import numpy as np
import csv # get the csv reader
import random
import json
from collections import Counter
from sklearn import tree
from sklearn import preprocessing
from sklearn.cross_validation import train_test_split
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
%matplotlib inline
In [3]:
# Database Column names
print data.columns
In [4]:
# Calculating the number of incidents.
dimensions=data.shape
print "The number of incidents are:",dimensions[0]
In [5]:
# Count the incidents per factor of accident.
FactorData=data.loc[(data['VEHICLE 1 FACTOR'] !="UNSPECIFIED")] # removing unspecified factors from data.
FactorData=FactorData['VEHICLE 1 FACTOR'].value_counts()
In [6]:
# plotting the factor of accidents and the number of incidents.
fig=plt.figure(figsize=(12,7))
plt.suptitle('Factor of accident occurrences ',fontsize=20)
plt.bar(range(47), FactorData, align='center', color="green", alpha=0.5)
plt.xticks(range(47),FactorData.index,rotation='vertical')
plt.margins(0.01) # adjust margins so we have
plt.ylabel("Accidents Count")
plt.xlabel("Factor of Accident")
plt.savefig('foo.png',bbox_inches='tight')
plt.show()
In [7]:
# Counting the number of deaths and injuries per category.
Fatality=data.filter(items=['PERSONS INJURED', 'PERSONS KILLED','PEDESTRIANS INJURED','PEDESTRIANS KILLED', 'CYCLISTS INJURED',
'CYCLISTS KILLED','MOTORISTS INJURED','MOTORISTS KILLED']).sum()
print Fatality.sort_values()
f=Fatality.sort_values()
f=f.tolist() # creating a list from the np ndarray
In [8]:
# Plotting the results.
plt.figure(figsize=(8,4))
plt.bar(range(4),f[0:4], align='center', color="green", alpha=0.5)
plt.suptitle('Number of deaths',fontsize=18)
plt.xticks(range(4), ['CYCLISTS KILLED', 'MOTORISTS KILLED','PEDESTRIANS KILLED','PERSONS KILLED'],rotation='vertical')
plt.margins(0.01) # adjust margins so we have
plt.ylabel("Total Number")
plt.show()
In [41]:
# Plotting the results.
plt.figure(figsize=(8,4))
plt.bar(range(4),f[4:], align='center', color="green", alpha=0.5)
plt.suptitle('Number of injured',fontsize=18)
plt.xticks(range(4), ['CYCLISTS INJURED','PEDESTRIANS INJURED','MOTORISTS INJURED','PERSONS INJURED'],rotation='vertical')
plt.margins(0.01) # adjust margins so we have
plt.ylabel("Total Number")
plt.show()
In [98]:
# Create a function that returns the time
def getHour(s):
return int(s.split(':')[0])
# Now create a new column named hour and store the values of the above function
data['Hour']= data['TIME'].apply(lambda x: getHour(x))
In [99]:
#create a dataframe with the accident occurencies per hour
dfHours = data.filter(items=['Hour'])
dfHours=dfHours.apply(pd.value_counts)
dfHours.sort_index()
Out[99]:
In [100]:
# Plotting the accidents count per Hour of the day.
fig=plt.figure(figsize=(12,7))
plt.suptitle('Accident occurrences during the day ',fontsize=20)
plt.bar(dfHours.index, dfHours["Hour"], align='center', color="green", alpha=0.5)
plt.xticks(dfHours.index,dfHours.index)
plt.margins(0.01) # adjust margins so we have
plt.ylabel("Accidents Count")
plt.xlabel("Hours")
plt.show()
In [75]:
dataJanuary=pd.read_csv("1january.csv") #reading the csv into dataframedataJanuary=pd.read_csv("1january.csv") #reading the csv into dataframe
In [76]:
# Filtering the data
cleanJanuary=dataJanuary.loc[(dataJanuary['pickup_longitude'] < -72) &
(dataJanuary['pickup_longitude'] > -75) &
(dataJanuary['pickup_longitude'] is not None)&
(dataJanuary['pickup_latitude'] is not None)&
(dataJanuary['pickup_latitude'] != 0.0)&
(dataJanuary['dropoff_longitude'] < -72) &
(dataJanuary['dropoff_longitude'] > -75) &
(dataJanuary['dropoff_longitude'] is not None)&
(dataJanuary['dropoff_latitude'] is not None)&
(dataJanuary['dropoff_latitude'] != 0.0)]
In [101]:
# Create a function that returns the time
def getHourTaxi(s):
return int((s.split(':')[0]).split(' ')[1])
# Now create a new column named hour and store the values of the above function
cleanJanuary['Hour']= cleanJanuary['pickup_datetime'].apply(lambda x: getHourTaxi(x))
#create a dataframe with the accident occurencies per hour
dfHoursTaxi = cleanJanuary.filter(items=['Hour'])
dfHoursTaxi=dfHoursTaxi.apply(pd.value_counts)
dfHoursTaxi.sort_index()
Out[101]:
In [102]:
# Plotting the taxi traffic count per Hour of the day.
fig=plt.figure(figsize=(12,7))
plt.suptitle('Taxi pick up occurrences during the day ',fontsize=20)
plt.bar(dfHoursTaxi.index, dfHoursTaxi["Hour"], align='center', color="green", alpha=0.5)
plt.xticks(dfHoursTaxi.index,dfHoursTaxi.index)
plt.margins(0.01) # adjust margins so we have
plt.ylabel("Pick ups Count")
plt.xlabel("Hours")
plt.show()
In this section we created 6 different maps depicting the coordinates that serious accidents happen. Those are the deaths and injuries for each category (PEDESTRIANS, CYCLISTS, MOTORISTS). What we would like to see is if they are relatively close. If they are not that would mean that different areas are dangerous for different people. If they are, that would mean that there are areas that are generally dangerous.
As we can see from the maps bellow, while for the pedestrians and cyclists the center of injuries are the same, in the center of the city, for the motorists it is not. That is because more people are cycling and walking in the center because the distances are shorter.
We can also observe which roads are dangerous to be on.
Furthermore, if we compare the map of all the accidents with that of the traffic, we will observe that while the city center is very active in both maps, there are many other parts of the map that are far more active when it comes to accidents. Thus, while there is a correlation between traffic and accidents, there are some areas that seem to be dangerous with no apparent reason. This could be due to speeding, hard turns or unsignalized intersections.
In [50]:
# Cleaning the data for plotting. Outliers and NaN values.
dataForGeo=data.loc[(data['LONGITUDE'] < -72) &
(data['LONGITUDE'] > -75) &
(data['LONGITUDE'] is not None)&
(data['LATITUDE'] is not None)&
(data['LATITUDE'] != 0.0)]
In [51]:
east=max(dataForGeo["LONGITUDE"])
west=min(dataForGeo["LONGITUDE"])
south=max(dataForGeo["LATITUDE"])
north=min(dataForGeo["LATITUDE"])
In [52]:
geo_data_for_plotting = {"lat": dataForGeo["LATITUDE"],
"lon": dataForGeo["LONGITUDE"]}
In [53]:
# Map of all incidents
geoplotlib.kde(geo_data_for_plotting,1)
bbox = BoundingBox(north=north, west=west, south=south, east=east)
geoplotlib.set_bbox(bbox)
geoplotlib.tiles_provider('toner-lite')
geoplotlib.inline()
geoplotlib.show()
In [165]:
geo_data_for_taxi = {"lat": list(cleanJanuary["pickup_latitude"])+list( cleanJanuary["dropoff_latitude"]) ,
"lon": list(cleanJanuary["pickup_longitude"])+list( cleanJanuary["dropoff_longitude"])}
In [166]:
# Map of taxi traffic
geoplotlib.kde(geo_data_for_taxi,1)
bbox = BoundingBox(north=north, west=west, south=south, east=east)
geoplotlib.set_bbox(bbox)
geoplotlib.tiles_provider('toner-lite')
geoplotlib.inline()
geoplotlib.show()
In [54]:
dataforgeokilled=dataForGeo.loc[(dataForGeo['PEDESTRIANS KILLED'] > 0)]#at least 1 killed
geo_data_for_plotting1 = {"lat": dataforgeokilled["LATITUDE"],
"lon": dataforgeokilled["LONGITUDE"]}
In [55]:
# Map of PEDESTRIANS KILLED
geoplotlib.kde(geo_data_for_plotting1,1)
bbox = BoundingBox(north=north, west=west, south=south, east=east)
geoplotlib.set_bbox(bbox)
geoplotlib.tiles_provider('toner-lite')
geoplotlib.inline()
geoplotlib.show()
In [56]:
dataforgeoPedestriansInjured=dataForGeo.loc[(dataForGeo['PEDESTRIANS INJURED'] > 0)]#at least 1 injured
In [57]:
geo_data_for_plotting2 = {"lat": dataforgeoPedestriansInjured["LATITUDE"],
"lon": dataforgeoPedestriansInjured["LONGITUDE"]}
In [58]:
#Map of INJURED PEDESTRIANS
geoplotlib.kde(geo_data_for_plotting2,1)
bbox = BoundingBox(north=north, west=west, south=south, east=east)
geoplotlib.set_bbox(bbox)
geoplotlib.tiles_provider('toner-lite')
geoplotlib.inline()
geoplotlib.show()
In [59]:
dataforCycklistskilled=dataForGeo.loc[(dataForGeo['CYCLISTS KILLED'] > 0)]#at least 1 killed
geo_data_for_plotting3 = {"lat": dataforCycklistskilled["LATITUDE"],
"lon": dataforCycklistskilled["LONGITUDE"]}
In [60]:
#Map of CYCLISTS KILLED
geoplotlib.kde(geo_data_for_plotting3,1)
bbox = BoundingBox(north=north, west=west, south=south, east=east)
geoplotlib.set_bbox(bbox)
geoplotlib.tiles_provider('toner-lite')
geoplotlib.inline()
geoplotlib.show()
In [63]:
dataforCycklistsinjured=dataForGeo.loc[(dataForGeo['CYCLISTS INJURED'] > 0)]#at least 1 killed
geo_data_for_plotting4 = {"lat": dataforCycklistsinjured["LATITUDE"],
"lon": dataforCycklistsinjured["LONGITUDE"]}
In [64]:
# Map of CYCLISTS INJURED
geoplotlib.kde(geo_data_for_plotting4,1, cmap="hot")
bbox = BoundingBox(north=north, west=west, south=south, east=east)
geoplotlib.set_bbox(bbox)
geoplotlib.tiles_provider('toner-lite')
geoplotlib.inline()
geoplotlib.show()
In [65]:
dataformotoristsinjured=dataForGeo.loc[(dataForGeo['MOTORISTS INJURED'] > 0)]#at least 1 killed
geo_data_for_plotting5 = {"lat": dataformotoristsinjured["LATITUDE"],
"lon": dataformotoristsinjured["LONGITUDE"]}
In [66]:
#Map of MOTORISTS INJURED
geoplotlib.kde(geo_data_for_plotting5,1)
bbox = BoundingBox(north=north, west=west, south=south, east=east)
geoplotlib.set_bbox(bbox)
geoplotlib.tiles_provider('toner-lite')
geoplotlib.inline()
geoplotlib.show()
In [67]:
dataformotoristskilled=dataForGeo.loc[(dataForGeo['MOTORISTS KILLED'] > 0)]#at least 1 killed
geo_data_for_plotting6 = {"lat": dataformotoristskilled["LATITUDE"],
"lon": dataformotoristskilled["LONGITUDE"]}
In [68]:
# Map of MOTORISTS KILLED
geoplotlib.kde(geo_data_for_plotting6,1)
bbox = BoundingBox(north=north, west=west, south=south, east=east)
geoplotlib.set_bbox(bbox)
geoplotlib.tiles_provider('toner-lite')
geoplotlib.inline()
geoplotlib.show()
We are using the decision tree and the kmeans machine learning techniques.
We are going to use the desision tree classification to predict where accidents may occure. We believe that this is very important as people will try to avoid areas that might be dangerous for them, leading to a safer life.
Furthermore with kmeans clustering we are trying to identify the centers of the traffic accidents as well as the centers of the taxi traffic. That will enables us to have an insight on the areas where accidents happen versus where most of the traffic is observed. Thus we will be able to tell if there is a stong correlation between those two elements. If there is none, we we will be able to conclude that some areas are simply more dangerous to travel within.
The features that we used at first to feed the descision tree model were the time and the factor of the accident. The test and training data were randomly selected and splitted as 90% training and 10% for testing and cross validation was not used.
The success ratio was 35.26%. Next we added one more feature, the type of vehicle, to see if our model's prediction rate will improve. It increased at 37.69%. This meant that this element is not very indicative. We expected that by having 3 features the model accurancy would be higher.
In [142]:
# Filtering the data
dataForTree=data.loc[(data['BOROUGH'] >"")&
(data['VEHICLE 1 FACTOR'] >"")&
(data['VEHICLE 1 FACTOR'] !="UNSPECIFIED")&
(data['VEHICLE 1 TYPE'] !="UNKNOWN")&
(data['VEHICLE 1 TYPE'] > 0) &
(data['Hour'] >-1)&
(data['LATITUDE'] != 0.0)]
In [143]:
boroughs=dataForTree.filter(items=['BOROUGH'])
hours=dataForTree.filter(items=['Hour'])
factors=dataForTree.filter(items=['VEHICLE 1 FACTOR'])
vehicle=dataForTree.filter(items=['VEHICLE 1 TYPE'])
# Transforming dataframes to 1-d arrays
boroughs=np.ravel(boroughs)
hours=np.ravel(hours)
factors=np.ravel(factors)
vehicle=np.ravel(vehicle)
In [144]:
#Encoding the dataframe because desision tree library does not accept strings
le=preprocessing.LabelEncoder()
le.fit(boroughs)
EnBoroughs=le.transform(boroughs)
In [145]:
#Encoding the dataframe because desision tree library does not accept strings
le=preprocessing.LabelEncoder()
le.fit(factors)
EnFactor=le.transform(factors)
In [146]:
#Encoding the dataframe because desision tree library does not accept strings
le=preprocessing.LabelEncoder()
le.fit(vehicle)
EnVehicle=le.transform(vehicle)
In [147]:
#Stack the arrays of hours and encoded Factor.
EnHoFa=np.stack((hours, EnFactor), axis=-1)
In [148]:
# Creating the test and training sets, 90% as training.
X_train, X_test, y_train, y_test = train_test_split(EnHoFa,EnBoroughs, test_size=0.1, random_state=42)
In [149]:
#Training the model
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train,y_train)
In [150]:
results=clf.score(X_test,y_test)
print "Percentage of correct predictions is:", str(results*100)[:5],"%"
In [151]:
#Stack the arrays of hours, encoded Factor and Encoded Vehicle.
EnHoFaVe=np.stack((hours, EnFactor,EnVehicle), axis=-1)
In [152]:
# Creating the test and training sets, 90% as training.
X_train, X_test, y_train, y_test = train_test_split(EnHoFaVe,EnBoroughs, test_size=0.1, random_state=42)
In [153]:
#Training the model
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train,y_train)
In [154]:
results=clf.score(X_test,y_test)
print "Percentage of correct predictions is:", str(results*100)[:5],"%"
In [155]:
# Create a mapping of our Encoded lists and pretty print it
import operator
import pprint
from prettytable import PrettyTable
MapBorough={}
MapFactor={}
MapHour={}
MapVehicle={}
for i,item in enumerate(boroughs):
MapBorough[item]=EnBoroughs[i]
for i,item in enumerate(factors):
MapFactor[item]=EnFactor[i]
for i,item in enumerate(vehicle):
MapVehicle[item]=EnVehicle[i]
sorted_MapBorough = sorted(MapBorough.items(), key=operator.itemgetter(1))
sorted_MapFactor = sorted(MapFactor.items(), key=operator.itemgetter(1))
sorted_MapVehicle = sorted(MapVehicle.items(), key=operator.itemgetter(1))
x = PrettyTable()
x.field_names = ["Borough","Encoding"]
for i in sorted_MapBorough:
x.add_row([i[0],i[1]])
x1= PrettyTable()
x1.field_names = ["Type","Encoding"]
for i in sorted_MapFactor:
x1.add_row([i[0],i[1]])
x3= PrettyTable()
x3.field_names = ["Vehicle Type","Encoding"]
for i in sorted_MapVehicle:
x3.add_row([i[0],i[1]])
print x
print x1
print x3
In [156]:
incident=[16,9,8] # At 16 o'clock, factor= Driver's inexperience, vehicle=passenger vehicle.
District=clf.predict_proba(incident)#making the predictions
SortDistrict=sorted(District[0])
print SortDistrict # Sort the list to find the 3 bigger percentages
print District[0] #print in order to match with the classes encoding and find the district through the table below
print clf.classes_
The three more likely boroughs are: MANHATTAN with 31.9%, QUEENS with 28.2% and BROOKLYN with 22.7%
In [158]:
#ploting the desision tree
dot_data = StringIO()
tree.export_graphviz(clf, out_file=dot_data,max_depth = 3)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())
Out[158]:
In [161]:
DataKmeans=dataForGeo.filter(items=['LATITUDE', 'LONGITUDE'])#filtering the dataset and save the coordinates
DataKmeans.head()
Out[161]:
In [162]:
DataKmeansArray=np.array(DataKmeans)#turn into an array
DataKmeansArray
Out[162]:
In [164]:
for i in range(2,5):#plot how the points would be distributed amongst different clusters. For 2,3 and 4 clusters
kmeans = KMeans(n_clusters=i, random_state=0).fit_predict(DataKmeansArray)
plt.scatter(DataKmeansArray[:, 0], DataKmeansArray[:, 1], c=kmeans)
plt.show()
In [163]:
for i in range(2,5):#calculating the coordinates of the centroids of the clusters for 2,3 and 4 clusters
kmeans2 = KMeans(n_clusters=i, random_state=0).fit(DataKmeansArray)
print kmeans2.cluster_centers_
In [167]:
#Keeping both pick up and drop off coordinates as indicators of traffic
KMeansTaxiPickup=cleanJanuary.filter(items=['pickup_latitude', 'pickup_longitude'])
KMeansTaxiDropOff=cleanJanuary.filter(items=['dropoff_latitude', 'dropoff_longitude'])
In [168]:
#rename the columns and concat the coordinates data
KMeansTaxiDropOff=KMeansTaxiDropOff.rename(columns = {"dropoff_latitude":'pickup_latitude',"dropoff_longitude":"pickup_longitude"})
KMeansTaxi= pd.concat([KMeansTaxiPickup,KMeansTaxiDropOff],axis=0)
In [169]:
KMeansTaxiArray=np.array(KMeansTaxi) #turn into array
In [170]:
#we need array of array for the KMeans library
KMeansTaxiPickupArray=np.array(KMeansTaxiPickup)
KMeansTaxiPickupArray
Out[170]:
In [172]:
for i in range(2,5):#find the centroids for the traffic for 2,3 and 4 clusters
kmeans4 = KMeans(n_clusters=i, random_state=0).fit(KMeansTaxiArray)
print kmeans4.cluster_centers_
In [ ]:
#If one wants to run this BE AWARE HUGE DATASET it will take time.
#see to which cluster each point belongs for 2,3 and 4 clusters.
#for i in range(2,5):
# kmeansTraffic2 = KMeans(n_clusters=i, random_state=0).fit_predict(KMeansTaxiPickupArray)
# plt.scatter(KMeansTaxiPickupArray[:, 0], KMeansTaxiPickupArray[:, 1], c=kmeansTraffic2)
# plt.show()
As far as the visualizations are concerned, we chose an interactive d3 scatterplot that will allow the user to have an overview of the factors of accidents. The two axis correspond to the total number of accidents and the total number of deaths or injuries. When the user hover above the specific factor he gets more detailed info, the borough name with the most deaths and injuries and the exact count of the deaths and injuries at that specific borough.
The second visualization is a geoplot of NY map where the user can see the centroids of the clustering both for accidents and traffic. He can toggle between the different number of clusters by clicking or hovering above the predefined buttons.
The scatter plot allow the user to see which is the correlation between the number of accidents and the number of death and injuries. To give an example if a factor has a large number of death and injuries but few total number of accidens that means that this factor is very threatening. On the other hand if a factor has a handfull of death and injuries and many occurances that would suggest it is not very fatal.
The purpose of the geoplot is to give to the user the ability to directly compare the centroids of traffic and accidents. Thus the user will be able to tell if there is any correlation between traffic and accidents.
In [88]:
# Retrieving data for scatter Plot
# removing unspecified factors from data and NaN values from boroughs.
DataScatter=data.loc[(data['VEHICLE 1 FACTOR'] !="UNSPECIFIED")&
(data['BOROUGH'] >"")]
DataScatter1=DataScatter.filter(items=['VEHICLE 1 FACTOR','BOROUGH','PERSONS INJURED', 'PERSONS KILLED','PEDESTRIANS INJURED','PEDESTRIANS KILLED', 'CYCLISTS INJURED',
'CYCLISTS KILLED','MOTORISTS INJURED','MOTORISTS KILLED'])
In [89]:
# defining a function to return the sum of all injuries and deaths.
def getTotalInjuries(injuries):
return sum(injuries)
# adding a new Total coloumn with the sum of all injuries and deaths.
DataScatter1['Total'] = DataScatter1[['PERSONS INJURED', 'PERSONS KILLED','PEDESTRIANS INJURED','PEDESTRIANS KILLED', 'CYCLISTS INJURED',
'CYCLISTS KILLED','MOTORISTS INJURED','MOTORISTS KILLED']].apply(lambda x : getTotalInjuries(x),axis=1)
# Filter the 3 colouns we need.
DataScatter2=DataScatter1.filter(items=["VEHICLE 1 FACTOR",'BOROUGH',"Total"])
# Group by the datafames and finding the sum of all incidnets and the sum per borough.
DataScatter3=DataScatter2.groupby(["VEHICLE 1 FACTOR","BOROUGH"])["Total"].sum().reset_index(name="total").sort_values(by="total",ascending=0)
DataScatter4=DataScatter2.groupby(["VEHICLE 1 FACTOR"])["Total"].sum().reset_index(name="total").sort_values(by="total",ascending=0)
In [90]:
# Cleaning the Dataframe for the total amount of accidents per Borough,
# reseting the index and renaming the columns. Getting it ready for concat.
BoroughTotal=DataScatter3.drop_duplicates("VEHICLE 1 FACTOR")
BoroughTotal=BoroughTotal.reset_index(drop=True)
BoroughTotal=BoroughTotal.sort_values("VEHICLE 1 FACTOR").reset_index(drop=True)
BoroughTotal.rename(columns={"VEHICLE 1 FACTOR":"Vehicle factor","total":"boroughTotal"},inplace=True)
In [91]:
# Cleaning the Dataframe for the total amount of accidents per factor of the accident,
# reseting the index and renaming the columns. Getting it ready for concat.
DataScatter4=DataScatter4.reset_index(drop=True) #reseting index
DataScatter4=DataScatter4.sort_values("VEHICLE 1 FACTOR").reset_index(drop=True)
TotalIncidents=DataScatter1['VEHICLE 1 FACTOR'].value_counts().reset_index()
TotalIncidents=pd.DataFrame(TotalIncidents) # Series to dataframe
TotalIncidents.rename(index=str, columns={"index": "VEHICLE FACTOR", "VEHICLE 1 FACTOR": "NumberAccidents"},inplace=True)
TotalIncidents=TotalIncidents.sort_values("VEHICLE FACTOR").reset_index(drop=True)
In [92]:
#Concating the Dataframes so we end up with a dataframe containing the Vehicle Factor,
#Total Incidents, Total Deaths&Injuries, Borough with most Deaths&Injuries and the total number of Deaths&Injuries
#in that Borough.
#Extracting the Dataframe to csv file.
frames=[DataScatter4,TotalIncidents,BoroughTotal]
FinalDataScatter=pd.concat(frames,axis=1)
FinalDataScatter.drop("VEHICLE FACTOR",1, inplace=True)
FinalDataScatter.drop("Vehicle factor",1, inplace=True)
FinalDataScatter.rename(columns={"total":"totalDeathsAndInjuries"},inplace=True)
FinalDataScatter=FinalDataScatter.sort_values("NumberAccidents",ascending=0).reset_index(drop=True)
FinalDataScatter.to_csv('ScatterData.csv')
The exploration of the data was quite successful. We were able to understand better why accidents occur, how they are spread in a day and across New York, make correlations between accidents and traffic, count the number of accidents and how threating a factor is and many more. Thus we were able to understand why accidents happen.
While we had a good feedback on why accidents happen, we still miss more data. It would be useful if we could use a dataset that would describe the roads of New York, where intersections and traffic lights are, were people speed and more. This way we would be able to have more suggestion on how to prevent accidents.
Another thing that could be improve is the desision tree predictions. We would like to be able to feed the model with more feaures in order to be able to give better results.
Finally it would also be interesting if we would be able to process the taxi data for more months that would give us better understanding of the traffic centers.