1. Motivation

The Datasets

The motor vehicle collision database includes the date and time, location (as borough, street names, zip code and latitude and longitude coordinates), injuries and fatalities, vehicle number and types, and related factors for all collisions in New York City during 2015 and 2016.
The vehicle collision data was collected by the NYPD and published by NYC OpenData.

The second dataset used was the NYC Yellow Taxi Trips. The yellow taxi trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.

Why this dataset?

The dataset just released 4 months ago, so its quite intresting to have an inshight on the latest traffic collision patterns in N.Y., and contains all the required features we need for the analysis.

Furthermore we included a second dataset, that of the yellow taxi trip records. What we used were the fields capturing pick-up and drop-off dates/times in order to make a comparison between the main traffic locations and those of the accidents.

What was your goal for the end user's experience?

We target to enable the user to have an overview of the basic stats of traffic collisions, eg. number of accidents , deaths and injuries per factor of accident as well as the hour of the day that the accident occured. We want to inform the users of the most dangerous combinations and help them be safer. Additionaly, we would like to find out, if possible, why accidents happen, which will lead to ways of prevention. Using machine learning techniques we also let the user to estimate the most likely borough for an accident given the time, factor and type of vehicle. Using clustering we found out the cendroids of the accidents and traffic for 2,3 and 4 number of clusters respectively which are presented on our website. We did the same for the traffic. This way we will find out if there are areas where accidents happen while there is not much traffic.

2. Basic stats. Let's understand the dataset better

Write about your choices in data cleaning and preprocessing

The data cleaning procedure included the clearance of outliers, of Null values and zero values. Outliers will raise errors when plotting the maps. Null and zero values would have made our results incorrect or misleading.

Write a short section that discusses the dataset stats

The first thing that we investigated is which causes of accidents occur the most, as well as, the total amount of incidents. The most common factor is the drivers inattention. There is a huge difference with the runner up cause, failure to yield.
Secondly, the number of deaths and injuries per category (PERSONS, PEDESTRIANS, CYCLISTS, MOTORISTS) was counted. Persons group seems to be the one that suffered the most deaths and injuries. What is interesting here is the fact that the difference becomes much smaller compared to the second category, motorists. Another interesting fact is that while there are more pedestrians killed than motorists, there are more motorists injured than pedestrians. The reasons for that are two. To begin with, it is very easy and common for a motorist to just lose his balance and fall causing him an injury thus increasing the number of motorists injuried. Furthermore, pedestrians are very vulnerable, meaning that if they are hit by a car the changes that they will get killed are increased.
Thirdly, we investigated how the number of accidents are distributed during a day, namely what time most accidents take place. We can see that there is an obvious peak at around 16:00-17:00 and a smaller one at around 8:00. The reason here is that there are more people moving around at that time, coming and going to work. It also makes sence that at 16:00 there are more accidents as people are tired and anxious to get back home. There is a peak that we can not really explain at 13:00 and is really intereting. If we compare this with the respective graph of the taxi traffic, we can see the same peak in the morning but in the evening the peak in traffic seems to be happening a bit later and the duration to be longer. This means that while there is a correlation, it is not binding.


In [2]:
# Importing the necessary libraries for the analysis.
import pandas as pd #import library
data=pd.read_csv("database.csv") #reading the csv into dataframe
import matplotlib.pyplot as plt
%matplotlib inline
import geoplotlib
from geoplotlib.utils import BoundingBox
from geoplotlib.colors import ColorMap
from IPython.display import Image  
from sklearn.externals.six import StringIO
import pydot
from sklearn.cluster import KMeans
import numpy as np
import csv # get the csv reader
import random
import json
from collections import Counter
from sklearn import tree
from sklearn import preprocessing
from sklearn.cross_validation import train_test_split
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
%matplotlib inline

In [3]:
# Database Column names
print data.columns


Index([u'UNIQUE KEY', u'DATE', u'TIME', u'BOROUGH', u'ZIP CODE', u'LATITUDE',
       u'LONGITUDE', u'LOCATION', u'ON STREET NAME', u'CROSS STREET NAME',
       u'OFF STREET NAME', u'PERSONS INJURED', u'PERSONS KILLED',
       u'PEDESTRIANS INJURED', u'PEDESTRIANS KILLED', u'CYCLISTS INJURED',
       u'CYCLISTS KILLED', u'MOTORISTS INJURED', u'MOTORISTS KILLED',
       u'VEHICLE 1 TYPE', u'VEHICLE 2 TYPE', u'VEHICLE 3 TYPE',
       u'VEHICLE 4 TYPE', u'VEHICLE 5 TYPE', u'VEHICLE 1 FACTOR',
       u'VEHICLE 2 FACTOR', u'VEHICLE 3 FACTOR', u'VEHICLE 4 FACTOR',
       u'VEHICLE 5 FACTOR'],
      dtype='object')

In [4]:
# Calculating the number of incidents.
dimensions=data.shape
print "The number of incidents are:",dimensions[0]


The number of incidents are: 477732

In [5]:
# Count the incidents per factor of accident.
FactorData=data.loc[(data['VEHICLE 1 FACTOR'] !="UNSPECIFIED")] # removing unspecified factors from data.
FactorData=FactorData['VEHICLE 1 FACTOR'].value_counts()

In [6]:
# plotting the factor of accidents and the number of incidents.
fig=plt.figure(figsize=(12,7))
plt.suptitle('Factor of accident occurrences ',fontsize=20)
plt.bar(range(47), FactorData, align='center', color="green", alpha=0.5)
plt.xticks(range(47),FactorData.index,rotation='vertical')
plt.margins(0.01) # adjust margins so we have 
plt.ylabel("Accidents Count")
plt.xlabel("Factor of Accident")
plt.savefig('foo.png',bbox_inches='tight')
plt.show()



In [7]:
# Counting the number of deaths and injuries per category. 
Fatality=data.filter(items=['PERSONS INJURED', 'PERSONS KILLED','PEDESTRIANS INJURED','PEDESTRIANS KILLED', 'CYCLISTS INJURED',
       'CYCLISTS KILLED','MOTORISTS INJURED','MOTORISTS KILLED']).sum()
print Fatality.sort_values()
f=Fatality.sort_values()
f=f.tolist() # creating a list from the np ndarray


CYCLISTS KILLED            37
MOTORISTS KILLED          180
PEDESTRIANS KILLED        296
PERSONS KILLED            506
CYCLISTS INJURED        10414
PEDESTRIANS INJURED     24984
MOTORISTS INJURED       95828
PERSONS INJURED        118418
dtype: int64

In [8]:
# Plotting the results.
plt.figure(figsize=(8,4))
plt.bar(range(4),f[0:4], align='center', color="green", alpha=0.5)
plt.suptitle('Number of deaths',fontsize=18)
plt.xticks(range(4), ['CYCLISTS KILLED', 'MOTORISTS KILLED','PEDESTRIANS KILLED','PERSONS KILLED'],rotation='vertical')
plt.margins(0.01) # adjust margins so we have 
plt.ylabel("Total Number")
plt.show()



In [41]:
# Plotting the results.
plt.figure(figsize=(8,4))
plt.bar(range(4),f[4:], align='center', color="green", alpha=0.5)
plt.suptitle('Number of injured',fontsize=18)
plt.xticks(range(4), ['CYCLISTS INJURED','PEDESTRIANS INJURED','MOTORISTS INJURED','PERSONS INJURED'],rotation='vertical')
plt.margins(0.01) # adjust margins so we have 
plt.ylabel("Total Number")
plt.show()



In [98]:
# Create a function that returns the time
def getHour(s):
    return int(s.split(':')[0])

# Now create a new column named hour and store the values of the above function
data['Hour']= data['TIME'].apply(lambda x: getHour(x))

In [99]:
#create a dataframe with the accident occurencies per hour
dfHours = data.filter(items=['Hour']) 
dfHours=dfHours.apply(pd.value_counts)
dfHours.sort_index()


Out[99]:
Hour
0 13525
1 7547
2 5807
3 4857
4 5730
5 6241
6 10024
7 13715
8 27092
9 26498
10 24395
11 25427
12 26653
13 27927
14 32750
15 29731
16 35532
17 34508
18 30598
19 24531
20 20299
21 16759
22 15372
23 12214

In [100]:
# Plotting the accidents count per Hour of the day.
fig=plt.figure(figsize=(12,7))
plt.suptitle('Accident occurrences during the day ',fontsize=20)

plt.bar(dfHours.index, dfHours["Hour"], align='center', color="green", alpha=0.5)
plt.xticks(dfHours.index,dfHours.index)
plt.margins(0.01) # adjust margins so we have 
plt.ylabel("Accidents Count")
plt.xlabel("Hours")
plt.show()



In [75]:
dataJanuary=pd.read_csv("1january.csv") #reading the csv into dataframedataJanuary=pd.read_csv("1january.csv") #reading the csv into dataframe

In [76]:
# Filtering the data
cleanJanuary=dataJanuary.loc[(dataJanuary['pickup_longitude'] < -72) &
                    (dataJanuary['pickup_longitude'] > -75) &
                    (dataJanuary['pickup_longitude'] is not None)&
                    (dataJanuary['pickup_latitude'] is not None)&
                    (dataJanuary['pickup_latitude'] != 0.0)&
                    (dataJanuary['dropoff_longitude'] < -72) &
                    (dataJanuary['dropoff_longitude'] > -75) &
                    (dataJanuary['dropoff_longitude'] is not None)&
                    (dataJanuary['dropoff_latitude'] is not None)&
                    (dataJanuary['dropoff_latitude'] != 0.0)]

In [101]:
# Create a function that returns the time
def getHourTaxi(s):
    return int((s.split(':')[0]).split(' ')[1])

# Now create a new column named hour and store the values of the above function
cleanJanuary['Hour']= cleanJanuary['pickup_datetime'].apply(lambda x: getHourTaxi(x))


#create a dataframe with the accident occurencies per hour
dfHoursTaxi = cleanJanuary.filter(items=['Hour']) 
dfHoursTaxi=dfHoursTaxi.apply(pd.value_counts)
dfHoursTaxi.sort_index()


C:\Users\nickzafi\Anaconda2\lib\site-packages\ipykernel\__main__.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Out[101]:
Hour
0 389241
1 295037
2 225190
3 166065
4 123183
5 109407
6 231806
7 397300
8 483198
9 483825
10 472759
11 494212
12 526348
13 522837
14 553216
15 554235
16 502260
17 581974
18 680478
19 671466
20 613864
21 597885
22 571414
23 473417

In [102]:
# Plotting the taxi traffic count per Hour of the day.
fig=plt.figure(figsize=(12,7))
plt.suptitle('Taxi pick up occurrences during the day ',fontsize=20)

plt.bar(dfHoursTaxi.index, dfHoursTaxi["Hour"], align='center', color="green", alpha=0.5)
plt.xticks(dfHoursTaxi.index,dfHoursTaxi.index)
plt.margins(0.01) # adjust margins so we have 
plt.ylabel("Pick ups Count")
plt.xlabel("Hours")
plt.show()


Geoplotting

In this section we created 6 different maps depicting the coordinates that serious accidents happen. Those are the deaths and injuries for each category (PEDESTRIANS, CYCLISTS, MOTORISTS). What we would like to see is if they are relatively close. If they are not that would mean that different areas are dangerous for different people. If they are, that would mean that there are areas that are generally dangerous.
As we can see from the maps bellow, while for the pedestrians and cyclists the center of injuries are the same, in the center of the city, for the motorists it is not. That is because more people are cycling and walking in the center because the distances are shorter.
We can also observe which roads are dangerous to be on.
Furthermore, if we compare the map of all the accidents with that of the traffic, we will observe that while the city center is very active in both maps, there are many other parts of the map that are far more active when it comes to accidents. Thus, while there is a correlation between traffic and accidents, there are some areas that seem to be dangerous with no apparent reason. This could be due to speeding, hard turns or unsignalized intersections.


In [50]:
# Cleaning the data for plotting. Outliers and NaN values. 
dataForGeo=data.loc[(data['LONGITUDE'] < -72) &
                    (data['LONGITUDE'] > -75) &
                    (data['LONGITUDE'] is not None)&
                    (data['LATITUDE'] is not None)&
                    (data['LATITUDE'] != 0.0)]

In [51]:
east=max(dataForGeo["LONGITUDE"])
west=min(dataForGeo["LONGITUDE"])
south=max(dataForGeo["LATITUDE"])
north=min(dataForGeo["LATITUDE"])

In [52]:
geo_data_for_plotting = {"lat": dataForGeo["LATITUDE"],
                         "lon": dataForGeo["LONGITUDE"]}

In [53]:
# Map of all incidents
geoplotlib.kde(geo_data_for_plotting,1)
bbox = BoundingBox(north=north, west=west, south=south, east=east)
geoplotlib.set_bbox(bbox)
geoplotlib.tiles_provider('toner-lite')
geoplotlib.inline()
geoplotlib.show()


('smallest non-zero count', 5.3731908254315777e-08)
('max count:', 53.423341864672629)