New York City is a beautiful city with terrible traffic. The drivers are impatient and aggressive, which results in many traffic collisions every day, some of which, quite unfortunately, lead to injuries and death.
For our project, we wanted to understand NYC's most collision-prone areas and try to predict collisions based on many factors, such as location, vehicle type, whether it's a weekday, etc.
The NYPD Motor Vehicle Collisions dataset we found was surprisingly feature-rich and populous. It also meant that there were lots of collision instances with missing data.
In [1]:
import os
from os import path
from bs4 import BeautifulSoup
import collections
import numpy as np
import pandas as pd
import re
import requests
import scipy
from scipy.cluster.vq import kmeans, kmeans2, vq
from scipy.spatial.distance import cdist
from IPython.core.display import display, HTML
from colour import Color
from sklearn import linear_model
import webbrowser
from geopy.distance import distance
# Plotting
import matplotlib
matplotlib.use('svg')
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import seaborn as sns
from PIL import Image
# For scikit-learn to chill out about deprecation
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
raw_data_file_name = 'NYPD_Motor_Vehicle_Collisions.csv'
bicycle_lanes_page_url = 'http://www.nyc.gov/html/dot/html/bicyclists/lane-list.shtml'
prime_colors = ['Red', 'Blue', 'Purple', 'Green', 'Yellow', 'Cyan', 'Magenta', 'Black', 'White']
current_working_dir = os.getcwd()
mapbox_access_token = 'pk.eyJ1IjoibWFwYm94IiwiYSI6ImNpandmbXliNDBjZWd2M2x6bDk3c2ZtOTkifQ._QA7i5Mpkd_m30IGElHziw'
def get_map_output_url(output_file_name):
return 'file://%s' % path.join(current_working_dir, output_file_name + '.html')
def create_map_html_file(output_file_name):
"""
Creates a webpage that will hold the map where the points of each cluster are plotted. This page
shows an instance of Leaflet (http://leafletjs.com/).
Parameters:
title (string): the name of the webpage that will appear on the browser
script (string): the location of the file that contains the plotting information (must be a
javascript file)
Returns:
The HTML code that forms the webpage.
"""
page = """
<!DOCTYPE html>
<html>
<head>
<title>TITLE</title>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<link rel="stylesheet" href="https://unpkg.com/leaflet@1.0.2/dist/leaflet.css" />
<script src="https://unpkg.com/leaflet@1.0.2/dist/leaflet.js"></script>
</head>
<body>
<div id="mapid" style="min-width: 800px; min-height: 600px;"></div>
<script src="SCRIPT"></script>
</body>
</html>
"""
page = page.replace("TITLE", output_file_name)
page = page.replace("SCRIPT", output_file_name + '.js')
with open(output_file_name + '.html', "w") as map_file:
map_file.write(page)
def create_map_script_file(output_file_name, lat, lon, zoom):
"""
Creates a javascript file with the basic content to display a map in a webpage.
Parameters:
output_file_name (string): the name (or directory) of the javascript file to be created
lat (float): the latitude where the map will be centered
lon (float): the longitude where the map will be centered
zoom (int): the zoom level in which the map starts (can be changed in browser)
Returns:
Nothing, the file is written onto disk
"""
script = """
var mymap = L.map('mapid').setView([LAT, LON], ZOOM);
L.tileLayer('https://api.tiles.mapbox.com/v4/{id}/{z}/{x}/{y}.png?access_token=ACC_TOKEN', {
maxZoom: 18,
attribution: 'Map data © <a href="http://openstreetmap.org">OpenStreetMap</a> contributors, ' +
'<a href="http://creativecommons.org/licenses/by-sa/2.0/">CC-BY-SA</a>, ' +
'Imagery © <a href="http://mapbox.com">Mapbox</a>',
id: 'mapbox.streets'
}).addTo(mymap);
"""
script = script.replace("LAT", str(lat))
script = script.replace("LON", str(lon))
script = script.replace("ZOOM", str(zoom))
script = script.replace("ACC_TOKEN", mapbox_access_token)
with open(output_file_name + '.js', "w") as script_file:
script_file.write(script)
def append_points_to_script(output_file_name, points, color, size):
"""
Appends to a javascript file all the points to be plotted with a given color and size
Parameters:
output_file_name (string): the name (or directory) of the javascript file to be used
points (list): the list of points to be plotted, as a DataFrame
color (Color): the color in which the points will be plotted
size (int): the size of each point
Returns:
Nothing, the file is written onto disk
"""
point_js = """
L.circle([LAT, LON], SIZE, {
color: 'COLOR',
fillColor: 'COLOR',
fillOpacity: 0.5
}).addTo(mymap);
"""
point_js = point_js.replace('SIZE', str(size))
point_js = point_js.replace('COLOR', str(color.hex))
if type(points) == pd.DataFrame:
points = points.loc[:, ['LATITUDE', 'LONGITUDE']].values
with open(output_file_name + '.js', "a") as script_file:
for lat, lon in points:
new_point_js = point_js.replace('LAT', str(lat))
new_point_js = new_point_js.replace('LON', str(lon))
script_file.write(new_point_js)
Note: the utility functions defined above are mainly used to plot the data employed in this notebook on a map. In order to do so, we required generating custom HTML that would accomplish this.
We started by cleaning up rows with empty and 'UNKNOWN
' values, throwing away about a third of the original dataset. We then dropped columns that either were unnecessary (like the reason for the collision, since this sort of information is not possible to infer before the collision, like the driver being distracted) or were too detailed (like the vehicle subtypes).
In [2]:
def load_data(file_name):
collision = pd.read_csv(file_name,
na_filter=False,
parse_dates={'DATE_COMBINED' : ['DATE', 'TIME']},
infer_datetime_format=True)
# Remove rows that don't have the necessary data
columns_to_check_for_empty = ['LOCATION', 'LATITUDE', 'LONGITUDE', 'BOROUGH',
'ZIP CODE', 'ON STREET NAME', 'CROSS STREET NAME',
'VEHICLE TYPE CODE 1', 'VEHICLE TYPE CODE 1']
for column in columns_to_check_for_empty:
collision = collision[collision[column] != '']
collision = collision[collision[column] != 'UNKNOWN']
# Drop unneeded columns
columns_to_drop = ['CONTRIBUTING FACTOR VEHICLE 1', 'CONTRIBUTING FACTOR VEHICLE 2',
'CONTRIBUTING FACTOR VEHICLE 3', 'CONTRIBUTING FACTOR VEHICLE 4',
'CONTRIBUTING FACTOR VEHICLE 5', 'LOCATION', 'OFF STREET NAME',
'VEHICLE TYPE CODE 2', 'VEHICLE TYPE CODE 3', 'VEHICLE TYPE CODE 4',
'VEHICLE TYPE CODE 5']
for column in columns_to_drop:
collision = collision.drop(column, axis=1)
# Set column types
collision['ZIP CODE'] = collision['ZIP CODE'].astype(int)
collision['LATITUDE'] = collision['LATITUDE'].astype(float)
collision['LONGITUDE'] = collision['LONGITUDE'].astype(float)
# Rename date column to just 'DATE'
collision = collision.rename(columns={'DATE_COMBINED':'DATE'})
# Eliminate duplicates
collision = collision.drop_duplicates()
# Reset index
collision = collision.reset_index(drop=True)
return collision
We proceeded to create two temporal features:
The 'time period' requires some explanation: We all know that morning and evening rush hours can be especially more collision-prone than daytime (around noon) and night time (after the evening rush and before the morning rush). We decided to create these four bins with the following time ranges:
Together with the weekday/weekend feature, this allowed us to represent the date in a way that is meaningful to a machine learning algorithm. We also assumed that, as long as it's a weekday, the exact day (whether it's a Monday or a Thursday) is not significant.
In [3]:
def create_temporal_features(collision):
collision = _create_time_features(collision)
collision = _create_date_features(collision)
# We're done with date, we can drop it
collision = collision.drop('DATE', axis=1)
return collision
def _create_time_features(collision):
# Create one-hot time of day representation
## Date part is unimportant
morning_rush_begin = pd.datetime(2000, 01, 01, 07, 00, 00).time()
morning_rush_end = pd.datetime(2000, 01, 01, 11, 00, 00).time()
evening_rush_begin = pd.datetime(2000, 01, 01, 16, 00, 00).time()
evening_rush_end = pd.datetime(2000, 01, 01, 20, 00, 00).time()
collision_time = collision['DATE'].dt.time
## Night Time 00:00-06:59 & 20:00-00:00
night_time = (collision_time >= evening_rush_end) | (collision_time < morning_rush_begin)
night_time_onehot = pd.get_dummies(night_time).loc[:, True].astype(int)
collision = collision.assign(NIGHT_TIME = night_time_onehot.values)
## Morning Rush 07:00-10:59
morning_rush = (collision_time >= morning_rush_begin) & (collision_time < morning_rush_end)
morning_rush_onehot = pd.get_dummies(morning_rush).loc[:, True].astype(int)
collision = collision.assign(MORNING_RUSH = morning_rush_onehot.values)
## Night time 00:00-06:59 & 20:00-00:00
day_time = (collision_time >= morning_rush_end) & (collision_time < evening_rush_begin)
day_time_onehot = pd.get_dummies(day_time).loc[:, True].astype(int)
collision = collision.assign(DAY_TIME = day_time_onehot.values)
## Evening Rush 16:00-19:59
evening_rush = (collision_time >= evening_rush_begin) & (collision_time < evening_rush_end)
evening_rush_onehot = pd.get_dummies(evening_rush).loc[:, True].astype(int)
collision = collision.assign(EVENING_RUSH = evening_rush_onehot.values)
return collision
def _create_date_features(collision):
# Create one-hot weekday/weekend representation
collision_day = collision['DATE'].dt.dayofweek
## Weekday 0, 1, 2, 3, 4
## Weekend 5, 6
is_weekday = (collision_day <= 4)
is_weekday_onehot = pd.get_dummies(is_weekday).astype(int)
## Weekday
weekday_onehot = is_weekday_onehot.loc[:, True]
collision = collision.assign(WEEKDAY = weekday_onehot.values)
## Weekend
weekend_onehot = is_weekday_onehot.loc[:, False]
collision = collision.assign(WEEKEND = weekend_onehot.values)
return collision
We further proceeded to create a one-hot encoding of the vehicle types that were available in the dataset. NYPD has put almost half of the vehicles in a general group called 'PASSENGER VEHICLE
', which seems to represent the common sedan type of vehicle. Other types, like SUVs, motorcycles, small and large commercial vehicles, have their own types.
We removed some collision types (like the collisions in which the vehicle was an 'AMBULANCE
') with very few collisions. This further cut about 5% of the dataset.
In [4]:
def create_vehicle_features(collision):
# Create one-hot vehicle type representation
vehicle_types_onehot = pd.get_dummies(collision.loc[:, 'VEHICLE TYPE CODE 1']).astype(int)
# Merge Motorcycle & Scooter columns
motorcycle = vehicle_types_onehot.loc[:, 'MOTORCYCLE'] + vehicle_types_onehot.loc[:, 'SCOOTER']
vehicle_types_onehot = vehicle_types_onehot.drop('MOTORCYCLE', axis=1)
vehicle_types_onehot = vehicle_types_onehot.drop('SCOOTER', axis=1)
vehicle_types_onehot = vehicle_types_onehot.assign(MOTORCYCLE = motorcycle.values)
# Concatanate one-hot with collisions
collision = pd.concat([collision, vehicle_types_onehot], axis=1)
# Drop unneeded collisions
vehicles_to_drop = ['OTHER', 'AMBULANCE', 'PEDICAB', 'FIRE TRUCK', 'LIVERY VEHICLE']
collisions_to_drop = vehicle_types_onehot.loc[:, vehicles_to_drop[0]]
for i in xrange(1, len(vehicles_to_drop)): # Start from 1 since we already have OTHER
collisions_to_drop += vehicle_types_onehot.loc[:, vehicles_to_drop[i]]
collisions_to_keep = (collisions_to_drop == 0)
collision = collision[collisions_to_keep]
collision = collision.reset_index(drop=True) # Reset index due to dropped rows
# Drop unneeded vehicle columns
for column in vehicles_to_drop:
collision = collision.drop(column, axis=1)
# Drop vehicle type column
collision = collision.drop('VEHICLE TYPE CODE 1', axis=1)
return collision
We proceeded by assigning a 'severity score' to each collision. Each collision, by default, has a severity score of 1. The score increases by 2 for each injury and by 10 for each death. These are arbitrary values we came up with and have no bearing on the cost of life in these collisions.
In [5]:
# Default score values
INJURY_SCORE = 2
DEATH_SCORE = 10
def create_severity_score_feature(collision):
# Give all collisions a score of 1 by default
collision['SEVERITY SCORE'] = 1
# For each injury, add to the score of the collision
collision['SEVERITY SCORE'] += (collision.loc[:, 'NUMBER OF PERSONS INJURED'] * INJURY_SCORE)
# For each death, add to the score of the collision
collision['SEVERITY SCORE'] += (collision.loc[:, 'NUMBER OF PERSONS KILLED'] * DEATH_SCORE)
return collision
We then load the data:
In [6]:
# Load the collision data
collision = load_data(raw_data_file_name)
collision = create_temporal_features(collision)
collision = create_vehicle_features(collision)
collision = create_severity_score_feature(collision)
print collision.head()
print collision.dtypes
print len(collision)
We proceed with parsing NYC's bike lanes list web page to add another feature to our collision data: whether the street the collision occurred in had a bicycle lane or not. We do that by manually parsing the bicycle lanes page and matching them to the street names of collisions.
In [7]:
def get_bicycle_lanes_from_page(page_html):
"""
Parse the table contained in the bicycle lanes webpage.
Args:
page_html (string): String of HTML corresponding to the data source webpage.
Returns:
a dictionary that contains mappings from a category to the list containing the data.
These categories are: street, begins, ends, and borough.
"""
soup = BeautifulSoup(page_html, 'html.parser')
bicycle_lanes_list = {
'street': [],
'begins': [],
'ends': [],
'borough': []
}
table = soup.findChildren('tbody')[0]
rows = table.findChildren(['tr'])
for row in rows:
cells = row.findChildren('td')
content = cells[0].text
m = re.search(r'^([a-zA-Z\s0-9\-\.,\(\)]+) (from)*(between)* '
r'([a-zA-Z\s0-9\-\.,\(\)]+) (to)*(and)* '
r'([a-zA-Z\s0-9\-\.,\(\)]+)$', content)
# Content that does not follow this syntax is discarded because
# it refers to landscapes or parks.
if m is not None:
bicycle_lanes_list['street'].append(m.group(1).upper())
bicycle_lanes_list['begins'].append(m.group(4).upper())
bicycle_lanes_list['ends'].append(m.group(7).upper())
bicycle_lanes_list['borough'].append(cells[2].text.upper())
return bicycle_lanes_list
def extract_bicycle_lanes(url):
"""
Retrieve all of the bicycle lane information for the city of New York.
Parameters:
url (string): page URL corresponding to the listing of bicycle lane information.
Returns:
bicycle_lanes (Pandas DataFrame): list of dictionaries containing extracted lane information.
"""
bicycle_lanes_page = requests.get(url).text
bicycle_lanes_list = get_bicycle_lanes_from_page(bicycle_lanes_page)
bicycle_lanes = pd.DataFrame(bicycle_lanes_list)
bicycle_lanes.loc[bicycle_lanes.borough == 'THE BRONX', 'borough'] = 'BRONX'
columns_to_rename = {'borough': 'BOROUGH', 'begins': 'BEGINS', 'ends': 'ENDS', 'street': 'ON STREET NAME'}
bicycle_lanes = bicycle_lanes.rename(columns=columns_to_rename)
return bicycle_lanes
In [8]:
bicycle_lanes = extract_bicycle_lanes(bicycle_lanes_page_url)
print bicycle_lanes.head()
print len(bicycle_lanes)
The bicycle lanes are defined by four features:
ON STREET NAME
)BEGINS
)ENDS
)BOROUGH
)By using these four values, we'll match them to the location of the collisions and add a 'HAS_BIKE_LANE' column to the collisions data frame, indicating whether the collision occurred on a street that has a bike lane.
In [9]:
def merge_bicycle_lanes(df1, df2, columns):
# Populate collisions with bicycle lane data by matching on 'columns'
result = pd.merge(df1, df2, how='left', on=columns)
# Create a 'HAS_BIKE_LANE' column to indicate the presenc of a bike lane
result = result.assign(HAS_BIKE_LANE = [0 if x is np.nan else 1 for x in result.loc[:, 'BEGINS']])
# Drop unneeded columns
result = result.drop('BEGINS', axis=1)
result = result.drop('ENDS', axis=1)
# Multiple bike lanes can be on the same street, disregard that fact
result = result.drop_duplicates()
# Reset index to account for duplicates
result = result.reset_index(drop=True)
return result
In [10]:
collision = merge_bicycle_lanes(collision, bicycle_lanes, ['ON STREET NAME', 'BOROUGH'])
print collision.head()
print collision.dtypes
print len(collision)
We now need to find areas in which collisions occur frequently. This will let us identify collision-prone areas in NYC. All the collisions we have right now have their latitude & longitude information, which is too granular to identify areas. We also have the borough and Zip codes of the locations, but those are not granular enough. What we need to do is to define 'clusters' from the locations of collisions, which we can use to determine areas. We can use the k-means algorithm to achieve this.
Usually, we try to optimize the number of means to give us a good balance of error & generalizability. While we can do that here as well, we can manually choose the number of means to specify how granular we'd like to be with our zoning: selecting a single mean will, naturally, cover the entire NYC metropolitan area but won't provide us with any additional information. By visually observing the results of choosing from a number of means, we can find a good balance of the number of zones that translate logically to the layout of the NYC metropolitan area.
Let's start with some helper code that will find those centers:
In [11]:
def get_collision_centers(collision, num_centers):
"""
Runs K-means, find the centers that correspond to collisions and returns them, along with
each collision's label (which center it belongs to).
"""
location_data = collision.loc[:, ['LATITUDE', 'LONGITUDE']]
centroids, labels = kmeans2(location_data, num_centers)
return centroids, labels
We will also use a function that takes collisions and plots them, along with some of its features (such as the centers, the membership of collisions, etc.). Effectively visualizing this kind of data is of paramount importance, since the evaluation depends on observation of collision hotspots.
The function takes a list of collisions and plots them on an interactive map of New York City. We provide the function with the collisions we would like to plot, whether we want to show the severity of the collisions, whether we want to demonstrate their cluster membership (which are obtained through kmeans), etc.
In [12]:
def plot_collisions(collision, filter_features=[], show_severity=False, centroids=None, labels_of_collisions=None,
color_collisions_based_on_labels=False, only_show_num_collisions=None,
open_map_after_plot=False, output_file_name='map', lat=40.77, lon=-74.009725, zoom=12):
# Create the output files
create_map_html_file(output_file_name)
create_map_script_file(output_file_name, lat, lon, zoom)
collisions_to_plot = collision.copy().reset_index(drop=True)
if labels_of_collisions is not None:
# Append the labels
collisions_to_plot = collisions_to_plot.assign(LABELS = labels_of_collisions)
# Filter the collisions based on the list of features that were given
for feature in filter_features:
collisions_to_plot = collisions_to_plot.loc[collisions_to_plot[feature] == 1, :].reset_index(drop=True)
# After filtering, throw away the unneeded features
if labels_of_collisions is not None:
collisions_to_plot = collisions_to_plot.loc[:, ['LATITUDE', 'LONGITUDE', 'SEVERITY SCORE', 'LABELS']]
else:
collisions_to_plot = collisions_to_plot.loc[:, ['LATITUDE', 'LONGITUDE', 'SEVERITY SCORE']]
# Randomly pick some number of collisions if specified
num_collisions_total = collisions_to_plot.shape[0]
if only_show_num_collisions and only_show_num_collisions < num_collisions_total:
collisions_shuffled = collisions_to_plot.loc[np.random.choice(num_collisions_total,
only_show_num_collisions,
replace=False), :]
collisions_to_plot = collisions_shuffled[:only_show_num_collisions].reset_index(drop=True)
if show_severity: # Plot with heatmap
# Get the color range
unique_severity_scores = sorted(collisions_to_plot.loc[:, 'SEVERITY SCORE'].unique())
num_unique_severity_scores = len(unique_severity_scores)
colors = list(Color('Yellow').range_to('Red', num_unique_severity_scores))
# Plot each severity level with its own color
for index, score in enumerate(unique_severity_scores):
collisions_with_score = collisions_to_plot.loc[collisions_to_plot['SEVERITY SCORE'] == score, :]
append_points_to_script(output_file_name, collisions_with_score, colors[index], 50)
elif labels_of_collisions is not None: # Plot with color code
# Plot each center's points
color_to_use = Color('Blue')
for center in xrange(len(centroids)):
if color_collisions_based_on_labels and len(centroids) <= len(prime_colors):
color_to_use = Color(prime_colors[center])
collisions_with_label = collisions_to_plot.loc[collisions_to_plot.loc[:, 'LABELS'] == center, :]
append_points_to_script(output_file_name, collisions_with_label, color_to_use, 50)
else: # Plot simple
append_points_to_script(output_file_name, collisions_to_plot, Color('Blue'), 50)
if centroids is not None:
# Plot the centroids
append_points_to_script(output_file_name, centroids, Color('Orange'), 150)
# Generate map URL
map_url = get_map_output_url(output_file_name)
# Optionally open it in a browser window
if open_map_after_plot:
webbrowser.open(map_url)
return map_url
With this function, we started mixing features, looking for some visible result that showed some interesting facts. During this process, we decided to focus on collisions that occur on streets with bike lanes. We know that traffic changes in quantity between workdays and the weekend, so we decided to start with collisions involving Taxis in the Manhattan borough. Let's see if we find something interesting:
In [13]:
# Only consider the collisions that took place in Manhattan
collisions_in_manhattan = collision.loc[collision['BOROUGH'] == 'MANHATTAN', :]
# Define features to filter the collisions by
filter_features = ['NIGHT_TIME', 'TAXI', 'HAS_BIKE_LANE']
# Plot the specified collisions and print its URL
print plot_collisions(collisions_in_manhattan, # The list of collisions to plot
filter_features=filter_features + ['WEEKDAY'], # Which features to filter the collisions by
show_severity=True, # Show the severity of collisions
output_file_name='TAXI_WEEKDAY_BIKE_NIGHT', # Name of the output HTML & JS map files
open_map_after_plot=True) # Open the map in a browser window
# Plot the specified collisions and print its URL
print plot_collisions(collisions_in_manhattan, # The list of collisions to plot
filter_features=filter_features + ['WEEKEND'], # Which features to filter the collisions by
show_severity=True, # Show the severity of collisions
output_file_name='TAXI_WEEKEND_BIKE_NIGHT', # Name of the output HTML & JS map files
open_map_after_plot=True) # Open the map in a browser window
The results above show that the severity of collisions at night in places with bike lanes tends to be greater during the week than on the weekend. This could be merely due to the amount of car activity due to jobs in the area. One noticeable thing here is an apparent increase in the severity of collisions close to Time Warner Center located at the lower left corner of Central Park on the weekends.
Let's look at the same features, but at daytime rather than nighttime:
In [14]:
filter_features = ['DAY_TIME', 'TAXI', 'HAS_BIKE_LANE']
print plot_collisions(collisions_in_manhattan,
filter_features=filter_features + ['WEEKDAY'],
show_severity=True,
output_file_name='TAXI_WEEKDAY_BIKE_DAY',
open_map_after_plot=True)
print plot_collisions(collisions_in_manhattan,
filter_features=filter_features + ['WEEKEND'],
show_severity=True,
output_file_name='TAXI_WEEKEND_BIKE_DAY',
open_map_after_plot=True)
Similar to the results above, the severity of the collisions are greater on the weekdays and more concentrated towards the area of South Manhattan. This appears to be the opposite in that part of the borough during the weekend. It appears that the South-West corner of Central Park remains a problem area.
The results shown here demonstrate that collisions are more likely to happen during the night and close to Central Park or the South. We are not able to determine only based on this data as to what causes these hotspots. It may be that they are favorite places to ride a bike (like Central Park) or that it is an important commercial or business center. However, it is clear that collisions in this parts of town are frequent during certain parts of the day or of the week.
Let's see if buses follow a similar pattern since they are in many places popular means of transportation.
In [15]:
filter_features = ['DAY_TIME', 'BUS', 'HAS_BIKE_LANE']
print plot_collisions(collisions_in_manhattan,
filter_features=filter_features + ['WEEKDAY'],
show_severity=True,
output_file_name='BUS_WEEKDAY_BIKE_DAY',
open_map_after_plot=True)
print plot_collisions(collisions_in_manhattan,
filter_features=filter_features + ['WEEKEND'],
show_severity=True,
output_file_name='BUS_WEEKEND_BIKE_DAY',
open_map_after_plot=True)
From the maps above, we see the obvious difference in the number of collisions overall. We can also observe an increase in the severity of the collisions during the week, compared to the weekend.
As we have seen through this walkthrough, it is quite easy to get interesting information out of collision data, through the use of features of collisions. We have determined a correlation between areas of Manhattan, the time of day, the day of the week and the severity of collisions.
Through the use of our collision data, we can try to infer the areas where collisions concentrate. This information can allow emergency response units to position their units according to historical collision data. This is useful for reasons including:
Let's start by identifying some collision hotspots in the Manhattan borough:
In [16]:
# Run kmeans with 15 centroids, find 15 'collision hotspots'
centroids, labels = get_collision_centers(collisions_in_manhattan, 15)
# Plot the collisions, along with the centroids
print plot_collisions(collisions_in_manhattan, # The list of collisions to plot
centroids=centroids, # The centroids of collisions, chosen through kmeans
labels_of_collisions=labels, # The list of which collisions belong to which centroid
only_show_num_collisions=2000, # Randomly pick only some collisions to show, for a cleaner map
open_map_after_plot=True) # Open the generated map in a browser window
You can observe that a small area between the 25th and 60th Streets, the Midtown Manhattan area, has as many collision hotspots as the northern part of Manhattan, including uptown Manhattan area, and twice as many collision hotspots as Lower Manhattan. This is an interesting result that suggests just how many more collisions happen Midtown compared to the rest of Manhattan.
We can focus on parts of New York City to get a more granular understanding of problematic areas. Let's take Midtown Manhattan, for example. We can run kmeans only on the collisions that occur in Midtown Manhattan and plot them. To (roughly) get the collisions in Midtown Manhattan, we can define and use a function to return the collisions that are in a circle:
In [17]:
def get_collisions_in_circle(collisions, latitude, longitude, radius_miles):
location_data = collisions.loc[:, ['LATITUDE', 'LONGITUDE']]
center = np.array([latitude, longitude])
# Calculate distances of each collision to the center
distances = location_data.apply(lambda x: distance((x['LATITUDE'], x['LONGITUDE']), center).miles, axis = 1)
# Return the collisions inside the circle
return collisions[distances <= radius].reset_index(drop=True)
We can now define a center and radius that (roughly) encircles Midtown Manhattan area, run kmeans on it to get the collision hotspots in that area...
In [18]:
# The circle that roughly corresponds to Midtown Manhattan
center_lat = 40.7564028 # 40°45'23.05"N
center_lon = -73.9844611 # 73°59'04.06"W
radius = 1.08 # miles
# Get the collisions in midtown
collisions_in_midtown_manhattan = get_collisions_in_circle(collisions_in_manhattan, center_lat, center_lon, radius)
...and plot all of them:
In [19]:
# Run kmeans with 10 centroids, find 10 'collision hotspots'
centroids, labels = get_collision_centers(collisions_in_midtown_manhattan, 15)
# Plot these collisions
print plot_collisions(collisions_in_midtown_manhattan,
centroids=centroids,
labels_of_collisions=labels,
only_show_num_collisions=500,
lat=center_lat, lon=center_lon, # Center the map to this latitude & longitude
zoom=14, # Zoom in to the map straight away
open_map_after_plot=True)
Perhaps unsurprisingly, we can see that a collision hotspot appears dead centered on Times Square (the hotspot at the middle of the image):
This seems to confirm something that most people who have seen Times Square and its traffic: it is a total mess that results in many collisions.
One can also observe a collision hotspot on the South-West corner of Central Park, which also came into question during the severity analysis above as an area that has higher-than-average severity collisions.
Through this kmeans analysis, we can get a good insight into the areas of New York City that result in a higher number of collisions than the rest.
We have spent time and effort in extracting a bunch of features for all the collisions. The question remains, however: how good are all these features in describing a collision's severity? In other words, how well can we predict the severity of a collision through the use of these features?
Let's start by describing the baseline: Most collisions in the data set do not have any injuries or deaths. If we only return a severity of '1', we will correctly predict the severity of ~80% of collisions:
In [20]:
score = sum(collision['SEVERITY SCORE'] == 1) / float(len(collision))
accuracy = '{:.4f}%'.format(score * 100)
print 'Baseline:', accuracy
Let's try to see how well some individual features can predict the severity in order to compare it to the combined features' effectiveness. Let's try the time of day.
It might seem reasonable to expect the most severe collisions to happen during rush hours. Therefore, the time of day might also seem like a good predictor of the severity of the collision. We can test this hypothesis by randomly choosing a subset of our collisions, running logistic regression on it by using just the 'time of day' feature and testing it on a verification set, also randomly chosen.
In [21]:
def get_training_sets(collision):
"""Gets a list of collisions and returns the set of training features and
labels from them. The severity score of each collision are their respective
label.
Parameters:
X (Pandas DataFrame): a dataframe that contains collisions
Returns:
X, Y (Pandas DataFrames): tuple of DataFrames that have features and labels of
collisions, respectively
"""
X = collision
Y = collision['SEVERITY SCORE']
# Convert BOROUGH to one-hot encoding
if 'BOROUGH' in X.columns:
borough_onehot = pd.get_dummies(X['BOROUGH']).astype(int)
X = pd.concat([X, borough_onehot], axis=1)
# Drop unneeded columns
columns_to_drop = ['BOROUGH', 'ZIP CODE', 'LATITUDE', 'LONGITUDE',
'ON STREET NAME', 'CROSS STREET NAME', 'SEVERITY SCORE',
'UNIQUE KEY', 'NUMBER OF PERSONS INJURED',
'NUMBER OF PERSONS KILLED', 'NUMBER OF PEDESTRIANS INJURED',
'NUMBER OF PEDESTRIANS KILLED', 'NUMBER OF CYCLIST INJURED',
'NUMBER OF CYCLIST KILLED', 'NUMBER OF MOTORIST INJURED',
'NUMBER OF MOTORIST KILLED']
for column in columns_to_drop:
if column in X.columns:
X = X.drop(column, axis=1)
return X, Y
# The number of collisions we have
num_collisions_total = collision.shape[0]
# Number of collisions we want in our training set
num_collisions_training = 40000
# Number of collisions we want in our test/validation set
num_collisions_test = 4000
# Get a random choice of collisions
collisions_shuffled = collision.loc[np.random.choice(num_collisions_total,
(num_collisions_training + num_collisions_test),
replace=False), :]
training_collisions = collisions_shuffled[:num_collisions_training]
test_collisions = collisions_shuffled[num_collisions_training:]
X_tr, Y_tr = get_training_sets(training_collisions)
X_te, Y_te = get_training_sets(test_collisions)
In [22]:
# Throw out all the features but the time of day
time_of_day_features = ['NIGHT_TIME', 'MORNING_RUSH', 'DAY_TIME', 'EVENING_RUSH']
X_tr_time = X_tr.loc[:, time_of_day_features]
X_te_time = X_te.loc[:, time_of_day_features]
# Train the classifier on the 'time of day' features
classifier = linear_model.LogisticRegressionCV()
classifier.fit_transform(X_tr_time, Y_tr)
score = classifier.score(X_te_time, Y_te)
# Assess the score of the classifier through the validation set
accuracy = '{:.4f}%'.format(score * 100)
print 'Just time of day:', accuracy,
As you can observe, the time of day wasn't much better at predicting the severity of collisions than the baseline.
Let's try another feature: the borough. We will see if the general area is a good indicator of the severity of collisions:
In [23]:
# Throw out all the features but the borough
borough_features = ['BRONX', 'BROOKLYN', 'MANHATTAN', 'QUEENS', 'STATEN ISLAND']
X_tr_borough = X_tr.loc[:, borough_features]
X_te_borough = X_te.loc[:, borough_features]
# Train the classifier on the 'borough' features
classifier = linear_model.LogisticRegressionCV()
classifier.fit_transform(X_tr_borough, Y_tr)
score = classifier.score(X_te_borough, Y_te)
# Assess the score of the classifier through the validation set
accuracy = '{:.4f}%'.format(score * 100)
print 'Just borough:', accuracy
As you can observe, the borough wasn't much better at predicting the severity of collisions than the baseline either. It's as bad as just using the time of day.
Now, let's use all of our features and see how good they are at predicting the severity of collisions:
In [24]:
# Train the classifier on all the features
classifier = linear_model.LogisticRegressionCV()
classifier.fit_transform(X_tr, Y_tr)
score = classifier.score(X_te, Y_te)
# Assess the score of the classifier through the validation set
accuracy = '{:.4f}%'.format(score * 100)
print 'All the features:', accuracy
It seems like these features weren't enough to predict the severity of collisions. We can't achieve better-than-baseline by using just these features. There are two possible conclusions:
To understand which is the case, we could extract more features from this data by, for example, using Zip Codes instead of boroughs or combining it with other datasets from other sources. Ruling out the second case might not be as easy as it seems.
Accurate severity prediction could give first-response units proactive information on the severity of the collisions and allow them to respond with the appropriate amount of resources. For example, when the emergency services get notified of a collision, they can use the location & time information and the types of vehicles involved in the collison to predict the severity before they even arrive to the scene. They can then route 2 ambulances (instead of just 1) to the area before they even see the severity themselves, potentially saving lives.