In [ ]:
# Final Project - Dangerous Driving
Submitted by Candace Grant
In this project, I analyzed nationwide driving fatalies and other bad driving data.
Datasets:
In [3]:
%matplotlib inline
import pandas as pd
import numpy as np
pd.options.display.max_columns = 55
In [5]:
# Dataset accessed from here: ftp://ftp.nhtsa.dot.gov/fars/2015/National/FARS2015NationalCSV.zip
df = pd.read_csv('https://raw.githubusercontent.com/paisely65/IS-362-Final-Project/master/accident.csv')
In [3]:
df.info()
In [6]:
df.head()
Out[6]:
In [8]:
# Convert the columns of Day, Month, Year, Hour, Minute into a single datetime column
# Requirements: Project includes at least one data transformation operation.
datetime = df[['YEAR', 'MONTH', 'DAY', 'HOUR', 'MINUTE']]
datetime = datetime.rename(columns={'YEAR': 'year', 'MONTH': 'month', 'DAY': 'day',
'HOUR': 'hour', 'MINUTE': 'minute'})
df['DATETIME'] = pd.to_datetime(datetime)
In [9]:
# Replace STATE column with appropriate state names
state_codes = {1: 'Alabama', 2: 'Alaska', 4: 'Arizona', 5: 'Arkansas', 6: 'California',
8: 'Colorado', 9: 'Connecticut', 10: 'Delaware', 11: 'District of Columbia', 12: 'Florida',
13: 'Georgia', 15: 'Hawaii', 16: 'Idaho', 17: 'Illinois', 18: 'Indiana',
19: 'Iowa', 20: 'Kansas', 21: 'Kentucky', 22: 'Louisiana', 23: 'Maine',
24: 'Maryland', 25: 'Massachusetts', 26: 'Michigan', 27: 'Minnesota', 28: 'Mississippi',
29: 'Missouri', 30: 'Montana', 31: 'Nebraska', 32: 'Nevada', 33: 'New Hampshire',
34: 'New Jersey', 35: 'New Mexico', 36: 'New York', 37: 'North Carolina', 38: 'North Dakota',
39: 'Ohio', 40: 'Oklahoma', 41: 'Oregon', 42: 'Pennsylvania', 43: 'Puerto Rico',
44: 'Rhode Island', 45: 'South Carolina', 46: 'South Dakota', 47: 'Tennessee', 48: 'Texas',
49: 'Utah', 50: 'Vermont', 51: 'Virginia', 52: 'Virgin Islands', 53: 'Washington',
54: 'West Virginia', 55: 'Wisconsin', 56: 'Wyoming'}
df['STATE_NAME'] = df['STATE'].replace(state_codes)
In [7]:
# Sum the number of incidents and fatalities in each state
# Requirements: Project includes at least one grouping or aggregation.
print('Top 10 States with most fatal incidents')
print(df.groupby('STATE_NAME')['ST_CASE'].count().nlargest(10))
print()
print('Top 10 States with most number of fatalities')
print(df.groupby('STATE_NAME')['FATALS'].sum().nlargest(10))
In [10]:
# Statistical analysis of the number of people involved in fatal car crash incident
# Requirements: Project includes at least one statistical analysis and at
# least one graphics that describes or validates your data.
print('Mean number of people involved in fatal car crash:', df['PERSONS'].mean())
print('Median number of people involved in fatal car crash:', df['PERSONS'].median())
print('Standard deviation of number of people involved in fatal car crash:', df['PERSONS'].std())
df.groupby('STATE_NAME')['PERSONS'].mean().plot(kind='bar', figsize=(14,6),
title='Mean number of persons involved in fatal car crash incidents by State')
Out[10]:
In [11]:
# Visualization of the number of fatal car crash incidents nationwide on a daily basis
df.index = df['DATETIME']
df['incident_count'] = 1
df['incident_count'].resample('1D').sum().plot(figsize=(14,6), title='Number of fatal car crash incidents per day')
Out[11]:
In [12]:
# We try to predict the number of fatalities based on the number of persons involved, the hour of day and whether
# the incident was classified as a drunk driving incident. We use a machine learning technique similar to a logistic
# regression, but it actually uses a random forest to make the predictions. Ultimately, we score the algorithm using
# the r-squared metric. In this example, we see that the r-squared is very low so using only the columns we included,
# we can't make very good predictions about the number of people killed in a crash.
# Requirements: Project includes at least one feature that we did not cover in class!
from sklearn.ensemble import RandomForestRegressor
from sklearn.cross_validation import train_test_split
from sklearn.metrics import r2_score
X = df[['PERSONS', 'HOUR', 'DRUNK_DR']]
y = df['FATALS']
X_train, X_test, y_train, y_test = train_test_split(X, y)
rfclf = RandomForestRegressor(n_estimators=100)
rfclf.fit(X_train, y_train)
y_pred = rfclf.predict(X_test)
r2_score(y_test, y_pred)
Out[12]:
In [15]:
# Let's start to analyze the second dataset
df2 = pd.read_csv('https://raw.githubusercontent.com/paisely65/IS-362-Final-Project/master/Copy%20of%20Worst%20Drivers%20By%20State%202015.csv')
In [16]:
df2.info()
In [17]:
df2.head()
Out[17]:
In [20]:
print (df2.sort_values('Fatalities Rate per 100 Million Vehicle Miles Traveled',
ascending=False)[['State']].head(10))
print()
print()
print('Top 10 States with most fatal incidents')
print(df.groupby('STATE_NAME')['ST_CASE'].count().nlargest(10))
For my final project, I wanted to look more closely at driving fatalities and other bad driving statistics, in part because with new machine learning technology and advancements in computer vision algorithms, self-driving cars will soon become commonplace. But, until we have presumably safer self-driving cars, we need to study car crashes so we can make policies and develop technology to make driving safer. In this analysis, I performed some exploratory data analysis, highlighting the states with the most car crashes resulting in fatalities. Texas, California and Florida had the most fatalities in absolute numbers, however it is important to note that we did not normalize the data for population size. A quick Google search indicates that Texas, California and Florida are in fact the three most populous states in the US, so it is not terribly surprising that they also have the most car crashes. In a future analysis, I could normalize by population size to determine which states have the most crashes on a relative basis. I also built a very simple machine learning algorithm to predict the number of fatalities in a car crash incident using just three variables: hour of day, number of persons involved and whether it was a drunk driving incident. The algorithm ultimately did not perform very well (R-squared value of .035), however it does open the door for building a more robost model with many more variables. In the last part of the analysis, I included a dataset known as the "Worst Drivers by State." This dataset ranks each state according to several criteria, including number of drunk driving incidents, speeding incidents, fatal car crashes, among others. When we compare the 10 ranked worst states for driving fatalities, we notice that the states are different from the ten states with the most absolute number of driving fatalities. Why is this the case? Well, in the new dataset, states are ranked based on fatalities per 100 million vehicle miles traveled. This unit of measurement normalizes the data and clearly affects the results. It serves as an important reminder that we must always normalize data when comparing it.