We will start by deciding which features we want to use in the machine learning prediction of the severity of the accident for each user. We'll also do a final clean up of the features to remove bad entries. The new dataframe with the target and wanted features will be entirely stored in a MySQL database to be used later on in the rest of the Section3.
The creation of the training and testing samples from the dataset stored in the MySQL database will be detailed at the end of this section. This last process creating these samples will be implemented in the file CreateTrainAndTestSamples.py to be called in the following part of Section3.
The target data is the severity column in the Users dataframe.
The features useful in the machine learning to predict the severity of the accident are:
In [1]:
import pandas as pd
import numpy as np
# Provides better color palettes
import seaborn as sns
from pandas import DataFrame,Series
import matplotlib as mpl
import matplotlib.pyplot as plt
# Command to display the plots in the iPython Notebook
%matplotlib inline
import matplotlib.patches as mpatches
mpl.style.use('seaborn-whitegrid')
plt.style.use('seaborn-talk')
# Extract the list of colors from this style for later use
cycl = mpl.rcParams['axes.prop_cycle']
colors = cycl.by_key()['color']
In [2]:
from CSVtoSQLconverter import load_sql_engine
sqlEngine = load_sql_engine()
In [3]:
users_df = pd.read_sql_query('SELECT * FROM safer_roads.users',
sqlEngine)
users_df.head()
Out[3]:
We keep the original numbers of entries in this dataframe to know how much we lost overall during the preparation of the dataset.
In [4]:
original_nb_entries = users_df.shape[0]
In [5]:
pd.options.display.float_format = '{:20,.2f}'.format
def check_columns(dataf,columns):
zeroes_sr = (dataf[columns] == 0).astype(int).sum(axis=0)
neg_sr = (dataf[columns] < 0).astype(int).sum(axis=0)
nans_sr = dataf[columns].isnull().sum()
nbent = dataf.shape[0]
rows = []
rows.append(DataFrame(zeroes_sr,columns=['nb zeroes']))
rows.append(DataFrame((100*zeroes_sr/nbent),columns=['% zeroes']))
rows.append(DataFrame(neg_sr,columns=['nb neg. vals']))
rows.append(DataFrame((100*neg_sr/nbent),columns=['% neg. vals']))
rows.append(DataFrame(nans_sr,columns=['nb nans']))
rows.append(DataFrame((100*nans_sr/nbent),columns=['% nans']))
checks_df = pd.concat(rows,axis=1)
print(' - Total number of entries: {}\n'.format(dataf.shape[0]))
print('List of NaN, zero and negative entries:\n{}\n\n'.format(checks_df))
In [6]:
check_columns(users_df, ['location in vehicle', 'user type', 'age', 'sex',
'journey type', 'safety gear worn','safety gear type',
'pedestrian action', 'pedestrian location','severity'])
A few different corrections are needed:
In [7]:
# 1. Replace NaN with zeros
users_df['location in vehicle'].fillna(0, inplace=True)
# 2. Remove rows with NaN in age column
users_df.dropna(subset=['age'], inplace=True)
# 3. Replace NaN with "other" or "unknown" categories
users_df['journey type'].fillna(9, inplace=True)
users_df['safety gear type'].fillna(9, inplace=True)
users_df['safety gear worn'].fillna(3, inplace=True)
users_df.replace(to_replace={'journey type':{0:9}}, inplace=True)
In [8]:
check_columns(users_df, ['location in vehicle', 'user type', 'age', 'sex',
'journey type', 'safety gear worn','safety gear type',
'severity'])
In [9]:
feat_target_df = users_df[['accident id','vehicle id','severity','location in vehicle', 'user type',
'age', 'sex', 'journey type', 'safety gear worn','safety gear type']]
feat_target_df.head()
Out[9]:
In [10]:
charact_df = pd.read_sql_query('SELECT * FROM safer_roads.characteristics',
sqlEngine)
charact_df.head()
Out[10]:
In [11]:
wanted_cols = ['luminosity', 'in city', 'intersect type', 'weather', 'collision type']
check_columns(charact_df, wanted_cols)
The intersection type has values 0 which do not correspond to anything but they concern very few instances overall. We can drop them.
On the other hand some weather and collision type entries contain NaN. The categories 9 and 6 are for "other" respectively in weather and collision type. Since very few entries have NaN, we can put the NaN in this vague category.
In [12]:
charact_df['weather'].fillna(9,inplace=True)
charact_df['collision type'].fillna(6,inplace=True)
charact_df = charact_df[charact_df['intersect type'] != 0]
In [13]:
check_columns(charact_df, wanted_cols)
In [14]:
wanted_cols.append('accident id')
feat_target_df = feat_target_df.merge(charact_df[wanted_cols],
on=['accident id'],how='inner')
print(' -> Number of entries in the dataset: {}\n'.format(feat_target_df.shape[0]))
feat_target_df.head()
Out[14]:
In [15]:
vehicles_df = pd.read_sql_query('SELECT * FROM safer_roads.vehicles',
sqlEngine)
vehicles_df.head()
Out[15]:
In [16]:
wanted_cols = ['vehicle type', 'fixed obj hit', 'moving obj hit',
'impact location', 'maneuver']
check_columns(vehicles_df, wanted_cols)
Vehicle type is perfectly fine. The other 4 columns have a very small number of entries with NaNs and a large number of zeros representing the missing information or in the case of the obj hit the fact that no obj, fixed or moving, was hit. SO we can replace the NaNs with zeros.
In [17]:
for col in wanted_cols[1:]:
vehicles_df[col].fillna(0,inplace=True)
In [18]:
wanted_cols.extend(['accident id','vehicle id'])
feat_target_df = feat_target_df.merge(vehicles_df[wanted_cols],
on=['accident id','vehicle id'],how='inner')
feat_target_df.head()
Out[18]:
In [19]:
locations_df = pd.read_sql_query('SELECT * FROM safer_roads.locations',
sqlEngine)
locations_df.head()
Out[19]:
In [20]:
wanted_cols = ['road type', 'traffic mode', 'nb lanes',
'road profil', 'road alignment','road surface',
'road width', 'installations', 'location']
check_columns(locations_df, wanted_cols)
For most columns we can again simply replace NaN by 0 which corresponds to an "unknown" category.
The columns "road width" and "installations" have respectively 35% and 90% of zeros so it is preferable to drop them.
In [21]:
wanted_cols.remove('road width')
wanted_cols.remove('installations')
In [22]:
for col in wanted_cols[1:]:
locations_df[col].fillna(0,inplace=True)
In [23]:
wanted_cols.extend(['accident id'])
feat_target_df = feat_target_df.merge(locations_df[wanted_cols],
on=['accident id'],how='inner')
feat_target_df.head()
Out[23]:
From the existing features we can derive new features to improve the predicting power.
The relative size of the vehicles involved in an accident has a direct impact on the gravity of the accident. A complementary piece of information would be the speed to deduce the momentum however we do not have any data on that.
When going through the dataset user by user, the vehicle information associated to each driver or passenger ('user type' = 1 or 2) correspond to their own vehicle therefore we have no information on the other vehicles in the accident. If the entry is for a pedestrian ('user type' = 3 or 4) then the associated vehicle is the vehicle that hit the pedestrian.
We will create a new column for the weight differential taking into account the two cases:
The mapping from 'vehicle type' to a crude estimate of the average vehicle weight in kilograms is stored in the Mapper.py script.
In [24]:
from Mapper import Vehicle_Weights
# Frist we map all the vehicle types to the average weight
feat_target_df['weight diff'] = feat_target_df['vehicle type'].map(Vehicle_Weights)
# Then calculate the differential for drivers and passengers
mask = feat_target_df['user type'].isin([1,2])
feat_target_df.ix[mask,'weight diff'] = feat_target_df.groupby('accident id')['weight diff']\
.transform(lambda x: x - x.max())
In [25]:
feat_target_df.head()
Out[25]:
In [26]:
feat_target_df = feat_target_df.reindex(np.random.permutation(feat_target_df.index))
feat_target_df.head()
Out[26]:
In [27]:
feat_target_df.drop(['accident id','vehicle id'],axis=1,inplace=True)
feat_target_df.head()
Out[27]:
Most of the categorical variables are encoded as integers however their columns in the dataframe have a float type. This is potentially important to save disk space when we store the dataset but also to convert the categorical variables into binary columns later on.
In [28]:
# List the columns stored as floats
float_col = feat_target_df.loc[:,feat_target_df.dtypes == type(1.0)].columns
float_col
Out[28]:
Of all the columns containing floats, the age is the only one that can be stored as float.
In [29]:
float_col = float_col.drop('age')
for col in float_col:
feat_target_df[col] = feat_target_df[col].astype(int)
In [30]:
nb_diff = feat_target_df.shape[0] - original_nb_entries
print(' We lost {} events during clean up, which represents {:.2f}% of the data.'.format(
nb_diff,(100.*nb_diff/original_nb_entries)))
In [31]:
# chunksize is need for this big dataframe otherwise the transfer will fail
# (unless you change your settings in MariaDB)
feat_target_df.to_sql(name='ml_dataset', con=sqlEngine, if_exists = 'replace',
index=False, chunksize = 100)
In [32]:
# mldata_df = pd.read_sql_query('SELECT * FROM ml_dataset',sqlEngine)
mldata_df = feat_target_df
mldata_df.head()
Out[32]:
In [33]:
def plot_severity(df):
severity_sr = df['severity'].map({1:'Indemn',2:'Dead',3:'Hospitalized injured',4:'Lightly injured'})
sns.countplot(x='severity', data=DataFrame(severity_sr),
order=['Indemn','Lightly injured','Hospitalized injured','Dead']);
In [34]:
plot_severity(mldata_df)
This could be a big problem for the training of the algorithms since they would predict Indemn and Lightly injured much more often and that would naturally increase the accuracy. One solution is create a training set that contains a balanced set of the 4 classes.
In [35]:
RatioDead = 100. * mldata_df[mldata_df['severity'] == 2].shape[0] / mldata_df.shape[0]
print('The number of road users who died in their accident represents about {:.1f}% of the total number of recorded users'.format(RatioDead))
Since we already shuffled the entries in the dataset prior to storing them in the SQL database, we can just use X entries to have a totally random sample for testing. However for the training sample we need to correct for the class imbalance first.
We will use part of the original dataset to create our training sample.
In [36]:
raw_training_df = mldata_df.head(100000)
n_sev2 = raw_training_df[raw_training_df.severity==2].shape[0]
print("We have {} entries with severity 2. We need the same amount for the other classes.".format(n_sev2))
In [37]:
list_severities = []
for sev_id in range(1,5):
list_severities.append(raw_training_df[raw_training_df.severity==sev_id].head(n_sev2))
training_df = pd.concat(list_severities)
plot_severity(training_df)
Now our training sample is totally balanced.
In order to process the dataset through machine learning algorithms we must first convert each category into a binary variable. Pretty much all the columns are categories except for the age and the weight differential so we'll remove them from the list of columns that needs to be converted.
In [38]:
all_col = training_df.columns
categ_col = all_col.drop(['severity','age','weight diff'])
training_df = pd.get_dummies(training_df,prefix=categ_col, columns=categ_col)
training_df.head()
Out[38]:
In [39]:
training_df[['age','weight diff']].describe().loc[['min','max']]
Out[39]:
In [40]:
max_age = training_df['age'].max()
training_df['age'] = training_df['age'] / max_age
In [41]:
max_weight_diff = training_df['weight diff'].max()
min_weight_diff = training_df['weight diff'].min()
training_df['weight diff'] = (training_df['weight diff'] - min_weight_diff) / (max_weight_diff - min_weight_diff)
In [42]:
training_df[['age','weight diff']].describe().loc[['min','max']]
Out[42]:
In [43]:
testing_df = mldata_df.head(120000).tail(20000)
In [44]:
all_col = testing_df.columns
categ_col = all_col.drop(['severity','age','weight diff'])
testing_df = pd.get_dummies(testing_df,prefix=categ_col, columns=categ_col)
testing_df.head()
Out[44]:
In [45]:
testing_df['age'] = testing_df['age'] / max_age
testing_df['weight diff'] = (testing_df['weight diff'] - min_weight_diff) / (max_weight_diff - min_weight_diff)
testing_df[['age','weight diff']].describe().loc[['min','max']]
Out[45]:
The weight differential here is not between 0.0 and 1.0 because it happens to have different minimum and maximum however it is on the sample scale as the training sample.
Both the training and the testing samples are ready to be used in a machine algorithm now.