In [1]:
import pandas as pd
import numpy as np
# Provides better color palettes
import seaborn as sns
from pandas import DataFrame,Series
import matplotlib as mpl
import matplotlib.pyplot as plt
# Command to display the plots in the iPython Notebook
%matplotlib inline
import matplotlib.patches as mpatches
mpl.style.use('seaborn-whitegrid')
plt.style.use('seaborn-talk')
# Extract the list of colors from this style for later use
cycl = mpl.rcParams['axes.prop_cycle']
colors = cycl.by_key()['color']
In [2]:
from CSVtoSQLconverter import load_sql_engine
sqlEngine = load_sql_engine()
In [3]:
users_df = pd.read_sql_query('''SELECT * FROM safer_roads.users
WHERE `accident id` > 2014E+8 AND `accident id` < 2015E+8''',
sqlEngine)
users_df.head()
Out[3]:
The best is to have one function drawing the same plots for different input dataframes, i.e. for different conditions.
In [4]:
def men_women_dist(in_df, title=''):
"""Plot the age distribution for men and women, and the ratio."""
# Extract age for each gender
women_sr = in_df[in_df['sex']==2]['age'].dropna()
men_sr = in_df[in_df['sex']==1]['age'].dropna()
# Bin the ages in bins with 2 years of width
bin_range = range(0,102,2)
women_bins = pd.cut(women_sr, bin_range,right=False).value_counts().sort_index()
men_bins = pd.cut(men_sr, bin_range,right=False).value_counts().sort_index()
# Calculate the ratio between men and women
ratio_bins = men_bins/women_bins
error_bins = ratio_bins * np.sqrt((men_bins+women_bins)/(men_bins*women_bins))
# Plot the stacked distribution of age for men and women
plt.subplots(figsize=(14,8))
plt.subplot(211)
plt.hist([women_sr,men_sr],bins=50,range=(0,100) , stacked=True, normed = False, histtype='bar',
rwidth=1.0, color=[colors[1],colors[0]]);
plt.title(title)
plt.ylabel('counts')
leg_men = mpatches.Patch(color=colors[0], label='Men')
leg_women = mpatches.Patch(color=colors[1], label='Women')
plt.legend(handles=[leg_men,leg_women],frameon=True)
# Plot the ratio men/women
plt.subplot(212)
plt.errorbar(range(1,100,2),ratio_bins,xerr=1.0,yerr=error_bins, fmt='none')
plt.xlim([0,100])
plt.xlabel('age in years')
plt.ylabel('ratio men/women')
In [5]:
men_women_dist(users_df, 'All accidents, all users')
We can see that men are a lot more involved in accidents than women. Let's try to understand this distribution by looking at the distribution for pedestrians, drivers and passagers separately.
Pedestrians are recorded in two categories in the user type column:
We will combine them here because we are not interested in the different cases
In [6]:
men_women_dist(users_df[(users_df['user type'] == 3) | (users_df['user type'] == 4)] , 'All accidents, pedestrians')
The ratio men/women displays some unexpected variations below 40. This can maybe be explained by different activities and lifestyles at different ages for men and women. The fact that the ratio is less than one above 60 could be due to women living longer than men in average. Numbers on the population by gender of France would be useful here.
The user type is 1 for drivers
In [7]:
men_women_dist(users_df[users_df['user type'] == 1] , 'All accidents, drivers')
The dataframe contains all vehicle types, including bicycles which is why we have accidents involving drivers under the age of 16. This analysis of the drivers will be easier to interpret if we separate also by vehicle type. However the vehicle type is not stored in the Users dataframe. We need to import the vehicle type from the Vehicles dataframe.
In [8]:
vehicles_df = pd.read_sql_query('''SELECT * FROM safer_roads.vehicles
WHERE `accident id` > 2014E+8 AND `accident id` < 2015E+8''',
sqlEngine)
users_vt_df = users_df.merge(vehicles_df[['vehicle type','accident id','vehicle id']],
on=['accident id','vehicle id'],how='inner')
users_vt_df.head()
Out[8]:
Cars are identified with the number 7 in the vehicle type column.
In [9]:
men_women_dist(users_df[(users_df['user type'] == 1) & (users_vt_df['vehicle type'] == 7)] ,
'All accidents, drivers of cars')
In [10]:
men_women_dist(users_df[(users_df['user type'] == 2) & (users_vt_df['vehicle type'] == 7)] ,
'All accidents, passengers of cars')
Below 20 the ratio men/women is at 1 which makes sense since the children driven by their parents are distributed equally between the two genders. The slow decrease of the ratio is probably due the tendency of men driving more often than women in most couples.