Section 1

This section is dedicated to the preparation of the dataframes used in the subsequent sections. We present here what is actually performed in CSVtoSQLconverter.py where the CSV files are read into Pandas dataframes, the dataframes are prepared and finally saved in a SQL database. This notebook explains the preparation of the dataframes using the 2014 data.

Characteristics dataframe

The first csv file contains the overall characteristics of each recorded accident such as the location, the luminosity, the type of road, and the precise date and time of the accident.

Python imports



In [1]:

    
import pandas as pd
import numpy as np

from pandas import DataFrame,Series
%matplotlib inline

First look at the data structure



In [2]:

    
caract_df = pd.read_csv("data/2014_France_caracteristiques.csv",sep=',', encoding='latin-1')



In [3]:

    
caract_df.head()









    Out[3]:






  
    
      
      Num_Acc
      an
      mois
      jour
      hrmn
      lum
      agg
      int
      atm
      col
      com
      adr
      gps
      lat
      long
      dep
    
  
  
    
      0
      201400000001
      14
      5
      7
      2015
      1
      2
      1
      1.0
      3
      11
      route de don
      M
      0.0
      0.0
      590
    
    
      1
      201400000002
      14
      5
      31
      430
      1
      2
      1
      1.0
      6
      11
      106 ROUTE DE DON
      M
      0.0
      0.0
      590
    
    
      2
      201400000003
      14
      8
      23
      1800
      1
      2
      9
      1.0
      3
      52
      75 bis rue jean jaures
      M
      0.0
      0.0
      590
    
    
      3
      201400000004
      14
      6
      12
      1700
      1
      2
      1
      1.0
      1
      25
      rue des Sablonnieres D41
      M
      0.0
      0.0
      590
    
    
      4
      201400000005
      14
      6
      23
      500
      2
      1
      1
      1.0
      1
      25
      NaN
      M
      0.0
      0.0
      590

The date is saved in many columns which is not necessarily practical. Also the format for the time is not too handy. The column labels are quite cryptic and not just because they are in French.

Convert time and date to datetime



In [4]:

    
import datetime as dt

The time of the accident is stored in an integer in 24H format such that 5:20pm becomes 1720. However some of the strings are "1" or "15" or "29". One possible explanation is that the recorded number was "029" for 29 minutes past midnight but once converted to an integer it became "29". We can verify this by plotting the two distributions of values for hrmn > 100 (the hour after midnight) and hrmn >= 2300 (the hour before midnight), and compare them.



In [5]:

    
# The hour after midnight
caract_df[caract_df['hrmn'] < 100]['hrmn'].plot.hist(bins=range(1,60),figsize=(12,6))









    Out[5]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f8f3ad4bb38>



In [6]:

    
# The hour before midnight
caract_df[caract_df['hrmn'] >= 2300]['hrmn'].plot.hist(bins=range(2300,2360),figsize=(12,6))









    Out[6]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f8f3ac27400>

We can now confidently convert integers < 100 to strings 4-character long and starting with 0.



In [7]:

    
def columns_to_datetime(sr):
    stringall = "%d-%d-%d %04d" % (sr['an'], sr['mois'], sr['jour'], sr['hrmn'])
    sr = sr.append(pd.Series(dt.datetime.strptime(stringall, "%y-%m-%d %H%M"), index=["datetime"]))
    return sr



In [8]:

    
clean_df = caract_df.apply(columns_to_datetime,axis=1)



In [9]:

    
clean_df.head()









    Out[9]:






  
    
      
      Num_Acc
      an
      mois
      jour
      hrmn
      lum
      agg
      int
      atm
      col
      com
      adr
      gps
      lat
      long
      dep
      datetime
    
  
  
    
      0
      201400000001
      14
      5
      7
      2015
      1
      2
      1
      1.0
      3
      11
      route de don
      M
      0.0
      0.0
      590
      2014-05-07 20:15:00
    
    
      1
      201400000002
      14
      5
      31
      430
      1
      2
      1
      1.0
      6
      11
      106 ROUTE DE DON
      M
      0.0
      0.0
      590
      2014-05-31 04:30:00
    
    
      2
      201400000003
      14
      8
      23
      1800
      1
      2
      9
      1.0
      3
      52
      75 bis rue jean jaures
      M
      0.0
      0.0
      590
      2014-08-23 18:00:00
    
    
      3
      201400000004
      14
      6
      12
      1700
      1
      2
      1
      1.0
      1
      25
      rue des Sablonnieres D41
      M
      0.0
      0.0
      590
      2014-06-12 17:00:00
    
    
      4
      201400000005
      14
      6
      23
      500
      2
      1
      1
      1.0
      1
      25
      NaN
      M
      0.0
      0.0
      590
      2014-06-23 05:00:00



In [10]:

    
# Let's check that our conversion worked properly
clean_df[clean_df['hrmn'] == 29]









    Out[10]:






  
    
      
      Num_Acc
      an
      mois
      jour
      hrmn
      lum
      agg
      int
      atm
      col
      com
      adr
      gps
      lat
      long
      dep
      datetime
    
  
  
    
      20746
      201400020747
      14
      7
      4
      29
      5
      2
      2
      1.0
      3
      231
      MARECHAL LYAUTEY (AVENUE
      NaN
      NaN
      NaN
      210
      2014-07-04 00:29:00



In [11]:

    
# We can remove the original time and date columns
clean_df.drop(['an','mois','jour','hrmn'],axis=1, inplace=True)
clean_df.head()









    Out[11]:






  
    
      
      Num_Acc
      lum
      agg
      int
      atm
      col
      com
      adr
      gps
      lat
      long
      dep
      datetime
    
  
  
    
      0
      201400000001
      1
      2
      1
      1.0
      3
      11
      route de don
      M
      0.0
      0.0
      590
      2014-05-07 20:15:00
    
    
      1
      201400000002
      1
      2
      1
      1.0
      6
      11
      106 ROUTE DE DON
      M
      0.0
      0.0
      590
      2014-05-31 04:30:00
    
    
      2
      201400000003
      1
      2
      9
      1.0
      3
      52
      75 bis rue jean jaures
      M
      0.0
      0.0
      590
      2014-08-23 18:00:00
    
    
      3
      201400000004
      1
      2
      1
      1.0
      1
      25
      rue des Sablonnieres D41
      M
      0.0
      0.0
      590
      2014-06-12 17:00:00
    
    
      4
      201400000005
      2
      1
      1
      1.0
      1
      25
      NaN
      M
      0.0
      0.0
      590
      2014-06-23 05:00:00

Remove unreliable information

The precise location of the accident has not been reliably recorded so we won't be able for example to get a reliable map of the location of the accidents.



In [12]:

    
clean_df.isnull().sum()









    Out[12]:





Num_Acc         0
lum             0
agg             0
int             0
atm            16
col             0
com             0
adr          9602
gps         28894
lat         29273
long        29273
dep             0
datetime        0
dtype: int64



In [13]:

    
clean_df.shape









    Out[13]:





(59854, 13)

Half of the longitude, latitude and the gps variable are missing. Many of the adresses are also missing and we don't have any use for a very precise location so we can drop them. We still have the area code ("dep") for a rough location in France.



In [14]:

    
clean_df.drop(['adr','gps','lat','long'],axis=1, inplace=True)
clean_df.head()









    Out[14]:






  
    
      
      Num_Acc
      lum
      agg
      int
      atm
      col
      com
      dep
      datetime
    
  
  
    
      0
      201400000001
      1
      2
      1
      1.0
      3
      11
      590
      2014-05-07 20:15:00
    
    
      1
      201400000002
      1
      2
      1
      1.0
      6
      11
      590
      2014-05-31 04:30:00
    
    
      2
      201400000003
      1
      2
      9
      1.0
      3
      52
      590
      2014-08-23 18:00:00
    
    
      3
      201400000004
      1
      2
      1
      1.0
      1
      25
      590
      2014-06-12 17:00:00
    
    
      4
      201400000005
      2
      1
      1
      1.0
      1
      25
      590
      2014-06-23 05:00:00

Rename the columns

Some of the column names are a little cryptic such as "col". Let's clarify them.



In [15]:

    
clean_df.rename(columns={'Num_Acc':'accident id','lum':'luminosity','agg':'in city','int':'intersect type',
                         'atm':'weather','col':'collision type','com':'city id','dep':'area id'},
                inplace=True)
clean_df.head()









    Out[15]:






  
    
      
      accident id
      luminosity
      in city
      intersect type
      weather
      collision type
      city id
      area id
      datetime
    
  
  
    
      0
      201400000001
      1
      2
      1
      1.0
      3
      11
      590
      2014-05-07 20:15:00
    
    
      1
      201400000002
      1
      2
      1
      1.0
      6
      11
      590
      2014-05-31 04:30:00
    
    
      2
      201400000003
      1
      2
      9
      1.0
      3
      52
      590
      2014-08-23 18:00:00
    
    
      3
      201400000004
      1
      2
      1
      1.0
      1
      25
      590
      2014-06-12 17:00:00
    
    
      4
      201400000005
      2
      1
      1
      1.0
      1
      25
      590
      2014-06-23 05:00:00

Area ID correction

For some reason most area IDs have an added 0 except for the overseas areas 971, 972, 973, 974, and 976. Corsica is split into the areas 2A and 2B, which are labelled 201 and 202 in this dataframe. We will keep 201 and 202 to keep the column filled with integers. So we will divide by 10 only the values that are multiples of 10.



In [16]:

    
clean_df['area id'].unique()









    Out[16]:





array([590, 620, 800,  20, 600, 570, 760, 270, 350, 220, 290, 560, 140,
       500, 940,  10, 610, 440, 490, 530, 720, 850, 330,  40, 240, 400,
       470, 640, 860, 160, 170, 790, 870, 190, 230, 310,  90, 120, 320,
       650, 460, 810, 690, 820,  70, 260, 380, 420, 730, 740, 630,  30,
       150, 430, 540, 550, 880, 670, 680, 510,  80, 100, 520, 210, 580,
       710, 890, 250, 390, 700, 900, 130,  50,  60, 830, 840, 340, 110,
       300, 480, 660, 201, 202, 780, 910, 950, 770, 450, 180, 280, 360,
       370, 410, 920, 930, 750, 972, 971, 973, 974, 976])

Let's correct those for clarity sake.



In [17]:

    
def correct_area_id(sr):
    if ( (sr['area id'] %10) == 0):
        sr['area id'] = int(sr['area id']/10)
    return sr
clean_df = clean_df.apply(correct_area_id,axis=1)



In [18]:

    
clean_df['area id'].unique()









    Out[18]:





array([ 59,  62,  80,   2,  60,  57,  76,  27,  35,  22,  29,  56,  14,
        50,  94,   1,  61,  44,  49,  53,  72,  85,  33,   4,  24,  40,
        47,  64,  86,  16,  17,  79,  87,  19,  23,  31,   9,  12,  32,
        65,  46,  81,  69,  82,   7,  26,  38,  42,  73,  74,  63,   3,
        15,  43,  54,  55,  88,  67,  68,  51,   8,  10,  52,  21,  58,
        71,  89,  25,  39,  70,  90,  13,   5,   6,  83,  84,  34,  11,
        30,  48,  66, 201, 202,  78,  91,  95,  77,  45,  18,  28,  36,
        37,  41,  92,  93,  75, 972, 971, 973, 974, 976])



In [19]:

    
## This is done in the CSVtoSQLconverter.py script
# clean_df.to_sql(name='characteristics', con=sqlEngine, if_exists = 'replace', index=False)

Locations dataframe

The second csv file contains the details of the accident's location.



In [20]:

    
lieux_df = pd.read_csv("data/2014_France_lieux.csv",sep=',', encoding='latin-1')



In [21]:

    
lieux_df.head()









    Out[21]:






  
    
      
      Num_Acc
      catr
      voie
      v1
      v2
      circ
      nbv
      pr
      pr1
      vosp
      prof
      plan
      lartpc
      larrout
      surf
      infra
      situ
      env1
    
  
  
    
      0
      201400000001
      3
      41.0
      NaN
      NaN
      2.0
      0.0
      0.0
      0.0
      0.0
      1.0
      1.0
      0.0
      60.0
      1.0
      0.0
      1.0
      0.0
    
    
      1
      201400000002
      3
      41.0
      NaN
      NaN
      2.0
      2.0
      0.0
      0.0
      0.0
      1.0
      2.0
      0.0
      62.0
      1.0
      0.0
      4.0
      99.0
    
    
      2
      201400000003
      3
      39.0
      NaN
      NaN
      2.0
      0.0
      0.0
      0.0
      0.0
      1.0
      1.0
      0.0
      60.0
      1.0
      0.0
      1.0
      99.0
    
    
      3
      201400000004
      3
      41.0
      NaN
      NaN
      2.0
      2.0
      0.0
      800.0
      0.0
      0.0
      1.0
      0.0
      60.0
      1.0
      0.0
      1.0
      99.0
    
    
      4
      201400000005
      3
      141.0
      NaN
      NaN
      2.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2.0
      0.0
      0.0
      2.0
      0.0
      0.0
      0.0

Remove useless information

Let's check which information we can drop.



In [22]:

    
lieux_df.isnull().sum()









    Out[22]:





Num_Acc        0
catr           0
voie        3860
v1         59610
v2         57994
circ          79
nbv          142
pr         32641
pr1        32828
vosp         188
prof         144
plan         143
lartpc      1670
larrout     1530
surf         147
infra        112
situ          99
env1         123
dtype: int64



In [23]:

    
lieux_df.shape









    Out[23]:





(59854, 18)

Some of the most unreliable columns are fortunately also not very useful since they pertain to the address of the accident which we have already decided to drop.



In [24]:

    
clean_df = lieux_df.drop(['voie','v1','v2','pr','pr1'],axis=1)
clean_df.head()









    Out[24]:






  
    
      
      Num_Acc
      catr
      circ
      nbv
      vosp
      prof
      plan
      lartpc
      larrout
      surf
      infra
      situ
      env1
    
  
  
    
      0
      201400000001
      3
      2.0
      0.0
      0.0
      1.0
      1.0
      0.0
      60.0
      1.0
      0.0
      1.0
      0.0
    
    
      1
      201400000002
      3
      2.0
      2.0
      0.0
      1.0
      2.0
      0.0
      62.0
      1.0
      0.0
      4.0
      99.0
    
    
      2
      201400000003
      3
      2.0
      0.0
      0.0
      1.0
      1.0
      0.0
      60.0
      1.0
      0.0
      1.0
      99.0
    
    
      3
      201400000004
      3
      2.0
      2.0
      0.0
      0.0
      1.0
      0.0
      60.0
      1.0
      0.0
      1.0
      99.0
    
    
      4
      201400000005
      3
      2.0
      0.0
      0.0
      0.0
      2.0
      0.0
      0.0
      2.0
      0.0
      0.0
      0.0

Rename columns

Again work is needed to clarify the meaning of each column.



In [25]:

    
clean_df.rename(columns={'Num_Acc':'accident id','catr':'road type','circ':'traffic mode',
                         'nbv':'nb lanes','vosp':'reserved lane','prof':'road profil',
                         'plan':'road alignment','lartpc':'central reservation',
                         'larrout':'road width','surf':'road surface','infra':'installations',
                         'situ':'location','env1':'school distance'},
                inplace=True)
clean_df.head()









    Out[25]:






  
    
      
      accident id
      road type
      traffic mode
      nb lanes
      reserved lane
      road profil
      road alignment
      central reservation
      road width
      road surface
      installations
      location
      school distance
    
  
  
    
      0
      201400000001
      3
      2.0
      0.0
      0.0
      1.0
      1.0
      0.0
      60.0
      1.0
      0.0
      1.0
      0.0
    
    
      1
      201400000002
      3
      2.0
      2.0
      0.0
      1.0
      2.0
      0.0
      62.0
      1.0
      0.0
      4.0
      99.0
    
    
      2
      201400000003
      3
      2.0
      0.0
      0.0
      1.0
      1.0
      0.0
      60.0
      1.0
      0.0
      1.0
      99.0
    
    
      3
      201400000004
      3
      2.0
      2.0
      0.0
      0.0
      1.0
      0.0
      60.0
      1.0
      0.0
      1.0
      99.0
    
    
      4
      201400000005
      3
      2.0
      0.0
      0.0
      0.0
      2.0
      0.0
      0.0
      2.0
      0.0
      0.0
      0.0



In [26]:

    
## This is done in the CSVtoSQLconverter.py script
# clean_df.to_sql(name='locations', con=sqlEngine, if_exists = 'replace', index=False)

Vehicle dataframe

All the details of the vehicles involved in the accident are listed in this dataframe. This time the same accident ID can appear in more than one entry since each entry is one of the vehicles involved in the accident.



In [27]:

    
vehicules_df = pd.read_csv("data/2014_France_vehicules.csv",sep=',', encoding='latin-1')



In [28]:

    
vehicules_df.head()









    Out[28]:






  
    
      
      Num_Acc
      senc
      catv
      occutc
      obs
      obsm
      choc
      manv
      num_veh
    
  
  
    
      0
      201400000001
      0
      33
      0
      0.0
      2.0
      1.0
      1.0
      A01
    
    
      1
      201400000001
      0
      7
      0
      0.0
      0.0
      6.0
      15.0
      B02
    
    
      2
      201400000002
      0
      7
      0
      1.0
      0.0
      7.0
      13.0
      A01
    
    
      3
      201400000003
      0
      2
      0
      0.0
      2.0
      1.0
      1.0
      A01
    
    
      4
      201400000003
      0
      7
      0
      0.0
      2.0
      7.0
      15.0
      B02

Remove columns

Most columns are really useful except for the traffic sense which has mostly zeroes meaning that the sense wasn't recorded. We can remove it.



In [29]:

    
vehicules_df['senc'].value_counts().head()









    Out[29]:





0    98441
1     2862
2      459
Name: senc, dtype: int64



In [30]:

    
clean_df = vehicules_df.drop('senc',axis=1)



In [31]:

    
clean_df.head()









    Out[31]:






  
    
      
      Num_Acc
      catv
      occutc
      obs
      obsm
      choc
      manv
      num_veh
    
  
  
    
      0
      201400000001
      33
      0
      0.0
      2.0
      1.0
      1.0
      A01
    
    
      1
      201400000001
      7
      0
      0.0
      0.0
      6.0
      15.0
      B02
    
    
      2
      201400000002
      7
      0
      1.0
      0.0
      7.0
      13.0
      A01
    
    
      3
      201400000003
      2
      0
      0.0
      2.0
      1.0
      1.0
      A01
    
    
      4
      201400000003
      7
      0
      0.0
      2.0
      7.0
      15.0
      B02

Reorganize vehicle type categories

The "catv" column contains vehicle types with values between 1 and 99. However only 24 categories are used as we can see here:



In [32]:

    
clean_df['catv'].value_counts().sort_index()









    Out[32]:





1      4684
2      3844
3       444
7     62778
10     5604
13      343
14     1042
15      941
16       35
17      662
20       88
21      218
30     4543
31     2486
32     2295
33     8533
34     1445
35       32
36      158
37      738
38      211
39       33
40      149
99      456
Name: catv, dtype: int64

It is more convenient to reclassify the types using categories from 1 to 24:



In [33]:

    
newMap = {}
for i,I in enumerate(clean_df['catv'].value_counts().sort_index().index):
    newMap[I] = i+1
clean_df['catv'] = clean_df['catv'].map(newMap)
clean_df['catv'].value_counts().sort_index()









    Out[33]:





1      4684
2      3844
3       444
4     62778
5      5604
6       343
7      1042
8       941
9        35
10      662
11       88
12      218
13     4543
14     2486
15     2295
16     8533
17     1445
18       32
19      158
20      738
21      211
22       33
23      149
24      456
Name: catv, dtype: int64

Rename columns

And now the renaming of the columns.



In [34]:

    
clean_df.rename(columns={'Num_Acc':'accident id', 'catv':'vehicle type','obs':'fixed obj hit',
                         'obsm':'moving obj hit', 'choc':'impact location','manv':'maneuver',
                         'occutc':'nb occupants public transit', 'num_veh':'vehicle id'},inplace=True)



In [35]:

    
clean_df.head()









    Out[35]:






  
    
      
      accident id
      vehicle type
      nb occupants public transit
      fixed obj hit
      moving obj hit
      impact location
      maneuver
      vehicle id
    
  
  
    
      0
      201400000001
      16
      0
      0.0
      2.0
      1.0
      1.0
      A01
    
    
      1
      201400000001
      4
      0
      0.0
      0.0
      6.0
      15.0
      B02
    
    
      2
      201400000002
      4
      0
      1.0
      0.0
      7.0
      13.0
      A01
    
    
      3
      201400000003
      2
      0
      0.0
      2.0
      1.0
      1.0
      A01
    
    
      4
      201400000003
      4
      0
      0.0
      2.0
      7.0
      15.0
      B02



In [36]:

    
## This is done in the CSVtoSQLconverter.py script
# clean_df.to_sql(name='vehicles', con=sqlEngine, if_exists = 'replace', index=False)

Users dataframe

This final dataframe contains the information relative to the people or road users involved in the accident. The dataframe contains both personal information on the users such as their birthdate or their sex, but it also has for example information on the safety gear worn, the location and direction of the pedestrians hit.



In [37]:

    
usagers_df = pd.read_csv("data/2014_France_usagers.csv",sep=',', encoding='latin-1')



In [38]:

    
usagers_df.head()









    Out[38]:






  
    
      
      Num_Acc
      place
      catu
      grav
      sexe
      trajet
      secu
      locp
      actp
      etatp
      an_nais
      num_veh
    
  
  
    
      0
      201400000001
      1.0
      1
      3
      1
      5.0
      21.0
      0.0
      0.0
      0.0
      1971.0
      A01
    
    
      1
      201400000001
      1.0
      1
      1
      1
      5.0
      11.0
      0.0
      0.0
      0.0
      1992.0
      B02
    
    
      2
      201400000002
      1.0
      1
      4
      1
      5.0
      11.0
      0.0
      0.0
      0.0
      1983.0
      A01
    
    
      3
      201400000002
      2.0
      2
      3
      1
      0.0
      11.0
      0.0
      0.0
      0.0
      1990.0
      A01
    
    
      4
      201400000003
      1.0
      1
      3
      1
      5.0
      21.0
      0.0
      0.0
      0.0
      1974.0
      A01

We will keep all the columns of this dataframe since they all contain some interesting data.

Reformat columns

The 'secu' and 'an_nais' columns represent respectively the safety gear used (or not used) and the birth year of each user. The 'secu' is not too practical to use as it is since it store the type of safety gear in the first digit, and if the safety gear was used or not in the second digit of an integer. It will be simpler to have 2 columns for these two items.



In [39]:

    
usagers_df['secu'].value_counts(dropna=False)









    Out[39]:





 11.0    72958
 21.0    23671
 93.0    11646
 13.0    11179
NaN       4035
 23.0     2645
 12.0     1911
 22.0     1281
 31.0      996
 91.0      546
 92.0      429
 1.0       301
 41.0      222
 43.0      120
 42.0      118
 32.0       47
 33.0       46
 2.0        19
 3.0        16
Name: secu, dtype: int64

We should always have the first digit since the value 9 represent 'others' for non-defined types.



In [40]:

    
def safety_to_twocolumns(sr):
    if not np.isnan(sr['secu']):
        safetychar = "%2d" % (sr['secu'])
        if safetychar[0] != ' ':
            sr = sr.append(pd.Series(int(safetychar[0]), index=["safety gear type"]))
        sr = sr.append(pd.Series(int(safetychar[1]), index=["safety gear worn"]))
    return sr



In [41]:

    
clean_df = usagers_df.apply(safety_to_twocolumns,axis=1)



In [42]:

    
clean_df.head()









    Out[42]:






  
    
      
      Num_Acc
      actp
      an_nais
      catu
      etatp
      grav
      locp
      num_veh
      place
      safety gear type
      safety gear worn
      secu
      sexe
      trajet
    
  
  
    
      0
      201400000001
      0.0
      1971.0
      1
      0.0
      3
      0.0
      A01
      1.0
      2.0
      1.0
      21.0
      1
      5.0
    
    
      1
      201400000001
      0.0
      1992.0
      1
      0.0
      1
      0.0
      B02
      1.0
      1.0
      1.0
      11.0
      1
      5.0
    
    
      2
      201400000002
      0.0
      1983.0
      1
      0.0
      4
      0.0
      A01
      1.0
      1.0
      1.0
      11.0
      1
      5.0
    
    
      3
      201400000002
      0.0
      1990.0
      2
      0.0
      3
      0.0
      A01
      2.0
      1.0
      1.0
      11.0
      1
      0.0
    
    
      4
      201400000003
      0.0
      1974.0
      1
      0.0
      3
      0.0
      A01
      1.0
      2.0
      1.0
      21.0
      1
      5.0



In [43]:

    
clean_df.drop('secu',inplace=True,axis=1)
clean_df.head()









    Out[43]:






  
    
      
      Num_Acc
      actp
      an_nais
      catu
      etatp
      grav
      locp
      num_veh
      place
      safety gear type
      safety gear worn
      sexe
      trajet
    
  
  
    
      0
      201400000001
      0.0
      1971.0
      1
      0.0
      3
      0.0
      A01
      1.0
      2.0
      1.0
      1
      5.0
    
    
      1
      201400000001
      0.0
      1992.0
      1
      0.0
      1
      0.0
      B02
      1.0
      1.0
      1.0
      1
      5.0
    
    
      2
      201400000002
      0.0
      1983.0
      1
      0.0
      4
      0.0
      A01
      1.0
      1.0
      1.0
      1
      5.0
    
    
      3
      201400000002
      0.0
      1990.0
      2
      0.0
      3
      0.0
      A01
      2.0
      1.0
      1.0
      1
      0.0
    
    
      4
      201400000003
      0.0
      1974.0
      1
      0.0
      3
      0.0
      A01
      1.0
      2.0
      1.0
      1
      5.0

The birth year is not that interesting by itself. The age of the users seems more interesting. We can replace the birth year by the age.



In [44]:

    
clean_df['age'] = 2014 - clean_df['an_nais']



In [45]:

    
clean_df.drop('an_nais',inplace=True,axis=1)
clean_df.head()









    Out[45]:






  
    
      
      Num_Acc
      actp
      catu
      etatp
      grav
      locp
      num_veh
      place
      safety gear type
      safety gear worn
      sexe
      trajet
      age
    
  
  
    
      0
      201400000001
      0.0
      1
      0.0
      3
      0.0
      A01
      1.0
      2.0
      1.0
      1
      5.0
      43.0
    
    
      1
      201400000001
      0.0
      1
      0.0
      1
      0.0
      B02
      1.0
      1.0
      1.0
      1
      5.0
      22.0
    
    
      2
      201400000002
      0.0
      1
      0.0
      4
      0.0
      A01
      1.0
      1.0
      1.0
      1
      5.0
      31.0
    
    
      3
      201400000002
      0.0
      2
      0.0
      3
      0.0
      A01
      2.0
      1.0
      1.0
      1
      0.0
      24.0
    
    
      4
      201400000003
      0.0
      1
      0.0
      3
      0.0
      A01
      1.0
      2.0
      1.0
      1
      5.0
      40.0

Rename the columns



In [46]:

    
clean_df.rename(columns={'Num_Acc':'accident id', 'num_veh':'vehicle id',
                         'place':'location in vehicle','catu':'user type',
                         'grav':'severity','sexe':'sex','trajet':'journey type',
                         'locp':'pedestrian location','actp':'pedestrian action',
                         'etatp':'pedestrian company'},inplace=True)



In [47]:

    
clean_df.head()









    Out[47]:






  
    
      
      accident id
      pedestrian action
      user type
      pedestrian company
      severity
      pedestrian location
      vehicle id
      location in vehicle
      safety gear type
      safety gear worn
      sex
      journey type
      age
    
  
  
    
      0
      201400000001
      0.0
      1
      0.0
      3
      0.0
      A01
      1.0
      2.0
      1.0
      1
      5.0
      43.0
    
    
      1
      201400000001
      0.0
      1
      0.0
      1
      0.0
      B02
      1.0
      1.0
      1.0
      1
      5.0
      22.0
    
    
      2
      201400000002
      0.0
      1
      0.0
      4
      0.0
      A01
      1.0
      1.0
      1.0
      1
      5.0
      31.0
    
    
      3
      201400000002
      0.0
      2
      0.0
      3
      0.0
      A01
      2.0
      1.0
      1.0
      1
      0.0
      24.0
    
    
      4
      201400000003
      0.0
      1
      0.0
      3
      0.0
      A01
      1.0
      2.0
      1.0
      1
      5.0
      40.0



In [48]:

    
## This is done in the CSVtoSQLconverter.py script
# clean_df.to_sql(name='users', con=sqlEngine, if_exists = 'replace', index=False)

	Num_Acc	an	mois	jour	hrmn	lum	agg	int	atm	col	com	adr	gps	dep
0	201400000001	14	5	7	2015	1	2	1	1.0	3	11	route de don	M	590
1	201400000002	14	5	31	430	1	2	1	1.0	6	11	106 ROUTE DE DON	M	590
2	201400000003	14	8	23	1800	1	2	9	1.0	3	52	75 bis rue jean jaures	M	590
3	201400000004	14	6	12	1700	1	2	1	1.0	1	25	rue des Sablonnieres D41	M	590
4	201400000005	14	6	23	500	2	1	1	1.0	1	25	NaN	M	590

	Num_Acc	lum	agg	int	atm	col	com	adr	gps	dep	datetime
0	201400000001	1	2	1	1.0	3	11	route de don	M	590	2014-05-07 20:15:00
1	201400000002	1	2	1	1.0	6	11	106 ROUTE DE DON	M	590	2014-05-31 04:30:00
2	201400000003	1	2	9	1.0	3	52	75 bis rue jean jaures	M	590	2014-08-23 18:00:00
3	201400000004	1	2	1	1.0	1	25	rue des Sablonnieres D41	M	590	2014-06-12 17:00:00
4	201400000005	2	1	1	1.0	1	25	NaN	M	590	2014-06-23 05:00:00

	Num_Acc	catr	voie	v1	v2	circ	nbv	pr1	prof	plan	larrout	surf	situ	env1
0	201400000001	3	41.0	NaN	NaN	2.0	0.0	0.0	1.0	1.0	60.0	1.0	1.0	0.0
1	201400000002	3	41.0	NaN	NaN	2.0	2.0	0.0	1.0	2.0	62.0	1.0	4.0	99.0
2	201400000003	3	39.0	NaN	NaN	2.0	0.0	0.0	1.0	1.0	60.0	1.0	1.0	99.0
3	201400000004	3	41.0	NaN	NaN	2.0	2.0	800.0	0.0	1.0	60.0	1.0	1.0	99.0
4	201400000005	3	141.0	NaN	NaN	2.0	0.0	0.0	0.0	2.0	0.0	2.0	0.0	0.0

	Num_Acc	catv	obs	obsm	choc	manv	num_veh
0	201400000001	33	0.0	2.0	1.0	1.0	A01
1	201400000001	7	0.0	0.0	6.0	15.0	B02
2	201400000002	7	1.0	0.0	7.0	13.0	A01
3	201400000003	2	0.0	2.0	1.0	1.0	A01
4	201400000003	7	0.0	2.0	7.0	15.0	B02

	Num_Acc	place	catu	grav	sexe	trajet	secu	an_nais	num_veh
0	201400000001	1.0	1	3	1	5.0	21.0	1971.0	A01
1	201400000001	1.0	1	1	1	5.0	11.0	1992.0	B02
2	201400000002	1.0	1	4	1	5.0	11.0	1983.0	A01
3	201400000002	2.0	2	3	1	0.0	11.0	1990.0	A01
4	201400000003	1.0	1	3	1	5.0	21.0	1974.0	A01

	Num_Acc	catu	grav	num_veh	place	safety gear type	safety gear worn	sexe	trajet	age
0	201400000001	1	3	A01	1.0	2.0	1.0	1	5.0	43.0
1	201400000001	1	1	B02	1.0	1.0	1.0	1	5.0	22.0
2	201400000002	1	4	A01	1.0	1.0	1.0	1	5.0	31.0
3	201400000002	2	3	A01	2.0	1.0	1.0	1	0.0	24.0
4	201400000003	1	3	A01	1.0	2.0	1.0	1	5.0	40.0