Capstone Project - Flight Delays

Does weather events have impact the delay of flights (Brazil)?

It is important to see this notebook with the step-by-step of the dataset cleaning process:

https://github.com/davicsilva/dsintensive/blob/master/notebooks/flightDelayPrepData_v2.ipynb



In [1]:

    
from datetime import datetime

# Pandas and NumPy
import pandas as pd
import numpy as np

# Matplotlib for additional customization
from matplotlib import pyplot as plt
%matplotlib inline

# Seaborn for plotting and styling
import seaborn as sns



In [2]:

    
# 1. Flight delay: any flight with (real_departure - planned_departure >= 15 minutes)
# 2. The Brazilian Federal Agency for Civil Aviation (ANAC) does not define exactly what is a "flight delay" (in minutes)
# 3. Anyway, the ANAC has a resolution for this subject: https://goo.gl/YBwbMy (last access: nov, 15th, 2017)
# ---
# DELAY, for this analysis, is defined as greater than 15 minutes (local flights only)
DELAY = 15

1 - Local flights dataset. For now, only flights from January to September, 2017

A note about date columns on this dataset

In the original dataset (CSV file from ANAC), the date was not in ISO8601 format (e.g. '2017-10-31 09:03:00')
To fix this I used regex (regular expression) to transform this column directly on CSV file
The original date was "31/10/2017 09:03" (october, 31, 2017 09:03)



In [3]:

    
#[flights] dataset_01 => all "Active Regular Flights" from 2017, from january to september
#source: http://www.anac.gov.br/assuntos/dados-e-estatisticas/historico-de-voos
#Last access this website: nov, 14th, 2017
flights = pd.read_csv('data/arf2017ISO.csv', sep = ';', dtype = str)



In [4]:

    
flights['departure-est'] = flights[['departure-est']].apply(lambda row: row.str.replace("(?P<day>\d{2})/(?P<month>\d{2})/(?P<year>\d{4}) (?P<HOUR>\d{2}):(?P<MIN>\d{2})", "\g<year>/\g<month>/\g<day> \g<HOUR>:\g<MIN>:00"), axis=1)



In [5]:

    
flights['departure-real'] = flights[['departure-real']].apply(lambda row: row.str.replace("(?P<day>\d{2})/(?P<month>\d{2})/(?P<year>\d{4}) (?P<HOUR>\d{2}):(?P<MIN>\d{2})", "\g<year>/\g<month>/\g<day> \g<HOUR>:\g<MIN>:00"), axis=1)
flights['arrival-est'] = flights[['arrival-est']].apply(lambda row: row.str.replace("(?P<day>\d{2})/(?P<month>\d{2})/(?P<year>\d{4}) (?P<HOUR>\d{2}):(?P<MIN>\d{2})", "\g<year>/\g<month>/\g<day> \g<HOUR>:\g<MIN>:00"), axis=1)
flights['arrival-real'] = flights[['arrival-real']].apply(lambda row: row.str.replace("(?P<day>\d{2})/(?P<month>\d{2})/(?P<year>\d{4}) (?P<HOUR>\d{2}):(?P<MIN>\d{2})", "\g<year>/\g<month>/\g<day> \g<HOUR>:\g<MIN>:00"), axis=1)



In [6]:

    
# Departure and Arrival columns: from 'object' to 'date' format
flights['departure-est'] = pd.to_datetime(flights['departure-est'], errors='ignore')
flights['departure-real'] = pd.to_datetime(flights['departure-real'], errors='ignore')
flights['arrival-est'] = pd.to_datetime(flights['arrival-est'], errors='ignore')
flights['arrival-real'] = pd.to_datetime(flights['arrival-real'], errors='ignore')



In [7]:

    
# translate the flight status from portuguese to english
flights['flight-status'] = flights[['flight-status']].apply(lambda row: row.str.replace("REALIZADO", "ACCOMPLISHED"), axis=1)
flights['flight-status'] = flights[['flight-status']].apply(lambda row: row.str.replace("CANCELADO", "CANCELED"), axis=1)



In [8]:

    
flights.head()









    Out[8]:







  
    
      
      airline
      flight-code
      auth-code
      line-code
      airport-A
      airport-B
      departure-est
      departure-real
      arrival-est
      arrival-real
      flight-status
      delay-code
    
  
  
    
      0
      AAL
      213
      0
      I
      KMIA
      SBBR
      2017-01-02 01:20:00
      2017-01-02 01:20:00
      2017-01-02 09:00:00
      2017-01-02 09:00:00
      ACCOMPLISHED
      NaN
    
    
      1
      AAL
      213
      0
      I
      KMIA
      SBBR
      2017-01-24 01:20:00
      2017-01-24 01:20:00
      2017-01-24 09:00:00
      2017-01-24 09:00:00
      ACCOMPLISHED
      NaN
    
    
      2
      AAL
      213
      0
      I
      KMIA
      SBBR
      2017-01-29 01:20:00
      2017-01-29 01:20:00
      2017-01-29 09:00:00
      2017-01-29 09:00:00
      ACCOMPLISHED
      NaN
    
    
      3
      AAL
      213
      0
      I
      KMIA
      SBBR
      2017-01-23 01:20:00
      2017-01-23 01:20:00
      2017-01-23 09:00:00
      2017-01-23 09:00:00
      ACCOMPLISHED
      NaN
    
    
      4
      AAL
      213
      0
      I
      KMIA
      SBBR
      2017-01-25 01:20:00
      2017-01-25 01:20:00
      2017-01-25 09:00:00
      2017-01-25 09:00:00
      ACCOMPLISHED
      NaN



In [9]:

    
flights.size









    Out[9]:





9851496



In [10]:

    
flights.to_csv("flights_csv.csv")

Some EDA's tasks



In [11]:

    
# See: https://stackoverflow.com/questions/37287938/sort-pandas-dataframe-by-value 
#
df_departures = flights.groupby(['airport-A']).size().reset_index(name='number_departures')



In [12]:

    
df_departures.sort_values(by=['number_departures'], ascending=False, inplace=True)



In [13]:

    
df_departures









    Out[13]:







  
    
      
      airport-A
      number_departures
    
  
  
    
      118
      SBGR
      97836
    
    
      168
      SBSP
      69381
    
    
      91
      SBBR
      49294
    
    
      133
      SBKP
      49279
    
    
      116
      SBGL
      45262
    
    
      96
      SBCF
      41174
    
    
      160
      SBRJ
      37799
    
    
      103
      SBCT
      27883
    
    
      148
      SBPA
      27682
    
    
      159
      SBRF
      25382
    
    
      170
      SBSV
      23431
    
    
      115
      SBFZ
      16851
    
    
      79
      SAEZ
      13857
    
    
      113
      SBFL
      12500
    
    
      110
      SBEG
      12494
    
    
      89
      SBBE
      12054
    
    
      117
      SBGO
      11727
    
    
      105
      SBCY
      11631
    
    
      185
      SBVT
      11186
    
    
      40
      KMIA
      8129
    
    
      162
      SBSG
      7671
    
    
      112
      SBFI
      7217
    
    
      189
      SCEL
      7143
    
    
      161
      SBRP
      7124
    
    
      164
      SBSL
      6380
    
    
      141
      SBMO
      6348
    
    
      65
      MPTO
      6032
    
    
      154
      SBPS
      5535
    
    
      145
      SBNF
      5443
    
    
      97
      SBCG
      5289
    
    
      ...
      ...
      ...
    
    
      220
      SVBI
      2
    
    
      232
      SWMW
      2
    
    
      82
      SAZN
      2
    
    
      187
      SCCF
      2
    
    
      42
      KPBI
      2
    
    
      53
      LMML
      2
    
    
      202
      SLCB
      2
    
    
      199
      SKBQ
      2
    
    
      143
      SBMS
      1
    
    
      20
      GCRR
      1
    
    
      241
      TTPP
      1
    
    
      29
      KBOS
      1
    
    
      43
      KVCV
      1
    
    
      46
      LEZL
      1
    
    
      4
      EBBR
      1
    
    
      24
      GQNN
      1
    
    
      233
      SWNK
      1
    
    
      21
      GCTS
      1
    
    
      126
      SBJD
      1
    
    
      209
      SNPD
      1
    
    
      48
      LFBT
      1
    
    
      7
      EDDS
      1
    
    
      224
      SWBR
      1
    
    
      14
      EHBK
      1
    
    
      147
      SBNT
      1
    
    
      8
      EDFH
      1
    
    
      208
      SNGI
      1
    
    
      219
      SVBC
      1
    
    
      9
      EDHI
      1
    
    
      69
      MYNN
      1
    
  

243 rows × 2 columns

2 - Local airports (list with all the ~600 brazilian public airports)

Source: https://goo.gl/mNFuPt (a XLS spreadsheet in portuguese; last access on nov, 15th, 2017)



In [14]:

    
# Airports dataset: all brazilian public airports (updated until october, 2017)
airports = pd.read_csv('data/brazilianPublicAirports-out2017.csv', sep = ';', dtype= str)



In [15]:

    
airports.head()









    Out[15]:







  
    
      
      airport
      airportname
      city
      state
      latitude
      longitude
      altitude
      operation
    
  
  
    
      0
      SBAA
      CONCEIÇÃO DO ARAGUAIA
      CONCEIÇÃO DO ARAGUAIA
      PA
      8° 20' 55'' S
      49° 18' 11'' W
      199 m
      VFR Diurno/Noturno e IFR Diurno
    
    
      1
      SBAE
      BAURU/AREALVA
      BAURU E AREALVA
      SP
      22° 9' 28'' S
      49° 4' 6'' W
      594 m
      VFR Diurno/Noturno e IFR Diurno/Noturno
    
    
      2
      SBAM
      AMAPÁ
      AMAPÁ
      AP
      2° 4' 22'' N
      50° 51' 45'' W
      14 m
      VFR Diurno e IFR Diurno
    
    
      3
      SBAQ
      BARTOLOMEU DE GUSMÃO
      ARARAQUARA
      SP
      21° 48' 16'' S
      48° 8' 25'' W
      711 m
      VFR Diurno/Noturno e IFR Diurno/Noturno
    
    
      4
      SBAR
      SANTA MARIA
      ARACAJU
      SE
      10° 59' 7'' S
      37° 4' 24'' W
      7 m
      VFR Diurno/Noturno e IFR Diurno/Noturno



In [16]:

    
# Merge "flights" dataset with "airports" in order to identify 
#   local flights (origin and destination are in Brazil)
flights = pd.merge(flights, airports, left_on="airport-A", right_on="airport", how='left')
flights = pd.merge(flights, airports, left_on="airport-B", right_on="airport", how='left')



In [17]:

    
flights.tail()









    Out[17]:







  
    
      
      airline
      flight-code
      auth-code
      line-code
      airport-A
      airport-B
      departure-est
      departure-real
      arrival-est
      arrival-real
      ...
      altitude_x
      operation_x
      airport_y
      airportname_y
      city_y
      state_y
      latitude_y
      longitude_y
      altitude_y
      operation_y
    
  
  
    
      820953
      UPS
      417
      0
      G
      SBKP
      SKBO
      2017-09-12 00:15:00
      NaT
      2017-09-12 05:57:00
      NaT
      ...
      661 m
      VFR Diurno/Noturno e IFR Diurno/Noturno
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      820954
      UPS
      417
      0
      G
      SBKP
      SKBO
      2017-09-19 00:15:00
      2017-09-19 00:15:00
      2017-09-19 05:57:00
      2017-09-19 05:57:00
      ...
      661 m
      VFR Diurno/Noturno e IFR Diurno/Noturno
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      820955
      UPS
      417
      0
      G
      SKBO
      KMIA
      2017-09-26 07:08:00
      2017-09-26 07:08:00
      2017-09-26 10:40:00
      2017-09-26 10:40:00
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      820956
      UPS
      417
      0
      G
      SBKP
      SKBO
      2017-09-05 00:15:00
      2017-09-05 00:15:00
      2017-09-05 05:57:00
      2017-09-05 05:57:00
      ...
      661 m
      VFR Diurno/Noturno e IFR Diurno/Noturno
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      820957
      UPS
      417
      0
      G
      SKBO
      KMIA
      2017-09-05 07:08:00
      2017-09-05 07:08:00
      2017-09-05 10:40:00
      2017-09-05 10:40:00
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
  

5 rows × 28 columns

3 - List of codes (two letters) used when there was a flight delay (departure)

I have found two lists that define two-letter codes used by the aircraft crew to justify the delay of the flights: a short and a long one.

Source: https://goo.gl/vUC8BX (last access: nov, 15th, 2017)



In [18]:

    
# ------------------------------------------------------------------
# List of codes (two letters) used to justify a delay on the flight
# - delayCodesShortlist.csv: list with YYY codes
# - delayCodesLongList.csv: list with XXX codes
# ------------------------------------------------------------------
delaycodes = pd.read_csv('data/delayCodesShortlist.csv', sep = ';', dtype = str)
delaycodesLongList = pd.read_csv('data/delayCodesLonglist.csv', sep = ';', dtype = str)



In [19]:

    
delaycodes.head()









    Out[19]:







  
    
      
      delaycode
      delayOrigin
      duetoweather
      autorizationdelay
      delaydescription
    
  
  
    
      0
      AA
      B
      0
      0
      ATRASO AEROPORTO DE ALTERNATIVA de ORDEM TÉCNICA
    
    
      1
      AF
      A
      0
      0
      FACILIDADES DO AEROPORTO - RESTRIÇÕES DE APOIO
    
    
      2
      AG
      A
      0
      0
      MIGRAÇÃO/ALFÂNDEGA/SAÚDE
    
    
      3
      AI
      A
      0
      0
      AEROPORTO DE ORIGEM INTERDITADO
    
    
      4
      AJ
      B
      0
      0
      AEROPORTO DE DESTINO INTERDITADO

4 - The Weather data from https://www.wunderground.com/history

From this website I captured a sample data from local airport (Campinas, SP, Brazil): January to September, 2017.

The website presents data like this (see https://goo.gl/oKwzyH):



In [20]:

    
# Weather sample: load the CSV with weather historical data (from Campinas, SP, Brazil, 2017)
weather = pd.read_csv('data/DataScience-Intensive-weatherAtCampinasAirport-2017-Campinas_Airport_2017Weather.csv', \
                      sep = ',', dtype = str)



In [21]:

    
weather["date"] = weather["year"].map(str) + "-" + weather["month"].map(str) + "-" + weather["day"].map(str) 
weather["date"] = pd.to_datetime(weather['date'],errors='ignore')



In [22]:

    
weather.head()









    Out[22]:







  
    
      
      year
      month
      day
      weather-event
      tempC-high
      tempC-avg
      tempC-low
      dewpoint-high
      dewpoint-avg
      dewpoint-low
      ...
      sealevelpress-avg
      sealevelpress-low
      visibilty-km-high
      visibilty-km-avg
      visibilty-km-low
      wind-km-h-high
      wind-km-h-avg
      precipitation-mm-high
      precipitation-mm-sum
      date
    
  
  
    
      0
      2017
      1
      1
      Rain , Thunderstorm
      31
      26
      21
      20
      14
      2
      ...
      1016
      1015
      10
      9
      2
      27
      10
      55
      0.00
      2017-01-01
    
    
      1
      2017
      1
      2
      Thunderstorm
      29
      24
      21
      21
      16
      9
      ...
      1015
      1013
      10
      10
      10
      23
      10
      -
      0.00
      2017-01-02
    
    
      2
      2017
      1
      3
      Rain , Thunderstorm
      34
      20
      7
      18
      12
      -1
      ...
      1014
      1011
      10
      9
      2
      39
      19
      34
      0.00
      2017-01-03
    
    
      3
      2017
      1
      4
      Rain , Thunderstorm
      30
      26
      21
      19
      14
      7
      ...
      1015
      1013
      10
      10
      10
      24
      11
      -
      0.00
      2017-01-04
    
    
      4
      2017
      1
      5
      Rain
      30
      26
      22
      21
      14
      7
      ...
      1017
      1015
      10
      10
      7
      39
      8
      61
      0.00
      2017-01-05
    
  

5 rows × 24 columns



In [ ]:

	airline	flight-code	line-code	airport-A	airport-B	departure-est	departure-real	arrival-est	arrival-real	flight-status	delay-code
0	AAL	213	I	KMIA	SBBR	2017-01-02 01:20:00	2017-01-02 01:20:00	2017-01-02 09:00:00	2017-01-02 09:00:00	ACCOMPLISHED	NaN
1	AAL	213	I	KMIA	SBBR	2017-01-24 01:20:00	2017-01-24 01:20:00	2017-01-24 09:00:00	2017-01-24 09:00:00	ACCOMPLISHED	NaN
2	AAL	213	I	KMIA	SBBR	2017-01-29 01:20:00	2017-01-29 01:20:00	2017-01-29 09:00:00	2017-01-29 09:00:00	ACCOMPLISHED	NaN
3	AAL	213	I	KMIA	SBBR	2017-01-23 01:20:00	2017-01-23 01:20:00	2017-01-23 09:00:00	2017-01-23 09:00:00	ACCOMPLISHED	NaN
4	AAL	213	I	KMIA	SBBR	2017-01-25 01:20:00	2017-01-25 01:20:00	2017-01-25 09:00:00	2017-01-25 09:00:00	ACCOMPLISHED	NaN

	airport-A	number_departures
118	SBGR	97836
168	SBSP	69381
91	SBBR	49294
133	SBKP	49279
116	SBGL	45262
96	SBCF	41174
160	SBRJ	37799
103	SBCT	27883
148	SBPA	27682
159	SBRF	25382
170	SBSV	23431
115	SBFZ	16851
79	SAEZ	13857
113	SBFL	12500
110	SBEG	12494
89	SBBE	12054
117	SBGO	11727
105	SBCY	11631
185	SBVT	11186
40	KMIA	8129
162	SBSG	7671
112	SBFI	7217
189	SCEL	7143
161	SBRP	7124
164	SBSL	6380
141	SBMO	6348
65	MPTO	6032
154	SBPS	5535
145	SBNF	5443
97	SBCG	5289
...	...	...
220	SVBI	2
232	SWMW	2
82	SAZN	2
187	SCCF	2
42	KPBI	2
53	LMML	2
202	SLCB	2
199	SKBQ	2
143	SBMS	1
20	GCRR	1
241	TTPP	1
29	KBOS	1
43	KVCV	1
46	LEZL	1
4	EBBR	1
24	GQNN	1
233	SWNK	1
21	GCTS	1
126	SBJD	1
209	SNPD	1
48	LFBT	1
7	EDDS	1
224	SWBR	1
14	EHBK	1
147	SBNT	1
8	EDFH	1
208	SNGI	1
219	SVBC	1
9	EDHI	1
69	MYNN	1

	airport	airportname	city	state	latitude	longitude	altitude	operation
0	SBAA	CONCEIÇÃO DO ARAGUAIA	CONCEIÇÃO DO ARAGUAIA	PA	8° 20' 55'' S	49° 18' 11'' W	199 m	VFR Diurno/Noturno e IFR Diurno
1	SBAE	BAURU/AREALVA	BAURU E AREALVA	SP	22° 9' 28'' S	49° 4' 6'' W	594 m	VFR Diurno/Noturno e IFR Diurno/Noturno
2	SBAM	AMAPÁ	AMAPÁ	AP	2° 4' 22'' N	50° 51' 45'' W	14 m	VFR Diurno e IFR Diurno
3	SBAQ	BARTOLOMEU DE GUSMÃO	ARARAQUARA	SP	21° 48' 16'' S	48° 8' 25'' W	711 m	VFR Diurno/Noturno e IFR Diurno/Noturno
4	SBAR	SANTA MARIA	ARACAJU	SE	10° 59' 7'' S	37° 4' 24'' W	7 m	VFR Diurno/Noturno e IFR Diurno/Noturno

	airline	flight-code	line-code	airport-A	airport-B	departure-est	departure-real	arrival-est	arrival-real	...	altitude_x	operation_x	airport_y	airportname_y	city_y	state_y	latitude_y	longitude_y	altitude_y	operation_y
820953	UPS	417	G	SBKP	SKBO	2017-09-12 00:15:00	NaT	2017-09-12 05:57:00	NaT	...	661 m	VFR Diurno/Noturno e IFR Diurno/Noturno	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
820954	UPS	417	G	SBKP	SKBO	2017-09-19 00:15:00	2017-09-19 00:15:00	2017-09-19 05:57:00	2017-09-19 05:57:00	...	661 m	VFR Diurno/Noturno e IFR Diurno/Noturno	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
820955	UPS	417	G	SKBO	KMIA	2017-09-26 07:08:00	2017-09-26 07:08:00	2017-09-26 10:40:00	2017-09-26 10:40:00	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
820956	UPS	417	G	SBKP	SKBO	2017-09-05 00:15:00	2017-09-05 00:15:00	2017-09-05 05:57:00	2017-09-05 05:57:00	...	661 m	VFR Diurno/Noturno e IFR Diurno/Noturno	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
820957	UPS	417	G	SKBO	KMIA	2017-09-05 07:08:00	2017-09-05 07:08:00	2017-09-05 10:40:00	2017-09-05 10:40:00	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	delaycode	delayOrigin	delaydescription
0	AA	B	ATRASO AEROPORTO DE ALTERNATIVA de ORDEM TÉCNICA
1	AF	A	FACILIDADES DO AEROPORTO - RESTRIÇÕES DE APOIO
2	AG	A	MIGRAÇÃO/ALFÂNDEGA/SAÚDE
3	AI	A	AEROPORTO DE ORIGEM INTERDITADO
4	AJ	B	AEROPORTO DE DESTINO INTERDITADO

	year	month	day	weather-event	tempC-high	tempC-avg	tempC-low	dewpoint-high	dewpoint-avg	dewpoint-low	...	sealevelpress-avg	sealevelpress-low	visibilty-km-high	visibilty-km-avg	visibilty-km-low	wind-km-h-high	wind-km-h-avg	precipitation-mm-high	date
0	2017	1	1	Rain , Thunderstorm	31	26	21	20	14	2	...	1016	1015	10	9	2	27	10	55	2017-01-01
1	2017	1	2	Thunderstorm	29	24	21	21	16	9	...	1015	1013	10	10	10	23	10	-	2017-01-02
2	2017	1	3	Rain , Thunderstorm	34	20	7	18	12	-1	...	1014	1011	10	9	2	39	19	34	2017-01-03
3	2017	1	4	Rain , Thunderstorm	30	26	21	19	14	7	...	1015	1013	10	10	10	24	11	-	2017-01-04
4	2017	1	5	Rain	30	26	22	21	14	7	...	1017	1015	10	10	7	39	8	61	2017-01-05