Capstone Project - Flight Delays

Does weather events have impact the delay of flights (Brazil)?

It is important to see this notebook with the step-by-step of the dataset cleaning process:

https://github.com/davicsilva/dsintensive/blob/master/notebooks/flightDelayPrepData_v2.ipynb


In [1]:
from datetime import datetime

# Pandas and NumPy
import pandas as pd
import numpy as np

# Matplotlib for additional customization
from matplotlib import pyplot as plt
%matplotlib inline

# Seaborn for plotting and styling
import seaborn as sns

In [2]:
# 1. Flight delay: any flight with (real_departure - planned_departure >= 15 minutes)
# 2. The Brazilian Federal Agency for Civil Aviation (ANAC) does not define exactly what is a "flight delay" (in minutes)
# 3. Anyway, the ANAC has a resolution for this subject: https://goo.gl/YBwbMy (last access: nov, 15th, 2017)
# ---
# DELAY, for this analysis, is defined as greater than 15 minutes (local flights only)
DELAY = 15

1 - Local flights dataset. For now, only flights from January to September, 2017

A note about date columns on this dataset

  • In the original dataset (CSV file from ANAC), the date was not in ISO8601 format (e.g. '2017-10-31 09:03:00')
  • To fix this I used regex (regular expression) to transform this column directly on CSV file
  • The original date was "31/10/2017 09:03" (october, 31, 2017 09:03)

In [3]:
#[flights] dataset_01 => all "Active Regular Flights" from 2017, from january to september
#source: http://www.anac.gov.br/assuntos/dados-e-estatisticas/historico-de-voos
#Last access this website: nov, 14th, 2017
flights = pd.read_csv('data/arf2017ISO.csv', sep = ';', dtype = str)

In [4]:
flights['departure-est'] = flights[['departure-est']].apply(lambda row: row.str.replace("(?P<day>\d{2})/(?P<month>\d{2})/(?P<year>\d{4}) (?P<HOUR>\d{2}):(?P<MIN>\d{2})", "\g<year>/\g<month>/\g<day> \g<HOUR>:\g<MIN>:00"), axis=1)

In [5]:
flights['departure-real'] = flights[['departure-real']].apply(lambda row: row.str.replace("(?P<day>\d{2})/(?P<month>\d{2})/(?P<year>\d{4}) (?P<HOUR>\d{2}):(?P<MIN>\d{2})", "\g<year>/\g<month>/\g<day> \g<HOUR>:\g<MIN>:00"), axis=1)
flights['arrival-est'] = flights[['arrival-est']].apply(lambda row: row.str.replace("(?P<day>\d{2})/(?P<month>\d{2})/(?P<year>\d{4}) (?P<HOUR>\d{2}):(?P<MIN>\d{2})", "\g<year>/\g<month>/\g<day> \g<HOUR>:\g<MIN>:00"), axis=1)
flights['arrival-real'] = flights[['arrival-real']].apply(lambda row: row.str.replace("(?P<day>\d{2})/(?P<month>\d{2})/(?P<year>\d{4}) (?P<HOUR>\d{2}):(?P<MIN>\d{2})", "\g<year>/\g<month>/\g<day> \g<HOUR>:\g<MIN>:00"), axis=1)

In [6]:
# Departure and Arrival columns: from 'object' to 'date' format
flights['departure-est'] = pd.to_datetime(flights['departure-est'], errors='ignore')
flights['departure-real'] = pd.to_datetime(flights['departure-real'], errors='ignore')
flights['arrival-est'] = pd.to_datetime(flights['arrival-est'], errors='ignore')
flights['arrival-real'] = pd.to_datetime(flights['arrival-real'], errors='ignore')

In [7]:
# translate the flight status from portuguese to english
flights['flight-status'] = flights[['flight-status']].apply(lambda row: row.str.replace("REALIZADO", "ACCOMPLISHED"), axis=1)
flights['flight-status'] = flights[['flight-status']].apply(lambda row: row.str.replace("CANCELADO", "CANCELED"), axis=1)

In [8]:
flights.head()


Out[8]:
airline flight-code auth-code line-code airport-A airport-B departure-est departure-real arrival-est arrival-real flight-status delay-code
0 AAL 213 0 I KMIA SBBR 2017-01-02 01:20:00 2017-01-02 01:20:00 2017-01-02 09:00:00 2017-01-02 09:00:00 ACCOMPLISHED NaN
1 AAL 213 0 I KMIA SBBR 2017-01-24 01:20:00 2017-01-24 01:20:00 2017-01-24 09:00:00 2017-01-24 09:00:00 ACCOMPLISHED NaN
2 AAL 213 0 I KMIA SBBR 2017-01-29 01:20:00 2017-01-29 01:20:00 2017-01-29 09:00:00 2017-01-29 09:00:00 ACCOMPLISHED NaN
3 AAL 213 0 I KMIA SBBR 2017-01-23 01:20:00 2017-01-23 01:20:00 2017-01-23 09:00:00 2017-01-23 09:00:00 ACCOMPLISHED NaN
4 AAL 213 0 I KMIA SBBR 2017-01-25 01:20:00 2017-01-25 01:20:00 2017-01-25 09:00:00 2017-01-25 09:00:00 ACCOMPLISHED NaN

In [9]:
flights.size


Out[9]:
9851496

In [10]:
flights.to_csv("flights_csv.csv")

Some EDA's tasks


In [11]:
# See: https://stackoverflow.com/questions/37287938/sort-pandas-dataframe-by-value 
#
df_departures = flights.groupby(['airport-A']).size().reset_index(name='number_departures')

In [12]:
df_departures.sort_values(by=['number_departures'], ascending=False, inplace=True)

In [13]:
df_departures


Out[13]:
airport-A number_departures
118 SBGR 97836
168 SBSP 69381
91 SBBR 49294
133 SBKP 49279
116 SBGL 45262
96 SBCF 41174
160 SBRJ 37799
103 SBCT 27883
148 SBPA 27682
159 SBRF 25382
170 SBSV 23431
115 SBFZ 16851
79 SAEZ 13857
113 SBFL 12500
110 SBEG 12494
89 SBBE 12054
117 SBGO 11727
105 SBCY 11631
185 SBVT 11186
40 KMIA 8129
162 SBSG 7671
112 SBFI 7217
189 SCEL 7143
161 SBRP 7124
164 SBSL 6380
141 SBMO 6348
65 MPTO 6032
154 SBPS 5535
145 SBNF 5443
97 SBCG 5289
... ... ...
220 SVBI 2
232 SWMW 2
82 SAZN 2
187 SCCF 2
42 KPBI 2
53 LMML 2
202 SLCB 2
199 SKBQ 2
143 SBMS 1
20 GCRR 1
241 TTPP 1
29 KBOS 1
43 KVCV 1
46 LEZL 1
4 EBBR 1
24 GQNN 1
233 SWNK 1
21 GCTS 1
126 SBJD 1
209 SNPD 1
48 LFBT 1
7 EDDS 1
224 SWBR 1
14 EHBK 1
147 SBNT 1
8 EDFH 1
208 SNGI 1
219 SVBC 1
9 EDHI 1
69 MYNN 1

243 rows × 2 columns

2 - Local airports (list with all the ~600 brazilian public airports)

Source: https://goo.gl/mNFuPt (a XLS spreadsheet in portuguese; last access on nov, 15th, 2017)


In [14]:
# Airports dataset: all brazilian public airports (updated until october, 2017)
airports = pd.read_csv('data/brazilianPublicAirports-out2017.csv', sep = ';', dtype= str)

In [15]:
airports.head()


Out[15]:
airport airportname city state latitude longitude altitude operation
0 SBAA CONCEIÇÃO DO ARAGUAIA CONCEIÇÃO DO ARAGUAIA PA 8° 20' 55'' S 49° 18' 11'' W 199 m VFR Diurno/Noturno e IFR Diurno
1 SBAE BAURU/AREALVA BAURU E AREALVA SP 22° 9' 28'' S 49° 4' 6'' W 594 m VFR Diurno/Noturno e IFR Diurno/Noturno
2 SBAM AMAPÁ AMAPÁ AP 2° 4' 22'' N 50° 51' 45'' W 14 m VFR Diurno e IFR Diurno
3 SBAQ BARTOLOMEU DE GUSMÃO ARARAQUARA SP 21° 48' 16'' S 48° 8' 25'' W 711 m VFR Diurno/Noturno e IFR Diurno/Noturno
4 SBAR SANTA MARIA ARACAJU SE 10° 59' 7'' S 37° 4' 24'' W 7 m VFR Diurno/Noturno e IFR Diurno/Noturno

In [16]:
# Merge "flights" dataset with "airports" in order to identify 
#   local flights (origin and destination are in Brazil)
flights = pd.merge(flights, airports, left_on="airport-A", right_on="airport", how='left')
flights = pd.merge(flights, airports, left_on="airport-B", right_on="airport", how='left')

In [17]:
flights.tail()


Out[17]:
airline flight-code auth-code line-code airport-A airport-B departure-est departure-real arrival-est arrival-real ... altitude_x operation_x airport_y airportname_y city_y state_y latitude_y longitude_y altitude_y operation_y
820953 UPS 417 0 G SBKP SKBO 2017-09-12 00:15:00 NaT 2017-09-12 05:57:00 NaT ... 661 m VFR Diurno/Noturno e IFR Diurno/Noturno NaN NaN NaN NaN NaN NaN NaN NaN
820954 UPS 417 0 G SBKP SKBO 2017-09-19 00:15:00 2017-09-19 00:15:00 2017-09-19 05:57:00 2017-09-19 05:57:00 ... 661 m VFR Diurno/Noturno e IFR Diurno/Noturno NaN NaN NaN NaN NaN NaN NaN NaN
820955 UPS 417 0 G SKBO KMIA 2017-09-26 07:08:00 2017-09-26 07:08:00 2017-09-26 10:40:00 2017-09-26 10:40:00 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
820956 UPS 417 0 G SBKP SKBO 2017-09-05 00:15:00 2017-09-05 00:15:00 2017-09-05 05:57:00 2017-09-05 05:57:00 ... 661 m VFR Diurno/Noturno e IFR Diurno/Noturno NaN NaN NaN NaN NaN NaN NaN NaN
820957 UPS 417 0 G SKBO KMIA 2017-09-05 07:08:00 2017-09-05 07:08:00 2017-09-05 10:40:00 2017-09-05 10:40:00 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 28 columns

3 - List of codes (two letters) used when there was a flight delay (departure)

I have found two lists that define two-letter codes used by the aircraft crew to justify the delay of the flights: a short and a long one.

Source: https://goo.gl/vUC8BX (last access: nov, 15th, 2017)


In [18]:
# ------------------------------------------------------------------
# List of codes (two letters) used to justify a delay on the flight
# - delayCodesShortlist.csv: list with YYY codes
# - delayCodesLongList.csv: list with XXX codes
# ------------------------------------------------------------------
delaycodes = pd.read_csv('data/delayCodesShortlist.csv', sep = ';', dtype = str)
delaycodesLongList = pd.read_csv('data/delayCodesLonglist.csv', sep = ';', dtype = str)

In [19]:
delaycodes.head()


Out[19]:
delaycode delayOrigin duetoweather autorizationdelay delaydescription
0 AA B 0 0 ATRASO AEROPORTO DE ALTERNATIVA de ORDEM TÉCNICA
1 AF A 0 0 FACILIDADES DO AEROPORTO - RESTRIÇÕES DE APOIO
2 AG A 0 0 MIGRAÇÃO/ALFÂNDEGA/SAÚDE
3 AI A 0 0 AEROPORTO DE ORIGEM INTERDITADO
4 AJ B 0 0 AEROPORTO DE DESTINO INTERDITADO

4 - The Weather data from https://www.wunderground.com/history

From this website I captured a sample data from local airport (Campinas, SP, Brazil): January to September, 2017.

The website presents data like this (see https://goo.gl/oKwzyH):


In [20]:
# Weather sample: load the CSV with weather historical data (from Campinas, SP, Brazil, 2017)
weather = pd.read_csv('data/DataScience-Intensive-weatherAtCampinasAirport-2017-Campinas_Airport_2017Weather.csv', \
                      sep = ',', dtype = str)

In [21]:
weather["date"] = weather["year"].map(str) + "-" + weather["month"].map(str) + "-" + weather["day"].map(str) 
weather["date"] = pd.to_datetime(weather['date'],errors='ignore')

In [22]:
weather.head()


Out[22]:
year month day weather-event tempC-high tempC-avg tempC-low dewpoint-high dewpoint-avg dewpoint-low ... sealevelpress-avg sealevelpress-low visibilty-km-high visibilty-km-avg visibilty-km-low wind-km-h-high wind-km-h-avg precipitation-mm-high precipitation-mm-sum date
0 2017 1 1 Rain , Thunderstorm 31 26 21 20 14 2 ... 1016 1015 10 9 2 27 10 55 0.00 2017-01-01
1 2017 1 2 Thunderstorm 29 24 21 21 16 9 ... 1015 1013 10 10 10 23 10 - 0.00 2017-01-02
2 2017 1 3 Rain , Thunderstorm 34 20 7 18 12 -1 ... 1014 1011 10 9 2 39 19 34 0.00 2017-01-03
3 2017 1 4 Rain , Thunderstorm 30 26 21 19 14 7 ... 1015 1013 10 10 10 24 11 - 0.00 2017-01-04
4 2017 1 5 Rain 30 26 22 21 14 7 ... 1017 1015 10 10 7 39 8 61 0.00 2017-01-05

5 rows × 24 columns


In [ ]: