Capstone Project - Flight Delays

Does weather events have impact the delay of flights (Brazil)?

It is important to see this notebook with the step-by-step of the dataset cleaning process:

https://github.com/davicsilva/dsintensive/blob/master/notebooks/flightDelayPrepData_v2.ipynb



In [1]:

    
from datetime import datetime

# Pandas and NumPy
import pandas as pd
import numpy as np

# Matplotlib for additional customization
from matplotlib import pyplot as plt
%matplotlib inline

# Seaborn for plotting and styling
import seaborn as sns



In [2]:

    
# 1. Flight delay: any flight with (real_departure - planned_departure >= 15 minutes)
# 2. The Brazilian Federal Agency for Civil Aviation (ANAC) does not define exactly what is a "flight delay" (in minutes)
# 3. Anyway, the ANAC has a resolution for this subject: https://goo.gl/YBwbMy (last access: nov, 15th, 2017)
# ---
# DELAY, for this analysis, is defined as greater than 15 minutes (local flights only)
DELAY = 15

1 - Dataset: sample of local flights dataset (700 records)



In [3]:

    
# Dataset: a sample with 700 records with flights that occurred in the beginning of 2017 (January to June)
flights = pd.read_csv('data/arf2017ISO-sample.csv', sep = ';', dtype = str)



In [4]:

    
# Using "lambda" to format "date" columns to ISO format: YYYY-MM-DD hh:mm:ss
flights['departure-est'] = flights[['departure-est']].apply(lambda row: row.str.replace("(?P<day>\d{2})/(?P<month>\d{2})/(?P<year>\d{4}) (?P<HOUR>\d{2}):(?P<MIN>\d{2})", "\g<year>/\g<month>/\g<day> \g<HOUR>:\g<MIN>:00"), axis=1)
flights['departure-real'] = flights[['departure-real']].apply(lambda row: row.str.replace("(?P<day>\d{2})/(?P<month>\d{2})/(?P<year>\d{4}) (?P<HOUR>\d{2}):(?P<MIN>\d{2})", "\g<year>/\g<month>/\g<day> \g<HOUR>:\g<MIN>:00"), axis=1)
flights['arrival-est'] = flights[['arrival-est']].apply(lambda row: row.str.replace("(?P<day>\d{2})/(?P<month>\d{2})/(?P<year>\d{4}) (?P<HOUR>\d{2}):(?P<MIN>\d{2})", "\g<year>/\g<month>/\g<day> \g<HOUR>:\g<MIN>:00"), axis=1)
flights['arrival-real'] = flights[['arrival-real']].apply(lambda row: row.str.replace("(?P<day>\d{2})/(?P<month>\d{2})/(?P<year>\d{4}) (?P<HOUR>\d{2}):(?P<MIN>\d{2})", "\g<year>/\g<month>/\g<day> \g<HOUR>:\g<MIN>:00"), axis=1)



In [5]:

    
# Departure and Arrival columns: from 'object' to 'date' format
flights['departure-est'] = pd.to_datetime(flights['departure-est'],errors='ignore')
flights['departure-real'] = pd.to_datetime(flights['departure-real'],errors='ignore')
flights['arrival-est'] = pd.to_datetime(flights['arrival-est'],errors='ignore')
flights['arrival-real'] = pd.to_datetime(flights['arrival-real'],errors='ignore')



In [6]:

    
# translate the flight status from portuguese to english
flights['flight-status'] = flights[['flight-status']].apply(lambda row: row.str.replace("REALIZADO", "ACCOMPLISHED"), axis=1)
flights['flight-status'] = flights[['flight-status']].apply(lambda row: row.str.replace("CANCELADO", "CANCELED"), axis=1)



In [7]:

    
flights[["departure-est", "departure-real", "arrival-est", "arrival-real", "flight-status"]].head(3)









    Out[7]:







  
    
      
      departure-est
      departure-real
      arrival-est
      arrival-real
      flight-status
    
  
  
    
      0
      2017-01-02 01:20:00
      2017-01-02 01:20:00
      2017-01-02 09:00:00
      2017-01-02 09:00:00
      ACCOMPLISHED
    
    
      1
      2017-01-24 01:20:00
      2017-01-24 01:20:00
      2017-01-24 09:00:00
      2017-01-24 09:00:00
      ACCOMPLISHED
    
    
      2
      2017-01-29 01:20:00
      2017-01-29 01:20:00
      2017-01-29 09:00:00
      2017-01-29 09:00:00
      ACCOMPLISHED

Some EDA's tasks

Flights/airports



In [8]:

    
# See: https://stackoverflow.com/questions/37287938/sort-pandas-dataframe-by-value 
# or   https://stackoverflow.com/questions/17618981/how-to-sort-pandas-data-frame-using-values-from-several-columns
df_departures = flights.groupby(['airport-A']).size().reset_index(name='number_departures')



In [9]:

    
df_departures.sort_values(by=['number_departures'], ascending=False, inplace=True)



In [10]:

    
df_departures.head()









    Out[10]:







  
    
      
      airport-A
      number_departures
    
  
  
    
      38
      SBGR
      104
    
    
      13
      KMIA
      42
    
    
      25
      SAEZ
      41
    
    
      54
      SBVT
      41
    
    
      42
      SBKP
      36



In [11]:

    
with sns.axes_style('white'):
    g = sns.factorplot("airport-A", data=flights, aspect=2,
                       kind="count", color='steelblue')
    g.set_xticklabels(step=4)



In [12]:

    
g = sns.factorplot(x="airport-A", y="number_departures", data=df_departures, size=4, aspect=4)



In [13]:

    
# Get the hour of 'real depature' datetime
# See .dt acessor (http://pandas.pydata.org/pandas-docs/stable/basics.html#basics-dt-accessors)
flights['hour-departure-est'] = flights['departure-est'].dt.hour
flights['hour-departure-real'] = flights['departure-real'].dt.hour
flights['hour-arrival-est'] = flights['arrival-est'].dt.hour
flights['hour-arrival-real'] = flights['arrival-real'].dt.hour



In [14]:

    
flights.groupby(['airport-A', 'hour-departure-real']).size().groupby(level=1).max()









    Out[14]:





hour-departure-real
0.0     10
1.0     10
2.0     17
3.0      6
4.0      2
5.0      1
6.0     12
7.0      5
8.0      4
9.0     17
10.0     7
11.0     9
12.0     4
13.0     9
14.0     6
15.0     5
16.0     4
17.0    17
18.0     9
19.0    20
20.0    11
21.0     1
22.0    12
23.0    13
dtype: int64



In [15]:

    
df_busy_hours = flights.groupby(['hour-departure-real']).size().reset_index(name='flight_by_hour')



In [16]:

    
df_busy_hours.sort_values(by=['flight_by_hour'], ascending=False, inplace=True)



In [17]:

    
df_busy_hours.head()









    Out[17]:







  
    
      
      hour-departure-real
      flight_by_hour
    
  
  
    
      19
      19.0
      76
    
    
      18
      18.0
      48
    
    
      17
      17.0
      46
    
    
      2
      2.0
      34
    
    
      9
      9.0
      33



In [18]:

    
g = sns.factorplot(x="hour-departure-real", y="flight_by_hour", data=df_busy_hours, size=6, aspect=1.8)



In [19]:

    
with sns.axes_style('white'):
    g = sns.factorplot("hour-departure-real", data=flights, size=5, aspect=3,
                       kind="count", color='steelblue')
    g.set_xticklabels(step=1)



In [20]:

    
(flights['departure-real'] - flights['departure-est']).dt.components.head()









    Out[20]:







  
    
      
      days
      hours
      minutes
      seconds
      milliseconds
      microseconds
      nanoseconds
    
  
  
    
      0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      1
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      2
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      3
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      4
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0



In [21]:

    
# Get the delta time (real - estimated) for depature and arrival times
# More about 'DateTime' at https://docs.scipy.org/doc/numpy-dev/reference/arrays.datetime.html#datetime-and-timedelta-arithmetic 
flights['delay-departure']=(flights['departure-real'] - flights['departure-est'])
flights['delay-arrival']=(flights['arrival-real'] - flights['arrival-est'])

# Delay (departure and arrival) in minutes (using 'numpy.timedelta64')
# See this link: https://stackoverflow.com/questions/14920903/time-difference-in-seconds-from-numpy-timedelta64
flights['delay-departure']=np.divide(flights[['delay-departure']], np.timedelta64(1, 'm'))
flights['delay-arrival']=np.divide(flights[['delay-arrival']], np.timedelta64(1, 'm'))



In [22]:

    
sns.kdeplot(flights['delay-departure'] > 0, label='departures(delay)', shade=True)
plt.xlabel('delay (in minutes)');

2 - Local airports (list with all the ~600 brazilian public airports)

Source: https://goo.gl/mNFuPt (a XLS spreadsheet in portuguese; last access on nov, 15th, 2017)



In [23]:

    
# Airports dataset: all brazilian public airports (updated until october, 2017)
airports = pd.read_csv('data/brazilianPublicAirports-out2017.csv', sep = ';', dtype= str)



In [24]:

    
airports.head()









    Out[24]:







  
    
      
      airport
      airportname
      city
      state
      latitude
      longitude
      altitude
      operation
    
  
  
    
      0
      SBAA
      CONCEIÇÃO DO ARAGUAIA
      CONCEIÇÃO DO ARAGUAIA
      PA
      8° 20' 55'' S
      49° 18' 11'' W
      199 m
      VFR Diurno/Noturno e IFR Diurno
    
    
      1
      SBAE
      BAURU/AREALVA
      BAURU E AREALVA
      SP
      22° 9' 28'' S
      49° 4' 6'' W
      594 m
      VFR Diurno/Noturno e IFR Diurno/Noturno
    
    
      2
      SBAM
      AMAPÁ
      AMAPÁ
      AP
      2° 4' 22'' N
      50° 51' 45'' W
      14 m
      VFR Diurno e IFR Diurno
    
    
      3
      SBAQ
      BARTOLOMEU DE GUSMÃO
      ARARAQUARA
      SP
      21° 48' 16'' S
      48° 8' 25'' W
      711 m
      VFR Diurno/Noturno e IFR Diurno/Noturno
    
    
      4
      SBAR
      SANTA MARIA
      ARACAJU
      SE
      10° 59' 7'' S
      37° 4' 24'' W
      7 m
      VFR Diurno/Noturno e IFR Diurno/Noturno



In [25]:

    
# Merge "flights" dataset with "airports" in order to identify 
#   local flights (origin and destination are in Brazil)
flights = pd.merge(flights, airports, left_on="airport-A", right_on="airport", how='left')
flights = pd.merge(flights, airports, left_on="airport-B", right_on="airport", how='left')



In [26]:

    
with sns.axes_style('white'):
    g = sns.factorplot("airport-A", data=flights, aspect=2,
                       kind="count", color='steelblue')
    g.set_xticklabels(step=5)

3 - List of codes (two letters) used when there was a flight delay (departure)

I have found two lists that define two-letter codes used by the aircraft crew to justify the delay of the flights: a short and a long one.

Source: https://goo.gl/vUC8BX (last access: nov, 15th, 2017)



In [27]:

    
# ------------------------------------------------------------------
# List of codes (two letters) used to justify a delay on the flight
# - delayCodesShortlist.csv: list with YYY codes
# - delayCodesLongList.csv: list with XXX codes
# ------------------------------------------------------------------
delaycodes = pd.read_csv('data/delayCodesShortlist.csv', sep = ';', dtype = str)
delaycodesLongList = pd.read_csv('data/delayCodesLonglist.csv', sep = ';', dtype = str)



In [28]:

    
delaycodes.head()









    Out[28]:







  
    
      
      delaycode
      delayOrigin
      duetoweather
      autorizationdelay
      delaydescription
    
  
  
    
      0
      AA
      B
      0
      0
      ATRASO AEROPORTO DE ALTERNATIVA de ORDEM TÉCNICA
    
    
      1
      AF
      A
      0
      0
      FACILIDADES DO AEROPORTO - RESTRIÇÕES DE APOIO
    
    
      2
      AG
      A
      0
      0
      MIGRAÇÃO/ALFÂNDEGA/SAÚDE
    
    
      3
      AI
      A
      0
      0
      AEROPORTO DE ORIGEM INTERDITADO
    
    
      4
      AJ
      B
      0
      0
      AEROPORTO DE DESTINO INTERDITADO

4 - The Weather data from https://www.wunderground.com/history

From this website I captured a sample data from local airport (Campinas, SP, Brazil): January to September, 2017.

The website presents data like this (see https://goo.gl/oKwzyH):



In [29]:

    
# Weather sample: load the CSV with weather historical data (from Campinas, SP, Brazil, 2017)
weather = pd.read_csv('data/DataScience-Intensive-weatherAtCampinasAirport-2017-Campinas_Airport_2017Weather.csv', \
                      sep = ',', dtype = str)



In [30]:

    
weather["date"] = weather["year"].map(str) + "-" + weather["month"].map(str) + "-" + weather["day"].map(str) 
weather["date"] = pd.to_datetime(weather['date'],errors='ignore')



In [31]:

    
weather.head()









    Out[31]:







  
    
      
      year
      month
      day
      weather-event
      tempC-high
      tempC-avg
      tempC-low
      dewpoint-high
      dewpoint-avg
      dewpoint-low
      ...
      sealevelpress-avg
      sealevelpress-low
      visibilty-km-high
      visibilty-km-avg
      visibilty-km-low
      wind-km-h-high
      wind-km-h-avg
      precipitation-mm-high
      precipitation-mm-sum
      date
    
  
  
    
      0
      2017
      1
      1
      Rain , Thunderstorm
      31
      26
      21
      20
      14
      2
      ...
      1016
      1015
      10
      9
      2
      27
      10
      55
      0.00
      2017-01-01
    
    
      1
      2017
      1
      2
      Thunderstorm
      29
      24
      21
      21
      16
      9
      ...
      1015
      1013
      10
      10
      10
      23
      10
      -
      0.00
      2017-01-02
    
    
      2
      2017
      1
      3
      Rain , Thunderstorm
      34
      20
      7
      18
      12
      -1
      ...
      1014
      1011
      10
      9
      2
      39
      19
      34
      0.00
      2017-01-03
    
    
      3
      2017
      1
      4
      Rain , Thunderstorm
      30
      26
      21
      19
      14
      7
      ...
      1015
      1013
      10
      10
      10
      24
      11
      -
      0.00
      2017-01-04
    
    
      4
      2017
      1
      5
      Rain
      30
      26
      22
      21
      14
      7
      ...
      1017
      1015
      10
      10
      7
      39
      8
      61
      0.00
      2017-01-05
    
  

5 rows × 24 columns



In [ ]:

	departure-est	departure-real	arrival-est	arrival-real	flight-status
0	2017-01-02 01:20:00	2017-01-02 01:20:00	2017-01-02 09:00:00	2017-01-02 09:00:00	ACCOMPLISHED
1	2017-01-24 01:20:00	2017-01-24 01:20:00	2017-01-24 09:00:00	2017-01-24 09:00:00	ACCOMPLISHED
2	2017-01-29 01:20:00	2017-01-29 01:20:00	2017-01-29 09:00:00	2017-01-29 09:00:00	ACCOMPLISHED

	days	hours	minutes	seconds	milliseconds	microseconds	nanoseconds
0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	0.0	0.0	0.0	0.0	0.0	0.0	0.0

	airport	airportname	city	state	latitude	longitude	altitude	operation
0	SBAA	CONCEIÇÃO DO ARAGUAIA	CONCEIÇÃO DO ARAGUAIA	PA	8° 20' 55'' S	49° 18' 11'' W	199 m	VFR Diurno/Noturno e IFR Diurno
1	SBAE	BAURU/AREALVA	BAURU E AREALVA	SP	22° 9' 28'' S	49° 4' 6'' W	594 m	VFR Diurno/Noturno e IFR Diurno/Noturno
2	SBAM	AMAPÁ	AMAPÁ	AP	2° 4' 22'' N	50° 51' 45'' W	14 m	VFR Diurno e IFR Diurno
3	SBAQ	BARTOLOMEU DE GUSMÃO	ARARAQUARA	SP	21° 48' 16'' S	48° 8' 25'' W	711 m	VFR Diurno/Noturno e IFR Diurno/Noturno
4	SBAR	SANTA MARIA	ARACAJU	SE	10° 59' 7'' S	37° 4' 24'' W	7 m	VFR Diurno/Noturno e IFR Diurno/Noturno

	delaycode	delayOrigin	delaydescription
0	AA	B	ATRASO AEROPORTO DE ALTERNATIVA de ORDEM TÉCNICA
1	AF	A	FACILIDADES DO AEROPORTO - RESTRIÇÕES DE APOIO
2	AG	A	MIGRAÇÃO/ALFÂNDEGA/SAÚDE
3	AI	A	AEROPORTO DE ORIGEM INTERDITADO
4	AJ	B	AEROPORTO DE DESTINO INTERDITADO

	year	month	day	weather-event	tempC-high	tempC-avg	tempC-low	dewpoint-high	dewpoint-avg	dewpoint-low	...	sealevelpress-avg	sealevelpress-low	visibilty-km-high	visibilty-km-avg	visibilty-km-low	wind-km-h-high	wind-km-h-avg	precipitation-mm-high	date
0	2017	1	1	Rain , Thunderstorm	31	26	21	20	14	2	...	1016	1015	10	9	2	27	10	55	2017-01-01
1	2017	1	2	Thunderstorm	29	24	21	21	16	9	...	1015	1013	10	10	10	23	10	-	2017-01-02
2	2017	1	3	Rain , Thunderstorm	34	20	7	18	12	-1	...	1014	1011	10	9	2	39	19	34	2017-01-03
3	2017	1	4	Rain , Thunderstorm	30	26	21	19	14	7	...	1015	1013	10	10	10	24	11	-	2017-01-04
4	2017	1	5	Rain	30	26	22	21	14	7	...	1017	1015	10	10	7	39	8	61	2017-01-05

	days	hours	minutes	seconds	milliseconds	microseconds	nanoseconds
0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	0.0	0.0	0.0	0.0	0.0	0.0	0.0

	days	hours	minutes	seconds	milliseconds	microseconds	nanoseconds
0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	0.0	0.0	0.0	0.0	0.0	0.0	0.0