Data Science Intensive Course

Capstone project: Flight Delays (Brazil)

The microdata from ANAC (National Agency for Civil Aviation) contains all flights that departure from brazilian airports. The flights are grouped by year-month, from January, 2000 to October, 2017 (last access on November, 2017).

According to this local publication, in Brazil, it is estimated that 20% of the 700K flights/year have some kind of delay: http://infograficos.oglobo.globo.com/economia/raio-x-dos-atrasos-dos-voos.html



In [34]:
# Pandas, Numpy, Matplotlib 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## plt.style.use('ggplot')

The raw data (monthly microdata from ANAC)

I will use the microdata that contains flights that occured on aug, 2017:

  • Flights: 94,185
  • Flights with delay: 16,317 (17,32%)

The questions:

1. It is possible to forecast flights with delay based on past delays causes?

2. Which model to choose?



In [24]:
# Flights (aug, 2017)
flightsAug17 = 'data/flightAug2017.csv'
flights = pd.read_csv(flightsAug17, sep = ';', parse_dates = True)

In [25]:
flights.describe()


Out[25]:
FlightCode
count 94185.000000
mean 3759.887689
std 2430.857874
min 1.000000
25% 1761.000000
50% 3349.000000
75% 5283.000000
max 9999.000000

In [26]:
flights.head(5)


Out[26]:
Airline FlightCode AuthCode TypeFlight AirportDep AirportArriv EstimDepart RealDepart EstimArriv RealArriv FlightStatus DelayMotivationCode
0 AAL 213 0 I KMIA SBBR 30/08/2017 23:25 30/08/2017 23:25 31/08/2017 07:00 31/08/2017 07:00 REALIZADO NaN
1 AAL 213 0 I KMIA SBBR 11/08/2017 23:25 11/08/2017 23:25 12/08/2017 07:00 12/08/2017 07:00 REALIZADO NaN
2 AAL 213 0 I KMIA SBBR 08/08/2017 23:25 08/08/2017 23:25 09/08/2017 07:00 09/08/2017 07:00 REALIZADO NaN
3 AAL 213 0 I KMIA SBBR 04/08/2017 23:25 04/08/2017 23:25 05/08/2017 07:00 05/08/2017 07:00 REALIZADO NaN
4 AAL 213 0 I KMIA SBBR 10/08/2017 23:25 10/08/2017 23:25 11/08/2017 07:00 11/08/2017 07:00 REALIZADO NaN

In [27]:
flights[flights['DelayMotivationCode'] >'NaN'].describe()


Out[27]:
FlightCode
count 16317.000000
mean 3854.016731
std 2047.877120
min 4.000000
25% 2192.000000
50% 3667.000000
75% 5260.000000
max 9990.000000

In [28]:
flights[flights['DelayMotivationCode'] >'NaN'].head(5)


Out[28]:
Airline FlightCode AuthCode TypeFlight AirportDep AirportArriv EstimDepart RealDepart EstimArriv RealArriv FlightStatus DelayMotivationCode
6 AAL 213 0 I KMIA SBBR 17/08/2017 23:25 18/08/2017 00:32 18/08/2017 07:00 18/08/2017 07:48 REALIZADO RA
15 AAL 213 0 I KMIA SBBR 23/08/2017 23:25 24/08/2017 00:11 24/08/2017 07:00 24/08/2017 09:34 REALIZADO TD
25 AAL 213 0 I KMIA SBBR 29/08/2017 23:25 29/08/2017 23:57 30/08/2017 07:00 30/08/2017 07:30 REALIZADO TD
26 AAL 213 0 I KMIA SBBR 24/08/2017 23:25 25/08/2017 00:33 25/08/2017 07:00 25/08/2017 08:13 REALIZADO TD
27 AAL 213 0 I KMIA SBBR 16/08/2017 23:25 17/08/2017 00:42 17/08/2017 07:00 17/08/2017 07:57 REALIZADO TD

In [29]:
# Total of delay per motivation code
flights['DelayMotivationCode'].value_counts()


Out[29]:
HD    10307
XB     7439
AR     2686
AT     2563
XN     2242
MX     1750
RA     1711
RI     1443
TD     1315
RM      690
MA      578
AS      485
TC      374
AI      260
VR      189
XT      162
WO      156
XO      145
AF      143
WA      118
WT      112
AJ      110
XM       91
HI       75
GF       62
FP       61
AG       48
XJ       35
XI       29
DF       29
XS       28
WR       25
HA       16
DG       15
VE       11
HB        9
AA        7
IR        3
AM        1
WS        1
WI        1
Name: DelayMotivationCode, dtype: int64

In [ ]: