Flight delays (capstone project)

Summary of data wrangling process

1. Original files (raw data)

1.1 - CSV with data of the flights: monthly-basis departures and arrivals

These raw data files are available in CSV format at website of brazilian Federal Agency for Civil Aviation (ANAC): http://www.anac.gov.br/assuntos/dados-e-estatisticas/historico-de-voos. Last access: nov, 2017.

These files has theses information:

1.2 - Local airports (list with all the ~600 brazilian public airports)

A XLSX spreadsheet is available, also, at ANAC's website: Link: http://www.anac.gov.br/assuntos/setor-regulado/aerodromos/cadastro-de-aerodromos/aerodromos-cadastrados/aerodromospublicos-12.xls. Last access on nov, 15th, 2017 ("aerodromospublicos-12.xls")

I opened this file with MS Excel and saved it as CSV. The columns of this file:

1.3 - Weather data (historical data)

I could not find any local dataset with easy-to-retrieve and open format weather historical data. So I grabbed the weather historical data from Weather Underground website.

A example with 2016 weather data from Campinas, SP, Brazil (my city): https://www.wunderground.com/history/airport/SBKP/2016/1/1/CustomHistory.html?dayend=31&monthend=12&yearend=2016&req_city=&req_state=&req_statename=&reqdb.zip=&reqdb.magic=&reqdb.wmo=

I Ctrl-C the data section and Ctrl-V on a Google spreadsheet. I moved the last column that shows the "Weather Events" (Rain, Thunderstorms etc.) and these data are presented like this:

1.4 - List of codes (two letters) used when there was a flight delay (departure)

A XLS file with two-letters justification code that is used by the aviation crew to "explain" the delay of a flight:

Link from ANAC's website: http://www.anac.gov.br/assuntos/dados-e-estatisticas/vra/glossario_de_justificativas.xls. Last access: nov, 20th, 2017.

These codes are used in the CSV with data of the flights (see section 1.1, column "Delay Code").

2. Adjusting some data

In order to start the analysis it was necessary to adjust some data from these files.

  • In the flights dataset, for instance, we'd found date/time in a not ISO format (yyyy/mm/dd HH:MM:SS);
  • The analysis will consider only the local flights: airport_A and airport_B in Brazil. It is necessary, then, to identify which airports (A or B) are in Brazil. A new column 'local_flight' will be added to flights dataset with values '1' (local flight) or '0' (A or B are overseas). This information will come as a result of a merge of the flight dataset with brazilian_airport dataset: if both 'airpot_A' and 'airport_B' are in brazilian_airport, 'local_flight' = '1', otherwise '0'.
  • There was relevant info in portuguese, so, for clarification of this analysis, I tried to translate some of these information
  • I translated to english the file header (columns' names)

2.1 - Flights

Adjust the date format to ISO, status of the flight, identify local (brazilian) airports

In order to demonstrate this I create a sample file from original dataset with 20 records: "arfsample-date.csv".


In [1]:
from datetime import datetime

# Pandas and NumPy
import pandas as pd
import numpy as np

In [2]:
# Read the CSV with flights records (separation = ";")
flights = pd.read_csv('data/arfsample-date.csv', sep = ';', dtype = str)

In [3]:
flights.head()


Out[3]:
airline flight-code auth-code line-code airport-A airport-B departure-est departure-real arrival-est arrival-real flight-status delay-code
0 AAL 213 0 I KMIA SBBR 02/01/2017 01:20 02/01/2017 01:20 02/01/2017 09:00 02/01/2017 09:00 REALIZADO NaN
1 AAL 213 0 I KMIA SBBR 24/01/2017 01:20 24/01/2017 01:20 24/01/2017 09:00 24/01/2017 09:00 REALIZADO NaN
2 TAM 8113 0 I MMMX SBGR 04/05/2017 20:25 NaN 05/05/2017 06:00 NaN CANCELADO XN
3 TAM 8113 0 I MMMX SBGR 31/05/2017 20:25 31/05/2017 20:25 01/06/2017 06:00 01/06/2017 06:00 REALIZADO NaN
4 TAM 8113 0 I MMMX SBGR 16/05/2017 20:25 16/05/2017 20:25 17/05/2017 06:00 17/05/2017 06:00 REALIZADO NaN

In [4]:
# Lambda function
# 1 - Used to adjust date columns to ISO format
# flights['column-with-date'] = flights[['column-with-date']].apply(lambda row: \
#                                                      row.str.replace("(?P<day>\d{2})/\
#                                                                       (?P<month>\d{2})/\
#                                                                       (?P<year>\d{4}) \
#                                                                       (?P<HOUR>\d{2}):\
#                                                                       (?P<MIN>\d{2})", \
#                                                                      "\g<year>/\g<month>/\g<day> \g<HOUR>:\g<MIN>:00"), axis=1)
#
# 2- Used to translate to english the 'flight-status': 
# - 'REALIZADO'-> 'ACCOMPLISHED'
# - 'CANCELADO'-> 'CANCELED'
# ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
flights['departure-est'] = flights[['departure-est']].apply(lambda row: row.str.replace("(?P<day>\d{2})/(?P<month>\d{2})/(?P<year>\d{4}) (?P<HOUR>\d{2}):(?P<MIN>\d{2})", "\g<year>/\g<month>/\g<day> \g<HOUR>:\g<MIN>:00"), axis=1)
flights['departure-real'] = flights[['departure-real']].apply(lambda row: row.str.replace("(?P<day>\d{2})/(?P<month>\d{2})/(?P<year>\d{4}) (?P<HOUR>\d{2}):(?P<MIN>\d{2})", "\g<year>/\g<month>/\g<day> \g<HOUR>:\g<MIN>:00"), axis=1)
flights['arrival-est'] = flights[['arrival-est']].apply(lambda row: row.str.replace("(?P<day>\d{2})/(?P<month>\d{2})/(?P<year>\d{4}) (?P<HOUR>\d{2}):(?P<MIN>\d{2})", "\g<year>/\g<month>/\g<day> \g<HOUR>:\g<MIN>:00"), axis=1)
flights['arrival-real'] = flights[['arrival-real']].apply(lambda row: row.str.replace("(?P<day>\d{2})/(?P<month>\d{2})/(?P<year>\d{4}) (?P<HOUR>\d{2}):(?P<MIN>\d{2})", "\g<year>/\g<month>/\g<day> \g<HOUR>:\g<MIN>:00"), axis=1)

In [5]:
flights.dtypes


Out[5]:
airline           object
flight-code       object
auth-code         object
line-code         object
airport-A         object
airport-B         object
departure-est     object
departure-real    object
arrival-est       object
arrival-real      object
flight-status     object
delay-code        object
dtype: object

Now, it is time to covert dates from 'object' to 'date' format


In [6]:
# Departure and Arrival columns: from 'object' to 'date' format
flights['departure-est'] = pd.to_datetime(flights['departure-est'], errors='ignore')
flights['departure-real'] = pd.to_datetime(flights['departure-real'], errors='ignore')
flights['arrival-est'] = pd.to_datetime(flights['arrival-est'], errors='ignore')
flights['arrival-real'] = pd.to_datetime(flights['arrival-real'], errors='ignore')

In [7]:
flights.dtypes


Out[7]:
airline                   object
flight-code               object
auth-code                 object
line-code                 object
airport-A                 object
airport-B                 object
departure-est     datetime64[ns]
departure-real    datetime64[ns]
arrival-est       datetime64[ns]
arrival-real      datetime64[ns]
flight-status             object
delay-code                object
dtype: object

Status of the flight: The content of 'status_flight' will be translated to english:


In [8]:
# translate the flight status from portuguese to english
flights['flight-status'] = flights[['flight-status']].apply(lambda row: row.str.replace("REALIZADO", "ACCOMPLISHED"), axis=1)
flights['flight-status'] = flights[['flight-status']].apply(lambda row: row.str.replace("CANCELADO", "CANCELED"), axis=1)

The result, so far:


In [9]:
flights[['departure-est', 'departure-real', 'arrival-est', 'arrival-real', 'flight-status']].head()


Out[9]:
departure-est departure-real arrival-est arrival-real flight-status
0 2017-01-02 01:20:00 2017-01-02 01:20:00 2017-01-02 09:00:00 2017-01-02 09:00:00 ACCOMPLISHED
1 2017-01-24 01:20:00 2017-01-24 01:20:00 2017-01-24 09:00:00 2017-01-24 09:00:00 ACCOMPLISHED
2 2017-05-04 20:25:00 NaT 2017-05-05 06:00:00 NaT CANCELED
3 2017-05-31 20:25:00 2017-05-31 20:25:00 2017-06-01 06:00:00 2017-06-01 06:00:00 ACCOMPLISHED
4 2017-05-16 20:25:00 2017-05-16 20:25:00 2017-05-17 06:00:00 2017-05-17 06:00:00 ACCOMPLISHED

Some EDA (tests)


In [32]:
flights.groupby(['airport-A']).size().reset_index(name='number_departures')


Out[32]:
airport-A number_departures
0 KMIA 2
1 MMMX 5
2 SBFZ 2
3 SBGR 5
4 SBSP 6

In [ ]:

2.2 - Merge Flights with local airports (list with all the ~600 brazilian public airports) and identify the local flights

The local airports dataset is not a sample. This CSV file contains all 589 brazilian public airports. There are other airports in Brazil but, in this analysis, only the public airports will be considered.


In [10]:
# Airports dataset: all brazilian public airports (updated until october, 2017)
airports = pd.read_csv('data/brazilianPublicAirports-out2017.csv', sep = ';')

In [11]:
airports.head()


Out[11]:
airport airportname city state latitude longitude altitude operation
0 SBAA CONCEIÇÃO DO ARAGUAIA CONCEIÇÃO DO ARAGUAIA PA 8° 20' 55'' S 49° 18' 11'' W 199 m VFR Diurno/Noturno e IFR Diurno
1 SBAE BAURU/AREALVA BAURU E AREALVA SP 22° 9' 28'' S 49° 4' 6'' W 594 m VFR Diurno/Noturno e IFR Diurno/Noturno
2 SBAM AMAPÁ AMAPÁ AP 2° 4' 22'' N 50° 51' 45'' W 14 m VFR Diurno e IFR Diurno
3 SBAQ BARTOLOMEU DE GUSMÃO ARARAQUARA SP 21° 48' 16'' S 48° 8' 25'' W 711 m VFR Diurno/Noturno e IFR Diurno/Noturno
4 SBAR SANTA MARIA ARACAJU SE 10° 59' 7'' S 37° 4' 24'' W 7 m VFR Diurno/Noturno e IFR Diurno/Noturno

In [12]:
# Merge "flights" dataset with "airports" in order to identify 
#   local flights (origin and destination are in Brazil)
flights = pd.merge(flights, airports, left_on="airport-A", right_on="airport", how='left')
flights = pd.merge(flights, airports, left_on="airport-B", right_on="airport", how='left')

In [13]:
flights.head()


Out[13]:
airline flight-code auth-code line-code airport-A airport-B departure-est departure-real arrival-est arrival-real ... altitude_x operation_x airport_y airportname_y city_y state_y latitude_y longitude_y altitude_y operation_y
0 AAL 213 0 I KMIA SBBR 2017-01-02 01:20:00 2017-01-02 01:20:00 2017-01-02 09:00:00 2017-01-02 09:00:00 ... NaN NaN SBBR PRESIDENTE JUSCELINO KUBITSCHEK BRASÍLIA DF 15° 52' 16'' S 47° 55' 7'' W 1066 m VFR Diurno/Noturno e IFR Diurno/Noturno
1 AAL 213 0 I KMIA SBBR 2017-01-24 01:20:00 2017-01-24 01:20:00 2017-01-24 09:00:00 2017-01-24 09:00:00 ... NaN NaN SBBR PRESIDENTE JUSCELINO KUBITSCHEK BRASÍLIA DF 15° 52' 16'' S 47° 55' 7'' W 1066 m VFR Diurno/Noturno e IFR Diurno/Noturno
2 TAM 8113 0 I MMMX SBGR 2017-05-04 20:25:00 NaT 2017-05-05 06:00:00 NaT ... NaN NaN SBGR GUARULHOS - GOVERNADOR ANDRÉ FRANCO MONTORO GUARULHOS SP 23° 26' 8'' S 46° 28' 23'' W 750 m VFR Diurno/Noturno e IFR Diurno/Noturno
3 TAM 8113 0 I MMMX SBGR 2017-05-31 20:25:00 2017-05-31 20:25:00 2017-06-01 06:00:00 2017-06-01 06:00:00 ... NaN NaN SBGR GUARULHOS - GOVERNADOR ANDRÉ FRANCO MONTORO GUARULHOS SP 23° 26' 8'' S 46° 28' 23'' W 750 m VFR Diurno/Noturno e IFR Diurno/Noturno
4 TAM 8113 0 I MMMX SBGR 2017-05-16 20:25:00 2017-05-16 20:25:00 2017-05-17 06:00:00 2017-05-17 06:00:00 ... NaN NaN SBGR GUARULHOS - GOVERNADOR ANDRÉ FRANCO MONTORO GUARULHOS SP 23° 26' 8'' S 46° 28' 23'' W 750 m VFR Diurno/Noturno e IFR Diurno/Noturno

5 rows × 28 columns

--- Here I got stuck ---:

1) After the merge of 'flights' and 'airports' datasets,

2) I created a new columun: 'local_flight';

3) I had tried to define a value (int or boolean) to this column, based on the content of columns 'airport_x' (departure) and 'airport_y' (arrival): if there is information in both columns, it means that we have a local flight record (A and B are brazilian airports)


In [14]:
# Create a new column 'local_flight' with '0' value as default
flights["local_flight"] = 0

In [15]:
flights.airport_x = flights.airport_x.fillna('')
flights.airport_y = flights.airport_y.fillna('')

In [16]:
flights[["airport_x", "airport_y", "local_flight"]]


Out[16]:
airport_x airport_y local_flight
0 SBBR 0
1 SBBR 0
2 SBGR 0
3 SBGR 0
4 SBGR 0
5 SBGR 0
6 SBGR 0
7 SBSP SBPA 0
8 SBSP SBPA 0
9 SBSP SBPA 0
10 SBSP SBPA 0
11 SBSP SBPA 0
12 SBSP SBPA 0
13 SBGR 0
14 SBGR 0
15 SBGR 0
16 SBGR 0
17 SBGR 0
18 SBFZ 0
19 SBFZ 0

In [17]:
for index,row in flights.iterrows():
    if ((row['airport_x'] != '') & (row['airport_y'] != '')):
        flights['local_flight'] = 1

In [18]:
flights[['airport_x', 'airport_y', 'local_flight']]


Out[18]:
airport_x airport_y local_flight
0 SBBR 1
1 SBBR 1
2 SBGR 1
3 SBGR 1
4 SBGR 1
5 SBGR 1
6 SBGR 1
7 SBSP SBPA 1
8 SBSP SBPA 1
9 SBSP SBPA 1
10 SBSP SBPA 1
11 SBSP SBPA 1
12 SBSP SBPA 1
13 SBGR 1
14 SBGR 1
15 SBGR 1
16 SBGR 1
17 SBGR 1
18 SBFZ 1
19 SBFZ 1

Moving on...

2.3 - The Weather data from https://www.wunderground.com/history

From this website I captured a sample data from local airport (Campinas, SP, Brazil) during the year of 2016.

The website presents data like this (see https://goo.gl/SqKfmP):

I performed the following steps in Google Spreadsheet:

  • I copied (Ctr+C) this data and I created a Google spreadsheet (Ctr+V) with this data (see section 1.3 above);
  • I "moved" the column that shows weather special events (thunderstorms, rain etc.) and I positioned just before de "Temperature" column;
  • In order to handle the date information I add the correct year and month to each row.

The result is this CSV (see https://docs.google.com/spreadsheets/d/1-AYWwt4coeJHhq4RSKwYTIXdov4MHxraQb7-ayPCaK4/edit?usp=sharing):


In [19]:
# Weather sample: load the CSV with weather historical data (from Campinas, SP, Brazil, 2016)
weather = pd.read_csv('data/DataScience-Intensive-weatherAtCampinasAirport-2016-Campinas_Airport_2016Weather.csv', \
                      sep = ',', dtype = str)

In [20]:
weather.head()


Out[20]:
year month day weather-event tempC-high tempC-avg tempC-low dewpoint-high dewpoint-avg dewpoint-low ... sealevelpress-high sealevelpress-avg sealevelpress-low visibilty-km-high visibilty-km-avg visibilty-km-low wind-km-h-high wind-km-h-avg precipitation-mm-high precipitation-mm-sum
0 2016 1 1 Rain , Thunderstorm 30 26 21 22 21 20 ... 1016 1014 1012 10 10 8 21 10 - 0.00
1 2016 1 2 Rain , Thunderstorm 28 24 22 23 21 20 ... 1017 1015 1013 10 9 3 24 10 45 0.00
2 2016 1 3 Rain 26 23 22 22 20 19 ... 1017 1015 1013 10 10 8 26 14 - 0.00
3 2016 1 4 NaN 28 24 20 19 18 17 ... 1017 1016 1013 10 10 10 32 19 - 0.00
4 2016 1 5 NaN 31 24 19 19 17 14 ... 1020 1017 1014 10 10 10 29 16 - 0.00

5 rows × 23 columns


In [21]:
weather.dtypes


Out[21]:
year                     object
month                    object
day                      object
weather-event            object
tempC-high               object
tempC-avg                object
tempC-low                object
dewpoint-high            object
dewpoint-avg             object
dewpoint-low             object
humidity-perc-high       object
humidity-perc-avg        object
humidity-perc-low        object
sealevelpress-high       object
sealevelpress-avg        object
sealevelpress-low        object
visibilty-km-high        object
visibilty-km-avg         object
visibilty-km-low         object
wind-km-h-high           object
wind-km-h-avg            object
precipitation-mm-high    object
precipitation-mm-sum     object
dtype: object

In [22]:
weather["date"] = weather["year"].map(str) + "-" + weather["month"].map(str) + "-" + weather["day"].map(str)

In [23]:
weather["date"] = pd.to_datetime(weather['date'],errors='ignore')

In [24]:
weather.dtypes


Out[24]:
year                             object
month                            object
day                              object
weather-event                    object
tempC-high                       object
tempC-avg                        object
tempC-low                        object
dewpoint-high                    object
dewpoint-avg                     object
dewpoint-low                     object
humidity-perc-high               object
humidity-perc-avg                object
humidity-perc-low                object
sealevelpress-high               object
sealevelpress-avg                object
sealevelpress-low                object
visibilty-km-high                object
visibilty-km-avg                 object
visibilty-km-low                 object
wind-km-h-high                   object
wind-km-h-avg                    object
precipitation-mm-high            object
precipitation-mm-sum             object
date                     datetime64[ns]
dtype: object

In [ ]: