These raw data files are available in CSV format at website of brazilian Federal Agency for Civil Aviation (ANAC): http://www.anac.gov.br/assuntos/dados-e-estatisticas/historico-de-voos. Last access: nov, 2017.
These files has theses information:
A XLSX spreadsheet is available, also, at ANAC's website: Link: http://www.anac.gov.br/assuntos/setor-regulado/aerodromos/cadastro-de-aerodromos/aerodromos-cadastrados/aerodromospublicos-12.xls. Last access on nov, 15th, 2017 ("aerodromospublicos-12.xls")
I opened this file with MS Excel and saved it as CSV. The columns of this file:
I could not find any local dataset with easy-to-retrieve and open format weather historical data. So I grabbed the weather historical data from Weather Underground website.
A example with 2016 weather data from Campinas, SP, Brazil (my city): https://www.wunderground.com/history/airport/SBKP/2016/1/1/CustomHistory.html?dayend=31&monthend=12&yearend=2016&req_city=&req_state=&req_statename=&reqdb.zip=&reqdb.magic=&reqdb.wmo=
I Ctrl-C the data section and Ctrl-V on a Google spreadsheet. I moved the last column that shows the "Weather Events" (Rain, Thunderstorms etc.) and these data are presented like this:
A XLS file with two-letters justification code that is used by the aviation crew to "explain" the delay of a flight:
Link from ANAC's website: http://www.anac.gov.br/assuntos/dados-e-estatisticas/vra/glossario_de_justificativas.xls. Last access: nov, 20th, 2017.
These codes are used in the CSV with data of the flights (see section 1.1, column "Delay Code").
In order to start the analysis it was necessary to adjust some data from these files.
In [1]:
from datetime import datetime
# Pandas and NumPy
import pandas as pd
import numpy as np
In [2]:
# Read the CSV with flights records (separation = ";")
flights = pd.read_csv('data/arfsample-date.csv', sep = ';', dtype = str)
In [3]:
flights.head()
Out[3]:
In [4]:
# Lambda function
# 1 - Used to adjust date columns to ISO format
# flights['column-with-date'] = flights[['column-with-date']].apply(lambda row: \
# row.str.replace("(?P<day>\d{2})/\
# (?P<month>\d{2})/\
# (?P<year>\d{4}) \
# (?P<HOUR>\d{2}):\
# (?P<MIN>\d{2})", \
# "\g<year>/\g<month>/\g<day> \g<HOUR>:\g<MIN>:00"), axis=1)
#
# 2- Used to translate to english the 'flight-status':
# - 'REALIZADO'-> 'ACCOMPLISHED'
# - 'CANCELADO'-> 'CANCELED'
# ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
flights['departure-est'] = flights[['departure-est']].apply(lambda row: row.str.replace("(?P<day>\d{2})/(?P<month>\d{2})/(?P<year>\d{4}) (?P<HOUR>\d{2}):(?P<MIN>\d{2})", "\g<year>/\g<month>/\g<day> \g<HOUR>:\g<MIN>:00"), axis=1)
flights['departure-real'] = flights[['departure-real']].apply(lambda row: row.str.replace("(?P<day>\d{2})/(?P<month>\d{2})/(?P<year>\d{4}) (?P<HOUR>\d{2}):(?P<MIN>\d{2})", "\g<year>/\g<month>/\g<day> \g<HOUR>:\g<MIN>:00"), axis=1)
flights['arrival-est'] = flights[['arrival-est']].apply(lambda row: row.str.replace("(?P<day>\d{2})/(?P<month>\d{2})/(?P<year>\d{4}) (?P<HOUR>\d{2}):(?P<MIN>\d{2})", "\g<year>/\g<month>/\g<day> \g<HOUR>:\g<MIN>:00"), axis=1)
flights['arrival-real'] = flights[['arrival-real']].apply(lambda row: row.str.replace("(?P<day>\d{2})/(?P<month>\d{2})/(?P<year>\d{4}) (?P<HOUR>\d{2}):(?P<MIN>\d{2})", "\g<year>/\g<month>/\g<day> \g<HOUR>:\g<MIN>:00"), axis=1)
In [5]:
flights.dtypes
Out[5]:
In [6]:
# Departure and Arrival columns: from 'object' to 'date' format
flights['departure-est'] = pd.to_datetime(flights['departure-est'],errors='ignore')
flights['departure-real'] = pd.to_datetime(flights['departure-real'],errors='ignore')
flights['arrival-est'] = pd.to_datetime(flights['arrival-est'],errors='ignore')
flights['arrival-real'] = pd.to_datetime(flights['arrival-real'],errors='ignore')
In [7]:
flights.dtypes
Out[7]:
In [8]:
# translate the flight status from portuguese to english
flights['flight-status'] = flights[['flight-status']].apply(lambda row: row.str.replace("REALIZADO", "ACCOMPLISHED"), axis=1)
flights['flight-status'] = flights[['flight-status']].apply(lambda row: row.str.replace("CANCELADO", "CANCELED"), axis=1)
In [9]:
flights[['departure-est', 'departure-real', 'arrival-est', 'arrival-real', 'flight-status']].head()
Out[9]:
In [32]:
flights.groupby(['airport-A']).size().reset_index(name='number_departures')
Out[32]:
In [ ]:
The local airports dataset is not a sample. This CSV file contains all 589 brazilian public airports. There are other airports in Brazil but, in this analysis, only the public airports will be considered.
In [10]:
# Airports dataset: all brazilian public airports (updated until october, 2017)
airports = pd.read_csv('data/brazilianPublicAirports-out2017.csv', sep = ';')
In [11]:
airports.head()
Out[11]:
In [12]:
# Merge "flights" dataset with "airports" in order to identify
# local flights (origin and destination are in Brazil)
flights = pd.merge(flights, airports, left_on="airport-A", right_on="airport", how='left')
flights = pd.merge(flights, airports, left_on="airport-B", right_on="airport", how='left')
In [13]:
flights.head()
Out[13]:
In [14]:
# Create a new column 'local_flight' with '0' value as default
flights["local_flight"] = 0
In [15]:
flights.airport_x = flights.airport_x.fillna('')
flights.airport_y = flights.airport_y.fillna('')
In [16]:
flights[["airport_x", "airport_y", "local_flight"]]
Out[16]:
In [17]:
for index,row in flights.iterrows():
if ((row['airport_x'] != '') & (row['airport_y'] != '')):
flights['local_flight'] = 1
In [18]:
flights[['airport_x', 'airport_y', 'local_flight']]
Out[18]:
From this website I captured a sample data from local airport (Campinas, SP, Brazil) during the year of 2016.
The website presents data like this (see https://goo.gl/SqKfmP):
The result is this CSV (see https://docs.google.com/spreadsheets/d/1-AYWwt4coeJHhq4RSKwYTIXdov4MHxraQb7-ayPCaK4/edit?usp=sharing):
In [19]:
# Weather sample: load the CSV with weather historical data (from Campinas, SP, Brazil, 2016)
weather = pd.read_csv('data/DataScience-Intensive-weatherAtCampinasAirport-2016-Campinas_Airport_2016Weather.csv', \
sep = ',', dtype = str)
In [20]:
weather.head()
Out[20]:
In [21]:
weather.dtypes
Out[21]:
In [22]:
weather["date"] = weather["year"].map(str) + "-" + weather["month"].map(str) + "-" + weather["day"].map(str)
In [23]:
weather["date"] = pd.to_datetime(weather['date'],errors='ignore')
In [24]:
weather.dtypes
Out[24]:
In [ ]: