https://github.com/davicsilva/dsintensive/blob/master/notebooks/flightDelayPrepData_v2.ipynb
In [1]:
from datetime import datetime
# Pandas and NumPy
import pandas as pd
import numpy as np
# Matplotlib for additional customization
from matplotlib import pyplot as plt
%matplotlib inline
# Seaborn for plotting and styling
import seaborn as sns
In [2]:
# 1. Flight delay: any flight with (real_departure - planned_departure >= 15 minutes)
# 2. The Brazilian Federal Agency for Civil Aviation (ANAC) does not define exactly what is a "flight delay" (in minutes)
# 3. Anyway, the ANAC has a resolution for this subject: https://goo.gl/YBwbMy (last access: nov, 15th, 2017)
# ---
# DELAY, for this analysis, is defined as greater than 15 minutes (local flights only)
DELAY = 15
A note about date columns on this dataset
In [3]:
#[flights] dataset_01 => all "Active Regular Flights" from 2017, from january to september
#source: http://www.anac.gov.br/assuntos/dados-e-estatisticas/historico-de-voos
#Last access this website: nov, 14th, 2017
flights = pd.read_csv('data/arf2017ISO.csv', sep = ';', dtype = str)
In [4]:
flights['departure-est'] = flights[['departure-est']].apply(lambda row: row.str.replace("(?P<day>\d{2})/(?P<month>\d{2})/(?P<year>\d{4}) (?P<HOUR>\d{2}):(?P<MIN>\d{2})", "\g<year>/\g<month>/\g<day> \g<HOUR>:\g<MIN>:00"), axis=1)
In [5]:
flights['departure-real'] = flights[['departure-real']].apply(lambda row: row.str.replace("(?P<day>\d{2})/(?P<month>\d{2})/(?P<year>\d{4}) (?P<HOUR>\d{2}):(?P<MIN>\d{2})", "\g<year>/\g<month>/\g<day> \g<HOUR>:\g<MIN>:00"), axis=1)
flights['arrival-est'] = flights[['arrival-est']].apply(lambda row: row.str.replace("(?P<day>\d{2})/(?P<month>\d{2})/(?P<year>\d{4}) (?P<HOUR>\d{2}):(?P<MIN>\d{2})", "\g<year>/\g<month>/\g<day> \g<HOUR>:\g<MIN>:00"), axis=1)
flights['arrival-real'] = flights[['arrival-real']].apply(lambda row: row.str.replace("(?P<day>\d{2})/(?P<month>\d{2})/(?P<year>\d{4}) (?P<HOUR>\d{2}):(?P<MIN>\d{2})", "\g<year>/\g<month>/\g<day> \g<HOUR>:\g<MIN>:00"), axis=1)
In [6]:
# Departure and Arrival columns: from 'object' to 'date' format
flights['departure-est'] = pd.to_datetime(flights['departure-est'], errors='ignore')
flights['departure-real'] = pd.to_datetime(flights['departure-real'], errors='ignore')
flights['arrival-est'] = pd.to_datetime(flights['arrival-est'], errors='ignore')
flights['arrival-real'] = pd.to_datetime(flights['arrival-real'], errors='ignore')
In [7]:
# translate the flight status from portuguese to english
flights['flight-status'] = flights[['flight-status']].apply(lambda row: row.str.replace("REALIZADO", "ACCOMPLISHED"), axis=1)
flights['flight-status'] = flights[['flight-status']].apply(lambda row: row.str.replace("CANCELADO", "CANCELED"), axis=1)
In [8]:
flights.head()
Out[8]:
In [9]:
flights.size
Out[9]:
In [10]:
flights.to_csv("flights_csv.csv")
In [11]:
# See: https://stackoverflow.com/questions/37287938/sort-pandas-dataframe-by-value
#
df_departures = flights.groupby(['airport-A']).size().reset_index(name='number_departures')
In [12]:
df_departures.sort_values(by=['number_departures'], ascending=False, inplace=True)
In [13]:
df_departures
Out[13]:
Source: https://goo.gl/mNFuPt (a XLS spreadsheet in portuguese; last access on nov, 15th, 2017)
In [14]:
# Airports dataset: all brazilian public airports (updated until october, 2017)
airports = pd.read_csv('data/brazilianPublicAirports-out2017.csv', sep = ';', dtype= str)
In [15]:
airports.head()
Out[15]:
In [16]:
# Merge "flights" dataset with "airports" in order to identify
# local flights (origin and destination are in Brazil)
flights = pd.merge(flights, airports, left_on="airport-A", right_on="airport", how='left')
flights = pd.merge(flights, airports, left_on="airport-B", right_on="airport", how='left')
In [17]:
flights.tail()
Out[17]:
I have found two lists that define two-letter codes used by the aircraft crew to justify the delay of the flights: a short and a long one.
Source: https://goo.gl/vUC8BX (last access: nov, 15th, 2017)
In [18]:
# ------------------------------------------------------------------
# List of codes (two letters) used to justify a delay on the flight
# - delayCodesShortlist.csv: list with YYY codes
# - delayCodesLongList.csv: list with XXX codes
# ------------------------------------------------------------------
delaycodes = pd.read_csv('data/delayCodesShortlist.csv', sep = ';', dtype = str)
delaycodesLongList = pd.read_csv('data/delayCodesLonglist.csv', sep = ';', dtype = str)
In [19]:
delaycodes.head()
Out[19]:
From this website I captured a sample data from local airport (Campinas, SP, Brazil): January to September, 2017.
The website presents data like this (see https://goo.gl/oKwzyH):
In [20]:
# Weather sample: load the CSV with weather historical data (from Campinas, SP, Brazil, 2017)
weather = pd.read_csv('data/DataScience-Intensive-weatherAtCampinasAirport-2017-Campinas_Airport_2017Weather.csv', \
sep = ',', dtype = str)
In [21]:
weather["date"] = weather["year"].map(str) + "-" + weather["month"].map(str) + "-" + weather["day"].map(str)
weather["date"] = pd.to_datetime(weather['date'],errors='ignore')
In [22]:
weather.head()
Out[22]:
In [ ]: