This is our second process notebook. It describes our data aquisition process (sources and characteristics), the data preparation, and finally the creation of two specific subsets for special tasks of our analysis.
The following sections will outline our data aquisition process. Furthermore, we will describe the most important characteristics and features of the datasets.
The official flight database for every domestic flight in the US
Historical weather data
Our main source of data was the Bureau of Transportation Statistics (BTS), which is as statistical agency of the US DEpartment of Transportation (http://www.transtats.bts.gov/). Luckily, the BTS publishes detailed data for every domestic flight in the US (http://www.transtats.bts.gov/Tables.asp?DB_ID=120&DB_Name=Airline%20On-Time%20Performance%20Data&DB_Short_Name=On-Time). However, it is not possible to download the data over a specified period of time, i.e. one can only download data for a month in a given year. To get the data automatically we developed a scraper tool in Python, which automatically performs the requests and downloads the data. Files adressing this issue can be found in the src
folder. Scraping the data for ~25 years needs around 2-3 hours as requests are processed slowly on the server side. Furthermore, the BTS provides LookUp Tables for airline codes which have been downloaded manually.
Since the uncompressed data for each month is around 250-300MB (comma separated), we needed to filter this dataset. A first step to do so is restricting the features. A description of all available columns is available at http://www.transtats.bts.gov/TableInfo.asp?Table_ID=236&DB_Short_Name=On-Time&Info_Only=0. In our analysis we will use a subset of 30 features that have been identified as relevant for the purpose of our analysis.
Many of us have experienced it before: a flight is delayed because there are some lastminute repairs or other problems with the aircraft. We are curious if the manufacturer or the age of the aircraft influences the probability of delays. Therefore we need more detailed data about the flight. From the delay dataset mentioned in 2.1 we get the tail number of the aircraft for every single flight. This tail number is comparable to car license plates and helps us to identify the manufacturer and age of the airplane.
We can get this information from the Federal Aviation Administration. This institution has a database of tail numbers (so called N-Numbers) for each aircraft in the US and also publishes other datasets with information about the respective aircraft (e.g. manufacturer of turbines, owner, etc.). We downloaded the database from the following website: http://www.faa.gov/licenses_certificates/aircraft_certification/aircraft_registry/releasable_aircraft_download/.
In addition, we need the exact geolocations for the airports in the dataset, for example to get good visualizations using Tableau Public or other maps. Furthermore, airport names would be helpful to interpret the data (the original dataset just contains the short IATA abbreviations). This data can be easily found online as csvs (see http://openflights.org/data.html). It can be found in the folder data
.
A natural cause for many delays seems to be the weather. We decided to include weather data additionally in our analysis. Therefore, we wanted to get historical weather information of the airports. Unfortunately, when it comes to weather data, we couldn't find any public available sources that provide a suitable (free) dataset. However, Wunderground provides a webinterface allowing to query specific IATA / IAOC codes of airports (see i.e. http://www.wunderground.com/history/airport/EDDF/2005/10/3/DailyHistory.html?req_city=Frankfurt+%2F+Main&req_state=&req_statename=Germany&reqdb.zip=00000&reqdb.magic=5&reqdb.wmo=10637). Writing a script allowed us to get historic data of individual airports.
**events** | a list containing strings of weather events, i.e. "Rain", "Fog", "Snow" |
**humidity** | humidity measured in percent |
**precipitation** | precipitation measured in inches |
**sealevelpressure** | pressure at sea level in inches |
**snowdepth** | snow depth in inches |
**snowfall** | snow fall in inches |
**temperature** | temperature in degree Fahrenheit |
**visibility** | visibility in miles |
**windspeed** | wind speed in miles per hour |
One drawback of this method is similiar to getting the data from the BTS: the slow processing of requests from the server and the amount of requests necessary to get data matching the huge dataset of the BTS (ca. 15 min for a single year and just one airport). Thus, we decided to focus on the weather at the John F. Kennedy International Airport (New York City) and at the Boston Logan International Airport (Boston) only. Although we just used these two airports we could get some very valuable insights about weather's effects on delays (see process notebook for exploratory analysis). The detailed code for the webscraper can be found in the src
-folder. It has not been included in this notebook, as it is not part of our actual analysis and would affect the readability of this notebook. The scraper creates a file weather_data.json
in the data
-folder that can be used for further analysis.
In [2]:
# import required modules for data preparation tasks
import requests, zipfile, StringIO
import pandas as pd
import random
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
import re
import json
import os
First, we want to open and combine the zipped data files for each month of the delay data that has been downloaded according to the process outlined in the data aquisition section above. As we have more than 400,000 recorded flights each month, the dataset is extremely large. We therefore decided to focus on the subset of all flights in 2014 to get the most relevant flight information without missing certain months (this could be important when investigating seasonality effects).
In [3]:
# reads all predefined months for a year and merge into one data frame
rawData = pd.DataFrame()
months = ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']
for m in months:
z = zipfile.ZipFile('cache/{y}{mo}.zip'.format(y=str(2014), mo = m))
rawData = rawData.append(pd.read_csv(z.open(z.namelist()[0])))
print "Downloaded", m
# reset index of complete dataset for delays to prepare merging in next step
rawData.reset_index(inplace=True)
The columns we now have in the dataset are:
In [4]:
rawData.columns
Out[4]:
However, we just need a subset of these columns for our analysis:
In [5]:
selectedColumns = [u'index', u'FL_DATE', u'UNIQUE_CARRIER', u'TAIL_NUM', u'FL_NUM',
u'ORIGIN', u'DEST', u'CRS_DEP_TIME', u'DEP_TIME', u'DEP_DELAY', u'TAXI_OUT',
u'WHEELS_OFF', u'WHEELS_ON', u'TAXI_IN', u'CRS_ARR_TIME', u'ARR_TIME', u'ARR_DELAY',
u'CANCELLED', u'DIVERTED', u'CANCELLATION_CODE', u'AIR_TIME', u'DISTANCE',
u'CARRIER_DELAY', u'WEATHER_DELAY', u'NAS_DELAY', u'SECURITY_DELAY', u'LATE_AIRCRAFT_DELAY',
u'ORIGIN_CITY_NAME', u'DEST_CITY_NAME']
rawData = rawData[selectedColumns]
We also have two tables containing infos about the aircraft and its manufacturer available as comma separated textfiles in the data
-folder as outlined in the section above. Both files will be loaded.
In [6]:
z = zipfile.ZipFile('externalData/AircraftInformation.zip')
# master table containing tail numbers of aircraft
df_master = pd.DataFrame.from_csv(z.open('MASTER.txt'))
# detailed table containing information about manufacturer, age, etc.
df_aircrafts = pd.DataFrame.from_csv(z.open('ACFTREF.txt'))
We can now join these two tables based on their common ID that is saved in the column MFR MDL CODE
of the master table and in the index of the aircraft table respectively.
In [7]:
master = df_master[['MFR MDL CODE', 'YEAR MFR']].reset_index()
aircrafts = df_aircrafts['MFR'].reset_index()
master.columns = ['TAIL_NUM', 'CODE', 'YEAR']
aircrafts.columns = ['CODE', 'MFR']
joined = pd.merge(master, aircrafts, how='left', on='CODE')
We now join this aircraft information with our delay data and extend the original dataset with the two new features: The year in which the aircraft was built (to determine the age) and the manufacturer.
In [8]:
delayFinal = rawData[['TAIL_NUM','UNIQUE_CARRIER']]
delayFinal.TAIL_NUM = delayFinal.TAIL_NUM.str.strip('N')
delaymfr = pd.merge(delayFinal, joined, how='left', on=['TAIL_NUM'])
rawData['AIRCRAFT_YEAR'] = delaymfr.YEAR
rawData['AIRCRAFT_MFR'] = delaymfr.MFR
Now we load an external dataset that contains the geolocations for each commercial airport in the world. We filter this to get only the airports in the US and then assign the respective geocode of the origin airport to our original delay dataset by merging both tables.
In [9]:
airportLocation = pd.DataFrame.from_csv('externalData/airport_codes_with_geo_name_ids_and_nl_names-2008-04-14.csv', header=None)
usAirports = airportLocation[airportLocation[4]=='US'].reset_index()
# we just need a subsets of the columns (origin, latitude and longitude)
usAirports = usAirports[[0, 5, 6]]
usAirports.columns = ['ORIGIN', 'LAT', 'LONG']
complete2014Data = pd.merge(rawData, usAirports, how='left', on='ORIGIN')
In [10]:
1.0*np.sum(complete2014Data.LAT.isnull())/complete2014Data.shape[0]
Out[10]:
Just 0.7% of alll flight origins could not be located, so the merge was quite successful.
The resulting dataframe complete2014Data
will be locally stored as csv file.
In [11]:
complete2014Data.to_csv('cache/complete2014Data.csv')
As outlined in previous section, we also scraped historical weather data for major US airports from the web. This data can be used as additional features for each flight to get information about the current weather conditions at the airport of departure. The script assumes that there is the weather_data.json
file in the data
-folder and that this file contains the respective weather information for the JFK airport in new york and the BOS airport in Boston for each day in 2014.
In [13]:
# load the weather file
weatherFile = os.path.join('data', 'weather_data.json')
with open(weatherFile) as infile:
weatherDict = json.load(infile)
In [14]:
# extract the weather data for new york and boston out of the json file and save it in weather_df
dates = []
frames = []
# create df for weather in new york
for datapoint in weatherDict['JFK']:
date = datapoint['date']
frames.append(pd.DataFrame(datapoint['data'], index=['%s-%s-%s' % (date[0:4], date[4:6], date[6:8])]))
weather_jfk = pd.concat(frames).reset_index()
# create df for weather in boston
for datapoint in weatherDict['BOS']:
date = datapoint['date']
frames.append(pd.DataFrame(datapoint['data'], index=['%s-%s-%s' % (date[0:4], date[4:6], date[6:8])]))
weather_bos = pd.concat(frames).reset_index()
# get just the departures for the John F. Kennedy airport in New York City and Logan airport in Boston
jfk_delays = complete2014Data[complete2014Data.ORIGIN=='JFK']
bos_delays = complete2014Data[complete2014Data.ORIGIN=='BOS']
# merge delays with weather_df created above
jfk_dalayWeather = pd.merge(jfk_delays, weather_jfk, how='left', left_on='FL_DATE', right_on = 'index')
bos_dalayWeather = pd.merge(bos_delays, weather_bos, how='left', left_on='FL_DATE', right_on = 'index')
jfk_bos_comparison = pd.concat([jfk_dalayWeather, bos_dalayWeather]).reset_index()
In [15]:
# save everything in a csv
jfk_bos_comparison.to_csv('cache/jfk_bos_weather_2014.csv', encoding='UTF-8')
Before evaluating any models on the data, we have to clean it a bit.
When cleaning the data set, we have to remove the following entries:
Note that data points have to be cleaned in this order because most flights have empty entries for the 'diverted' columns.
In [92]:
#entries to be dropped in the analysis
flight_data_dropped = ['QUARTER', 'DAY_OF_MONTH', 'AIRLINE_ID', 'CARRIER', 'FL_NUM', 'TAIL_NUM']
location_data_dropped = ['ORIGIN_STATE_FIPS', 'ORIGIN_STATE_NM',\
'ORIGIN_WAC', 'DEST_STATE_FIPS', \
'DEST_STATE_NM', 'DEST_WAC']
departure_data_dropped = ['DEP_TIME', 'DEP_DELAY', 'DEP_DELAY_NEW', 'DEP_DEL15', 'DEP_DELAY_GROUP',\
'DEP_TIME_BLK', 'TAXI_OUT', 'WHEELS_OFF']
arrival_data_dropped = ['WHEELS_ON', 'TAXI_IN', 'ARR_TIME', 'ARR_DELAY_NEW',\
'ARR_DELAY_GROUP', 'ARR_TIME_BLK']
cancel_data_dropped = ['CANCELLED','CANCELLATION_CODE', 'DIVERTED']
summaries_dropped = ['CRS_ELAPSED_TIME', 'AIR_TIME', 'FLIGHTS']
cause_delay_dropped = ['CARRIER_DELAY', 'WEATHER_DELAY', 'NAS_DELAY', 'SECURITY_DELAY', 'LATE_AIRCRAFT_DELAY']
gate_return_dropped = ['FIRST_DEP_TIME', 'TOTAL_ADD_GTIME', 'LONGEST_ADD_GTIME']
diverted_data_dropped = ['DIV_AIRPORT_LANDINGS', 'DIV_REACHED_DEST', 'DIV_ACTUAL_ELAPSED_TIME', \
'DIV_ARR_DELAY', 'DIV_DISTANCE', 'DIV1_AIRPORT', 'DIV1_WHEELS_ON', \
'DIV1_TOTAL_GTIME', 'DIV1_LONGEST_GTIME', 'DIV1_WHEELS_OFF', \
'DIV1_TAIL_NUM', 'DIV2_AIRPORT', 'DIV2_WHEELS_ON', \
'DIV2_TOTAL_GTIME', 'DIV2_LONGEST_GTIME', 'DIV2_WHEELS_OFF', \
'DIV2_TAIL_NUM', 'DIV3_AIRPORT', 'DIV3_WHEELS_ON', \
'DIV3_TOTAL_GTIME', 'DIV3_LONGEST_GTIME', 'DIV3_WHEELS_OFF', 'DIV3_TAIL_NUM', \
'DIV4_AIRPORT', 'DIV4_WHEELS_ON', 'DIV4_TOTAL_GTIME', 'DIV4_LONGEST_GTIME', \
'DIV4_WHEELS_OFF', 'DIV4_TAIL_NUM', 'DIV5_AIRPORT', 'DIV5_WHEELS_ON', \
'DIV5_TOTAL_GTIME', 'DIV5_LONGEST_GTIME', 'DIV5_WHEELS_OFF', 'DIV5_TAIL_NUM']
other_dropped = ['Unnamed: 93']
columns_dropped = flight_data_dropped + location_data_dropped + departure_data_dropped + arrival_data_dropped \
+ cancel_data_dropped + summaries_dropped + cause_delay_dropped + gate_return_dropped + diverted_data_dropped \
+ other_dropped
In [93]:
def clean(data, list_col):
'''
Creates a dataset by excluding undesirable columns
Parameters:
-----------
df: pandas.DataFrame
Flight dataframe
list_col: <list 'string'>
Comumns to exclude from the data set
'''
# security check to drop only columns that exist
list_col = list(set(list_col) & set(data.columns))
res = data[(data.CANCELLED == 0) & (data.DIVERTED == 0)]
res.drop(list_col, axis=1, inplace=True)
res.dropna(axis = 0, inplace = True)
return res
In [95]:
%%time
data2014 = clean(complete2014Data, columns_dropped)
print data2014.columns
In [124]:
df_active_airlines = pd.read_csv('data/cur_airlines.txt', header=None)
df_active_airlines.columns = [['UNIQUE_CARRIER']];
df_active_airlines.head()
Out[124]:
In [125]:
filteredData2014 = data2014.merge(df_active_airlines, on=['UNIQUE_CARRIER', 'UNIQUE_CARRIER'], how='inner')
A quick check reveals, that filtering was not (really) necessary as all Airlines are still active today.
In [126]:
filteredData2014.count()[0], data2014.count()[0]
Out[126]:
In [127]:
# save data to cache
filteredData2014.to_csv('cache/linear_model_data.csv')