Data Aquisition and Preparation Process

This is our second process notebook. It describes our data aquisition process (sources and characteristics), the data preparation, and finally the creation of two specific subsets for special tasks of our analysis.

1 Data Aquisition Process

The following sections will outline our data aquisition process. Furthermore, we will describe the most important characteristics and features of the datasets.

We used four different data sources

  1. The official flight database for every domestic flight in the US

  2. Historical weather data

  3. Airport information with geodata and names (e.g. for visualization and interpretation of results)
  4. Information about aircraft models

1.1 Flight Delay Central Database

Our main source of data was the Bureau of Transportation Statistics (BTS), which is as statistical agency of the US DEpartment of Transportation (http://www.transtats.bts.gov/). Luckily, the BTS publishes detailed data for every domestic flight in the US (http://www.transtats.bts.gov/Tables.asp?DB_ID=120&DB_Name=Airline%20On-Time%20Performance%20Data&DB_Short_Name=On-Time). However, it is not possible to download the data over a specified period of time, i.e. one can only download data for a month in a given year. To get the data automatically we developed a scraper tool in Python, which automatically performs the requests and downloads the data. Files adressing this issue can be found in the src folder. Scraping the data for ~25 years needs around 2-3 hours as requests are processed slowly on the server side. Furthermore, the BTS provides LookUp Tables for airline codes which have been downloaded manually.

Since the uncompressed data for each month is around 250-300MB (comma separated), we needed to filter this dataset. A first step to do so is restricting the features. A description of all available columns is available at http://www.transtats.bts.gov/TableInfo.asp?Table_ID=236&DB_Short_Name=On-Time&Info_Only=0. In our analysis we will use a subset of 30 features that have been identified as relevant for the purpose of our analysis.

2.2 Aircraft Information Data

Many of us have experienced it before: a flight is delayed because there are some lastminute repairs or other problems with the aircraft. We are curious if the manufacturer or the age of the aircraft influences the probability of delays. Therefore we need more detailed data about the flight. From the delay dataset mentioned in 2.1 we get the tail number of the aircraft for every single flight. This tail number is comparable to car license plates and helps us to identify the manufacturer and age of the airplane.

We can get this information from the Federal Aviation Administration. This institution has a database of tail numbers (so called N-Numbers) for each aircraft in the US and also publishes other datasets with information about the respective aircraft (e.g. manufacturer of turbines, owner, etc.). We downloaded the database from the following website: http://www.faa.gov/licenses_certificates/aircraft_certification/aircraft_registry/releasable_aircraft_download/.

2.3 Airport Information

In addition, we need the exact geolocations for the airports in the dataset, for example to get good visualizations using Tableau Public or other maps. Furthermore, airport names would be helpful to interpret the data (the original dataset just contains the short IATA abbreviations). This data can be easily found online as csvs (see http://openflights.org/data.html). It can be found in the folder data.

2.4 Weather Data

A natural cause for many delays seems to be the weather. We decided to include weather data additionally in our analysis. Therefore, we wanted to get historical weather information of the airports. Unfortunately, when it comes to weather data, we couldn't find any public available sources that provide a suitable (free) dataset. However, Wunderground provides a webinterface allowing to query specific IATA / IAOC codes of airports (see i.e. http://www.wunderground.com/history/airport/EDDF/2005/10/3/DailyHistory.html?req_city=Frankfurt+%2F+Main&req_state=&req_statename=Germany&reqdb.zip=00000&reqdb.magic=5&reqdb.wmo=10637). Writing a script allowed us to get historic data of individual airports.

**events**a list containing strings of weather events, i.e. "Rain", "Fog", "Snow"
**humidity**humidity measured in percent
**precipitation**precipitation measured in inches
**sealevelpressure**pressure at sea level in inches
**snowdepth**snow depth in inches
**snowfall**snow fall in inches
**temperature**temperature in degree Fahrenheit
**visibility**visibility in miles
**windspeed**wind speed in miles per hour

One drawback of this method is similiar to getting the data from the BTS: the slow processing of requests from the server and the amount of requests necessary to get data matching the huge dataset of the BTS (ca. 15 min for a single year and just one airport). Thus, we decided to focus on the weather at the John F. Kennedy International Airport (New York City) and at the Boston Logan International Airport (Boston) only. Although we just used these two airports we could get some very valuable insights about weather's effects on delays (see process notebook for exploratory analysis). The detailed code for the webscraper can be found in the src-folder. It has not been included in this notebook, as it is not part of our actual analysis and would affect the readability of this notebook. The scraper creates a file weather_data.json in the data-folder that can be used for further analysis.

2 Data Preparation Process


In [2]:
# import required modules for data preparation tasks
import requests, zipfile, StringIO
import pandas as pd
import random
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
import re
import json
import os

2.1 Get the Main Delay Data for 2014 from Downloaded zip Files

First, we want to open and combine the zipped data files for each month of the delay data that has been downloaded according to the process outlined in the data aquisition section above. As we have more than 400,000 recorded flights each month, the dataset is extremely large. We therefore decided to focus on the subset of all flights in 2014 to get the most relevant flight information without missing certain months (this could be important when investigating seasonality effects).


In [3]:
# reads all predefined months for a year and merge into one data frame
rawData = pd.DataFrame()
months = ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']
for m in months:
    z = zipfile.ZipFile('cache/{y}{mo}.zip'.format(y=str(2014), mo = m))
    rawData = rawData.append(pd.read_csv(z.open(z.namelist()[0])))
    print "Downloaded", m
# reset index of complete dataset for delays to prepare merging in next step
rawData.reset_index(inplace=True)


Downloaded 01
Downloaded 02
Downloaded
/Users/David/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py:1170: DtypeWarning: Columns (69,74) have mixed types. Specify dtype option on import or set low_memory=False.
  data = self._reader.read(nrows)
/Users/David/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py:1170: DtypeWarning: Columns (63,68,69,74) have mixed types. Specify dtype option on import or set low_memory=False.
  data = self._reader.read(nrows)
 03
Downloaded 04
Downloaded 05
Downloaded
/Users/David/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py:1170: DtypeWarning: Columns (42,69,74) have mixed types. Specify dtype option on import or set low_memory=False.
  data = self._reader.read(nrows)
/Users/David/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py:1170: DtypeWarning: Columns (68,69,74) have mixed types. Specify dtype option on import or set low_memory=False.
  data = self._reader.read(nrows)
 06
Downloaded 07
Downloaded 08
Downloaded 09
Downloaded 10
Downloaded 11
Downloaded 12
/Users/David/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py:1170: DtypeWarning: Columns (42,63,68,69,74) have mixed types. Specify dtype option on import or set low_memory=False.
  data = self._reader.read(nrows)

The columns we now have in the dataset are:


In [4]:
rawData.columns


Out[4]:
Index([u'index', u'YEAR', u'QUARTER', u'MONTH', u'DAY_OF_MONTH',
       u'DAY_OF_WEEK', u'FL_DATE', u'UNIQUE_CARRIER', u'AIRLINE_ID',
       u'CARRIER', u'TAIL_NUM', u'FL_NUM', u'ORIGIN', u'ORIGIN_CITY_NAME',
       u'ORIGIN_STATE_ABR', u'ORIGIN_STATE_FIPS', u'ORIGIN_STATE_NM',
       u'ORIGIN_WAC', u'DEST', u'DEST_CITY_NAME', u'DEST_STATE_ABR',
       u'DEST_STATE_FIPS', u'DEST_STATE_NM', u'DEST_WAC', u'CRS_DEP_TIME',
       u'DEP_TIME', u'DEP_DELAY', u'DEP_DELAY_NEW', u'DEP_DEL15',
       u'DEP_DELAY_GROUP', u'DEP_TIME_BLK', u'TAXI_OUT', u'WHEELS_OFF',
       u'WHEELS_ON', u'TAXI_IN', u'CRS_ARR_TIME', u'ARR_TIME', u'ARR_DELAY',
       u'ARR_DELAY_NEW', u'ARR_DEL15', u'ARR_DELAY_GROUP', u'ARR_TIME_BLK',
       u'CANCELLED', u'CANCELLATION_CODE', u'DIVERTED', u'CRS_ELAPSED_TIME',
       u'ACTUAL_ELAPSED_TIME', u'AIR_TIME', u'FLIGHTS', u'DISTANCE',
       u'DISTANCE_GROUP', u'CARRIER_DELAY', u'WEATHER_DELAY', u'NAS_DELAY',
       u'SECURITY_DELAY', u'LATE_AIRCRAFT_DELAY', u'FIRST_DEP_TIME',
       u'TOTAL_ADD_GTIME', u'LONGEST_ADD_GTIME', u'DIV_AIRPORT_LANDINGS',
       u'DIV_REACHED_DEST', u'DIV_ACTUAL_ELAPSED_TIME', u'DIV_ARR_DELAY',
       u'DIV_DISTANCE', u'DIV1_AIRPORT', u'DIV1_WHEELS_ON',
       u'DIV1_TOTAL_GTIME', u'DIV1_LONGEST_GTIME', u'DIV1_WHEELS_OFF',
       u'DIV1_TAIL_NUM', u'DIV2_AIRPORT', u'DIV2_WHEELS_ON',
       u'DIV2_TOTAL_GTIME', u'DIV2_LONGEST_GTIME', u'DIV2_WHEELS_OFF',
       u'DIV2_TAIL_NUM', u'DIV3_AIRPORT', u'DIV3_WHEELS_ON',
       u'DIV3_TOTAL_GTIME', u'DIV3_LONGEST_GTIME', u'DIV3_WHEELS_OFF',
       u'DIV3_TAIL_NUM', u'DIV4_AIRPORT', u'DIV4_WHEELS_ON',
       u'DIV4_TOTAL_GTIME', u'DIV4_LONGEST_GTIME', u'DIV4_WHEELS_OFF',
       u'DIV4_TAIL_NUM', u'DIV5_AIRPORT', u'DIV5_WHEELS_ON',
       u'DIV5_TOTAL_GTIME', u'DIV5_LONGEST_GTIME', u'DIV5_WHEELS_OFF',
       u'DIV5_TAIL_NUM', u'Unnamed: 93'],
      dtype='object')

However, we just need a subset of these columns for our analysis:


In [5]:
selectedColumns = [u'index', u'FL_DATE', u'UNIQUE_CARRIER', u'TAIL_NUM', u'FL_NUM', 
                   u'ORIGIN', u'DEST', u'CRS_DEP_TIME', u'DEP_TIME', u'DEP_DELAY', u'TAXI_OUT', 
                   u'WHEELS_OFF', u'WHEELS_ON', u'TAXI_IN', u'CRS_ARR_TIME', u'ARR_TIME', u'ARR_DELAY', 
                   u'CANCELLED', u'DIVERTED', u'CANCELLATION_CODE', u'AIR_TIME', u'DISTANCE', 
                   u'CARRIER_DELAY', u'WEATHER_DELAY', u'NAS_DELAY', u'SECURITY_DELAY', u'LATE_AIRCRAFT_DELAY',
                   u'ORIGIN_CITY_NAME', u'DEST_CITY_NAME']
rawData = rawData[selectedColumns]

2.2 Combine Data with External Aircraft Data

We also have two tables containing infos about the aircraft and its manufacturer available as comma separated textfiles in the data-folder as outlined in the section above. Both files will be loaded.


In [6]:
z = zipfile.ZipFile('externalData/AircraftInformation.zip')
# master table containing tail numbers of aircraft
df_master  = pd.DataFrame.from_csv(z.open('MASTER.txt'))
# detailed table containing information about manufacturer, age, etc.
df_aircrafts  = pd.DataFrame.from_csv(z.open('ACFTREF.txt'))

We can now join these two tables based on their common ID that is saved in the column MFR MDL CODE of the master table and in the index of the aircraft table respectively.


In [7]:
master = df_master[['MFR MDL CODE', 'YEAR MFR']].reset_index()
aircrafts = df_aircrafts['MFR'].reset_index()
master.columns = ['TAIL_NUM', 'CODE', 'YEAR']
aircrafts.columns = ['CODE', 'MFR']
joined = pd.merge(master, aircrafts, how='left', on='CODE')

We now join this aircraft information with our delay data and extend the original dataset with the two new features: The year in which the aircraft was built (to determine the age) and the manufacturer.


In [8]:
delayFinal = rawData[['TAIL_NUM','UNIQUE_CARRIER']]
delayFinal.TAIL_NUM = delayFinal.TAIL_NUM.str.strip('N')
delaymfr = pd.merge(delayFinal, joined, how='left', on=['TAIL_NUM'])
rawData['AIRCRAFT_YEAR'] = delaymfr.YEAR
rawData['AIRCRAFT_MFR'] = delaymfr.MFR


/Users/David/anaconda/lib/python2.7/site-packages/pandas/core/generic.py:2177: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value

2.3 Combine Data with External Airport Location Data

Now we load an external dataset that contains the geolocations for each commercial airport in the world. We filter this to get only the airports in the US and then assign the respective geocode of the origin airport to our original delay dataset by merging both tables.


In [9]:
airportLocation = pd.DataFrame.from_csv('externalData/airport_codes_with_geo_name_ids_and_nl_names-2008-04-14.csv', header=None)
usAirports = airportLocation[airportLocation[4]=='US'].reset_index()
# we just need a subsets of the columns (origin, latitude and longitude)
usAirports = usAirports[[0, 5, 6]]
usAirports.columns = ['ORIGIN', 'LAT', 'LONG']
complete2014Data = pd.merge(rawData, usAirports, how='left', on='ORIGIN')

In [10]:
1.0*np.sum(complete2014Data.LAT.isnull())/complete2014Data.shape[0]


Out[10]:
0.007050916258277116

Just 0.7% of alll flight origins could not be located, so the merge was quite successful.

2.4 Save the Main Final Dataset

The resulting dataframe complete2014Data will be locally stored as csv file.


In [11]:
complete2014Data.to_csv('cache/complete2014Data.csv')

3 Creation of Data Subsets for Weather Analysis and Predictive Models

3.1 Create a Subset with External Weather Data for Selected Airports

As outlined in previous section, we also scraped historical weather data for major US airports from the web. This data can be used as additional features for each flight to get information about the current weather conditions at the airport of departure. The script assumes that there is the weather_data.json file in the data-folder and that this file contains the respective weather information for the JFK airport in new york and the BOS airport in Boston for each day in 2014.


In [13]:
# load the weather file
weatherFile = os.path.join('data', 'weather_data.json')
with open(weatherFile) as infile:
    weatherDict = json.load(infile)

In [14]:
# extract the weather data for new york and boston out of the json file and save it in weather_df
dates = []
frames = []

# create df for weather in new york
for datapoint in weatherDict['JFK']:
    date = datapoint['date']
    frames.append(pd.DataFrame(datapoint['data'], index=['%s-%s-%s' % (date[0:4], date[4:6], date[6:8])]))
weather_jfk = pd.concat(frames).reset_index()

# create df for weather in boston
for datapoint in weatherDict['BOS']:
    date = datapoint['date']
    frames.append(pd.DataFrame(datapoint['data'], index=['%s-%s-%s' % (date[0:4], date[4:6], date[6:8])]))
weather_bos = pd.concat(frames).reset_index()

# get just the departures for the John F. Kennedy airport in New York City and Logan airport in Boston
jfk_delays = complete2014Data[complete2014Data.ORIGIN=='JFK']
bos_delays = complete2014Data[complete2014Data.ORIGIN=='BOS']

# merge delays with weather_df created above
jfk_dalayWeather = pd.merge(jfk_delays, weather_jfk, how='left', left_on='FL_DATE', right_on = 'index')
bos_dalayWeather = pd.merge(bos_delays, weather_bos, how='left', left_on='FL_DATE', right_on = 'index')

jfk_bos_comparison = pd.concat([jfk_dalayWeather, bos_dalayWeather]).reset_index()

In [15]:
# save everything in a csv
jfk_bos_comparison.to_csv('cache/jfk_bos_weather_2014.csv', encoding='UTF-8')

3.2 Creation of the Prediction Datasets

Before evaluating any models on the data, we have to clean it a bit.

Pruning the data

When cleaning the data set, we have to remove the following entries:

  • flights that have been cancelled or diverted. We focus on predicting the delay. As a result, we also remove the columns associated with diverted flights.
  • colmuns that give the answer. This is the case of many colmuns related to the arrival of the plane
  • rows where a value is missing

Note that data points have to be cleaned in this order because most flights have empty entries for the 'diverted' columns.


In [92]:
#entries to be dropped in the analysis
flight_data_dropped = ['QUARTER', 'DAY_OF_MONTH', 'AIRLINE_ID', 'CARRIER', 'FL_NUM', 'TAIL_NUM']

location_data_dropped = ['ORIGIN_STATE_FIPS', 'ORIGIN_STATE_NM',\
                    'ORIGIN_WAC', 'DEST_STATE_FIPS', \
                    'DEST_STATE_NM', 'DEST_WAC']

departure_data_dropped = ['DEP_TIME', 'DEP_DELAY', 'DEP_DELAY_NEW', 'DEP_DEL15', 'DEP_DELAY_GROUP',\
                          'DEP_TIME_BLK', 'TAXI_OUT', 'WHEELS_OFF']

arrival_data_dropped = ['WHEELS_ON', 'TAXI_IN', 'ARR_TIME', 'ARR_DELAY_NEW',\
                       'ARR_DELAY_GROUP', 'ARR_TIME_BLK']

cancel_data_dropped = ['CANCELLED','CANCELLATION_CODE', 'DIVERTED']

summaries_dropped = ['CRS_ELAPSED_TIME', 'AIR_TIME', 'FLIGHTS']

cause_delay_dropped = ['CARRIER_DELAY', 'WEATHER_DELAY', 'NAS_DELAY', 'SECURITY_DELAY', 'LATE_AIRCRAFT_DELAY']

gate_return_dropped = ['FIRST_DEP_TIME', 'TOTAL_ADD_GTIME', 'LONGEST_ADD_GTIME']

diverted_data_dropped = ['DIV_AIRPORT_LANDINGS', 'DIV_REACHED_DEST', 'DIV_ACTUAL_ELAPSED_TIME',  \
                         'DIV_ARR_DELAY', 'DIV_DISTANCE', 'DIV1_AIRPORT', 'DIV1_WHEELS_ON', \
                         'DIV1_TOTAL_GTIME', 'DIV1_LONGEST_GTIME', 'DIV1_WHEELS_OFF', \
                         'DIV1_TAIL_NUM', 'DIV2_AIRPORT', 'DIV2_WHEELS_ON', \
                         'DIV2_TOTAL_GTIME', 'DIV2_LONGEST_GTIME', 'DIV2_WHEELS_OFF', \
                         'DIV2_TAIL_NUM', 'DIV3_AIRPORT', 'DIV3_WHEELS_ON', \
                         'DIV3_TOTAL_GTIME', 'DIV3_LONGEST_GTIME', 'DIV3_WHEELS_OFF', 'DIV3_TAIL_NUM', \
                         'DIV4_AIRPORT', 'DIV4_WHEELS_ON', 'DIV4_TOTAL_GTIME', 'DIV4_LONGEST_GTIME', \
                         'DIV4_WHEELS_OFF', 'DIV4_TAIL_NUM', 'DIV5_AIRPORT', 'DIV5_WHEELS_ON', \
                         'DIV5_TOTAL_GTIME', 'DIV5_LONGEST_GTIME', 'DIV5_WHEELS_OFF', 'DIV5_TAIL_NUM']

other_dropped = ['Unnamed: 93']

columns_dropped = flight_data_dropped + location_data_dropped + departure_data_dropped + arrival_data_dropped \
    + cancel_data_dropped + summaries_dropped + cause_delay_dropped + gate_return_dropped + diverted_data_dropped \
    + other_dropped

In [93]:
def clean(data, list_col):
    ''' 
    Creates a dataset by excluding undesirable columns

    Parameters:
    -----------

    df: pandas.DataFrame
       Flight dataframe  

    list_col: <list 'string'>
        Comumns to exclude from the data set
    '''

    # security check to drop only columns that exist
    list_col = list(set(list_col) & set(data.columns))
    
    res = data[(data.CANCELLED == 0) & (data.DIVERTED == 0)]
    res.drop(list_col, axis=1, inplace=True)
    res.dropna(axis = 0, inplace = True)
    return res

In [95]:
%%time
data2014 = clean(complete2014Data, columns_dropped)
print data2014.columns


/usr/local/lib/python2.7/site-packages/ipykernel/__main__.py:19: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/usr/local/lib/python2.7/site-packages/ipykernel/__main__.py:20: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Index([u'index', u'FL_DATE', u'UNIQUE_CARRIER', u'ORIGIN', u'DEST',
       u'CRS_DEP_TIME', u'CRS_ARR_TIME', u'ARR_DELAY', u'DISTANCE',
       u'ORIGIN_CITY_NAME', u'DEST_CITY_NAME', u'AIRCRAFT_YEAR',
       u'AIRCRAFT_MFR', u'LAT', u'LONG'],
      dtype='object')
CPU times: user 4.48 s, sys: 1.53 s, total: 6.01 s
Wall time: 6.21 s

Filtering the data for active Airlines only

As we want to predict delay times, we throw out any flights that are operated by a shutdown airline.


In [124]:
df_active_airlines = pd.read_csv('data/cur_airlines.txt', header=None)
df_active_airlines.columns = [['UNIQUE_CARRIER']];
df_active_airlines.head()


Out[124]:
UNIQUE_CARRIER
0 M3
1 JP
2 A3
3 EI
4 NG

In [125]:
filteredData2014 = data2014.merge(df_active_airlines, on=['UNIQUE_CARRIER', 'UNIQUE_CARRIER'], how='inner')

A quick check reveals, that filtering was not (really) necessary as all Airlines are still active today.


In [126]:
filteredData2014.count()[0], data2014.count()[0]


Out[126]:
(4103597, 4103597)

In [127]:
# save data to cache
filteredData2014.to_csv('cache/linear_model_data.csv')