In this project, we aim to determine the “most punctual” commercial airline for domestic flights, based on 2015 flight data from the U.S. DOT.
Motivation: As the old saying goes, time is money. Most people prefer to save as much time as possible and to avoid delays when traveling. However, the everyday consumer has no control over avoiding a flight delay--unless he or she strategically chooses flights that are unlikely to be delayed in the first place. Although some people may be loyal to a certain airline for the quality of amenities offered, we believe that punctuality is the leading factor in determining a “good” airline. Through our analysis, we will reveal which airline one should pick to minimize the chance of a delayed flight.
In [1]:
import sys # system module
import pandas as pd # data package
import matplotlib.pyplot as plt # graphics module
import datetime as dt # date and time module
import numpy as np
import seaborn as sns
%matplotlib inline
sns.set_style('whitegrid')
sns.color_palette("pastel")
# check versions, make sure Python is running
print('Python version:', sys.version)
print('Pandas version: ', pd.__version__)
print('Today: ', dt.date.today())
Data Source: To access and use the data, it is easiest to download the files directly to your computer and import it into Jupyter Notebook from the location on your computer.
First, we access the 3 data files from the local file paths and save them to DataFrames:
[df].head() helps us see the data and variables we are dealing with in each file.
In [2]:
path = 'C:/Users/Ziqi/Desktop/Data Bootcamp/Project/airports.csv'
airports = pd.read_csv(path)
airports.head()
Out[2]:
In [3]:
airlines = pd.read_csv('C:/Users/Ziqi/Desktop/Data Bootcamp/Project/airlines.csv')
airlines.head()
Out[3]:
In [4]:
flights = pd.read_csv('C:/Users/Ziqi/Desktop/Data Bootcamp/Project/flights.csv', low_memory=False) # (this is a big data file)
flights.head()
Out[4]:
In [5]:
# number of rows and columns of each DataFrame
print('airports:',airports.shape)
print('airlines:',airlines.shape)
print('flights:',flights.shape)
We see that the data contain 322 airports, 14 airlines, and 5,819,079 flights.
In [6]:
# list of column names and datatypes in flights
flights.info()
In [7]:
flights.index
Out[7]:
In [8]:
# create new DataFrame with relevant variables
columns=['YEAR',
'MONTH',
'DAY',
'DAY_OF_WEEK',
'AIRLINE',
'FLIGHT_NUMBER',
'ORIGIN_AIRPORT',
'DESTINATION_AIRPORT',
'DEPARTURE_DELAY',
'ARRIVAL_DELAY',
'DIVERTED',
'CANCELLED',
'AIR_SYSTEM_DELAY',
'SECURITY_DELAY',
'AIRLINE_DELAY',
'LATE_AIRCRAFT_DELAY',
'WEATHER_DELAY']
flights2 = pd.DataFrame(flights, columns=columns)
flights2.head()
Out[8]:
In [9]:
# for later convenience, we will replace the airline codes with each airline's full name, using a dictionary
airlines_dictionary = dict(zip(airlines['IATA_CODE'], airlines['AIRLINE']))
flights2['AIRLINE'] = flights2['AIRLINE'].apply(lambda x: airlines_dictionary[x])
flights2.head()
Out[9]:
The DataFrame flights2 will serve as the foundation for our analysis on US domestic flight delays in 2015. We can further examine the data to determine which airline is the "most punctual".
First, we will rank the airlines by average arrival delay. We are mainly concerned about arrival delay because regardless of whether a flight departs on time, what matters most to the passenger is whether he or she arrives at the final destination on time. Of course, a significant departure delay may result in an arrival delay. However, airlines may include a buffer in the scheduled arrival time to ensure that passengers reach their destination at the promised time.
In [12]:
# create DataFrame with airlines and arrival delays
delays = flights2[['AIRLINE','DEPARTURE_DELAY','ARRIVAL_DELAY']]
# if we hadn't used a dictionary to change the airline names, this is the code we would have used to produce the same result:
#flights4 = pd.merge(airlines, flights3, left_on='IATA_CODE', right_on='AIRLINE', how='left')
#flights4.drop('IATA_CODE', axis=1, inplace=True)
#flights4.drop('AIRLINE_y', axis=1, inplace=True)
#flights4.rename(columns={'AIRLINE_x': 'AIRLINE'}, inplace=True)
delays.head()
Out[12]:
In [13]:
# group data by airline name, calculate average arrival delay for each airline in 2015
airline_av_delay = delays.groupby(['AIRLINE']).mean()
airline_av_delay
Out[13]:
In [153]:
# create bar graph of average delay time for each airline
airline_av_delay.sort(['ARRIVAL_DELAY'], ascending=1, inplace=True)
sns.set()
fig, ax = plt.subplots()
airline_av_delay.plot(ax=ax,
kind='bar',
title='Average Delay (mins)')
ax.set_ylabel('Average Minutes Delayed')
ax.set_xlabel('Airline')
plt.show()
The bar graph shows that Alaska Airlines has the shortest delay on average--in fact, the average Alaska Airlines flight arrives before the scheduled arrival time, making it the airline with the best time on record. On the other end, Spirit Airlines has the longest average arrival delay. Interestingly, none of the average arrival delays exceed 15 minutes--for the most part, it seems that US domestic flights have been pretty punctual in 2015!
Additionally, almost all of the airlines have a departure delay greater than the arrival delay (with the exception of Hawaiian Airlines), which makes sense, considering that departure delay could be due to a variety of factors related to the departure airport, such as security, late passengers, or late arrivals of other flights to that airport. Despite a greater average departure delay, most airports seem to make up for the delay during the travel time, resulting in a shorter average arrival delay.
Now that we know how the airlines rank in terms of arrival delay, we can look at the how many of each airline's flights were cancelled or diverted. Second, we can calculate delay percentages for each airline, i.e. what percent of each airline's total flights were delayed in 2015, to determine which airlines are more likely to be delayed.
In [71]:
# new DataFrame with relevant variables
diverted_cancelled = flights2[['AIRLINE','DIVERTED', 'CANCELLED']]
diverted_cancelled.head()
Out[71]:
In [73]:
diverted_cancelled = diverted_cancelled.groupby(['AIRLINE']).sum()
In [52]:
# total number of flights scheduled by each airline in 2015
total_flights = flights2[['AIRLINE', 'FLIGHT_NUMBER']].groupby(['AIRLINE']).count()
total_flights.rename(columns={'FLIGHT_NUMBER': 'TOTAL_FLIGHTS'}, inplace=True)
total_flights
Out[52]:
In [154]:
# Tangent: for fun, we can see which airlines were dominant in the number of domestic flights
total_flights['TOTAL_FLIGHTS'].plot.pie(figsize=(12,12), rot=45, autopct='%1.0f%%', title='Market Share of Domestic Flights in 2015 by Airline')
Out[154]:
It appears that the airlines with the top three largest market share of domestic flights in 2015 were Southwest (22%), Delta (15%), and American Airlines (12%).
In [74]:
#resetting the index to merge the two DataFrames
total_flights2 = total_flights.reset_index()
diverted_cancelled2 = diverted_cancelled.reset_index()
In [81]:
# check
total_flights2
diverted_cancelled2
Out[81]:
In [89]:
# calculate divertion and cancellation rates (percentages) for each airline
dc_rates = pd.merge(diverted_cancelled2, total_flights2, on='AIRLINE')
dc_rates['DIVERTION_RATE'] = dc_rates['DIVERTED']/dc_rates['TOTAL_FLIGHTS']
dc_rates['CANCELLATION_RATE'] = dc_rates['CANCELLED']/dc_rates['TOTAL_FLIGHTS']
dc_rates = dc_rates.set_index(['AIRLINE'])
dc_rates
Out[89]:
In [151]:
dc_rates[['DIVERTION_RATE','CANCELLATION_RATE']].plot.bar(legend=True, figsize=(13,11),rot=45)
Out[151]:
Overall, the chance of cancellation or divertion is very low, with the divertion rate almost nonexistant. Flights are rarely diverted and only in extreme situations dues to plane safety failures, attacks, or natural disasters. We could use the flight divertion rate as a proxy for the safety of flying in 2015, and are happy to see this rate way below 0.01%. American Airlines and its partner American Eagle Airlines were the most likely to cancel a flight in 2015, while Hawaiian Airlines and Alaska Airlines were the least likely. (It is interesting to note that the two airlines operating out of the two states not in the continental U.S. are the least likely to be cancelled, despite have to travel the greatest distance.)
In [98]:
# create a DataFrame with all flights that had a positive arrival delay time
delayed = flights2['ARRIVAL_DELAY'] >= 0
pos_delay = flights2[delayed]
pos_delay.head()
Out[98]:
In [99]:
# groupby function to determine how many flights had delayed arrival for each airline
pos_delay = pos_delay[['AIRLINE','ARRIVAL_DELAY']].groupby(['AIRLINE']).count()
In [100]:
pos_delay2 = pos_delay.reset_index()
In [133]:
# merge with total_flights to calculate percentage of flights that were delayed for each airline
delay_rates = pd.merge(pos_delay2, total_flights2, on='AIRLINE')
delay_rates['DELAY_RATE'] = delay_rates['ARRIVAL_DELAY']/delay_rates['TOTAL_FLIGHTS']
delay_rates = delay_rates.set_index(['AIRLINE'])
delay_rates.sort(['DELAY_RATE'], ascending=1, inplace=True)
delay_rates.reset_index()
Out[133]:
In [150]:
delay_rates[['DELAY_RATE']].plot.bar(legend=True, figsize=(13,11),rot=45)
Out[150]:
Spirit Airlines has the largest chance of being delayed upon arrival, with Delta Airlines the least likely.
However, when we combine divertion rate, cancellation rate, and delay rate, we see that delays account for the majority of flights that didn't operate as scheduled for all airlines across the board.
In [117]:
# combining the two into one DataFrame
all_rates = pd.merge(dc_rates.reset_index(), delay_rates.reset_index()).set_index(['AIRLINE'])
all_rates
Out[117]:
In [149]:
all_rates[['DIVERTION_RATE','CANCELLATION_RATE','DELAY_RATE']].plot.bar(legend=True, figsize=(13,10),rot=45)
Out[149]:
Obviously, delays are a lot more prevalent than diverted or cancelled flights. In conclusion, it appears that Delta Airlines was the most punctual domestic airline in 2015. Delta Airlines had the lowest average delay rate upon arrival and the third lowest cancellation rate. We can therefore state that if punctuality is a passenger's top priority when flights, we recommend flying Delta Airlines. The airlines with the highest average delay rate was Spirit Airlines, followed closely by Frontier. (While these flights are more likely to be delayed and arrive late, they are known as two of the cheapest airlines to fly in the U.S. While we are not observing ticket prices and affordability of airlines, it is still important to note.)