In this report we analyze on-time performance data of domestic flights in the USA for the month of December, 2014. Delays in airline traffic can be attributed to many factors such as weather, security, scheduling inefficiencies, imbalance between demand and capacity at the airports as well as propagation of late arrivals and departures between connecting flights. Our goal is to reveal patterns, or lack thereof, of flight delays due to airport characteristics, carrier and date and time of travel. More involved modelling of different possible effects mentioned above is out of the scope of this report.
There are three sections in the report. Since iPython notebook is chosen as the format, source codes implementing the described computations are also presented in each section. Section I describes the steps for loading, merging and cleaning the data sets in hand. An exploratory analysis of selected attributes and their relation to on-time performance was given in Section II. Section III describes a logistic regression model for the estimation of delay probability. Finally, Section IV summarizes the report and provides some future directions.
Let us first import the modules that will be used:
In [1]:
from mpl_toolkits.basemap import Basemap
import numpy as np
import matplotlib.pyplot as plt
import csv
import xlrd
%matplotlib inline
In [2]:
book = xlrd.open_workbook('airports_new.xlt')
sheet = book.sheet_by_index(0)
airport_data = [[sheet.cell_value(i,j) for j in range(sheet.ncols)] for i in range(sheet.nrows)]
#convert to dictionary for easy loop-up
airport_dict = {}
for j in range(len(airport_data[0])):
key = airport_data[0][j]
airport_dict[key] = [airport_data[i][j] for i in range(1,len(airport_data))]
book = xlrd.open_workbook('carriers.xls')
sheet = book.sheet_by_index(0)
#every other row in 'carrriers.xls' sheet is empty'
carrier_data = [[sheet.cell_value(i,j) for j in range(sheet.ncols)]
for i in range(0,sheet.nrows,2)]
#convert to dictionary for easy look-up
carrier_dict = {}
for j in range(len(carrier_data[0])):
key = carrier_data[0][j]
carrier_dict[key] = [carrier_data[i][j] for i in range(1,len(carrier_data))]
print('Fields in the additional carrier data set:')
print('-----------------------------------------')
for key in carrier_dict.keys():
print(key)
print('')
print('Fields in the additional airport data set:')
print('-----------------------------------------')
for key in airport_dict.keys():
print(key)
We downloaded the on time performance data from the Bureau of Transportation Statistics for December, 2014.
In [3]:
delay_data = []
f = open('532747144_T_ONTIME.csv', 'r')
reader = csv.reader(f)
delay_data_header = next(reader,None)
for row in reader:
delay_data.append(row)
f.close()
In [4]:
for i,s in enumerate(delay_data_header):
print(str(i) + ': ' + s)
Last column is empty. Let's remove it from our data.
In [5]:
delay_data = [d[:-1] for d in delay_data]
delay_data_header = delay_data_header[:-1]
We are concerned with conducted flights. Therefore let us remove the canceled flights from the data.
In [6]:
#remove cancelled flights
delay_data = [d for d in delay_data if d[16] != '1.00']
Now a quick glance and printing some of the rows reveal that some flights have missing information. We remove them from the data for sake completeness. Note that rows 20:24 are empty when arrival delay <= 0.
In [7]:
#determine the rows with missing data:
rows_with_missing_data = []
for i in range(len(delay_data)):
for j in range(20):
if len(delay_data[i][j]) == 0:
rows_with_missing_data.append(i)
break
For example observe that the flight below is missing arrival delay and air time information. This is possibly because that particular flight was diverted (hopefully).
In [8]:
i = rows_with_missing_data[0]
print('Example row in the data with missing entries:\n')
for j in range(len(delay_data[i])):
print(delay_data_header[j] + ': ' + str(delay_data[i][j]))
In [9]:
#remove rows with missing entries:
delay_data = [delay_data[i] for i in range(len(delay_data)) if i not in rows_with_missing_data]
Now let's convert the fields with numerical values to float. Also note that delay type (rows[20:24]) are empty if arrival delay <= 0. We will fill those empty cells with zeros.
In [10]:
float_index = set([11,12,13,15,17,18,19,20,21,22,23,24])
for i in range(len(delay_data)):
for j in float_index:
if len(delay_data[i][j]) > 0:
delay_data[i][j] = float(delay_data[i][j])
else:
#delay type fields
delay_data[i][j] = 0.0
int_index = set([1,2])
for i in range(len(delay_data)):
for j in int_index:
delay_data[i][j] = int(delay_data[i][j])
Now, we assume that the dynamics of busy airports might be significantly different that the ones with less busy ones. We would like to discard flights to and from smaller aiports so that the delay time dynamics are somewhat similar for each data point. For this aim we sorted all the airports with respect to total number of incoming and outgoing flights in December 2015. We decide the number of airports to be investigated to be 50 via visual inspection of number of flights at busiest airports:
In [11]:
#get the list of unique carrires:
carrier_ID = set()
airport_ID = set()
for d in delay_data:
carrier_ID.add(d[3])
airport_ID.add(d[4])
airport_ID.add(d[7])
#count total arrivals and departures from each airport
flight_count_dict = {iata: 0 for iata in airport_ID}
for d in delay_data:
flight_count_dict[d[4]] += 1
flight_count_dict[d[7]] += 1
pairs = []
for key, value in flight_count_dict.items():
pairs.append((key,value))
#sort airports according to
pairs.sort(key = lambda x: x[1], reverse = True)
In [12]:
c = [c for a,c in pairs]
a = [a for a,c in pairs]
plt.figure(figsize = (20,4))
N = 60
plt.plot(c[:N])
plt.xticks(range(N), a[:N], fontsize = 8)
plt.ylabel('Total Number of Flights')
plt.xlabel('Airport IATA')
plt.grid()
plt.axvline(49, color = 'r')
plt.show()
print('\n'+'Use data from 50 most busy airports according to number of total incoming and outgoing domestic flights')
In [13]:
airports_to_keep = [a for a,c in pairs[:52]]
delay_data2 = [d for d in delay_data if (d[4] in airports_to_keep and d[7] in airports_to_keep)]
print('Size of the dataset is reduced from ' + str(len(delay_data)) + ' to ' + str(len(delay_data2)))
#let's delete the large dataset
delay_data = delay_data2
Now let's merge information of carriers and airports into two dictionaries 'carrier_info' and 'airport_info' for easy access during analysis.
In [14]:
#find out carrier names from carrier_data
carrier_info = {}
for code in carrier_ID:
k = carrier_dict['Code'].index(code)
carrier_info[code] = carrier_dict['Description'][k]
In [15]:
airport_info = {}
for iata in airports_to_keep:
k = airport_dict['iata'].index(iata)
airport_info[iata] = {key: airport_dict[key][k] for key in airport_dict.keys()}
Above example also reveals that some departure delays are ridiclously high. We can consider them outliers as they are most pbobabably caused by some irrelevant incident beyond the scope of this investigation. Let's plot the histogram for departure delays and determine a cut-off point for departure time for outliers. Note that early arrivals and departures are given with negative values. Alternatively we could take 95th percentile. Let's investigate:
In [16]:
dep_delay_time_vector = [d[11] for d in delay_data]
arr_delay_time_vector = [d[15] for d in delay_data]
print('Departure Delay Stats in minutes:')
print('--------------------------------')
print('95th percentile: ' + str(np.percentile(dep_delay_time_vector, 95)))
print('75th percentile: ' + str(np.percentile(dep_delay_time_vector, 75)))
print('5th percentile : ' + str(np.percentile(dep_delay_time_vector, 5)))
print('median : ' + str(np.median(dep_delay_time_vector)))
print('mean : ' + str(np.mean(dep_delay_time_vector)))
print('std : ' + str(np.std(dep_delay_time_vector)))
print('')
print('Arrival Delay Stats in minutes:')
print('--------------------------------')
print('95th percentile: ' + str(np.percentile(arr_delay_time_vector, 95)))
print('75th percentile: ' + str(np.percentile(arr_delay_time_vector, 75)))
print('5th percentile : ' + str(np.percentile(arr_delay_time_vector, 5)))
print('median : ' + str(np.median(arr_delay_time_vector)))
print('mean : ' + str(np.mean(arr_delay_time_vector)))
print('std : ' + str(np.std(arr_delay_time_vector)))
Let's plot histograms for departure and arrival delays in December 2015, as well as scatter plot for departure and arrival delays. Note that we strict the range of data points to [5-95]th percentile for arrival and departure delay histograms.
In [17]:
arr_5th = np.percentile(arr_delay_time_vector, 5)
arr_95th = np.percentile(arr_delay_time_vector, 95)
dep_5th = np.percentile(dep_delay_time_vector, 5)
dep_95th = np.percentile(dep_delay_time_vector, 95)
fig = plt.figure(figsize = (16,3))
ax1 = plt.subplot(141)
ax2 = plt.subplot(142)
ax3 = plt.subplot(143)
ax4 = plt.subplot(144)
_,_,_ = ax1.hist(dep_delay_time_vector, bins = 30, range = [dep_5th, dep_95th])
ax1.set_xlabel('delay [min]')
ax1.set_ylabel('number of flights')
ax1.set_title('Departure Delay Histogram')
_,_,_ = ax2.hist(arr_delay_time_vector, bins = 30, range = [arr_5th, arr_95th])
ax2.set_xlabel('delay [min]')
ax2.set_title('Arrival Delay Histogram')
ax2.set_ylabel('number of flights')
_,_,_ = ax3.hist([a-b for a,b in zip(arr_delay_time_vector,dep_delay_time_vector)], bins = 30)
ax3.set_xlabel('delay [min]')
ax3.set_title('Arrival-Departure Delay Histogram')
ax3.set_ylabel('number of flights')
corr_coef = np.corrcoef(dep_delay_time_vector,arr_delay_time_vector)[0,1]
ax4.scatter(dep_delay_time_vector,arr_delay_time_vector)
ax4.set_xlim([-50,1500])
ax4.set_ylim([-50,1500])
ax4.set_title('correlation coefficient: %2.2f' %(corr_coef) )
ax4.set_xlabel('departure delay [min]')
ax4.set_ylabel('arrival delay [min]')
plt.tight_layout()
plt.show()
As expected, departure delays and arrival delay are highly correlated. Let us first remove outliers in terms of departure delay. 95th percentile gives us the departure delay of 69 minutes, which is not too drastic. Therefore, we remove flights with departure delay larger than 69 minutes. Note the very large departure delay times in the scatter plot. We reason that those extreme values are assumed to be governed by unusual events such as storms or errupting volcanos, which are needed to be removed from our data.
In [18]:
N = len(dep_delay_time_vector)
delay_data = [delay_data[i] for i in range(N) if dep_delay_time_vector[i] < 69]
Next, let us see if we have outliers in the arrival delays after the removal of departure delay outliers.
In [19]:
dep_delay_time_vector = [d[11] for d in delay_data]
arr_delay_time_vector = [d[15] for d in delay_data]
print('Departure Delay Stats in minutes:')
print('--------------------------------')
print('95th percentile: ' + str(np.percentile(dep_delay_time_vector, 95)))
print('75th percentile: ' + str(np.percentile(dep_delay_time_vector, 75)))
print('5th percentile : ' + str(np.percentile(dep_delay_time_vector, 5)))
print('median : ' + str(np.median(dep_delay_time_vector)))
print('mean : ' + str(np.mean(dep_delay_time_vector)))
print('std : ' + str(np.std(dep_delay_time_vector)))
print('')
print('Arrival Delay Stats in minutes:')
print('--------------------------------')
print('95th percentile: ' + str(np.percentile(arr_delay_time_vector, 95)))
print('75th percentile: ' + str(np.percentile(arr_delay_time_vector, 75)))
print('5th percentile : ' + str(np.percentile(arr_delay_time_vector, 5)))
print('median : ' + str(np.median(arr_delay_time_vector)))
print('mean : ' + str(np.mean(arr_delay_time_vector)))
print('std : ' + str(np.std(arr_delay_time_vector)))
arr_5th = np.percentile(arr_delay_time_vector, 5)
arr_95th = np.percentile(arr_delay_time_vector, 95)
dep_5th = np.percentile(dep_delay_time_vector, 5)
dep_95th = np.percentile(dep_delay_time_vector, 95)
fig = plt.figure(figsize = (16,3))
ax1 = plt.subplot(141)
ax2 = plt.subplot(142)
ax3 = plt.subplot(143)
ax4 = plt.subplot(144)
ax1.boxplot(arr_delay_time_vector)
ax1.set_ylabel('arrival delay [min]')
ax1.set_title('Arrival Delay Box Plot')
_,_,_ = ax2.hist(arr_delay_time_vector, bins = 30, range = [arr_5th, arr_95th])
ax2.set_xlabel('delay [min]')
ax2.set_title('Arrival Delay Histogram')
ax2.set_ylabel('number of flights')
_,_,_ = ax3.hist([a-b for a,b in zip(arr_delay_time_vector,dep_delay_time_vector)], bins = 30)
ax3.set_xlabel('delay [min]')
ax3.set_title('Arrival-Departure Delay Histogram')
ax3.set_ylabel('number of flights')
corr_coef = np.corrcoef(dep_delay_time_vector,arr_delay_time_vector)[0,1]
ax4.scatter(dep_delay_time_vector,arr_delay_time_vector)
ax4.set_xlim([-20,100])
ax4.set_ylim([-50,300])
ax4.set_title('correlation coefficient: %2.2f' %(corr_coef) )
ax4.set_xlabel('departure delay [min]')
ax4.set_ylabel('arrival delay [min]')
plt.tight_layout()
plt.show()
Notice the correlation between departure delay and arrival delay is reduced to 0.75. The distribution of the difference of arrival and departure delays has a peaked shape and most of the points are in the [-50,50] minutes range. Scatter plot also reveals that points with arrival time greater than ~125 minutes are somewhat outside of the big cluster of points. With these observations we assume arrival delays greater than 125 minutes are outliers. It would have been interesting to investigate the the causes of these big delay times. However we are concerned with common patterns in the on-time performance of airline traffic.
In [20]:
N = len(arr_delay_time_vector)
delay_data = [delay_data[i] for i in range(N) if arr_delay_time_vector[i] < 125]
In [21]:
delay_data_dict = {}
for j in range(len(delay_data_header)):
key = delay_data_header[j]
delay_data_dict[key] = [delay_data[i][j] for i in range(len(delay_data))]
In [22]:
#let's approximate arrival and departure times by only their hour
delay_data_dict['ARR_TIME'] = [round( float(v)*1e-2 ) for v in delay_data_dict['ARR_TIME']]
delay_data_dict['DEP_TIME'] = [round( float(v)*1e-2 ) for v in delay_data_dict['DEP_TIME']]
The dictionart '_airport_info_' indexed by the 'iata' code. We remind the reader that only the busiest 52 US airports were kept in the data set. Each airport has further information on its location. Let's look at Boston's Logan Airport as an example
In [23]:
print("Example: Info on Logan Airport: \n")
for key,value in airport_info['BOS'].items():
print(key + ': ' + str(value))
The dictionary 'carrier_info' pairs carrier codes with airline names:
In [24]:
for key,value in carrier_info.items():
print(key + ': ' + value)
In [25]:
#we will not delve into data before 07. Let's make US: US Airways
carrier_info['US'] = 'US Airways Inc.'
The main data, 'delay_data_dict' is also in a dictionary format where keys are the fields and each field has all the samples for that field (feature) in the data set. Here are the fields one more time for reference. Note that 'UNIQUE_CARRIER' corresponds to the carrier codes in the carrier_info dictionary, whereas DEST and ORIGIN fields are the 'iata' id's in the airpot_info dictionary.
In [26]:
for key in delay_data_dict.keys():
print(key)
By the way let's make sure that delay_data_dict does not have flights information on carrriers that are not known to us:
In [27]:
s1 = set(delay_data_dict['UNIQUE_CARRIER'])
s2 = set(carrier_info.keys())
print(list(s1-s2))
print(list(s2-s1))
Let's look at the distribution of delay causes among all delays in 12/2015:
In [28]:
delays = [sum(delay_data_dict['CARRIER_DELAY']),
sum(delay_data_dict['WEATHER_DELAY']),
sum(delay_data_dict['NAS_DELAY']),
sum(delay_data_dict['SECURITY_DELAY']),
sum(delay_data_dict['LATE_AIRCRAFT_DELAY'])]
total = sum(delays)
delays = [100*d/total for d in delays]
print('Delay Cause Percentages:')
print('-----------------------')
print('Carrier delay : ' + str(delays[0]))
print('Weather delay : ' + str(delays[1]))
print('NAS delay : ' + str(delays[2]))
print('Security delay : ' + str(delays[3]))
print('Late Aircraft : ' + str(delays[4]))
One can say that most of the delays are caused by 'relative congestion' at the airports as more than 98% of the delays are caused by carrier, NAS and late aircraft related reasons. Weather also seem to be effecting the on-time performance. Please follow this link for definitions of types of delays.
Airline traffic network is extremely complex with interactions of many variables and propagation of delays during the day. Therefore we need to be careful in our definition of late flights. We already establied the fact that departure delays are highly correlated with arrival delays.
We will use the following definitions for delay at airports:
For carriers, we consider only late departures.
In [29]:
N = len(delay_data_dict['ORIGIN']) # N: sample size
carrier_performance = {}
airport_performance = {}
#airport on time performance
for airport in airport_info.keys():
#departures:
ind = [i for i in range(N) if delay_data_dict['DEST'][i] == airport]
total_flights = len(ind)
on_time_flights = sum( [delay_data_dict['DEP_DELAY'][i] <= 15 for i in ind] )
#arrivals:
ind = [i for i in range(N) if delay_data_dict['ORIGIN'][i] == airport]
total_flights += len(ind)
on_time_flights += sum( [delay_data_dict['ARR_DELAY'][i] - delay_data_dict['DEP_DELAY'][i] <= 15 for i in ind] )
if total_flights > 0:
airport_performance[airport] = {'total_flights': total_flights,
'on_time_flights': on_time_flights,
'on_time_ratio': on_time_flights/total_flights}
#carreir on time performance
for carrier in carrier_info.keys():
#departures:
ind = [i for i in range(N) if delay_data_dict['UNIQUE_CARRIER'][i] == carrier]
total_flights = len(ind)
on_time_flights = sum( [delay_data_dict['DEP_DELAY'][i] <= 15 for i in ind] )
if total_flights > 0:
carrier_performance[carrier] = {'total_flights': total_flights,
'on_time_flights': on_time_flights,
'on_time_ratio': on_time_flights/total_flights}
In [30]:
name = []
code = []
on_time = []
flights = []
for key in carrier_performance.keys():
code.append(key)
name.append(carrier_info[key])
on_time.append(carrier_performance[key]['on_time_ratio'])
flights.append(carrier_performance[key]['total_flights'])
name, code, on_time, flights = zip( *sorted( zip(name, code, on_time, flights), key = lambda x: x[3], reverse = True ) )
fig = plt.figure(figsize = (15,3))
width = .6
ax1 = plt.subplot(121)
ax1.bar(range(len(on_time)), [1- v for v in on_time], width = width)
ax1.set_xticks(np.arange(len(on_time)) + width/2)
ax1.set_xticklabels(name, rotation = 90)
ax1.set_title('On-time Performance of Carriers in 12/2015')
ax1.set_ylabel('delay ratio')
ax2 = plt.subplot(122)
ax2.bar(range(len(on_time)), flights, width = width)
ax2.set_xticks(np.arange(len(on_time)) + width/2)
ax2.set_xticklabels(name, rotation = 90)
ax2.set_ylabel('toatl #of flights')
ax2.set_title('#of Flights in 12/2015')
plt.show()
fig = plt.figure(figsize=(3,3))
plt.scatter([1- v for v in on_time], flights)
#plt.xticks([0.14, 0.16, 0.20, 0.26])
plt.xlabel('delay ratio')
plt.ylabel('total #of flights')
plt.grid()
plt.show()
The upper left shows that overall on-time performance varies quite a bit from carrier to carrier. Whereas no correlation between flight volume of the carrier and its on-time performance is observed. We decide to divide the carriers into performance categories according to their overall delay ratios as follows:
In [31]:
#find the airlines within each category:
no_unless_its_really_cheap = []
not_bad = []
way_to_go = []
for c,v in zip(code, on_time):
r = 1-v
if r > 0.20:
no_unless_its_really_cheap.append(c)
elif r <= 0.15:
way_to_go.append(c)
else:
not_bad.append(c)
print('way_to_go carriers:')
print('------------------')
for c in way_to_go:
print(carrier_info[c])
Let's visualize airport traffic and on time performance of all airports on the map of USA.
In [32]:
lat = []
lon = []
name = []
on_time = []
flights = []
for key in airport_performance.keys():
name.append(airport_info[key]['airport'])
lat.append(airport_info[key]['lat'])
lon.append(airport_info[key]['long'])
on_time.append(airport_performance[key]['on_time_ratio'])
flights.append(airport_performance[key]['total_flights'])
In [33]:
fig = plt.figure(figsize=[12,10])
m = Basemap(llcrnrlon=-119,llcrnrlat=22,urcrnrlon=-64,urcrnrlat=49,
projection='lcc',lat_1=33,lat_2=45,lon_0=-95)
m.drawcoastlines(linewidth=1)
m.fillcontinents(color = 'green', lake_color = 'blue', alpha = 0.2)
m.drawcountries(linewidth=1)
x,y = m(lon, lat)
im = m.scatter(x,y, marker = 'o', s = np.array(flights)/10, c = on_time,
cmap = 'autumn')
cb = m.colorbar(im,'bottom')
cb.set_label('on time percentage', fontsize = '14')
plt.show()
In the map above, airport locations are shown with circles color coded accordinf to on-time performance. The area of each circle is proportional to the total number of flights at that airport. Similar to carrier and number of flights we observe no immediate relationship between flight volume and on-time percentage. One intereting question is whether there is a relationship between closeness of the airport to either of the coasts (east or west). Since longitude lines are nearly parellel to the east-west alignment of the map of the US we can measure closeness of an airport with its distance to the middle of the map in terms of longitude. The below scatter plot and delay ratio as a function of distance from the coast plots investigate this possibility.
In [34]:
middle_of_map = (min(lon)+max(lon))/2.0
distance_from_coasts = abs(np.array(lon)-np.array(middle_of_map))
fig = plt.figure(figsize = (14,5))
ax1 = plt.subplot(121)
im = ax1.scatter(flights, [1-v for v in on_time], marker = 'o', s = np.array(flights)/100, c = distance_from_coasts)
#cbar3 = plt.colorbar(im3, cax=cax3, ticks=MultipleLocator(0.2), format="%.2f")
cb = plt.colorbar(im)
cb.set_label('distance from coast [longitude]', fontsize = '14')
ax1.set_xlabel('number of flights', fontsize = '14')
ax1.set_ylabel('delay ratio', fontsize = '14')
x,y = zip(*sorted(zip(distance_from_coasts, [1- v for v in on_time]), key = lambda x: x[0]))
fit = np.polyfit(x,y,1)
fit_fn = np.poly1d(fit)
ax2 = plt.subplot(122)
ax2.plot(x, fit_fn(x), '--k', label = 'linear fit')
ax2.plot(x,y,'o-', label = 'data')
ax2.legend()
ax2.set_xlabel('distance from coast [longitude]', fontsize = '14')
ax2.set_ylabel('delay ratio', fontsize = '14')
plt.show()
Observing the plots above, we can say that coastal distance and delay ratio are negatively correlated. Although longitude is is a bit crude and a more precise computation of coastal distance is possible we chose to use it as a continious variable (predictor) in our model.
Finally in this section we list the busiest ten airports in the US and their on time performances.
In [35]:
name, on_time, flights = zip( *sorted( zip(name, on_time, flights), key = lambda x: x[2], reverse = True ) )
fig = plt.figure(figsize = (10,3))
width = .6
ax1 = plt.subplot(121)
ax1.bar(range(10), [1-v for v in on_time[:10]], width = width)
ax1.set_xticks(np.arange(10) + width/2)
ax1.set_xticklabels(name[:10], rotation = 90)
ax1.set_title('On-time Performance of Airports in 12/2015')
ax1.set_ylabel('delay ratio')
ax2 = plt.subplot(122)
ax2.bar(range(10), flights[:10], width = width)
ax2.set_xticks(np.arange(10) + width/2)
ax2.set_xticklabels(name[:10], rotation = 90)
ax2.set_title('#of Flights in 12/2015')
plt.show()
When considering delays for each trip what passengers are really concerned acout is the arrival delay. Also considering the correlation between departure and arrival delays as well possible accumulation of delays we only count arrival delays when analyzing daily trends.
In [36]:
total_flights_month = [0]*32
on_time_flights_month = [0]*32
avg_delay_month = [0]*32
total_flights_day = [0]*8
on_time_flights_day = [0]*8
avg_delay_day = [0]*8
total_flights_time = [0]*25
on_time_flights_time = [0]*25
avg_delay_time = [0]*25
N = len(delay_data_dict['ARR_DELAY']) #sample size
day_dict = {1:'mon',2:'tue',3:'wed',4:'thu',5:'fri',6:'sat',7:'sun'}
days = ['']*32
for i in range(N):
j = delay_data_dict['DAY_OF_MONTH'][i]
day = delay_data_dict['DAY_OF_WEEK'][i]
t = delay_data_dict['ARR_TIME'][i]
days[j] = day_dict[day] # keep list of days for indexing purposes
delay = delay_data_dict['ARR_DELAY'][i]
total_flights_month[j] += 1
total_flights_day[day] += 1
total_flights_time[t] += 1
if delay <= 15:
on_time_flights_month[j] += 1
on_time_flights_day[day] += 1
on_time_flights_time[t] += 1
avg_delay_month[j] += delay
avg_delay_day[day] += delay
avg_delay_time[t] += delay
avg_delay_time[24] += avg_delay_time[0]
avg_delay_month = np.array(avg_delay_month[1:]) / np.array(total_flights_month[1:])
avg_delay_day = np.array(avg_delay_day[1:]) / np.array(total_flights_day[1:])
avg_delay_time = np.array(avg_delay_time[1:]) / np.array(total_flights_time[1:])
delay_ratio_month = 1.0 - np.array(on_time_flights_month[1:]) / np.array(total_flights_month[1:])
delay_ratio_day = 1.0 - np.array(on_time_flights_day[1:]) / np.array(total_flights_day[1:])
delay_ratio_time = 1.0 - np.array(on_time_flights_time[1:]) / np.array(total_flights_time[1:])
day = days[1:]
In [37]:
fig = plt.figure(figsize=(12,3))
plt.subplot(121)
plt.plot(avg_delay_day, 'o-')
plt.xticks(range(0,8), [day_dict[i] for i in range(1,8)])
plt.title('Average Delay (Early Arrivals Are Accounted)')
plt.ylabel('delay [min]')
plt.grid()
plt.subplot(122)
plt.plot(delay_ratio_day, 'o-')
plt.xticks(range(0,8), [day_dict[i] for i in range(1,8)])
plt.title('Delay Ratio')
plt.grid()
plt.show()
fig = plt.figure(figsize=(3,3))
plt.scatter(delay_ratio_day,avg_delay_day)
plt.xticks([0.14, 0.16, 0.20, 0.26])
plt.xlabel('delay ratio')
plt.ylabel('average delay [min]')
plt.grid()
plt.show()
It is interesting to observe Tuesday is the day with the highest probability of delay. Note that in 2014 Chiristmas day was Thursday. The behaviour above can be due to congestion two day before Christmas. We investigate this possibility below when we analyze the daily patterns withing the month. Here we also show the correlation between average delay in minutes and delay ratio with the scatter plot above. Similar behaviours are observed in weekly and hourly patterns.
In [38]:
fig = plt.figure(figsize=(20,6))
plt.subplot(221)
plt.plot(avg_delay_month, 'o-')
plt.xticks(range(5,32,7), days[6::7])
plt.title('Average Delay (Early Arrivals Are Accounted)')
plt.ylabel('delay [min]')
plt.grid()
plt.subplot(222)
plt.plot(delay_ratio_month, 'o-')
plt.xticks(range(0,31),range(1,32))
plt.ylabel('delay ratio')
plt.title('Delay Ratio for the day of month')
plt.axhline(y = 0.20, color = 'r')
plt.axhline(y = 0.15, color = 'r')
plt.grid()
plt.subplot(224)
plt.plot(total_flights_month[1:], 'o-')
plt.xticks(range(5,32,7), days[6::7])
plt.title('Total Number of Flights')
plt.grid()
fig.tight_layout()
plt.show()
We notice the weekly periodict of delay times and ratios, where Tuesday-Friday has higher delay ration than Saturdat-Monday. More interestingly, we also notice how the cycle breaks exacyly one week before Christmas, Thursday December 18th. First Tuesday and Wednesday also tend to be different than the general pattern perhaps due to their closeness to the Thanksgiving day. Also notice two peaks on December 23rd and 30th, which were both tuesdays. Even though the number of flights are similar to the Tuesdays before delay ratios are approximately doubled. In addition Chrismas Finally, we note that day-of-month analysis is more informative that the day-of-week analysis as the effect of holiday season then to deviate the daily patterns.
Due to several interactins of holidays and weekly patterns we decide to simply categorize days of the month as {good_day, bad_day, very_bad_day} according to following definitios:
In [39]:
#find the airlines within each category:
very_bad_days = []
bad_days = []
good_days = []
for k in range(31):
r = delay_ratio_month[k]
if r > 0.20:
very_bad_days.append(k+1)
elif r <= 0.15:
good_days.append(k+1)
else:
bad_days.append(k+1)
print('very_bad_days:')
print('-------------')
print(very_bad_days)
Finally, let's investigate hourly patterns.
In [40]:
fig = plt.figure(figsize=(20/31*24,6))
plt.subplot(221)
plt.plot(avg_delay_time, 'o-')
plt.xticks(range(0,24),range(0,24))
plt.title('Average Delay (Early Arrivals Are Accounted)')
plt.ylabel('delay [min]')
plt.xlabel('hour')
plt.grid()
plt.subplot(222)
plt.plot(delay_ratio_time, 'o-')
plt.axvline(x = 3, color = 'r')
plt.axvline(x = 16, color = 'r')
plt.axvline(x = 23, color = 'r')
plt.xticks(range(0,24),range(0,24))
plt.title('Delay Ratio for time of the day')
plt.xlabel('hour')
plt.ylabel('delay ratio')
plt.grid()
plt.subplot(224)
plt.plot(total_flights_time[1:], 'o-')
plt.xticks(range(0,24),range(0,24))
plt.title('Total number of Flights')
plt.xlabel('hour')
fig.tight_layout()
plt.grid()
plt.show()
Inspired by the plots above we define the following categories for the hour of the day:
In [41]:
#find the airlines within each category:
morning = range(3,13)
afternoon = range(13,17)
evening = range(17,23)
night = [23,24,0,1,2]
In this section a logistic regression model for the estimation of delay probability is described and implemented. Let us start with a summary of variables identified in Section II:
We stick to the definition of delay (use only arrival delay) that we used when computing deate and time related patterns in Section III. Therefore we define target vector $Y$ with components equal to zero for on time flights and one for delayed flights.
In [79]:
#create the data set dictionary and target vector Y
from sklearn.feature_extraction import DictVectorizer
training_set = []
Y = []
N = len(delay_data_dict['ARR_DELAY'])
for i in range(N):
lon = airport_info[delay_data_dict['DEST'][i]]['long']
coastal_dist = abs(np.array(lon)-np.array(middle_of_map))
arr_time = delay_data_dict['ARR_TIME'][i]
if arr_time in morning:
arr_time = 'morning'
elif arr_time in afternoon:
arr_time = 'afternoon'
elif arr_time in evening:
arr_time= 'evening'
else:
arr_time = 'night'
arr_day = delay_data_dict['DAY_OF_MONTH'][i]
if arr_day in good_days:
arr_day = 'good_days'
elif arr_day in bad_days:
arr_day= 'bad_days'
else:
arr_day = 'very_bad_days'
carrier = delay_data_dict['UNIQUE_CARRIER'][i]
if carrier in no_unless_its_really_cheap:
carrier = 'no_unless_its_really_cheap'
elif carrier in not_bad:
carrier = 'not_bad'
else:
carrier = 'way_to_go'
training_set.append({'bias': 1.0,'coastal_dist': coastal_dist, 'arr_time': arr_time, 'arr_day': arr_day, 'carrier': carrier})
Y.append(int(delay_data_dict['ARR_DELAY'][i]>15))
vec = DictVectorizer()
X = vec.fit_transform(training_set).toarray()
In [80]:
#Train our Logistic Regression Model
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.cross_validation import StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler
Y = np.array(Y)
ratio0 = len(Y[Y==0])/len(Y)
model = LogisticRegression(fit_intercept = False)
model = model.fit(X, Y)
train_accuracy = model.score(X, Y)
Y = np.array(Y)
print('Ratio of the on-time flights in the data-set: {}'.format(ratio0))
print('\nTraining score: {}'.format(train_accuracy))
print('\nClassification report on training data:\n')
Y_pred = model.predict(X)
print(classification_report(Y, Y_pred))
cv = StratifiedKFold(Y, n_folds = 5)
cv_score = cross_val_score(model, X, Y, cv = cv)
print('\n5-fold cross validation score: {}'.format(np.mean(cv_score)))
It may help to take a look into ROC curve to get more insight on the choice of the threshold:
In [63]:
from sklearn.metrics import roc_curve, auc
probas_ = model.predict_proba(X)
fpr, tpr, thresholds = roc_curve(Y, probas_[:,1])
ruc_score = auc(fpr, tpr)
plt.plot(fpr, tpr)
plt.xlim([-0.05, 1.05])
plt.ylim([-0.05, 1.05])
plt.plot([0, 1], [0, 1], '--k')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()
We have been very crude on our design of features. The logistic regression did not work well with the current model.
Let's also try a random forrest classifier but I believe the issue is in feature engineering.
In [85]:
from sklearn.ensemble import RandomForestClassifier
model_rf = RandomForestClassifier()
model_rf.fit(X,Y)
train_accuracy = model_rf.score(X, Y)
print('\nTraining score - Random Forest Classifier: {}'.format(train_accuracy))
print('\nClassification report on training data:\n')
Y_pred_rf = model_rf.predict(X)
print(classification_report(Y, Y_pred_rf))
cv = StratifiedKFold(Y, n_folds = 5)
cv_score = cross_val_score(model_rf, X, Y, cv = cv)
print('\n5-fold cross validation score: {}'.format(np.mean(cv_score)))
Fairly good test error reported above increases our confidence on our model. Finally let us report on the coefficients of the logistic regression model:
In [86]:
features = vec.get_feature_names()
coeffs = model.coef_[0]
print('%34s %20s' %('Feature:', 'Coefficient:'))
print('%34s %20s' %('-'*34, '-'*20))
for f,c in zip(features, coeffs):
print(('%34s %20.4f' %(f, c)))
An immediate comment on the coefficients of categorical variables our model: As we observed in Section II flying in the morning of a 'good day' using an airline with good overall track record greatly reduces the risk of delay.
Analysis of on-time performance data of US domestic flights in December 2014 is conducted. We have identified key aspects that that could be used as features for a simple linear probability modeling of delays. We also reported on a logistic regression method for the estimation of delay probability using the identified features.
Here we ommit a more detailed analysis of coefficients of each feature. That type of analysis will definetely usefull in terms of importance of different features in predicting delay probabilities in the context of our logistic regression model.
We also tried a random forest classifier on the traing set. Both of the classification methods did not perform well. That is most probably due to pure feature engineering. We won't delve into that issue here as the main goal is exploratory analysis of the data in hand.
Some possibilities to enhance the model and its performance can be listed as: