Exploration of Airline On-Time Performance

In this notebook, we explore a sample of data from the U.S. Department of Transportation (US-DOT) Research and Innovative Technology Administration (RITA) Bureau of Transportation Statistics (BTS). The data comes from the On-Time Performance table:

This table contains on-time arrival data for non-stop domestic flights by major air carriers, and provides such additional items as departure and arrival delays, origin and destination airports, flight numbers, scheduled and actual departure and arrival times, cancelled or diverted flights, taxi-out and taxi-in times, air time, and non-stop distance.

Questions

For the purposes of this notebook, I have captured a subset of the table in a Cloudant database. We will start by connecting to the database and simply looking at the available data. Once we understand the content, we will try to answer the following questions about flights during the month of June, 2014:

What is the distribution of departure delays of at least 15 minutes by state? Arrival delays?
Is there a tendency of flights from one state to another to experience a delay of 15 minutes or more on the arriving end?
How did arrival delay in minutes vary day-by-day?

Connect to Cloudant

To get to the data, we can use a Cloudant client for Python. We'll can install the official client by shelling out to bash and running a pip command right here.



In [1]:

    
!pip install cloudant









    



Requirement already satisfied (use --upgrade to upgrade): cloudant in /usr/local/lib/python2.7/dist-packages
Requirement already satisfied (use --upgrade to upgrade): requests-futures==0.9.4 in /usr/local/lib/python2.7/dist-packages (from cloudant)
Requirement already satisfied (use --upgrade to upgrade): requests>=1.2.0 in /usr/lib/python2.7/dist-packages (from requests-futures==0.9.4->cloudant)
Requirement already satisfied (use --upgrade to upgrade): futures>=2.1.3 in /usr/local/lib/python2.7/dist-packages (from requests-futures==0.9.4->cloudant)
Cleaning up...

Now we'll import the cloudant package we just installed and use it to connect to the read-only rita_transtats_2014_06 database in the parente user account.



In [2]:

    
import cloudant



In [3]:

    
account = cloudant.Account('parente')



In [4]:

    
database = account.database('rita_transtats_2014_06')

The cloudant package builds on the popular Python requests package. Almost every object that comes back from the API is a subclass of a requests class. This means we can perform a HTTP GET against the database and get the JSON body of the response with a couple method calls.



In [5]:

    
database.get().json()









    Out[5]:





{u'compact_running': False,
 u'db_name': u'rita_transtats_2014_06',
 u'disk_format_version': 5,
 u'disk_size': 160707408,
 u'doc_count': 502500,
 u'doc_del_count': 0,
 u'instance_start_time': u'0',
 u'other': {u'data_size': 265247850},
 u'purge_seq': 0,
 u'update_seq': u'502516-g1AAAADveJzLYWBgYMlgTmGQT0lKzi9KdUhJMtJLykxPyilN1UvOyS9NScwr0ctLLckBKmRKZEiy____f1YSA-PrcKJ1JTkAyaR6qMZX84nWmMcCJBkagBRQ736w5mgSNR-AaIbYvDgLACDLUQs'}

From the above, we can see the database contains roughly 500,000 documents. We can grab a couple from the database to inspect locally.



In [6]:

    
items = []
for i, item in enumerate(database.all_docs(params={'include_docs' : True})):
    if i > 1: break
    items.append(item)
print items









    



[{u'value': {u'rev': u'1-40fc15f608f95f0428e2e9d6b468e483'}, u'id': u'04da0d01eb0f15d5c56eb1399a000a35', u'key': u'04da0d01eb0f15d5c56eb1399a000a35', u'doc': {u'DISTANCE': 1947.0, u'DEST_AIRPORT_ID': 12892, u'ARR_DEL15': 0.0, u'ORIGIN_STATE_ABR': u'GA', u'_rev': u'1-40fc15f608f95f0428e2e9d6b468e483', u'ARR_DELAY_NEW': 0.0, u'UNIQUE_CARRIER': u'DL', u'ORIGIN_AIRPORT_ID': 10397, u'DISTANCE_GROUP': 8, u'DEP_DEL15': 0.0, u'_id': u'04da0d01eb0f15d5c56eb1399a000a35', u'DEST_STATE_ABR': u'CA', u'DEP_DELAY_NEW': 0.0, u'FL_DATE': u'2014-06-30'}}, {u'value': {u'rev': u'1-4dd24d56dc537210f49fe327c7773178'}, u'id': u'04da0d01eb0f15d5c56eb1399a001198', u'key': u'04da0d01eb0f15d5c56eb1399a001198', u'doc': {u'DISTANCE': 282.0, u'DEST_AIRPORT_ID': 13487, u'ARR_DEL15': 0.0, u'ORIGIN_STATE_ABR': u'NE', u'_rev': u'1-4dd24d56dc537210f49fe327c7773178', u'ARR_DELAY_NEW': 0.0, u'UNIQUE_CARRIER': u'DL', u'ORIGIN_AIRPORT_ID': 13871, u'DISTANCE_GROUP': 2, u'DEP_DEL15': 0.0, u'_id': u'04da0d01eb0f15d5c56eb1399a001198', u'DEST_STATE_ABR': u'MN', u'DEP_DELAY_NEW': 0.0, u'FL_DATE': u'2014-06-30'}}]

The dictionary format is hard to read and contains metadata from Cloudant that we don't care about. Let's use the pandas package to put the data in a tabular, HTML format instead.



In [7]:

    
import pandas



In [8]:

    
pandas.DataFrame([item['doc'] for item in items])









    Out[8]:






  
    
      
      ARR_DEL15
      ARR_DELAY_NEW
      DEP_DEL15
      DEP_DELAY_NEW
      DEST_AIRPORT_ID
      DEST_STATE_ABR
      DISTANCE
      DISTANCE_GROUP
      FL_DATE
      ORIGIN_AIRPORT_ID
      ORIGIN_STATE_ABR
      UNIQUE_CARRIER
      _id
      _rev
    
  
  
    
      0
       0
       0
       0
       0
       12892
       CA
       1947
       8
       2014-06-30
       10397
       GA
       DL
       04da0d01eb0f15d5c56eb1399a000a35
       1-40fc15f608f95f0428e2e9d6b468e483
    
    
      1
       0
       0
       0
       0
       13487
       MN
        282
       2
       2014-06-30
       13871
       NE
       DL
       04da0d01eb0f15d5c56eb1399a001198
       1-4dd24d56dc537210f49fe327c7773178

Returning to the source of the data, we can get definitions for each of these fields.

ARR_DEL15: Arrival Delay Indicator, 15 Minutes or More (1=Yes)
ARR_DEL15: Arrival Delay Indicator, 15 Minutes or More (1=Yes)
ARR_DELAY_NEW: Difference in minutes between scheduled and actual arrival time. Early arrivals set to 0.
DEP_DEL15: Departure Delay Indicator, 15 Minutes or More (1=Yes)
DEP_DELAY_NEW: Difference in minutes between scheduled and actual departure time. Early departures set to 0.
DEST_AIRPORT_ID: Destination Airport, Airport ID. An identification number assigned by US DOT to identify a unique airport. Use this field for airport analysis across a range of years because an airport can change its airport code and airport codes can be reused.
DEST_STATE_ABR: Destination Airport, State Code
DISTANCE: Distance between airports (miles)
DISTANCE_GROUP: Distance Intervals, every 250 Miles, for Flight Segment
FL_DATE: Flight Date (yyyymmdd)
ORIGIN_AIRPORT_ID: Origin Airport, Airport ID. An identification number assigned by US DOT to identify a unique airport. Use this field for airport analysis across a range of years because an airport can change its airport code and airport codes can be reused.
ORIGIN_STATE_ABR: Origin Airport, State Code
UNIQUE_CARRIER: Unique Carrier Code. When the same code has been used by multiple carriers, a numeric suffix is used for earlier users, for example, PA, PA(1), PA(2). Use this field for analysis across a range of years.

For the purposes of the specific questions stated at the top of this notebook, we only need a subset of the available columns, namely delay metrics, origin and destination states, and the flight date. We'll ignore the other fields.



In [9]:

    
columns = [u'FL_DATE', u'ORIGIN_STATE_ABR', u'DEST_STATE_ABR', u'ARR_DEL15', u'ARR_DELAY_NEW', u'DEP_DEL15', u'DEP_DELAY_NEW', u'DISTANCE', u'DISTANCE_GROUP',]

Moving forward, we'll assume we only have 1 GB of RAM total. Since we're only dealing with half a million records here, we can probably pull the entire contents of the database into local memory. If the data proves too large, we can rely on the map/reduce and search capabilities in Cloudant to work with the data instead.

Being optimistic, we write a little loop that gets up to 20,000 docs at a time from the database. It stores the 20,000 in a simple Python list. Once the buffer reaches the threshold, we create a DataFrame from the buffer which reduces the data to just the fields we want. We do this chunking because appending to a DataFrame one row at a time is much slower.



In [10]:

    
%%time
dfs = []
buff = []
for i, item in enumerate(database.all_docs(params={'include_docs' : True})):
    buff.append(item['doc'])
    if i > 0 and i % 20000 == 0:
        print 'Processed #{}'.format(i)
        df = pandas.DataFrame(buff, columns=columns)
        dfs.append(df)
        buff = []
# don't forget the leftovers
df = pandas.DataFrame(buff, columns=columns)
dfs.append(df)









    



Processed #20000
Processed #40000
Processed #60000
Processed #80000
Processed #100000
Processed #120000
Processed #140000
Processed #160000
Processed #180000
Processed #200000
Processed #220000
Processed #240000
Processed #260000
Processed #280000
Processed #300000
Processed #320000
Processed #340000
Processed #360000
Processed #380000
Processed #400000
Processed #420000
Processed #440000
Processed #460000
Processed #480000
Processed #500000
CPU times: user 29 s, sys: 1.92 s, total: 30.9 s
Wall time: 43.5 s

Now we can build one DataFrame by quickly concatenating all the subframes we built in the loop above.



In [11]:

    
df = pandas.concat(dfs)

At this point, we have two copies of all the data in memory, which is undesirable. Before we delete the temporary buffer and subframes to free up some RAM, let's ensure the DataFrame row count matches the document count in the database.



In [12]:

    
assert len(df) == database.get().json()['doc_count']



In [13]:

    
del dfs
del buff



In [14]:

    
!free -m









    



             total       used       free     shared    buffers     cached
Mem:           989        700        289          1         73        178
-/+ buffers/cache:        449        540
Swap:         2047         29       2017

As one last step, we reset the index on the DataFrame so that it is a unique, monotonically increasing integer. As it stands, we have dupes in our index because of our chunking (i.e., each chunk starts at index 0). This reset will prove important in some of our later plots where the index must be unique per row.



In [15]:

    
df = df.reset_index(drop=True)

Distribution of Delay Counts by State

What is the distribution of departure delays of at least 15 minutes by state? Arrival delays?

Let's look at some basic information about delays to start and work up to delays grouped by state. Because the question asks about delays 15 minutes or longer, we'll focus on the DEP_DEL15 and ARR_DEL15 columns.



In [16]:

    
df.DEP_DEL15.value_counts() / len(df)









    Out[16]:





0    0.731817
1    0.248993
dtype: float64



In [17]:

    
df.ARR_DEL15.value_counts() / len(df)









    Out[17]:





0    0.718265
1    0.258054
dtype: float64

Roughly a quarter of all departures and a quarter of all arrivals have delays. We can look at the distribution more closely once we enable and configure plotting with matplotlib and seaborn.



In [18]:

    
%matplotlib inline



In [19]:

    
import matplotlib.pyplot as plt



In [20]:

    
import seaborn as sns
sns.set_palette("deep", desat=.6)
colors = sns.color_palette("deep")
sns.set_context(rc={"figure.figsize": (18, 5)})

Let's group by state now and sum the number of delays. We'll do it for both departure and arrival delays.



In [21]:

    
by_origin_state = df.groupby('ORIGIN_STATE_ABR')
departure_delay_counts = by_origin_state.DEP_DEL15.sum()



In [22]:

    
by_dest_state = df.groupby('DEST_STATE_ABR')
arrival_delay_counts = by_dest_state.ARR_DEL15.sum()

To plot, we'll put both series in a DataFrame so we can view the arrival and departure delays for each state side-by-side.



In [23]:

    
delay_df = pandas.DataFrame([departure_delay_counts, arrival_delay_counts]).T



In [24]:

    
delay_df.sort('DEP_DEL15', ascending=False).plot(kind='bar', title='Number of delayed flights by state')









    Out[24]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f67ac978990>

Big states with big airports appear to be in the top five. But we haven't accounted for how many total flights these states service. We should plot the percentage of flights that are delayed.



In [25]:

    
pct_departure_delay = departure_delay_counts / df.ORIGIN_STATE_ABR.value_counts()
pct_arrival_delay = arrival_delay_counts / df.DEST_STATE_ABR.value_counts()

Ranking states of origin by their percentage of departures tells a different story than the plot above. For example, here we see Illinois and Arkansas at the top of the list whereas IL was third in total departure delay counts and AR was ranked 25th or so. California, which is #2 in the number of total departure delays is only #17 in percentage of departures delayed. Not bad.



In [26]:

    
pct_departure_delay.order(ascending=False).plot(kind='bar', title='% flights with departure delays by origin state')









    Out[26]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f67abb67e90>

Similarly, when we look at destination states ranked by percentage of arrivals delayed, we see some new states at the head of the list. For instance, Delaware, second to last in the total number of delays overall, has the highest percentage of arrival delays for inbound flights. Iowa and Kansas are also new entries near the top.



In [27]:

    
pct_arrival_delay.order(ascending=False).plot(kind='bar', color=colors[1], title='% flights with arrival delay by destination state')









    Out[27]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f67ad3fec10>

We can get a sense of the difference between the two percentages for each state by plotting them on the same axes. In the plot below, we find that most states see more arrival delays than departure delays. The disparity seems greatest for smaller, less populated states that don't have huge airports (e.g., DE, IA, RI). We can't say for sure without studying more data or perhaps correlating the disparity with the state's ranking in terms of the total number of flights it serviced. We'll leave that as an exercise for the future.



In [28]:

    
pct_delay_df = pandas.DataFrame([pct_departure_delay, pct_arrival_delay], index=['PCT_DEP_DEL15', 'PCT_ARR_DEL15']).T



In [29]:

    
pct_delay_df.sort('PCT_ARR_DEL15', ascending=False).plot(kind='bar', title='Overlapping % delay plots for comparison')









    Out[29]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f67a96854d0>

Delay Tendency

Is there a tendency of flights from one state to another to experience a delay of 15 minutes or more on the arriving end?

While there are many ways to answer this question, we'll look at visualizations of two metrics:

How many times does a delay occur for an (origin → destination) state pair over all flights during the time period? (This is the support of a simple association rule.)
What percentage of total flights from an origin to a destination are delayed during the time period?

First, we'll compute the support. We do so by grouping all of the arrival delay counts by the origin and destination.



In [30]:

    
from __future__ import division



In [31]:

    
delay_counts_df = df[['ORIGIN_STATE_ABR', 'DEST_STATE_ABR', 'ARR_DEL15']].groupby(['ORIGIN_STATE_ABR', 'DEST_STATE_ABR']).sum()
delay_counts_df.head()









    Out[31]:






  
    
      
      
      ARR_DEL15
    
    
      ORIGIN_STATE_ABR
      DEST_STATE_ABR
      
    
  
  
    
      AK
      AK
       351
    
    
      AZ
         5
    
    
      CA
        11
    
    
      CO
        21
    
    
      GA
         3

We divide each (origin → destination) delay count by the total number of flights during the period.



In [32]:

    
support = (delay_counts_df / len(df))
support.head()









    Out[32]:






  
    
      
      
      ARR_DEL15
    
    
      ORIGIN_STATE_ABR
      DEST_STATE_ABR
      
    
  
  
    
      AK
      AK
       0.000699
    
    
      AZ
       0.000010
    
    
      CA
       0.000022
    
    
      CO
       0.000042
    
    
      GA
       0.000006

We unstack the multiple indices so that we have a N x N matrix with origins as rows and destinations as columns.



In [33]:

    
support = support.unstack()
support.head()









    Out[33]:






  
    
      
      ARR_DEL15
    
    
      DEST_STATE_ABR
      AK
      AL
      AR
      AZ
      CA
      CO
      CT
      DE
      FL
      GA
      ...
      TT
      TX
      UT
      VA
      VI
      VT
      WA
      WI
      WV
      WY
    
    
      ORIGIN_STATE_ABR
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      AK
       0.000699
      NaN
            NaN
       0.000010
       0.000022
       0.000042
            NaN
      NaN
            NaN
       0.000006
      ...
      NaN
       0.000050
       0.000004
            NaN
      NaN
      NaN
       0.000209
            NaN
      NaN
      NaN
    
    
      AL
            NaN
      NaN
            NaN
            NaN
            NaN
       0.000064
            NaN
      NaN
       0.000080
       0.000364
      ...
      NaN
       0.000631
            NaN
       0.000018
      NaN
      NaN
            NaN
            NaN
      NaN
      NaN
    
    
      AR
            NaN
      NaN
            NaN
       0.000008
       0.000036
       0.000098
            NaN
      NaN
            NaN
       0.000163
      ...
      NaN
       0.000826
       0.000002
            NaN
      NaN
      NaN
            NaN
            NaN
      NaN
      NaN
    
    
      AZ
       0.000026
      NaN
       0.000008
       0.000129
       0.002559
       0.000352
            NaN
      NaN
       0.000086
       0.000113
      ...
      NaN
       0.000722
       0.000239
       0.000072
      NaN
      NaN
       0.000291
       0.000062
      NaN
      NaN
    
    
      CA
       0.000056
      NaN
       0.000008
       0.001847
       0.011355
       0.001409
       0.000014
      NaN
       0.000302
       0.000344
      ...
      NaN
       0.002113
       0.000557
       0.000406
      NaN
      NaN
       0.001423
       0.000068
      NaN
        0
    
  

5 rows × 53 columns

Unfortunately, we now have a multilevel set of columns. One way to remove the outer level is to rotate the matrix, reset the outer index to drop it, and then rotate it back.

In the resulting matrix, each cell represents the proportion of total, system-wide flights that were delayed between an (origin → destination) pair.



In [34]:

    
support = support.T.reset_index(level=0, drop=True).T
support.head()









    Out[34]:






  
    
      DEST_STATE_ABR
      AK
      AL
      AR
      AZ
      CA
      CO
      CT
      DE
      FL
      GA
      ...
      TT
      TX
      UT
      VA
      VI
      VT
      WA
      WI
      WV
      WY
    
    
      ORIGIN_STATE_ABR
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      AK
       0.000699
      NaN
            NaN
       0.000010
       0.000022
       0.000042
            NaN
      NaN
            NaN
       0.000006
      ...
      NaN
       0.000050
       0.000004
            NaN
      NaN
      NaN
       0.000209
            NaN
      NaN
      NaN
    
    
      AL
            NaN
      NaN
            NaN
            NaN
            NaN
       0.000064
            NaN
      NaN
       0.000080
       0.000364
      ...
      NaN
       0.000631
            NaN
       0.000018
      NaN
      NaN
            NaN
            NaN
      NaN
      NaN
    
    
      AR
            NaN
      NaN
            NaN
       0.000008
       0.000036
       0.000098
            NaN
      NaN
            NaN
       0.000163
      ...
      NaN
       0.000826
       0.000002
            NaN
      NaN
      NaN
            NaN
            NaN
      NaN
      NaN
    
    
      AZ
       0.000026
      NaN
       0.000008
       0.000129
       0.002559
       0.000352
            NaN
      NaN
       0.000086
       0.000113
      ...
      NaN
       0.000722
       0.000239
       0.000072
      NaN
      NaN
       0.000291
       0.000062
      NaN
      NaN
    
    
      CA
       0.000056
      NaN
       0.000008
       0.001847
       0.011355
       0.001409
       0.000014
      NaN
       0.000302
       0.000344
      ...
      NaN
       0.002113
       0.000557
       0.000406
      NaN
      NaN
       0.001423
       0.000068
      NaN
        0
    
  

5 rows × 53 columns

At this point, we have a DataFrame that we can query but no clear idea of where to start looking. A visualization of the entire DataFrame can help us find interesting pairs. We borrow and slightly modify some code from seaborn to plot our asymmetric matrix as a heatmap.



In [37]:

    
import numpy as np



In [38]:

    
def asymmatplot(plotmat, names=None, cmap="Greys", cmap_range=None, ax=None, **kwargs):
    '''
    Plot an asymmetric matrix with colormap and statistic values. A modification of the
    symmatplot() function in Seaborn to show the upper-half of the matrix.
    
    See https://github.com/mwaskom/seaborn/blob/master/seaborn/linearmodels.py for the original.
    '''
    if ax is None:
        ax = plt.gca()

    nvars = len(plotmat)

    if cmap_range is None:
        vmax = np.nanmax(plotmat) * 1.15
        vmin = np.nanmin(plotmat) * 1.15
    elif len(cmap_range) == 2:
        vmin, vmax = cmap_range
    else:
        raise ValueError("cmap_range argument not understood")

    mat_img = ax.matshow(plotmat, cmap=cmap, vmin=vmin, vmax=vmax, **kwargs)

    plt.colorbar(mat_img, shrink=.75)
 
    ax.xaxis.set_ticks_position("bottom")
    ax.set_xticklabels(names, rotation=90)
    ax.set_yticklabels(names)

    minor_ticks = np.linspace(-.5, nvars - 1.5, nvars)
    ax.set_xticks(minor_ticks, True)
    ax.set_yticks(minor_ticks, True)
    major_ticks = np.linspace(0, nvars - 1, nvars)
    ax.set_xticks(major_ticks)
    ax.set_yticks(major_ticks)
    ax.grid(False, which="major")
    ax.grid(True, which="minor", linestyle="-")

    return ax

In the plot, gray boxes represent cases where there are no flights between the origin (row) and destination (column) states. We see mostly light yellowish boxes representing state pairs where delays occur, but in tiny numbers compared to the total number of system-wide flights.

We do see a couple very hot spots, namely in the (CA → CA) and (TX → TX) cells. We also see a few moderately hot spots, for example (TX → CA) and (FL → NY). We can interpret these cells as indicators of where delays tend to occur most across the entire set of flights.



In [39]:

    
fig, ax = plt.subplots(figsize=(18,18))
asymmatplot(support, names=support.columns, ax=ax, cmap='OrRd')









    Out[39]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f679b621890>

To understand what the percentage of flights for a particular (origin → destination) state pair that are delayed, we can look at a second metric using the same visualization. Here we compute the total number of flights for each (origin → destination) pair.



In [40]:

    
trip_counts_df = df[['ORIGIN_STATE_ABR', 'DEST_STATE_ABR', 'FL_DATE']].groupby(['ORIGIN_STATE_ABR', 'DEST_STATE_ABR']).count()

To put trip counts DataFrame and our earlier delay_counts DataFrame on the same axes, we rename the columns to COUNTS in both cases.



In [41]:

    
delay_counts_df = delay_counts_df.rename_axis({'ARR_DEL15' : 'COUNTS'}, axis=1)
trip_counts_df = trip_counts_df.rename_axis({'FL_DATE' : 'COUNTS'}, axis=1)

Now we divide the delay counts by the total trip counts and perform the same transforms we did previous to produce the N by N matrix. In this case, each cell represents the proportion of flights between each (origin → destination) that were delayed.



In [42]:

    
mat = (delay_counts_df / trip_counts_df).unstack().T.reset_index(level=0, drop=True).T

In this second heatmap plot, the (CA → CA) and (TX → TX) hotspots from the firt visualization no longer stand out. Though there are many in-state delays for these two states, there are even more flights, keeping the percentage of delayed flights for these in-state trips lower than other routes.

To the contrary, we see some cases where all flights from one state to another had arrival delays:

(AR → UT)
(MT → NY)
(CO → RI) and (RI → CO)

We can also see some other moderately hot spots, such as (AK → NJ) and (OK → MN), which seem to have a higher percentage of delays than other state pairs.

One "crosshair" jumps out in the visualization: the row and column representing Illinois are nearly both filled with non-gray cells. On closer inspection, we see Illinois sends flights to and receives flights from every other state except one: TT, the state code abbreviation for U.S. Pacific Trust Territories and Possessions. And though it is difficult to make accurate relative value judgments from this visualization, it appears the run of cells in the row and column for Illinois are darker than most other row or column runs (e.g., GA).



In [43]:

    
fig, ax = plt.subplots(figsize=(18,18))
asymmatplot(mat, names=mat.columns, ax=ax, cmap='OrRd', cmap_range=(0., 1.0))









    Out[43]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f679b051950>

Of course, this plot only shows proportions of delayed flights between two states and doesn't depict the number of flights between the two nor the magnitude of the delay. So while all flights between Rhode Island and Colorado were delayed, we need to keep in mind ...



In [44]:

    
print delay_counts_df.loc['RI', 'CO']
print trip_counts_df.loc['RI', 'CO']









    



COUNTS    3
Name: (RI, CO), dtype: float64
COUNTS    3
Name: (RI, CO), dtype: int64

there were only three of them!

A visualization that captures both the proportion of (origin → destination) flights delayed as well as the proportion of total flights represented by that state pair is yet another exercise for the future.

Delay Distribution by Date

How did arrival delay in minutes vary day-by-day?

To address this question, we can group the ARR_DELAY_NEW column by date and look at their descriptive stats. A Tukey box plot by day is a reasonable way for us to start.



In [45]:

    
fig, ax = plt.subplots(figsize=(18,10))
sns.boxplot(df.ARR_DELAY_NEW, df.FL_DATE, ax=ax)
fig.autofmt_xdate()

Oh my, outliers! If the data are to be trusted, there's at least one flight every day that is over 500 minutes (8 hours) late in arriving. In the most extreme case, a flight scheduled on 2014-06-19 appears to have arrived 1800 minutes (30 hours) late.

Whether this information is accurate or not requires some fact checking against other historical sources. For now, we can turn off the fliers to get a better view of the interquartile range.



In [46]:

    
fig, ax = plt.subplots(figsize=(18,10))
sns.boxplot(df.ARR_DELAY_NEW, df.FL_DATE, ax=ax, showfliers=False)
fig.autofmt_xdate()

We see that the median arrival delay on most days is zero (or better than zero: this data column counts early arrivals as zeros). Some days see greater skew toward longer delays than others, particularly over the five day periods of Sunday, 2014-06-08 through Friday, 2014-06-13 and Monday, 2014-06-23 through Friday, 2014-06-27. There's also another period of elongation from Wednesday, 2014-06-18 through Thursday, 2014-06-19. These might relate to weather patterns during those weeks or an increase in number of passengers (e.g., summer vacations). Further study and sources of data are required to test these hypotheses.



In [47]:

    
!cal 6 2014









    



     June 2014        
Su Mo Tu We Th Fr Sa  
 1  2  3  4  5  6  7  
 8  9 10 11 12 13 14  
15 16 17 18 19 20 21  
22 23 24 25 26 27 28  
29 30

Going Further

If you wish to take this exploration further, here are some questions you might consider addressing with additional thinking and data.

How accurately can a model predict if a flight will be delayed or not using simple features like origin, destination, day of the week, etc.?
What factors (e.g., U.S. weather) help explain the greater median arrival delay and dispersion from 2014-06-08 to 2014-06-13?
How do the results above contrast with the results from applying the same analyses to data from June, 2001? June, 2002?

	ARR_DEL15	ARR_DELAY_NEW	DEP_DEL15	DEP_DELAY_NEW	DEST_AIRPORT_ID	DEST_STATE_ABR	DISTANCE	DISTANCE_GROUP	FL_DATE	ORIGIN_AIRPORT_ID	ORIGIN_STATE_ABR	UNIQUE_CARRIER	_id	_rev
0	0	0	0	0	12892	CA	1947	8	2014-06-30	10397	GA	DL	04da0d01eb0f15d5c56eb1399a000a35	1-40fc15f608f95f0428e2e9d6b468e483
1	0	0	0	0	13487	MN	282	2	2014-06-30	13871	NE	DL	04da0d01eb0f15d5c56eb1399a001198	1-4dd24d56dc537210f49fe327c7773178

		ARR_DEL15
ORIGIN_STATE_ABR	DEST_STATE_ABR
AK	AK	0.000699
	AZ	0.000010
	CA	0.000022
	CO	0.000042
	GA	0.000006