In this notebook, we explore a sample of data from the U.S. Department of Transportation (US-DOT) Research and Innovative Technology Administration (RITA) Bureau of Transportation Statistics (BTS). The data comes from the On-Time Performance table:
This table contains on-time arrival data for non-stop domestic flights by major air carriers, and provides such additional items as departure and arrival delays, origin and destination airports, flight numbers, scheduled and actual departure and arrival times, cancelled or diverted flights, taxi-out and taxi-in times, air time, and non-stop distance.
For the purposes of this notebook, I have captured a subset of the table in a Cloudant database. We will start by connecting to the database and simply looking at the available data. Once we understand the content, we will try to answer the following questions about flights during the month of June, 2014:
To get to the data, we can use a Cloudant client for Python. We'll can install the official client by shelling out to bash
and running a pip
command right here.
In [1]:
!pip install cloudant
Now we'll import the cloudant
package we just installed and use it to connect to the read-only rita_transtats_2014_06
database in the parente
user account.
In [2]:
import cloudant
In [3]:
account = cloudant.Account('parente')
In [4]:
database = account.database('rita_transtats_2014_06')
The cloudant
package builds on the popular Python requests package. Almost every object that comes back from the API is a subclass of a requests
class. This means we can perform a HTTP GET against the database and get the JSON body of the response with a couple method calls.
In [5]:
database.get().json()
Out[5]:
From the above, we can see the database contains roughly 500,000 documents. We can grab a couple from the database to inspect locally.
In [6]:
items = []
for i, item in enumerate(database.all_docs(params={'include_docs' : True})):
if i > 1: break
items.append(item)
print items
The dictionary format is hard to read and contains metadata from Cloudant that we don't care about. Let's use the pandas
package to put the data in a tabular, HTML format instead.
In [7]:
import pandas
In [8]:
pandas.DataFrame([item['doc'] for item in items])
Out[8]:
Returning to the source of the data, we can get definitions for each of these fields.
For the purposes of the specific questions stated at the top of this notebook, we only need a subset of the available columns, namely delay metrics, origin and destination states, and the flight date. We'll ignore the other fields.
In [9]:
columns = [u'FL_DATE', u'ORIGIN_STATE_ABR', u'DEST_STATE_ABR', u'ARR_DEL15', u'ARR_DELAY_NEW', u'DEP_DEL15', u'DEP_DELAY_NEW', u'DISTANCE', u'DISTANCE_GROUP',]
Moving forward, we'll assume we only have 1 GB of RAM total. Since we're only dealing with half a million records here, we can probably pull the entire contents of the database into local memory. If the data proves too large, we can rely on the map/reduce and search capabilities in Cloudant to work with the data instead.
Being optimistic, we write a little loop that gets up to 20,000 docs at a time from the database. It stores the 20,000 in a simple Python list. Once the buffer reaches the threshold, we create a DataFrame from the buffer which reduces the data to just the fields we want. We do this chunking because appending to a DataFrame one row at a time is much slower.
In [10]:
%%time
dfs = []
buff = []
for i, item in enumerate(database.all_docs(params={'include_docs' : True})):
buff.append(item['doc'])
if i > 0 and i % 20000 == 0:
print 'Processed #{}'.format(i)
df = pandas.DataFrame(buff, columns=columns)
dfs.append(df)
buff = []
# don't forget the leftovers
df = pandas.DataFrame(buff, columns=columns)
dfs.append(df)
Now we can build one DataFrame by quickly concatenating all the subframes we built in the loop above.
In [11]:
df = pandas.concat(dfs)
At this point, we have two copies of all the data in memory, which is undesirable. Before we delete the temporary buffer and subframes to free up some RAM, let's ensure the DataFrame row count matches the document count in the database.
In [12]:
assert len(df) == database.get().json()['doc_count']
In [13]:
del dfs
del buff
In [14]:
!free -m
As one last step, we reset the index on the DataFrame so that it is a unique, monotonically increasing integer. As it stands, we have dupes in our index because of our chunking (i.e., each chunk starts at index 0). This reset will prove important in some of our later plots where the index must be unique per row.
In [15]:
df = df.reset_index(drop=True)
What is the distribution of departure delays of at least 15 minutes by state? Arrival delays?
Let's look at some basic information about delays to start and work up to delays grouped by state. Because the question asks about delays 15 minutes or longer, we'll focus on the DEP_DEL15
and ARR_DEL15
columns.
In [16]:
df.DEP_DEL15.value_counts() / len(df)
Out[16]:
In [17]:
df.ARR_DEL15.value_counts() / len(df)
Out[17]:
Roughly a quarter of all departures and a quarter of all arrivals have delays. We can look at the distribution more closely once we enable and configure plotting with matplotlib and seaborn.
In [18]:
%matplotlib inline
In [19]:
import matplotlib.pyplot as plt
In [20]:
import seaborn as sns
sns.set_palette("deep", desat=.6)
colors = sns.color_palette("deep")
sns.set_context(rc={"figure.figsize": (18, 5)})
Let's group by state now and sum the number of delays. We'll do it for both departure and arrival delays.
In [21]:
by_origin_state = df.groupby('ORIGIN_STATE_ABR')
departure_delay_counts = by_origin_state.DEP_DEL15.sum()
In [22]:
by_dest_state = df.groupby('DEST_STATE_ABR')
arrival_delay_counts = by_dest_state.ARR_DEL15.sum()
To plot, we'll put both series in a DataFrame so we can view the arrival and departure delays for each state side-by-side.
In [23]:
delay_df = pandas.DataFrame([departure_delay_counts, arrival_delay_counts]).T
In [24]:
delay_df.sort('DEP_DEL15', ascending=False).plot(kind='bar', title='Number of delayed flights by state')
Out[24]:
Big states with big airports appear to be in the top five. But we haven't accounted for how many total flights these states service. We should plot the percentage of flights that are delayed.
In [25]:
pct_departure_delay = departure_delay_counts / df.ORIGIN_STATE_ABR.value_counts()
pct_arrival_delay = arrival_delay_counts / df.DEST_STATE_ABR.value_counts()
Ranking states of origin by their percentage of departures tells a different story than the plot above. For example, here we see Illinois and Arkansas at the top of the list whereas IL was third in total departure delay counts and AR was ranked 25th or so. California, which is #2 in the number of total departure delays is only #17 in percentage of departures delayed. Not bad.
In [26]:
pct_departure_delay.order(ascending=False).plot(kind='bar', title='% flights with departure delays by origin state')
Out[26]:
Similarly, when we look at destination states ranked by percentage of arrivals delayed, we see some new states at the head of the list. For instance, Delaware, second to last in the total number of delays overall, has the highest percentage of arrival delays for inbound flights. Iowa and Kansas are also new entries near the top.
In [27]:
pct_arrival_delay.order(ascending=False).plot(kind='bar', color=colors[1], title='% flights with arrival delay by destination state')
Out[27]:
We can get a sense of the difference between the two percentages for each state by plotting them on the same axes. In the plot below, we find that most states see more arrival delays than departure delays. The disparity seems greatest for smaller, less populated states that don't have huge airports (e.g., DE, IA, RI). We can't say for sure without studying more data or perhaps correlating the disparity with the state's ranking in terms of the total number of flights it serviced. We'll leave that as an exercise for the future.
In [28]:
pct_delay_df = pandas.DataFrame([pct_departure_delay, pct_arrival_delay], index=['PCT_DEP_DEL15', 'PCT_ARR_DEL15']).T
In [29]:
pct_delay_df.sort('PCT_ARR_DEL15', ascending=False).plot(kind='bar', title='Overlapping % delay plots for comparison')
Out[29]:
Is there a tendency of flights from one state to another to experience a delay of 15 minutes or more on the arriving end?
While there are many ways to answer this question, we'll look at visualizations of two metrics:
First, we'll compute the support. We do so by grouping all of the arrival delay counts by the origin and destination.
In [30]:
from __future__ import division
In [31]:
delay_counts_df = df[['ORIGIN_STATE_ABR', 'DEST_STATE_ABR', 'ARR_DEL15']].groupby(['ORIGIN_STATE_ABR', 'DEST_STATE_ABR']).sum()
delay_counts_df.head()
Out[31]:
We divide each (origin → destination) delay count by the total number of flights during the period.
In [32]:
support = (delay_counts_df / len(df))
support.head()
Out[32]:
We unstack the multiple indices so that we have a N x N matrix with origins as rows and destinations as columns.
In [33]:
support = support.unstack()
support.head()
Out[33]:
Unfortunately, we now have a multilevel set of columns. One way to remove the outer level is to rotate the matrix, reset the outer index to drop it, and then rotate it back.
In the resulting matrix, each cell represents the proportion of total, system-wide flights that were delayed between an (origin → destination) pair.
In [34]:
support = support.T.reset_index(level=0, drop=True).T
support.head()
Out[34]:
At this point, we have a DataFrame that we can query but no clear idea of where to start looking. A visualization of the entire DataFrame can help us find interesting pairs. We borrow and slightly modify some code from seaborn
to plot our asymmetric matrix as a heatmap.
In [37]:
import numpy as np
In [38]:
def asymmatplot(plotmat, names=None, cmap="Greys", cmap_range=None, ax=None, **kwargs):
'''
Plot an asymmetric matrix with colormap and statistic values. A modification of the
symmatplot() function in Seaborn to show the upper-half of the matrix.
See https://github.com/mwaskom/seaborn/blob/master/seaborn/linearmodels.py for the original.
'''
if ax is None:
ax = plt.gca()
nvars = len(plotmat)
if cmap_range is None:
vmax = np.nanmax(plotmat) * 1.15
vmin = np.nanmin(plotmat) * 1.15
elif len(cmap_range) == 2:
vmin, vmax = cmap_range
else:
raise ValueError("cmap_range argument not understood")
mat_img = ax.matshow(plotmat, cmap=cmap, vmin=vmin, vmax=vmax, **kwargs)
plt.colorbar(mat_img, shrink=.75)
ax.xaxis.set_ticks_position("bottom")
ax.set_xticklabels(names, rotation=90)
ax.set_yticklabels(names)
minor_ticks = np.linspace(-.5, nvars - 1.5, nvars)
ax.set_xticks(minor_ticks, True)
ax.set_yticks(minor_ticks, True)
major_ticks = np.linspace(0, nvars - 1, nvars)
ax.set_xticks(major_ticks)
ax.set_yticks(major_ticks)
ax.grid(False, which="major")
ax.grid(True, which="minor", linestyle="-")
return ax
In the plot, gray boxes represent cases where there are no flights between the origin (row) and destination (column) states. We see mostly light yellowish boxes representing state pairs where delays occur, but in tiny numbers compared to the total number of system-wide flights.
We do see a couple very hot spots, namely in the (CA → CA) and (TX → TX) cells. We also see a few moderately hot spots, for example (TX → CA) and (FL → NY). We can interpret these cells as indicators of where delays tend to occur most across the entire set of flights.
In [39]:
fig, ax = plt.subplots(figsize=(18,18))
asymmatplot(support, names=support.columns, ax=ax, cmap='OrRd')
Out[39]:
To understand what the percentage of flights for a particular (origin → destination) state pair that are delayed, we can look at a second metric using the same visualization. Here we compute the total number of flights for each (origin → destination) pair.
In [40]:
trip_counts_df = df[['ORIGIN_STATE_ABR', 'DEST_STATE_ABR', 'FL_DATE']].groupby(['ORIGIN_STATE_ABR', 'DEST_STATE_ABR']).count()
To put trip counts DataFrame and our earlier delay_counts
DataFrame on the same axes, we rename the columns to COUNTS
in both cases.
In [41]:
delay_counts_df = delay_counts_df.rename_axis({'ARR_DEL15' : 'COUNTS'}, axis=1)
trip_counts_df = trip_counts_df.rename_axis({'FL_DATE' : 'COUNTS'}, axis=1)
Now we divide the delay counts by the total trip counts and perform the same transforms we did previous to produce the N by N matrix. In this case, each cell represents the proportion of flights between each (origin → destination) that were delayed.
In [42]:
mat = (delay_counts_df / trip_counts_df).unstack().T.reset_index(level=0, drop=True).T
In this second heatmap plot, the (CA → CA) and (TX → TX) hotspots from the firt visualization no longer stand out. Though there are many in-state delays for these two states, there are even more flights, keeping the percentage of delayed flights for these in-state trips lower than other routes.
To the contrary, we see some cases where all flights from one state to another had arrival delays:
We can also see some other moderately hot spots, such as (AK → NJ) and (OK → MN), which seem to have a higher percentage of delays than other state pairs.
One "crosshair" jumps out in the visualization: the row and column representing Illinois are nearly both filled with non-gray cells. On closer inspection, we see Illinois sends flights to and receives flights from every other state except one: TT, the state code abbreviation for U.S. Pacific Trust Territories and Possessions. And though it is difficult to make accurate relative value judgments from this visualization, it appears the run of cells in the row and column for Illinois are darker than most other row or column runs (e.g., GA).
In [43]:
fig, ax = plt.subplots(figsize=(18,18))
asymmatplot(mat, names=mat.columns, ax=ax, cmap='OrRd', cmap_range=(0., 1.0))
Out[43]:
Of course, this plot only shows proportions of delayed flights between two states and doesn't depict the number of flights between the two nor the magnitude of the delay. So while all flights between Rhode Island and Colorado were delayed, we need to keep in mind ...
In [44]:
print delay_counts_df.loc['RI', 'CO']
print trip_counts_df.loc['RI', 'CO']
there were only three of them!
A visualization that captures both the proportion of (origin → destination) flights delayed as well as the proportion of total flights represented by that state pair is yet another exercise for the future.
In [45]:
fig, ax = plt.subplots(figsize=(18,10))
sns.boxplot(df.ARR_DELAY_NEW, df.FL_DATE, ax=ax)
fig.autofmt_xdate()
Oh my, outliers! If the data are to be trusted, there's at least one flight every day that is over 500 minutes (8 hours) late in arriving. In the most extreme case, a flight scheduled on 2014-06-19 appears to have arrived 1800 minutes (30 hours) late.
Whether this information is accurate or not requires some fact checking against other historical sources. For now, we can turn off the fliers to get a better view of the interquartile range.
In [46]:
fig, ax = plt.subplots(figsize=(18,10))
sns.boxplot(df.ARR_DELAY_NEW, df.FL_DATE, ax=ax, showfliers=False)
fig.autofmt_xdate()
We see that the median arrival delay on most days is zero (or better than zero: this data column counts early arrivals as zeros). Some days see greater skew toward longer delays than others, particularly over the five day periods of Sunday, 2014-06-08 through Friday, 2014-06-13 and Monday, 2014-06-23 through Friday, 2014-06-27. There's also another period of elongation from Wednesday, 2014-06-18 through Thursday, 2014-06-19. These might relate to weather patterns during those weeks or an increase in number of passengers (e.g., summer vacations). Further study and sources of data are required to test these hypotheses.
In [47]:
!cal 6 2014
If you wish to take this exploration further, here are some questions you might consider addressing with additional thinking and data.