In [39]:
import pandas as pd
%matplotlib inline
Simple stuff. We're loading in a CSV here, and we'll run the describe function over it to get the lay of the land.
In [44]:
df = pd.read_csv('data/ontime_reports_may_2015_ny.csv')
In [45]:
df.describe()
Out[45]:
In journalism, we're primarily concerned with using data analysis for two purposes:
We'll spend a little time looking at the first before we move on to the second.
Let's start with the longest delays:
In [5]:
df.sort('ARR_DELAY', ascending=False).head(1)
Out[5]:
One record isn't super useful, so we'll do 10:
In [47]:
df.sort('ARR_DELAY', ascending=False).head(10)
Out[47]:
If we want, we can keep drilling down. Maybe we should also limit our inquiry to, say, La Guardia.
In [48]:
la_guardia_flights = df[df['ORIGIN'] == 'LGA']
la_guardia_flights.sort('ARR_DELAY', ascending=False).head(10)
Out[48]:
Huh, does LGA struggle more than usual to get its planes to Atlanta on time? Let's live dangerously and make a boxplot.
(Spoiler alert: JFK is marginally worse)
In [19]:
lga_to_atl = df[df['DEST'] == 'ATL']
lga_to_atl.boxplot('ACTUAL_ELAPSED_TIME', by='ORIGIN')
Out[19]:
And so on.
Of course data journalists are also in the business of finding trends, so let's do some of that.
Being good, accountability-minded reporters, one thing we might be interested in is each airline's on-time performance throughout our sample. Here's one way to check that:
In [49]:
df.groupby('CARRIER').median()['ARR_DELAY']
Out[49]:
Huh. Looks like the median flight from most of these carriers tends to show up pretty early. How does that change when we look at the mean?
In [50]:
df.groupby('CARRIER').mean()['ARR_DELAY']
Out[50]:
A little less generous. We can spend some time debating which portrayal is more fair, but the large difference between the two is still worth noting.
We can, of course, also drill down by destination:
In [51]:
df.groupby(['CARRIER', 'ORIGIN']).median()['ARR_DELAY']
Out[51]:
And if we want a more user-friendly display ...
In [57]:
df.boxplot('ARR_DELAY', by='CARRIER')
Out[57]:
Up until now, we've spent a lot of time seeing how variables act in isolation -- mainly focusing on arrival delays. But sometimes we might also want to see how two variables interact. That's where correlation comes into play.
For example, let's test one of my personal suspicions that longer flights (measured in distance) tend to experience longer delays.
In [53]:
df.corr()
Out[53]:
And now we'll make a crude visualization, just to show off:
In [56]:
import matplotlib.pyplot as plt
plt.matshow(df.corr())
Out[56]: