Author's Note

Good news everyone, Wes has announced a 2nd edition of Python for Data Analysis.

Dear Twitter friends: I have started the process of writing the 2nd edition of Python for Data Analysis -- coming late 2016 or early 2017!
— Wes McKinney (@wesmckinn) March 30, 2016

This book.

Part of the reason I started out this series was because...

Method chaining is in style at the moment.

Over the past several releases, we've added methods that enable method chaining.

assign (0.16.0): For adding new columns to a DataFrame in a chain (inspired by dplyr's mutate)
pipe (0.16.2): For including user-defined methods in method chains.
rename (0.18.0): For altering axis names (in additional to changing the actual labels as before).
Window methods (0.18): Took the top-level pd.rolling_* and pd.expanding_* functions and made them NDFrame methods with a groupby-like API.

My scripts will typically start off with largeish change at the start getting things into a managable state. It's nice to have all that munging done with. To get your data into a nice and tidy state, where you can start to do Science.



In [82]:

    
df.arr_time - df.crs_arr_time









    Out[82]:





0                 00:13:00
1                 01:51:00
2                 00:13:00
3                 00:01:00
4        -1 days +23:52:00
                ...       
471944   -1 days +23:53:00
471945   -1 days +23:48:00
471946            00:11:00
471947            00:56:00
471948            00:47:00
dtype: timedelta64[ns]



In [2]:

    
import numpy as np
import pandas as pd


def read(fp):
    df = (pd.read_csv(fp)
            .rename(columns=str.lower)
            .drop('unnamed: 36', axis=1)
            .pipe(extract_city_name)
            .pipe(time_to_datetime, ['dep_time', 'arr_time', 'crs_arr_time', 'crs_dep_time'])
            .assign(fl_date=lambda x: pd.to_datetime(x['fl_date']),
                    dest=lambda x: pd.Categorical(x['dest']),
                    origin=lambda x: pd.Categorical(x['origin']),
                    tail_num=lambda x: pd.Categorical(x['tail_num']),
                    unique_carrier=lambda x: pd.Categorical(x['unique_carrier']),
                    cancellation_code=lambda x: pd.Categorical(x['cancellation_code'])))
    return df

def extract_city_name(df):
    '''
    Chicago, IL -> Chicago for origin_city_name and dest_city_name
    '''
    cols = ['origin_city_name', 'dest_city_name']
    city = df[cols].apply(lambda x: x.str.extract("(.*), \w{2}", expand=False))
    df = df.copy()
    df[['origin_city_name', 'dest_city_name']] = city
    return df

def time_to_datetime(df, columns):
    '''
    Combine all time items into datetimes.
    
    2014-01-01,0914 -> 2014-01-01 09:14:00
    '''
    df = df.copy()
    def converter(col):
        timepart = (col.astype(str)
                       .str.replace('\.0$', '')  # NaNs force float dtype
                       .str.pad(4, fillchar='0'))
        return  pd.to_datetime(df['fl_date'] + ' ' +
                               timepart.str.slice(0, 2) + ':' +
                               timepart.str.slice(2, 4),
                               errors='coerce')
        return datetime_part
    df[columns] = df[columns].apply(converter)
    return df


df = read("878167309_T_ONTIME.csv")
df.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 471949 entries, 0 to 471948
Data columns (total 36 columns):
fl_date                  471949 non-null datetime64[ns]
unique_carrier           471949 non-null category
airline_id               471949 non-null int64
tail_num                 467903 non-null category
fl_num                   471949 non-null int64
origin_airport_id        471949 non-null int64
origin_airport_seq_id    471949 non-null int64
origin_city_market_id    471949 non-null int64
origin                   471949 non-null category
origin_city_name         471949 non-null object
origin_state_nm          471949 non-null object
dest_airport_id          471949 non-null int64
dest_airport_seq_id      471949 non-null int64
dest_city_market_id      471949 non-null int64
dest                     471949 non-null category
dest_city_name           471949 non-null object
dest_state_nm            471949 non-null object
crs_dep_time             471949 non-null datetime64[ns]
dep_time                 441586 non-null datetime64[ns]
dep_delay                441622 non-null float64
taxi_out                 441266 non-null float64
wheels_off               441266 non-null float64
wheels_on                440453 non-null float64
taxi_in                  440453 non-null float64
crs_arr_time             471949 non-null datetime64[ns]
arr_time                 440302 non-null datetime64[ns]
arr_delay                439620 non-null float64
cancelled                471949 non-null float64
cancellation_code        30852 non-null category
diverted                 471949 non-null float64
distance                 471949 non-null float64
carrier_delay            119994 non-null float64
weather_delay            119994 non-null float64
nas_delay                119994 non-null float64
security_delay           119994 non-null float64
late_aircraft_delay      119994 non-null float64
dtypes: category(5), datetime64[ns](5), float64(14), int64(8), object(4)
memory usage: 115.3+ MB

I find them readable, some people don't. Both the code and the flow of execution are top to bottom, unlike heavily nested function calls.

My favorite example demonstrating this comes from Hadley Wickham. Compare these two ways of telling the same story:

tumble_after(
    broke(
        fell_down(
            fetch(went_up(jack_jill, "hill"), "water"),
            jack),
        "crown"),
    "jill"
)

and

jack_jill %>%
    went_up("hill") %>%
    fetch("water") %>%
    fell_down("jack") %>%
    broke("crown") %>%
    tumble_after("jill")

Even if you weren't aware that in R %>% (pronounced pipe) calls the function on the right with the thing on the left as an argument, you can still make out what's going on. Compare that with the first style, where you need to unravel the code to figure out the order of execution and which arguments are being passed where.

Admittedly, you probably wouldn't write the first one. It'd be something like

on_hill = went_up(jack_jill, 'hill')
with_water = fetch(on_hill, 'water')
fallen = fell_down(with_water, 'jack')
broken = broke(fallen, 'jack')
after = tmple_after(broken, 'jill')

I don't like this version because I spend more time than I'd like to admit coming up with appropriate names for variables¹. That's bothersome when we don't really care about the on_hill variable. We're just passing it into the next step.

A fourth way of writing the same thing may be available. Suppose you owned a JackAndJill object, and could define the methods on it. Then you'd have something like R's %>% example.

jack_jill = JackAndJill()
(jack_jill.went_up('hill')
    .fetch('water')
    .fell_down('jack')
    .broke('crown')
    .tumble_after('jill')
)

But the problem is you don't own the ndarray or DataFrame or DataArray, and the exact method you want may not exist. Monekypatching on your own methods is fragile. It's not easy to correctly subclass pandas' DataFrame to extend it with your own methods. Composition, where you create a class that holds onto a DataFrame internally, may be fine for your own code, but it won't interact well with the rest of the ecosystem.

Perhaps you could submit a pull request to pandas implementing your method. But then you'd need to convince the maintainers that it's broadly useful enough to merit its inclusion (and worth their time to maintain it). But DataFrame has something like 250+ methods, so we're reluctant to add more.

Enter DataFrame.pipe. All the benefits of having your specific function as a method on the DataFrame, without us having to maintain it. A win-win.

jack_jill = pd.DataFrame()
(jack_jill.pipe(went_up, 'hill')
    .pipe(fetch, 'water')
    .pipe(fell_down, 'jack')
    .pipe(broke, 'crown')
    .pipe(tumble_after, 'jill')
)

This really is just right-to-left function execution. The first argument to pipe, a callable, is called with the DataFrame on the left, and any additional arguments you specify.

When implementing assign, pandas' version of dplyr's mutate, we spent a lot of time finding the right name. We ended up having 110 comments in the PR, plus more in the issue, for a roughly 15 line method. Most of those were about the name.↩

Costs

One drawback to excessivly long chains is that debugging can be harder. If something looks wrong at the end, you don't have intermediate values to inspect. There's a close parallel here to python's generators. Generators are great for keeping memory consumption down, but they can be hard to debug since values are consumed.

For my typical exploratory workflow, this isn't really a big problem. I'm working with a single dataset that isn't being updated, and the path from raw data to usuable data isn't so large.

For very large workflows, you'll probably want to move away from pandas to to something more structured, like Airflow or Luigi.

When writing medium sized ETL jobs in python that will be run repeatedly, I'll use decorators to inspect and log properties about the DataFrames at each step of the process.

from functools import wraps
import logging

def log_shape(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        result = func(*args, **kwargs)
        logging.info("%s,%s" % (func.__name__, result.shape))
        return result
    return wrapper

def log_dtypes(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        result = func(*args, **kwargs)
        logging.info("%s,%s" % (func.__name__, result.dtypes))
        return result
    return wrapper


@log_shape
@log_dtypes
def load(fp):
    df = pd.read_csv(fp, index_col=0, parse_dates=True)

@log_shape
@log_dtypes
def update_events(df, new_events):
    df.loc[new_events.index, 'foo'] = new_events
    return df

This plays nicely with engarde, a little library I wrote to validate data as it flows through the pipeline.

Inplace?

Most pandas methods have an inplace keyword that's False by default. In general, you shouldn't do inplace operations.

First, if you like method chains then you simply can't use inplace since the return value is None, terminating the chain.

Second, I suspect people have a mental model of inplace operations happening, you know, inplace. That is, extra memory doesn't need to be allocated for the result. But that might not actually be true. There are many methods in pandas that look like

def dataframe_method(self, inplace=False)
    data = self if inplace else self.copy()
    # result = ...
    if inplace:
        self._update_inplace(result)
    else:
        return result

There's a lot of defensive copying in pandas. Part of this comes down to pandas being built on top of NumPy, and not having full control over how memory is handled and shared. We saw it above when we defined our own functions extract_city_name and time_to_datetime. Without the copy, adding the columns would modify the input DataFrame, which just isn't polite.

Finally, inplace operations don't make sense in projects like ibis or dask, where the expression you're writing isn't immediately executed.

I feel like we haven't done much actual coding.



In [132]:

    
sns.set(style='white', context='talk')



In [133]:

    
import statsmodels.api as sm

Does a plane with multiple flights on the same day get backed up, causing later flights to be delayed more?



In [162]:

    
flights = (df[['fl_date', 'tail_num', 'dep_time', 'dep_delay', 'distance']]
           .dropna()
           .sort_values('dep_time')
           .assign(turn = lambda x:
                x.groupby(['fl_date', 'tail_num'])
                 .dep_time
                 .transform('rank').astype(int)))

fig, ax = plt.subplots(figsize=(15, 5))
sns.boxplot(x='turn', y='dep_delay', data=flights, ax=ax)
sns.despine()
plt.savefig('../content/images/mc_turn.svg', transparent=True)

Do flights later in the day have longer delays?



In [180]:

    
plt.figure(figsize=(15, 5))
(df[['fl_date', 'tail_num', 'dep_time', 'dep_delay', 'distance']]
    .dropna()
    .assign(hour=lambda x: x.dep_time.dt.hour)
    .query('5 < dep_delay < 600')
    .pipe((sns.boxplot, 'data'), 'hour', 'dep_delay'))
sns.despine()
plt.savefig('../content/images/delay_by_hour.svg', transparent=True)

Let's call everyone above 10 hours an outlier and throw them out.



In [164]:

    
fig, ax = plt.subplots(figsize=(15, 5))
sns.boxplot(x='hour', y='dep_delay', data=flights[flights.dep_delay < 600], ax=ax)
sns.despine()
plt.savefig('../content/images/mc_no_days.svg', transparent=True)

Let's condition on there being a delay at all.



In [166]:

    
fig, ax = plt.subplots(figsize=(15, 5))
sns.boxplot(x='hour', y='dep_delay', data=flights.query('5 < dep_delay < 600'), ax=ax)
sns.despine()
plt.savefig('../content/images/mc_delays.svg', transparent=True)

Which flights are the worst?



In [175]:

    
# Groupby.agg accepts dict of {column: {ouput_name: agg_func}}
air = (df.groupby(['origin', 'dest'])
         .agg({'dep_delay': {'dep_mean': 'mean', 'dep_count': 'count'},
               'arr_delay': {'arr_mean': 'mean', 'arr_count': 'count'}}))
air.columns = air.columns.droplevel()



In [171]:

    
air[air.arr_count > 50].sort_values('dep_mean', ascending=False).head(10)

Which airlines are the worst?



In [174]:

    
airlines = df.groupby('unique_carrier').dep_delay.agg(['mean', 'count'])
airlines['mean'].sort_values().plot.barh()
sns.despine()

B6 is Jet Blue.

I wanted to try out scikit-learn's new Gaussian Process module so here's a pretty picture.



In [192]:

    
print(delay.head().to_html())









    



<table border="0" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>count</th>
      <th>delay</th>
      <th>dist</th>
    </tr>
    <tr>
      <th>tail_num</th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>D942DN</th>
      <td>120</td>
      <td>9.232143</td>
      <td>829.783333</td>
    </tr>
    <tr>
      <th>N001AA</th>
      <td>139</td>
      <td>13.818182</td>
      <td>616.043165</td>
    </tr>
    <tr>
      <th>N002AA</th>
      <td>135</td>
      <td>9.570370</td>
      <td>570.377778</td>
    </tr>
    <tr>
      <th>N003AA</th>
      <td>125</td>
      <td>5.722689</td>
      <td>641.184000</td>
    </tr>
    <tr>
      <th>N004AA</th>
      <td>138</td>
      <td>2.037879</td>
      <td>630.391304</td>
    </tr>
  </tbody>
</table>



In [190]:

    
planes = df.assign(year=df.fl_date.dt.year).groupby("tail_num")
delay = (planes.agg({"year": "count",
                     "distance": "mean",
                     "arr_delay": "mean"})
               .rename(columns={"distance": "dist",
                                "arr_delay": "delay",
                                "year": "count"})
               .query("count > 20 & dist < 2000"))
delay.head()



In [253]:

    
X = delay['dist']
y = delay['delay']



In [254]:

    
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, WhiteKernel

kernel = (1.0 * RBF(length_scale=10.0, length_scale_bounds=(1e2, 1e4))
    + WhiteKernel(noise_level=.5, noise_level_bounds=(1e-1, 1e+5)))
gp = GaussianProcessRegressor(kernel=kernel,
                              alpha=0.0).fit(X.reshape(-1, 1), y)

X_ = np.linspace(X.min(), X.max(), 1000)
y_mean, y_cov = gp.predict(X_[:, np.newaxis], return_cov=True)



In [255]:

    
%matplotlib inline
sns.set(style='white', context='talk')



In [256]:

    
ax = delay.plot(kind='scatter', x='dist', y = 'delay', figsize=(12, 6),
                color='k', alpha=.25, s=delay['count'] / 10)

ax.plot(X_, y_mean, lw=2, zorder=9)
ax.fill_between(X_, y_mean - np.sqrt(np.diag(y_cov)),
                y_mean + np.sqrt(np.diag(y_cov)),
                alpha=0.25)

sizes = (delay['count'] / 10).round(0)

for area in np.linspace(sizes.min(), sizes.max(), 3).astype(int):
    plt.scatter([], [], c='k', alpha=0.7, s=area,
                label=str(area * 10) + ' flights')
plt.legend(scatterpoints=1, frameon=False, labelspacing=1)

ax.set_xlim(0, 2100)
ax.set_ylim(-20, 65)
sns.despine()
plt.tight_layout()
plt.savefig("../content/images/mc_flights.svg", transparent=True)
plt.savefig("../content/images/mc_flights.png")



In [159]:

    
df.head()









    Out[159]:






  
    
      
      fl_date
      unique_carrier
      airline_id
      tail_num
      fl_num
      origin_airport_id
      origin_airport_seq_id
      origin_city_market_id
      origin
      origin_city_name
      ...
      arr_delay
      cancelled
      cancellation_code
      diverted
      distance
      carrier_delay
      weather_delay
      nas_delay
      security_delay
      late_aircraft_delay
    
  
  
    
      0
      2014-01-01
      AA
      19805
      N338AA
      1
      12478
      1247802
      31703
      JFK
      New York
      ...
      13.0
      0.0
      NaN
      0.0
      2475.0
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      1
      2014-01-01
      AA
      19805
      N339AA
      2
      12892
      1289203
      32575
      LAX
      Los Angeles
      ...
      111.0
      0.0
      NaN
      0.0
      2475.0
      111.0
      0.0
      0.0
      0.0
      0.0
    
    
      2
      2014-01-01
      AA
      19805
      N335AA
      3
      12478
      1247802
      31703
      JFK
      New York
      ...
      13.0
      0.0
      NaN
      0.0
      2475.0
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      3
      2014-01-01
      AA
      19805
      N367AA
      5
      11298
      1129803
      30194
      DFW
      Dallas/Fort Worth
      ...
      1.0
      0.0
      NaN
      0.0
      3784.0
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      4
      2014-01-01
      AA
      19805
      N364AA
      6
      13830
      1383002
      33830
      OGG
      Kahului
      ...
      -8.0
      0.0
      NaN
      0.0
      3711.0
      NaN
      NaN
      NaN
      NaN
      NaN
    
  

5 rows × 36 columns

Until next time.



In [193]:

    
import statsmodels.tsa.api as smt



In [197]:

    
jdaily = df.groupby('fl_date').dep_delay.mean().sort_index()



In [198]:

    
df.groupby(['origin', 'fl_date']).dep_delay.mean()









    Out[198]:





origin  fl_date   
ABE     2014-01-01     6.000000
        2014-01-02    -4.666667
        2014-01-03    -4.000000
        2014-01-04    19.000000
        2014-01-05    48.500000
                        ...    
YUM     2014-01-27    -7.111111
        2014-01-28    -6.714286
        2014-01-29     7.125000
        2014-01-30     3.777778
        2014-01-31    -2.888889
Name: dep_delay, dtype: float64



In [247]:

    
hourly = (
    df.dropna(subset=['dep_time', 'unique_carrier'])
      .loc[df['unique_carrier']
           .isin(df['unique_carrier'].value_counts().index[:5])]
   .set_index('dep_time')
   .groupby(['unique_carrier', pd.TimeGrouper("H")])
   .fl_num.count()
   .unstack(0).fillna(0).plot())



In [250]:

    
sns.set_style('ticks')



In [251]:

    
(df.dropna(subset=['dep_time', 'unique_carrier'])
   .loc[df['unique_carrier']
       .isin(df['unique_carrier'].value_counts().index[:5])]
   .set_index('dep_time')
   .groupby(['unique_carrier', pd.TimeGrouper("H")])
   .fl_num.count()
   .unstack(0)
   .fillna(0)
   .rolling(24)
   .sum()
   .rename_axis("Flights per Day", axis=1)
   .plot()
)
sns.despine()
plt.savefig("../content/images/flights_per_day.svg", transparent=True)



In [235]:

    
hourly.unstack(0).fillna(0).plot()
sns.despine()k



In [229]:

    
top = hourly.index.get_level_values('unique_carrier').value_counts().index[:5]



In [222]:

    
gr.fl_num.count().index.get_level_values(0).value_counts().sort_values()









    Out[222]:





F9    533
VX    555
MQ    589
WN    605
FL    612
     ... 
DL    662
UA    667
US    673
AS    689
B6    697
dtype: int64



In [215]:

    
gr.unstack()









    Out[215]:





unique_carrier  dep_time           
AA              2014-01-01 00:00:00      8
                2014-01-01 01:00:00      1
                2014-01-01 05:00:00     31
                2014-01-01 06:00:00     77
                2014-01-01 07:00:00    107
                                      ... 
WN              2014-01-31 19:00:00    205
                2014-01-31 20:00:00    147
                2014-01-31 21:00:00     98
                2014-01-31 22:00:00     18
                2014-01-31 23:00:00      6
Name: fl_num, dtype: int64



In [214]:

    
gr.unstack(1).fillna(0).pipe(p)









    Out[214]:





<matplotlib.image.AxesImage at 0x14dea0b38>



In [206]:

    
df.head()









    Out[206]:






  
    
      
      fl_date
      unique_carrier
      airline_id
      tail_num
      fl_num
      origin_airport_id
      origin_airport_seq_id
      origin_city_market_id
      origin
      origin_city_name
      ...
      arr_delay
      cancelled
      cancellation_code
      diverted
      distance
      carrier_delay
      weather_delay
      nas_delay
      security_delay
      late_aircraft_delay
    
  
  
    
      0
      2014-01-01
      AA
      19805
      N338AA
      1
      12478
      1247802
      31703
      JFK
      New York
      ...
      13.0
      0.0
      NaN
      0.0
      2475.0
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      1
      2014-01-01
      AA
      19805
      N339AA
      2
      12892
      1289203
      32575
      LAX
      Los Angeles
      ...
      111.0
      0.0
      NaN
      0.0
      2475.0
      111.0
      0.0
      0.0
      0.0
      0.0
    
    
      2
      2014-01-01
      AA
      19805
      N335AA
      3
      12478
      1247802
      31703
      JFK
      New York
      ...
      13.0
      0.0
      NaN
      0.0
      2475.0
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      3
      2014-01-01
      AA
      19805
      N367AA
      5
      11298
      1129803
      30194
      DFW
      Dallas/Fort Worth
      ...
      1.0
      0.0
      NaN
      0.0
      3784.0
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      4
      2014-01-01
      AA
      19805
      N364AA
      6
      13830
      1383002
      33830
      OGG
      Kahului
      ...
      -8.0
      0.0
      NaN
      0.0
      3711.0
      NaN
      NaN
      NaN
      NaN
      NaN
    
  

5 rows × 36 columns



In [201]:

    
gr = df.set_index(['dep_time']).groupby(['unique_carrier', pd.TimeGrouper("Min")])









    



---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-201-07aec73f35ab> in <module>()
----> 1 gr = df.set_index(['dep_time']).groupby(['unique_carrier', pd.TimeGrouper("Min")])

/Users/tom.augspurger/sandbox/tomaugspurger.github.io/pandas/pandas/core/generic.py in groupby(self, by, axis, level, as_index, sort, group_keys, squeeze)
   3746         axis = self._get_axis_number(axis)
   3747         return groupby(self, by=by, axis=axis, level=level, as_index=as_index,
-> 3748                        sort=sort, group_keys=group_keys, squeeze=squeeze)
   3749 
   3750     def asfreq(self, freq, method=None, how=None, normalize=False):

/Users/tom.augspurger/sandbox/tomaugspurger.github.io/pandas/pandas/core/groupby.py in groupby(obj, by, **kwds)
   1426         raise TypeError('invalid type: %s' % type(obj))
   1427 
-> 1428     return klass(obj, by, **kwds)
   1429 
   1430 

/Users/tom.augspurger/sandbox/tomaugspurger.github.io/pandas/pandas/core/groupby.py in __init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze)
    343         if grouper is None:
    344             grouper, exclusions, obj = _get_grouper(obj, keys, axis=axis,
--> 345                                                     level=level, sort=sort)
    346 
    347         self.obj = obj

/Users/tom.augspurger/sandbox/tomaugspurger.github.io/pandas/pandas/core/groupby.py in _get_grouper(obj, key, axis, level, sort)
   2415                         sort=sort,
   2416                         in_axis=in_axis) \
-> 2417             if not isinstance(gpr, Grouping) else gpr
   2418 
   2419         groupings.append(ping)

/Users/tom.augspurger/sandbox/tomaugspurger.github.io/pandas/pandas/core/groupby.py in __init__(self, index, grouper, obj, name, level, sort, in_axis)
   2213 
   2214                 # get the new grouper
-> 2215                 grouper = self.grouper._get_binner_for_grouping(self.obj)
   2216                 self.obj = self.grouper.obj
   2217                 self.grouper = grouper

/Users/tom.augspurger/sandbox/tomaugspurger.github.io/pandas/pandas/tseries/resample.py in _get_binner_for_grouping(self, obj)
    899         if self.indexer is not None:
    900             indexer = self.indexer.argsort(kind='quicksort')
--> 901             grouper = grouper.take(indexer)
    902         return grouper
    903 

/Users/tom.augspurger/sandbox/tomaugspurger.github.io/pandas/pandas/tseries/base.py in take(self, indices, axis, allow_fill, fill_value)
    267         if isinstance(maybe_slice, slice):
    268             return self[maybe_slice]
--> 269         taken = self.asi8.take(com._ensure_platform_int(indices))
    270 
    271         # only fill if we are passing a non-None fill_value

IndexError: index 441740 is out of bounds for size 441586



In [ ]:

    
smt.seasonal_decompose()

		arr_mean	arr_count	dep_count	dep_mean
origin	dest
MDW	MSY	47.740741	54	54	55.111111
ORD	HSV	56.578125	64	65	52.800000
IAD	EWR	47.887500	80	81	52.333333
JFK	ATL	49.647887	142	142	51.464789
CMH	MIA	54.000000	61	61	51.344262
FLL	BOS	40.033784	148	148	51.033784
IAD	CLT	51.111111	54	54	50.888889
MDW	BDL	45.807692	52	52	50.442308
SJU	BOS	32.081081	111	112	49.660714
PBI	BWI	40.655172	87	87	49.643678

	count	delay	dist
tail_num
D942DN	120	9.232143	829.783333
N001AA	139	13.818182	616.043165
N002AA	135	9.570370	570.377778
N003AA	125	5.722689	641.184000
N004AA	138	2.037879	630.391304

	fl_date	unique_carrier	airline_id	tail_num	fl_num	origin_airport_id	origin_airport_seq_id	origin_city_market_id	origin	origin_city_name	...	arr_delay	cancellation_code	distance	carrier_delay	weather_delay	nas_delay	security_delay	late_aircraft_delay
0	2014-01-01	AA	19805	N338AA	1	12478	1247802	31703	JFK	New York	...	13.0	NaN	2475.0	NaN	NaN	NaN	NaN	NaN
1	2014-01-01	AA	19805	N339AA	2	12892	1289203	32575	LAX	Los Angeles	...	111.0	NaN	2475.0	111.0	0.0	0.0	0.0	0.0
2	2014-01-01	AA	19805	N335AA	3	12478	1247802	31703	JFK	New York	...	13.0	NaN	2475.0	NaN	NaN	NaN	NaN	NaN
3	2014-01-01	AA	19805	N367AA	5	11298	1129803	30194	DFW	Dallas/Fort Worth	...	1.0	NaN	3784.0	NaN	NaN	NaN	NaN	NaN
4	2014-01-01	AA	19805	N364AA	6	13830	1383002	33830	OGG	Kahului	...	-8.0	NaN	3711.0	NaN	NaN	NaN	NaN	NaN