In [225]:

    
import zipfile

import requests  # not a dependency of engarde

import engarde.checks as ck

%matplotlib inline
pd.options.display.max_rows = 20

Just a couple quick examples.



In [213]:

    
# This will take a few minutes

r = requests.get("http://www.transtats.bts.gov/Download/On_Time_On_Time_Performance_2015_1.zip",
                 stream=True)
with open("otp-1.zip", "wb") as f:
    for chunk in r.iter_content(chunk_size=1024):
        f.write(chunk)
        f.flush()
r.close()

z = zipfile.ZipFile("otp-1.zip")
fp = z.extract('On_Time_On_Time_Performance_2015_1.csv')



In [144]:

    
columns = ['FlightDate', 'Carrier', 'TailNum', 'FlightNum',
           'Origin', 'OriginCityName', 'OriginStateName',
           'Dest', 'DestCityName', 'DestStateName',
           'DepTime', 'DepDelay', 'TaxiOut', 'WheelsOn',
           'WheelsOn', 'TaxiIn', 'ArrTime', 'ArrDelay',
           'Cancelled', 'Diverted', 'ActualElapsedTime',
           'AirTime', 'Distance', 'CarrierDelay', 'WeatherDelay',
           'NASDelay', 'SecurityDelay', 'LateAircraftDelay']

df = pd.read_csv('On_Time_On_Time_Performance_2015_1.csv', usecols=columns,
                 dtype={'DepTime': str})
dep_time = df.DepTime.fillna('').str.pad(4, side='left', fillchar='0')
df['ts'] = pd.to_datetime(df.FlightDate + 'T' + dep_time,
                          format='%Y-%m-%dT%H%M%S')
df = df.drop(['FlightDate', 'DepTime'], axis=1)

Let's suppose that down the road our probram can only handle certain carriers; an update to the data adding a new carrier would violate an assumpetion we hold. We'll use the within_set method to check our assumption



In [226]:

    
carriers = ['AA', 'AS', 'B6', 'DL', 'US', 'VX', 'WN', 'UA', 'NK', 'MQ', 'OO',
            'EV', 'HA', 'F9']
df.pipe(ck.within_set, items={'Carrier': carriers}).Carrier.value_counts().head()









    Out[226]:





WN    100042
DL     64421
EV     49925
OO     48114
AA     44059
dtype: int64

Great, our assumption was true (at least for now).

Surely, we can't count on each flight having a Carrier, TailNum and FlightNum, right?



In [217]:

    
df.pipe(ck.none_missing, columns=['Carrier', 'TailNum', 'FlightNum'])









    



---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-217-4697b8966f5e> in <module>()
----> 1 df.pipe(ck.none_missing, columns=['Carrier', 'TailNum', 'FlightNum'])

/Users/tom.augspurger/Envs/py3/lib/python3.4/site-packages/pandas-0.16.2_11_gd8a2f30-py3.4-macosx-10.10-x86_64.egg/pandas/core/generic.py in pipe(self, func, *args, **kwargs)
   2111             return func(*args, **kwargs)
   2112         else:
-> 2113             return func(self, *args, **kwargs)
   2114 
   2115     #----------------------------------------------------------------------

/Users/tom.augspurger/sandbox/engarde/engarde/checks.py in none_missing(df, columns)
     18     if columns is None:
     19         columns = df.columns
---> 20     assert not df[columns].isnull().any().any()
     21     return df
     22 

AssertionError:

Note: this isn't too user-friendly yet. I'm planning to make the error messages more informative. Just haven't gotten there yet. That said, you wouldn't really use engarde to determine whether or not those columns are always not null. Instead, you might find that for January every flight has all of those fields, assume that hold generally, only to be surprised when next month you get a flight without a tail number.

Decorator Interface

Each of your checks can also be written as a decorator on a function that returns a DataFrame. I really like how slick this is.

Let's do a nonsense example. Suppose we want to show the counts of each Carrier, but our UI designer worries that if things are too spread out the bar graph will look weird (again, silly example). We'll assert that teh counts are within a comfortable range and that each count is within 3 standard deviations of the mean.



In [266]:

    
import engarde.decorators as ed

@ed.within_range({'Counts!': (4000, 110000)})
@ed.within_n_std(3)
def pretty_counts(df):
    return df.Carrier.value_counts().to_frame(name='Counts!')



In [268]:

    
pretty_counts(df)

No AssertionError was raised so we're all good.

I'll typically find myself with a handful of functions that define my ETL process from a flat file to whatever the end result is. Each function takes and returns a DataFrame. This can be nicely layered into a pipeline. Using decorators allows you to make the checks without breaking up the flow of the pipeline.



In [ ]:

	Counts!
WN	100042
DL	64421
EV	49925
OO	48114
AA	44059
UA	38395
US	33489
MQ	29900
B6	21623
AS	13257
NK	8743
F9	6829
HA	6440
VX	4731