We'll work with a data set of customer preferences on trains, available here. This is a static dataset and isn't being updated, but you could imagine that each month the Dutch authorities upload a new month's worth of data.

We can start by making some very basic assertions, that the dataset is the correct shape, and that a few columns are the correct dtypes. Assertions are made as decorators to functions that return a DataFrame.


In [5]:
import pandas as pd
import engarde.decorators as ed

pd.set_option('display.max_rows', 10)

dtypes = dict(
    price1=int,
    price2=int,
    time1=int,
    time2=int,
    change1=int,
    change2=int,
    comfort1=int,
    comfort2=int
)

@ed.is_shape((-1, 11))
@ed.has_dtypes(items=dtypes)
def unload():
    url = "http://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/Train.csv"
    trains = pd.read_csv(url, index_col=0)
    return trains

In [6]:
df = unload()
df.head()


Out[6]:
id choiceid choice price1 time1 change1 comfort1 price2 time2 change2 comfort2
1 1 1 choice1 2400 150 0 1 4000 150 0 1
2 1 2 choice1 2400 150 0 1 3200 130 0 1
3 1 3 choice1 2400 115 0 1 4000 115 0 0
4 1 4 choice2 4000 130 0 1 3200 150 0 0
5 1 5 choice2 2400 150 0 1 3200 150 0 0

Notice two things: we only specified the dtypes for some of the columns, and we don't care about the length of the DataFrame (just its width), so we passed -1 for the first dimension of the shape.

Since people are rational, their first choice is surely going to be better in at least one way than their second choice. This is fundamental to our analysis later on, so we'll explicilty state it in our code, and check it in our data.


In [7]:
def rational(df):
    """
    Check that at least one criteria is better.
    """
    r = ((df.price1 < df.price2) | (df.time1 < df.time2) |
         (df.change1 < df.change2) | (df.comfort1 > df.comfort2))
    return r

@ed.is_shape((-1, 11))
@ed.has_dtypes(items=dtypes)
@ed.verify_all(rational)
def unload():
    url = "http://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/Train.csv"
    trains = pd.read_csv(url, index_col=0)
    return trains

In [8]:
df = unload()


---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-8-b108f050ce4e> in <module>()
----> 1 df = unload()

/Users/tom.augspurger/sandbox/engarde/engarde/decorators.py in wrapper(*args, **kwargs)
     22         @wraps(func)
     23         def wrapper(*args, **kwargs):
---> 24             result = func(*args, **kwargs)
     25             ck.is_shape(result, shape)
     26             return result

/Users/tom.augspurger/sandbox/engarde/engarde/decorators.py in wrapper(*args, **kwargs)
    115         @wraps(func)
    116         def wrapper(*args, **kwargs):
--> 117             result = func(*args, **kwargs)
    118             ck.has_dtypes(result, items)
    119             return result

/Users/tom.augspurger/sandbox/engarde/engarde/decorators.py in wrapper(*operation_args, **operation_kwargs)
    147         def wrapper(*operation_args, **operation_kwargs):
    148             result = operation_func(*operation_args, **operation_kwargs)
--> 149             vfunc(result, func, *args, **kwargs)
    150             return result
    151         return wrapper

/Users/tom.augspurger/sandbox/engarde/engarde/generic.py in verify_all(df, check, *args, **kwargs)
     40     result = check(df, *args, **kwargs)
     41     try:
---> 42         assert np.all(result)
     43     except AssertionError as e:
     44         msg = "{} not true for all".format(check.__name__)

AssertionError: ('rational not true for all',        id  choiceid   choice  price1  time1  change1  comfort1  price2  time2  \
13      2         3  choice2    2450    121        0         0    2450     93   
18      2         8  choice2    2975    108        0         0    2450    108   
27      3         6  choice2    1920    106        0         0    1440     96   
28      3         7  choice1    1920    106        0         0    1920     96   
33      4         1  choice2     545    105        1         1     545     85   
...   ...       ...      ...     ...    ...      ...       ...     ...    ...   
2899  233        10  choice1    1350    110        0         0    1350     95   
2900  234         1  choice2    4400     85        1         1    3300     85   
2907  234         8  choice2    3300     95        1         0    3300     85   
2914  235         1  choice2    3000     75        2         1    3000     65   
2916  235         3  choice2    2550     75        1         0    2100     55   

      change2  comfort2  
13          0         1  
18          0         1  
27          0         1  
28          0         1  
33          1         1  
...       ...       ...  
2899        0         1  
2900        0         1  
2907        0         1  
2914        1         1  
2916        1         1  

[467 rows x 11 columns])

OK, so apparently people aren't rational... We'll fix this problem by ignoring those people (why change your mind when you can change the data?).


In [11]:
@ed.verify_all(rational)
def drop_silly_people(df):
    r = ((df.price1 < df.price2) | (df.time1 < df.time2) |
         (df.change1 < df.change2) | (df.comfort1 > df.comfort2))
    return df[r]


@ed.is_shape((-1, 11))
@ed.has_dtypes(items=dtypes)
def unload():
    url = "http://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/Train.csv"
    trains = pd.read_csv(url, index_col=0)
    return trains

def main():
    df = (unload()
          .pipe(drop_silly_people)
          )
    return df

In [12]:
df = main()

There's a couple things to notice here. The checks are always performed on the result of a function. That's why our ed.verify_all(rational) works now. I also like how the assertions don't clutter the logic of the code.