We'll work with a data set of customer preferences on trains, available here. This is a static dataset and isn't being updated, but you could imagine that each month the Dutch authorities upload a new month's worth of data.
We can start by making some very basic assertions, that the dataset is the correct shape, and that a few columns are the correct dtypes. Assertions are made as decorators to functions that return a DataFrame.
In [5]:
import pandas as pd
import engarde.decorators as ed
pd.set_option('display.max_rows', 10)
dtypes = dict(
price1=int,
price2=int,
time1=int,
time2=int,
change1=int,
change2=int,
comfort1=int,
comfort2=int
)
@ed.is_shape((-1, 11))
@ed.has_dtypes(items=dtypes)
def unload():
url = "http://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/Train.csv"
trains = pd.read_csv(url, index_col=0)
return trains
In [6]:
df = unload()
df.head()
Out[6]:
Notice two things: we only specified the dtypes for some of the columns, and we don't care about the length of the DataFrame (just its width), so we passed -1 for the first dimension of the shape.
Since people are rational, their first choice is surely going to be better in at least one way than their second choice. This is fundamental to our analysis later on, so we'll explicilty state it in our code, and check it in our data.
In [7]:
def rational(df):
"""
Check that at least one criteria is better.
"""
r = ((df.price1 < df.price2) | (df.time1 < df.time2) |
(df.change1 < df.change2) | (df.comfort1 > df.comfort2))
return r
@ed.is_shape((-1, 11))
@ed.has_dtypes(items=dtypes)
@ed.verify_all(rational)
def unload():
url = "http://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/Train.csv"
trains = pd.read_csv(url, index_col=0)
return trains
In [8]:
df = unload()
OK, so apparently people aren't rational... We'll fix this problem by ignoring those people (why change your mind when you can change the data?).
In [11]:
@ed.verify_all(rational)
def drop_silly_people(df):
r = ((df.price1 < df.price2) | (df.time1 < df.time2) |
(df.change1 < df.change2) | (df.comfort1 > df.comfort2))
return df[r]
@ed.is_shape((-1, 11))
@ed.has_dtypes(items=dtypes)
def unload():
url = "http://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/Train.csv"
trains = pd.read_csv(url, index_col=0)
return trains
def main():
df = (unload()
.pipe(drop_silly_people)
)
return df
In [12]:
df = main()
There's a couple things to notice here. The checks are always performed on the result of a function. That's why our ed.verify_all(rational) works now. I also like how the assertions don't clutter the logic of the code.