In [1]:
from vaquero import Vaquero, callables_from
In [2]:
lines = ["1, 1.0", # An errant float
"1, $", # A bad number
"1,-1", # A good line
"10"] # Missing the second value
In [3]:
def extract_pairs(s):
return s.split(",")
def to_int(items):
return [int(item) for item in items]
def sum_pair(items):
return items[0], items[1]
First, instantiate a vaquero instance. Here, I've set the maximum number of failures allowed to 5. After that many failures, the Vaquero object raises a VaqueroException. Generally, you want it to be large enough to collect a lot of unexpected failures. But, you don't want it to be so large you exhaust memory. This is an iterative process.
Also, as a tip, always instantiate the Vaquero object in its own cell. This way, you get to inspect it in your notebook even if it raises a VaqueroException.
I also registered all functions (well, callables) in this notebook with vaquero. The error capturing machinery only operates on the registered functions. And, it always ignores a KeyboardInterrupt.
In [4]:
vaquero = Vaquero(max_failures=5)
vaquero.register_targets(callables_from(globals()))
Just to be sure, I'll check the registered functions. It does matching by name, which is a bit naive. But, it's also surprisingly robust given vaquero usage patterns. Looking, you can see some things that don't belong. But, again, it mostly works well.
In [5]:
vaquero.target_funcs
Out[5]:
Now, run my trivial examples over the initial implementation.
In [6]:
results = []
for s in lines:
with vaquero.on_input(s):
results.append(sum_pair(to_int(extract_pairs(s))))
It was not successful.
In [7]:
vaquero.was_successful
Out[7]:
So, look at the failures. There were two functions, and both had failures.
In [8]:
vaquero.stats()
Out[8]:
To get a sense of what happened, examine the failing functions.
You can do this by calling examine with the name of the function (or the function object). It returns the captured invocations and errors.
Here you can see that the to_int function from cell In [3] failed with a ValueError exception.
In [9]:
vaquero.examine('to_int')
Out[9]:
Often though, we want to query only parts of the capture for a specific function. To do so, you can use JMESPath, specifying the selector as an argument to exam. Also, you can say, show me only the set applied to the selected result (assuming it's hashable), to simplify things.
In [10]:
vaquero.examine('to_int', '[*].exc_value', as_set=True)
Out[10]:
And, for sum_pair.
In [11]:
vaquero.examine('sum_pair')
Out[11]:
We know know that there are some ints encoded as doubles. But, we know from our data source, it can only be an int. So, in to_ints, let's parse the strings first as floats, then create an int from it. It's robust.
Also, we know that some lines don't have two components. Those are just bad lines. Let's assert there are two parts as as post condition of extract_pairs.
Finally, after a bit of digging, we found that $ means NA. After cursing for a minute because that's crazy -- although, crazy is common in dirty data -- you decide to ignore those entries. Instead of adding this to an existing function, you write an assert_no_missing_data function.
In [12]:
def no_missing_data(s):
assert '$' not in s, "'{}' has missing data".format(s)
def extract_pairs(s):
parts = s.split(",")
assert len(parts) == 2, "'{}' not in 2 parts".format(s)
return tuple(parts)
def to_int(items):
return [int(float(item)) for item in items]
def sum_pair(items):
assert len(items) == 2, "Line is improperly formatted"
return items[0] + items[1]
In [13]:
vaquero.reset()
vaquero.register_targets(globals())
In [14]:
results = []
for s in lines:
with vaquero.on_input(s):
no_missing_data(s)
results.append(sum_pair(to_int(extract_pairs(s))))
Now, we have one more success, but still two failures.
In [15]:
vaquero.stats()
Out[15]:
Let's quickly examine.
In [16]:
vaquero.examine('extract_pairs')
Out[16]:
In [17]:
vaquero.examine('no_missing_data')
Out[17]:
Both these exceptions are bad data. We want to ignore them.
In [18]:
vaquero.stats_ignoring('AssertionError')
Out[18]:
Looking at the results accumulated,
In [19]:
results
Out[19]:
Things look good.
Now that we have something that works, we can use Vaquero in a more production-oriented mode. That is, we allow for unlimited errors, but we don't capture anything. That is, we note the failure, but otherwise ignore it since we won't be post-processing.
In [20]:
vaquero.reset(turn_off_error_capturing=True)
# Or, Vaquero(capture_error_invocations=False)
In [21]:
results = []
for s in lines:
with vaquero.on_input(s):
no_missing_data(s)
results.append(sum_pair(to_int(extract_pairs(s))))
results
Out[21]:
They still show up as failures, but it doesn't waste memory storing the captures.
In [22]:
vaquero.stats()
Out[22]: