This notebook demonstrates vaquero, as both a library and data cleaning pattern.



In [1]:

    
from vaquero import Vaquero, callables_from

Task

Say you think you have pairs of numbers serialized as comma separated values in a file. You want to extract the pair from each line, then sum over the result (per line).

Sample Data



In [2]:

    
lines = ["1, 1.0",  # An errant float
         "1, $",    # A bad number
         "1,-1",    # A good line
         "10"]      # Missing the second value

Initial Implementation



In [3]:

    
def extract_pairs(s):
    return s.split(",")
    
    
def to_int(items):
    return [int(item) for item in items]


def sum_pair(items):
    return items[0], items[1]

Iteration 1

First, instantiate a vaquero instance. Here, I've set the maximum number of failures allowed to 5. After that many failures, the Vaquero object raises a VaqueroException. Generally, you want it to be large enough to collect a lot of unexpected failures. But, you don't want it to be so large you exhaust memory. This is an iterative process.

Also, as a tip, always instantiate the Vaquero object in its own cell. This way, you get to inspect it in your notebook even if it raises a VaqueroException.

I also registered all functions (well, callables) in this notebook with vaquero. The error capturing machinery only operates on the registered functions. And, it always ignores a KeyboardInterrupt.



In [4]:

    
vaquero = Vaquero(max_failures=5)
vaquero.register_targets(callables_from(globals()))

Just to be sure, I'll check the registered functions. It does matching by name, which is a bit naive. But, it's also surprisingly robust given vaquero usage patterns. Looking, you can see some things that don't belong. But, again, it mostly works well.



In [5]:

    
vaquero.target_funcs









    Out[5]:





{'Vaquero',
 'callables_from',
 'exit',
 'extract_pairs',
 'get_ipython',
 'quit',
 'sum_pair',
 'to_int'}

Now, run my trivial examples over the initial implementation.



In [6]:

    
results = []

for s in lines:
    with vaquero.on_input(s):
        results.append(sum_pair(to_int(extract_pairs(s))))

It was not successful.



In [7]:

    
vaquero.was_successful









    Out[7]:





False

So, look at the failures. There were two functions, and both had failures.



In [8]:

    
vaquero.stats()









    Out[8]:





{'failures': 3,
 'failures_by': {'sum_pair': 1, 'to_int': 2},
 'ignored': 0,
 'successes': 1}

To get a sense of what happened, examine the failing functions.

You can do this by calling examine with the name of the function (or the function object). It returns the captured invocations and errors.

Here you can see that the to_int function from cell In [3] failed with a ValueError exception.



In [9]:

    
vaquero.examine('to_int')









    Out[9]:





[{'call_args': [['1', ' 1.0']],
  'exc_type': 'ValueError',
  'exc_value': "invalid literal for int() with base 10: ' 1.0'",
  'filename': 'In [3]',
  'lineno': 6,
  'name': 'to_int'},
 {'call_args': [['1', ' $']],
  'exc_type': 'ValueError',
  'exc_value': "invalid literal for int() with base 10: ' $'",
  'filename': 'In [3]',
  'lineno': 6,
  'name': 'to_int'}]

Often though, we want to query only parts of the capture for a specific function. To do so, you can use JMESPath, specifying the selector as an argument to exam. Also, you can say, show me only the set applied to the selected result (assuming it's hashable), to simplify things.



In [10]:

    
vaquero.examine('to_int', '[*].exc_value', as_set=True)









    Out[10]:





{"invalid literal for int() with base 10: ' $'",
 "invalid literal for int() with base 10: ' 1.0'"}

And, for sum_pair.



In [11]:

    
vaquero.examine('sum_pair')









    Out[11]:





[{'call_args': [[10]],
  'exc_type': 'IndexError',
  'exc_value': 'list index out of range',
  'filename': 'In [3]',
  'lineno': 10,
  'name': 'sum_pair'}]

Iteration 2

We know know that there are some ints encoded as doubles. But, we know from our data source, it can only be an int. So, in to_ints, let's parse the strings first as floats, then create an int from it. It's robust.

Also, we know that some lines don't have two components. Those are just bad lines. Let's assert there are two parts as post condition of extract_pairs.

Finally, after a bit of digging, we found that $ means NA. After cursing for a minute -- because that's crazy -- you decide to ignore those entries. Instead of adding this to an existing function, you write an assert_no_missing_data function.



In [12]:

    
def no_missing_data(s):
    assert '$' not in s, "'{}' has missing data".format(s)
    
    
def extract_pairs(s):
    parts = s.split(",")
    assert len(parts) == 2, "'{}' not in 2 parts".format(s)
    return tuple(parts)
    
    
def to_int(items):
    return [int(float(item)) for item in items]


def sum_pair(items):
    assert len(items) == 2, "Line is improperly formatted"
    return items[0] + items[1]



In [13]:

    
vaquero.reset()  # Clear logged errors, mostly.
vaquero.register_targets(globals())



In [14]:

    
results = []

for s in lines:
    with vaquero.on_input(s):
        no_missing_data(s)
        results.append(sum_pair(to_int(extract_pairs(s))))

Now, we have one more success, but still two failures.



In [15]:

    
vaquero.stats()









    Out[15]:





{'failures': 2,
 'failures_by': {'extract_pairs': 1, 'no_missing_data': 1},
 'ignored': 0,
 'successes': 2}

Let's quickly examine.



In [16]:

    
vaquero.examine('extract_pairs')









    Out[16]:





[{'call_args': ['10'],
  'exc_type': 'AssertionError',
  'exc_value': "'10' not in 2 parts",
  'filename': 'In [12]',
  'lineno': 7,
  'name': 'extract_pairs'}]



In [17]:

    
vaquero.examine('no_missing_data')









    Out[17]:





[{'call_args': ['1, $'],
  'exc_type': 'AssertionError',
  'exc_value': "'1, $' has missing data",
  'filename': 'In [12]',
  'lineno': 2,
  'name': 'no_missing_data'}]

Both these exceptions are bad data. We want to ignore them.



In [18]:

    
vaquero.stats_ignoring('AssertionError')









    Out[18]:





{'failures': 0, 'failures_by': {}, 'ignored': 2, 'successes': 2}

Looking at the results accumulated,



In [19]:

    
results









    Out[19]:





[2, 0]

Things look good.

Now that we have something that works, we can use Vaquero in a more production-oriented mode. That is, we allow for unlimited errors, but we don't capture anything. That is, we note the failure, but otherwise ignore it since we won't be post-processing.



In [20]:

    
vaquero.reset(turn_off_error_capturing=True) 
# Or, Vaquero(capture_error_invocations=False)



In [21]:

    
results = []

for s in lines:
    with vaquero.on_input(s):
        no_missing_data(s)
        results.append(sum_pair(to_int(extract_pairs(s))))
        
results









    Out[21]:





[2, 0]

They still show up as failures, but it doesn't waste memory storing the captures.



In [22]:

    
vaquero.stats()









    Out[22]:





{'failures': 2, 'failures_by': {}, 'ignored': 0, 'successes': 2}