This notebook demonstrates vaquero.

Let's say you are processing some html files for users. Someone on your team already used css selectors to extract a dict of attributes that looks like:


In [1]:
data = [{'user_name': "Jack", 'user_age': "42.0"},
        {'user_name': "Jill", 'user_age': 64},
        {'user_name': "Jane", 'user_age': "lamp"}]

You create a pipeline in a file named username_pipeline.py with contents:


In [2]:
!cat username_pipeline.py


from vaquero.transformations import sstrip


def extract_username(src_d, dst_d):
    # Copy the user's name, then normalize it.
    dst_d['name'] = sstrip(src_d['user_name']).lower()

def _robust_int(s):
    # Try to convert s into an int.
    try:
        return int(s)
    except ValueError:
        return int(float(s))

def extract_age(src_d, dst_d):
    # Extract the age as an int.
    dst_d['age'] = _robust_int(src_d['user_age'])

In [3]:
from vaquero import ModulePipeline, Vaquero
import username_pipeline

After importing necessities, you:

  • create a vaquero object which gathers the results of your pipeline's applications
  • create a module pipeline, which wraps and parses the python module
  • register the targets in the pipeline, so vaquero knows what to observe. ​

In [4]:
vaq = Vaquero()
pipeline = ModulePipeline(username_pipeline)
vaq.register_targets(pipeline)

Now, you can run your pipeline over the data, piece by piece. I usually reset the vaq object at the top of the processing cell. This way, I'm not accidentally looking at stale errors, which happens a lot.


In [5]:
vaq.reset() 

clean = []
for doc in data:
    with vaq:  # Capture exceptions.
        d = {}
        pipeline(doc, d)
        clean.append(d)
        
vaq.stats()


Out[5]:
{'failures': 1,
 'failures_by': {'_robust_int': 1},
 'ignored': 0,
 'successes': 2}

The stats show one error. You can examine the entire set of errors for some offending function with:


In [6]:
vaq.examine('_robust_int')


Out[6]:
[{'call_args': ['lamp'],
  'exc_type': 'ValueError',
  'exc_value': "could not convert string to float: 'lamp'",
  'filename': 'username_pipeline.py',
  'lineno': 13,
  'name': '_robust_int'}]

But, more often than not, the exception values are sufficient:


In [7]:
vaq.examine('_robust_int', '[*].exc_value')


Out[7]:
["could not convert string to float: 'lamp'"]

Perhaps, you see a bug in your code. Fix it in the pipeline python file, then do


In [8]:
pipeline.reload()

And try again.

(Edit username_pipline)

In the end, you have clean data, and a semi-decent code base.


In [9]:
clean


Out[9]:
[{'age': 42, 'name': 'jack'}, {'age': 64, 'name': 'jill'}]

In [ ]: