This notebook demonstrates vaquero.

Let's say you are processing some html files for users. Someone on your team already used css selectors to extract a dict of attributes that looks like:



In [1]:

    
data = [{'user_name': "Jack", 'user_age': "42.0"},
        {'user_name': "Jill", 'user_age': 64},
        {'user_name': "Jane", 'user_age': "lamp"}]

You create a pipeline in a file named username_pipeline.py with contents:



In [2]:

    
!cat username_pipeline.py









    



from vaquero.transformations import sstrip


def extract_username(src_d, dst_d):
    # Copy the user's name, then normalize it.
    dst_d['name'] = sstrip(src_d['user_name']).lower()

def _robust_int(s):
    # Try to convert s into an int.
    try:
        return int(s)
    except ValueError:
        return int(float(s))

def extract_age(src_d, dst_d):
    # Extract the age as an int.
    dst_d['age'] = _robust_int(src_d['user_age'])



In [3]:

    
from vaquero import ModulePipeline, Vaquero
import username_pipeline

After importing necessities, you:

create a vaquero object which gathers the results of your pipeline's applications
create a module pipeline, which wraps and parses the python module
register the targets in the pipeline, so vaquero knows what to observe.



In [4]:

    
vaq = Vaquero()
pipeline = ModulePipeline(username_pipeline)
vaq.register_targets(pipeline)

Now, you can run your pipeline over the data, piece by piece. I usually reset the vaq object at the top of the processing cell. This way, I'm not accidentally looking at stale errors, which happens a lot.



In [5]:

    
vaq.reset() 

clean = []
for doc in data:
    with vaq:  # Capture exceptions.
        d = {}
        pipeline(doc, d)
        clean.append(d)
        
vaq.stats()









    Out[5]:





{'failures': 1,
 'failures_by': {'_robust_int': 1},
 'ignored': 0,
 'successes': 2}

The stats show one error. You can examine the entire set of errors for some offending function with:



In [6]:

    
vaq.examine('_robust_int')









    Out[6]:





[{'call_args': ['lamp'],
  'exc_type': 'ValueError',
  'exc_value': "could not convert string to float: 'lamp'",
  'filename': 'username_pipeline.py',
  'lineno': 13,
  'name': '_robust_int'}]

But, more often than not, the exception values are sufficient:



In [7]:

    
vaq.examine('_robust_int', '[*].exc_value')









    Out[7]:





["could not convert string to float: 'lamp'"]

Perhaps, you see a bug in your code. Fix it in the pipeline python file, then do



In [8]:

    
pipeline.reload()

And try again.

(Edit username_pipline)

In the end, you have clean data, and a semi-decent code base.



In [9]:

    
clean









    Out[9]:





[{'age': 42, 'name': 'jack'}, {'age': 64, 'name': 'jill'}]



In [ ]: