Let's say you are processing some html files for users. Someone on your team already used css selectors to extract a dict of attributes that looks like:
In [1]:
data = [{'user_name': "Jack", 'user_age': "42.0"},
{'user_name': "Jill", 'user_age': 64},
{'user_name': "Jane", 'user_age': "lamp"}]
You create a pipeline in a file named username_pipeline.py
with
contents:
In [2]:
!cat username_pipeline.py
In [3]:
from vaquero import ModulePipeline, Vaquero
import username_pipeline
After importing necessities, you:
In [4]:
vaq = Vaquero()
pipeline = ModulePipeline(username_pipeline)
vaq.register_targets(pipeline)
Now, you can run your pipeline over the data, piece by piece. I usually reset the vaq
object at the top of the processing cell. This way, I'm not accidentally looking at stale errors, which happens a lot.
In [5]:
vaq.reset()
clean = []
for doc in data:
with vaq: # Capture exceptions.
d = {}
pipeline(doc, d)
clean.append(d)
vaq.stats()
Out[5]:
The stats show one error. You can examine the entire set of errors for some offending function with:
In [6]:
vaq.examine('_robust_int')
Out[6]:
But, more often than not, the exception values are sufficient:
In [7]:
vaq.examine('_robust_int', '[*].exc_value')
Out[7]:
Perhaps, you see a bug in your code. Fix it in the pipeline python file, then do
In [8]:
pipeline.reload()
And try again.
(Edit username_pipline)
In the end, you have clean data, and a semi-decent code base.
In [9]:
clean
Out[9]:
In [ ]: