In [1]:

    
import sys



In [2]:

    
x









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-2-401b30e3b8b5> in <module>()
----> 1 x

NameError: name 'x' is not defined



In [ ]:

I want to write code -- or, find an existing module! -- that reloads:

functions
functions that reference other functions
classes
classes that reference other functions or classes
modules

Basically, given your map_func and collector_func, reload all the code so the next iteration is clean.



In [5]:

    
%%html 

<div align="center"><blockquote class="twitter-tweet" lang="en"><p lang="en" dir="ltr">When I’m trying to fix a tiny bug. <a href="https://t.co/nml6ZS5quW">pic.twitter.com/nml6ZS5quW</a></p>&mdash; Mike Bostock (@mbostock) <a href="https://twitter.com/mbostock/status/661650359069208576">November 3, 2015</a></blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script></div>









    





When I’m trying to fix a tiny bug. pic.twitter.com/nml6ZS5quW
— Mike Bostock (@mbostock) November 3, 2015

Add D3 Visualization showing MCMC over two different methods

Add PS (I'm looking for a job in NYC)

WHEREAMI?

Add and test function reloading
rename collector_func and mapper_func to mapper and collector
classes don't work, unless they take no arguments for initialization

General Strategy

Most scraping projects map a set of raw documents to clean ones. It sounds easy, but in practice their are inconsistencies that break code. For a large corpus of documents, this is frustrating. The cycle becomes: Run your code against the data; discover an inconsistency (i.e. your code breaks); alter some code; start running all over again.

Said more simply, cleaning scraped data is an iterative process. But, the tools tend to be bad for iterating. Write -> Compile -> Run is frictionless by comparison, at least for large datasets. Vaquero is a tool for iterative data cleaning.

Also

Use generators. On a failure, you can restart from where you left off. The failset should be a partial guard. Satisfactory flag.
Examples are your fixtures.
Assertions are self documenting
Grab only what you need now. This is iterative, remember.
Unobtrusive (the function doesn't need to be in Vaquero for production, although the asserts will slow things down. But, you can also use a PrePost wrapper so your production code is faster and the assertions are all in one place

Simplest Implementation

Failsets

Why would you disable dup checking?

Common data structures -- the kind you will probably convert your document into -- are mutable. As such, they are not hashable, so I can't use proper set data structures. Maintaining set semantics requires comparison over all elements in the collection, an O(n) operation. As the size of the collection grows, this may become time prohibitive. If the cost of running your processor over a fail set document is less than the cost of equality checking over the entire collection, this is useful.



In [1]:

    
from __future__ import print_function

import sys
sys.path.append('.')
from vaquero import *

Resuming Iterators

I assume that you have created a collection of documents. Or, more commonly, you created some sort of generator that yields documents. The resuming iterator fits well with the general processor pattern. It wraps the iterable, saving the state. Given an exception, the current point in the iteration persists.

Why is this useful? Remember that the processor saves anything that throws an exception in the failset. Prior to continuing iteration, you update your map_func so that the new failing case -- and all others in the failset -- pass. After which point, you move on to "green" documents. It is possible your alteration introduced a regression in the documents already visited but not in the failset. However, in my experience, it is more probable that the later cases require more refinement than the ones already seen.



In [2]:

    
items = (i for i in range(20) if i % 2 == 1)



In [3]:

    
mylist = ResumingIterator(items)



In [4]:

    
for i in mylist:
    assert i != 3
    print(i)









    



1






    



---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-4-74cf6d657fc7> in <module>()
      1 for i in mylist:
----> 2     assert i != 3
      3     print(i)

AssertionError:



In [5]:

    
for i in mylist:
    assert i != 2
    print(i)

Processing with processor(example)



In [6]:

    
f = Processor(int, print, PicklingFailSet("int.pickle"))



In [7]:

    
f("10")



In [8]:

    
f("10.0")









    



---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-8-03b6f415cf73> in <module>()
----> 1 f("10.0")

/Users/generativist/Projects/vaquero/vaquero/__init__.py in __call__(self, document)
     71     def __call__(self, document):
     72         try:
---> 73             self.collector_func(self.map_func(document))
     74         except KeyboardInterrupt:
     75             # This is user requested, not a failure.

ValueError: invalid literal for int() with base 10: '10.0'



In [9]:

    
f("20.0")









    



---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-9-b8eb98b7afeb> in <module>()
----> 1 f("20.0")

/Users/generativist/Projects/vaquero/vaquero/__init__.py in __call__(self, document)
     71     def __call__(self, document):
     72         try:
---> 73             self.collector_func(self.map_func(document))
     74         except KeyboardInterrupt:
     75             # This is user requested, not a failure.

ValueError: invalid literal for int() with base 10: '20.0'



In [12]:

    
f.fail_set.examples()









    Out[12]:





['10.0', '20.0']



In [104]:

    
examples = ['10', 20, 20.0, '20.0', '10.0']
for example in examples:
    f(example)









    



---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-104-985df5c0afe9> in <module>()
      1 examples = ['10', 20, 20.0, '20.0', '10.0']
      2 for example in examples:
----> 3     f(example)

<ipython-input-95-36b40891743a> in __call__(self, document)
     51     def __call__(self, document):
     52         try:
---> 53             self.map_func(document)
     54         except KeyboardInterrupt:
     55             raise  # This is user requested, not a failure.

ValueError: invalid literal for int() with base 10: '20.0'



In [105]:

    
f.failing_examples()









    Out[105]:





['10.0', '20.0']

Processing with processor.work_through(iter)



In [94]:

    
x = [1,2,3]
i = iter(x)
next(i)









    Out[94]:





1



In [64]:

    
next(i)









    Out[64]:





2



In [ ]:

Next