In [1]:
import sys
In [2]:
x
In [ ]:
In [5]:
%%html
<div align="center"><blockquote class="twitter-tweet" lang="en"><p lang="en" dir="ltr">When I’m trying to fix a tiny bug. <a href="https://t.co/nml6ZS5quW">pic.twitter.com/nml6ZS5quW</a></p>— Mike Bostock (@mbostock) <a href="https://twitter.com/mbostock/status/661650359069208576">November 3, 2015</a></blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script></div>
Add D3 Visualization showing MCMC over two different methods
Add PS (I'm looking for a job in NYC)
Most scraping projects map a set of raw documents to clean ones. It sounds easy, but in practice their are inconsistencies that break code. For a large corpus of documents, this is frustrating. The cycle becomes: Run your code against the data; discover an inconsistency (i.e. your code breaks); alter some code; start running all over again.
Said more simply, cleaning scraped data is an iterative process. But, the tools tend to be bad for iterating. Write -> Compile -> Run is frictionless by comparison, at least for large datasets. Vaquero is a tool for iterative data cleaning.
Common data structures -- the kind you will probably convert your document into -- are mutable. As such, they are not hashable, so I can't use proper set data structures. Maintaining set semantics requires comparison over all elements in the collection, an O(n) operation. As the size of the collection grows, this may become time prohibitive. If the cost of running your processor over a fail set document is less than the cost of equality checking over the entire collection, this is useful.
In [1]:
from __future__ import print_function
import sys
sys.path.append('.')
from vaquero import *
I assume that you have created a collection of documents. Or, more commonly, you created some sort of generator that yields documents. The resuming iterator fits well with the general processor pattern. It wraps the iterable, saving the state. Given an exception, the current point in the iteration persists.
Why is this useful? Remember that the processor saves anything that throws an exception in the failset. Prior to continuing iteration, you update your map_func so that the new failing case -- and all others in the failset -- pass. After which point, you move on to "green" documents. It is possible your alteration introduced a regression in the documents already visited but not in the failset. However, in my experience, it is more probable that the later cases require more refinement than the ones already seen.
In [2]:
items = (i for i in range(20) if i % 2 == 1)
In [3]:
mylist = ResumingIterator(items)
In [4]:
for i in mylist:
assert i != 3
print(i)
In [5]:
for i in mylist:
assert i != 2
print(i)
In [6]:
f = Processor(int, print, PicklingFailSet("int.pickle"))
In [7]:
f("10")
In [8]:
f("10.0")
In [9]:
f("20.0")
In [12]:
f.fail_set.examples()
Out[12]:
In [104]:
examples = ['10', 20, 20.0, '20.0', '10.0']
for example in examples:
f(example)
In [105]:
f.failing_examples()
Out[105]:
In [94]:
x = [1,2,3]
i = iter(x)
next(i)
Out[94]:
In [64]:
next(i)
Out[64]:
In [ ]: