Scraping and downloading stuff from the internet is commonly the first step for every experiment. here is a simple Page class that has bunch of helper methods that makes this type of work much much simpler.
Async and Multi-process crawing is much much faster. I initially wrote the engadget crawer as a single threaded class. Because the python requests
library is synchronous, the crawler spent virtually all time waiting for the GET
requests.
This could be made a *lot* faster by parallelizing the crawling, or use proper async pattern.
This thought came to me pretty late during the second crawl so I did not implement it. But for future work, parallel and async crawler is going to be on the todo list.
GET
request to finish.
In [ ]:
%load_ext autoreload
%autoreload 2
In [ ]:
%%bash
mkdir data
In [ ]:
from download_links import Page
p = Page('http://www.manythings.org/anki/', debug=True)
p.set_mask('(.*).zip')
p.request()
for m, n in p.get_anchors():
n_p = Page(p.url + m[0], debug=True)
n_p.download('./data/' + m[0], 'wb', chunk_size=4096*2**4)
In [96]:
!jupyter nbconvert --to script "Language Pair Scraper.ipynb"
In [97]:
%%bash
source activate deep-learning
rm crawl.out
ipython 'Language Pair Scraper.py' > crawl.out
doesn't work.
In [ ]:
%%bash
tail -f crawl.out
In [ ]: