Simple Downloader

Scraping and downloading stuff from the internet is commonly the first step for every experiment. here is a simple Page class that has bunch of helper methods that makes this type of work much much simpler.

Async and Multi-process crawing is much much faster. I initially wrote the engadget crawer as a single threaded class. Because the python requests library is synchronous, the crawler spent virtually all time waiting for the GET requests.

This could be made a *lot* faster by parallelizing the crawling, or use proper async pattern. 

This thought came to me pretty late during the second crawl so I did not implement it. But for future work, parallel and async crawler is going to be on the todo list.


TODO

  • [ ] use async pattern for the requests, so that we don't spend 90% of the time waiting for GET request to finish.
  • [ ] use multiple-threads to craw.

In [ ]:
%load_ext autoreload
%autoreload 2

In [ ]:
%%bash 
mkdir data

In [ ]:
from download_links import Page

p = Page('http://www.manythings.org/anki/', debug=True)
p.set_mask('(.*).zip')
p.request()

for m, n in p.get_anchors():
    n_p = Page(p.url + m[0], debug=True)
    n_p.download('./data/' + m[0], 'wb', chunk_size=4096*2**4)

In [96]:
!jupyter nbconvert --to script "Language Pair Scraper.ipynb"


[NbConvertApp] Converting notebook Language Pair Scraper.ipynb to script
[NbConvertApp] Writing 1632 bytes to Language Pair Scraper.py

In [97]:
%%bash
source activate deep-learning
rm crawl.out
ipython 'Language Pair Scraper.py' > crawl.out


Process is interrupted.

doesn't work.


In [ ]:
%%bash
tail -f crawl.out

In [ ]: