Simple Downloader

Scraping and downloading stuff from the internet is commonly the first step for every experiment. here is a simple Page class that has bunch of helper methods that makes this type of work much much simpler.

Async and Multi-process crawing is much much faster. I initially wrote the engadget crawer as a single threaded class. Because the python requests library is synchronous, the crawler spent virtually all time waiting for the GET requests.

This could be made a *lot* faster by parallelizing the crawling, or use proper async pattern. 

This thought came to me pretty late during the second crawl so I did not implement it. But for future work, parallel and async crawler is going to be on the todo list.

TODO

[ ] use async pattern for the requests, so that we don't spend 90% of the time waiting for GET request to finish.
[ ] use multiple-threads to craw.



In [ ]:

    
%load_ext autoreload
%autoreload 2



In [ ]:

    
%%bash 
mkdir data



In [ ]:

    
from download_links import Page

p = Page('http://www.manythings.org/anki/', debug=True)
p.set_mask('(.*).zip')
p.request()

for m, n in p.get_anchors():
    n_p = Page(p.url + m[0], debug=True)
    n_p.download('./data/' + m[0], 'wb', chunk_size=4096*2**4)



In [96]:

    
!jupyter nbconvert --to script "Language Pair Scraper.ipynb"









    



[NbConvertApp] Converting notebook Language Pair Scraper.ipynb to script
[NbConvertApp] Writing 1632 bytes to Language Pair Scraper.py



In [97]:

    
%%bash
source activate deep-learning
rm crawl.out
ipython 'Language Pair Scraper.py' > crawl.out









    



Process is interrupted.

doesn't work.



In [ ]:

    
%%bash
tail -f crawl.out



In [ ]: