Scraping and downloading stuff from the internet is commonly the first step for every experiment. here is a simple Page class that has bunch of helper methods that makes this type of work much much simpler.

Async and Multi-process crawing is much much faster. I initially wrote the engadget crawer as a single threaded class. Because the python requests library is synchronous, the crawler spent virtually all time waiting for the GET requests.

This could be made a *lot* faster by parallelizing the crawling, or use proper async pattern. 

This thought came to me pretty late during the second crawl so I did not implement it. But for future work, parallel and async crawler is going to be on the todo list.


  • [ ] use async pattern for the requests, so that we don't spend 90% of the time waiting for GET request to finish.
  • [ ] use multiple-threads to craw.

%load_ext autoreload
%autoreload 2

mkdir data

from download_links import Page

p = Page('', debug=True)

for m, n in p.get_anchors():
    n_p = Page(p.url + m[0], debug=True)'./data/' + m[0], 'wb', chunk_size=4096*2**4)

!jupyter nbconvert --to script "Language Pair Scraper.ipynb"

source activate deep-learning
rm crawl.out
ipython 'Language Pair' > crawl.out

