ScrapyDo is a crochet-based blocking API for Scrapy. It allows the usage of Scrapy as a library, mainly aimed to be used in spiders prototyping and data exploration in IPython notebooks.
In this notebook we are going to show how to use scrapydo and how it helps to rapidly crawl and explore data. Our main premise is that we want to crawl the internet as a mean to analysis data and not as an end.
The function setup must be called before any call to other functions.
In [1]:
import scrapydo
scrapydo.setup()
The fetch function returns a scrapy.Response object for a given URL.
In [2]:
response = scrapydo.fetch("http://httpbin.org/get?show_env=1")
response
Out[2]:
The highlight function is a helper to highlight text content using the pygments module. It is very useful to inspect text content.
In [3]:
from scrapydo.utils import highlight
highlight(response.body, 'json')
Out[3]:
In [4]:
response = scrapydo.fetch("http://httpbin.org")
highlight(response.body[:300])
Out[4]:
In [5]:
highlight(response.css('p').extract())
Out[5]:
In [6]:
highlight(response.headers, 'python')
Out[6]:
Here we are going to show to crawl an URL without defining a spider class and only using callback functions. This is very useful for quick crawling and data exploration.
In [7]:
# Some additional imports for our data exploration.
%matplotlib inline
import matplotlib.pylab as plt
import pandas as pd
import seaborn as sns
sns.set(context='poster', style='ticks')
We replicate the example in scrapy.org, by defining two callbacks functions to crawl the website http://blog.scrapinghub.com.
The function parse_blog(response) is going to extract the listing URLs and the function parse_titles(response) is going to extract the post titles from each listing page.
In [8]:
import scrapy
def parse_blog(response):
for url in response.css('ul li a::attr("href")').re(r'/\d\d\d\d/\d\d/$'):
yield scrapy.Request(response.urljoin(url), parse_titles)
def parse_titles(response):
for post_title in response.css('div.entries > ul > li a::text').extract():
yield {'title': post_title}
Once we have our callback functions for our target website, we simply call to scrapydo.crawl:
In [9]:
items = scrapydo.crawl('http://blog.scrapinghub.com', parse_blog)
Now that we have our data, we can start doing the fun part! Here we show the posts title length distribution.
In [10]:
df = pd.DataFrame(items)
df['length'] = df['title'].apply(len)
df[:5]
Out[10]:
In [11]:
ax = df['length'].plot(kind='hist', bins=11)
ax2 = df['length'].plot(kind='kde', secondary_y=True, ax=ax)
ax2.set(ylabel="density")
ax.set(title="Title length distribution", xlim=(10, 80), ylabel="posts", xlabel="length");
The previous section showed how to do quick crawls to retrieve data. In this section we are going to show how to run spiders from existing scrapy projects, which can be useful for rapid spider prototyping as well as analysing the crawled data from a given spider.
We use a modified dirbot project, which is already accesible through the PYTHONPATH.
In [12]:
import os
os.environ['SCRAPY_SETTINGS_MODULE'] = 'dirbot.settings'
We want to see the logging output, just as the scrapy crawl command would do. Hence we set the log level to INFO.
In [13]:
import logging
logging.root.setLevel(logging.INFO)
The function run_spider allows to run any spider class and provide custom settings.
In [14]:
from dirbot.spiders import dmoz
items = scrapydo.run_spider(dmoz.DmozSpider, settings={'CLOSESPIDER_ITEMCOUNT': 500})
In this way, we have less friction to use scrapy to data mine the web and quickly start exploring our data.
In [15]:
highlight(items[:3], 'python')
Out[15]:
In [16]:
from urlparse import urlparse
dmoz_items = pd.DataFrame(items)
dmoz_items['domain'] = dmoz_items['url'].apply(lambda url: urlparse(url).netloc.replace('www.', ''))
ax = dmoz_items.groupby('domain').apply(len).sort(inplace=False)[-10:].plot(kind='bar')
ax.set(title="Top 10 domains")
plt.setp(ax.xaxis.get_majorticklabels(), rotation=30);