Scrapy 2: using the Spider

Scrapy is a powerful Python web scraping framework. We will experience its power today by scraping quotes.toscrape.com. To understand how Scrapy works, first of all we need to create a Scrapy project. For that purpose go to the command prompt, change it to your usual directory (e.g. Data_scraping folder) and run the following command:

scrapy startproject Quotes

This command will generate a new folder titled Quotes with several files and folders. What you should be interested in now is the folder called spiders inside another folders again titled Quotes. This folder includes the scrapers that you use (none for know). AS the scrapers ar eusually getting data from web, they are called spiders. To generate our first spider, change the directory from command prompt to the newly created project folder using the following command:

cd Quotes

Afterwards, run the following command to generate a spider based on the default simple sample.

scrapy genspider QuoteScraper quotes.toscrapte.com

The third argument is the name of the scraper class, while the last argument provides the overall domain where you may scrape pages from. Once it is done, a QuoteScraper.py file will appear in the abovementioned spiders folder. Open the file and start editing. The initial file will include the general structure, however, the allowed domain and start url variables will be built on the above-provided information (4th variable). Yet, there is nothing inside the defined parse() function. Let's fill it in.



In [1]:

    
# -*- coding: utf-8 -*-
import scrapy


class QuoteSpider(scrapy.Spider):
    name = "quote"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        pass

What we want is to get the data (response.body) saved in an HTML file, thus, we add 2 lines of code in the end.



In [2]:

    
# -*- coding: utf-8 -*-
import scrapy


class QuoteSpider(scrapy.Spider):
    name = "quote"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        with open('scraped.html','w') as f:
            f.write(response.body)

What if you want to scrape two pages and save the body as an HTML file (with a proper filename)? That's again easy, one just needs to do 2 things:

Add both URLs to the start_urls list as done below,
create a filename variable which will take the prelast character of the page name (1 or 2 in our case) and append it to the filename.



In [3]:

    
# -*- coding: utf-8 -*-
import scrapy

class QuoteSpider(scrapy.Spider):
    name = "quote"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ['http://quotes.toscrape.com/page/1/',
                  'http://quotes.toscrape.com/page/2/']

    def parse(self, response):
        filename = "quotes"+response.url[-2:-1]+".html"
        with open(filename,'w') as f:
            f.write(response.body)

That's cool, but not that much of a scraping yet. We get the page, but not a data of interest. Let's assume one is interested in getting the following data: quote, its author and the keyword tags. The following spyder would help (all same but parse function):



In [4]:

    
# -*- coding: utf-8 -*-
import scrapy


class QuoteSpider(scrapy.Spider):
    name = "quote"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ['http://quotes.toscrape.com/page/1/',
                  'http://quotes.toscrape.com/page/2/']

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('span small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

The yield keyword above is like the return, yet instead of returning a value, its just generating it and forgetting about that (computationally efficient). This is helpful when you want to write some values into a file and forget about them. That's what Scrapy is doing. Once you have this function ready, you can write scraped values into a JSON file by just using hte following command inside the command prompt:

scrapy crawl quote -o quotes.json

The output will be a JSON file with scraped data. If you are interested in getting a JSON lines document, then just change the file format from .json to .jl.

It is important to note that all those changes happened to the QuoteScraper.py file, while there are some other files also generated by Scrapy. One of those is titled settings.py, which includes information on settings that one can change. The most important components probably are:

BOT_NAME = 'quotes' - that's the bot name, used to be recognized by websites being scraped,
ROBOTSTXT_OBEY = True - tells the spider to obey robots.txt, i.e. not scrape if it is not allowed,
DOWNLOAD_DELAY = 3 - provides the number of seconds for sleeping between requests.