Intro to Scrapy

Scrapy is a Python framework for data scraping, which, to say in short, is the combination of almost everything we learnt until now: requests, css selectors (BeautifulSoup), xpath (lxml), regex (re) and even checking robots.txt or putting hte scraper to sleep.

Generally, as Scrapy is a framework, one does not code inside Jupyter Notebook. To mimic Scrapy behavior inside the Notebook, we will have to make some additional imports which would not be required otherwise.

Key points:

  • response - the object that contains page source as a Scrapy element to be scraped,
  • response.css() - css approach to scraping (BeautifulSoup),
  • response.xpath() - xpath approach to scraping (Lxml),
  • extract() - extract all elements satisfying some condition (provides list),
  • extract_first() - extract first element satisfying some condition (provides element).
  • response.css("a::text").extract_first() - will provide the text of the first link matched (CSS),
  • response.xpath("//a/text()").extract_first() - will provide the text of the first link matched (Xpath),
  • response.css('a::attr(href)').extract_first() - will provide the href attribute (URL) of the first link matched (CSS),
  • response.xpath("//a/@href").extract_first() - will provide the href attribute (URL) of the first link matched (Xpath).

In [1]:
import requests
from scrapy.http import TextResponse

In [2]:
url = "http://quotes.toscrape.com/"
r = requests.get(url)
response = TextResponse(r.url,body=r.text,encoding="utf-8")

In [3]:
response


Out[3]:
<200 http://quotes.toscrape.com/>

In [10]:
#get heading-css
response.css("a").extract_first()


Out[10]:
'<a href="/" style="text-decoration: none">Quotes to Scrape</a>'

In [13]:
#get heading-xpath
response.xpath("//a").extract_first()


Out[13]:
'<a href="/" style="text-decoration: none">Quotes to Scrape</a>'

In [16]:
#get authors-css
response.css("small::text").extract()


Out[16]:
['Albert Einstein',
 'J.K. Rowling',
 'Albert Einstein',
 'Jane Austen',
 'Marilyn Monroe',
 'Albert Einstein',
 'André Gide',
 'Thomas A. Edison',
 'Eleanor Roosevelt',
 'Steve Martin']

In [17]:
#authors-xpath
response.xpath("//small/text()").extract()


Out[17]:
['Albert Einstein',
 'J.K. Rowling',
 'Albert Einstein',
 'Jane Austen',
 'Marilyn Monroe',
 'Albert Einstein',
 'André Gide',
 'Thomas A. Edison',
 'Eleanor Roosevelt',
 'Steve Martin']

In [19]:
#heading-css
response.css('a[style="text-decoration: none"]').extract()


Out[19]:
['<a href="/" style="text-decoration: none">Quotes to Scrape</a>']

In [20]:
#heading-css text only
response.css('a[style="text-decoration: none"]::text').extract()


Out[20]:
['Quotes to Scrape']

In [21]:
#heading-css href only
response.css('a[style="text-decoration: none"]::attr(href)').extract()


Out[21]:
['/']

In [23]:
#tag text css
response.css("a[class='tag']::text").extract()


Out[23]:
['change',
 'deep-thoughts',
 'thinking',
 'world',
 'abilities',
 'choices',
 'inspirational',
 'life',
 'live',
 'miracle',
 'miracles',
 'aliteracy',
 'books',
 'classic',
 'humor',
 'be-yourself',
 'inspirational',
 'adulthood',
 'success',
 'value',
 'life',
 'love',
 'edison',
 'failure',
 'inspirational',
 'paraphrased',
 'misattributed-eleanor-roosevelt',
 'humor',
 'obvious',
 'simile',
 'love',
 'inspirational',
 'life',
 'humor',
 'books',
 'reading',
 'friendship',
 'friends',
 'truth',
 'simile']

In [24]:
#tag url css
response.css("a[class='tag']::attr(href)").extract()


Out[24]:
['/tag/change/page/1/',
 '/tag/deep-thoughts/page/1/',
 '/tag/thinking/page/1/',
 '/tag/world/page/1/',
 '/tag/abilities/page/1/',
 '/tag/choices/page/1/',
 '/tag/inspirational/page/1/',
 '/tag/life/page/1/',
 '/tag/live/page/1/',
 '/tag/miracle/page/1/',
 '/tag/miracles/page/1/',
 '/tag/aliteracy/page/1/',
 '/tag/books/page/1/',
 '/tag/classic/page/1/',
 '/tag/humor/page/1/',
 '/tag/be-yourself/page/1/',
 '/tag/inspirational/page/1/',
 '/tag/adulthood/page/1/',
 '/tag/success/page/1/',
 '/tag/value/page/1/',
 '/tag/life/page/1/',
 '/tag/love/page/1/',
 '/tag/edison/page/1/',
 '/tag/failure/page/1/',
 '/tag/inspirational/page/1/',
 '/tag/paraphrased/page/1/',
 '/tag/misattributed-eleanor-roosevelt/page/1/',
 '/tag/humor/page/1/',
 '/tag/obvious/page/1/',
 '/tag/simile/page/1/',
 '/tag/love/',
 '/tag/inspirational/',
 '/tag/life/',
 '/tag/humor/',
 '/tag/books/',
 '/tag/reading/',
 '/tag/friendship/',
 '/tag/friends/',
 '/tag/truth/',
 '/tag/simile/']

In [28]:
#tag text xpath
response.xpath("//a[@class='tag']/text()").extract()


Out[28]:
['change',
 'deep-thoughts',
 'thinking',
 'world',
 'abilities',
 'choices',
 'inspirational',
 'life',
 'live',
 'miracle',
 'miracles',
 'aliteracy',
 'books',
 'classic',
 'humor',
 'be-yourself',
 'inspirational',
 'adulthood',
 'success',
 'value',
 'life',
 'love',
 'edison',
 'failure',
 'inspirational',
 'paraphrased',
 'misattributed-eleanor-roosevelt',
 'humor',
 'obvious',
 'simile',
 'love',
 'inspirational',
 'life',
 'humor',
 'books',
 'reading',
 'friendship',
 'friends',
 'truth',
 'simile']

In [30]:
#tag url xpath
response.xpath("//a[@class='tag']/@href").extract()


Out[30]:
['/tag/change/page/1/',
 '/tag/deep-thoughts/page/1/',
 '/tag/thinking/page/1/',
 '/tag/world/page/1/',
 '/tag/abilities/page/1/',
 '/tag/choices/page/1/',
 '/tag/inspirational/page/1/',
 '/tag/life/page/1/',
 '/tag/live/page/1/',
 '/tag/miracle/page/1/',
 '/tag/miracles/page/1/',
 '/tag/aliteracy/page/1/',
 '/tag/books/page/1/',
 '/tag/classic/page/1/',
 '/tag/humor/page/1/',
 '/tag/be-yourself/page/1/',
 '/tag/inspirational/page/1/',
 '/tag/adulthood/page/1/',
 '/tag/success/page/1/',
 '/tag/value/page/1/',
 '/tag/life/page/1/',
 '/tag/love/page/1/',
 '/tag/edison/page/1/',
 '/tag/failure/page/1/',
 '/tag/inspirational/page/1/',
 '/tag/paraphrased/page/1/',
 '/tag/misattributed-eleanor-roosevelt/page/1/',
 '/tag/humor/page/1/',
 '/tag/obvious/page/1/',
 '/tag/simile/page/1/',
 '/tag/love/',
 '/tag/inspirational/',
 '/tag/life/',
 '/tag/humor/',
 '/tag/books/',
 '/tag/reading/',
 '/tag/friendship/',
 '/tag/friends/',
 '/tag/truth/',
 '/tag/simile/']

In [7]:
response.css("title").extract_first()


Out[7]:
'<title>Quotes to Scrape</title>'

In [9]:
response.css("title").re("title")


Out[9]:
['title', 'title']

In [17]:
#regex to get text between tags
response.css("title").re('.+>(.+)<.+')


Out[17]:
['Quotes to Scrape']

In [ ]: