Intro to Scrapy

Scrapy is a Python framework for data scraping, which, to say in short, is the combination of almost everything we learnt until now: requests, css selectors (BeautifulSoup), xpath (lxml), regex (re) and even checking robots.txt or putting hte scraper to sleep.

Generally, as Scrapy is a framework, one does not code inside Jupyter Notebook. To mimic Scrapy behavior inside the Notebook, we will have to make some additional imports which would not be required otherwise.

Key points:

response - the object that contains page source as a Scrapy element to be scraped,
response.css() - css approach to scraping (BeautifulSoup),
response.xpath() - xpath approach to scraping (Lxml),
extract() - extract all elements satisfying some condition (provides list),
extract_first() - extract first element satisfying some condition (provides element).
response.css("a::text").extract_first() - will provide the text of the first link matched (CSS),
response.xpath("//a/text()").extract_first() - will provide the text of the first link matched (Xpath),
response.css('a::attr(href)').extract_first() - will provide the href attribute (URL) of the first link matched (CSS),
response.xpath("//a/@href").extract_first() - will provide the href attribute (URL) of the first link matched (Xpath).



In [1]:

    
import requests
from scrapy.http import TextResponse



In [2]:

    
url = "http://quotes.toscrape.com/"
r = requests.get(url)
response = TextResponse(r.url,body=r.text,encoding="utf-8")



In [3]:

    
response









    Out[3]:





<200 http://quotes.toscrape.com/>



In [10]:

    
#get heading-css
response.css("a").extract_first()









    Out[10]:





'<a href="/" style="text-decoration: none">Quotes to Scrape</a>'



In [13]:

    
#get heading-xpath
response.xpath("//a").extract_first()









    Out[13]:





'<a href="/" style="text-decoration: none">Quotes to Scrape</a>'



In [16]:

    
#get authors-css
response.css("small::text").extract()









    Out[16]:





['Albert Einstein',
 'J.K. Rowling',
 'Albert Einstein',
 'Jane Austen',
 'Marilyn Monroe',
 'Albert Einstein',
 'André Gide',
 'Thomas A. Edison',
 'Eleanor Roosevelt',
 'Steve Martin']



In [17]:

    
#authors-xpath
response.xpath("//small/text()").extract()









    Out[17]:





['Albert Einstein',
 'J.K. Rowling',
 'Albert Einstein',
 'Jane Austen',
 'Marilyn Monroe',
 'Albert Einstein',
 'André Gide',
 'Thomas A. Edison',
 'Eleanor Roosevelt',
 'Steve Martin']



In [19]:

    
#heading-css
response.css('a[style="text-decoration: none"]').extract()









    Out[19]:





['<a href="/" style="text-decoration: none">Quotes to Scrape</a>']



In [20]:

    
#heading-css text only
response.css('a[style="text-decoration: none"]::text').extract()









    Out[20]:





['Quotes to Scrape']



In [21]:

    
#heading-css href only
response.css('a[style="text-decoration: none"]::attr(href)').extract()









    Out[21]:





['/']



In [23]:

    
#tag text css
response.css("a[class='tag']::text").extract()









    Out[23]:





['change',
 'deep-thoughts',
 'thinking',
 'world',
 'abilities',
 'choices',
 'inspirational',
 'life',
 'live',
 'miracle',
 'miracles',
 'aliteracy',
 'books',
 'classic',
 'humor',
 'be-yourself',
 'inspirational',
 'adulthood',
 'success',
 'value',
 'life',
 'love',
 'edison',
 'failure',
 'inspirational',
 'paraphrased',
 'misattributed-eleanor-roosevelt',
 'humor',
 'obvious',
 'simile',
 'love',
 'inspirational',
 'life',
 'humor',
 'books',
 'reading',
 'friendship',
 'friends',
 'truth',
 'simile']



In [24]:

    
#tag url css
response.css("a[class='tag']::attr(href)").extract()









    Out[24]:





['/tag/change/page/1/',
 '/tag/deep-thoughts/page/1/',
 '/tag/thinking/page/1/',
 '/tag/world/page/1/',
 '/tag/abilities/page/1/',
 '/tag/choices/page/1/',
 '/tag/inspirational/page/1/',
 '/tag/life/page/1/',
 '/tag/live/page/1/',
 '/tag/miracle/page/1/',
 '/tag/miracles/page/1/',
 '/tag/aliteracy/page/1/',
 '/tag/books/page/1/',
 '/tag/classic/page/1/',
 '/tag/humor/page/1/',
 '/tag/be-yourself/page/1/',
 '/tag/inspirational/page/1/',
 '/tag/adulthood/page/1/',
 '/tag/success/page/1/',
 '/tag/value/page/1/',
 '/tag/life/page/1/',
 '/tag/love/page/1/',
 '/tag/edison/page/1/',
 '/tag/failure/page/1/',
 '/tag/inspirational/page/1/',
 '/tag/paraphrased/page/1/',
 '/tag/misattributed-eleanor-roosevelt/page/1/',
 '/tag/humor/page/1/',
 '/tag/obvious/page/1/',
 '/tag/simile/page/1/',
 '/tag/love/',
 '/tag/inspirational/',
 '/tag/life/',
 '/tag/humor/',
 '/tag/books/',
 '/tag/reading/',
 '/tag/friendship/',
 '/tag/friends/',
 '/tag/truth/',
 '/tag/simile/']



In [28]:

    
#tag text xpath
response.xpath("//a[@class='tag']/text()").extract()









    Out[28]:





['change',
 'deep-thoughts',
 'thinking',
 'world',
 'abilities',
 'choices',
 'inspirational',
 'life',
 'live',
 'miracle',
 'miracles',
 'aliteracy',
 'books',
 'classic',
 'humor',
 'be-yourself',
 'inspirational',
 'adulthood',
 'success',
 'value',
 'life',
 'love',
 'edison',
 'failure',
 'inspirational',
 'paraphrased',
 'misattributed-eleanor-roosevelt',
 'humor',
 'obvious',
 'simile',
 'love',
 'inspirational',
 'life',
 'humor',
 'books',
 'reading',
 'friendship',
 'friends',
 'truth',
 'simile']



In [30]:

    
#tag url xpath
response.xpath("//a[@class='tag']/@href").extract()









    Out[30]:





['/tag/change/page/1/',
 '/tag/deep-thoughts/page/1/',
 '/tag/thinking/page/1/',
 '/tag/world/page/1/',
 '/tag/abilities/page/1/',
 '/tag/choices/page/1/',
 '/tag/inspirational/page/1/',
 '/tag/life/page/1/',
 '/tag/live/page/1/',
 '/tag/miracle/page/1/',
 '/tag/miracles/page/1/',
 '/tag/aliteracy/page/1/',
 '/tag/books/page/1/',
 '/tag/classic/page/1/',
 '/tag/humor/page/1/',
 '/tag/be-yourself/page/1/',
 '/tag/inspirational/page/1/',
 '/tag/adulthood/page/1/',
 '/tag/success/page/1/',
 '/tag/value/page/1/',
 '/tag/life/page/1/',
 '/tag/love/page/1/',
 '/tag/edison/page/1/',
 '/tag/failure/page/1/',
 '/tag/inspirational/page/1/',
 '/tag/paraphrased/page/1/',
 '/tag/misattributed-eleanor-roosevelt/page/1/',
 '/tag/humor/page/1/',
 '/tag/obvious/page/1/',
 '/tag/simile/page/1/',
 '/tag/love/',
 '/tag/inspirational/',
 '/tag/life/',
 '/tag/humor/',
 '/tag/books/',
 '/tag/reading/',
 '/tag/friendship/',
 '/tag/friends/',
 '/tag/truth/',
 '/tag/simile/']



In [7]:

    
response.css("title").extract_first()









    Out[7]:





'<title>Quotes to Scrape</title>'



In [9]:

    
response.css("title").re("title")









    Out[9]:





['title', 'title']



In [17]:

    
#regex to get text between tags
response.css("title").re('.+>(.+)<.+')









    Out[17]:





['Quotes to Scrape']



In [ ]: