Scrapy is a Python framework for data scraping, which, to say in short, is the combination of almost everything we learnt until now: requests, css selectors (BeautifulSoup), xpath (lxml), regex (re) and even checking robots.txt or putting hte scraper to sleep.
Generally, as Scrapy is a framework, one does not code inside Jupyter Notebook. To mimic Scrapy behavior inside the Notebook, we will have to make some additional imports which would not be required otherwise.
Key points:
In [1]:
import requests
from scrapy.http import TextResponse
In [2]:
url = "http://quotes.toscrape.com/"
r = requests.get(url)
response = TextResponse(r.url,body=r.text,encoding="utf-8")
In [3]:
response
Out[3]:
In [10]:
#get heading-css
response.css("a").extract_first()
Out[10]:
In [13]:
#get heading-xpath
response.xpath("//a").extract_first()
Out[13]:
In [16]:
#get authors-css
response.css("small::text").extract()
Out[16]:
In [17]:
#authors-xpath
response.xpath("//small/text()").extract()
Out[17]:
In [19]:
#heading-css
response.css('a[style="text-decoration: none"]').extract()
Out[19]:
In [20]:
#heading-css text only
response.css('a[style="text-decoration: none"]::text').extract()
Out[20]:
In [21]:
#heading-css href only
response.css('a[style="text-decoration: none"]::attr(href)').extract()
Out[21]:
In [23]:
#tag text css
response.css("a[class='tag']::text").extract()
Out[23]:
In [24]:
#tag url css
response.css("a[class='tag']::attr(href)").extract()
Out[24]:
In [28]:
#tag text xpath
response.xpath("//a[@class='tag']/text()").extract()
Out[28]:
In [30]:
#tag url xpath
response.xpath("//a[@class='tag']/@href").extract()
Out[30]:
In [7]:
response.css("title").extract_first()
Out[7]:
In [9]:
response.css("title").re("title")
Out[9]:
In [17]:
#regex to get text between tags
response.css("title").re('.+>(.+)<.+')
Out[17]:
In [ ]: