Scraping Quotes with LXML (Xpath approach)

Unti now we were using CSS selectors to find an HTML element. This notebook uses another powerful approach called Xpath (XML path) to math elements. We will use Lxml library to complete the task.

Key points:

  • lxml - ugly, useful to get attributes and nested values, faster
  • bs - beautiful, useful to get structured values, slower
  • text_content() function from lxml to ge tthe text out of the tag,
  • xpath() - functino from lxml to mathc elements,
  • tostring() - function from lmxml to convert HTML element to string (both tag and text)

In [1]:
import requests
from lxml import html
from lxml.etree import tostring

In [3]:
url = "http://quotes.toscrape.com/"

In [5]:
#headers is used to provide information to server
response = requests.get(url,headers={"user-agent":"hdavtyan@aua.am"})
page = response.content
tree = html.document_fromstring(page) #change type to lxml type

In [12]:
tostring(tree) #to see the HTML source


Out[12]:
b'<html lang="en">\n<head>\n\t<meta charset="UTF-8"/>\n\t<title>Quotes to Scrape</title>\n    <link rel="stylesheet" href="/static/bootstrap.min.css"/>\n    <link rel="stylesheet" href="/static/main.css"/>\n</head>\n<body>\n    <div class="container">\n        <div class="row header-box">\n            <div class="col-md-8">\n                <h1>\n                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>\n                </h1>\n            </div>\n            <div class="col-md-4">\n                <p>\n                \n                    <a href="/login">Login</a>\n                \n                </p>\n            </div>\n        </div>\n    \n\n<div class="row">\n    <div class="col-md-8">\n\n    <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">&#8220;The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.&#8221;</span>\n        <span>by <small class="author" itemprop="author">Albert Einstein</small>\n        <a href="/author/Albert-Einstein">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world"/> \n            \n            <a class="tag" href="/tag/change/page/1/">change</a>\n            \n            <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>\n            \n            <a class="tag" href="/tag/thinking/page/1/">thinking</a>\n            \n            <a class="tag" href="/tag/world/page/1/">world</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">&#8220;It is our choices, Harry, that show what we truly are, far more than our abilities.&#8221;</span>\n        <span>by <small class="author" itemprop="author">J.K. Rowling</small>\n        <a href="/author/J-K-Rowling">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="abilities,choices"/> \n            \n            <a class="tag" href="/tag/abilities/page/1/">abilities</a>\n            \n            <a class="tag" href="/tag/choices/page/1/">choices</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">&#8220;There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.&#8221;</span>\n        <span>by <small class="author" itemprop="author">Albert Einstein</small>\n        <a href="/author/Albert-Einstein">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="inspirational,life,live,miracle,miracles"/> \n            \n            <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>\n            \n            <a class="tag" href="/tag/life/page/1/">life</a>\n            \n            <a class="tag" href="/tag/live/page/1/">live</a>\n            \n            <a class="tag" href="/tag/miracle/page/1/">miracle</a>\n            \n            <a class="tag" href="/tag/miracles/page/1/">miracles</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">&#8220;The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.&#8221;</span>\n        <span>by <small class="author" itemprop="author">Jane Austen</small>\n        <a href="/author/Jane-Austen">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="aliteracy,books,classic,humor"/> \n            \n            <a class="tag" href="/tag/aliteracy/page/1/">aliteracy</a>\n            \n            <a class="tag" href="/tag/books/page/1/">books</a>\n            \n            <a class="tag" href="/tag/classic/page/1/">classic</a>\n            \n            <a class="tag" href="/tag/humor/page/1/">humor</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">&#8220;Imperfection is beauty, madness is genius and it\'s better to be absolutely ridiculous than absolutely boring.&#8221;</span>\n        <span>by <small class="author" itemprop="author">Marilyn Monroe</small>\n        <a href="/author/Marilyn-Monroe">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="be-yourself,inspirational"/> \n            \n            <a class="tag" href="/tag/be-yourself/page/1/">be-yourself</a>\n            \n            <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">&#8220;Try not to become a man of success. Rather become a man of value.&#8221;</span>\n        <span>by <small class="author" itemprop="author">Albert Einstein</small>\n        <a href="/author/Albert-Einstein">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="adulthood,success,value"/> \n            \n            <a class="tag" href="/tag/adulthood/page/1/">adulthood</a>\n            \n            <a class="tag" href="/tag/success/page/1/">success</a>\n            \n            <a class="tag" href="/tag/value/page/1/">value</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">&#8220;It is better to be hated for what you are than to be loved for what you are not.&#8221;</span>\n        <span>by <small class="author" itemprop="author">Andr&#233; Gide</small>\n        <a href="/author/Andre-Gide">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="life,love"/> \n            \n            <a class="tag" href="/tag/life/page/1/">life</a>\n            \n            <a class="tag" href="/tag/love/page/1/">love</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">&#8220;I have not failed. I\'ve just found 10,000 ways that won\'t work.&#8221;</span>\n        <span>by <small class="author" itemprop="author">Thomas A. Edison</small>\n        <a href="/author/Thomas-A-Edison">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="edison,failure,inspirational,paraphrased"/> \n            \n            <a class="tag" href="/tag/edison/page/1/">edison</a>\n            \n            <a class="tag" href="/tag/failure/page/1/">failure</a>\n            \n            <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>\n            \n            <a class="tag" href="/tag/paraphrased/page/1/">paraphrased</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">&#8220;A woman is like a tea bag; you never know how strong it is until it\'s in hot water.&#8221;</span>\n        <span>by <small class="author" itemprop="author">Eleanor Roosevelt</small>\n        <a href="/author/Eleanor-Roosevelt">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="misattributed-eleanor-roosevelt"/> \n            \n            <a class="tag" href="/tag/misattributed-eleanor-roosevelt/page/1/">misattributed-eleanor-roosevelt</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">&#8220;A day without sunshine is like, you know, night.&#8221;</span>\n        <span>by <small class="author" itemprop="author">Steve Martin</small>\n        <a href="/author/Steve-Martin">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="humor,obvious,simile"/> \n            \n            <a class="tag" href="/tag/humor/page/1/">humor</a>\n            \n            <a class="tag" href="/tag/obvious/page/1/">obvious</a>\n            \n            <a class="tag" href="/tag/simile/page/1/">simile</a>\n            \n        </div>\n    </div>\n\n    <nav>\n        <ul class="pager">\n            \n            \n            <li class="next">\n                <a href="/page/2/">Next <span aria-hidden="true">&#8594;</span></a>\n            </li>\n            \n        </ul>\n    </nav>\n    </div>\n    <div class="col-md-4 tags-box">\n        \n            <h2>Top Ten tags</h2>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 28px" href="/tag/love/">love</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 26px" href="/tag/inspirational/">inspirational</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 26px" href="/tag/life/">life</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 24px" href="/tag/humor/">humor</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 22px" href="/tag/books/">books</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 14px" href="/tag/reading/">reading</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 10px" href="/tag/friendship/">friendship</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 8px" href="/tag/friends/">friends</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 8px" href="/tag/truth/">truth</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 6px" href="/tag/simile/">simile</a>\n            </span>\n            \n        \n    </div>\n</div>\n\n    </div>\n    <footer class="footer">\n        <div class="container">\n            <p class="text-muted">\n                Quotes by: <a href="https://www.goodreads.com/quotes">GoodReads.com</a>\n            </p>\n            <p class="copyright">\n                Made with <span class="sh-red">&#10084;</span> by <a href="https://scrapinghub.com">Scrapinghub</a>\n            </p>\n        </div>\n    </footer>\n</body>\n</html>'

In [17]:
# find author names using xpath
[i.text_content() for i in tree.xpath("//small")]


Out[17]:
['Albert Einstein',
 'J.K. Rowling',
 'Albert Einstein',
 'Jane Austen',
 'Marilyn Monroe',
 'Albert Einstein',
 'André Gide',
 'Thomas A. Edison',
 'Eleanor Roosevelt',
 'Steve Martin']

In [18]:
# find tags using spath
[i.text_content() for i in tree.xpath("//div/a")]


Out[18]:
['change',
 'deep-thoughts',
 'thinking',
 'world',
 'abilities',
 'choices',
 'inspirational',
 'life',
 'live',
 'miracle',
 'miracles',
 'aliteracy',
 'books',
 'classic',
 'humor',
 'be-yourself',
 'inspirational',
 'adulthood',
 'success',
 'value',
 'life',
 'love',
 'edison',
 'failure',
 'inspirational',
 'paraphrased',
 'misattributed-eleanor-roosevelt',
 'humor',
 'obvious',
 'simile']

In [29]:
tree.xpath("//a[@class='tag']")


Out[29]:
[<Element a at 0x1923db9d098>,
 <Element a at 0x1923db9d188>,
 <Element a at 0x1923db9d138>,
 <Element a at 0x1923db9d1d8>,
 <Element a at 0x1923db9d228>,
 <Element a at 0x1923db9d278>,
 <Element a at 0x1923db9d318>,
 <Element a at 0x1923db9d2c8>,
 <Element a at 0x1923db9d368>,
 <Element a at 0x1923db9d3b8>,
 <Element a at 0x1923db9d408>,
 <Element a at 0x1923db9d458>,
 <Element a at 0x1923db9d4a8>,
 <Element a at 0x1923db9d4f8>,
 <Element a at 0x1923db9d548>,
 <Element a at 0x1923db9d5e8>,
 <Element a at 0x1923db9d638>,
 <Element a at 0x1923db9d598>,
 <Element a at 0x1923db9d688>,
 <Element a at 0x1923db9d6d8>,
 <Element a at 0x1923db9d728>,
 <Element a at 0x1923db9d778>,
 <Element a at 0x1923db9d7c8>,
 <Element a at 0x1923db9d818>,
 <Element a at 0x1923db9d8b8>,
 <Element a at 0x1923db9d908>,
 <Element a at 0x1923db9d868>,
 <Element a at 0x1923db9d958>,
 <Element a at 0x1923db9d9a8>,
 <Element a at 0x1923db9da48>,
 <Element a at 0x1923db9da98>,
 <Element a at 0x1923db9dae8>,
 <Element a at 0x1923db9db38>,
 <Element a at 0x1923db9db88>,
 <Element a at 0x1923db9dbd8>,
 <Element a at 0x1923db9dc28>,
 <Element a at 0x1923db9dc78>,
 <Element a at 0x1923db9dcc8>,
 <Element a at 0x1923db9dd68>,
 <Element a at 0x1923db9ddb8>]