In [1]:
import requests
from lxml import html
We used the library "request" last time in getting Twitter data (REST-ful). We are introducing the new "lxml" library for analyzing & extracting HTML elements and attributes here.
HackerNews is a community contributed news website with an emphasis on technology related content. Let's grab the set of articles that are at the top of the HN list.
In [2]:
response = requests.get('http://news.ycombinator.com/')
response
Out[2]:
<Response [200]>
In [3]:
response.content
Out[3]:
'<html op="news"><head><meta name="referrer" content="origin"><meta name="viewport" content="width=device-width, initial-scale=1.0"><link rel="stylesheet" type="text/css" href="news.css?WwbEhbljl4NoDa7axYx5">\n <link rel="shortcut icon" href="favicon.ico">\n <link rel="alternate" type="application/rss+xml" title="RSS" href="rss">\n <title>Hacker News</title>\n </head><body><center><table id="hnmain" border="0" cellpadding="0" cellspacing="0" width="85%" bgcolor="#f6f6ef">\n <tr><td bgcolor="#ff6600"><table border="0" cellpadding="0" cellspacing="0" width="100%" style="padding:2px"><tr><td style="width:18px;padding-right:4px"><a href="http://www.ycombinator.com"><img src="y18.gif" width="18" height="18" style="border:1px white solid;"></a></td>\n <td style="line-height:12pt; height:10px;"><span class="pagetop"><b class="hnname"><a href="news">Hacker News</a></b>\n <a href="newest">new</a> | <a href="newcomments">comments</a> | <a href="show">show</a> | <a href="ask">ask</a> | <a href="jobs">jobs</a> | <a href="submit">submit</a> </span></td><td style="text-align:right;padding-right:4px;"><span class="pagetop">\n <a href="login?goto=news">login</a>\n </span></td>\n </tr></table></td></tr>\n<tr style="height:10px"></tr><tr><td><table border="0" cellpadding="0" cellspacing="0" class="itemlist">\n <tr class=\'athing\' id=\'12768768\'>\n <td align="right" valign="top" class="title"><span class="rank">1.</span></td> <td valign="top" class="votelinks"><center><a id=\'up_12768768\' href=\'vote?id=12768768&how=up&goto=news\'><div class=\'votearrow\' title=\'upvote\'></div></a></center></td><td class="title"><a href="https://www.brainpickings.org/2014/10/13/kierkegaard-diary-bullying-trolling-haters/" class="storylink">Why Haters Hate: Kierkegaard Explains the Psychology of Trolling in 1847</a><span class="sitebit comhead"> (<a href="from?site=brainpickings.org"><span class="sitestr">brainpickings.org</span></a>)</span></td></tr><tr><td colspan="2"></td><td class="subtext">\n <span class="score" id="score_12768768">74 points</span> by <a href="user?id=DyslexicAtheist" class="hnuser">DyslexicAtheist</a> <span class="age"><a href="item?id=12768768">3 hours ago</a></span> <span id="unv_12768768"></span> | <a href="hide?id=12768768&goto=news">hide</a> | <a href="item?id=12768768">63 comments</a> </td></tr>\n <tr class="spacer" style="height:5px"></tr>\n <tr class=\'athing\' id=\'12769261\'>\n <td align="right" valign="top" class="title"><span class="rank">2.</span></td> <td valign="top" class="votelinks"><center><a id=\'up_12769261\' href=\'vote?id=12769261&how=up&goto=news\'><div class=\'votearrow\' title=\'upvote\'></div></a></center></td><td class="title"><a href="http://libvmi.com/" class="storylink">LibVMI: virtual machine introspection</a><span class="sitebit comhead"> (<a href="from?site=libvmi.com"><span class="sitestr">libvmi.com</span></a>)</span></td></tr><tr><td colspan="2"></td><td class="subtext">\n <span class="score" id="score_12769261">25 points</span> by <a href="user?id=ingve" class="hnuser">ingve</a> <span class="age"><a href="item?id=12769261">1 hour ago</a></span> <span id="unv_12769261"></span> | <a href="hide?id=12769261&goto=news">hide</a> | <a href="item?id=12769261">2 comments</a> </td></tr>\n <tr class="spacer" style="height:5px"></tr>\n <tr class=\'athing\' id=\'12768881\'>\n <td align="right" valign="top" class="title"><span class="rank">3.</span></td> <td valign="top" class="votelinks"><center><a id=\'up_12768881\' href=\'vote?id=12768881&how=up&goto=news\'><div class=\'votearrow\' title=\'upvote\'></div></a></center></td><td class="title"><a href="https://about.gitlab.com/2016/10/22/gitlab-8-13-released/" class="storylink">GitLab 8.13 Released with Multiple Issue Boards and Merge Conflict Editor</a><span class="sitebit comhead"> (<a href="from?site=gitlab.com"><span class="sitestr">gitlab.com</span></a>)</span></td></tr><tr><td colspan="2"></td><td class="subtext">\n <span class="score" id="score_12768881">90 points</span> by <a href="user?id=Smibu" class="hnuser">Smibu</a> <span class="age"><a href="item?id=12768881">2 hours ago</a></span> <span id="unv_12768881"></span> | <a href="hide?id=12768881&goto=news">hide</a> | <a href="item?id=12768881">14 comments</a> </td></tr>\n <tr class="spacer" style="height:5px"></tr>\n <tr class=\'athing\' id=\'12768319\'>\n <td align="right" valign="top" class="title"><span class="rank">4.</span></td> <td valign="top" class="votelinks"><center><a id=\'up_12768319\' href=\'vote?id=12768319&how=up&goto=news\'><div class=\'votearrow\' title=\'upvote\'></div></a></center></td><td class="title"><a href="http://www.bloomberg.com/news/articles/2016-09-22/the-professor-who-was-right-about-index-funds-all-along" class="storylink">A Professor Who Was Right About Index Funds All Along</a><span class="sitebit comhead"> (<a href="from?site=bloomberg.com"><span class="sitestr">bloomberg.com</span></a>)</span></td></tr><tr><td colspan="2"></td><td class="subtext">\n <span class="score" id="score_12768319">128 points</span> by <a href="user?id=carlosgg" class="hnuser">carlosgg</a> <span class="age"><a href="item?id=12768319">5 hours ago</a></span> <span id="unv_12768319"></span> | <a href="hide?id=12768319&goto=news">hide</a> | <a href="item?id=12768319">111 comments</a> </td></tr>\n <tr class="spacer" style="height:5px"></tr>\n <tr class=\'athing\' id=\'12768425\'>\n <td align="right" valign="top" class="title"><span class="rank">5.</span></td> <td valign="top" class="votelinks"><center><a id=\'up_12768425\' href=\'vote?id=12768425&how=up&goto=news\'><div class=\'votearrow\' title=\'upvote\'></div></a></center></td><td class="title"><a href="http://www.nongnu.org/lzip/xz_inadequate.html" class="storylink">Xz format inadequate for long-term archiving</a><span class="sitebit comhead"> (<a href="from?site=nongnu.org"><span class="sitestr">nongnu.org</span></a>)</span></td></tr><tr><td colspan="2"></td><td class="subtext">\n <span class="score" id="score_12768425">106 points</span> by <a href="user?id=martianh" class="hnuser">martianh</a> <span class="age"><a href="item?id=12768425">4 hours ago</a></span> <span id="unv_12768425"></span> | <a href="hide?id=12768425&goto=news">hide</a> | <a href="item?id=12768425">48 comments</a> </td></tr>\n <tr class="spacer" style="height:5px"></tr>\n <tr class=\'athing\' id=\'12768719\'>\n <td align="right" valign="top" class="title"><span class="rank">6.</span></td> <td valign="top" class="votelinks"><center><a id=\'up_12768719\' href=\'vote?id=12768719&how=up&goto=news\'><div class=\'votearrow\' title=\'upvote\'></div></a></center></td><td class="title"><a href="https://en.wikipedia.org/wiki/ZMODEM" class="storylink">ZMODEM</a><span class="sitebit comhead"> (<a href="from?site=wikipedia.org"><span class="sitestr">wikipedia.org</span></a>)</span></td></tr><tr><td colspan="2"></td><td class="subtext">\n <span class="score" id="score_12768719">48 points</span> by <a href="user?id=turrini" class="hnuser">turrini</a> <span class="age"><a href="item?id=12768719">3 hours ago</a></span> <span id="unv_12768719"></span> | <a href="hide?id=12768719&goto=news">hide</a> | <a href="item?id=12768719">29 comments</a> </td></tr>\n <tr class="spacer" style="height:5px"></tr>\n <tr class=\'athing\' id=\'12767821\'>\n <td align="right" valign="top" class="title"><span class="rank">7.</span></td> <td valign="top" class="votelinks"><center><a id=\'up_12767821\' href=\'vote?id=12767821&how=up&goto=news\'><div class=\'votearrow\' title=\'upvote\'></div></a></center></td><td class="title"><a href="https://www.youtube.com/watch?v=hyry8mgXiTk" class="storylink">1177 BC \xe2\x80\x93 The Year Civilization Collapsed [video]</a><span class="sitebit comhead"> (<a href="from?site=youtube.com"><span class="sitestr">youtube.com</span></a>)</span></td></tr><tr><td colspan="2"></td><td class="subtext">\n <span class="score" id="score_12767821">207 points</span> by <a href="user?id=dmlhllnd" class="hnuser">dmlhllnd</a> <span class="age"><a href="item?id=12767821">8 hours ago</a></span> <span id="unv_12767821"></span> | <a href="hide?id=12767821&goto=news">hide</a> | <a href="item?id=12767821">43 comments</a> </td></tr>\n <tr class="spacer" style="height:5px"></tr>\n <tr class=\'athing\' id=\'12769178\'>\n <td align="right" valign="top" class="title"><span class="rank">8.</span></td> <td valign="top" class="votelinks"><center><a id=\'up_12769178\' href=\'vote?id=12769178&how=up&goto=news\'><div class=\'votearrow\' title=\'upvote\'></div></a></center></td><td class="title"><a href="https://tech.slashdot.org/story/16/10/22/008216/google-has-quietly-dropped-ban-on-personally-identifiable-web-tracking?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+Slashdot%2Fslashdot+%28Slashdot%29" class="storylink">Google Has Dropped Ban on Personally Identifiable Web Tracking</a><span class="sitebit comhead"> (<a href="from?site=slashdot.org"><span class="sitestr">slashdot.org</span></a>)</span></td></tr><tr><td colspan="2"></td><td class="subtext">\n <span class="score" id="score_12769178">50 points</span> by <a href="user?id=pcunite" class="hnuser">pcunite</a> <span class="age"><a href="item?id=12769178">1 hour ago</a></span> <span id="unv_12769178"></span> | <a href="hide?id=12769178&goto=news">hide</a> | <a href="item?id=12769178">8 comments</a> </td></tr>\n <tr class="spacer" style="height:5px"></tr>\n <tr class=\'athing\' id=\'12767560\'>\n <td align="right" valign="top" class="title"><span class="rank">9.</span></td> <td valign="top" class="votelinks"><center><a id=\'up_12767560\' href=\'vote?id=12767560&how=up&goto=news\'><div class=\'votearrow\' title=\'upvote\'></div></a></center></td><td class="title"><a href="https://vuejs.org/guide/comparison.html" class="storylink">Comparison with Other Frameworks</a><span class="sitebit comhead"> (<a href="from?site=vuejs.org"><span class="sitestr">vuejs.org</span></a>)</span></td></tr><tr><td colspan="2"></td><td class="subtext">\n <span class="score" id="score_12767560">192 points</span> by <a href="user?id=wanderer42" class="hnuser">wanderer42</a> <span class="age"><a href="item?id=12767560">10 hours ago</a></span> <span id="unv_12767560"></span> | <a href="hide?id=12767560&goto=news">hide</a> | <a href="item?id=12767560">97 comments</a> </td></tr>\n <tr class="spacer" style="height:5px"></tr>\n <tr class=\'athing\' id=\'12766493\'>\n <td align="right" valign="top" class="title"><span class="rank">10.</span></td> <td valign="top" class="votelinks"><center><a id=\'up_12766493\' href=\'vote?id=12766493&how=up&goto=news\'><div class=\'votearrow\' title=\'upvote\'></div></a></center></td><td class="title"><a href="https://www.infoq.com/presentations/category-theory-propositions-principle" class="storylink">Category Theory for the Working Hacker [video]</a><span class="sitebit comhead"> (<a href="from?site=infoq.com"><span class="sitestr">infoq.com</span></a>)</span></td></tr><tr><td colspan="2"></td><td class="subtext">\n <span class="score" id="score_12766493">45 points</span> by <a href="user?id=louthy" class="hnuser">louthy</a> <span class="age"><a href="item?id=12766493">5 hours ago</a></span> <span id="unv_12766493"></span> | <a href="hide?id=12766493&goto=news">hide</a> | <a href="item?id=12766493">7 comments</a> </td></tr>\n <tr class="spacer" style="height:5px"></tr>\n <tr class=\'athing\' id=\'12769105\'>\n <td align="right" valign="top" class="title"><span class="rank">11.</span></td> <td valign="top" class="votelinks"><center><a id=\'up_12769105\' href=\'vote?id=12769105&how=up&goto=news\'><div class=\'votearrow\' title=\'upvote\'></div></a></center></td><td class="title"><a href="http://www.juliabloggers.com/optimizing-details-of-vectorization-and-metaprogramming/?utm_source=ReviveOldPost&utm_medium=social&utm_campaign=ReviveOldPost" class="storylink">Optimizing .*: Details of Vectorization and Metaprogramming \xe2\x80\x93 Juliabloggers.com</a><span class="sitebit comhead"> (<a href="from?site=juliabloggers.com"><span class="sitestr">juliabloggers.com</span></a>)</span></td></tr><tr><td colspan="2"></td><td class="subtext">\n <span class="score" id="score_12769105">9 points</span> by <a href="user?id=leephillips" class="hnuser">leephillips</a> <span class="age"><a href="item?id=12769105">1 hour ago</a></span> <span id="unv_12769105"></span> | <a href="hide?id=12769105&goto=news">hide</a> | <a href="item?id=12769105">1 comment</a> </td></tr>\n <tr class="spacer" style="height:5px"></tr>\n <tr class=\'athing\' id=\'12766847\'>\n <td align="right" valign="top" class="title"><span class="rank">12.</span></td> <td valign="top" class="votelinks"><center><a id=\'up_12766847\' href=\'vote?id=12766847&how=up&goto=news\'><div class=\'votearrow\' title=\'upvote\'></div></a></center></td><td class="title"><a href="https://github.com/SamyPesse/How-to-Make-a-Computer-Operating-System" class="storylink">How to Make a Computer Operating System</a><span class="sitebit comhead"> (<a href="from?site=github.com"><span class="sitestr">github.com</span></a>)</span></td></tr><tr><td colspan="2"></td><td class="subtext">\n <span class="score" id="score_12766847">24 points</span> by <a href="user?id=hitr" class="hnuser">hitr</a> <span class="age"><a href="item?id=12766847">4 hours ago</a></span> <span id="unv_12766847"></span> | <a href="hide?id=12766847&goto=news">hide</a> | <a href="item?id=12766847">discuss</a> </td></tr>\n <tr class="spacer" style="height:5px"></tr>\n <tr class=\'athing\' id=\'12767038\'>\n <td align="right" valign="top" class="title"><span class="rank">13.</span></td> <td valign="top" class="votelinks"><center><a id=\'up_12767038\' href=\'vote?id=12767038&how=up&goto=news\'><div class=\'votearrow\' title=\'upvote\'></div></a></center></td><td class="title"><a href="http://www.blikstein.com/paulo/projects/project_water.html" class="storylink">Programmable Water (2003)</a><span class="sitebit comhead"> (<a href="from?site=blikstein.com"><span class="sitestr">blikstein.com</span></a>)</span></td></tr><tr><td colspan="2"></td><td class="subtext">\n <span class="score" id="score_12767038">31 points</span> by <a href="user?id=Phithagoras" class="hnuser">Phithagoras</a> <span class="age"><a href="item?id=12767038">6 hours ago</a></span> <span id="unv_12767038"></span> | <a href="hide?id=12767038&goto=news">hide</a> | <a href="item?id=12767038">6 comments</a> </td></tr>\n <tr class="spacer" style="height:5px"></tr>\n <tr class=\'athing\' id=\'12767747\'>\n <td align="right" valign="top" class="title"><span class="rank">14.</span></td> <td valign="top" class="votelinks"><center><a id=\'up_12767747\' href=\'vote?id=12767747&how=up&goto=news\'><div class=\'votearrow\' title=\'upvote\'></div></a></center></td><td class="title"><a href="http://www.bbc.co.uk/news/resources/idt-150d11df-c541-44a9-9332-560a19828c47" class="storylink">Aberfan: The mistake that cost a village its children</a><span class="sitebit comhead"> (<a href="from?site=bbc.co.uk"><span class="sitestr">bbc.co.uk</span></a>)</span></td></tr><tr><td colspan="2"></td><td class="subtext">\n <span class="score" id="score_12767747">84 points</span> by <a href="user?id=Patient0" class="hnuser">Patient0</a> <span class="age"><a href="item?id=12767747">9 hours ago</a></span> <span id="unv_12767747"></span> | <a href="hide?id=12767747&goto=news">hide</a> | <a href="item?id=12767747">35 comments</a> </td></tr>\n <tr class="spacer" style="height:5px"></tr>\n <tr class=\'athing\' id=\'12769264\'>\n <td align="right" valign="top" class="title"><span class="rank">15.</span></td> <td valign="top" class="votelinks"><center><a id=\'up_12769264\' href=\'vote?id=12769264&how=up&goto=news\'><div class=\'votearrow\' title=\'upvote\'></div></a></center></td><td class="title"><a href="http://www.popularmechanics.com/science/energy/news/a23490/iceland-3-mile-hole-magma/" class="storylink">Iceland Is Drilling a 3-Mile Hole to Tap Magma Power</a><span class="sitebit comhead"> (<a href="from?site=popularmechanics.com"><span class="sitestr">popularmechanics.com</span></a>)</span></td></tr><tr><td colspan="2"></td><td class="subtext">\n <span class="score" id="score_12769264">24 points</span> by <a href="user?id=jonbaer" class="hnuser">jonbaer</a> <span class="age"><a href="item?id=12769264">1 hour ago</a></span> <span id="unv_12769264"></span> | <a href="hide?id=12769264&goto=news">hide</a> | <a href="item?id=12769264">5 comments</a> </td></tr>\n <tr class="spacer" style="height:5px"></tr>\n <tr class=\'athing\' id=\'12769196\'>\n <td align="right" valign="top" class="title"><span class="rank">16.</span></td> <td valign="top" class="votelinks"><center><a id=\'up_12769196\' href=\'vote?id=12769196&how=up&goto=news\'><div class=\'votearrow\' title=\'upvote\'></div></a></center></td><td class="title"><a href="item?id=12769196" class="storylink">Ask HN: How did Dyn fail to fend off DDOS?</a></td></tr><tr><td colspan="2"></td><td class="subtext">\n <span class="score" id="score_12769196">18 points</span> by <a href="user?id=ruler88" class="hnuser">ruler88</a> <span class="age"><a href="item?id=12769196">1 hour ago</a></span> <span id="unv_12769196"></span> | <a href="hide?id=12769196&goto=news">hide</a> | <a href="item?id=12769196">12 comments</a> </td></tr>\n <tr class="spacer" style="height:5px"></tr>\n <tr class=\'athing\' id=\'12768782\'>\n <td align="right" valign="top" class="title"><span class="rank">17.</span></td> <td valign="top" class="votelinks"><center><a id=\'up_12768782\' href=\'vote?id=12768782&how=up&goto=news\'><div class=\'votearrow\' title=\'upvote\'></div></a></center></td><td class="title"><a href="https://github.com/okTurtles/dnschain" class="storylink" rel="nofollow">OkTurtles/dnschain: A blockchain-based DNS and HTTP server</a><span class="sitebit comhead"> (<a href="from?site=github.com"><span class="sitestr">github.com</span></a>)</span></td></tr><tr><td colspan="2"></td><td class="subtext">\n <span class="score" id="score_12768782">9 points</span> by <a href="user?id=callaars" class="hnuser">callaars</a> <span class="age"><a href="item?id=12768782">2 hours ago</a></span> <span id="unv_12768782"></span> | <a href="hide?id=12768782&goto=news">hide</a> | <a href="item?id=12768782">1 comment</a> </td></tr>\n <tr class="spacer" style="height:5px"></tr>\n <tr class=\'athing\' id=\'12766846\'>\n <td align="right" valign="top" class="title"><span class="rank">18.</span></td> <td valign="top" class="votelinks"><center><a id=\'up_12766846\' href=\'vote?id=12766846&how=up&goto=news\'><div class=\'votearrow\' title=\'upvote\'></div></a></center></td><td class="title"><a href="http://www.jayconrod.com/posts/52/a-tour-of-v8-object-representation" class="storylink">A tour of V8: object representation (2013)</a><span class="sitebit comhead"> (<a href="from?site=jayconrod.com"><span class="sitestr">jayconrod.com</span></a>)</span></td></tr><tr><td colspan="2"></td><td class="subtext">\n <span class="score" id="score_12766846">23 points</span> by <a href="user?id=tambourine_man" class="hnuser">tambourine_man</a> <span class="age"><a href="item?id=12766846">6 hours ago</a></span> <span id="unv_12766846"></span> | <a href="hide?id=12766846&goto=news">hide</a> | <a href="item?id=12766846">7 comments</a> </td></tr>\n <tr class="spacer" style="height:5px"></tr>\n <tr class=\'athing\' id=\'12766839\'>\n <td align="right" valign="top" class="title"><span class="rank">19.</span></td> <td valign="top" class="votelinks"><center><a id=\'up_12766839\' href=\'vote?id=12766839&how=up&goto=news\'><div class=\'votearrow\' title=\'upvote\'></div></a></center></td><td class="title"><a href="http://www.theparisreview.org/interviews/3605/the-art-of-fiction-no-64-kurt-vonnegut" class="storylink" rel="nofollow">Vonnegut: the art of fiction</a><span class="sitebit comhead"> (<a href="from?site=theparisreview.org"><span class="sitestr">theparisreview.org</span></a>)</span></td></tr><tr><td colspan="2"></td><td class="subtext">\n <span class="score" id="score_12766839">9 points</span> by <a href="user?id=kapitza" class="hnuser">kapitza</a> <span class="age"><a href="item?id=12766839">3 hours ago</a></span> <span id="unv_12766839"></span> | <a href="hide?id=12766839&goto=news">hide</a> | <a href="item?id=12766839">discuss</a> </td></tr>\n <tr class="spacer" style="height:5px"></tr>\n <tr class=\'athing\' id=\'12766174\'>\n <td align="right" valign="top" class="title"><span class="rank">20.</span></td> <td valign="top" class="votelinks"><center><a id=\'up_12766174\' href=\'vote?id=12766174&how=up&goto=news\'><div class=\'votearrow\' title=\'upvote\'></div></a></center></td><td class="title"><a href="https://medium.com/@fagnerbrack/how-to-accept-over-engineering-for-what-it-really-is-6fca9a919263" class="storylink">How to Accept Over-Engineering for What It Really Is</a><span class="sitebit comhead"> (<a href="from?site=medium.com"><span class="sitestr">medium.com</span></a>)</span></td></tr><tr><td colspan="2"></td><td class="subtext">\n <span class="score" id="score_12766174">142 points</span> by <a href="user?id=fagnerbrack" class="hnuser">fagnerbrack</a> <span class="age"><a href="item?id=12766174">14 hours ago</a></span> <span id="unv_12766174"></span> | <a href="hide?id=12766174&goto=news">hide</a> | <a href="item?id=12766174">88 comments</a> </td></tr>\n <tr class="spacer" style="height:5px"></tr>\n <tr class=\'athing\' id=\'12766458\'>\n <td align="right" valign="top" class="title"><span class="rank">21.</span></td> <td valign="top" class="votelinks"><center><a id=\'up_12766458\' href=\'vote?id=12766458&how=up&goto=news\'><div class=\'votearrow\' title=\'upvote\'></div></a></center></td><td class="title"><a href="http://www.nextplatform.com/2016/09/01/cpu-gpu-put-deep-learning-framework-test/" class="storylink">CPU, GPU Put to Deep Learning Framework Test</a><span class="sitebit comhead"> (<a href="from?site=nextplatform.com"><span class="sitestr">nextplatform.com</span></a>)</span></td></tr><tr><td colspan="2"></td><td class="subtext">\n <span class="score" id="score_12766458">22 points</span> by <a href="user?id=adamnemecek" class="hnuser">adamnemecek</a> <span class="age"><a href="item?id=12766458">6 hours ago</a></span> <span id="unv_12766458"></span> | <a href="hide?id=12766458&goto=news">hide</a> | <a href="item?id=12766458">11 comments</a> </td></tr>\n <tr class="spacer" style="height:5px"></tr>\n <tr class=\'athing\' id=\'12766691\'>\n <td align="right" valign="top" class="title"><span class="rank">22.</span></td> <td valign="top" class="votelinks"><center><a id=\'up_12766691\' href=\'vote?id=12766691&how=up&goto=news\'><div class=\'votearrow\' title=\'upvote\'></div></a></center></td><td class="title"><a href="http://blog.adacore.com/how-to-prevent-drone-crashes-using-spark" class="storylink">How to avoid runtime errors on drones using SPARK (2015)</a><span class="sitebit comhead"> (<a href="from?site=adacore.com"><span class="sitestr">adacore.com</span></a>)</span></td></tr><tr><td colspan="2"></td><td class="subtext">\n <span class="score" id="score_12766691">49 points</span> by <a href="user?id=0srv" class="hnuser">0srv</a> <span class="age"><a href="item?id=12766691">14 hours ago</a></span> <span id="unv_12766691"></span> | <a href="hide?id=12766691&goto=news">hide</a> | <a href="item?id=12766691">20 comments</a> </td></tr>\n <tr class="spacer" style="height:5px"></tr>\n <tr class=\'athing\' id=\'12748863\'>\n <td align="right" valign="top" class="title"><span class="rank">23.</span></td> <td valign="top" class="votelinks"><center><a id=\'up_12748863\' href=\'vote?id=12748863&how=up&goto=news\'><div class=\'votearrow\' title=\'upvote\'></div></a></center></td><td class="title"><a href="https://www.tesla.com/blog/all-tesla-cars-being-produced-now-have-full-self-driving-hardware" class="storylink">All Tesla Cars Being Produced Now Have Full Self-Driving Hardware</a><span class="sitebit comhead"> (<a href="from?site=tesla.com"><span class="sitestr">tesla.com</span></a>)</span></td></tr><tr><td colspan="2"></td><td class="subtext">\n <span class="score" id="score_12748863">1423 points</span> by <a href="user?id=impish19" class="hnuser">impish19</a> <span class="age"><a href="item?id=12748863">2 days ago</a></span> <span id="unv_12748863"></span> | <a href="hide?id=12748863&goto=news">hide</a> | <a href="item?id=12748863">1069 comments</a> </td></tr>\n <tr class="spacer" style="height:5px"></tr>\n <tr class=\'athing\' id=\'12766613\'>\n <td align="right" valign="top" class="title"><span class="rank">24.</span></td> <td valign="top" class="votelinks"><center><a id=\'up_12766613\' href=\'vote?id=12766613&how=up&goto=news\'><div class=\'votearrow\' title=\'upvote\'></div></a></center></td><td class="title"><a href="http://internetcensus2012.bitbucket.org/paper.html" class="storylink" rel="nofollow">Internet Census: Port scanning /0 using insecure embedded devices (2012)</a><span class="sitebit comhead"> (<a href="from?site=bitbucket.org"><span class="sitestr">bitbucket.org</span></a>)</span></td></tr><tr><td colspan="2"></td><td class="subtext">\n <span class="score" id="score_12766613">12 points</span> by <a href="user?id=bootload" class="hnuser">bootload</a> <span class="age"><a href="item?id=12766613">5 hours ago</a></span> <span id="unv_12766613"></span> | <a href="hide?id=12766613&goto=news">hide</a> | <a href="item?id=12766613">discuss</a> </td></tr>\n <tr class="spacer" style="height:5px"></tr>\n <tr class=\'athing\' id=\'12766419\'>\n <td align="right" valign="top" class="title"><span class="rank">25.</span></td> <td valign="top" class="votelinks"><center><a id=\'up_12766419\' href=\'vote?id=12766419&how=up&goto=news\'><div class=\'votearrow\' title=\'upvote\'></div></a></center></td><td class="title"><a href="http://www.japantimes.co.jp/community/2016/06/22/issues/japans-koseki-system-dull-uncaring-terribly-efficient/#.WAq-taOZNP0" class="storylink">Japan\xe2\x80\x99s koseki system: dull, uncaring but efficient</a><span class="sitebit comhead"> (<a href="from?site=japantimes.co.jp"><span class="sitestr">japantimes.co.jp</span></a>)</span></td></tr><tr><td colspan="2"></td><td class="subtext">\n <span class="score" id="score_12766419">204 points</span> by <a href="user?id=Thevet" class="hnuser">Thevet</a> <span class="age"><a href="item?id=12766419">16 hours ago</a></span> <span id="unv_12766419"></span> | <a href="hide?id=12766419&goto=news">hide</a> | <a href="item?id=12766419">109 comments</a> </td></tr>\n <tr class="spacer" style="height:5px"></tr>\n <tr class=\'athing\' id=\'12760235\'>\n <td align="right" valign="top" class="title"><span class="rank">26.</span></td> <td valign="top" class="votelinks"><center><a id=\'up_12760235\' href=\'vote?id=12760235&how=up&goto=news\'><div class=\'votearrow\' title=\'upvote\'></div></a></center></td><td class="title"><a href="http://www.bbc.com/news/technology-37713939" class="storylink">Samsung \'blocks\' exploding Note 7 parody videos</a><span class="sitebit comhead"> (<a href="from?site=bbc.com"><span class="sitestr">bbc.com</span></a>)</span></td></tr><tr><td colspan="2"></td><td class="subtext">\n <span class="score" id="score_12760235">584 points</span> by <a href="user?id=Lio" class="hnuser">Lio</a> <span class="age"><a href="item?id=12760235">1 day ago</a></span> <span id="unv_12760235"></span> | <a href="hide?id=12760235&goto=news">hide</a> | <a href="item?id=12760235">206 comments</a> </td></tr>\n <tr class="spacer" style="height:5px"></tr>\n <tr class=\'athing\' id=\'12767214\'>\n <td align="right" valign="top" class="title"><span class="rank">27.</span></td> <td valign="top" class="votelinks"><center><a id=\'up_12767214\' href=\'vote?id=12767214&how=up&goto=news\'><div class=\'votearrow\' title=\'upvote\'></div></a></center></td><td class="title"><a href="http://news.stanford.edu/2016/10/20/stanford-researchers-create-new-special-purpose-computer/" class="storylink">Researchers create new computer combining optical and electronic technology</a><span class="sitebit comhead"> (<a href="from?site=stanford.edu"><span class="sitestr">stanford.edu</span></a>)</span></td></tr><tr><td colspan="2"></td><td class="subtext">\n <span class="score" id="score_12767214">25 points</span> by <a href="user?id=hn-user" class="hnuser">hn-user</a> <span class="age"><a href="item?id=12767214">12 hours ago</a></span> <span id="unv_12767214"></span> | <a href="hide?id=12767214&goto=news">hide</a> | <a href="item?id=12767214">5 comments</a> </td></tr>\n <tr class="spacer" style="height:5px"></tr>\n <tr class=\'athing\' id=\'12762462\'>\n <td align="right" valign="top" class="title"><span class="rank">28.</span></td> <td valign="top" class="votelinks"><center><a id=\'up_12762462\' href=\'vote?id=12762462&how=up&goto=news\'><div class=\'votearrow\' title=\'upvote\'></div></a></center></td><td class="title"><a href="http://www.esa.int/Our_Activities/Space_Science/ExoMars/Mars_Reconnaissance_Orbiter_views_Schiaparelli_landing_site" class="storylink">Mars Reconnaissance Orbiter views Schiaparelli landing site</a><span class="sitebit comhead"> (<a href="from?site=esa.int"><span class="sitestr">esa.int</span></a>)</span></td></tr><tr><td colspan="2"></td><td class="subtext">\n <span class="score" id="score_12762462">265 points</span> by <a href="user?id=okket" class="hnuser">okket</a> <span class="age"><a href="item?id=12762462">1 day ago</a></span> <span id="unv_12762462"></span> | <a href="hide?id=12762462&goto=news">hide</a> | <a href="item?id=12762462">108 comments</a> </td></tr>\n <tr class="spacer" style="height:5px"></tr>\n <tr class=\'athing\' id=\'12759697\'>\n <td align="right" valign="top" class="title"><span class="rank">29.</span></td> <td valign="top" class="votelinks"><center><a id=\'up_12759697\' href=\'vote?id=12759697&how=up&goto=news\'><div class=\'votearrow\' title=\'upvote\'></div></a></center></td><td class="title"><a href="https://www.dynstatus.com/incidents/nlr4yrr162t8" class="storylink">DDoS Attack Against Dyn Managed DNS</a><span class="sitebit comhead"> (<a href="from?site=dynstatus.com"><span class="sitestr">dynstatus.com</span></a>)</span></td></tr><tr><td colspan="2"></td><td class="subtext">\n <span class="score" id="score_12759697">1538 points</span> by <a href="user?id=owenwil" class="hnuser">owenwil</a> <span class="age"><a href="item?id=12759697">1 day ago</a></span> <span id="unv_12759697"></span> | <a href="hide?id=12759697&goto=news">hide</a> | <a href="item?id=12759697">658 comments</a> </td></tr>\n <tr class="spacer" style="height:5px"></tr>\n <tr class=\'athing\' id=\'12766123\'>\n <td align="right" valign="top" class="title"><span class="rank">30.</span></td> <td valign="top" class="votelinks"><center><a id=\'up_12766123\' href=\'vote?id=12766123&how=up&goto=news\'><div class=\'votearrow\' title=\'upvote\'></div></a></center></td><td class="title"><a href="https://www.flashpoint-intel.com/mirai-botnet-linked-dyn-dns-ddos-attacks/" class="storylink">Mirai Botnet Linked to Dyn DNS DDoS Attacks</a><span class="sitebit comhead"> (<a href="from?site=flashpoint-intel.com"><span class="sitestr">flashpoint-intel.com</span></a>)</span></td></tr><tr><td colspan="2"></td><td class="subtext">\n <span class="score" id="score_12766123">111 points</span> by <a href="user?id=ashitlerferad" class="hnuser">ashitlerferad</a> <span class="age"><a href="item?id=12766123">17 hours ago</a></span> <span id="unv_12766123"></span> | <a href="hide?id=12766123&goto=news">hide</a> | <a href="item?id=12766123">93 comments</a> </td></tr>\n <tr class="spacer" style="height:5px"></tr>\n <tr class="morespace" style="height:10px"></tr><tr><td colspan="2"></td><td class="title"><a href="news?p=2" class="morelink" rel="nofollow">More</a></td></tr>\n </table>\n</td></tr>\n<tr><td><img src="s.gif" height="10" width="0"><table width="100%" cellspacing="0" cellpadding="1"><tr><td bgcolor="#ff6600"></td></tr></table><br><center><span class="yclinks"><a href="newsguidelines.html">Guidelines</a>\n | <a href="newsfaq.html">FAQ</a>\n | <a href="mailto:hn@ycombinator.com">Support</a>\n | <a href="https://github.com/HackerNews/API">API</a>\n | <a href="security.html">Security</a>\n | <a href="lists">Lists</a>\n | <a href="bookmarklet.html">Bookmarklet</a>\n | <a href="dmca.html">DMCA</a>\n | <a href="http://www.ycombinator.com/apply/">Apply to YC</a>\n | <a href="mailto:hn@ycombinator.com">Contact</a></span><br><br><form method="get" action="//hn.algolia.com/">Search:\n <input type="text" name="q" value="" size="17" autocorrect="off" spellcheck="false" autocapitalize="off" autocomplete="false"></form>\n </center></td></tr> </table></center></body><script type=\'text/javascript\' src=\'hn.js?WwbEhbljl4NoDa7axYx5\'></script></html>\n'
We will now use lxml to create a programmatic access to the content from HackerNews.
In [5]:
page = html.fromstring(response.content)
page
Out[5]:
<Element html at 0x103fc0578>
In [7]:
posts = page.cssselect('.title')
In [8]:
len(posts)
Out[8]:
61
Details of how to use CSS selectors can be found in the w3 schools site:
In [9]:
posts = page.xpath('//td[contains(@class, "title")]')
In [10]:
len(posts)
Out[10]:
61
We are only interested in those "td" tags that contain an anchor link to the referred article.
In [11]:
posts = page.xpath('//td[contains(@class, "title")]/a')
In [12]:
len(posts)
Out[12]:
31
So, only half of those "td" tags with "title" contain posts that we are interested in. Let's take a look at the first such post.
In [13]:
first_post = posts[0]
first_post.text
Out[13]:
'Why Haters Hate: Kierkegaard Explains the Psychology of Trolling in 1847'
There is a lot of "content" in the td tag's attributes.
In [14]:
first_post.attrib
Out[14]:
{'href': 'https://www.brainpickings.org/2014/10/13/kierkegaard-diary-bullying-trolling-haters/', 'class': 'storylink'}
In [15]:
first_post.attrib["href"]
Out[15]:
'https://www.brainpickings.org/2014/10/13/kierkegaard-diary-bullying-trolling-haters/'
In [16]:
all_links = []
for p in posts:
all_links.append((p.text, p.attrib["href"]))
In [17]:
all_links
Out[17]:
[('Why Haters Hate: Kierkegaard Explains the Psychology of Trolling in 1847',
'https://www.brainpickings.org/2014/10/13/kierkegaard-diary-bullying-trolling-haters/'),
('LibVMI: virtual machine introspection', 'http://libvmi.com/'),
('GitLab 8.13 Released with Multiple Issue Boards and Merge Conflict Editor',
'https://about.gitlab.com/2016/10/22/gitlab-8-13-released/'),
('A Professor Who Was Right About Index Funds All Along',
'http://www.bloomberg.com/news/articles/2016-09-22/the-professor-who-was-right-about-index-funds-all-along'),
('Xz format inadequate for long-term archiving',
'http://www.nongnu.org/lzip/xz_inadequate.html'),
('ZMODEM', 'https://en.wikipedia.org/wiki/ZMODEM'),
(u'1177 BC \xe2\x80\x93 The Year Civilization Collapsed [video]',
'https://www.youtube.com/watch?v=hyry8mgXiTk'),
('Google Has Dropped Ban on Personally Identifiable Web Tracking',
'https://tech.slashdot.org/story/16/10/22/008216/google-has-quietly-dropped-ban-on-personally-identifiable-web-tracking?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+Slashdot%2Fslashdot+%28Slashdot%29'),
('Comparison with Other Frameworks',
'https://vuejs.org/guide/comparison.html'),
('Category Theory for the Working Hacker [video]',
'https://www.infoq.com/presentations/category-theory-propositions-principle'),
(u'Optimizing .*: Details of Vectorization and Metaprogramming \xe2\x80\x93 Juliabloggers.com',
'http://www.juliabloggers.com/optimizing-details-of-vectorization-and-metaprogramming/?utm_source=ReviveOldPost&utm_medium=social&utm_campaign=ReviveOldPost'),
('How to Make a Computer Operating System',
'https://github.com/SamyPesse/How-to-Make-a-Computer-Operating-System'),
('Programmable Water (2003)',
'http://www.blikstein.com/paulo/projects/project_water.html'),
('Aberfan: The mistake that cost a village its children',
'http://www.bbc.co.uk/news/resources/idt-150d11df-c541-44a9-9332-560a19828c47'),
('Iceland Is Drilling a 3-Mile Hole to Tap Magma Power',
'http://www.popularmechanics.com/science/energy/news/a23490/iceland-3-mile-hole-magma/'),
('Ask HN: How did Dyn fail to fend off DDOS?', 'item?id=12769196'),
('OkTurtles/dnschain: A blockchain-based DNS and HTTP server',
'https://github.com/okTurtles/dnschain'),
('A tour of V8: object representation (2013)',
'http://www.jayconrod.com/posts/52/a-tour-of-v8-object-representation'),
('Vonnegut: the art of fiction',
'http://www.theparisreview.org/interviews/3605/the-art-of-fiction-no-64-kurt-vonnegut'),
('How to Accept Over-Engineering for What It Really Is',
'https://medium.com/@fagnerbrack/how-to-accept-over-engineering-for-what-it-really-is-6fca9a919263'),
('CPU, GPU Put to Deep Learning Framework Test',
'http://www.nextplatform.com/2016/09/01/cpu-gpu-put-deep-learning-framework-test/'),
('How to avoid runtime errors on drones using SPARK (2015)',
'http://blog.adacore.com/how-to-prevent-drone-crashes-using-spark'),
('All Tesla Cars Being Produced Now Have Full Self-Driving Hardware',
'https://www.tesla.com/blog/all-tesla-cars-being-produced-now-have-full-self-driving-hardware'),
('Internet Census: Port scanning /0 using insecure embedded devices (2012)',
'http://internetcensus2012.bitbucket.org/paper.html'),
(u'Japan\xe2\x80\x99s koseki system: dull, uncaring but efficient',
'http://www.japantimes.co.jp/community/2016/06/22/issues/japans-koseki-system-dull-uncaring-terribly-efficient/#.WAq-taOZNP0'),
("Samsung 'blocks' exploding Note 7 parody videos",
'http://www.bbc.com/news/technology-37713939'),
('Researchers create new computer combining optical and electronic technology',
'http://news.stanford.edu/2016/10/20/stanford-researchers-create-new-special-purpose-computer/'),
('Mars Reconnaissance Orbiter views Schiaparelli landing site',
'http://www.esa.int/Our_Activities/Space_Science/ExoMars/Mars_Reconnaissance_Orbiter_views_Schiaparelli_landing_site'),
('DDoS Attack Against Dyn Managed DNS',
'https://www.dynstatus.com/incidents/nlr4yrr162t8'),
('Mirai Botnet Linked to Dyn DNS DDoS Attacks',
'https://www.flashpoint-intel.com/mirai-botnet-linked-dyn-dns-ddos-attacks/'),
('More', 'news?p=2')]
Great: when you run the code above (starting from the HTTP request), this list of top content should change from time to time.
More details on how to use XPath can be found in the w3 schools site:
In [ ]:
Content source: philmui/datascience2016fall
Similar notebooks: