In [1]:
import requests
from lxml import html
We used the library "request" last time in getting Twitter data (REST-ful). We are introducing the new "lxml" library for analyzing & extracting HTML elements and attributes here.
HackerNews is a community contributed news website with an emphasis on technology related content. Let's grab the set of articles that are at the top of the HN list.
In [ ]:
response = requests.get('http://news.ycombinator.com/')
response
In [ ]:
response.content
We will now use lxml to create a programmatic access to the content from HackerNews.
In [ ]:
page = html.fromstring(response.content)
page
In [80]:
posts = page.cssselect('.title')
In [ ]:
len(posts)
Details of how to use CSS selectors can be found in the w3 schools site:
In [61]:
posts = page.xpath('//td[contains(@class, "title")]')
In [ ]:
len(posts)
We are only interested in those "td" tags that contain an anchor link to the referred article.
In [84]:
posts = page.xpath('//td[contains(@class, "title")]/a')
In [ ]:
len(posts)
So, only half of those "td" tags with "title" contain posts that we are interested in. Let's take a look at the first such post.
In [ ]:
first_post = posts[0]
first_post.text
There is a lot of "content" in the td tag's attributes.
In [ ]:
first_post.attrib
In [ ]:
first_post.attrib["href"]
In [90]:
all_links = []
for p in posts:
all_links.append((p.text, p.attrib["href"]))
In [ ]:
all_links
Great: when you run the code above (starting from the HTTP request), this list of top content should change from time to time.
More details on how to use XPath can be found in the w3 schools site:
In [ ]: