Import needed libraries


In [1]:
import requests
from lxml import html

We used the library "request" last time in getting Twitter data (REST-ful). We are introducing the new "lxml" library for analyzing & extracting HTML elements and attributes here.

Use Requests to get HackerNews content

HackerNews is a community contributed news website with an emphasis on technology related content. Let's grab the set of articles that are at the top of the HN list.


In [ ]:
response = requests.get('http://news.ycombinator.com/')
response

In [ ]:
response.content

We will now use lxml to create a programmatic access to the content from HackerNews.

Analyzing HTML Content


In [ ]:
page = html.fromstring(response.content)
page

CSS Selectors

For those of you who are web designers, you are likely very familiar with Cascading Stylesheets (CSS). Here is an example for how to use CSS selector for finding specific HTML elements


In [80]:
posts = page.cssselect('.title')

In [ ]:
len(posts)

Details of how to use CSS selectors can be found in the w3 schools site:

http://www.w3schools.com/cssref/css_selectors.asp

XPath

Alternatively, we can use a standard called "XPath" to find specific content in the HTML.


In [61]:
posts = page.xpath('//td[contains(@class, "title")]')

In [ ]:
len(posts)

We are only interested in those "td" tags that contain an anchor link to the referred article.


In [84]:
posts = page.xpath('//td[contains(@class, "title")]/a')

In [ ]:
len(posts)

So, only half of those "td" tags with "title" contain posts that we are interested in. Let's take a look at the first such post.


In [ ]:
first_post = posts[0]
first_post.text

There is a lot of "content" in the td tag's attributes.


In [ ]:
first_post.attrib

In [ ]:
first_post.attrib["href"]

In [90]:
all_links = []
for p in posts:
    all_links.append((p.text, p.attrib["href"]))

In [ ]:
all_links

Great: when you run the code above (starting from the HTTP request), this list of top content should change from time to time.

More details on how to use XPath can be found in the w3 schools site:

http://www.w3schools.com/xsl/xpath_syntax.asp


In [ ]: