Now You Code 0: Extracting Hyperlinks from a Webpage

Let's write a program which extracts the hyperlinks from a Webpage. There are 3 webpage files you can try with this program:

file = 'NYC0-httpbin-org.html'
file = 'NYC0-ischool-website.html'
file = 'NYC0-wikipedia-President-of-the-United-States.html'

Here's the basic strategy: you should be looking for the following token in the text of the file <a href=" when you find this token the link is all characters until the second " character.

For example: <a href="http://ist256.github.io"> the Hyperlink would be: http://ist256.github.io.

We will use the problem simplification technique to solve this problem. It works by solving a simpler problem then taking what you've learned to solve a more complicated problem.

  1. Try to extract http://ist256.github.io from the string <a href="http://ist256.github.io"> in Python code
  2. Then read in the file = 'NYC0-httpbin-org.html' text file into a variable, and use the same code in step 1. to extract the first Hyperlink in the text. For file = 'NYC0-httpbin-org.html' It will be https://github.com/requests/httpbin
  3. Finally try to figure out how to use a loop to extract all the urls in the string text you read from the file. For file = 'NYC0-httpbin-org.html' you should get these Hyperlinks: https://github.com/requests/httpbin, https://kennethreitz.org, mailto:me@kennethreitz.org, /forms/post

A little HTML primer - there are 4 types of hyperlinks you will see:

  • those which begin with http are external links to other pages off site.
  • those which begin with mailto are links to email addresses
  • those which begin with a / are internal links other pages on the same website
  • those which begin with a # are internal links to places on the same page

Anything else is not a link and you're probably extracting incorrectly.

Step 1: Problem Analysis

You should write 3 algorithms, one for each iteration of your problem simplification technique

Inputs:

Outputs:

Algorithm (Steps in Program):


In [29]:
## Step 2: write code here

file = 'NYC0-httpbin-org.html'
#file = 'NYC0-ischool-website.html'
#file = 'NYC0-wikipedia-President-of-the-United-States.html'