Now You Code 0: Extracting Hyperlinks from a Webpage

Let's write a program which extracts the hyperlinks from a Webpage. There are 3 webpage files you can try with this program:

file = 'NYC0-httpbin-org.html'
file = 'NYC0-ischool-website.html'
file = 'NYC0-wikipedia-President-of-the-United-States.html'

Here's the basic strategy: you should be looking for the following token in the text of the file <a href=" when you find this token the link is all characters until the second " character.

For example: <a href="http://ist256.github.io"> the Hyperlink would be: http://ist256.github.io.

Recommended approach (the problem simplification technique):

We will use the problem simplification technique to solve this problem. It works by solving a simpler problem then taking what you've learned to solve a more complicated problem.

Try to extract http://ist256.github.io from the string <a href="http://ist256.github.io"> in Python code
Then read in the file = 'NYC0-httpbin-org.html' text file into a variable, and use the same code in step 1. to extract the first Hyperlink in the text. For file = 'NYC0-httpbin-org.html' It will be https://github.com/requests/httpbin
Finally try to figure out how to use a loop to extract all the urls in the string text you read from the file. For file = 'NYC0-httpbin-org.html' you should get these Hyperlinks: https://github.com/requests/httpbin, https://kennethreitz.org, mailto:me@kennethreitz.org, /forms/post

A little HTML primer - there are 4 types of hyperlinks you will see:

those which begin with http are external links to other pages off site.
those which begin with mailto are links to email addresses
those which begin with a / are internal links other pages on the same website
those which begin with a # are internal links to places on the same page

Anything else is not a link and you're probably extracting incorrectly.

Now You Code 0: Extracting Hyperlinks from a Webpage

Recommended approach (the problem simplification technique):

Step 1: Problem Analysis