Let's write a program which extracts the hyperlinks from a Webpage. There are 3 webpage files you can try with this program:
file = 'NYC0-httpbin-org.html'
file = 'NYC0-ischool-website.html'
file = 'NYC0-wikipedia-President-of-the-United-States.html'
Here's the basic strategy: you should be looking for the following token in the text of the file <a href=" when you find this token the link is all characters until the second " character.
For example: <a href="http://ist256.github.io"> the Hyperlink would be: http://ist256.github.io.
We will use the problem simplification technique to solve this problem. It works by solving a simpler problem then taking what you've learned to solve a more complicated problem.
http://ist256.github.io from the string <a href="http://ist256.github.io"> in Python codefile = 'NYC0-httpbin-org.html' text file into a variable, and use the same code in step 1. to extract the first Hyperlink in the text. For file = 'NYC0-httpbin-org.html' It will be https://github.com/requests/httpbinfile = 'NYC0-httpbin-org.html' you should get these Hyperlinks: https://github.com/requests/httpbin, https://kennethreitz.org, mailto:me@kennethreitz.org, /forms/postA little HTML primer - there are 4 types of hyperlinks you will see:
http are external links to other pages off site.mailto are links to email addresses/ are internal links other pages on the same website# are internal links to places on the same pageAnything else is not a link and you're probably extracting incorrectly.
In [29]:
## Step 2: write code here
file = 'NYC0-httpbin-org.html'
#file = 'NYC0-ischool-website.html'
#file = 'NYC0-wikipedia-President-of-the-United-States.html'