In [1]:
html_doc = """
<html><head><title>My new page</title></head>
<body>
<p class="title"><b>Cool my new page</b></p>
<p class="story">I have written the following articles:
<a href="http://foo.bar/A1" class="sister" id="link1">A1</a>,
<a href="http://foo.bar/A2" class="sister" id="link2">A2</a>
<a href="http://foo.bar/A3" class="sister" id="link3">A3</a>;
</p>
<p class="story">...</p>
"""
Let's parse a HTML document. Here we have already the page as string.
In [2]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
Or you could load the content with urlib.
import urllib.request
from BeautifulSoup import BeautifulSoup
url = 'https://foo.bar/'
req = urllib.request.Request(url, headers={'User-Agent' : "Magic Browser"})
con = urllib.request.urlopen( req )
html_doc = con.read()
soup = BeautifulSoup(html_doc, 'html.parser')
We can get the title from the page
In [3]:
soup.title.string
Out[3]:
or all the links in the HTML document.
In [4]:
for link in soup.find_all('a'):
print(link.get('href'))
Of course is very easy to get the plain text from the document without the HTML tags.
In [5]:
soup.text
Out[5]: