Extract the text from a HTML document with Beautiful Soup

We start with the following HTML document.


In [1]:
html_doc = """
<html><head><title>My new page</title></head>
<body>
<p class="title"><b>Cool my new page</b></p>

<p class="story">I have written the following articles:
<a href="http://foo.bar/A1" class="sister" id="link1">A1</a>,
<a href="http://foo.bar/A2" class="sister" id="link2">A2</a>
<a href="http://foo.bar/A3" class="sister" id="link3">A3</a>;
</p>

<p class="story">...</p>
"""

Let's parse a HTML document. Here we have already the page as string.


In [2]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, 'html.parser')

Or you could load the content with urlib.

import urllib.request
from BeautifulSoup import BeautifulSoup

url = 'https://foo.bar/'
req = urllib.request.Request(url, headers={'User-Agent' : "Magic Browser"}) 
con = urllib.request.urlopen( req )
html_doc = con.read()

soup = BeautifulSoup(html_doc, 'html.parser')

We can get the title from the page


In [3]:
soup.title.string


Out[3]:
'My new page'

or all the links in the HTML document.


In [4]:
for link in soup.find_all('a'):
    print(link.get('href'))


http://foo.bar/A1
http://foo.bar/A2
http://foo.bar/A3

Of course is very easy to get the plain text from the document without the HTML tags.


In [5]:
soup.text


Out[5]:
'\nMy new page\n\nCool my new page\nI have written the following articles:\nA1,\nA2\nA3;\n\n...\n'