Extract the text from a HTML document with Beautiful Soup

We start with the following HTML document.



In [1]:

    
html_doc = """
<html><head><title>My new page</title></head>
<body>
<p class="title"><b>Cool my new page</b></p>

<p class="story">I have written the following articles:
<a href="http://foo.bar/A1" class="sister" id="link1">A1</a>,
<a href="http://foo.bar/A2" class="sister" id="link2">A2</a>
<a href="http://foo.bar/A3" class="sister" id="link3">A3</a>;
</p>

<p class="story">...</p>
"""

Let's parse a HTML document. Here we have already the page as string.



In [2]:

    
from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, 'html.parser')

Or you could load the content with urlib.

import urllib.request
from BeautifulSoup import BeautifulSoup

url = 'https://foo.bar/'
req = urllib.request.Request(url, headers={'User-Agent' : "Magic Browser"}) 
con = urllib.request.urlopen( req )
html_doc = con.read()

soup = BeautifulSoup(html_doc, 'html.parser')

We can get the title from the page



In [3]:

    
soup.title.string









    Out[3]:





'My new page'

or all the links in the HTML document.



In [4]:

    
for link in soup.find_all('a'):
    print(link.get('href'))









    



http://foo.bar/A1
http://foo.bar/A2
http://foo.bar/A3

Of course is very easy to get the plain text from the document without the HTML tags.



In [5]:

    
soup.text









    Out[5]:





'\nMy new page\n\nCool my new page\nI have written the following articles:\nA1,\nA2\nA3;\n\n...\n'