Introduction reference

Beautiful Soup turns every element of a document into a Python object and connects it to a bunch of other Python objects. If you only need a subset of the document, this is really slow. But you can pass in a SoupStrainer as the parse_only argument to the soup constructor. Beautiful Soup checks each element against the SoupStrainer, and only if it matches is the element turned into a Tag or NavigableText, and added to the tree.

If an element is added to to the tree, then so are its children—even if they wouldn't have matched the SoupStrainer on their own. This lets you parse only the chunks of a document that contain the data you want.


In [11]:
from bs4 import BeautifulSoup,SoupStrainer
import re

In [6]:
doc = '''Bob reports <a href="http://www.bob.com/">success</a>
with his plasma breeding <a
href="http://www.bob.com/plasma">experiments</a>. <i>Don't get any on
us, Bob!</i>

<br><br>Ever hear of annular fusion? The folks at <a
href="http://www.boogabooga.net/">BoogaBooga</a> sure seem obsessed
with it. Secret project, or <b>WEB MADNESS?</b> You decide!'''

parse only a tag


In [19]:
links = SoupStrainer('a')
[tag for tag in BeautifulSoup(doc,"lxml",parse_only=links)]


Out[19]:
[<a href="http://www.bob.com/">success</a>,
 <a href="http://www.bob.com/plasma">experiments</a>,
 <a href="http://www.boogabooga.net/">BoogaBooga</a>]

pare a tags with specified contecnt inside it


In [18]:
linksToBob = SoupStrainer('a', href=re.compile('bob.com/'))
[tag for tag in BeautifulSoup(doc,"lxml", parse_only=linksToBob)]


Out[18]:
[<a href="http://www.bob.com/">success</a>,
 <a href="http://www.bob.com/plasma">experiments</a>]

In [23]:
mentionsOfBob = SoupStrainer(text=re.compile("Bob"))
[text for text in BeautifulSoup(doc,"lxml", parse_only=mentionsOfBob)]


Out[23]:
['Bob reports ', "Don't get any on\nus, Bob!"]

specify criterion for tags


In [24]:
allCaps = SoupStrainer(text=lambda t:t.upper()==t)
[text for text in BeautifulSoup(doc,"lxml", parse_only=allCaps)]


Out[24]:
['. ', '\n', 'WEB MADNESS?']

In [ ]: