Applied example of scraping the Handbook of Birds of the World to get a list of subspecies for a given bird species.



In [1]:

    
#Import modules
import requests
from bs4 import BeautifulSoup



In [2]:

    
#Example URL
theURL = "https://www.hbw.com/species/brown-wood-owl-strix-leptogrammica"



In [3]:

    
#Get content of the species web page
response = requests.get(theURL)



In [4]:

    
#Convert to a "soup" object, which BS4 is designed to work with
soup = BeautifulSoup(response.text,'lxml')

Introspection of the source HTML of the species web page reveals that the sub-species listings fall within a section (div in HTML lingo) labeled "<div class="ds-ssp_comp>" in the HTML. So we'll search the 'soup' for this section, which returns a list of one object, then we extract that one object to a variable named subSection.

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class



In [5]:

    
#Find all sections with the CSS class 'ds-ssp_comp' and get the first (only) item found
div = soup.find_all('div',class_='ds-ssp_comp')
section = div[0]

All the entries with the tag <em> are the subspecies entries.



In [6]:

    
#Find all lines in the section with the tag 'em'
subSpecies = section.find_all('em')

We can loop through each subspecies found and print its name



In [7]:

    
#Extract to a variable
for subSpp in subSpecies:
    print (subSpp.get_text())









    



S. l. newarensis
S. l. ticehursti
S. l. caligata
S. l. laotiana
S. l. indranee
S. l. ochrogenys
S. l. maingayi
S. l. myrtha
S. l. nyctiphasma
S. l. chaseni
S. l. vaga
S. l. leptogrammica
S. l. niasensis
S. l. bartelsi