Webscraping with Beautiful Soup


Intro

In this tutorial, we'll be scraping information on the state senators of Illinois, available here, as well as the list of bills each senator has sponsored (e.g., here.


In [1]:
# import required modules
import requests
from bs4 import BeautifulSoup
from datetime import datetime
import time
import re
import sys

Part 1: Using Beautiful Soup


1.1 Make a Get Request and Read in HTML

We use requests library to:

  1. make a GET request to the page
  2. read in the html of the page

In [8]:
# make a GET request
req = requests.get('http://www.ilga.gov/senate/default.asp')
# read the content of the server’s response
src = req.text

1.2 Soup it

Now we use the BeautifulSoup function to parse the reponse into an HTML tree. This returns an object (called a soup object) which contains all of the HTML in the original document.


In [9]:
# parse the response into an HTML tree
soup = BeautifulSoup(src, 'lxml')
# take a look
print(soup.prettify()[:1000])


<html lang="en">
 <head>
  <title>
   Illinois General Assembly - Senate Members
  </title>
  <link href="/style/lis.css" rel="stylesheet" type="text/css"/>
  <link href="/style/print.css" media="print" rel="stylesheet" type="text/css"/>
  <link href="http://info.er.usgs.gov/public/gils/gilsexec.html" rel="GILS"/>
  <link href="/LISlogo1.ico" rel="Shortcut Icon"/>
  <script language="JavaScript" type="text/javascript">
   <!--

if(window.event + "" == "undefined") event = null;
function HM_f_PopUp(){return false};
function HM_f_PopDown(){return false};
popUp = HM_f_PopUp;
popDown = HM_f_PopDown;

//-->
  </script>
  <!--
    option explicit
  -->
  <meta content='(PICS-1.1 "http://www.weburbia.com/safe/ratings.htm" l r (s 0))' http-equiv="PICS-Label"/>
  <meta content="Government" name="classification"/>
  <meta content="Global" name="distribution"/>
  <meta content="General" name="rating"/>
  <meta content="IL" name="contactState"/>
  <meta content="Illinois General Assembly" name="

1.3 Find Elements

BeautifulSoup has a number of functions to find things on a page. Like other webscraping tools, Beautiful Soup lets you find elements by their:

  1. HTML tags
  2. HTML Attributes
  3. CSS Selectors

Let's search first for HTML tags.

The function find_all searches the soup tree to find all the elements with an a particular HTML tag, and returns all of those elements.

What does the example below do?


In [10]:
# find all elements in a certain tag
# these two lines of code are equivilant

# soup.find_all("a")

NB: Because find_all() is the most popular method in the Beautiful Soup search API, you can use a shortcut for it. If you treat the BeautifulSoup object as though it were a function, then it’s the same as calling find_all() on that object.

These two lines of code are equivalent:


In [11]:
# soup.find_all("a")
# soup("a")

That's a lot! Many elements on a page will have the same html tag. For instance, if you search for everything with the a tag, you're likely to get a lot of stuff, much of which you don't want. What if we wanted to search for HTML tags ONLY with certain attributes, like particular CSS classes?

We can do this by adding an additional argument to the find_all

In the example below, we are finding all the a tags, and then filtering those with class = "sidemenu".


In [12]:
# Get only the 'a' tags in 'sidemenu' class
soup("a", class_="sidemenu")


Out[12]:
[<a class="sidemenu" href="/senate/default.asp">  Members  </a>,
 <a class="sidemenu" href="/senate/committees/default.asp">  Committees  </a>,
 <a class="sidemenu" href="/senate/schedules/default.asp">  Schedules  </a>,
 <a class="sidemenu" href="/senate/journals/default.asp">  Journals  </a>,
 <a class="sidemenu" href="/senate/transcripts/default.asp">  Transcripts  </a>,
 <a class="sidemenu" href="/senate/rules.asp">  Rules  </a>,
 <a class="sidemenu" href="/senate/audvid.asp">  Live Audio/Video  </a>]

Oftentimes a more efficient way to search and find things on a website is by CSS selector. For this we have to use a different method, select(). Just pass a string into the .select() to get all elements with that string as a valid CSS selector.

In the example above, we can use "a.sidemenu" as a CSS selector, which returns all a tags with class sidemenu.


In [13]:
# get elements with "a.sidemenu" CSS Selector.
soup.select("a.sidemenu")


Out[13]:
[<a class="sidemenu" href="/senate/default.asp">  Members  </a>,
 <a class="sidemenu" href="/senate/committees/default.asp">  Committees  </a>,
 <a class="sidemenu" href="/senate/schedules/default.asp">  Schedules  </a>,
 <a class="sidemenu" href="/senate/journals/default.asp">  Journals  </a>,
 <a class="sidemenu" href="/senate/transcripts/default.asp">  Transcripts  </a>,
 <a class="sidemenu" href="/senate/rules.asp">  Rules  </a>,
 <a class="sidemenu" href="/senate/audvid.asp">  Live Audio/Video  </a>]

Challenge 1

Find all the <a> elements in class mainmenu


In [16]:
# SOLUTION
soup.select("a.mainmenu")


Out[16]:
[<a class="mainmenu" href="/">Home</a>,
 <a class="mainmenu" href="/legislation/" onblur="HM_f_PopDown('elMenu1')" onfocus="HM_f_PopUp('elMenu1',event)" onmouseout="HM_f_PopDown('elMenu1')" onmouseover="HM_f_PopUp('elMenu1',event)">Legislation &amp; Laws</a>,
 <a class="mainmenu" href="/senate/" onblur="HM_f_PopDown('elMenu3')" onfocus="HM_f_PopUp('elMenu3',event)" onmouseout="HM_f_PopDown('elMenu3')" onmouseover="HM_f_PopUp('elMenu3',event)">Senate</a>,
 <a class="mainmenu" href="/house/" onblur="HM_f_PopDown('elMenu2')" onfocus="HM_f_PopUp('elMenu2',event)" onmouseout="HM_f_PopDown('elMenu2')" onmouseover="HM_f_PopUp('elMenu2',event)">House</a>,
 <a class="mainmenu" href="/mylegislation/" onblur="HM_f_PopDown('elMenu4')" onfocus="HM_f_PopUp('elMenu4',event)" onmouseout="HM_f_PopDown('elMenu4')" onmouseover="HM_f_PopUp('elMenu4',event)">My Legislation</a>,
 <a class="mainmenu" href="/sitemap.asp">Site Map</a>]

1.4 Get Attributes and Text of Elements

Once we identify elements, we want the access information in that element. Oftentimes this means two things:

  1. Text
  2. Attributes

Getting the text inside an element is easy. All we have to do is use the text member of a tag object:


In [8]:
# this is a list
soup.select("a.sidemenu")

# we first want to get an individual tag object
first_link = soup.select("a.sidemenu")[0]

# check out its class
type(first_link)


Out[8]:
bs4.element.Tag

It's a tag! Which means it has a text member:


In [9]:
print(first_link.text)


  Members  

Sometimes we want the value of certain attributes. This is particularly relevant for a tags, or links, where the href attribute tells us where the link goes.

You can access a tag’s attributes by treating the tag like a dictionary:


In [10]:
print(first_link['href'])


/senate/default.asp

Challenge 2

Find all the href attributes (url) from the mainmenu.


In [21]:
# SOLUTION
[link['href'] for link in soup.select("a.mainmenu")]


Out[21]:
['/',
 '/legislation/',
 '/senate/',
 '/house/',
 '/mylegislation/',
 '/sitemap.asp']

Part 2


Believe it or not, that's all you need to scrape a website. Let's apply these skills to scrape http://www.ilga.gov/senate/default.asp?GA=98

NB: we're just going to scrape the 98th general assembly.

Our goal is to scrape information on each senator, including their:

- name
- district
- party

2.1 First, make the get request and soup it.


In [12]:
# make a GET request
req = requests.get('http://www.ilga.gov/senate/default.asp?GA=98')
# read the content of the server’s response
src = req.text
# soup it
soup = BeautifulSoup(src, "lxml")

2.2 Find the right elements and text.

Now let's try to get a list of rows in that table. Remember that rows are identified by the tr tag.


In [13]:
# get all tr elements
rows = soup.find_all("tr")
len(rows)


Out[13]:
73

But remember, find_all gets all the elements with the tr tag. We can use smart CSS selectors to get only the rows we want.


In [14]:
# returns every ‘tr tr tr’ css selector in the page
rows = soup.select('tr tr tr')
print(rows[2].prettify())


<tr>
 <td bgcolor="white" class="detail" width="40%">
  <a href="/senate/Senator.asp?GA=98&amp;MemberID=1911">
   Pamela J. Althoff
  </a>
 </td>
 <td align="center" bgcolor="white" class="detail" width="15%">
  <a href="SenatorBills.asp?GA=98&amp;MemberID=1911">
   Bills
  </a>
 </td>
 <td align="center" bgcolor="white" class="detail" width="15%">
  <a href="SenCommittees.asp?GA=98&amp;MemberID=1911">
   Committees
  </a>
 </td>
 <td align="center" bgcolor="white" class="detail" width="15%">
  32
 </td>
 <td align="center" bgcolor="white" class="detail" width="15%">
  R
 </td>
</tr>

We can use the select method on anything. Let's say we want to find everything with the CSS selector td.detail in an item of the list we created above.


In [15]:
# select only those 'td' tags with class 'detail'
row = rows[2]
detailCells = row.select('td.detail')
detailCells


Out[15]:
[<td bgcolor="white" class="detail" width="40%"><a href="/senate/Senator.asp?GA=98&amp;MemberID=1911">Pamela J. Althoff</a></td>,
 <td align="center" bgcolor="white" class="detail" width="15%"><a href="SenatorBills.asp?GA=98&amp;MemberID=1911">Bills</a></td>,
 <td align="center" bgcolor="white" class="detail" width="15%"><a href="SenCommittees.asp?GA=98&amp;MemberID=1911">Committees</a></td>,
 <td align="center" bgcolor="white" class="detail" width="15%">32</td>,
 <td align="center" bgcolor="white" class="detail" width="15%">R</td>]

Most of the time, we're interested in the actual text of a website, not its tags. Remember, to get the text of an HTML element, use the text member.


In [16]:
# Keep only the text in each of those cells
rowData = [cell.text for cell in detailCells]

Now we can combine the beautifulsoup tools with our basic python skills to scrape an entire web page.


In [17]:
# check em out
print(rowData[0]) # Name
print(rowData[3]) # district
print(rowData[4]) # party


Pamela J. Althoff
32
R

2.3 Loop it all together

Let's use a for loop to get 'em all!


In [18]:
# make a GET request
req = requests.get('http://www.ilga.gov/senate/default.asp?GA=98')

# read the content of the server’s response
src = req.text

# soup it
soup = BeautifulSoup(src, "lxml")

# Create empty list to store our data
members = []

# returns every ‘tr tr tr’ css selector in the page
rows = soup.select('tr tr tr')

# loop through all rows
for row in rows:
    # select only those 'td' tags with class 'detail'
    detailCells = row.select('td.detail')
    
    # get rid of junk rows
    if len(detailCells) is not 5: 
        continue
        
    # Keep only the text in each of those cells
    rowData = [cell.text for cell in detailCells]
    
    # Collect information
    name = rowData[0]
    district = int(rowData[3])
    party = rowData[4]
    
    # Store in a tuple
    tup = (name,district,party)
    
    # Append to list
    members.append(tup)

In [19]:
len(members)


Out[19]:
61

Challege 3: Get HREF element pointing to members' bills.

The code above retrieves information on:

- the senator's name
- their district number
- and their party

We now want to retrieve the URL for each senator's list of bills. The format for the list of bills for a given senator is:

http://www.ilga.gov/senate/SenatorBills.asp + ? + GA=98 + &MemberID=memberID + &Primary=True

to get something like:

http://www.ilga.gov/senate/SenatorBills.asp?MemberID=1911&GA=98&Primary=True

You should be able to see that, unfortunately, memberID is not currently something pulled out in our scraping code.

Your initial task is to modify the code above so that we also retrieve the full URL which points to the corresponding page of primary-sponsored bills, for each member, and return it along with their name, district, and party.

Tips:

  • To do this, you will want to get the appropriate anchor element (<a>) in each legislator's row of the table. You can again use the .select() method on the row object in the loop to do this — similar to the command that finds all of the td.detail cells in the row. Remember that we only want the link to the legislator's bills, not the committees or the legislator's profile page.
  • The anchor elements' HTML will look like <a href="/senate/Senator.asp/...">Bills</a>. The string in the href attribute contains the relative link we are after. You can access an attribute of a BeatifulSoup Tag object the same way you access a Python dictionary: anchor['attributeName']. (See the documentation for more details).
  • NOTE: There are a lot of different ways to use BeautifulSoup to get things done; whatever you need to do to pull that HREF out is fine. Posting on the etherpad is recommended for discussing different strategies.

In [20]:
# SOLUTION

# make a GET request
req = requests.get('http://www.ilga.gov/senate/default.asp?GA=98')

# read the content of the server’s response
src = req.text

# soup it
soup = BeautifulSoup(src, "lxml")

# Create empty list to store our data
members = []

# returns every ‘tr tr tr’ css selector in the page
rows = soup.select('tr tr tr')

# loop through all rows
for row in rows:
    # select only those 'td' tags with class 'detail'
    detailCells = row.select('td.detail')
    
    # get rid of junk rows
    if len(detailCells) is not 5: 
        continue
        
    # Keep only the text in each of those cells
    rowData = [cell.text for cell in detailCells]
    
    # Collect information
    name = rowData[0]
    district = int(rowData[3])
    party = rowData[4]
    
    # add href
    href = row.select('a')[1]['href']
    
    # add full path
    full_path = "http://www.ilga.gov/senate/" + href + "&Primary=True"
    
    # Store in a tuple
    tup = (name,district,party, full_path)
    
    # Append to list
    members.append(tup)

In [21]:
members[:5]


Out[21]:
[('Pamela J. Althoff',
  32,
  'R',
  'http://www.ilga.gov/senate/SenatorBills.asp?GA=98&MemberID=1911&Primary=True'),
 ('Jason A. Barickman',
  53,
  'R',
  'http://www.ilga.gov/senate/SenatorBills.asp?GA=98&MemberID=2018&Primary=True'),
 ('Scott M Bennett',
  52,
  'D',
  'http://www.ilga.gov/senate/SenatorBills.asp?GA=98&MemberID=2272&Primary=True'),
 ('Jennifer Bertino-Tarrant',
  49,
  'D',
  'http://www.ilga.gov/senate/SenatorBills.asp?GA=98&MemberID=2022&Primary=True'),
 ('Daniel Biss',
  9,
  'D',
  'http://www.ilga.gov/senate/SenatorBills.asp?GA=98&MemberID=2020&Primary=True')]

Challenge 4: Make a function

Turn the code above into a function that accepts a URL, scrapes the URL for its senators, and returns a list of tuples containing information about each senator.


In [22]:
# SOLUTION
def get_members(url):
    src = requests.get(url).text
    soup = BeautifulSoup(src, "lxml")
    rows = soup.select('tr')
    members = []
    for row in rows:
        detailCells = row.select('td.detail')
        if len(detailCells) is not 5:
            continue
        rowData = [cell.text for cell in detailCells]
        name = rowData[0]
        district = int(rowData[3])
        party = rowData[4]
        href = row.select('a')[1]['href']
        full_path = "http://www.ilga.gov/senate/" + href + "&Primary=True"
        tup = (name,district,party,full_path)
        members.append(tup)
    return(members)

In [23]:
# Test you code!
senateMembers = get_members('http://www.ilga.gov/senate/default.asp?GA=98')
len(senateMembers)


Out[23]:
61

Part 3: Scrape Bills


3.1 Writing a Scraper Function

Now we want to scrape the webpages corresponding to bills sponsored by each bills.

Write a function called get_bills(url) to parse a given Bills URL. This will involve:

  • requesting the URL using the `requests` library
  • using the features of the BeautifulSoup library to find all of the <td> elements with the class billlist
  • return a list of tuples, each with:
    • description (2nd column)
    • chamber (S or H) (3rd column)
    • the last action (4th column)
    • the last action date (5th column)

I've started the function for you. Fill in the rest.


In [25]:
# SOLUTION
def get_bills(url):
    src = requests.get(url).text
    soup = BeautifulSoup(src, "lxml")
    rows = soup.select('tr tr tr')
    bills = []
    rowData = []
    for row in rows:
        detailCells = row.select('td.billlist')
        if len(detailCells) is not 5:
            continue
        rowData = [cell.text for cell in row]
        bill_id = rowData[0]
        description = rowData[2]
        champber = rowData[3]
        last_action = rowData[4]
        last_action_date = rowData[5] 
        tup = (bill_id,description,champber,last_action,last_action_date)
        bills.append(tup)
    return(bills)

In [30]:
# uncomment to test your code:
test_url = senateMembers[0][3]
get_bills(test_url)[0:5]


Out[30]:
[('SB27', 'MEDICAID BUDGET NOTE ACT', 'S', 'Session Sine Die', '1/13/2015'),
 ('SB28',
  'HOMELESS VETERANS SHELTER ACT',
  'S',
  'Session Sine Die',
  '1/13/2015'),
 ('SB29', 'ROAD FUND-NO TRANSFERS', 'S', 'Session Sine Die', '1/13/2015'),
 ('SB33',
  'EPA-RULES-DOCUMENT SUBMISSION',
  'S',
  'Public Act . . . . . . . . . 98-0072',
  '7/15/2013'),
 ('SB104',
  'MIN WAGE-OVERTIME-ALTERN SHIFT',
  'S',
  'Session Sine Die',
  '1/13/2015')]

3.2 Get all the bills

Finally, create a dictionary bills_dict which maps a district number (the key) onto a list_of_bills (the value) eminating from that district. You can do this by looping over all of the senate members in members_dict and calling get_bills() for each of their associated bill URLs.

NOTE: please call the function time.sleep(0.5) for each iteration of the loop, so that we don't destroy the state's web site.


In [ ]:
# SOLUTION
bills_dict = {}
for member in senateMembers[:5]:
    bills_dict[member[1]] = get_bills(member[3])
    time.sleep(0.5)

In [ ]:
bills_dict[52]

In [ ]: