Let's scrape some death row data

Texas executes a lot of criminals, and it has a web page that keeps track of people on its death row.

Using what you've learned so far, let's scrape this table into a CSV. Then we're going write a function to grab a couple pieces of additional data from the inmates' detail pages.

Import our libraries


In [ ]:

Fetch and parse the summary page


In [ ]:
# the URL to request


# get that page


# turn the page text into soup


# find the table of interest

Loop over the table rows and write to CSV


In [ ]:
# find all table rows (skip the first one)


# open a file to write to

    
    # create a writer object

    
    # write header row


    # loop over the rows

        
        # extract the cells

        
        # offense ID

        
        # link to detail page

        
        # last name

        
        # first name

        
        # dob

        
        # sex

        
        # race

        
        # date received

        
        # county

        
        # offense date

        
        # write out to file

Let's write a parsing function

We need a function that will take a URL of a detail page and do these things:

  • Open the detail page URL using requests
  • Parse the contents using BeautifulSoup
  • Isolate the bits of information we're interested in: height, weight, eye color, hair color, native county, native state, link to mugshot
  • Return those bits of information in a dictionary

A couple things to keep in mind: Not every inmate will have every piece of data. Also, not every inmate has an HTML detail page to parse -- the older ones are a picture. So we'll need to work around those limitations.

We shall call our function fetch_details().


In [ ]:
"""Fetch details from a death row inmate's page."""

    # create a dictionary with some default values
    # as we go through, we're going to add stuff to it
    # (if you want to explore further, there is actually
    # a special kind of dictionary called a "defaultdict" to
    # handle this use case) =>
    # https://docs.python.org/3/library/collections.html#collections.defaultdict


    
    # partway down the page, the links go to JPEGs instead of HTML pages
    # we can't parse images, so we'll just return the empty dictionary

    
    # get the page

    
    # soup the HTML


    # find the table of info

    
    # target the mugshot, if it exists

    
    # if there is a mug, grab the src and add it to the dictionary



        
    # get a list of the "label" cells
    # on some pages, they're identified by the class 'tabledata_bold_align_right_deathrow'
    # on others, they're identified by the class 'tabledata_bold_align_right_unit'
    # so we pass it a list of possible classes


    # gonna do some fanciness here in the interests of DRY =>
    # a list of attributes we're interested in -- should match exactly the text inside the cells of interest


    # loop over the list of label cells that we targeted earlier

        

        
        # check to see if the cell text is in our list of attributes

            
            # if so, find the value -- go up to the tr and search for the other td --
            # and add that attribute to our dictionary

            


    # return the dictionary to the script

Putting it all together

Now that we have our parsing function, we can:

  • Open and read the CSV files of summary inmate info (the one we just scraped)
  • Open and write a new CSV file of detailed inmate info

As we loop over the summary inmate data, we're going to call our new parsing function on the detail URL in each row. Then we'll combine the dictionaries (data from the row of summary data + new detailed data) and write out to the new file.


In [ ]:
# open the CSV file to read from and the one to write to

    
    # create a reader object

    
    # the output headers are goind to be the headers from the summary file
    # plus a list of new attributes


    # create the writer object

    
    # write the header row

    
    # loop over the rows in the input file

        
        # print the inmate's name (so we can keep track of where we're at)
        # helps with debugging, too

        
        # call our function on the URL in the row

        
        # add the two dicts together by
        # unpacking them inside a new one
        # and write out to file