Let's scrape some inmate data

Our goal in this exercise is to scrape the roster of inmates in the Hennepin County Jail into a CSV.

Step 1: Can we get everyone?

What happens when we click the search box without entering a first or last name? We're directed to a page with the listing of the entire roster at a new URL.

This is good news -- some forms are set up to require a minimum number of characters. Now we need to check whether you can just go to that URL without visiting the landing page first and clicking through -- in other words, does that page depend on a cookie being passed?

To test this, I usually open another browser window in incognito mode and paste in the URL. Success! Going to https://www4.co.hennepin.mn.us/webbooking/resultbyname.asp dumps out the entire list of inmates, so that's where we'll start. (You could also open your network tab and see what information is getting exchanged during the request. For more complex dynamically created pages that rely on cookies, we'd probably need the requests Session object.)

Step 2: Check out the inmate detail page

Let's click on an inmate link. We want to look at two things:

Does each inmate have a unique URL with a consistent pattern? (Yes)
What information on the page do we want to collect? (Let's grab custody info, housing location, booking date/time and arresting agency)

What's the pattern for an inmate URL?

Step 3: Start scraping

Import the libraries we'll need



In [ ]:

Set introductory variables



In [ ]:

    
# base URL


# results page URL


# pattern for inmate detail URLs

Fetch and parse the page contents



In [ ]:

    
# fetch the page


# parse it


# find the table we want


# get the rows of the table, minus the header

Write a couple of functions

We need to pause here and write a couple of functions to help us extract the bits of data from the inmate's detail page:

A function that takes the URL for an inmate detail page, fetches and parses the contents, then returns the bits of data we're interested in
A more specific function that takes the text of a label cell on a detail page ("Sheriff's Custody:", for instance) and returns the associated value in the next cell. This function will be called inside our other function -- it's not 100% necessary but it keeps us from repeating ourselves a million times



In [ ]:

    
"""Given a label and a soup'd detail page, return the associated value."""




    """Fetch and parse and inmate detail page, return three bits of data."""
    
    # fetch the page

    
    # parse it into soup

    
    # call the get_inmate_attr function to nab the cells we're interested in




    # return a dict with this info
    # lose the " Address" string on the housing cell, where it exists
    # also, parse the booking date as a date to validate

Loop over the inmate rows, write to file



In [ ]:

    
# open a file to write to


    # define your headers -- they should match the keys in the dict
    # we're creating as we scrape


    # create a writer object


    # write the header row


    # print some summary info




    # loop over the rows of inmates from the search results page

        
        # unpack the list of cells in the row

        
        # get the detail page link using the template string we defined up top

        
        # unpack the name into last/rest and print it



        # reformat the dob, which, bonus, also validates it


        # our dict of summary info



        # call the inmate_details function on the detail URL
        # remember: this returns a dictionary


        # combine the summary and detail dicts
        # by unpacking them into a new dict
        # https://www.python.org/dev/peps/pep-0448/


        # write the combined dict out to file


        # pause for 2 seconds to give the server a break

Extra credit: Get charge details

It's all well and good to get the basic inmate info, but we're probably also interested in why they're in jail -- what are they charged with?

For this exercise, add some parsing logic to the inmate_details scraping function to extract data about what each inmate has been charged with. Pulling them out as a list of dictionaries makes the most sense to me, but you can format it however you like.

Because each inmate has a variable number of charges, you also need to think about how you want to represent the data in your CSV. Is each line one charge? One inmate? Picture how one row of data should look in your output file and structure your parsing to match.



In [ ]: