Our goal in this exercise is to scrape the roster of inmates in the Hennepin County Jail into a CSV.
What happens when we click the search box without entering a first or last name? We're directed to a page with the listing of the entire roster at a new URL.
This is good news -- some forms are set up to require a minimum number of characters. Now we need to check whether you can just go to that URL without visiting the landing page first and clicking through -- in other words, does that page depend on a cookie being passed?
To test this, I usually open another browser window in incognito mode and paste in the URL. Success! Going to https://www4.co.hennepin.mn.us/webbooking/resultbyname.asp dumps out the entire list of inmates, so that's where we'll start. (You could also open your network tab and see what information is getting exchanged during the request. For more complex dynamically created pages that rely on cookies, we'd probably need the requests Session object.)
Let's click on an inmate link. We want to look at two things:
What's the pattern for an inmate URL?
In [ ]:
In [ ]:
# base URL
# results page URL
# pattern for inmate detail URLs
In [ ]:
# fetch the page
# parse it
# find the table we want
# get the rows of the table, minus the header
We need to pause here and write a couple of functions to help us extract the bits of data from the inmate's detail page:
In [ ]:
"""Given a label and a soup'd detail page, return the associated value."""
"""Fetch and parse and inmate detail page, return three bits of data."""
# fetch the page
# parse it into soup
# call the get_inmate_attr function to nab the cells we're interested in
# return a dict with this info
# lose the " Address" string on the housing cell, where it exists
# also, parse the booking date as a date to validate
In [ ]:
# open a file to write to
# define your headers -- they should match the keys in the dict
# we're creating as we scrape
# create a writer object
# write the header row
# print some summary info
# loop over the rows of inmates from the search results page
# unpack the list of cells in the row
# get the detail page link using the template string we defined up top
# unpack the name into last/rest and print it
# reformat the dob, which, bonus, also validates it
# our dict of summary info
# call the inmate_details function on the detail URL
# remember: this returns a dictionary
# combine the summary and detail dicts
# by unpacking them into a new dict
# https://www.python.org/dev/peps/pep-0448/
# write the combined dict out to file
# pause for 2 seconds to give the server a break
It's all well and good to get the basic inmate info, but we're probably also interested in why they're in jail -- what are they charged with?
For this exercise, add some parsing logic to the inmate_details scraping function to extract data about what each inmate has been charged with. Pulling them out as a list of dictionaries makes the most sense to me, but you can format it however you like.
Because each inmate has a variable number of charges, you also need to think about how you want to represent the data in your CSV. Is each line one charge? One inmate? Picture how one row of data should look in your output file and structure your parsing to match.
In [ ]: