Our goal in this exercise is to scrape the roster of inmates in the Hennepin County Jail into a CSV.
What happens when we click the search box without entering a first or last name? We're directed to a page with the listing of the entire roster at a new URL.
This is good news -- some forms are set up to require a minimum number of characters. Now we need to check whether you can just go to that URL without visiting the landing page first and clicking through -- in other words, does that page depend on a cookie being passed?
To test this, I usually open another browser window in incognito mode and paste in the URL. Success! Going to https://www4.co.hennepin.mn.us/webbooking/resultbyname.asp dumps out the entire list of inmates, so that's where we'll start. (You could also open your network tab and see what information is getting exchanged during the request. For more complex dynamically created pages that rely on cookies, we'd probably need the requests
Session object.)
Let's click on an inmate link. We want to look at two things:
What's the pattern for an inmate URL?
In [ ]:
import csv
from datetime import datetime
import time
import requests
from bs4 import BeautifulSoup
In [ ]:
# base URL
url_base = 'https://www4.co.hennepin.mn.us/webbooking/'
# results page URL
results_page = url_base + 'resultbyname.asp'
# pattern for inmate detail URLs
inmate_url_pattern = url_base + 'chargedetail.asp?v_booknum={}'
In [ ]:
# fetch the page
r = requests.get(results_page)
# parse it
soup = BeautifulSoup(r.text, 'html.parser')
# find the table we want
table = soup.find_all('table')[6]
# get the rows of the table, minus the header
inmates = table.find_all('tr')[1:]
We need to pause here and write a couple of functions to help us extract the bits of data from the inmate's detail page:
In [ ]:
def get_inmate_attr(soup, label):
"""Given a label and a soup'd detail page, return the associated value."""
return soup.find(string=label).parent.parent.next_sibling \
.next_sibling.text.strip()
def inmate_details(url):
"""Fetch and parse and inmate detail page, return three bits of data."""
# fetch the page
r = requests.get(url)
# parse it into soup
soup = BeautifulSoup(r.text, 'html.parser')
# call the get_inmate_attr function to nab the cells we're interested in
custody = get_inmate_attr(soup, "Sheriff's Custody:")
housing = get_inmate_attr(soup, "Housing Location:")
booking_date = get_inmate_attr(soup, "Received Date/Time:")
# return a dict with this info
# lose the " Address" string on the housing cell, where it exists
# also, parse the booking date as a date to validate
return {
'custody': custody,
'housing': housing.replace(' Address', ''),
'booking_date': datetime.strptime(booking_date, '%m/%d/%Y.. %H:%M')
}
In [ ]:
# open a file to write to
with open('inmates.csv', 'w') as outfile:
# define your headers -- they should match the keys in the dict
# we're creating as we scrape
headers = ['booking_num', 'url', 'last', 'rest', 'dob',
'custody', 'housing', 'booking_date']
# create a writer object
writer = csv.DictWriter(outfile, fieldnames=headers)
# write the header row
writer.writeheader()
# print some summary info
print('')
print('Writing data for {:,} inmates ...'.format(len(inmates)))
print('')
# loop over the rows of inmates from the search results page
for row in inmates:
# unpack the list of cells in the row
booking_num, name, dob, status = row.find_all('td')
# get the detail page link using the template string we defined up top
detail_link = inmate_url_pattern.format(booking_num.string)
# unpack the name into last/rest and print it
last, rest = name.string.split(', ')
print(rest, last)
# reformat the dob, which, bonus, also validates it
dob_parsed = datetime.strptime(dob.string, '%m/%d/%Y')
# our dict of summary info
summary_info = {
'booking_num': booking_num.string,
'url': detail_link,
'last': last,
'rest': rest,
'dob': dob_parsed.strftime('%Y-%m-%d')
}
# call the inmate_details function on the detail URL
# remember: this returns a dictionary
details = inmate_details(detail_link)
# combine the summary and detail dicts
# by unpacking them into a new dict
# https://www.python.org/dev/peps/pep-0448/
combined_dict = {
**summary_info,
**details
}
# write the combined dict out to file
writer.writerow(combined_dict)
# pause for 2 seconds to give the server a break
time.sleep(2)
It's all well and good to get the basic inmate info, but we're probably also interested in why they're in jail -- what are they charged with?
For this exercise, add some parsing logic to the inmate_details
scraping function to extract data about what each inmate has been charged with. Pulling them out as a list of dictionaries makes the most sense to me, but you can format it however you like.
Because each inmate has a variable number of charges, you also need to think about how you want to represent the data in your CSV. Is each line one charge? One inmate? Picture how one row of data should look in your output file and structure your parsing to match.
In [ ]: