Texas executes a lot of criminals, and it has a web page that keeps track of people on its death row.
Using what you've learned so far, let's scrape this table into a CSV. Then we're going write a function to grab a couple pieces of additional data from the inmates' detail pages.
In [ ]:
import csv
import time
import requests
from bs4 import BeautifulSoup
In [ ]:
# the URL to request
URL = 'https://www.tdcj.state.tx.us/death_row/dr_offenders_on_dr.html'
# get that page
page = requests.get(URL)
# turn the page text into soup
soup = BeautifulSoup(page.text, 'html.parser')
# find the table of interest
table = soup.find('table')
In [ ]:
# find all table rows (skip the first one)
rows = table.find_all('tr')[1:]
# open a file to write to
with open('death-row.csv', 'w') as outfile:
# create a writer object
writer = csv.DictWriter(outfile, fieldnames=['id', 'link', 'last', 'first', 'dob', 'sex',
'race', 'date_received', 'county', 'offense_date'])
# write header row
writer.writeheader()
# loop over the rows
for row in rows:
# extract the cells
cells = row.find_all('td')
# offense ID
off_id = cells[0].string
# link to detail page
link = 'https://www.tdcj.state.tx.us/death_row/' + cells[1].a['href']
# last name
last = cells[2].string
# first name
first = cells[3].string
# dob
dob = cells[4].string
# sex
sex = cells[5].string
# race
race = cells[6].string
# date received
date_received = cells[7].string
# county
county = cells[8].string
# offense date
offense_date = cells[9].string
# write out to file
writer.writerow({
'id': off_id,
'link': link,
'last': last,
'first': first,
'dob': dob,
'sex': sex,
'race': race,
'date_received': date_received,
'county': county,
'offense_date': offense_date
})
We need a function that will take a URL of a detail page and do these things:
requests
A couple things to keep in mind: Not every inmate will have every piece of data. Also, not every inmate has an HTML detail page to parse -- the older ones are a picture. So we'll need to work around those limitations.
We shall call our function fetch_details()
.
In [ ]:
def fetch_details(url):
"""Fetch details from a death row inmate's page."""
# create a dictionary with some default values
# as we go through, we're going to add stuff to it
# (if you want to explore further, there is actually
# a special kind of dictionary called a "defaultdict" to
# handle this use case) =>
# https://docs.python.org/3/library/collections.html#collections.defaultdict
out_dict = {
'Height': None,
'Weight': None,
'Eye Color': None,
'Hair Color': None,
'Native County': None,
'Native State': None,
'mug': None
}
# partway down the page, the links go to JPEGs instead of HTML pages
# we can't parse images, so we'll just return the empty dictionary
if not url.endswith('.html'):
return out_dict
# get the page
r = requests.get(url)
# soup the HTML
soup = BeautifulSoup(r.text, 'html.parser')
# find the table of info
table = soup.find('table', {'class': 'tabledata_deathrow_table'})
# target the mugshot, if it exists
mug = table.find('img', {'class': 'photo_border_black_right'})
# if there is a mug, grab the src and add it to the dictionary
if mug:
out_dict['mug'] = 'http://www.tdcj.state.tx.us/death_row/dr_info/' + mug['src']
# get a list of the "label" cells
# on some pages, they're identified by the class 'tabledata_bold_align_right_deathrow'
# on others, they're identified by the class 'tabledata_bold_align_right_unit'
# so we pass it a list of possible classes
label_cells = table.find_all('td', {'class': ['tabledata_bold_align_right_deathrow',
'tabledata_bold_align_right_unit']})
# gonna do some fanciness here in the interests of DRY =>
# a list of attributes we're interested in -- should match exactly the text inside the cells of interest
attr_list = ['Height', 'Weight', 'Eye Color', 'Hair Color', 'Native County', 'Native State']
# loop over the list of label cells that we targeted earlier
for cell in label_cells:
clean_label_cell_text = cell.text.strip()
# check to see if the cell text is in our list of attributes
if clean_label_cell_text in attr_list:
# if so, find the value -- go up to the tr and search for the other td --
# and add that attribute to our dictionary
value_cell_text = cell.parent.find('td', {'class': 'tabledata_align_left_deathrow'}).text.strip()
out_dict[clean_label_cell_text] = value_cell_text
# return the dictionary to the script
return(out_dict)
Now that we have our parsing function, we can:
As we loop over the summary inmate data, we're going to call our new parsing function on the detail URL in each row. Then we'll combine the dictionaries (data from the row of summary data + new detailed data) and write out to the new file.
In [ ]:
# open the CSV file to read from and the one to write to
with open('death-row.csv', 'r') as infile, open('death-row-details.csv', 'w') as outfile:
# create a reader object
reader = csv.DictReader(infile)
# the output headers are goind to be the headers from the summary file
# plus a list of new attributes
headers = reader.fieldnames + ['Height', 'Weight', 'Eye Color', 'Hair Color',
'Native County', 'Native State', 'mug']
# create the writer object
writer = csv.DictWriter(outfile, fieldnames=headers)
# write the header row
writer.writeheader()
# loop over the rows in the input file
for row in reader:
# print the inmate's name (so we can keep track of where we're at)
# helps with debugging, too
print(row['first'], row['last'])
# call our function on the URL in the row
deets = fetch_details(row['link'])
# add the two dicts together by
# unpacking them inside a new one
# and write out to file
writer.writerow({**row, **deets})
time.sleep(2)
print('---')
print('Done!')