pdfplumber's table-extraction optionsThis notebook uses a report from the FBI's National Instant Criminal Background Check System.
In [1]:
import pdfplumber
In [2]:
pdf = pdfplumber.open("../pdfs/background-checks.pdf")
In [3]:
p0 = pdf.pages[0]
In [4]:
im = p0.to_image()
im
Out[4]:
In [5]:
im.reset().debug_tablefinder()
Out[5]:
The default settings correctly identify the table's vertical demarcations, but don't capture the horizontal demarcations between each group of five states/territories. So:
.extract_table's settingsvertical_strategy="lines"horizontal_strategy="text""intersection_tolerance": 15
In [6]:
table_settings = {
"vertical_strategy": "lines",
"horizontal_strategy": "text",
"intersection_x_tolerance": 15
}
In [7]:
im.reset().debug_tablefinder(table_settings)
Out[7]:
In [8]:
table = p0.extract_table(table_settings)
In [9]:
for row in table[:5]:
print(row)
.extract_table worked with our custom settings, but the table it detected contains extraneous headers and footers. Since we know that the Alabama row is the first, and that there are 56 rows we care about (50 states + DC + 4 territories + the "Totals" row), we can slice away the rest:
In [10]:
core_table = table[3:3+56]
The first row:
In [11]:
" • ".join(core_table[0])
Out[11]:
The last:
In [12]:
" • ".join(core_table[-1])
Out[12]:
Now, let's turn those rows into dictionaries, and also convert strings-representing-numbers to the numbers themselves, e.g., "18,870" -> 18870:
In [13]:
COLUMNS = [
"state",
"permit",
"handgun",
"long_gun",
"other",
"multiple",
"admin",
"prepawn_handgun",
"prepawn_long_gun",
"prepawn_other",
"redemption_handgun",
"redemption_long_gun",
"redemption_other",
"returned_handgun",
"returned_long_gun",
"returned_other",
"rentals_handgun",
"rentals_long_gun",
"private_sale_handgun",
"private_sale_long_gun",
"private_sale_other",
"return_to_seller_handgun",
"return_to_seller_long_gun",
"return_to_seller_other",
"totals"
]
In [14]:
def parse_value(i, x):
if i == 0: return x
if x == "": return None
return int(x.replace(",", ""))
In [15]:
from collections import OrderedDict
def parse_row(row):
return OrderedDict((COLUMNS[i], parse_value(i, cell))
for i, cell in enumerate(row))
In [16]:
data = [ parse_row(row) for row in core_table ]
Now here's the first row, parsed:
In [17]:
data[0]
Out[17]:
In [18]:
for row in list(reversed(sorted(data, key=lambda x: x["handgun"])))[:6]:
print("{state}: {handgun:,d} handgun-only checks".format(**row))
extract_text to extract the report monthIt looks like the month of the report is listed in an area 35px to 65px from the top of the page. But there's also some other text directly above and below it. So when we crop for that area, we'll use .within_bbox instead of .crop to select only characters (and other objects) that are fully within the bounding box.
In [19]:
month_crop = p0.within_bbox((0, 35, p0.width, 65))
month_crop.to_image()
Out[19]:
In [20]:
month_chars = month_crop.extract_text()
month_chars
Out[20]: