Parsing San Jose PD's firearm search reports

This example uses pdfplumber's visual debugging and text-extraction features to parse a fixed-width table embedded in a PDF. Thanks to Ron Campbell for the sample PDF.


In [1]:
import pdfplumber
import re
from collections import OrderedDict

Load the PDF


In [2]:
pdf = pdfplumber.open("../pdfs/san-jose-pd-firearm-sample.pdf")

Examine the first page


In [3]:
p0 = pdf.pages[0]

In [4]:
im = p0.to_image()
im


Out[4]:

See where the characters are on the page

Below, we draw rectangles around each of the char objects that pdfplumber detected. By doing so, we can see that every line of the main part of the report is the same width, and that there are space (" ") characters padding out each field. That means we can parse those lines a lot like we'd parse a standard fixed-width data file.


In [5]:
im.reset().draw_rects(p0.chars)


Out[5]:

Extract the text from the PDF

Using the Page.extract_text(...) method, we grab every character on the page, line by line:


In [6]:
text = p0.extract_text()
print(text)


For:1094N                                                           
Page 1                                                SAN JOSE POLICE DEPT                         Date Report Run : Tue, May-24-16
                                                         FIREARM SEARCH                               
SEARCH CRITERIA: fi_property.item_status='B' and fi_property.entry_date between '07/01/2012' and '05/24/2016' and fi_p
                 roperty.jurisdiction='SJ'                                                                            
NUMBER OF RECORDS RETURNED: 983
TYPE                 ITEM                   MAKE                                          MODEL           CALIBRE  STATUS    FLAGS
-------------------- --------------------   --------------------------------------------- --------------- -------  --------  -----
SERIAL NUMBER        REPORT/TAG NUMBER      CASE/FILE NUMBER        STORAGE LOCATION
-------------        -------------------    --------------------    ----------------------- 
PISTOL               REVOLVER               UNKWN/UNPUBLSHD MAKE                          UNKNOWN         38       FOUND     _d__
1721                    Tag#:SJ3095-1         GO SJ 2012-122590518  WA-GR/GUN DESTRUCTION  
SHOTGUN              DOUBLE BARREL          STEVEN/J STVNS ARMS                           311 SERIES H    12GA     FOUND     _d__
REMOVED                 Tag#:SJ3123-1         GO SJ 2012-122620320  WA-GR/GUN DESTRUCTION  
PISTOL               SEMI-AUTOMATIC         TAURUS FORJAS                                 PT40 PRO MIL'NU 40       FOUND     _d__
SZH77919                Tag#:SJ3360-1         GO SJ 2012-122670386  WA-GR/GUN DESTRUCTION/G
PISTOL               SEMI-AUTOMATIC         SMITH AND WESSON                              39              9MM      FOUND     _d__
85517                   Tag#:SJ3686-1         GO SJ 2012-122741016  WA-GR/GUN DESTRUCTION/G
PISTOL               SEMI-AUTOMATIC         HIGH POINT                                    JCP             40       FOUND     ____
REMOVED                 Tag#:SJ3724-1         GO SJ 2012-122750582  WA-GR/GUN DESTRUCTION/G
PISTOL               REVOLVER               COLT                                          DETECTIVE SPEC  38       FOUND     _d__
982982                  Tag#:SJ4077-2         GO SJ 2012-122840637  WA-GR/GUN DESTRUCTION/Z
SHOTGUN              DOUBLE BARREL          MISS.VALLEY ARMS CO.                          UNK             12GA     FOUND     _d__
279861                  Tag#:SJ4117-1         GO SJ 2012-122850220  WA-GR/GUN DESTRUCTION  
RIFLE                LEVER ACTION           SEARS ROEBUCK & CO.                           5               22       FOUND     _d__
63723                   Tag#:SJ4172-1         GO SJ 2012-122860208  WA-GR/GUN DESTRUCTION  
PISTOL               SEMI-AUTOMATIC         COLT                                          MATCH TARGET    .22      FOUND     _d__
142061S                 Tag#:SJ4324-1         GO SJ 2012-122890285  WA-GR/GD-SCCCL         
PISTOL               REVOLVER               SMITH AND WESSON                              UNKNOWN         .22      FOUND     _d__
K22539                  Tag#:SJ4324-2         GO SJ 2012-122890285  WA-GR/GUN DESTRUCTION  
RIFLE                LEVER ACTION           STEVEN/J STVNS ARMS                           CRACK SHOT 26   .22      FOUND     _d__
UNKNOWN                 Tag#:SJ4324-3         GO SJ 2012-122890285  WA-GR/GUN DESTRUCTION  
PISTOL               SEMI-AUTOMATIC         CARL WALTHER                                  PPK             7.65     FOUND     _d__
832229                  Tag#:SJ4664-1         GO SJ 2012-122960284  WA-GR/GD-SCCCL         
                                                           Flags = e (evidence)  d (disposed)  x (x-reference)  n (entered on NCIC)

In this step, we use a regular expression to focus on the core part of the page — the table.


In [7]:
core_pat = re.compile(r"LOCATION[\-\s]+(.*)\n\s+Flags = e", re.DOTALL)

In [8]:
core = re.search(core_pat, text).group(1)

In [9]:
print(core)


PISTOL               REVOLVER               UNKWN/UNPUBLSHD MAKE                          UNKNOWN         38       FOUND     _d__
1721                    Tag#:SJ3095-1         GO SJ 2012-122590518  WA-GR/GUN DESTRUCTION  
SHOTGUN              DOUBLE BARREL          STEVEN/J STVNS ARMS                           311 SERIES H    12GA     FOUND     _d__
REMOVED                 Tag#:SJ3123-1         GO SJ 2012-122620320  WA-GR/GUN DESTRUCTION  
PISTOL               SEMI-AUTOMATIC         TAURUS FORJAS                                 PT40 PRO MIL'NU 40       FOUND     _d__
SZH77919                Tag#:SJ3360-1         GO SJ 2012-122670386  WA-GR/GUN DESTRUCTION/G
PISTOL               SEMI-AUTOMATIC         SMITH AND WESSON                              39              9MM      FOUND     _d__
85517                   Tag#:SJ3686-1         GO SJ 2012-122741016  WA-GR/GUN DESTRUCTION/G
PISTOL               SEMI-AUTOMATIC         HIGH POINT                                    JCP             40       FOUND     ____
REMOVED                 Tag#:SJ3724-1         GO SJ 2012-122750582  WA-GR/GUN DESTRUCTION/G
PISTOL               REVOLVER               COLT                                          DETECTIVE SPEC  38       FOUND     _d__
982982                  Tag#:SJ4077-2         GO SJ 2012-122840637  WA-GR/GUN DESTRUCTION/Z
SHOTGUN              DOUBLE BARREL          MISS.VALLEY ARMS CO.                          UNK             12GA     FOUND     _d__
279861                  Tag#:SJ4117-1         GO SJ 2012-122850220  WA-GR/GUN DESTRUCTION  
RIFLE                LEVER ACTION           SEARS ROEBUCK & CO.                           5               22       FOUND     _d__
63723                   Tag#:SJ4172-1         GO SJ 2012-122860208  WA-GR/GUN DESTRUCTION  
PISTOL               SEMI-AUTOMATIC         COLT                                          MATCH TARGET    .22      FOUND     _d__
142061S                 Tag#:SJ4324-1         GO SJ 2012-122890285  WA-GR/GD-SCCCL         
PISTOL               REVOLVER               SMITH AND WESSON                              UNKNOWN         .22      FOUND     _d__
K22539                  Tag#:SJ4324-2         GO SJ 2012-122890285  WA-GR/GUN DESTRUCTION  
RIFLE                LEVER ACTION           STEVEN/J STVNS ARMS                           CRACK SHOT 26   .22      FOUND     _d__
UNKNOWN                 Tag#:SJ4324-3         GO SJ 2012-122890285  WA-GR/GUN DESTRUCTION  
PISTOL               SEMI-AUTOMATIC         CARL WALTHER                                  PPK             7.65     FOUND     _d__
832229                  Tag#:SJ4664-1         GO SJ 2012-122960284  WA-GR/GD-SCCCL         

Parse each group of two lines

In the report, each firearm takes up two lines. The code below splits the core table into two-line groups, and then parses out the fields, based on the number of characters in each field:


In [10]:
lines = core.split("\n")
line_groups = list(zip(lines[::2], lines[1::2]))
print(line_groups[0])


('PISTOL               REVOLVER               UNKWN/UNPUBLSHD MAKE                          UNKNOWN         38       FOUND     _d__', '1721                    Tag#:SJ3095-1         GO SJ 2012-122590518  WA-GR/GUN DESTRUCTION  ')

In [11]:
def parse_row(first_line, second_line):
    return OrderedDict([
        ("type", first_line[:20].strip()),
        ("item", first_line[21:41].strip()),
        ("make", first_line[44:89].strip()),
        ("model", first_line[90:105].strip()),
        ("calibre", first_line[106:111].strip()),
        ("status", first_line[112:120].strip()),
        ("flags", first_line[124:129].strip()),
        ("serial_number", second_line[0:13].strip()),
        ("report_tag_number", second_line[21:41].strip()),
        ("case_file_number", second_line[44:64].strip()),
        ("storage_location", second_line[68:91].strip())
    ])

In [12]:
parsed = [ parse_row(first_line, second_line) 
    for first_line, second_line in line_groups ]

Result

Below, you can see the parsed data for the first two firearms in the report:


In [13]:
parsed[:2]


Out[13]:
[OrderedDict([('type', 'PISTOL'),
              ('item', 'REVOLVER'),
              ('make', 'UNKWN/UNPUBLSHD MAKE'),
              ('model', 'UNKNOWN'),
              ('calibre', '38'),
              ('status', 'FOUND'),
              ('flags', '_d__'),
              ('serial_number', '1721'),
              ('report_tag_number', 'Tag#:SJ3095-1'),
              ('case_file_number', 'GO SJ 2012-1225905'),
              ('storage_location', 'WA-GR/GUN DESTRUCTION')]),
 OrderedDict([('type', 'SHOTGUN'),
              ('item', 'DOUBLE BARREL'),
              ('make', 'STEVEN/J STVNS ARMS'),
              ('model', '311 SERIES H'),
              ('calibre', '12GA'),
              ('status', 'FOUND'),
              ('flags', '_d__'),
              ('serial_number', 'REMOVED'),
              ('report_tag_number', 'Tag#:SJ3123-1'),
              ('case_file_number', 'GO SJ 2012-1226203'),
              ('storage_location', 'WA-GR/GUN DESTRUCTION')])]

Preview via pandas.DataFrame

To make it a little easier to read, here's the full table, parsed and represented as a pandas.DataFrame (for ease of viewing):


In [14]:
import pandas as pd
columns = list(parsed[0].keys())
pd.DataFrame(parsed)[columns]


Out[14]:
type item make model calibre status flags serial_number report_tag_number case_file_number storage_location
0 PISTOL REVOLVER UNKWN/UNPUBLSHD MAKE UNKNOWN 38 FOUND _d__ 1721 Tag#:SJ3095-1 GO SJ 2012-1225905 WA-GR/GUN DESTRUCTION
1 SHOTGUN DOUBLE BARREL STEVEN/J STVNS ARMS 311 SERIES H 12GA FOUND _d__ REMOVED Tag#:SJ3123-1 GO SJ 2012-1226203 WA-GR/GUN DESTRUCTION
2 PISTOL SEMI-AUTOMATIC TAURUS FORJAS PT40 PRO MIL'NU 40 FOUND _d__ SZH77919 Tag#:SJ3360-1 GO SJ 2012-1226703 WA-GR/GUN DESTRUCTION/G
3 PISTOL SEMI-AUTOMATIC SMITH AND WESSON 39 9MM FOUND _d__ 85517 Tag#:SJ3686-1 GO SJ 2012-1227410 WA-GR/GUN DESTRUCTION/G
4 PISTOL SEMI-AUTOMATIC HIGH POINT JCP 40 FOUND ____ REMOVED Tag#:SJ3724-1 GO SJ 2012-1227505 WA-GR/GUN DESTRUCTION/G
5 PISTOL REVOLVER COLT DETECTIVE SPEC 38 FOUND _d__ 982982 Tag#:SJ4077-2 GO SJ 2012-1228406 WA-GR/GUN DESTRUCTION/Z
6 SHOTGUN DOUBLE BARREL MISS.VALLEY ARMS CO. UNK 12GA FOUND _d__ 279861 Tag#:SJ4117-1 GO SJ 2012-1228502 WA-GR/GUN DESTRUCTION
7 RIFLE LEVER ACTION SEARS ROEBUCK & CO. 5 22 FOUND _d__ 63723 Tag#:SJ4172-1 GO SJ 2012-1228602 WA-GR/GUN DESTRUCTION
8 PISTOL SEMI-AUTOMATIC COLT MATCH TARGET .22 FOUND _d__ 142061S Tag#:SJ4324-1 GO SJ 2012-1228902 WA-GR/GD-SCCCL
9 PISTOL REVOLVER SMITH AND WESSON UNKNOWN .22 FOUND _d__ K22539 Tag#:SJ4324-2 GO SJ 2012-1228902 WA-GR/GUN DESTRUCTION
10 RIFLE LEVER ACTION STEVEN/J STVNS ARMS CRACK SHOT 26 .22 FOUND _d__ UNKNOWN Tag#:SJ4324-3 GO SJ 2012-1228902 WA-GR/GUN DESTRUCTION
11 PISTOL SEMI-AUTOMATIC CARL WALTHER PPK 7.65 FOUND _d__ 832229 Tag#:SJ4664-1 GO SJ 2012-1229602 WA-GR/GD-SCCCL