Demonstration of pdfplumber's extract_table method.

This notebook uses pdfplumber to extract data from an California Worker Adjustment and Retraining Notification (WARN) report.

Import pdfplumber


In [1]:
import pdfplumber

Load the PDF


In [2]:
pdf = pdfplumber.open("../pdfs/ca-warn-report.pdf")

Get the first page


In [3]:
p0 = pdf.pages[0]

Display the first page


In [4]:
im = p0.to_image()
im


Out[4]:

Use .extract_table to get the data from the largest table on the page


In [5]:
table = p0.extract_table()

.extract_table returns a list of lists, with each inner list representing a row in the table. Here are the first three rows:


In [6]:
table[:3]


Out[6]:
[['Notice Date',
  'Effective',
  'Received',
  'Company',
  'City',
  'No. Of',
  'Layoff/Closure'],
 ['06/22/2015',
  '0  3  / 2  5  / 2  0  16',
  '0  7  / 0  1  / 2  0  15',
  'Maxim Integrated Product',
  'San Jose',
  '150',
  'Closure Permanent'],
 ['06/30/2015',
  '0  8  / 2  9  / 2  0  15',
  '0  7  / 0  1  / 2  0  15',
  'McGraw-Hill Education',
  'Monterey',
  '137',
  'Layoff Unknown at this time']]

Basic cleanup

We can use pandas to render the list as a DataFrame, and to remove the extra spaces within some of the dates.


In [7]:
import pandas as pd

In [8]:
df = pd.DataFrame(table[1:], columns=table[0])
for column in ["Effective", "Received"]:
    df[column] = df[column].str.replace(" ", "")

In [9]:
df


Out[9]:
Notice Date Effective Received Company City No. Of Layoff/Closure
0 06/22/2015 03/25/2016 07/01/2015 Maxim Integrated Product San Jose 150 Closure Permanent
1 06/30/2015 08/29/2015 07/01/2015 McGraw-Hill Education Monterey 137 Layoff Unknown at this time
2 06/30/2015 08/30/2015 07/01/2015 Long Beach Memorial Medical Center Long Beach 90 Layoff Permanent
3 07/01/2015 09/02/2015 07/01/2015 Leidos El Segundo 72 Layoff Permanent
4 07/01/2015 09/30/2016 07/01/2015 Bosch Healthcare Systems, Inc. Palo Alto 55 Closure Permanent
5 06/29/2015 09/01/2015 07/02/2015 Encompass Digital Media, Inc. Los Angeles 41 Closure Permanent
6 07/02/2015 07/06/2015 07/02/2015 Alphatec Spine Carlsbad 99 Layoff Permanent
7 06/30/2015 08/07/2015 07/06/2015 Symantec Corporation Mountain View 60 Layoff Permanent
8 06/30/2015 08/31/2015 07/06/2015 Fusion Contacts Centers, LLC Santa Maria 50 Closure Permanent
9 06/30/2015 09/15/2015 07/06/2015 KLA-Tencor Corporation Milpitas 213 Layoff Permanent
10 07/01/2015 09/04/2015 07/06/2015 Southern California Edison Company San Clemente 100 Closure Permanent
11 07/02/2015 09/01/2015 07/06/2015 State Fish Company, Inc. Wilmington 76 Closure Permanent
12 07/02/2015 09/04/2015 07/06/2015 Boeing Company Long Beach 56 Layoff Unknown at this time
13 07/06/2015 09/04/2015 07/06/2015 Bridgepoint Education, Inc. San Diego 7 Layoff Permanent
14 07/06/2015 09/04/2015 07/06/2015 Bridgepoint Education, Inc. San Diego 15 Layoff Permanent
15 07/01/2015 06/29/2015 07/07/2015 BAE Systems San Francisco 4 Layoff Temporary
16 07/01/2015 06/29/2015 07/07/2015 BAE Systems San Francisco 78 Layoff Temporary
17 07/01/2015 09/07/2015 07/07/2015 Bay Bread LLC dba Bakery Los Angeles San Fernando 50 Closure Permanent
18 07/01/2015 09/25/2015 07/07/2015 Bay Bread LLC dba New French Bakery South San 121 Closure Permanent
19 07/02/2015 06/12/2015 07/07/2015 Hewlett-Packard Company Palo Alto 65 Layoff Permanent
20 07/08/2015 09/06/2015 07/08/2015 Microsoft Corporation San Diego 129 Layoff Permanent
21 06/25/2015 10/09/2015 07/10/2015 Aramark Healthcare Support Services, Culver City 53 Closure Permanent
22 07/01/2015 09/10/2015 07/10/2015 Maxim Integrated Product San Jose 20 Layoff Permanent
23 07/06/2015 09/04/2015 07/10/2015 ProCourier, Inc. San Diego 22 Layoff Unknown at this time
24 07/06/2015 09/04/2015 07/10/2015 ProCourier, Inc. Los Angeles 71 Layoff Unknown at this time
25 07/07/2015 09/04/2015 07/10/2015 ProCourier, Inc. Irvine 22 Layoff Unknown at this time
26 07/09/2015 07/22/2015 07/10/2015 Berkeley Pyramid Alehouse Berkeley 63 Closure Permanent
27 07/09/2015 09/14/2015 07/10/2015 Fireman's Fund Insurance Company Novato 35 Layoff Permanent
28 06/30/2015 08/31/2015 07/13/2015 First Transit San Bernardino 127 Layoff Permanent
29 06/30/2015 08/31/2015 07/13/2015 First Transit Rancho 71 Layoff Permanent
30 07/10/2015 07/14/2015 07/13/2015 11 Main LLC San Mateo 35 Closure Permanent
31 07/10/2015 07/14/2015 07/13/2015 11 Main LLC Chico 44 Layoff Permanent
32 07/15/2015 07/15/2015 07/15/2015 TaylorMade Golf Company Carlsbad 64 Layoff Permanent
33 07/08/2015 09/06/2015 07/16/2015 Southern California Edison Company Rosemead 38 Layoff Permanent
34 07/14/2015 09/18/2015 07/20/2015 Actavis, Inc. Corona 45 Layoff Permanent
35 07/17/2015 07/13/2015 07/21/2015 American Management Services LLC Monterey 56 Closure Permanent

How did it work?

We can use pdfplumber's visual debugging to show how the table was extracted. The red lines represent the lines pdfplumber found on the page; the blue circles represent the intersections of those lines, and the light-blue shading indicates the cells derived from those intersections.


In [10]:
im.debug_tablefinder()


Out[10]: