Demonstration of `pdfplumber`'s `extract_table` method.

This notebook uses pdfplumber to extract data from an California Worker Adjustment and Retraining Notification (WARN) report.

Import `pdfplumber`



In [1]:

    
import pdfplumber

Load the PDF



In [2]:

    
pdf = pdfplumber.open("../pdfs/ca-warn-report.pdf")

Get the first page



In [3]:

    
p0 = pdf.pages[0]

Display the first page



In [4]:

    
im = p0.to_image()
im









    Out[4]:

Use `.extract_table` to get the data from the largest table on the page



In [5]:

    
table = p0.extract_table()

.extract_table returns a list of lists, with each inner list representing a row in the table. Here are the first three rows:



In [6]:

    
table[:3]









    Out[6]:





[['Notice Date',
  'Effective',
  'Received',
  'Company',
  'City',
  'No. Of',
  'Layoff/Closure'],
 ['06/22/2015',
  '0  3  / 2  5  / 2  0  16',
  '0  7  / 0  1  / 2  0  15',
  'Maxim Integrated Product',
  'San Jose',
  '150',
  'Closure Permanent'],
 ['06/30/2015',
  '0  8  / 2  9  / 2  0  15',
  '0  7  / 0  1  / 2  0  15',
  'McGraw-Hill Education',
  'Monterey',
  '137',
  'Layoff Unknown at this time']]

Basic cleanup

We can use pandas to render the list as a DataFrame, and to remove the extra spaces within some of the dates.



In [7]:

    
import pandas as pd



In [8]:

    
df = pd.DataFrame(table[1:], columns=table[0])
for column in ["Effective", "Received"]:
    df[column] = df[column].str.replace(" ", "")



In [9]:

    
df









    Out[9]:






  
    
      
      Notice Date
      Effective
      Received
      Company
      City
      No. Of
      Layoff/Closure
    
  
  
    
      0
      06/22/2015
      03/25/2016
      07/01/2015
      Maxim Integrated Product
      San Jose
      150
      Closure Permanent
    
    
      1
      06/30/2015
      08/29/2015
      07/01/2015
      McGraw-Hill Education
      Monterey
      137
      Layoff Unknown at this time
    
    
      2
      06/30/2015
      08/30/2015
      07/01/2015
      Long Beach Memorial Medical Center
      Long Beach
      90
      Layoff Permanent
    
    
      3
      07/01/2015
      09/02/2015
      07/01/2015
      Leidos
      El Segundo
      72
      Layoff Permanent
    
    
      4
      07/01/2015
      09/30/2016
      07/01/2015
      Bosch Healthcare Systems, Inc.
      Palo Alto
      55
      Closure Permanent
    
    
      5
      06/29/2015
      09/01/2015
      07/02/2015
      Encompass Digital Media, Inc.
      Los Angeles
      41
      Closure Permanent
    
    
      6
      07/02/2015
      07/06/2015
      07/02/2015
      Alphatec Spine
      Carlsbad
      99
      Layoff Permanent
    
    
      7
      06/30/2015
      08/07/2015
      07/06/2015
      Symantec Corporation
      Mountain View
      60
      Layoff Permanent
    
    
      8
      06/30/2015
      08/31/2015
      07/06/2015
      Fusion Contacts Centers, LLC
      Santa Maria
      50
      Closure Permanent
    
    
      9
      06/30/2015
      09/15/2015
      07/06/2015
      KLA-Tencor Corporation
      Milpitas
      213
      Layoff Permanent
    
    
      10
      07/01/2015
      09/04/2015
      07/06/2015
      Southern California Edison Company
      San Clemente
      100
      Closure Permanent
    
    
      11
      07/02/2015
      09/01/2015
      07/06/2015
      State Fish Company, Inc.
      Wilmington
      76
      Closure Permanent
    
    
      12
      07/02/2015
      09/04/2015
      07/06/2015
      Boeing Company
      Long Beach
      56
      Layoff Unknown at this time
    
    
      13
      07/06/2015
      09/04/2015
      07/06/2015
      Bridgepoint Education, Inc.
      San Diego
      7
      Layoff Permanent
    
    
      14
      07/06/2015
      09/04/2015
      07/06/2015
      Bridgepoint Education, Inc.
      San Diego
      15
      Layoff Permanent
    
    
      15
      07/01/2015
      06/29/2015
      07/07/2015
      BAE Systems
      San Francisco
      4
      Layoff Temporary
    
    
      16
      07/01/2015
      06/29/2015
      07/07/2015
      BAE Systems
      San Francisco
      78
      Layoff Temporary
    
    
      17
      07/01/2015
      09/07/2015
      07/07/2015
      Bay Bread LLC dba Bakery Los Angeles
      San Fernando
      50
      Closure Permanent
    
    
      18
      07/01/2015
      09/25/2015
      07/07/2015
      Bay Bread LLC dba New French Bakery
      South San
      121
      Closure Permanent
    
    
      19
      07/02/2015
      06/12/2015
      07/07/2015
      Hewlett-Packard Company
      Palo Alto
      65
      Layoff Permanent
    
    
      20
      07/08/2015
      09/06/2015
      07/08/2015
      Microsoft Corporation
      San Diego
      129
      Layoff Permanent
    
    
      21
      06/25/2015
      10/09/2015
      07/10/2015
      Aramark Healthcare Support Services,
      Culver City
      53
      Closure Permanent
    
    
      22
      07/01/2015
      09/10/2015
      07/10/2015
      Maxim Integrated Product
      San Jose
      20
      Layoff Permanent
    
    
      23
      07/06/2015
      09/04/2015
      07/10/2015
      ProCourier, Inc.
      San Diego
      22
      Layoff Unknown at this time
    
    
      24
      07/06/2015
      09/04/2015
      07/10/2015
      ProCourier, Inc.
      Los Angeles
      71
      Layoff Unknown at this time
    
    
      25
      07/07/2015
      09/04/2015
      07/10/2015
      ProCourier, Inc.
      Irvine
      22
      Layoff Unknown at this time
    
    
      26
      07/09/2015
      07/22/2015
      07/10/2015
      Berkeley Pyramid Alehouse
      Berkeley
      63
      Closure Permanent
    
    
      27
      07/09/2015
      09/14/2015
      07/10/2015
      Fireman's Fund Insurance Company
      Novato
      35
      Layoff Permanent
    
    
      28
      06/30/2015
      08/31/2015
      07/13/2015
      First Transit
      San Bernardino
      127
      Layoff Permanent
    
    
      29
      06/30/2015
      08/31/2015
      07/13/2015
      First Transit
      Rancho
      71
      Layoff Permanent
    
    
      30
      07/10/2015
      07/14/2015
      07/13/2015
      11 Main LLC
      San Mateo
      35
      Closure Permanent
    
    
      31
      07/10/2015
      07/14/2015
      07/13/2015
      11 Main LLC
      Chico
      44
      Layoff Permanent
    
    
      32
      07/15/2015
      07/15/2015
      07/15/2015
      TaylorMade Golf Company
      Carlsbad
      64
      Layoff Permanent
    
    
      33
      07/08/2015
      09/06/2015
      07/16/2015
      Southern California Edison Company
      Rosemead
      38
      Layoff Permanent
    
    
      34
      07/14/2015
      09/18/2015
      07/20/2015
      Actavis, Inc.
      Corona
      45
      Layoff Permanent
    
    
      35
      07/17/2015
      07/13/2015
      07/21/2015
      American Management Services LLC
      Monterey
      56
      Closure Permanent

How did it work?

We can use pdfplumber's visual debugging to show how the table was extracted. The red lines represent the lines pdfplumber found on the page; the blue circles represent the intersections of those lines, and the light-blue shading indicates the cells derived from those intersections.



In [10]:

    
im.debug_tablefinder()









    Out[10]:

	Notice Date	Effective	Received	Company	City	No. Of	Layoff/Closure
0	06/22/2015	03/25/2016	07/01/2015	Maxim Integrated Product	San Jose	150	Closure Permanent
1	06/30/2015	08/29/2015	07/01/2015	McGraw-Hill Education	Monterey	137	Layoff Unknown at this time
2	06/30/2015	08/30/2015	07/01/2015	Long Beach Memorial Medical Center	Long Beach	90	Layoff Permanent
3	07/01/2015	09/02/2015	07/01/2015	Leidos	El Segundo	72	Layoff Permanent
4	07/01/2015	09/30/2016	07/01/2015	Bosch Healthcare Systems, Inc.	Palo Alto	55	Closure Permanent
5	06/29/2015	09/01/2015	07/02/2015	Encompass Digital Media, Inc.	Los Angeles	41	Closure Permanent
6	07/02/2015	07/06/2015	07/02/2015	Alphatec Spine	Carlsbad	99	Layoff Permanent
7	06/30/2015	08/07/2015	07/06/2015	Symantec Corporation	Mountain View	60	Layoff Permanent
8	06/30/2015	08/31/2015	07/06/2015	Fusion Contacts Centers, LLC	Santa Maria	50	Closure Permanent
9	06/30/2015	09/15/2015	07/06/2015	KLA-Tencor Corporation	Milpitas	213	Layoff Permanent
10	07/01/2015	09/04/2015	07/06/2015	Southern California Edison Company	San Clemente	100	Closure Permanent
11	07/02/2015	09/01/2015	07/06/2015	State Fish Company, Inc.	Wilmington	76	Closure Permanent
12	07/02/2015	09/04/2015	07/06/2015	Boeing Company	Long Beach	56	Layoff Unknown at this time
13	07/06/2015	09/04/2015	07/06/2015	Bridgepoint Education, Inc.	San Diego	7	Layoff Permanent
14	07/06/2015	09/04/2015	07/06/2015	Bridgepoint Education, Inc.	San Diego	15	Layoff Permanent
15	07/01/2015	06/29/2015	07/07/2015	BAE Systems	San Francisco	4	Layoff Temporary
16	07/01/2015	06/29/2015	07/07/2015	BAE Systems	San Francisco	78	Layoff Temporary
17	07/01/2015	09/07/2015	07/07/2015	Bay Bread LLC dba Bakery Los Angeles	San Fernando	50	Closure Permanent
18	07/01/2015	09/25/2015	07/07/2015	Bay Bread LLC dba New French Bakery	South San	121	Closure Permanent
19	07/02/2015	06/12/2015	07/07/2015	Hewlett-Packard Company	Palo Alto	65	Layoff Permanent
20	07/08/2015	09/06/2015	07/08/2015	Microsoft Corporation	San Diego	129	Layoff Permanent
21	06/25/2015	10/09/2015	07/10/2015	Aramark Healthcare Support Services,	Culver City	53	Closure Permanent
22	07/01/2015	09/10/2015	07/10/2015	Maxim Integrated Product	San Jose	20	Layoff Permanent
23	07/06/2015	09/04/2015	07/10/2015	ProCourier, Inc.	San Diego	22	Layoff Unknown at this time
24	07/06/2015	09/04/2015	07/10/2015	ProCourier, Inc.	Los Angeles	71	Layoff Unknown at this time
25	07/07/2015	09/04/2015	07/10/2015	ProCourier, Inc.	Irvine	22	Layoff Unknown at this time
26	07/09/2015	07/22/2015	07/10/2015	Berkeley Pyramid Alehouse	Berkeley	63	Closure Permanent
27	07/09/2015	09/14/2015	07/10/2015	Fireman's Fund Insurance Company	Novato	35	Layoff Permanent
28	06/30/2015	08/31/2015	07/13/2015	First Transit	San Bernardino	127	Layoff Permanent
29	06/30/2015	08/31/2015	07/13/2015	First Transit	Rancho	71	Layoff Permanent
30	07/10/2015	07/14/2015	07/13/2015	11 Main LLC	San Mateo	35	Closure Permanent
31	07/10/2015	07/14/2015	07/13/2015	11 Main LLC	Chico	44	Layoff Permanent
32	07/15/2015	07/15/2015	07/15/2015	TaylorMade Golf Company	Carlsbad	64	Layoff Permanent
33	07/08/2015	09/06/2015	07/16/2015	Southern California Edison Company	Rosemead	38	Layoff Permanent
34	07/14/2015	09/18/2015	07/20/2015	Actavis, Inc.	Corona	45	Layoff Permanent
35	07/17/2015	07/13/2015	07/21/2015	American Management Services LLC	Monterey	56	Closure Permanent

Demonstration of pdfplumber's extract_table method.

Import pdfplumber