pdfplumber
's extract_table
method.This notebook uses pdfplumber
to extract data from an California Worker Adjustment and Retraining Notification (WARN) report.
In [1]:
import pdfplumber
In [2]:
pdf = pdfplumber.open("../pdfs/ca-warn-report.pdf")
In [3]:
p0 = pdf.pages[0]
In [4]:
im = p0.to_image()
im
Out[4]:
In [5]:
table = p0.extract_table()
.extract_table
returns a list of lists, with each inner list representing a row in the table. Here are the first three rows:
In [6]:
table[:3]
Out[6]:
In [7]:
import pandas as pd
In [8]:
df = pd.DataFrame(table[1:], columns=table[0])
for column in ["Effective", "Received"]:
df[column] = df[column].str.replace(" ", "")
In [9]:
df
Out[9]:
In [10]:
im.debug_tablefinder()
Out[10]: