In [1]:
name = '2017-06-30-pdf-scraping'
title = 'Extracting data from PDF files'
tags = 'pdf, text processing, pandas'
author = 'Denis Sergeev'
In [2]:
from nb_tools import connect_notebook_to_post
from IPython.core.display import HTML, Image
html = connect_notebook_to_post(name, title, tags, author)
All these features make it very hard to extract text and especially tables from PDF documents
One possible solution: store data as attachments.
Otherwise, you have to find a way to parse PDFs programmatically, especially if there are more than just a few of them.
Today we had a brief look at some of the most active/developed Python libraries that allow PDF text mining.
Below are the two PDF documents that will be used in this notebook:
Let's define paths to these files
In [3]:
stars_file = '../pdfs/stars-dat_user-manual_p2d-4_v3p1.pdf'
moore2016_file = '../pdfs/moore_bromwich_qjrms_2016.pdf'
PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines.
This package is not extensively documented, but following basic examples, you can start extracting useful information from a PDF file.
In [4]:
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
Open the first PDF document in as a file stream in binary mode. Then we pass the file stream object to create a PDFParser
instance and create a PDFDocument
accordingly.
In [5]:
fp = open(stars_file, mode='rb')
parser = PDFParser(fp)
document = PDFDocument(parser)
Then, if a document has a built-in table of contents, we can extract it:
In [6]:
from pdfminer.pdfdocument import PDFNoOutlines
In [7]:
try:
outlines = document.get_outlines()
except PDFNoOutlines:
raise Exception('No outlines found!')
The result, outlines
, is a generator object which contains section titles and their levels, as well as other technical info.
In [8]:
for i, (level, title, *_) in enumerate(outlines):
if i < 10:
# Print only first 10 lines
print (level, title)
In [9]:
fp.close()
Another example of using pdfminer
package is taken from this gist.
In [10]:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter#process_pdf
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from io import StringIO
In [11]:
def pdf_to_text(pdfname, codec='utf-8'):
"""
Extract text from a PDF document
"""
# PDFMiner boilerplate
rsrcmgr = PDFResourceManager()
sio = StringIO()
laparams = LAParams()
device = TextConverter(rsrcmgr, sio, codec=codec, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Extract text
with open(pdfname, mode='rb') as fp:
for page in PDFPage.get_pages(fp):
interpreter.process_page(page)
# Get text from StringIO
text = sio.getvalue()
# Cleanup
device.close()
sio.close()
return text
Let's apply it to the same PDF document:
In [12]:
mytext = pdf_to_text(stars_file)
And print only an excerpt from the file:
In [13]:
print(mytext[25000:26000])
A package that builds on PDFMiner
and enables extraction of not only text, but tables and curves from a PDF document, is PDFPlumber. Thanks to PythonBytes podcast for letting me know about it.
In a nutshell,
.to_image()
In [14]:
import pdfplumber
In [15]:
doc = pdfplumber.open(stars_file) # No fussing about with binary file streams!
Select page nuber 36 from the document. It is a table containing polar low start and end locations and times.
In [16]:
p0 = doc.pages[35]
We can display it as image - very useful in Jupyter Notebook!
In [17]:
im = p0.to_image()
im
Out[17]:
Another very useful tool is the visual debugger that shows what rows and columns are selected and if they are parsed correctly
In [18]:
im.debug_tablefinder()
Out[18]:
It can be seen that because there is some text before and after the table, the last column is split in half, following the text width.
One of the simplest (but not flexible) solutions is to crop the page before processing.
In [19]:
cropped = p0.crop((0, 50, p0.width, p0.height-100))
In [20]:
cropped.to_image().debug_tablefinder()
Out[20]:
Now we only have the table itself, and all cells are separated correctly.
We don't even need to specify table_settings
to extract the table
In [21]:
table = cropped.extract_table()
The returned result is a list of lists containing rows of string data.
In [22]:
table[:3] # the first three rows
Out[22]:
We can easily convert it to pandas.Dataframe
. For the sake of brevity, we only convert the first 10 rows here.
In [23]:
import pandas as pd
In [24]:
df = pd.DataFrame(table[1:11], columns=table[0])
In [25]:
df
Out[25]:
We can also convert columns with strings representing dates and times to columns of datetime-like objects.
In [26]:
df[['Start time', 'End time']] = df[['Start time', 'End time']].apply(pd.to_datetime)
Since we already have an index ('Polar low ID'), we can reset it:
In [27]:
df.set_index('Polar low ID', inplace=True)
In [28]:
df
Out[28]:
We can also write a function to convert polar low positions (like "74.54N 28.01E") to numerical data.
In [29]:
def geostr2coord(arg):
"""
Function to convert geographical coordinates in traditional notation to longitude and latitude
[WIP]
"""
pass
and then apply it to the relevant columns
In [30]:
for colname in ('Start position', 'End position'):
df[colname] = df[colname].apply(geostr2coord)
Another promising library for PDF parsing is tabula-py
, which is actually just Python bindings to a Java library with the same name.
So the caveat is that this is not a pure Python package, and you have to install Java to make it work.
Nevertheless, let's give it a try.
In [31]:
import tabula
Read data from the page 3 of the second PDF document.
In [32]:
doc = tabula.read_pdf(moore2016_file,
pages=3, multiple_tables=True)
In [33]:
doc[0]
Out[33]:
As you can see, the package conveniently converts extracted tables to pandas.DataFrame
.
But what about the second table?
In [34]:
doc = tabula.read_pdf(moore2016_file,
pages=3, multiple_tables=True, area=(500, 0, 850, 500))
Even with multiple_tables=True
, it could not extract two tables from the same page. Therefore, to get the second table (at the bottom of the page), we used a "dirty" solution again - we cropped half of the page.
In [35]:
doc[0]
Out[35]:
The tables still need a bit of cleaning, including the correct column names, of course.
https://pdftables.com/ - pay-per-page service focused on tabular data extraction from the folks at ScraperWiki
In [36]:
HTML(html)
Out[36]: