In [1]:

    
name = '2017-06-30-pdf-scraping'
title = 'Extracting data from PDF files'
tags = 'pdf, text processing, pandas'
author = 'Denis Sergeev'



In [2]:

    
from nb_tools import connect_notebook_to_post
from IPython.core.display import HTML, Image

html = connect_notebook_to_post(name, title, tags, author)

Some organisations still release their data in PDF format
PDF was not designed as a data format. It was designed as an "electronic paper" format.
Main purpose: presenting elements exactly how creator want them to be, independent of operating system or time.
PDF documents are not aware what tabular data, or even words are.
Even when it's made using Microsoft Excel, the PDF document doesn't retain any sense of the "cells" that once contained the data.

All these features make it very hard to extract text and especially tables from PDF documents

One possible solution: store data as attachments.

Otherwise, you have to find a way to parse PDFs programmatically, especially if there are more than just a few of them.

Today we had a brief look at some of the most active/developed Python libraries that allow PDF text mining.

Below are the two PDF documents that will be used in this notebook:

Let's define paths to these files



In [3]:

    
stars_file = '../pdfs/stars-dat_user-manual_p2d-4_v3p1.pdf'
moore2016_file = '../pdfs/moore_bromwich_qjrms_2016.pdf'

PDFMiner

PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines.

This package is not extensively documented, but following basic examples, you can start extracting useful information from a PDF file.

Simple example: extract table of contents



In [4]:

    
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument

Open the first PDF document in as a file stream in binary mode. Then we pass the file stream object to create a PDFParser instance and create a PDFDocument accordingly.



In [5]:

    
fp = open(stars_file, mode='rb')
parser = PDFParser(fp)
document = PDFDocument(parser)

Then, if a document has a built-in table of contents, we can extract it:



In [6]:

    
from pdfminer.pdfdocument import PDFNoOutlines



In [7]:

    
try:
    outlines = document.get_outlines()
except PDFNoOutlines:
    raise Exception('No outlines found!')

The result, outlines, is a generator object which contains section titles and their levels, as well as other technical info.



In [8]:

    
for i, (level, title, *_) in enumerate(outlines):
    if i < 10:
        # Print only first 10 lines
        print (level, title)









    



1 1. Introduction
2 1.1 Purpose of this document
2 1.2 Background
2 1.3 Structure of this documentation
2 1.4 Glossary
2 1.5 Reference Documents
1 2. STARS-DAT Data Model
2 2.1 Format
2 2.2 Organization
2 2.3 Content



In [9]:

    
fp.close()

Another example

Another example of using pdfminer package is taken from this gist.



In [10]:

    
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter#process_pdf
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams

from io import StringIO



In [11]:

    
def pdf_to_text(pdfname, codec='utf-8'):
    """
    Extract text from a PDF document
    """
    # PDFMiner boilerplate
    rsrcmgr = PDFResourceManager()
    sio = StringIO()
    laparams = LAParams()
    device = TextConverter(rsrcmgr, sio, codec=codec, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)

    # Extract text
    with open(pdfname, mode='rb') as fp:
        for page in PDFPage.get_pages(fp):
            interpreter.process_page(page)

    # Get text from StringIO
    text = sio.getvalue()

    # Cleanup
    device.close()
    sio.close()

    return text

Let's apply it to the same PDF document:



In [12]:

    
mytext = pdf_to_text(stars_file)

And print only an excerpt from the file:



In [13]:

    
print(mytext[25000:26000])









    



duct is 
included in the sea ice files. More details about the sea ice concentration product are given in 
the OSI SAF Sea Ice Product Manual [RD-10]. The file name convention for these files are 
listed in Table 4.

5.6   Satellite radiometer imagery

The NOAA AVHRR satellite radiometer imagery data files in STARS-DAT are from the local 
receiving station at met.no. These files are only used for tracking polar low events. There is 
one file for each satellite passage, and the files are multi-layered TIFF files, one layer for 
each channel. The AVHRR data have two fixed visible channels (0.6 and 0.9 um) and two 
fixed infrared channels (10.5 and 11.5um). There is one “flexible” channel that depends from 
the different missions. This channel is either 1.6um or 3.7um, and for some satellites there is 
a switch between the two channels so that 1.6um is used at daytime and 3.7um at nighttime.

The file name convention for these files are listed in Table 4.

5.7   In situ data

The availabl

PDFPlumber

A package that builds on PDFMiner and enables extraction of not only text, but tables and curves from a PDF document, is PDFPlumber. Thanks to PythonBytes podcast for letting me know about it.

In a nutshell,

Plumb a PDF for detailed information about each char, rectangle, line, etc - and easily extract text and tables.
Visual debugging with .to_image()
Extracting tables
- pdfplumber's approach to table detection borrows heavily from Anssi Nurminen's master's thesis, and is inspired by Tabula. It works like this:
- For any given PDF page, find the lines that are (a) explicitly defined and/or (b) implied by the alignment of words on the page.
- Merge overlapping, or nearly-overlapping, lines.
- Find the intersections of all those lines.
- Find the most granular set of rectangles (i.e., cells) that use these intersections as their vertices.
- Group contiguous cells into tables.
- Check out the demonstrations section.



In [14]:

    
import pdfplumber



In [15]:

    
doc = pdfplumber.open(stars_file)  # No fussing about with binary file streams!

Select page nuber 36 from the document. It is a table containing polar low start and end locations and times.



In [16]:

    
p0 = doc.pages[35]

We can display it as image - very useful in Jupyter Notebook!



In [17]:

    
im = p0.to_image()
im









    Out[17]:

Another very useful tool is the visual debugger that shows what rows and columns are selected and if they are parsed correctly



In [18]:

    
im.debug_tablefinder()









    Out[18]:

It can be seen that because there is some text before and after the table, the last column is split in half, following the text width.

One of the simplest (but not flexible) solutions is to crop the page before processing.



In [19]:

    
cropped = p0.crop((0, 50, p0.width, p0.height-100))



In [20]:

    
cropped.to_image().debug_tablefinder()









    Out[20]:

Now we only have the table itself, and all cells are separated correctly.

We don't even need to specify table_settings to extract the table



In [21]:

    
table = cropped.extract_table()

The returned result is a list of lists containing rows of string data.



In [22]:

    
table[:3]  # the first three rows









    Out[22]:





[['Polar low ID', 'Start time', 'End time', 'Start position', 'End position'],
 ['1',
  '2002-01-12 01UTC',
  '2002-01-12 16UTC',
  '74.54N  28.01E',
  '72.74N   46.36E'],
 ['2',
  '2002-01-19 01UTC',
  '2002-01-19  06UTC',
  '71.00N   46.07E',
  '69.21N   49.22E']]

Converting to `DataFrame`

We can easily convert it to pandas.Dataframe. For the sake of brevity, we only convert the first 10 rows here.



In [23]:

    
import pandas as pd



In [24]:

    
df = pd.DataFrame(table[1:11], columns=table[0])



In [25]:

    
df









    Out[25]:







  
    
      
      Polar low ID
      Start time
      End time
      Start position
      End position
    
  
  
    
      0
      1
      2002-01-12 01UTC
      2002-01-12 16UTC
      74.54N  28.01E
      72.74N   46.36E
    
    
      1
      2
      2002-01-19 01UTC
      2002-01-19  06UTC
      71.00N   46.07E
      69.21N   49.22E
    
    
      2
      3
      2002-01-22 10UTC
      2002-01-22  12UTC
      74.72 N  28.06E
      74.77N   25.95E
    
    
      3
      4
      2002-01-23  12UTC
      2002-01-23  13UTC
      70.23N   16.84E
      69.90N   16.04E
    
    
      4
      5
      2002-01-26  03UTC
      2002-01-27  08UTC
      72.09N   14.99E
      76.34N    5.99W
    
    
      5
      6
      2002-02-19  09UTC
      2002-02-19  16UTC
      74.28N   35.70E
      73.50N    34.50E
    
    
      6
      7
      2002-02-22  00UTC
      2002-02-22  08UTC
      74.04N   32.80E
      76.27N    31.26E
    
    
      7
      8
      2002-02-22  00UTC
      2002-02-22  08UTC
      75.70N   30.50E
      76.52N    19.21E
    
    
      8
      9
      2002-02-23  10UTC
      2002-02-24  02UTC
      68.50N    5.70E
      66.10N    11.90E
    
    
      9
      10
      2002-03-01  12UTC
      2002-03-02  00UTC
      68.80N   10.10E
      69.10N    15.06E

We can also convert columns with strings representing dates and times to columns of datetime-like objects.



In [26]:

    
df[['Start time', 'End time']] = df[['Start time', 'End time']].apply(pd.to_datetime)

Since we already have an index ('Polar low ID'), we can reset it:



In [27]:

    
df.set_index('Polar low ID', inplace=True)



In [28]:

    
df









    Out[28]:







  
    
      
      Start time
      End time
      Start position
      End position
    
    
      Polar low ID
      
      
      
      
    
  
  
    
      1
      2002-01-12 01:00:00
      2002-01-12 16:00:00
      74.54N  28.01E
      72.74N   46.36E
    
    
      2
      2002-01-19 01:00:00
      2002-01-19 06:00:00
      71.00N   46.07E
      69.21N   49.22E
    
    
      3
      2002-01-22 10:00:00
      2002-01-22 12:00:00
      74.72 N  28.06E
      74.77N   25.95E
    
    
      4
      2002-01-23 12:00:00
      2002-01-23 13:00:00
      70.23N   16.84E
      69.90N   16.04E
    
    
      5
      2002-01-26 03:00:00
      2002-01-27 08:00:00
      72.09N   14.99E
      76.34N    5.99W
    
    
      6
      2002-02-19 09:00:00
      2002-02-19 16:00:00
      74.28N   35.70E
      73.50N    34.50E
    
    
      7
      2002-02-22 00:00:00
      2002-02-22 08:00:00
      74.04N   32.80E
      76.27N    31.26E
    
    
      8
      2002-02-22 00:00:00
      2002-02-22 08:00:00
      75.70N   30.50E
      76.52N    19.21E
    
    
      9
      2002-02-23 10:00:00
      2002-02-24 02:00:00
      68.50N    5.70E
      66.10N    11.90E
    
    
      10
      2002-03-01 12:00:00
      2002-03-02 00:00:00
      68.80N   10.10E
      69.10N    15.06E

We can also write a function to convert polar low positions (like "74.54N 28.01E") to numerical data.



In [29]:

    
def geostr2coord(arg):
    """
    Function to convert geographical coordinates in traditional notation to longitude and latitude
    
    [WIP]
    """
    pass

and then apply it to the relevant columns



In [30]:

    
for colname in ('Start position', 'End position'):
    df[colname] = df[colname].apply(geostr2coord)

Tabula-py

Another promising library for PDF parsing is tabula-py, which is actually just Python bindings to a Java library with the same name.

So the caveat is that this is not a pure Python package, and you have to install Java to make it work.

Nevertheless, let's give it a try.



In [31]:

    
import tabula

Read data from the page 3 of the second PDF document.



In [32]:

    
doc = tabula.read_pdf(moore2016_file,
                      pages=3, multiple_tables=True)



In [33]:

    
doc[0]









    Out[33]:







  
    
      
      0
      1
      2
      3
      4
      5
      6
    
  
  
    
      0
      Site
      ID
      Vicinity
      Data
      Latitude
      Longitude
      Elevation (m)
    
    
      1
      Ittoqqortoormiit
      04339
      Scoresby Sund
      Surface, Upper-Air
      70.48◦N
      21.95◦W
      65
    
    
      2
      Aputiteeq
      04351
      Kangerdlugssuaq Fjord
      None
      67.78◦N
      32.30◦W
      13
    
    
      3
      Tasiilaq
      04360
      Sermilik Fjord
      Surface, Upper-Air
      65.60◦N
      37.63◦W
      36
    
    
      4
      Ikermiit
      04373
      Køge Bugt Fjord
      Surface
      64.78◦N
      40.30◦W
      85
    
    
      5
      Ikermiuarsuk
      04382
      North of Cape Farewell
      Surface
      61.93◦N
      42.07◦W
      39
    
    
      6
      Ikerasassuaq
      04390
      Cape Farewell
      Surface
      60.03◦N
      43.12◦W
      88
    
    
      7
      Narsarsuaq
      04270
      Cape Farewell
      Upper-Air
      61.15◦N
      45.43◦W
      65

As you can see, the package conveniently converts extracted tables to pandas.DataFrame.

But what about the second table?



In [34]:

    
doc = tabula.read_pdf(moore2016_file,
                      pages=3, multiple_tables=True, area=(500, 0, 850, 500))

Even with multiple_tables=True, it could not extract two tables from the same page. Therefore, to get the second table (at the bottom of the page), we used a "dirty" solution again - we cropped half of the page.



In [35]:

    
doc[0]









    Out[35]:







  
    
      
      0
      1
      2
      3
      4
      5
      6
    
  
  
    
      0
      Flight
      Date
      Science aim
      Area of operation
      Take-off UTC
      Landing UTC
      Dropsondes
    
    
      1
      B268
      21 February
      Easterly tip jet
      Cape Farewell
      1048
      1627
      12
    
    
      2
      B271
      25 February
      Polar low interacting with Greenland
      Iceland Sea
      1035
      1625
      16
    
    
      3
      B274
      2 March
      Barrier winds
      Denmark Strait
      1107
      1455
      9
    
    
      4
      B276
      5 March
      Barrier winds
      Denmark Strait
      1120
      1706
      8
    
    
      5
      B277
      6 March
      Barrier winds
      Denmark Strait
      1027
      1600
      17
    
    
      6
      B278
      9 March
      Barrier winds and air–sea interaction
      Denmark Strait
      1031
      1511
      6

The tables still need a bit of cleaning, including the correct column names, of course.

Some online converters

PDFTables

https://pdftables.com/ - pay-per-page service focused on tabular data extraction from the folks at ScraperWiki

API: https://github.com/pdftables/python-pdftables-api

PDF to XLS

http://pdftoxls.com/

References



In [36]:

    
HTML(html)









    Out[36]:





    
     This post was written as an IPython (Jupyter) notebook. You can view or download it using
    nbviewer.

	Polar low ID	Start time	End time	Start position	End position
0	1	2002-01-12 01UTC	2002-01-12 16UTC	74.54N 28.01E	72.74N 46.36E
1	2	2002-01-19 01UTC	2002-01-19 06UTC	71.00N 46.07E	69.21N 49.22E
2	3	2002-01-22 10UTC	2002-01-22 12UTC	74.72 N 28.06E	74.77N 25.95E
3	4	2002-01-23 12UTC	2002-01-23 13UTC	70.23N 16.84E	69.90N 16.04E
4	5	2002-01-26 03UTC	2002-01-27 08UTC	72.09N 14.99E	76.34N 5.99W
5	6	2002-02-19 09UTC	2002-02-19 16UTC	74.28N 35.70E	73.50N 34.50E
6	7	2002-02-22 00UTC	2002-02-22 08UTC	74.04N 32.80E	76.27N 31.26E
7	8	2002-02-22 00UTC	2002-02-22 08UTC	75.70N 30.50E	76.52N 19.21E
8	9	2002-02-23 10UTC	2002-02-24 02UTC	68.50N 5.70E	66.10N 11.90E
9	10	2002-03-01 12UTC	2002-03-02 00UTC	68.80N 10.10E	69.10N 15.06E

	Start time	End time	Start position	End position
Polar low ID
1	2002-01-12 01:00:00	2002-01-12 16:00:00	74.54N 28.01E	72.74N 46.36E
2	2002-01-19 01:00:00	2002-01-19 06:00:00	71.00N 46.07E	69.21N 49.22E
3	2002-01-22 10:00:00	2002-01-22 12:00:00	74.72 N 28.06E	74.77N 25.95E
4	2002-01-23 12:00:00	2002-01-23 13:00:00	70.23N 16.84E	69.90N 16.04E
5	2002-01-26 03:00:00	2002-01-27 08:00:00	72.09N 14.99E	76.34N 5.99W
6	2002-02-19 09:00:00	2002-02-19 16:00:00	74.28N 35.70E	73.50N 34.50E
7	2002-02-22 00:00:00	2002-02-22 08:00:00	74.04N 32.80E	76.27N 31.26E
8	2002-02-22 00:00:00	2002-02-22 08:00:00	75.70N 30.50E	76.52N 19.21E
9	2002-02-23 10:00:00	2002-02-24 02:00:00	68.50N 5.70E	66.10N 11.90E
10	2002-03-01 12:00:00	2002-03-02 00:00:00	68.80N 10.10E	69.10N 15.06E

	0	1	2	3	4	5	6
0	Site	ID	Vicinity	Data	Latitude	Longitude	Elevation (m)
1	Ittoqqortoormiit	04339	Scoresby Sund	Surface, Upper-Air	70.48◦N	21.95◦W	65
2	Aputiteeq	04351	Kangerdlugssuaq Fjord	None	67.78◦N	32.30◦W	13
3	Tasiilaq	04360	Sermilik Fjord	Surface, Upper-Air	65.60◦N	37.63◦W	36
4	Ikermiit	04373	Køge Bugt Fjord	Surface	64.78◦N	40.30◦W	85
5	Ikermiuarsuk	04382	North of Cape Farewell	Surface	61.93◦N	42.07◦W	39
6	Ikerasassuaq	04390	Cape Farewell	Surface	60.03◦N	43.12◦W	88
7	Narsarsuaq	04270	Cape Farewell	Upper-Air	61.15◦N	45.43◦W	65

	0	1	2	3	4	5	6
0	Flight	Date	Science aim	Area of operation	Take-off UTC	Landing UTC	Dropsondes
1	B268	21 February	Easterly tip jet	Cape Farewell	1048	1627	12
2	B271	25 February	Polar low interacting with Greenland	Iceland Sea	1035	1625	16
3	B274	2 March	Barrier winds	Denmark Strait	1107	1455	9
4	B276	5 March	Barrier winds	Denmark Strait	1120	1706	8
5	B277	6 March	Barrier winds	Denmark Strait	1027	1600	17
6	B278	9 March	Barrier winds and air–sea interaction	Denmark Strait	1031	1511	6