Data Mining OCR PDFs - Using pdftabextract to liberate tabular data from scanned documents

This is an example on how to use pdftabextract for data mining in scanned and OCR-processed documents with rather complex tables and/or few "optical features" like column or row borders, which often make it impossible to use like tools like Tabula.

In this example, only a table from a single page will extracted for demonstration purposes. For a full example that covers several pages, see the catalog_30s.py script. There are also some more demonstrations in the examples directory.

The page that we will process looks like this:

The page has been scanned and processed with Optical Character Recognition (OCR) software like ABBYY FineReader or tesseract and produced a "sandwich" PDF with the scanned document image and the recognized text boxes. Although some software, like FineReader allows to extract tables, this often fails and some more effort in order to liberate the data is necessary. We can use pdftabextract together with some other other tools for this.

We will use a combination of the following tools in order to reach our goal:

  • the pdftohtml command from poppler-utils to extract the texts and scanned images from the PDF
  • pdf2xml-viewer to inspect the text boxes and the generated table grid (more on that later)
  • pdftabextract to write a script that estimates the positions of columns and rows, generates a table grid and fits the text boxes into this grid in order to extract the tabular data

The following steps will be performed and described in detail:

  1. Extract the scanned page images and generate an XML with the OCR texts of the PDF with pdftohtml
  2. View the text boxes and scanned pages with pdf2xml-viewer
  3. Load the XML describing the pages and text boxes
  4. Detect straight lines in the scanned pages, find out a possible page skew or rotation and fix it
  5. Detect clusters of vertical lines for identifying the columns of a table
  6. Find out the row positions of the table by analyzing the y-coordinates' distribution of text boxes
  7. Create a grid of columns and lines
  8. Match the text boxes into the grid and hence extract the tabular data in order to export it as Excel and CSV file

1. Extract the scanned page images and generate an XML with the OCR texts of the PDF with pdftohtml

The tool pdftohtml is part of the software package poppler-utils. It's available on most Linux distributions and also for OSX via Homebrew or MacPorts. Using the argument -xml, it will generate an XML file in pdf2xml format with from a sandwich PDF:


In [1]:
!cd data/ && pdftohtml -c -hidden -xml ALA1934_RR-excerpt.pdf ALA1934_RR-excerpt.pdf.xml


Page-1
Page-2
Page-3
Page-4

In [2]:
!ls -1 data/


ALA1934_RR-excerpt.pdf
ALA1934_RR-excerpt.pdf-1_1.png
ALA1934_RR-excerpt.pdf-2_1.png
ALA1934_RR-excerpt.pdf-3_1.png
ALA1934_RR-excerpt.pdf-4_1.png
ALA1934_RR-excerpt.pdf.xml

In [3]:
!head -n 30 data/ALA1934_RR-excerpt.pdf.xml


<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">

<pdf2xml producer="poppler" version="0.41.0">
<page number="1" position="absolute" top="0" left="0" height="1261" width="892">
	<fontspec id="0" size="23" family="Times" color="#000000"/>
	<fontspec id="1" size="41" family="Times" color="#000000"/>
	<fontspec id="2" size="20" family="Times" color="#000000"/>
	<fontspec id="3" size="22" family="Times" color="#000000"/>
	<fontspec id="4" size="7" family="Times" color="#000000"/>
	<fontspec id="5" size="7" family="Times" color="#000000"/>
	<fontspec id="6" size="13" family="Times" color="#000000"/>
	<fontspec id="7" size="5" family="Times" color="#000000"/>
	<fontspec id="8" size="13" family="Times" color="#000000"/>
	<fontspec id="9" size="4" family="Times" color="#000000"/>
	<fontspec id="10" size="4" family="Times" color="#000000"/>
	<fontspec id="11" size="5" family="Times" color="#000000"/>
	<fontspec id="12" size="7" family="Times" color="#000000"/>
	<fontspec id="13" size="7" family="Times" color="#000000"/>
	<fontspec id="14" size="16" family="Times" color="#000000"/>
	<fontspec id="15" size="11" family="Times" color="#000000"/>
<image top="0" left="0" width="893" height="1262" src="ALA1934_RR-excerpt.pdf-1_1.png"/>
<text top="43" left="545" width="14" height="26" font="0">\</text>
<text top="102" left="298" width="353" height="41" font="1"><b>Deutsches Reich</b></text>
<text top="175" left="363" width="234" height="21" font="2">Politische  Zeitungen </text>
<text top="209" left="436" width="83" height="22" font="3"><b>Anhalt</b></text>
<text top="248" left="499" width="8" height="8" font="4">'S</text>
<text top="256" left="490" width="16" height="8" font="4">„   s</text>
<text top="262" left="492" width="15" height="8" font="4">Ö  rj</text>
<text top="251" left="520" width="85" height="8" font="4">A n z e i g e n t e i l</text>

We generated an XML which consists of several <page> elements, containing an <image> (the "background" image, i.e. the scanned page) and several text boxes (<text>) with coordinates and the respective text box value. The images themselves have also been extracted from the PDF.

2. View the text boxes and scanned pages with pdf2xml-viewer

You can download the pdf2xml-viewer from its github page. It basically consists of an HTML page that allows you to inspect an XML file in pdf2xml format in your browser. A copy of it resides also in the directory of this example.

Change to the directory where pdf2xml-viewer resides (where its index.html or pdf2xml-viewer.html file is). You should also copy the extracted XML file and images to this location. Now let's start up a minimal local webserver. This can be done very easily with Python:

With Python 2.x: python -m SimpleHTTPServer 8080

Or with Python 3: python3 -m http.server 8080 --bind 127.0.0.1

Now you open your browser and go to the address http://127.0.0.1:8080. The viewer shows up and you can now enter the file name of your file to load (it must be relative to the directory in which pdf2xml-viewer resides).


In [ ]:
!python3 -m http.server 8080 --bind 127.0.0.1

When you execute the above line you can start up pdf2xml-viewer by visiting http://127.0.0.1:8080/pdf2xml-viewer.html. However, please note that this will prevent further code execution in this Notebook as long as the minimal webserver is running. So it's generally a better idea to execute this in a separate Terminal window.

In pdf2xml-viewer you will now be able to browse through the pages. In the background, you can see the scanned image page and on top of that are the text boxes that were detected during OCR:

3. Load the XML describing the pages and text boxes

We can now start to use pdftabextract in Python code in order to load the XML file. By now, you should have installed pdftabextract via pip in the Terminal with the command pip install pdftabextract.

Let's define some constants first that we will need throughout the script.


In [5]:
DATAPATH = 'data/'
OUTPUTPATH = 'generated_output/'
INPUT_XML = 'ALA1934_RR-excerpt.pdf.xml'

Now we can load the the XML, parse it and have a look at the third page (the page from which we later want to extract the data).


In [6]:
import os
from pdftabextract.common import read_xml, parse_pages

# Load the XML that was generated with pdftohtml
xmltree, xmlroot = read_xml(os.path.join(DATAPATH, INPUT_XML))

# parse it and generate a dict of pages
pages = parse_pages(xmlroot)

In [7]:
pages[3].keys()


Out[7]:
dict_keys(['number', 'width', 'texts', 'image', 'xmlnode', 'height'])

Each page consists of an xmlnode which points to the original XML page element, a page number, the page dimensions, an image (the scanned page) and the text boxes:


In [8]:
from pprint import pprint

p_num = 3
p = pages[p_num]

print('number', p['number'])
print('width', p['width'])
print('height', p['height'])
print('image', p['image'])
print('the first three text boxes:')
pprint(p['texts'][:3])


number 3
width 892
height 1261
image ALA1934_RR-excerpt.pdf-3_1.png
the first three text boxes:
[{'bottom': 107.0,
  'bottomleft': array([ 359.,  107.]),
  'bottomright': array([ 404.,  107.]),
  'height': 14,
  'left': 359.0,
  'right': 404.0,
  'top': 93.0,
  'topleft': array([ 359.,   93.]),
  'topright': array([ 404.,   93.]),
  'value': 'Baden',
  'width': 45,
  'xmlnode': <Element 'text' at 0x7f0227e4b688>},
 {'bottom': 104.0,
  'bottomleft': array([ 737.,  104.]),
  'bottomright': array([ 745.,  104.]),
  'height': 13,
  'left': 737.0,
  'right': 745.0,
  'top': 91.0,
  'topleft': array([ 737.,   91.]),
  'topright': array([ 745.,   91.]),
  'value': '3',
  'width': 8,
  'xmlnode': <Element 'text' at 0x7f0227e4b6d8>},
 {'bottom': 144.0,
  'bottomleft': array([  58.,  144.]),
  'bottomright': array([  76.,  144.]),
  'height': 8,
  'left': 58.0,
  'right': 76.0,
  'top': 136.0,
  'topleft': array([  58.,  136.]),
  'topright': array([  76.,  136.]),
  'value': 'Ein\xad',
  'width': 18,
  'xmlnode': <Element 'text' at 0x7f0227e4b778>}]

The text boxes in p['texts'] contain the coordinates and dimensions of each text box as well as the content (value) and a reference to the original XML node.

4. Detect straight lines in the scanned pages, find out a possible page skew or rotation and fix it

We can see on the scanned page image, that columns and table headers are marked with straight lines. We can detect these in order to a) find out if and how much the page is skewed or rotated and b) later use the detected vertical lines to estimate the position of the columns in the table.

In order to do so, we can use the image processing module (pdftabextract.imgproc) which uses OpenCV's Hough transform to detect the lines.


In [9]:
import numpy as np
from pdftabextract import imgproc

# get the image file of the scanned page
imgfilebasename = p['image'][:p['image'].rindex('.')]
imgfile = os.path.join(DATAPATH, p['image'])

print("page %d: detecting lines in image file '%s'..." % (p_num, imgfile))

# create an image processing object with the scanned page
iproc_obj = imgproc.ImageProc(imgfile)

# calculate the scaling of the image file in relation to the text boxes coordinate system dimensions
page_scaling_x = iproc_obj.img_w / p['width']   # scaling in X-direction
page_scaling_y = iproc_obj.img_h / p['height']  # scaling in Y-direction

# detect the lines
lines_hough = iproc_obj.detect_lines(canny_kernel_size=3, canny_low_thresh=50, canny_high_thresh=150,
                                     hough_rho_res=1,
                                     hough_theta_res=np.pi/500,
                                     hough_votes_thresh=round(0.2 * iproc_obj.img_w))
print("> found %d lines" % len(lines_hough))


page 3: detecting lines in image file 'data/ALA1934_RR-excerpt.pdf-3_1.png'...
> found 69 lines

The important thing is that we created an ImageProc instance using the scanned page image file that is referred to in the image key of the page p. ImageProc will identify the dimensions of the image file which allows us to calculate the scaling between the image dimensions and the text boxes' coordinate system. It is important to understand that the coordinate system in image space has a different scaling than the coordinate system used to position the text boxes. For example, the image could be scanned with a size of 1000x3000 pixels (iproc_obj.img_w by iproc_obj.img_h) while the text boxes of a page are positioned on a canvas of size 500x1500 units (p['width'] by p['height']).

Now the most crucial step is to detect the lines and use the right parameters for the image processing step. At first, our image will be converted into a binary image with white pixels marking the edges in the image (i.e. the regions with a big change in color intensity) and black pixels showing regions of homogenous color, i.e. low change in color intensity. To achieve this, the Canny algorithm is used. The kernel size is the size of the Gaussian Filter used to smooth the image and remove pixel noise. The low and high thresholds are used for detecting "strong" and "weak" edge pixels during Hysteresis Thresholding.

After this, the actual Hough transform is taking place. All white pixels (i.e. edge pixels) are transformed into "Hough space" which is a descrete approximation of a polar coordinate system with theta being the polar angle and rho being the distance from the origin or "pole". This Hough space can be seen as a 2D map with theta being on the y-axis and rho on the x-axis. The map must have a certain resolution or size which is given by hough_rho_res and hough_theta_res. During Hough transform, each edge pixel p is then converted to polar coordinate space and n = PI / hough_theta_res lines (in the above example 500 lines) are simulated to pass through that pixel in n different degrees. This forms a sinusoid which will intersect with other sinusoids when other edge pixels are on the same straight line (because this pixels share the same angle or theta and distance or rho of this straight line). So for each edge pixel, a number of potential lines going through that pixel is simulated and then accumulated on the Hough space map, a process which is called voting. The more intersections appear on a certain spot in the Hough map the higher is the vote. All votes that are above a certain threshold here given by hough_votes_thresh are then considered being straight lines.

So hough_theta_res defines the minimum line angle that can be detected (in our case: PI/500 = 0.36°) and hough_rho_res the minimum distance delta. hough_votes_thresh is the minimum number of intersections (and hence about the number of edge pixels on a straight line) in the Hough map for a straight line to be detected as such.

We should now have a look at the detected lines in order to verify that we chose the right parameters. But at first, let's define a helper function for that because we will need it another time.


In [10]:
import cv2

# helper function to save an image 
def save_image_w_lines(iproc_obj, imgfilebasename):
    img_lines = iproc_obj.draw_lines(orig_img_as_background=True)
    img_lines_file = os.path.join(OUTPUTPATH, '%s-lines-orig.png' % imgfilebasename)
    
    print("> saving image with detected lines to '%s'" % img_lines_file)
    cv2.imwrite(img_lines_file, img_lines)

In [11]:
save_image_w_lines(iproc_obj, imgfilebasename)


> saving image with detected lines to 'generated_output/ALA1934_RR-excerpt.pdf-3_1-lines-orig.png'

This is our image with the detected lines:

We see that the horizontal lines (green) are not straight, whereas the vertical lines (red) are. This means that the page is skewed so that the text boxes in the left are slightly shifted down in relation to those on the right of the same row (this can also be seen in pdf2xml-viewer). In such a dense table, this would bring serious problems during row detection, so we will need to fix this. The method find_rotation_or_skew can be used for that.


In [12]:
from math import radians, degrees

from pdftabextract.common import ROTATION, SKEW_X, SKEW_Y
from pdftabextract.geom import pt
from pdftabextract.textboxes import rotate_textboxes, deskew_textboxes

# find rotation or skew
# the parameters are:
# 1. the minimum threshold in radians for a rotation to be counted as such
# 2. the maximum threshold for the difference between horizontal and vertical line rotation (to detect skew)
# 3. an optional threshold to filter out "stray" lines whose angle is too far apart from the median angle of
#    all other lines that go in the same direction (no effect here)
rot_or_skew_type, rot_or_skew_radians = iproc_obj.find_rotation_or_skew(radians(0.5),    # uses "lines_hough"
                                                                        radians(1),
                                                                        omit_on_rot_thresh=radians(0.5))

# rotate back or deskew text boxes
needs_fix = True
if rot_or_skew_type == ROTATION:
    print("> rotating back by %f°" % -degrees(rot_or_skew_radians))
    rotate_textboxes(p, -rot_or_skew_radians, pt(0, 0))
elif rot_or_skew_type in (SKEW_X, SKEW_Y):
    print("> deskewing in direction '%s' by %f°" % (rot_or_skew_type, -degrees(rot_or_skew_radians)))
    deskew_textboxes(p, -rot_or_skew_radians, rot_or_skew_type, pt(0, 0))
else:
    needs_fix = False
    print("> no page rotation / skew found")


> deskewing in direction 'sy' by 0.719995°

We can see that a skew in direction sy (meaning vertical direction) is detected. To fix this, deskew_textboxes is called with our page p as first parameter, then the negated skew (negated in order to "skew back"), the skew type (direction) and the origin point about which the deskewing is done (top left corner).

We fixed the text boxes in the XML, but we should also fix the actual lines that were detected. This can be done with the method apply_found_rotation_or_skew. Additionally, we save the image with the repaired lines.


In [13]:
if needs_fix:
    # rotate back or deskew detected lines
    lines_hough = iproc_obj.apply_found_rotation_or_skew(rot_or_skew_type, -rot_or_skew_radians)

    save_image_w_lines(iproc_obj, imgfilebasename + '-repaired')


> saving image with detected lines to 'generated_output/ALA1934_RR-excerpt.pdf-3_1-repaired-lines-orig.png'

As we can see, only the horizontal lines are straightened. Please note that the deskewing is not applied to the original image because this is not necessary for our further processing and hence the repaired lines are now a bit off from the table borders in the original image.

We should also save the corrected text boxes' XML so that we can inspect it in pdf2xml-viewer:


In [14]:
# save repaired XML (i.e. XML with deskewed textbox positions)
output_files_basename = INPUT_XML[:INPUT_XML.rindex('.')]
repaired_xmlfile = os.path.join(OUTPUTPATH, output_files_basename + '.repaired.xml')

print("saving repaired XML file to '%s'..." % repaired_xmlfile)
xmltree.write(repaired_xmlfile)


saving repaired XML file to 'generated_output/ALA1934_RR-excerpt.pdf.repaired.xml'...

5. Detect clusters of vertical lines for identifying the columns of a table

We now want to identify the columns of the table on our page. We already obtained the vertical lines by the means of image processing. Now as there are always many close lines detected for single column separator line, we need to cluster these sets of close lines so that we can later calculate the column positions.

We can use the find_clusters method to which we need to pass the direction of lines that we want to cluster (vertical lines) and a clustering method. Here, we will use a simple technique that orders the lines and arranges them to separate clusters when the gap between them is too big (i.e. exceeds dist_thresh). We use the half of the minimum column width for this threshold. We can measure this minimum column width in advance with a graphics editor like GIMP. You can also use other, more sophisticated clustering techniques here too, which are implemented in pdftabextract.clustering, e.g. hierarchical clustering. But usually this approach here is sufficient.


In [15]:
from pdftabextract.clustering import find_clusters_1d_break_dist

MIN_COL_WIDTH = 60 # minimum width of a column in pixels, measured in the scanned pages

# cluster the detected *vertical* lines using find_clusters_1d_break_dist as simple clustering function
# (break on distance MIN_COL_WIDTH/2)
# additionally, remove all cluster sections that are considered empty
# a cluster is considered empty when the number of text boxes in it is below 10% of the median number of text boxes
# per cluster section
vertical_clusters = iproc_obj.find_clusters(imgproc.DIRECTION_VERTICAL, find_clusters_1d_break_dist,
                                            remove_empty_cluster_sections_use_texts=p['texts'], # use this page's textboxes
                                            remove_empty_cluster_sections_n_texts_ratio=0.1,    # 10% rule
                                            remove_empty_cluster_sections_scaling=page_scaling_x,  # the positions are in "scanned image space" -> we scale them to "text box space"
                                            dist_thresh=MIN_COL_WIDTH/2)
print("> found %d clusters" % len(vertical_clusters))

# draw the clusters
img_w_clusters = iproc_obj.draw_line_clusters(imgproc.DIRECTION_VERTICAL, vertical_clusters)
save_img_file = os.path.join(OUTPUTPATH, '%s-vertical-clusters.png' % imgfilebasename)
print("> saving image with detected vertical clusters to '%s'" % save_img_file)
cv2.imwrite(save_img_file, img_w_clusters)


> found 17 clusters
> saving image with detected vertical clusters to 'generated_output/ALA1934_RR-excerpt.pdf-3_1-vertical-clusters.png'
Out[15]:
True

Here we see the detected clusters marked with different colors:

Now we simply calculate the centers of the clusters, by default taking the median of the cluster values. We also need to divide by the page scaling because the cluster positions are in image space but we need the column positions in "text box space".


In [16]:
from pdftabextract.clustering import calc_cluster_centers_1d

page_colpos = np.array(calc_cluster_centers_1d(vertical_clusters)) / page_scaling_x
print('found %d column borders:' % len(page_colpos))
print(page_colpos)


found 17 column borders:
[  39.16765831   92.43709677  350.3668275   401.12338442  425.85806452
  452.11451613  478.01129032  503.5483871   528.36612903  554.55839054
  579.45175965  604.34512877  630.55830948  658.97309748  690.66941713
  716.2070073   740.26100487]

6. Find out the row positions of the table by analyzing the y-coordinates' distribution of text boxes

For the row or line positions in the table, we can't rely on optical features because there are no row separator lines. However, we can exploit the distinctive distribution of y-coordinates of the text boxes and cluster them in order to get an estimation of the row positions. Another challenge is to find the start and the end of the table. We can do so by analyzing the specific contents of the text boxes for hints that the first row starts (it must start with a population number on the left) or that the text box is the table footer (it has something to do with "Ala Anzeigen").

We start by finding text boxes of "usual" height in the first two columns, because the text boxes in these columns are most reliable in determining the row position.


In [17]:
# right border of the second column
col2_rightborder = page_colpos[2]

# calculate median text box height
median_text_height = np.median([t['height'] for t in p['texts']])

# get all texts in the first two columns with a "usual" textbox height
# we will only use these text boxes in order to determine the line positions because they are more "stable"
# otherwise, especially the right side of the column header can lead to problems detecting the first table row
text_height_deviation_thresh = median_text_height / 2
texts_cols_1_2 = [t for t in p['texts']
                  if t['right'] <= col2_rightborder
                     and abs(t['height'] - median_text_height) <= text_height_deviation_thresh]

Next we get the text boxes' top and bottom border positions, cluster them, and calculate the cluster centers.


In [18]:
from pdftabextract.clustering import zip_clusters_and_values
from pdftabextract.textboxes import border_positions_from_texts, split_texts_by_positions, join_texts
from pdftabextract.common import all_a_in_b, DIRECTION_VERTICAL

# get all textboxes' top and bottom border positions
borders_y = border_positions_from_texts(texts_cols_1_2, DIRECTION_VERTICAL)

# break into clusters using half of the median text height as break distance
clusters_y = find_clusters_1d_break_dist(borders_y, dist_thresh=median_text_height/2)
clusters_w_vals = zip_clusters_and_values(clusters_y, borders_y)

# for each cluster, calculate the median as center
pos_y = calc_cluster_centers_1d(clusters_w_vals)
pos_y.append(p['height'])

print('number of line positions:', len(pos_y))


number of line positions: 100

Now try to find the first table row by checking when we find the first text box that contains a string which matches a typical row pattern.


In [19]:
import re

# a (possibly malformed) population number + space + start of city name
pttrn_table_row_beginning = re.compile(r'^[\d Oo][\d Oo]{2,} +[A-ZÄÖÜ]')

# 1. try to find the top row of the table
texts_cols_1_2_per_line = split_texts_by_positions(texts_cols_1_2, pos_y, DIRECTION_VERTICAL,
                                                   alignment='middle',
                                                   enrich_with_positions=True)

# go through the texts line per line
for line_texts, (line_top, line_bottom) in texts_cols_1_2_per_line:
    line_str = join_texts(line_texts)
    if pttrn_table_row_beginning.match(line_str):  # check if the line content matches the given pattern
        top_y = line_top
        break
else:
    top_y = 0

top_y


Out[19]:
208.80321434348292

We find the last table row by checking if we find a "big" text box in the lower page area that matches certain words.


In [20]:
# hints for a footer text box
words_in_footer = ('anzeige', 'annahme', 'ala')

# 2. try to find the bottom row of the table
min_footer_text_height = median_text_height * 1.5
min_footer_y_pos = p['height'] * 0.7
# get all texts in the lower 30% of the page that have are at least 50% bigger than the median textbox height
bottom_texts = [t for t in p['texts']
                if t['top'] >= min_footer_y_pos and t['height'] >= min_footer_text_height]
bottom_texts_per_line = split_texts_by_positions(bottom_texts,
                                                 pos_y + [p['height']],   # always down to the end of the page
                                                 DIRECTION_VERTICAL,
                                                 alignment='middle',
                                                 enrich_with_positions=True)
# go through the texts at the bottom line per line
page_span = page_colpos[-1] - page_colpos[0]
min_footer_text_width = page_span * 0.8
for line_texts, (line_top, line_bottom) in bottom_texts_per_line:
    line_str = join_texts(line_texts)
    has_wide_footer_text = any(t['width'] >= min_footer_text_width for t in line_texts)
    # check if there's at least one wide text or if all of the required words for a footer match
    if has_wide_footer_text or all_a_in_b(words_in_footer, line_str):
        bottom_y = line_top
        break
else:
    bottom_y = p['height']

bottom_y


Out[20]:
1089.8901693310179

In [21]:
# finally filter the line positions so that only the lines between the table top and bottom are left
page_rowpos = [y for y in pos_y if top_y <= y <= bottom_y]

print("> page %d: %d lines between [%f, %f]" % (p_num, len(page_rowpos), top_y, bottom_y))


> page 3: 93 lines between [208.803214, 1089.890169]

7. Create a grid of columns and lines

From the column and row positions that we detected, we can now generate a "page grid" which should resemble the table layout as close as possible. We then save the grid information as JSON file so that we can display it in pdf2xml-viewer.


In [22]:
from pdftabextract.extract import make_grid_from_positions

grid = make_grid_from_positions(page_colpos, page_rowpos)
n_rows = len(grid)
n_cols = len(grid[0])
print("> page %d: grid with %d rows, %d columns" % (p_num, n_rows, n_cols))


> page 3: grid with 92 rows, 16 columns

In [23]:
# the page grid is a 2D matrix that contains the rectangular table cells
# this is the top left table cell:
grid[0][0]


Out[23]:
array([[  39.16765831,  208.80321434],
       [  92.43709677,  217.83563592]])

In [24]:
from pdftabextract.common import save_page_grids

page_grids_file = os.path.join(OUTPUTPATH, output_files_basename + '.pagegrids_p3_only.json')
print("saving page grids JSON file to '%s'" % page_grids_file)
save_page_grids({p_num: grid}, page_grids_file)


saving page grids JSON file to 'generated_output/ALA1934_RR-excerpt.pdf.pagegrids_p3_only.json'

We should now evaluate our page grid using pdf2xml-viewer. At first, we will need to load the XML of the repaired (deskewed) pages and then the JSON file containing the page grid information for page 3. When we select page 3 we should see an output like this:

We can see that the rows and columns were detected correctly (decrease the background image visibility for better display) and the deskewed text boxes mostly lie in the correct table cells. So the only thing left now is to finally extract the data by fitting the text boxes in the correct table cells.

8. Match the text boxes into the grid and hence extract the tabular data in order to export it as Excel and CSV file

We can use fit_texts_into_grid to fit the text boxes into the grid and then transform it to a pandas DataFrame.


In [25]:
from pdftabextract.extract import fit_texts_into_grid, datatable_to_dataframe

datatable = fit_texts_into_grid(p['texts'], grid)
    
df = datatable_to_dataframe(datatable)

df.head(n=10)


Out[25]:
col01 col02 col03 col04 col05 col06 col07 col08 col09 col10 col11 col12 col13 col14 col15 col16
0 157 000 K a r l s r u h e i . B . , Generalanzei...
1 d eu tsch la n d .............................. 23 300 7 22 12 7 N 90 50 c 420 280 5040 + 25
2 B e i gleichz. A u fg . in B ez.-A u sg ....
3 u. K in zig -B ote, H ardt-Anzeiger, K eM ,...
4 8 R p f, R ekl. mm SO l i p f Siehe A nze...
5 K aärlsruher<B ürgerzeitung...................... 1 50 4 * D 295 186 + 54
6 » V i S . R M 9 6 ,—
7 K arlsruh eü-'Tagbatt . . . S iehe A nzei... 13 000 7 22 12 6 N 68 30 B 420 2 87 5040 + 25
8 R esidenZ-A nzeiger . . . . Siehe A nzeige... 7 300 6 22 12 4 N 96 25 B 450 2 92 5400 + 25
9 R h e i n - R ö t e ....................... Si... 1 450 6 46 6 5 N 96 15 B 450 2 92 2700 + 25

Of course you would now usually do a lot of parsing, data clean up and validation because with this kind of messy data it will never be 100% perfect. Still this is a good basis to work on. We can export the data now as CSV and Excel:


In [26]:
csv_output_file = os.path.join(OUTPUTPATH, output_files_basename + '-p3_only.csv')
print("saving extracted data to '%s'" % csv_output_file)
df.to_csv(csv_output_file, index=False)

excel_output_file = os.path.join(OUTPUTPATH, output_files_basename + '-p3_only.xlsx')
print("saving extracted data to '%s'" % excel_output_file)
df.to_excel(excel_output_file, index=False)


saving extracted data to 'generated_output/ALA1934_RR-excerpt.pdf-p3_only.csv'
saving extracted data to 'generated_output/ALA1934_RR-excerpt.pdf-p3_only.xlsx'

In [ ]: