Experiments for extracting images of Boyd's Bird Journal into computer readable form

(See image below)

The journals are PDFs containing a series of scanned images of observations of birds. The observations are scanned handwritten notes on graph paper. There are bird species labels running down the left side of the page and date information across the top. The charts are organized by month with days of the month being column headings. There are between 2 and three months of information for each image.

Each cell has a mark indicating the presence or absence of a bird species on a given day. So there is, potentially, one mark per bird species per day. The mark on the page is typically a forward slash "/" but it can also be an "x" or an asterisk "*". We are treating all types of marks the same, a cell either has a mark or it doesn't.

Somethings to note here:

  • The graphs are not clean and contain notes and stray marks.
  • The scans do not always have nice strong lines to pick out.
  • The scans of the graphs are crooked and contain distortions, so the lines are slightly bent, typically near the edges.
  • Some of the lines are incomplete or missing. In the image below, May 1986 has more grid cells than June 1986. And the line to the left of May 1st is incomplete.


In [1]:
%load_ext watermark
%watermark -a 'Raphael LaFrance' -i -u -v -r -g -p numpy,matplotlib,skimage


Raphael LaFrance 
last updated: 2017-10-29T18:05:06-04:00

CPython 3.6.3
IPython 6.2.1

numpy 1.13.3
matplotlib 2.1.0
skimage 0.13.1
Git hash: e029fb4543f36c7057716ac74de6891f2b986206
Git repo: https://github.com/rafelafrance/boyd-bird-journal.git

Extract images from PDF files

First we need to extract individual images from the PDFs. This is easily accomplished in Linux with the command pdfimages. This is part of either the poppler or xpdf packages. We're using bash to make a directory to hold the images and then extracting the PDF images into that directory. The first 20 images are not relevant here.


In [2]:
%%bash

RAW_DATA='raw_data'
DIRECTORY='images'

PDF1="$RAW_DATA/Boyd_M_Bird_journal_section1.pdf"
PDF2="$RAW_DATA/Boyd_M_Bird_journal_section2.pdf"

PREFIX1="$DIRECTORY/Boyd_M_Bird_journal_section1"
PREFIX2="$DIRECTORY/Boyd_M_Bird_journal_section2"

if [ ! -d "$DIRECTORY" ]; then
    mkdir $DIRECTORY
    pdfimages -png $PDF1 $PREFIX1
    pdfimages -png $PDF2 $PREFIX2
fi

Setup

We are using a fairly standard scipy stack: numpy & matplotlib. The only addition is the use of scikit-image.


In [3]:
# %matplotlib notebook
%matplotlib inline

import os
import csv
from itertools import product
from collections import namedtuple

import numpy as np

import matplotlib.pyplot as plt
from matplotlib import patches, cm

from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

# import cv2

from skimage import io
from skimage import util
# from skimage.filters import sobel
from skimage.transform import hough_line, hough_line_peaks
from skimage.transform import probabilistic_hough_line, rotate

from lib.util import Crop, too_close
from lib.cell import Cell
from lib.grid import Grid
from boyd_journal_extraction import get_month_graph_areas, build_month_graphs
from boyd_journal_extraction import init_csv_file, output_results, process_image
from boyd_journal_extraction import get_left_side

In [4]:
in_file = 'images/Boyd_M_Bird_journal_section1-024.png'
# in_file = 'images/Boyd_M_Bird_journal_section2-125.png'

CSV_PATH = 'output/boyd_bird_journal.csv'

Brief description of the Hough transform

We're using the Hough Transform to find lines in the image. It's an efficient and old algorithm for finding objects in an image. Efficient because it only scans the image once.

The basic idea of the algorithm is:

  1. Set up a table of every possible line in the image. The lines are in polar form (rho, theta).
    1. Lines are limited to a given set of angles.
    2. This table will hold a count of all of the "on" pixels for the line.
  2. Scan the image for "on" pixels.
  3. When a pixel is "on", add one to every possible line that goes thru the pixel it.
  4. After every pixel has been recorded choose all lines with a count that is greater than a given threshold.

See the Wikipedia Page for a more detailed description.

Get the grid for the full image


In [5]:
grid = Grid(file_name=in_file)
print(grid.image.shape)


(5100, 3300)

Get the horizontal and vertical grid lines

As described above, we need to define a line as a threshold on the line count. However, there is a wrinkle, the images are not square with the width being the shorter dimension (3300px width x 5100px height). To accommodate this we will make two passes over the image. One for the horizontal lines and one for the vertical line.


In [6]:
grid.find_grid_lines()

Look at the grid lines for the image

We want to verify the horizontal grid lines and the vertical grid lines at left side of the image.

Notice that the grid lines for the full image are not complete and do not line up exactly. Therefore, we will chop up the image more to see if can get better grid lines.

What we do care about is the left most vertical line. It should separate the grid line numbers from the rest of the grid.


In [7]:
fig, ax = plt.subplots(figsize=(4, 8))
ax.imshow(grid.image, cmap=plt.cm.gray)

for ((x0, y0), (x1, y1)) in grid.horiz.lines:
    ax.plot((x0, x1), (y0, y1), '-y', linewidth=1)

for ((x0, y0), (x1, y1)) in grid.vert.lines:
    ax.plot((x0, x1), (y0, y1), '-r', linewidth=1)


Crop the left side to see if we can get better grid lines for the row labels

We're expecting two columns of cells on the left side of the image. The cells are rather long and typically have lots of whitespace toward the left end. We expect the 1st cell to have a row number and the 2nd cell to have the bird's species identification. We are going to look at the 1st cell to see if there is any writing in it. To help boost the signal we are going to chop the 1st cell at a fixed width and look at that part for writing.


In [8]:
left_side = get_left_side(grid)
print(len(left_side.cells))
print(len(left_side.cells[0]))


98
1

In [9]:
fig, ax = plt.subplots(figsize=(4, 8))
ax.imshow(left_side.image, cmap=plt.cm.gray)

for ((x0, y0), (x1, y1)) in left_side.horiz.lines:
    ax.plot((x0, x1), (y0, y1), '-y', linewidth=1)

for ((x0, y0), (x1, y1)) in left_side.vert.lines:
    ax.plot((x0, x1), (y0, y1), '-r', linewidth=1)


Look for writing in the row label cells


In [10]:
# @interact(row=(0, len(left_side.cells) - 1))
def draw_row_label_interior(row):
    print('yes' if left_side.row_labels[row] else '')

    inside = left_side.cells[row][0].interior(crop=Cell.crop)
    print(np.mean(inside))
    fig, ax = plt.subplots(figsize=(6, 2))
    ax.imshow(inside, cmap=plt.cm.gray)

    lines = left_side.cells[row][0].has_line(Cell.label_lines)
    for ((x0, y0), (x1, y1)) in lines:
        ax.plot((x0, x1), (y0, y1), '-r', linewidth=1)

draw_row_label_interior(24)


yes
31.5021929825

Now split the right side into separate graphs


In [11]:
months = get_month_graph_areas(grid, left_side)

for month in months:
    print(month.width, month.height)


1850 1510
1850 1374

In [12]:
build_month_graphs(months)

for m, month in enumerate(months):
    print('month: {} rows: {}  cols: {}'.format(
        m, len(month.cells), len(month.cells[0])))


month: 0 rows: 31  cols: 33
month: 1 rows: 28  cols: 32

Look the resulting grid


In [13]:
# @interact(mon=(0, len(months) - 1))
def show_month_grid(mon):
    month = months[mon]
    fig, ax = plt.subplots(figsize=(6, 6))

    ax.imshow(month.image, cmap=plt.cm.gray)
    ax.set_title('Grid {}'.format(mon))

    for ((x0, y0), (x1, y1)) in month.horiz.lines:
        ax.plot((x0, x1), (y0, y1), '-y', linewidth=1)

    if not month.top or not month.bottom:
        for ((x0, y0), (x1, y1)) in month.vert.lines:
            ax.plot((x0, x1), (y0, y1), '-r', linewidth=1)

    if month.top:
        for ((x0, y0), (x1, y1)) in month.top.vert.lines:
            ax.plot((x0, x1), (y0, y1), '-c', linewidth=1)

    if month.bottom:
        for ((x0, y0), (x1, y1)) in month.bottom.vert.lines:
            y0 += month.bottom.offset.y
            y1 += month.bottom.offset.y
            ax.plot((x0, x1), (y0, y1), '-b', linewidth=1)

    plt.tight_layout()
    plt.show()

show_month_grid(0)


Find column labels


In [14]:
# @interact(mon=(0, len(months) - 1), col=(0, 35))
def draw_column_header_interior(mon, col):
    month = months[mon]

    col = -1 if col >= len(month.cells[0]) else col

    cell = month.cells[0][col]
    interior = cell.interior(crop=Cell.crop)

    mean = np.mean(interior)
    print('mean', mean)
    print('yes' if cell.is_label() else '')
    print(interior.shape)

    fig, ax = plt.subplots(figsize=(3, 3))
    ax.imshow(interior, cmap=plt.cm.gray)

    lines = cell.has_line()
    for ((x0, y0), (x1, y1)) in lines:
        ax.plot((x0, x1), (y0, y1), '-r', linewidth=1)


draw_column_header_interior(1, 2)


mean 23.4427767355
yes
(41, 39)

Look for forward slashes in grid cells


In [15]:
# @interact(mon=(0, len(months) - 1), row=(1, 60), col=(0, 35))
def draw_cell_interior(mon, row, col):
    month = months[mon]
    row = -1 if row >= len(month.cells) else row
    col = -1 if col >= len(month.cells[0]) else col

    cell = month.cells[row][col]
    interior = cell.interior(Cell.crop)

    fig, ax = plt.subplots(figsize=(3, 3))
    ax.imshow(interior, cmap=plt.cm.gray)

    lines = probabilistic_hough_line(
        interior, line_length=15, theta=Cell.forward_slashes)
    for ((x0, y0), (x1, y1)) in lines:
        ax.plot((x0, x1), (y0, y1), '-r', linewidth=1)

    print('lines', len(lines))
    print('yes' if len(lines) else '')


draw_cell_interior(1, 18, 31)
# draw_cell_interior(1, 27, 27)


lines 0


In [16]:
# @interact(mon=(0, len(months) - 1))
def show_slashes(mon):
    month = months[mon]
    for r, row in enumerate(month.cells[1:]):
        for col, cell in enumerate(row):
            if month.col_labels[col]:
                print('/' if cell.has_line(Cell.forward_slashes) else '.', end=' ')
        print()
show_slashes(0)


/ / . . . . . . . . . . . . . . . . . . . . . . . / . . . . . 
/ / / / / / / / / / / / / / / / / / / / / / / / / . . . . / / 
/ / / / / / / . / / . / . . . . / . / . . / . . . . . . . . . 
/ / . / . / . . . / . . . . . . . . . . . . . . . . . . . . . 
/ / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / 
/ / . . . . / . / / / / / / / / / . / . / / . / . / . . . . / 
/ / . . / . . / / / . . . . / . / / . . . / . / / . . / . . . 
/ / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / 
/ / / / / / / / . . . . . / . . . / / / / / / / / / / . / . / 
/ / . . . / . / . / . . / / . . / / . / / / / . / / / . . . / 
/ / / / / / / / / / / / / / / / / / / / / / / / / / / / / . / 
/ . . . . . / / / / / / / / / / / / . / / / / / / / . / . . . 
/ / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / 
/ / / / / / / / / / / / / / / / / / / / / / . . . / / . . . . 
/ / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / 
/ / / / / / / / / / / / / / / / / . . . . / . . . / / . / / / 
/ . . . . / / . . / . . . . / . . . . . . . . / . . . . . . . 
/ / / / / / / / . . . . . . . . . . . . . . . . . . . . . . . 
/ / / / / / / / / / / / / / / / / / . . / / . / . / . . / . . 
. / . . / . . / . / / . . . . . . . . . . / . . . . . . . . . 
. / / / / . . / . / . / . . . . . . . . / . . / . . . . . . . 
. / / . . . . / . . / . . . . . . . . . . . . . . . . . . / . 
. . . . / . . . . . . . . . . . . . . . . . . . . . . . . . . 
. . . . / / . / . . . . . . . . . . . . . . . . . . . . . . . 
. . . . . . / . . . . . . . . . . . . . . . . . . . . . . . . 
. . . . . . . / / / / / / / / / / / / / . . / . . . . . . . . 
. . . . . . . . . / . . / / . . . . . . . . . . . . . . . / . 
. . . . . . . . . . . . / . . . . . . . . . . . . . . . . . . 
. . . . . . . . . . . . / . . . . . . . . . . . . / . . . . . 
. . . . . . . . . . . . . . . . . . . . . . / . . . . . . . . 

Stitch image parts back together to report output

This image shows how we broke up the input image to get the monthly charts. Doing it this way reduces distortion.


In [17]:
fig, ax = plt.subplots(figsize=(8, 8))
ax.imshow(grid.image, cmap=plt.cm.gray)

for ((x0, y0), (x1, y1)) in left_side.horiz.lines:
    ax.plot((x0, x1), (y0, y1), '-m', linewidth=1)

for ((x0, y0), (x1, y1)) in left_side.vert.lines:
    ax.plot((x0, x1), (y0, y1), '-r', linewidth=1)

for month in months:
    for ((x0, y0), (x1, y1)) in month.horiz.lines:
        x0 += month.offset.x
        x1 += month.offset.x
        y0 += month.offset.y
        y1 += month.offset.y
        ax.plot((x0, x1), (y0, y1), '-y', linewidth=1)

    if not month.top or not month.bottom:
        for ((x0, y0), (x1, y1)) in month.vert.lines:
            x0 += month.offset.x
            x1 += month.offset.x
            y0 += month.offset.y
            y1 += month.offset.y
            ax.plot((x0, x1), (y0, y1), '-r', linewidth=1)

    if month.top:
        for ((x0, y0), (x1, y1)) in month.top.vert.lines:
            x0 += month.top.offset.x
            x1 += month.top.offset.x
            y0 += month.top.offset.y
            y1 += month.top.offset.y
            ax.plot((x0, x1), (y0, y1), '-c', linewidth=1)

    if month.bottom:
        for ((x0, y0), (x1, y1)) in month.bottom.vert.lines:
            x0 += month.bottom.offset.x
            x1 += month.bottom.offset.x
            y0 += month.bottom.offset.y
            y1 += month.bottom.offset.y
            ax.plot((x0, x1), (y0, y1), '-b', linewidth=1)

plt.show()



In [18]:
def get_month_graph_areas(grid, left_side):
    """Chop the right side image into images for each month."""
    months = []

    for label_idx, label in enumerate(left_side.row_labels[1:], 1):

        prev_idx = label_idx - 1
        prev_label = left_side.row_labels[prev_idx]

        if label and not prev_label:
            top_line = grid.horiz.lines[label_idx - 1]
        elif not label and prev_label:
            bottom_line = grid.horiz.lines[label_idx + 1]
            months.append(crop_rows(grid, top_line, bottom_line))

    return months

Output the results


In [19]:
CSV_PATH = 'output/boyd_bird_journal.csv'
output_results(in_file, CSV_PATH, grid, months, left_side)
plt.show()


Failed experiments

  • Try merging endpoints: Lines are pretty skew across the entire image. I tried to use interior points to make grid lines. This didn't really help things.

  • Probabilistic Hough line: This may work for other parts of the image, like slashes, but it didn't help with either grid lines or row labels. It proved to be much slower and harder to tune for finding grid lines that span the entire image.

  • OpenCV: This works, it's just less flexible for searching on a limited set of angles. The ability to pull out the horizontal, vertical, and diagonal lines separately is useful in this application. Also, OpenCV is difficult to install.