(See image below)
The journals are PDFs containing a series of scanned images of observations of birds. The observations are scanned handwritten notes on graph paper. There are bird species labels running down the left side of the page and date information across the top. The charts are organized by month with days of the month being column headings. There are between 2 and three months of information for each image.
Each cell has a mark indicating the presence or absence of a bird species on a given day. So there is, potentially, one mark per bird species per day. The mark on the page is typically a forward slash "/" but it can also be an "x" or an asterisk "*". We are treating all types of marks the same, a cell either has a mark or it doesn't.
Somethings to note here:
In [1]:
%load_ext watermark
%watermark -a 'Raphael LaFrance' -i -u -v -r -g -p numpy,matplotlib,skimage
First we need to extract individual images from the PDFs. This is easily accomplished in Linux with the command pdfimages
. This is part of either the poppler or xpdf packages. We're using bash
to make a directory to hold the images and then extracting the PDF images into that directory. The first 20 images are not relevant here.
In [2]:
%%bash
RAW_DATA='raw_data'
DIRECTORY='images'
PDF1="$RAW_DATA/Boyd_M_Bird_journal_section1.pdf"
PDF2="$RAW_DATA/Boyd_M_Bird_journal_section2.pdf"
PREFIX1="$DIRECTORY/Boyd_M_Bird_journal_section1"
PREFIX2="$DIRECTORY/Boyd_M_Bird_journal_section2"
if [ ! -d "$DIRECTORY" ]; then
mkdir $DIRECTORY
pdfimages -png $PDF1 $PREFIX1
pdfimages -png $PDF2 $PREFIX2
fi
In [3]:
# %matplotlib notebook
%matplotlib inline
import os
import csv
from itertools import product
from collections import namedtuple
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import patches, cm
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets
# import cv2
from skimage import io
from skimage import util
# from skimage.filters import sobel
from skimage.transform import hough_line, hough_line_peaks
from skimage.transform import probabilistic_hough_line, rotate
from lib.util import Crop, too_close
from lib.cell import Cell
from lib.grid import Grid
from boyd_journal_extraction import get_month_graph_areas, build_month_graphs
from boyd_journal_extraction import init_csv_file, output_results, process_image
from boyd_journal_extraction import get_left_side
In [4]:
in_file = 'images/Boyd_M_Bird_journal_section1-024.png'
# in_file = 'images/Boyd_M_Bird_journal_section2-125.png'
CSV_PATH = 'output/boyd_bird_journal.csv'
We're using the Hough Transform to find lines in the image. It's an efficient and old algorithm for finding objects in an image. Efficient because it only scans the image once.
The basic idea of the algorithm is:
See the Wikipedia Page for a more detailed description.
In [5]:
grid = Grid(file_name=in_file)
print(grid.image.shape)
As described above, we need to define a line as a threshold on the line count. However, there is a wrinkle, the images are not square with the width being the shorter dimension (3300px width x 5100px height). To accommodate this we will make two passes over the image. One for the horizontal lines and one for the vertical line.
In [6]:
grid.find_grid_lines()
We want to verify the horizontal grid lines and the vertical grid lines at left side of the image.
What we do care about is the left most vertical line. It should separate the grid line numbers from the rest of the grid.
In [7]:
fig, ax = plt.subplots(figsize=(4, 8))
ax.imshow(grid.image, cmap=plt.cm.gray)
for ((x0, y0), (x1, y1)) in grid.horiz.lines:
ax.plot((x0, x1), (y0, y1), '-y', linewidth=1)
for ((x0, y0), (x1, y1)) in grid.vert.lines:
ax.plot((x0, x1), (y0, y1), '-r', linewidth=1)
We're expecting two columns of cells on the left side of the image. The cells are rather long and typically have lots of whitespace toward the left end. We expect the 1st cell to have a row number and the 2nd cell to have the bird's species identification. We are going to look at the 1st cell to see if there is any writing in it. To help boost the signal we are going to chop the 1st cell at a fixed width and look at that part for writing.
In [8]:
left_side = get_left_side(grid)
print(len(left_side.cells))
print(len(left_side.cells[0]))
In [9]:
fig, ax = plt.subplots(figsize=(4, 8))
ax.imshow(left_side.image, cmap=plt.cm.gray)
for ((x0, y0), (x1, y1)) in left_side.horiz.lines:
ax.plot((x0, x1), (y0, y1), '-y', linewidth=1)
for ((x0, y0), (x1, y1)) in left_side.vert.lines:
ax.plot((x0, x1), (y0, y1), '-r', linewidth=1)
In [10]:
# @interact(row=(0, len(left_side.cells) - 1))
def draw_row_label_interior(row):
print('yes' if left_side.row_labels[row] else '')
inside = left_side.cells[row][0].interior(crop=Cell.crop)
print(np.mean(inside))
fig, ax = plt.subplots(figsize=(6, 2))
ax.imshow(inside, cmap=plt.cm.gray)
lines = left_side.cells[row][0].has_line(Cell.label_lines)
for ((x0, y0), (x1, y1)) in lines:
ax.plot((x0, x1), (y0, y1), '-r', linewidth=1)
draw_row_label_interior(24)
In [11]:
months = get_month_graph_areas(grid, left_side)
for month in months:
print(month.width, month.height)
In [12]:
build_month_graphs(months)
for m, month in enumerate(months):
print('month: {} rows: {} cols: {}'.format(
m, len(month.cells), len(month.cells[0])))
In [13]:
# @interact(mon=(0, len(months) - 1))
def show_month_grid(mon):
month = months[mon]
fig, ax = plt.subplots(figsize=(6, 6))
ax.imshow(month.image, cmap=plt.cm.gray)
ax.set_title('Grid {}'.format(mon))
for ((x0, y0), (x1, y1)) in month.horiz.lines:
ax.plot((x0, x1), (y0, y1), '-y', linewidth=1)
if not month.top or not month.bottom:
for ((x0, y0), (x1, y1)) in month.vert.lines:
ax.plot((x0, x1), (y0, y1), '-r', linewidth=1)
if month.top:
for ((x0, y0), (x1, y1)) in month.top.vert.lines:
ax.plot((x0, x1), (y0, y1), '-c', linewidth=1)
if month.bottom:
for ((x0, y0), (x1, y1)) in month.bottom.vert.lines:
y0 += month.bottom.offset.y
y1 += month.bottom.offset.y
ax.plot((x0, x1), (y0, y1), '-b', linewidth=1)
plt.tight_layout()
plt.show()
show_month_grid(0)
In [14]:
# @interact(mon=(0, len(months) - 1), col=(0, 35))
def draw_column_header_interior(mon, col):
month = months[mon]
col = -1 if col >= len(month.cells[0]) else col
cell = month.cells[0][col]
interior = cell.interior(crop=Cell.crop)
mean = np.mean(interior)
print('mean', mean)
print('yes' if cell.is_label() else '')
print(interior.shape)
fig, ax = plt.subplots(figsize=(3, 3))
ax.imshow(interior, cmap=plt.cm.gray)
lines = cell.has_line()
for ((x0, y0), (x1, y1)) in lines:
ax.plot((x0, x1), (y0, y1), '-r', linewidth=1)
draw_column_header_interior(1, 2)
In [15]:
# @interact(mon=(0, len(months) - 1), row=(1, 60), col=(0, 35))
def draw_cell_interior(mon, row, col):
month = months[mon]
row = -1 if row >= len(month.cells) else row
col = -1 if col >= len(month.cells[0]) else col
cell = month.cells[row][col]
interior = cell.interior(Cell.crop)
fig, ax = plt.subplots(figsize=(3, 3))
ax.imshow(interior, cmap=plt.cm.gray)
lines = probabilistic_hough_line(
interior, line_length=15, theta=Cell.forward_slashes)
for ((x0, y0), (x1, y1)) in lines:
ax.plot((x0, x1), (y0, y1), '-r', linewidth=1)
print('lines', len(lines))
print('yes' if len(lines) else '')
draw_cell_interior(1, 18, 31)
# draw_cell_interior(1, 27, 27)
In [16]:
# @interact(mon=(0, len(months) - 1))
def show_slashes(mon):
month = months[mon]
for r, row in enumerate(month.cells[1:]):
for col, cell in enumerate(row):
if month.col_labels[col]:
print('/' if cell.has_line(Cell.forward_slashes) else '.', end=' ')
print()
show_slashes(0)
In [17]:
fig, ax = plt.subplots(figsize=(8, 8))
ax.imshow(grid.image, cmap=plt.cm.gray)
for ((x0, y0), (x1, y1)) in left_side.horiz.lines:
ax.plot((x0, x1), (y0, y1), '-m', linewidth=1)
for ((x0, y0), (x1, y1)) in left_side.vert.lines:
ax.plot((x0, x1), (y0, y1), '-r', linewidth=1)
for month in months:
for ((x0, y0), (x1, y1)) in month.horiz.lines:
x0 += month.offset.x
x1 += month.offset.x
y0 += month.offset.y
y1 += month.offset.y
ax.plot((x0, x1), (y0, y1), '-y', linewidth=1)
if not month.top or not month.bottom:
for ((x0, y0), (x1, y1)) in month.vert.lines:
x0 += month.offset.x
x1 += month.offset.x
y0 += month.offset.y
y1 += month.offset.y
ax.plot((x0, x1), (y0, y1), '-r', linewidth=1)
if month.top:
for ((x0, y0), (x1, y1)) in month.top.vert.lines:
x0 += month.top.offset.x
x1 += month.top.offset.x
y0 += month.top.offset.y
y1 += month.top.offset.y
ax.plot((x0, x1), (y0, y1), '-c', linewidth=1)
if month.bottom:
for ((x0, y0), (x1, y1)) in month.bottom.vert.lines:
x0 += month.bottom.offset.x
x1 += month.bottom.offset.x
y0 += month.bottom.offset.y
y1 += month.bottom.offset.y
ax.plot((x0, x1), (y0, y1), '-b', linewidth=1)
plt.show()
In [18]:
def get_month_graph_areas(grid, left_side):
"""Chop the right side image into images for each month."""
months = []
for label_idx, label in enumerate(left_side.row_labels[1:], 1):
prev_idx = label_idx - 1
prev_label = left_side.row_labels[prev_idx]
if label and not prev_label:
top_line = grid.horiz.lines[label_idx - 1]
elif not label and prev_label:
bottom_line = grid.horiz.lines[label_idx + 1]
months.append(crop_rows(grid, top_line, bottom_line))
return months
In [19]:
CSV_PATH = 'output/boyd_bird_journal.csv'
output_results(in_file, CSV_PATH, grid, months, left_side)
plt.show()
Try merging endpoints: Lines are pretty skew across the entire image. I tried to use interior points to make grid lines. This didn't really help things.
Probabilistic Hough line: This may work for other parts of the image, like slashes, but it didn't help with either grid lines or row labels. It proved to be much slower and harder to tune for finding grid lines that span the entire image.
OpenCV: This works, it's just less flexible for searching on a limited set of angles. The ability to pull out the horizontal, vertical, and diagonal lines separately is useful in this application. Also, OpenCV is difficult to install.