In [1]:
from pylab import *
Creating ground truth for OCR training from scratch is usually a fairly laborious process. However, we can create some ground truth fairly automatically.
There are many existing OCR systems. Even if their error rates may be higher than that of OCRopus, they can still help us get a larger variety of training data, in particular for initial training.
Let's demonstrate this with the Tesseract OCR system. We first as Tesseract to produce output in hOCR format.
In [8]:
!tesseract -l deu-frak 0211.bin.png 0211 hocr
As you can see in the output, there is information about bounding boxes and locations of text lines and words.
In [10]:
!fmt 0211.html | sed '20,30!d'
This OCR gets quite a bit wrong, but we only care about getting some training data, not transcribing the entire document.
In [13]:
!lynx -dump 0211.html | sed '20,30!d'
To get good training data, we just focus on output that is correctly spelled.
The hocr-xwords script reads such hOCR files and takes a spelling dictionary.
It then goes through the hOCR output produced by the OCR engine, picks out
sequences of words that are spelled correctly (some common punctuation is allowed),
and writes out both the text (as ground truth) and the corresponding image
in a format suitable for training.
In [5]:
!hocr-xwords -w 400 -d de_DE 0211.html
Here is an example of what this looks like.
In [16]:
import codecs
print repr(codecs.open("0211/010001.gt.txt","r","utf-8").read())
imshow(imread("0211/010001.bin.png"),cmap=cm.gray)
Out[16]:
As you can see, a whole page of text only gave rise to a few lines of training text, but that's enough, since there are many scanned pages we can easily get a hold of.
You can control the number of lines returned with the -w and -m arguments;
with really short lines, you risk accidentally matching a word in the dictionary
and making text line geometry modeling harder, while with really long lines, you
will simply not find a lot of examples.
Furthermore, after generating training data like this initially, it is easy to verify it by training on it and seeing which lines remain misclassified.
Note that you can also use OCRopus itself to perform these steps; even if its performance on some new font is fairly poor, it will pick up a lot of good training data, and that will allow it to improve itself.
In [ ]: