In [1]:
from pylab import *

Making Groundtruth

Creating ground truth for OCR training from scratch is usually a fairly laborious process. However, we can create some ground truth fairly automatically.

There are many existing OCR systems. Even if their error rates may be higher than that of OCRopus, they can still help us get a larger variety of training data, in particular for initial training.

Let's demonstrate this with the Tesseract OCR system. We first as Tesseract to produce output in hOCR format.


In [8]:
!tesseract -l deu-frak 0211.bin.png 0211 hocr


Tesseract Open Source OCR Engine v3.02 with Leptonica

As you can see in the output, there is information about bounding boxes and locations of text lines and words.


In [10]:
!fmt 0211.html | sed '20,30!d'


2709 805 2906 885">wuthe</span> <span class='ocr_word' id='word_13'
title="bbox 2948 803 3266 881">(wadete)</span> <span class='ocr_word'
id='word_15' title="bbox 3308 799 3445 871">über</span> </span>
<span class='ocr_line' id='line_3' title="bbox 1542 915 3444
1005"><span class='ocr_word' id='word_17' title="bbox 1542 925 1655
989">alle</span> <span class='ocr_word' id='word_19' title="bbox 1688
927 1954 1005">Wasser,</span> <span class='ocr_word' id='word_21'
title="bbox 2000 922 2155 997">dorst</span> <span class='ocr_word'
id='word_23' title="bbox 2198 922 2524 998">(braucht)</span> <span
class='ocr_word' id='word_25' title="bbox 2564 923 2702 992">über</span>
<span class='ocr_word' id='word_27' title="bbox 2753 926 2905

This OCR gets quite a bit wrong, but we only care about getting some training data, not transcribing the entire document.


In [13]:
!lynx -dump 0211.html | sed '20,30!d'


   Feinde, Winden und Hunnen, meinten, es wär der leidige Teufel.

   19. Riesen» Säulen.

   Wi n ketiiia n n’ s hessifche Chronik. S. z2. M el i s s a n t e s in
   0rqgraph. bei Maichene Berg.

   Bei Miltenberg oder Kleinen-Haubaih auf einein hohen Gebürg ini Walde
   sind neun geivaltige, große,

To get good training data, we just focus on output that is correctly spelled.

The hocr-xwords script reads such hOCR files and takes a spelling dictionary. It then goes through the hOCR output produced by the OCR engine, picks out sequences of words that are spelled correctly (some common punctuation is allowed), and writes out both the text (as ground truth) and the corresponding image in a format suitable for training.


In [5]:
!hocr-xwords -w 400 -d de_DE 0211.html


page_1 0211.bin.png
(7009, 4959) (7009, 4959)
alle Wasser,
keine Brücke gehen,
nieder, hängt
gute Gesellen
sieben oder acht
viel Volks wider solche Kröten
Riesen nennt
Feinde, Winden
Hunnen, meinten,
leidige Teufel.

Here is an example of what this looks like.


In [16]:
import codecs
print repr(codecs.open("0211/010001.gt.txt","r","utf-8").read())
imshow(imread("0211/010001.bin.png"),cmap=cm.gray)


u'alle Wasser,'
Out[16]:
<matplotlib.image.AxesImage at 0x46fbb10>

As you can see, a whole page of text only gave rise to a few lines of training text, but that's enough, since there are many scanned pages we can easily get a hold of.

You can control the number of lines returned with the -w and -m arguments; with really short lines, you risk accidentally matching a word in the dictionary and making text line geometry modeling harder, while with really long lines, you will simply not find a lot of examples.

Furthermore, after generating training data like this initially, it is easy to verify it by training on it and seeing which lines remain misclassified.

Note that you can also use OCRopus itself to perform these steps; even if its performance on some new font is fairly poor, it will pick up a lot of good training data, and that will allow it to improve itself.


In [ ]: