In [20]:
from pylab import *
import ocrolib
from ocrolib import morph
def display(f): imshow(imread(f),cmap=cm.gray)
def displaysegs(f): morph.showlabels(ocrolib.read_line_segmentation(f))

OCRopus Processing Steps

This worksheet illustrates the various processing steps in OCRopus.

Preprocessing

There are a number of preprocessing tools available in OCRopus. The recommended ones are:

  • ocropus-nlbin - nonlinear binarization tool (recommended)
  • ocropus-sauvola - binarization based on a fast implementation of Sauvola's method

Both perform deskewing and binarization.


In [2]:
!ocropus-nlbin 0020_0022.png


0020_0022.png lo-hi (0.18 0.64) angle  0.0 

In [3]:
figsize(12,12)
subplot(121); display("0020_0022.png")
subplot(122); display("0020_0022.bin.png")


As you can see, the output is much nicer quality than the input, and bleed through and stains have been removed.

We can also zoom in to find that characters are nicely connected and that there isn't much noise.


In [24]:
orig = ocrolib.read_image_gray("0020_0022.png")
image = ocrolib.read_image_gray("0020_0022.bin.png")
gray(); figsize(12,18)
subplot(121); imshow(orig[:500,:500])
subplot(122); imshow(image[:500,:500])


Out[24]:
<matplotlib.image.AxesImage at 0x440e0d0>

You can get information on command line options by invoking any program with the --help option.


In [8]:
!ocropus-nlbin --help


usage: ocropus-nlbin [-h] [-t THRESHOLD] [-z ZOOM] [-e ESCALE] [-b BIGNORE] [-p PERC] [-r RANGE] [-m MAXSKEW]
                     [-g] [--lo LO] [--hi HI] [--skewsteps SKEWSTEPS] [--debug DEBUG] [--show] [-o OUTPUT]
                     [-Q PARALLEL]
                     files [files ...]

positional arguments:
  files

optional arguments:
  -h, --help            show this help message and exit
  -t THRESHOLD, --threshold THRESHOLD
                        threshold, determines lightness
  -z ZOOM, --zoom ZOOM  zoom for page background estimation, smaller=faster
  -e ESCALE, --escale ESCALE
                        scale for estimating a mask over the text region
  -b BIGNORE, --bignore BIGNORE
                        ignore this much of the border for threshold estimation
  -p PERC, --perc PERC  percentage for filters
  -r RANGE, --range RANGE
                        range for filters
  -m MAXSKEW, --maxskew MAXSKEW
                        skew angle estimation parameters (degrees)
  -g, --gray            force grayscale processing even if image seems binary
  --lo LO               percentile for black estimation
  --hi HI               percentile for white estimation
  --skewsteps SKEWSTEPS
                        steps for skew angle estimation (per degree)
  --debug DEBUG         display intermediate results
  --show                display final result
  -o OUTPUT, --output OUTPUT
                        output directory
  -Q PARALLEL, --parallel PARALLEL

The important parameters here are:

  • -t - needed if the images come out too dark or too light
  • -b - increase if your images have very large borders, decrease for small images
  • -m - the maximum skew angle that is corrected (should be within a few degrees at most)
  • --lo, --hi - percentiles for black/white estimation; set these if your documents are unusually light or dark
  • --debug t - display intermediate processing steps

Page Segmentation

There are several page segmentation programs:

  • ocropus-gpageseg - simple gradient-based layout analysis, part of the main distribution; fairly robust (recommended)

The following are available as separate packages:

  • ocropus-prast - more complex, geometric layout analysis; published and well characterized, but sensitive
  • ocropus-ridge - ridge-based layout analysis

Page segmentation programs take one or more page image files as input and generate a directory with the basename of that image containing the individual lines.


In [9]:
!rm -rf 0020_0022
!ocropus-gpageseg -Q 0 0020_0022.png


0020_0022.png
    44 0020_0022.png 17.0 45

In [10]:
!ls 0020_0022 | head


010001.nrm.png
010001.png
010002.nrm.png
010002.png
010003.nrm.png
010003.png
010004.nrm.png
010004.png
010005.nrm.png
010005.png

We can now look at the lines as produced by the page segmenter:


In [11]:
figsize(12,8)
for i in range(8):
    subplot(8,1,i+1)
    display("0020_0022/01%04x.png"%(i+4))


Text Line Recognition

There is currently only one text line recognizer included in OCRopus:

  • ocropus-lattices - compute recognition lattices by oversegmentation and character classification

The ocropus-lattices command actually takes character recognizers and line segmenters and combines them into a line recognizer.

Additional recognizers are in preparation:

  • ocropus-hmm - HMM-based recognizer
  • ocropus-rnn - RNN-based recognizer

These produce similar files to ocropus-lattices but have their own internal recognizers.


In [13]:
!ocropus-lattices 0020_0022/??????.png


loading /usr/local/share/ocropus/uw3.cmodel
got <ocrolib.patrec.LocalCmodel instance at 0x4c9dd88>
sizemode linerel
loading /usr/local/share/ocropus/space.model
got <ocrolib.wmodel.WhitespaceModel instance at 0x1a758cf8>
loading /usr/local/share/ocropus/default.lineest
got <ocrolib.lineest.TrainedLineGeometry instance at 0x1a758d88>
segmenter lineseg.DPSegmentLine()
got <ocrolib.lineseg.DPSegmentLine instance at 0x1a758f38>
recognizing 45 files
0020_0022/010001.png =RAW= .5f .
0020_0022/010002.png =RAW= i6 
0020_0022/010003.png =RAW= sETTtEMENT oF NEw-YoRK 
0020_0022/010004.png =RAW= We accordingly find, that the States General, determined 
0020_0022/010005.png =RAW= on the regular settlement of a colonyo and made a grant of 
0020_0022/010006.png =RAW= ,the country in 1621 to the West India company of Amster- 
0020_0022/010007.png =RAW= olam. Wouter Van Twiller darrivel at Furt Amsterdam 
0020_0022/010008.png =RAW= (now New York) and took upon him s the government of 
0020_0022/010009.png =RAW= the colony, in June 1629. His style, in the patents whiclt 
0020_0022/01000a.png =RAW= he granted was thus. ~~ W' e the Director and Council residinh 
0020_0022/01000b.png =RAW= in New Netherland, under the government of .their Higg
0020_0022/01000c.png =RAW= Mightinesses the Sfates ' Geveral of the United Netherlands 
0020_0022/01000d.png =RAW= and the privileged West India Company.v 
0020_0022/01000e.png =RAW= DOESN'T SATISFY GEOMETRIC CONSTRAINS ON LINES, SKIPPED
0020_0022/01000f.png =RAW= NO RAW SEGMENTS
0020_0022/010010.png =RAW= CHAPTER III. 
0020_0022/010011.png =RAW= Fron the possession of the coariy by the Dutch to its surrerz-
0020_0022/010012.png =RAW= der fo the British, under the command of Uofmel Richard 
0020_0022/010013.png =RAW= Nichols, in the year I664. 
0020_0022/010014.png =RAW= IT is my avowed object, in th1s undertaking, to lay hefore 
0020_0022/010015.png =RAW= my readers fhe history of tloe city, not of the provinceo noW 
0020_0022/010016.png =RAW= the state of New-York: but at this early periodrthe circum- 
0020_0022/010017.png =RAW= stances im,cident to the settlement of both are so llended to- 
0020_0022/010018.png =RAW= gc:her as to render it difficult to separRte the one from the 
0020_0022/010019.png =RAW= nther. Ishall, therefore, without ftuttheI apdology,proceed, in 
0020_0022/01001a.png =RAW= the manner, wluich appears to be most practicable for general 
0020_0022/01001b.png =RAW= information. 
0020_0022/01001c.png =RAW= During the government of Mr. Van Twiller, the New-Eng-
0020_0022/01001d.png =RAW= landers extended hheir pos-essions to the Westward, s as far as 
0020_0022/01001e.png =RAW= Connecticut rivtr. William Kieft, who sucf er ded in the ad- 
0020_0022/01001f.png =RAW= ministration, protested s against it, and, in the year 16S8,issued 
0020_0022/010020.png =RAW= a proclamation prohibitiRg the English from trading te Fort 
0020_0022/010021.png =RAW= Gooll Hopeo and shortly after application was s made to the 
0020_0022/010022.png =RAW= States General for more troops to defend their territories 
0020_0022/010023.png =RAW= against invasion. They appear to have had good reason for 
0020_0022/010024.png =RAW= alarm, as Dr. Mather,in his History of New England, admitso 
0020_0022/010025.png =RAW= thaI the inhabitapts liad formed the design of settling Connec- 
0020_0022/010026.png =RAW= ticut river in the yeur 1635,before which iime tbey had con- 
0020_0022/010027.png =RAW= sidered, that river to be, at least, 100 miles fnom any of their 
0020_0022/010028.png =RAW= settlements, that in 16S6 they seated chemselves at Hartford, 
0020_0022/010029.png =RAW= and after settlinE New Haven in 1638, drove the Dutch gar- 
0020_0022/01002a.png =RAW= tison from Fort Good Hope. 
0020_0022/01002b.png =RAW= In 1640w uhe English, who had taken possession of tl1og 
0020_0022/01002c.png =RAW= Eastern part of Long Island, proceeded as far as Oyster Bay, 
0020_0022/01002d.png =RAW= about 40 miles from the city of New-Ycrk, But Kieft brok. 

The output from ocropus-lattices consists of a number of new files:

  • *.lattice - contains the recognition lattice (segmentation alternatives and character classes)
  • *.raw.txt - language-model free best interpretation of the input
  • .rseg.png - color-coded segmentation of the input; colors are referenced by .lattice

In [14]:
!ls -1 0020_0022/010023.*


0020_0022/010023.lattice
0020_0022/010023.nrm.png
0020_0022/010023.png
0020_0022/010023.rseg.png

We can look at the segmentation output; to do so, we need to read the segmentation. In the segmentation, each RGB triple is interpreted as a 24 bit integer. The background is white.

To show characters in a distinctive way, we cycle between colors.


In [22]:
figsize(12,3)
displaysegs("0020_0022/010023.rseg.png")


Here is a simple illustration of how we extract characters and character parts from such a segmentation.


In [30]:
from scipy.ndimage import measurements
charboxes = measurements.find_objects(rseg)
chars = [rseg[bbox]==i+1 for i,bbox in enumerate(charboxes)]
figsize(8,8)
ocrolib.showgrid(chars[:64])


The lattice file contains the recognition output. It is divided into segments and chars. Each segment line represents a range of labels in the raw segmentation. The numbers at the end represent the probability of a space being present or not present after the segment. The chr line represents the negative log probabilities and classes.


In [31]:
!head 0020_0022/010023.lattice


segment 0	1:1	17:35:3:17	1.00	0.00
chr 0	0	0.0015	a
chr 0	1	5.5884	3
chr 0	2	5.8478	z
segment 1	2:2	18:44:21:37	1.00	0.00
chr 1	0	30.0000	~
segment 2	2:3	18:44:21:38	1.00	0.00
chr 2	0	0.0024	g
segment 3	2:4	18:44:21:55	1.00	0.00
chr 3	0	30.0000	~

For example, segment 2 comprises rseg labels 2 and 3.


In [32]:
imshow((rseg>=2)*(rseg<=3))


Out[32]:
<matplotlib.image.AxesImage at 0x11ad3310>

Language Modeling

The final output from the recognizer is computed via language modeling.

The default language model is based on n-graphs. Such language models are constructed and applied with the ocropus-ngraphs command.

Language models take the *.lattice files and output *.txt files and *.cseg.png files.


In [33]:
!ocropus-ngraphs 0020_0022/*.lattice


loading /usr/local/share/ocropus/default-4.ngraphs
processing 43 files
0020_0022/010001.lattice =NGRAPHS= 17.91	.5 .
0020_0022/010002.lattice =NGRAPHS=  4.74	i6 
0020_0022/010003.lattice =NGRAPHS= 22.59	sETTLEMENT oF NEw-Y oRK
0020_0022/010004.lattice =NGRAPHS= 32.62	We accordingly find, that the States General, determined 
0020_0022/010005.lattice =NGRAPHS= 37.07	on the regular settlement of a colonyo and made a grant of 
0020_0022/010006.lattice =NGRAPHS= 37.55	,the country in 1621 to the West India company of Amster- 
0020_0022/010007.lattice =NGRAPHS= 48.81	olam. Wouter Van Twiller darrivel at Furt Amsterdam 
0020_0022/010008.lattice =NGRAPHS= 34.53	(now New York) and took upon him sthe government of 
0020_0022/010009.lattice =NGRAPHS= 40.61	the colony, in June 1629. His style, in the patents whiclt 
0020_0022/01000a.lattice =NGRAPHS= 70.27	he granted was thus. W'e the Director and Council residinh
0020_0022/01000b.lattice =NGRAPHS= 27.72	in New Netherland, under the government of .their Higg
0020_0022/01000c.lattice =NGRAPHS= 35.86	Mightinesses the Sfates' Gemeral of the United Netherlands 
0020_0022/01000d.lattice =NGRAPHS= 23.96	and the privileged West India Company.'' 
0020_0022/010010.lattice =NGRAPHS=  6.26	CHAPTER III. 
0020_0022/010011.lattice =NGRAPHS= 53.15	Fron the possession of the coariy by the Dutch to its surrerz-
0020_0022/010012.lattice =NGRAPHS= 59.48	der fothe British, under the command of Uolonel Richard 
0020_0022/010013.lattice =NGRAPHS= 14.66	Nichols, in the year I664. 
0020_0022/010014.lattice =NGRAPHS= 37.60	IT is my avowed object, in th1s undertaking, to lay hefore 
0020_0022/010015.lattice =NGRAPHS= 55.77	my readers fhe history oftloe city, not of the provincevnoW
0020_0022/010016.lattice =NGRAPHS= 37.61	the state of New-York: but at this early periodrthe circum- 
0020_0022/010017.lattice =NGRAPHS= 40.77	stances im,cident to the settlement of both are so llended to-
0020_0022/010018.lattice =NGRAPHS= 43.03	gc:her as to render it difficult to separ Rte the one from the 
0020_0022/010019.lattice =NGRAPHS= 48.23	nther. Ishall, therefore, without fultthe Iapdology,proceed, in 
0020_0022/01001a.lattice =NGRAPHS= 36.52	the manner, wliich appears to be most practicable for general 
0020_0022/01001b.lattice =NGRAPHS=  4.20	information. 
0020_0022/01001c.lattice =NGRAPHS= 24.34	During the government of Mr. Van Twiller, the New-Eng-
0020_0022/01001d.lattice =NGRAPHS= 48.61	landers extended hheir pos-essions to the Westward, e as far as 
0020_0022/01001e.lattice =NGRAPHS= 55.65	Connecticut rivtr. William Kieft, who sucferded in the ad-
0020_0022/01001f.lattice =NGRAPHS= 34.14	ministration, protested sagainst it, and, in the year 16S8,issued 
0020_0022/010020.lattice =NGRAPHS= 34.46	a proclamation prohibiting the English from trading te Fort 
0020_0022/010021.lattice =NGRAPHS= 49.82	Gooll Hope and shortly after application was smade to the 
0020_0022/010022.lattice =NGRAPHS= 33.21	States General for more troops to defend their territories 
0020_0022/010023.lattice =NGRAPHS= 28.90	against invasion. They appear to have had good reason for 
0020_0022/010024.lattice =NGRAPHS= 37.00	alarm, as Dr. Mather,in his History of New England, admits 
0020_0022/010025.lattice =NGRAPHS= 45.99	thai the inhabitapts liad formed the design of settling Connec- 
0020_0022/010026.lattice =NGRAPHS= 49.73	ticut river in the yeur 1635,before which iime they had con-
0020_0022/010027.lattice =NGRAPHS= 44.37	sidered, that river to be, at least, 100 miles from any of their 
0020_0022/010028.lattice =NGRAPHS= 40.08	settlements, that in 16S6they seated themselves at Hartford, 
0020_0022/010029.lattice =NGRAPHS= 44.75	and after settlinE New Haven in 1638, drove the Dutch gar-
0020_0022/01002a.lattice =NGRAPHS= 18.31	tison from Fort Good Hope. 
0020_0022/01002b.lattice =NGRAPHS= 59.14	In 1640wheEnglish, who had taken possession of tl1og 
0020_0022/01002c.lattice =NGRAPHS= 30.80	Eastern part of Long Island, proceeded as far as Oyster Bay, 
0020_0022/01002d.lattice =NGRAPHS= 37.14	about 40 miles from the city of New-York, But Kieft brok. 

You can get a good idea of what a language model represents by generating random text from it. This is random text generated by the default language model.


In [34]:
!ocropus-ngraphs --sample 10


loading /usr/local/share/ocropus/default-4.ngraphs
in could intoxico, and noth cap a blacing the dispondant our by factor
''Illinity. ''if sured, and soon ways. Hith who below. For einrol aske
resight an hers, cance hom ands presh its, and hear och als reconce as
from to at the get few way that to been hearshare from he hold and to 
sould his pa's and greedit,'' her uns I was and Holy:--''tured, be. ''
LATER XVIII. Demory him all-cupation, to easion ched his occust follow
An everywoget brospirity. In ally othings ande prollow one face mined 
the Nilgriend? Hautier, eastical''learly. ''But rapide ladney with act
Margen he cal boy himself his welled genusualing them to calf-and hims
367231698172731031740686952418361694507603625, for are to the Empel sc

In [35]:
imshow(imread("0020_0022/01000a.png"))


Out[35]:
<matplotlib.image.AxesImage at 0x10941190>

In the following note...

  • the text line has a typo at the end; depending on which language model parameters we choose, that may get corrected
  • the current recognizer has trouble with double quotes and some upper/lower case distinctions because it doesn't use character size or position at all (this will be fixed soon)

In [36]:
!ocropus-ngraphs 0020_0022/01000a.lattice --other 5 --nother 4
!ocropus-ngraphs 0020_0022/01000a.lattice --other 10 --nother 4
!ocropus-ngraphs 0020_0022/01000a.lattice --other 20 --nother 4


loading /usr/local/share/ocropus/default-4.ngraphs
processing 1 files
0020_0022/01000a.lattice =NGRAPHS= 70.27	he granted was thus. W'e the Director and Council residinh
loading /usr/local/share/ocropus/default-4.ngraphs
processing 1 files
0020_0022/01000a.lattice =NGRAPHS= 70.27	he granted was thus. W'e the Director and Council residinh
loading /usr/local/share/ocropus/default-4.ngraphs
processing 1 files
0020_0022/01000a.lattice =NGRAPHS= 70.27	he granted was thus. W'e the Director and Council residinh

In [37]:
!ocropus-ngraphs 0020_0022/01000a.lattice --lweight 0.1
!ocropus-ngraphs 0020_0022/01000a.lattice --lweight 0.5
!ocropus-ngraphs 0020_0022/01000a.lattice --lweight 1.0
!ocropus-ngraphs 0020_0022/01000a.lattice --lweight 2.0


loading /usr/local/share/ocropus/default-4.ngraphs
processing 1 files
0020_0022/01000a.lattice =NGRAPHS= 67.12	he granted was thus. W' e the Director and Council residinh 
loading /usr/local/share/ocropus/default-4.ngraphs
processing 1 files
0020_0022/01000a.lattice =NGRAPHS= 85.60	he granted was thus. W'ethe Director and Council residin h
loading /usr/local/share/ocropus/default-4.ngraphs
processing 1 files
0020_0022/01000a.lattice =NGRAPHS= 104.43	he granted was thus. We the Director and Council tesidin h
loading /usr/local/share/ocropus/default-4.ngraphs
processing 1 files
0020_0022/01000a.lattice =NGRAPHS= 131.81	hegranted was thus. We the Director and Councileesidinly 

In [38]:
!ls -1 0020_0022/01000a.*


0020_0022/01000a.aligned
0020_0022/01000a.cseg.png
0020_0022/01000a.lattice
0020_0022/01000a.nrm.png
0020_0022/01000a.png
0020_0022/01000a.rseg.png
0020_0022/01000a.txt

The .cseg.png file contains a segmentation that corresponds directly to the characters in the output file.


In [40]:
cseg = read_segmentation("0020_0022/01000a.cseg.png")
figsize(32,12)
subplot(121); imshow((cseg>0)*(cseg%3+1),cmap=cm.gist_stern)
!cat 0020_0022/010023.txt


against invasion. They appear to have had good reason for 

(Note that in this case, the 'u' and the '.' are misrecognized due to noise; ocropus-lattices needs to remove such noise better than it does right now.)

Alignment

For training, it is important to obtain isolated characters that can then be used to train new character models. This is done with the ocropus-align command. This command takes ground truth and a recognition lattice, aligns them, and then outputs a database of individual characters in HDF5 format.


In [41]:
!echo "against invasion. They appear to have had good reason for" > 0020_0022/010023.gt.txt

In [42]:
# !ocropus-lalign -x test.h5 -g .gt.txt 0020_0022/010023.lattice
!ocropus-align 0020_0022/010023.lattice


processing 1 files
0020_0022/010023.lattice =ALIGNED=   24.8  against invasion. They appear to have had good reason for

In [43]:
!ls -1 0020_0022/010023.*


0020_0022/010023.aligned
0020_0022/010023.cseg.png
0020_0022/010023.gt.txt
0020_0022/010023.lattice
0020_0022/010023.nrm.png
0020_0022/010023.png
0020_0022/010023.rseg.png
0020_0022/010023.txt

(The following doesn't quite work yet, since you don't have the ocropus-extract command.)


In [46]:
!ocropus-lattices --extract test.h5 0020_0022/010023.lattice


loading /usr/local/share/ocropus/default.lineest
got <ocrolib.lineest.TrainedLineGeometry instance at 0x4180ef0>
sizemode linerel
0020_0022/010023.lattice =EXTRACTED= against invasion. They appear to have had good reason for

In [47]:
!h5ls test.h5


classes                  Dataset {111/Inf}
patches                  Dataset {111/Inf, 32, 32}

In [48]:
import tables

In [51]:
aligned = tables.openFile("test.h5","r")
cls = aligned.root.classes[17]
print cls,chr(cls)
figsize(5,5)
imshow(aligned.root.patches[17],interpolation='nearest')


104 h
Out[51]:
<matplotlib.image.AxesImage at 0xb320750>

The extracted characters are stored in HDF5 format. There are two storage formats: perchar and linerel.

The perchar format normalized the size of each character individually. This means that perchar recognizers can recognize characters independent of context or position. However, such recognizers are prone to confusions such o/O and ,/'.

The linerel format is what is usually used. It normalizes the size of each character based on the size of the text line, but then centers the character. The location of the character relative to the rest of the line is indicated by a white bar in the leftmost column, running from the baseline to the xline of the original line.


In [52]:
figsize(12,12)
for i,p in enumerate(aligned.root.patches[:64]):
    subplot(8,8,i+1)
    xticks([]); yticks([])
    xlabel(chr(aligned.root.classes[i]))
    imshow(p,interpolation='nearest')



In [ ]: