In [20]:
from pylab import *
import ocrolib
from ocrolib import morph
def display(f): imshow(imread(f),cmap=cm.gray)
def displaysegs(f): morph.showlabels(ocrolib.read_line_segmentation(f))
This worksheet illustrates the various processing steps in OCRopus.
There are a number of preprocessing tools available in OCRopus. The recommended ones are:
Both perform deskewing and binarization.
In [2]:
!ocropus-nlbin 0020_0022.png
In [3]:
figsize(12,12)
subplot(121); display("0020_0022.png")
subplot(122); display("0020_0022.bin.png")
As you can see, the output is much nicer quality than the input, and bleed through and stains have been removed.
We can also zoom in to find that characters are nicely connected and that there isn't much noise.
In [24]:
orig = ocrolib.read_image_gray("0020_0022.png")
image = ocrolib.read_image_gray("0020_0022.bin.png")
gray(); figsize(12,18)
subplot(121); imshow(orig[:500,:500])
subplot(122); imshow(image[:500,:500])
Out[24]:
You can get information on command line options by invoking any program with the --help
option.
In [8]:
!ocropus-nlbin --help
The important parameters here are:
-t
- needed if the images come out too dark or too light-b
- increase if your images have very large borders, decrease for small images-m
- the maximum skew angle that is corrected (should be within a few degrees at most)--lo
, --hi
- percentiles for black/white estimation; set these if your documents are unusually light or dark--debug t
- display intermediate processing stepsThere are several page segmentation programs:
The following are available as separate packages:
Page segmentation programs take one or more page image files as input and generate a directory with the basename of that image containing the individual lines.
In [9]:
!rm -rf 0020_0022
!ocropus-gpageseg -Q 0 0020_0022.png
In [10]:
!ls 0020_0022 | head
We can now look at the lines as produced by the page segmenter:
In [11]:
figsize(12,8)
for i in range(8):
subplot(8,1,i+1)
display("0020_0022/01%04x.png"%(i+4))
There is currently only one text line recognizer included in OCRopus:
The ocropus-lattices
command actually takes character recognizers and line segmenters and
combines them into a line recognizer.
Additional recognizers are in preparation:
These produce similar files to ocropus-lattices
but have their own internal recognizers.
In [13]:
!ocropus-lattices 0020_0022/??????.png
The output from ocropus-lattices
consists of a number of new files:
In [14]:
!ls -1 0020_0022/010023.*
We can look at the segmentation output; to do so, we need to read the segmentation. In the segmentation, each RGB triple is interpreted as a 24 bit integer. The background is white.
To show characters in a distinctive way, we cycle between colors.
In [22]:
figsize(12,3)
displaysegs("0020_0022/010023.rseg.png")
Here is a simple illustration of how we extract characters and character parts from such a segmentation.
In [30]:
from scipy.ndimage import measurements
charboxes = measurements.find_objects(rseg)
chars = [rseg[bbox]==i+1 for i,bbox in enumerate(charboxes)]
figsize(8,8)
ocrolib.showgrid(chars[:64])
The lattice file contains the recognition output.
It is divided into segments and chars.
Each segment
line represents a range of labels in the raw segmentation.
The numbers at the end represent the probability of a space being present or not present after the segment.
The chr
line represents the negative log probabilities and classes.
In [31]:
!head 0020_0022/010023.lattice
For example, segment 2
comprises rseg labels 2 and 3.
In [32]:
imshow((rseg>=2)*(rseg<=3))
Out[32]:
The final output from the recognizer is computed via language modeling.
The default language model is based on n-graphs. Such language models are constructed
and applied with the ocropus-ngraphs
command.
Language models take the *.lattice
files and output *.txt
files and *.cseg.png
files.
In [33]:
!ocropus-ngraphs 0020_0022/*.lattice
You can get a good idea of what a language model represents by generating random text from it. This is random text generated by the default language model.
In [34]:
!ocropus-ngraphs --sample 10
In [35]:
imshow(imread("0020_0022/01000a.png"))
Out[35]:
In the following note...
In [36]:
!ocropus-ngraphs 0020_0022/01000a.lattice --other 5 --nother 4
!ocropus-ngraphs 0020_0022/01000a.lattice --other 10 --nother 4
!ocropus-ngraphs 0020_0022/01000a.lattice --other 20 --nother 4
In [37]:
!ocropus-ngraphs 0020_0022/01000a.lattice --lweight 0.1
!ocropus-ngraphs 0020_0022/01000a.lattice --lweight 0.5
!ocropus-ngraphs 0020_0022/01000a.lattice --lweight 1.0
!ocropus-ngraphs 0020_0022/01000a.lattice --lweight 2.0
In [38]:
!ls -1 0020_0022/01000a.*
The .cseg.png
file contains a segmentation that corresponds directly to the characters in
the output file.
In [40]:
cseg = read_segmentation("0020_0022/01000a.cseg.png")
figsize(32,12)
subplot(121); imshow((cseg>0)*(cseg%3+1),cmap=cm.gist_stern)
!cat 0020_0022/010023.txt
(Note that in this case, the 'u' and the '.' are misrecognized due to noise; ocropus-lattices
needs
to remove such noise better than it does right now.)
For training, it is important to obtain isolated characters that can then be used to
train new character models.
This is done with the ocropus-align
command.
This command takes ground truth and a recognition lattice,
aligns them, and then outputs a database of individual characters in HDF5 format.
In [41]:
!echo "against invasion. They appear to have had good reason for" > 0020_0022/010023.gt.txt
In [42]:
# !ocropus-lalign -x test.h5 -g .gt.txt 0020_0022/010023.lattice
!ocropus-align 0020_0022/010023.lattice
In [43]:
!ls -1 0020_0022/010023.*
(The following doesn't quite work yet, since you don't have the ocropus-extract command.)
In [46]:
!ocropus-lattices --extract test.h5 0020_0022/010023.lattice
In [47]:
!h5ls test.h5
In [48]:
import tables
In [51]:
aligned = tables.openFile("test.h5","r")
cls = aligned.root.classes[17]
print cls,chr(cls)
figsize(5,5)
imshow(aligned.root.patches[17],interpolation='nearest')
Out[51]:
The extracted characters are stored in HDF5 format. There are two
storage formats: perchar
and linerel
.
The perchar
format normalized the size of each character individually.
This means that perchar
recognizers can recognize characters independent
of context or position. However, such recognizers are prone to confusions such
o/O and ,/'.
The linerel
format is what is usually used. It normalizes the size of
each character based on the size of the text line, but then centers the
character. The location of the character relative to the rest of the line
is indicated by a white bar in the leftmost column, running from
the baseline to the xline of the
original line.
In [52]:
figsize(12,12)
for i,p in enumerate(aligned.root.patches[:64]):
subplot(8,8,i+1)
xticks([]); yticks([])
xlabel(chr(aligned.root.classes[i]))
imshow(p,interpolation='nearest')
In [ ]: