We are faced with a problem. There is now a machine readable text, but it is littered with stuff we do not want. To create their edition, Bouwman et al. added line numbers, a translation in English, a plethora of footnotes and annotations. There's also remnants of the print book: page numbers, page headers, etc. If a full computational edition of the Reynaert edition by Bouwman et al. would be th goal of this project, we might actually be interested in these elements and we might want to capture them. However here I am interested in the meticulous reproducibility of the scholarly process of editing through code and in reading the text of the Reynaert through code. Capturing the book metaphor that is 'inbuilt' in the exiting edition is less relevant to this direct aim, though it would be an interesting later project to pursue the mise-en-abyme of computationally creating the digital scholarly edition of the scholarly print edition. But right now the task is to separate the Middle Dutch verses from the rest.
The approach I take is rather straight forward. I will read in the text line by line and I will see if each line matches to a certain model. So we will need models and some machinery to have the text be matched for those models. The latter piece of machinery we usually call a parser. The models we will just call models.
So, we know that we will need several models, pieces of code that can recognize footnotes, empty lines, page numbers, etc. for us. Thus a case where we have several objects of the category model. And each model will have to be able to match itself against each line of a text. We express this commonality via a super class Model. All concrete models for matching will be variant of that.
In [1]:
class Model
# Determines if the model matches a line of text.
# By default it returns false because it doesn't match anything.
def matches( line )
false
end
end
Out[1]:
But wait.. not all models will apply to exactly one line. Remember those footnotes? Those are multiline phenomena. We will need some way of registering or knowing that a model is terminated. Thus we add a variable and a way to read it to the super model. Each derived concrete model will have the ability to 'know' by which other models it is terminated.
In [2]:
class Model
# A class instance variable that holds a list of
# other models that terminates this model.
@terminators = nil
def self.terminators
@terminators
end
# Determines if the model matches a line of text.
# By default it returns false because it doesn't match anything.
def matches( line )
false
end
end
Out[2]:
Now we need several concrete models that will enable us to categorize lines in the text. Looking at the text we see that there are a number of 'types' of lines that we don't need. Lines that contain only numbers (page number or verse numbers) for instance, lines that are in all capital font and that coincide with page headers and chapter headings, lines belonging to footnotes, and lastly empty lines. We can express this by creating concrete model classes that implement the matches method of the super class in specific ways. Thus we end up with four models (AllCaps, FootNote, Numbers, and Empty) that each use a different regular expression to match the text surfaces that are typical for each type of line. You'll find these regular expressions as the red parts below in each class (e.g. /[[:upper:]]/, which matches upper case letters). These expression if not encountered before may seem hermetic, but with a bit of study effort they will be sufficiently understandable.
In [3]:
# Matches a line that only contains capitals.
class AllCaps < Model
def matches( line )
!!line.match( /[[:upper:]]/ ) && !!!line.match( /[[:lower:]]/ )
end
end
# Matches a line starting with at least one digit, followed by a dash or a space.
class FootNote < Model
def matches( line )
line.match( /^\d+(-| )(.+)$/ ) != nil
end
end
# Matches a line containing only numbers. 'o' (lower case letter o) is also
# accepted as the OCR frequently misreads 0 for o.
class Numbers < Model
def matches( line )
line.match( /^[\do]+$/ ) != nil
end
end
# Matches an empty line.
class Empty < Model
def matches( line )
line.match( /^\s*$/ ) != nil
end
end
Out[3]:
All the line types we have seen until now have some recognizable features (they're empty, contain numbers, and so forth). When it comes to telling apart "dat die avonture van Reynaerde" from "that the tales of Reynaert", we are lost for visual clues at the surface of the text only. We will need some more knowledge to identify the former as Middle Dutch and the latter as English. The 'English model' is therefore quite somewhat more complicated than the other classes. It does not need to get as complicated as using sophisticated natural language processing (NLP) software packages. An admittedly naive but straight forward approach is to use a list of English stop words. If a line is made up for more than 20% (or differently put: if is passes a 0.2 threshold of words in English) of such stop words we can safely assume that the line is in English. There are some subtleties that might be worth noting, to point these out commentary is provided within the code of the class below.
In [4]:
class English < Model
# Whenever this model is created/used it loads a number of variables,
# e.g. the list of English stop words (@stopwords), which is read from a file.
def initialize
@stopwords = File.read( './resources/stopwords_en.txt' ).split( "\n" )
@threshold = 0.2
end
# Sets threshold, 0.2 (20%) by default.
def threshold=( new_threshold )
@threshold = new_threshold
end
# Some words look like "been." or "her?", we strip the punctuation to make sure we
# don't miss any English words while matching them ("been." for a computer is
# obviously not the same as "been").
def strip_embracing_punctuation( token )
return token.gsub(/[\.:;'“‘’”?!\(\),]+$|^[\.:;'“‘’”?!\(\),]+/, '')
end
# This computes the 'English score' for a line.
# The line is first split into its individual tokens.
# Then we count all English stopwords with a weight of 1.
# Finally we compute the relative score, that is: the count of English
# words divided by the total number of tokens on the line.
def score( string )
score = 0.0
tokens = string.split( /\s+/ )
tokens.each do |token|
stripped = strip_embracing_punctuation( token )
score += 1.0 if @stopwords.include?( stripped.downcase )
end
score/tokens.size()
end
# The standard match function that all models must provide.
# We say a line is English if the score computed above is larger
# than the threshold of 0.2. (Thus if 20% of the tokens could be English.)
def matches( line )
score( line ) > @threshold
end
end
Out[4]:
Now that we have all these models we need something that will take an actual text, set the models lose on it and returns us just the Middle Dutch verses that we were looking for. This piece of machinery we will call the OCRParser. The OCRParser class takes a text (method load_text) and splits it on line breaks (method text=). Then it delegates the matching of lines to the models described above in the method match_lines.
The method parse_tuples considers the possibilities of models spanning multiple lines. It keeps track of which multiline models are active and checks if a new model that is matched maybe terminates any of the active multiline models. It then returns all lines and adds to each an indicator whether it was matched by a model or not (true, false). It also adds a list of the models that matched the line.
The parse method filters that result and returns only those lines that did not answer to any model, which should be only the Middle Dutch verses.
In [5]:
class OCRParser
attr_accessor :models
def text=( text )
@text = text
@lines = text.split( "\n" )
end
def load_text( file_path )
self.text = File.read( file_path )
end
def match_lines
@lines.each do |line|
matches = []
@models.each do |model|
if model.matches( line )
matches.push( model.class )
end
end
yield line, matches
end
end
def parse_tuples
active_multiline_models = []
match_lines do |line, matches|
matches.each do |model|
active_multiline_models.reject! do |active_multiline_model|
active_multiline_model.terminators.include? model
end
if model.terminators != nil
active_multiline_models.push( model )
end
end
if matches.size == 0 && active_multiline_models.size == 0
yield true, line, matches
else
yield false, line, matches
end
end
end
def parse
tuples = []
parse_tuples { | accept, line | tuples.push line if accept }
tuples
end
end
Out[5]:
In [6]:
text = OCRParser.new
text.load_text( './resources/Bouwman_ Of Reynaert the Fox.txt' )
text.models = [ Empty.new, Numbers.new, FootNote.new, AllCaps.new, English.new ]
parsed = text.parse()
puts parsed.join( "\n" )
But wait! That is not correct. Anyone who knows the Reynaert will spot that something is off already in the first few line. The Reynaert (in the Comburg manuscript) reads:
Willem die Madocke maecte, [192va,22]
daer hi dicken omme waecte,
hem vernoyde so haerde
dat die avonture van Reynaerde
in Dietsche onghemaket bleven
— die Arnout niet hevet vulscreven —
dat hi die vijte dede soucken
ende hise na den Walschen boucken
in Dietsche dus hevet begonnen.
But our parser gives us:
Willem die Madocke maecte, [192va,22]
daer hi dicken omme waecte,
dat die avonture van Reynaerde
— die Arnout niet hevet vulscreven —
dat hi die vijte dede soucken
ende hise na den Walschen boucken
in Dietsche dus hevet begonnen.
It missed the lines "hem vernoyde so haerde" and "in Dietsche onghemaket bleven". And also apparently it kept on to some English as well, a long part of a footnote is found lodged withing the text: "cart, leaving nothing but the bones ofone single fish (cf. p. 31—32)." What happened?
On closer inspection it turns out that "hem vernoyde so haerde" contains an Middle Dutch word that is also an English stop word ('so'). And because the verse is so short, it's relative English score hits the threshold of 0.5. Converse the English footnote text has an OCR misreading which 'hides' two English stop words ("ofone"), which keeps it under the threshold.
Clearly our English parsing model is not up to par yet. We will have to iterate the code through a new development cycle to improve the performance.
In [ ]: