Chapter 3 — On Iterations

Writing code shares a lot of commonalities with authoring text. For instance, only rarely you get it right in exactly one time. What authors would call rewriting, scrapping, polishing, starting anew, editing, and so forth, in IT terms is called "refactoring" or "doing another iteration"—although indeed you will here plain and simple "rewriting" as well.

When writing the code for this experiment, it was often needed to rewrite parts of it. During these iterations it was not always clear which iterations were a pure technical matter (e.g. fixing a typo so that "parser.pase()" would not actually simple error out due to a missing 'r') and which were also tied to scholarly action (e.g. improving the model for detecting English lines of the text). This was important to discern as the rules of this experiment stated that all scholarly actions should be reproducible. Including in this notebook all the states of the code according to each iteration would be rather tedious, pretty boring, and little informative. It seems reasonable to silently accept iterations that are code oriented, that is: those rewritings of the code that make the performance and heuristics of the code in some ways technically more adequate or more efficient but that do not change the result of the heuristic itself. A good example is a method extraction. Suppose you have the following code.


In [1]:
puts "The byciclist rides on the bike".gsub( "the", "<part>the</part>" )
puts "The train conductor asks for the tickets".gsub( "the", "<part>the</part>" )
puts "The train conductor asks for the tickets".gsub( "the", "<part>the</part>" )


The byciclist rides on <part>the</part> bike
The train conductor asks for <part>the</part> tickets
The train conductor asks for <part>the</part> tickets

This code might be part of some application that marks particles for teaching purposes. But the marking is done very explicit for every case, making maintaining the code harder: if we wanted to change the label of the particles we would have to change it in six places, giving us as many if not more occasions to err. This is why you would refactor code like that to use a method (function) that is called each time when we need to write a label.


In [2]:
def mark_part( string )
  string.gsub( "the", "<part>the</part>" ) 
end 

puts mark_part( "The byciclist rides on the bike" )
puts mark_part( "The train conductor asks for the tickets" )
puts mark_part( "The train conductor asks for the tickets" )


The byciclist rides on <part>the</part> bike
The train conductor asks for <part>the</part> tickets
The train conductor asks for <part>the</part> tickets

Now if we would want to change the label, we can change it in one place and we will be sure that all instances will act the same. Note however that the performance of the code, its output, is still the same (as you can confirm by running the code).

Similarly I have made many iterations and improvements to the code described in this notebook: they improved structure, maintainability, readability of the code, or simply solved bugs that prevented successful execution. But those refactorings did not alter the actual resulting performance.1 I choose not to represent all these (often tiny) iterations. However when a refactoring essentially changes the heuristics of the code, then that iteration should be reported as a matter of scholarly completeness. Thus if I were to change the above code to…


In [3]:
def mark_part( string )
  string.gsub( /(t|T)he/, '<part>\1he</part>' ) 
end 

puts mark_part( "The byciclist rides on the bike" )
puts mark_part( "The train conductor asks for the tickets" )
puts mark_part( "The train conductor asks for the tickets" )


<part>The</part> byciclist rides on <part>the</part> bike
<part>The</part> train conductor asks for <part>the</part> tickets
<part>The</part> train conductor asks for <part>the</part> tickets

…I am changing the heuristics of its task. Instead of just looking for "the" and marking it, I am now looking for either "the" or "The" and marking those. This involves, in the case of this notebook, scholarly decisions on how textual material should be interpreted. These scholarly decision should be represented as part of the reproducibility of scholarly effort and workflow.

A scholarly iteration: improving the English model

When we ran the code we found that it marks a number of lines as being English that are really not, and that it qualifies certain lines as Middle Dutch that are not. Clearly some refactoring is needed to remedy this. Before we can do so however we need to recreate the models and code that we had already. That is the reason for the next somewhat hermetic lines of code. They figure out where we are on the file system, then load the models and parser of the prior chapter.


In [4]:
require File.join(File.dirname(__FILE__), '../lib/ocr_parse_models')
require File.join(File.dirname(__FILE__), '../lib/ocr_parser')


Out[4]:
true

There is a number of words in the English stop words list that are homonymous with Middle Dutch vocabulary. If we count these fully towards English then we are counting too many terms as English. At the same time we do not want to count too many truly English terms as Middle Dutch. Let's give these terms a weight of 0.4 to see if things improve.


In [5]:
class EnglishSecondIteration < Model

  # Whenever this model is created/used it loads a number of variables,
  # e.g. the list of English stop words (@stopwords), which is read from a file.
  # There are English stopwords that are homonyms of Middle Dutch words. We do not
  # want to count these as English, therefore we substract the set of these words
  # (@may_be_middle_dutch) from the set of English stopwords.
  def initialize
    @may_be_middle_dutch = [ "an", "as", "been", "by", "have", "he", "her", 
      "here", "i", "in", "is", "me", "mine", "no", "so", "over", "was", "we" ]
    @stopwords = File.read( './resources/stopwords_en.txt' ).split( "\n" ) - @may_be_middle_dutch
    @threshold = 0.2
  end

  # Sets threshold, 0.2 (20%) by default.
  def threshold=( new_threshold )
    @threshold = new_threshold
  end

  # Some words look like "been." or "her?", we strip the punctuation to make sure we 
  # don't miss any English words while matching them ("been." for a computer is 
  # obviously not the same as "been").
  def strip_embracing_punctuation( token )
    return token.gsub(/[\.:;'“‘’”?!\(\),]+$|^[\.:;'“‘’”?!\(\),]+/, '')
  end

  # This computes the 'English score' for a line.
  # The line is first split into its individual tokens.
  # Then we count all English stopwords with a weight of 1.
  # We count those words that *might* be English, but *could*
  # also be Middle Dutch too, but their weight is 0.4.
  # Finally we compute the relative score, that is: the count of English
  # words divided by the total number of tokens on the line.
  def score( string )
    score = 0.0
    tokens = string.split( /\s+/ )
    tokens.each do |token|
      stripped = strip_embracing_punctuation( token )
      score += 1.0 if @stopwords.include?( stripped.downcase )
      score += 0.4 if @may_be_middle_dutch.include?( stripped.downcase )
    end
    score/tokens.size()
  end

  # The standard match function that all models must provide.
  # We say a line is English if the score computed above is larger
  # than the threshold of 0.2. (Thus if 20% of the tokens could be English.)
  def matches( line )
    score( line ) > @threshold
  end

end


Out[5]:
:matches

So, let's fire up the parser once again and let's see how well our second interaction English model performs.


In [6]:
text = OCRParser.new
text.load_text( './resources/Bouwman_ Of Reynaert the Fox.txt' )
text.models = [ Empty.new, Numbers.new, FootNote.new, AllCaps.new, EnglishSecondIteration.new ]
parsed = text.parse()
puts parsed.join( "\n" )


Willem die Madocke maecte, [192va,22]
daer hi dicken omme waecte,
hem vernoyde so haerde
dat die avonture van Reynaerde
in Dietsche onghemaket bleven
— die Arnout niet hevet vulscreven —
dat hi die vijte dede soucken
ende hise na den Walschen boucken
in Dietsche dus hevet begonnen.
God moete ons ziere hulpen jonnen!
Nu keert hem daertoe mijn zin
dat ic bidde in dit beghin
beede den dorpren enten doren,
ofte si commen daer si horen
dese rijme ende dese woort
(die hem onnutte sijn ghehoort),
dat sise laten onbescaven.
Te vele slachten si den raven,
die emmer es al even malsch.
Si maken sulke rijme valsch,
daer si niet meer of ne weten [192vb]
dan ic doe hoe dat si heeten
die nu in Babilonien leven.
Daden si wel, si soudens begheven.
Dat en segghic niet dor minen wille.
Mijns dichtens ware een ghestille,
ne hads mi eene niet ghebeden
die in groeter hovesscheden
Prologue
was so extremely annoyed
remained unwritten in Dutch
gherne keert hare saken.
Soe bat mi dat ic soude maken
dese avontuere van Reynaerde.
Al begripic die grongaerde
ende die dorpren ende die doren,
ic wille dat dieghene horen
die gherne pleghen der eeren
ende haren zin daertoe keeren
dat si leven hoofschelike,
sijn si arem, sijn si rike,
diet verstaen met goeden sinne.
Nu hoert hoe ic hier beghinne!
Het was in eenen tsinxendaghe
dat beede bosch ende haghe
met groenen loveren waren bevaen.
Nobel die coninc hadde ghedaen
sijn hof crayeren overal,
dat hi waende, hadde hijs gheval,
houden ten wel groeten love.
Doe quamen tes sconinx hove
alle die diere, groet ende cleene,
sonder vos Reynaert alleene.
Hi hadde te hove so vele mesdaen
dat hire niet dorste gaen.
Die hem besculdich kent, ontsiet.
Also was Reynaerde ghesciet
ende hieromme scuwedi sconinx hof,
daer hi in hadde crancken lof.
Doe al dat hofversamet was,
was daer niemen, sonder die das,
hi ne hadde te claghene over Reynaerde,
King Nobel holds court
Whoever is knowingly guilty, is afraid.
den fellen metten grijsen baerde.
Nu gaet hier up eene claghe.
Isingrijn ende sine maghe
ghinghen voer den coninc staen. [193ra]
Ysengrijn begonste saen
ende sprac: ‘Coninc heere,
dor hu edelheit ende dor hu eere
ende dor recht ende dor ghenade,
ontfaerme hu miere scade
die mi Reynaert heeft ghedaen,
daer ic af dicken hebbe ontfaen
groeten lachter ende verlies.
Voer al dandre ontfaerme hu dies
dat hi mijn wijfhevet verhoert
ende mine kindre so mesvoert
dat hise beseekede daer si laghen,
datter twee noint ne saghen
ende si worden staerblent.
Nochtan hoendi mi sent.
Het was sint so verre comen
datter eenen dach af was ghenomen
ende Reynaerd soude hebben ghedaen
sine onsculde. Ende also saen
alse die heleghe waren brocht,
was hi andersins bedocht
ende ontfoer ons in sine veste.
Heere, dit kennen noch die beste
die te hove zijn commen hier.
Mi hevet Reynaert, dat felle dier,
inflicted upon me by Reynaert,
so vele te leede ghedaen,
ic weet wel al sonder waen:
al ware al tlaken paerkement
dat men maket nu te Ghent,
inne ghescreeft niet daeran.
Dies zwijghics nochtan,
neware mijns wives lachter
ne mach niet bliven achter,
no onversweghen no onghewroken.’
Doe Ysengrijn dit hadde ghesproken,
stont up een hondekijn, hiet Cortoys,
ende claghede den coninc in Francsoys
hoet so arem was wijleneere
dat alles goets en hadde meere
in eenen winter, in eene vorst,
dan alleene eene worst
ende hem Reynaert, die felle man, [193rb]
die selve worst stal ende nam.
Tybeert die cater die wart gram.
Aldus hi sine tale began
ende spranc midden in den rinc
ende seide: ‘Heere coninc,
dordat ghi Reynaerde zijt onhout,
so en es hier jonc no hout,
hi ne hebbe te wroughene jeghen hu.
Dat Cortoys claghet nu,
dats over menich jaer ghesciet.
Die worst was mine, al en claghic niet.
Ic hadse bi miere lust ghewonnen
daer ic bi nachte quam gheronnen
omme bejach in eene molen,
daer ic die worst in hadde ghestolen
eenen slapenden molenman.
Hadder Cortoys yewet an,
happened many a year ago.
dan was bi niemene dan bi mi.
Hets recht dat omberecht zi
die claghe die Cortoys doet.’
Pancer de bever sprac: ‘Dinct hu goet,
Tybeert, dat men die claghe ombeere?
Reynaert es een recht mordeneere
ende een trekere ende een dief.
Hi ne heeft oec niemene so lief,
no den coninc, minen heere,
hi ne wilde dat hi lijf ende eere
verlore, mochtire an winnen
een vet morzeel van eere hinnen.
Wat sechdi van eere laghe?
En dedi ghistren in den daghe
eene die meeste overdaet
an Cuwaerde den hase, die hier staet,
die noyt eenich dier ghedede?
Want hi hem binnen sconinX vrede
ende binnen des coninX gheleede
ghelovede te leerne sinen crede
ende soudene maken capelaen.
Doe dedine sitten gaen
vaste tusschen sine beene.
Doe begonsten si overeene
spellen ende lesen beede [193va]
ende lude te zinghene crede.
Mi gheviel dat ic te dien tijden
ter selver stede soude lijden.
Doe hoerdic haerre beeder sanc
ende maecte daerwaert minen ganc
met eere arde snelre vaerde.
Doe vandic daer meester Reynaerde,
die ziere lessen hadde begheven
die hi tevoren up hadde gheheven,
ende diende van sinen houden spelen
ende hadde Coewaerde bi der kelen
ende soude hem thoeft afhebben ghenomen
waer ic hem niet te hulpen comen
bi avontueren in dien stonden.
Siet hier noch die verssche wonden
ende die teekine, heere coninc,
die Coewaert van hem ontfinc.
Laetti dit bliven onghewroken,
dat hu verde dus es tebroken,
ghi ne wreket als huwe mannen wijsen,
men saelt huwen kindren mesprijsen
hiernaer over wel menich jaer.’
‘Bi Gode, Pancer, ghi secht waer,’
sprac Ysengrijn daer hi stoet.
‘Heere, waer Reynaerd doot, het waer ons goet,
also behoude mi God mijn leven.
Neware wert hem dit vergheven,
hi sal noch hoenen binnen eere maent
sulken dies niet ne bewaent.’
Doe spranc up Grinbert die das,
die Reynaerts broedersone was,
met eere verbolghenlike tale:
‘Heere Ysengrijn, men weet dat wale
ende hets een hout bijspel:
viants mont seit selden wel.
Verstaet, neemt miere talen goem:
ic wilde, hi hinghe an eenen boem
bi ziere kelen als een dief
die andren heeft ghedaen meest grief.
as evidence, lord king,
‘Lord Ysingrijn, as everyone surely knows
Heere Ysengrijn, wildi angaen
soendinc ende dat ontfaen,
daertoe willic helpen gherne. [193vb]
Mijn oem en saelt hem oec niet wernen.
Entie meest andren heeft mesdaen
sal den andren in baten staen
van minen oem ende van hu.
Al comt hi niet claghen nu,
ware mijn oem wel te hove
ende stonde in sconinx love,
heere Ysengrijn, als ghi doet,
en soude den coninc niet dincken goet
ende ghi ne bleves heden onbegrepen,
dat ghi sijn vel so hebt ghenepen
so dicwile met huwen scerpen tanden,
dat hi niet ne conde ghehanden.’
Ysengrijn sprac: ‘Hebdi gheleert
an huwen oem dus lieghen apeert?’
‘In hebbe daeran niet gheloghen.
Ghi hebt minen oem bedroghen
arde dicke in menegher wijsen.
Ghi mesleettene van den pladijse
die hi hu warp van der kerren,
doe ghi hem volghet van verren
ende ghi die beste pladijse uplaset,
daer ghi hu ane hadt versadet.
Ghi ne gaeft hem no goet no quaet,
sonder alleene eenen pladijsengraet
dat ghi hem te jeghen brocht,
dordat ghine niet en mocht.
Sint hoendine van eenen bake
die vet was ende van goeder smake,
dien ghi leit in huwen muzeele.
cart, leaving nothing but the bones ofone single fish (cf. p. 31—32).
Doe Reynaert heesschede zijn deele,
andwoerdi hem in scerne:
“Hu deel willic hu gheven gherne,
Reynaert, scone jonghelinc!
Die wisse daer die bake an hinc,
becnause, so es so vet.”
Reynaerde waes lettel te bet
dat hi den goeden bake ghewan
in sulker zorghen, dattene een man
vinc ende warpene in sinen zac.
Dese pine ende dit onghemac
hevet hi leden dor Ysengrijne [194ra]
ende ondert waerven meer dan ic hu rijme.
Ghi heeren, dinct hu dit ghenouch?
Nochtan om meer onghevouch
dat hi claghet om sijn wijf,
die Reynaerde hevet al haer lijf
ghemint; so doet hi hare.
Al ne makeden zijt niet mare,
ic dart wel segghen over waer
dat langher es dan VII jaer
dat Reynaert hevet hare trauwe.
Omdat Haersint, die scone vrouwe,
dor minne ende dor quade zede
Reynaert sinen wille dede,
wattan? So was sciere ghenesen.
Wat talen mach daeromme wesen?
Nu maket heere Cuwaert, die hase,
eene claghe van eere blase.
Of hi den credo niet wel en las,
Reynaerd, die zijn meester was,
mochte hi sinen clerc niet blauwen?
Dat ware onrecht, entrauwen.
Reynaert, my dear young man!
accommodated Reynaert
Now Lord Cuwaert, the hare,
Cortoys claghet om eene worst
die hi verloes in eene vorst.
Die claghe ware bet verholen:
ende hoerdi dat so was ghestolen?
Male quesite male perdite:
over rechtwert men qualike quite
dat men hevet qualic ghewonnen.
Wie sal Reynaerde dat verjonnen
Niemen die recht versceeden can.
Reynaert es een gherecht man.
Sint dat die coninc sinen ban
hevet gheboden ende sinen vrede,
so weetic wel dat hi ne dede
dinc negheene dan of hi ware
hermite ofte clusenare.
Naest siere huut draecht hi een hare.
Binnen desen naesten jare
so ne hat hi vleesch, no Wilt no tam.
Dat seidi die ghistren danen quam.
Malcroys hevet hi begheven, [194rb]
sinen casteel, ende hevet upheven
eene cluse daer hi leghet in.
Ander bejach no ander ghewin
so wanic wel dat hi ne hevet
dan karitate die men hem ghevet.
Bleec es hi ende magher van pinen.
Hongher, dorst, scerpe karijnen
doghet hi voer sine zonden.’
Recht te desen selven stonden,
doe Grimbert stont in dese tale,
saghen si van berghe te dale
Canticler commen ghevaren,
ende brochte up eene bare
eene doode hinne ende hiet Coppe,

Alright, that seems to yield more Middle Dutch lines in any case.

Willem die Madocke maecte, [192va,22]
daer hi dicken omme waecte,
hem vernoyde so haerde
dat die avonture van Reynaerde
in Dietsche onghemaket bleven
— die Arnout niet hevet vulscreven —
dat hi die vijte dede soucken
ende hise na den Walschen boucken
in Dietsche dus hevet begonnen.

But unfortunately also more English lines, as in:

die in groeter hovesscheden
Prologue
was so extremely annoyed
remained unwritten in Dutch
gherne keert hare saken.

or:

Mi hevet Reynaert, dat felle dier,
inflicted upon me by Reynaert,
so vele te leede ghedaen,

Intuitively seems logical, if we allow words less to be identified decisively as English, than less sentences will end up being identified as such. But how can we further improve our selection. The last example gives a hint: the English line is part of a longer section of translated Middle Dutch, but it appears here isolated from its sibling English sentences. Or in other words: the matching algorithm suddenly amidst of all English lines decides that one line is not English. We can make use of the fact that this is unlikely. We will rewrite the English model such that if a line is not identified as English but both the previous and next lines are, then we'll identify the line itself also as English.

At least our English model will require knowledge about its textual context. As you may gauge from the models, all models only now about the very line they are trying to match. Thus the parser needs a way of letting the models know what lines precede and follow. We can do so by changing the super class Model. All models will have a variable called 'line_context' that is settable by the parser (this is what the 'attr_accessor' bit guarantees).


In [7]:
class Model

  attr_accessor :line_context

  # A class instance variable that holds a list of 
  # other models that terminates this model. 
  @terminators = nil
  def self.terminators
    @terminators
  end

  # Determines if the model matches a line of text.
  # By default it returns false because it doesn't match anything.
  def matches( line )
    false
  end

end


Out[7]:
:matches

Now we will have to have some object that can represent and hold the context of a line (e.g. 'know' or store the 10 lines before and after the current line). Let's call this object LineContext. On creation—with LineContext.new, which invokes the initialize() method—a line_context object gets all lines of the text and it gets the index of the line that is currently being parsed. Upon its creation it stores the 10 lines before the line that is currently parsed in 'previous_lines' and stores the 10 lines after that line in 'next_lines'.


In [8]:
class LineContext

  attr_reader :previous_lines
  attr_reader :next_lines

  def initialize( lines, index )
    reversed_index = lines.size - index
    @previous_lines = lines.reverse()[reversed_index,10]
    @next_lines = lines[index+1,10]
  end

end


Out[8]:
:initialize

Also we will have to adapt our parser, because it needs to inform each model of its context. To do so we change the match_lines() method.


In [9]:
class OCRParser

  attr_accessor :models

  def text=( text )
    @text = text
    @lines = text.split( "\n" )
  end

  def load_text( file_path )
    self.text = File.read( file_path )
  end

  def match_lines
    @lines.each_with_index do |line,index|
      matches = []
      @models.each do |model|
        # We provide each model with some context of the line it will be working on…
        model.line_context = LineContext.new( @lines, index )
        if model.matches( line )
          matches.push( model.class )
        end
      end
      yield line, matches
    end
  end

  def parse_tuples
    active_multiline_models = []
    match_lines do |line, matches|
      matches.each do |model|
        active_multiline_models.reject! do |active_multiline_model|
          active_multiline_model.terminators.include? model
        end
        if model.terminators != nil
          active_multiline_models.push( model )
        end
      end
      if matches.size == 0 && active_multiline_models.size == 0
        yield true, line, matches
      else
        yield false, line, matches
      end
    end
  end

  def parse
    tuples = []
    parse_tuples { | accept, line | tuples.push line if accept }
    tuples
  end

end


Out[9]:
:parse

Now we can finally adapt the English model such that it can take into account the lines prior to and following the line it is trying to match. The matches() method is expanded to see if the preceding and following line are English according to the scoring function. If so then the current line is still marked as English even if the scoring function did not judge it to be.


In [10]:
class EnglishThirdIteration < Model

  # Whenever this model is created/used it loads a number of variables,
  # e.g. the list of English stop words (@stopwords), which is read from a file.
  # There are English stopwords that are homonyms of Middle Dutch words. We do not
  # want to count these as English, therefore we substract the set of these words
  # (@may_be_middle_dutch) from the set of English stopwords.
  def initialize
    @may_be_middle_dutch = [ "an", "as", "been", "by", "have", "he", "her", 
      "here", "i", "in", "is", "me", "mine", "no", "so", "over", "was", "we" ]
    @stopwords = File.read( './resources/stopwords_en.txt' ).split( "\n" ) - @may_be_middle_dutch
    @threshold = 0.2
  end

  # Sets threshold, 0.2 (20%) by default.
  def threshold=( new_threshold )
    @threshold = new_threshold
  end

  # Some words look like "been." or "her?", we strip the punctuation to make sure we 
  # don't miss any English words while matching them ("been." for a computer is 
  # obviously not the same as "been").
  def strip_embracing_punctuation( token )
    return token.gsub(/[\.:;'“‘’”?!\(\),]+$|^[\.:;'“‘’”?!\(\),]+/, '')
  end

  # This computes the 'English score' for a line.
  # The line is first split into its individual tokens.
  # Then we count all English stopwords with a weight of 1.
  # We count those words that *might* be English, but *could*
  # also be Middle Dutch too, but their weight is 0.4.
  # Finally we compute the relative score, that is: the count of English
  # words divided by the total number of tokens on the line.
  def score( string )
    score = 0.0
    tokens = string.split( /\s+/ )
    tokens.each do |token|
      stripped = strip_embracing_punctuation( token )
      score += 1.0 if @stopwords.include?( stripped.downcase )
      score += 0.4 if @may_be_middle_dutch.include?( stripped.downcase )
    end
    score/tokens.size()
  end

  # Below the adapted part for the English model. The standard function
  # 'matches( line )' that all models must provide first tests if the 
  # line is English according to the scoring function.
  # We say a line is English if the score computed above is larger
  # than the threshold of 0.2. (Thus if 20% of the tokens could be English.)
  # The matches method then continuous to do an additional check in case
  # the line is found to be not English. In that case the line is still
  # marked as English if it is wedged in between two lines that are 
  # identified as being English.
  def matches( line )
    match = above_treshold( line )
    if @line_context != nil && !match
      empty_model = Empty.new
      if !empty_model.matches( line )
        # Post correction, if in between two english matches, it probably should be matched too
        prev = @line_context.previous_lines.reject { |line| empty_model.matches( line ) }
        succ = @line_context.next_lines.reject { |line| empty_model.matches( line ) }
        previous_matches = above_treshold( prev[0] ) if prev.size > 0
        next_matches = above_treshold( succ[0] ) if succ.size > 0
        match = true if (previous_matches && next_matches)
      end
    end
    match
  end

  def above_treshold( line )
    score( line ) > @threshold
  end

end


Out[10]:
:above_treshold

And now we are ready again to test the third iteration of the English model. Engine away…


In [11]:
text = OCRParser.new
text.load_text( './resources/Bouwman_ Of Reynaert the Fox.txt' )
text.models = [ Empty.new, Numbers.new, FootNote.new, AllCaps.new, EnglishThirdIteration.new ]
parsed = text.parse()
puts parsed.join( "\n" )


Willem die Madocke maecte, [192va,22]
daer hi dicken omme waecte,
hem vernoyde so haerde
dat die avonture van Reynaerde
in Dietsche onghemaket bleven
— die Arnout niet hevet vulscreven —
dat hi die vijte dede soucken
ende hise na den Walschen boucken
in Dietsche dus hevet begonnen.
God moete ons ziere hulpen jonnen!
Nu keert hem daertoe mijn zin
dat ic bidde in dit beghin
beede den dorpren enten doren,
ofte si commen daer si horen
dese rijme ende dese woort
(die hem onnutte sijn ghehoort),
dat sise laten onbescaven.
Te vele slachten si den raven,
die emmer es al even malsch.
Si maken sulke rijme valsch,
daer si niet meer of ne weten [192vb]
dan ic doe hoe dat si heeten
die nu in Babilonien leven.
Daden si wel, si soudens begheven.
Dat en segghic niet dor minen wille.
Mijns dichtens ware een ghestille,
ne hads mi eene niet ghebeden
die in groeter hovesscheden
Prologue
gherne keert hare saken.
Soe bat mi dat ic soude maken
dese avontuere van Reynaerde.
Al begripic die grongaerde
ende die dorpren ende die doren,
ic wille dat dieghene horen
die gherne pleghen der eeren
ende haren zin daertoe keeren
dat si leven hoofschelike,
sijn si arem, sijn si rike,
diet verstaen met goeden sinne.
Nu hoert hoe ic hier beghinne!
Het was in eenen tsinxendaghe
dat beede bosch ende haghe
met groenen loveren waren bevaen.
Nobel die coninc hadde ghedaen
sijn hof crayeren overal,
dat hi waende, hadde hijs gheval,
houden ten wel groeten love.
Doe quamen tes sconinx hove
alle die diere, groet ende cleene,
sonder vos Reynaert alleene.
Hi hadde te hove so vele mesdaen
dat hire niet dorste gaen.
Die hem besculdich kent, ontsiet.
Also was Reynaerde ghesciet
ende hieromme scuwedi sconinx hof,
daer hi in hadde crancken lof.
Doe al dat hofversamet was,
was daer niemen, sonder die das,
hi ne hadde te claghene over Reynaerde,
den fellen metten grijsen baerde.
Nu gaet hier up eene claghe.
Isingrijn ende sine maghe
ghinghen voer den coninc staen. [193ra]
Ysengrijn begonste saen
ende sprac: ‘Coninc heere,
dor hu edelheit ende dor hu eere
ende dor recht ende dor ghenade,
ontfaerme hu miere scade
die mi Reynaert heeft ghedaen,
daer ic af dicken hebbe ontfaen
groeten lachter ende verlies.
Voer al dandre ontfaerme hu dies
dat hi mijn wijfhevet verhoert
ende mine kindre so mesvoert
dat hise beseekede daer si laghen,
datter twee noint ne saghen
ende si worden staerblent.
Nochtan hoendi mi sent.
Het was sint so verre comen
datter eenen dach af was ghenomen
ende Reynaerd soude hebben ghedaen
sine onsculde. Ende also saen
alse die heleghe waren brocht,
was hi andersins bedocht
ende ontfoer ons in sine veste.
Heere, dit kennen noch die beste
die te hove zijn commen hier.
Mi hevet Reynaert, dat felle dier,
so vele te leede ghedaen,
ic weet wel al sonder waen:
al ware al tlaken paerkement
dat men maket nu te Ghent,
inne ghescreeft niet daeran.
Dies zwijghics nochtan,
neware mijns wives lachter
ne mach niet bliven achter,
no onversweghen no onghewroken.’
Doe Ysengrijn dit hadde ghesproken,
stont up een hondekijn, hiet Cortoys,
ende claghede den coninc in Francsoys
hoet so arem was wijleneere
dat alles goets en hadde meere
in eenen winter, in eene vorst,
dan alleene eene worst
ende hem Reynaert, die felle man, [193rb]
die selve worst stal ende nam.
Tybeert die cater die wart gram.
Aldus hi sine tale began
ende spranc midden in den rinc
ende seide: ‘Heere coninc,
dordat ghi Reynaerde zijt onhout,
so en es hier jonc no hout,
hi ne hebbe te wroughene jeghen hu.
Dat Cortoys claghet nu,
dats over menich jaer ghesciet.
Die worst was mine, al en claghic niet.
Ic hadse bi miere lust ghewonnen
daer ic bi nachte quam gheronnen
omme bejach in eene molen,
daer ic die worst in hadde ghestolen
eenen slapenden molenman.
Hadder Cortoys yewet an,
dan was bi niemene dan bi mi.
Hets recht dat omberecht zi
die claghe die Cortoys doet.’
Pancer de bever sprac: ‘Dinct hu goet,
Tybeert, dat men die claghe ombeere?
Reynaert es een recht mordeneere
ende een trekere ende een dief.
Hi ne heeft oec niemene so lief,
no den coninc, minen heere,
hi ne wilde dat hi lijf ende eere
verlore, mochtire an winnen
een vet morzeel van eere hinnen.
Wat sechdi van eere laghe?
En dedi ghistren in den daghe
eene die meeste overdaet
an Cuwaerde den hase, die hier staet,
die noyt eenich dier ghedede?
Want hi hem binnen sconinX vrede
ende binnen des coninX gheleede
ghelovede te leerne sinen crede
ende soudene maken capelaen.
Doe dedine sitten gaen
vaste tusschen sine beene.
Doe begonsten si overeene
spellen ende lesen beede [193va]
ende lude te zinghene crede.
Mi gheviel dat ic te dien tijden
ter selver stede soude lijden.
Doe hoerdic haerre beeder sanc
ende maecte daerwaert minen ganc
met eere arde snelre vaerde.
Doe vandic daer meester Reynaerde,
die ziere lessen hadde begheven
die hi tevoren up hadde gheheven,
ende diende van sinen houden spelen
ende hadde Coewaerde bi der kelen
ende soude hem thoeft afhebben ghenomen
waer ic hem niet te hulpen comen
bi avontueren in dien stonden.
Siet hier noch die verssche wonden
ende die teekine, heere coninc,
die Coewaert van hem ontfinc.
Laetti dit bliven onghewroken,
dat hu verde dus es tebroken,
ghi ne wreket als huwe mannen wijsen,
men saelt huwen kindren mesprijsen
hiernaer over wel menich jaer.’
‘Bi Gode, Pancer, ghi secht waer,’
sprac Ysengrijn daer hi stoet.
‘Heere, waer Reynaerd doot, het waer ons goet,
also behoude mi God mijn leven.
Neware wert hem dit vergheven,
hi sal noch hoenen binnen eere maent
sulken dies niet ne bewaent.’
Doe spranc up Grinbert die das,
die Reynaerts broedersone was,
met eere verbolghenlike tale:
‘Heere Ysengrijn, men weet dat wale
ende hets een hout bijspel:
viants mont seit selden wel.
Verstaet, neemt miere talen goem:
ic wilde, hi hinghe an eenen boem
bi ziere kelen als een dief
die andren heeft ghedaen meest grief.
Heere Ysengrijn, wildi angaen
soendinc ende dat ontfaen,
daertoe willic helpen gherne. [193vb]
Mijn oem en saelt hem oec niet wernen.
Entie meest andren heeft mesdaen
sal den andren in baten staen
van minen oem ende van hu.
Al comt hi niet claghen nu,
ware mijn oem wel te hove
ende stonde in sconinx love,
heere Ysengrijn, als ghi doet,
en soude den coninc niet dincken goet
ende ghi ne bleves heden onbegrepen,
dat ghi sijn vel so hebt ghenepen
so dicwile met huwen scerpen tanden,
dat hi niet ne conde ghehanden.’
Ysengrijn sprac: ‘Hebdi gheleert
an huwen oem dus lieghen apeert?’
‘In hebbe daeran niet gheloghen.
Ghi hebt minen oem bedroghen
arde dicke in menegher wijsen.
Ghi mesleettene van den pladijse
die hi hu warp van der kerren,
doe ghi hem volghet van verren
ende ghi die beste pladijse uplaset,
daer ghi hu ane hadt versadet.
Ghi ne gaeft hem no goet no quaet,
sonder alleene eenen pladijsengraet
dat ghi hem te jeghen brocht,
dordat ghine niet en mocht.
Sint hoendine van eenen bake
die vet was ende van goeder smake,
dien ghi leit in huwen muzeele.
cart, leaving nothing but the bones ofone single fish (cf. p. 31—32).
Doe Reynaert heesschede zijn deele,
andwoerdi hem in scerne:
“Hu deel willic hu gheven gherne,
Reynaert, scone jonghelinc!
Die wisse daer die bake an hinc,
becnause, so es so vet.”
Reynaerde waes lettel te bet
dat hi den goeden bake ghewan
in sulker zorghen, dattene een man
vinc ende warpene in sinen zac.
Dese pine ende dit onghemac
hevet hi leden dor Ysengrijne [194ra]
ende ondert waerven meer dan ic hu rijme.
Ghi heeren, dinct hu dit ghenouch?
Nochtan om meer onghevouch
dat hi claghet om sijn wijf,
die Reynaerde hevet al haer lijf
ghemint; so doet hi hare.
Al ne makeden zijt niet mare,
ic dart wel segghen over waer
dat langher es dan VII jaer
dat Reynaert hevet hare trauwe.
Omdat Haersint, die scone vrouwe,
dor minne ende dor quade zede
Reynaert sinen wille dede,
wattan? So was sciere ghenesen.
Wat talen mach daeromme wesen?
Nu maket heere Cuwaert, die hase,
eene claghe van eere blase.
Of hi den credo niet wel en las,
Reynaerd, die zijn meester was,
mochte hi sinen clerc niet blauwen?
Dat ware onrecht, entrauwen.
Cortoys claghet om eene worst
die hi verloes in eene vorst.
Die claghe ware bet verholen:
ende hoerdi dat so was ghestolen?
Male quesite male perdite:
over rechtwert men qualike quite
dat men hevet qualic ghewonnen.
Wie sal Reynaerde dat verjonnen
Niemen die recht versceeden can.
Reynaert es een gherecht man.
Sint dat die coninc sinen ban
hevet gheboden ende sinen vrede,
so weetic wel dat hi ne dede
dinc negheene dan of hi ware
hermite ofte clusenare.
Naest siere huut draecht hi een hare.
Binnen desen naesten jare
so ne hat hi vleesch, no Wilt no tam.
Dat seidi die ghistren danen quam.
Malcroys hevet hi begheven, [194rb]
sinen casteel, ende hevet upheven
eene cluse daer hi leghet in.
Ander bejach no ander ghewin
so wanic wel dat hi ne hevet
dan karitate die men hem ghevet.
Bleec es hi ende magher van pinen.
Hongher, dorst, scerpe karijnen
doghet hi voer sine zonden.’
Recht te desen selven stonden,
doe Grimbert stont in dese tale,
saghen si van berghe te dale
Canticler commen ghevaren,
ende brochte up eene bare
eene doode hinne ende hiet Coppe,

Excellent, apart from two rogue English relicts that looks like a clean text. We can add some hints to our filtering mechanism to make sure also these relicts are filtered out. These hints are equivalent to a scholarly editor deciding 'no, that is not supposed to be part of the witness text'.


In [12]:
class EnglishFourthIteration < Model

  # Whenever this model is created/used it loads a number of variables,
  # e.g. the list of English stop words (@stopwords), which is read from a file.
  # There are English stopwords that are homonyms of Middle Dutch words. We do not
  # want to count these as English, therefore we substract the set of these words
  # (@may_be_middle_dutch) from the set of English stopwords.
  # Hints may be provided to recognize particular cases, that is: if we know certain
  # words definitely indicate an English line we can at these to the set of hints.
  def initialize
    @may_be_middle_dutch = [ "an", "as", "been", "by", "have", "he", "her", 
      "here", "i", "in", "is", "me", "mine", "no", "so", "over", "was", "we" ]
    @hints = [ "prologue", "ofone" ]
    @stopwords = File.read( './resources/stopwords_en.txt' ).split( "\n" ) - @may_be_middle_dutch + @hints
    @threshold = 0.2
  end

  # Sets threshold, 0.2 (20%) by default.
  def threshold=( new_threshold )
    @threshold = new_threshold
  end

  # Some words look like "been." or "her?", we strip the punctuation to make sure we 
  # don't miss any English words while matching them ("been." for a computer is 
  # obviously not the same as "been").
  def strip_embracing_punctuation( token )
    return token.gsub(/[\.:;'“‘’”?!\(\),]+$|^[\.:;'“‘’”?!\(\),]+/, '')
  end

  # This computes the 'English score' for a line.
  # The line is first split into its individual tokens.
  # Then we count all English stopwords with a weight of 1.
  # We count those words that *might* be English, but *could*
  # also be Middle Dutch too, but their weight is 0.4.
  # Finally we compute the relative score, that is: the count of English
  # words divided by the total number of tokens on the line.
  def score( string )
    score = 0.0
    tokens = string.split( /\s+/ )
    tokens.each do |token|
      stripped = strip_embracing_punctuation( token )
      score += 1.0 if @stopwords.include?( stripped.downcase )
      score += 0.4 if @may_be_middle_dutch.include?( stripped.downcase )
    end
    score/tokens.size()
  end

  # Below the adapted part for the English model. The standard function
  # 'matches( line )' that all models must provide first tests if the 
  # line is English according to the scoring function.
  # We say a line is English if the score computed above is larger
  # than the threshold of 0.2. (Thus if 20% of the tokens could be English.)
  # The matches method then continuous to do an additional check in case
  # the line is found to be not English. In that case the line is still
  # marked as English if it is wedged in between two lines that are 
  # identified as being English.
  def above_treshold( line )
    score( line ) > @threshold
  end

  def matches( line )
    match = above_treshold( line )
    if @line_context != nil && !match
      empty_model = Empty.new
      if !empty_model.matches( line )
        # Post correction, if in between two english matches, it probably should be matched too
        prev = @line_context.previous_lines.reject { |line| empty_model.matches( line ) }
        succ = @line_context.next_lines.reject { |line| empty_model.matches( line ) }
        previous_matches = above_treshold( prev[0] ) if prev.size > 0
        next_matches = above_treshold( succ[0] ) if succ.size > 0
        match = true if (previous_matches && next_matches)
      end
    end
    match
  end

end


Out[12]:
:matches

That should do it. Let's test drive this and see what the result is…


In [13]:
text = OCRParser.new
text.load_text( './resources/Bouwman_ Of Reynaert the Fox.txt' )
text.models = [ Empty.new, Numbers.new, FootNote.new, AllCaps.new, EnglishFourthIteration.new ]
parsed = text.parse()
puts parsed.join( "\n" )


Willem die Madocke maecte, [192va,22]
daer hi dicken omme waecte,
hem vernoyde so haerde
dat die avonture van Reynaerde
in Dietsche onghemaket bleven
— die Arnout niet hevet vulscreven —
dat hi die vijte dede soucken
ende hise na den Walschen boucken
in Dietsche dus hevet begonnen.
God moete ons ziere hulpen jonnen!
Nu keert hem daertoe mijn zin
dat ic bidde in dit beghin
beede den dorpren enten doren,
ofte si commen daer si horen
dese rijme ende dese woort
(die hem onnutte sijn ghehoort),
dat sise laten onbescaven.
Te vele slachten si den raven,
die emmer es al even malsch.
Si maken sulke rijme valsch,
daer si niet meer of ne weten [192vb]
dan ic doe hoe dat si heeten
die nu in Babilonien leven.
Daden si wel, si soudens begheven.
Dat en segghic niet dor minen wille.
Mijns dichtens ware een ghestille,
ne hads mi eene niet ghebeden
die in groeter hovesscheden
gherne keert hare saken.
Soe bat mi dat ic soude maken
dese avontuere van Reynaerde.
Al begripic die grongaerde
ende die dorpren ende die doren,
ic wille dat dieghene horen
die gherne pleghen der eeren
ende haren zin daertoe keeren
dat si leven hoofschelike,
sijn si arem, sijn si rike,
diet verstaen met goeden sinne.
Nu hoert hoe ic hier beghinne!
Het was in eenen tsinxendaghe
dat beede bosch ende haghe
met groenen loveren waren bevaen.
Nobel die coninc hadde ghedaen
sijn hof crayeren overal,
dat hi waende, hadde hijs gheval,
houden ten wel groeten love.
Doe quamen tes sconinx hove
alle die diere, groet ende cleene,
sonder vos Reynaert alleene.
Hi hadde te hove so vele mesdaen
dat hire niet dorste gaen.
Die hem besculdich kent, ontsiet.
Also was Reynaerde ghesciet
ende hieromme scuwedi sconinx hof,
daer hi in hadde crancken lof.
Doe al dat hofversamet was,
was daer niemen, sonder die das,
hi ne hadde te claghene over Reynaerde,
den fellen metten grijsen baerde.
Nu gaet hier up eene claghe.
Isingrijn ende sine maghe
ghinghen voer den coninc staen. [193ra]
Ysengrijn begonste saen
ende sprac: ‘Coninc heere,
dor hu edelheit ende dor hu eere
ende dor recht ende dor ghenade,
ontfaerme hu miere scade
die mi Reynaert heeft ghedaen,
daer ic af dicken hebbe ontfaen
groeten lachter ende verlies.
Voer al dandre ontfaerme hu dies
dat hi mijn wijfhevet verhoert
ende mine kindre so mesvoert
dat hise beseekede daer si laghen,
datter twee noint ne saghen
ende si worden staerblent.
Nochtan hoendi mi sent.
Het was sint so verre comen
datter eenen dach af was ghenomen
ende Reynaerd soude hebben ghedaen
sine onsculde. Ende also saen
alse die heleghe waren brocht,
was hi andersins bedocht
ende ontfoer ons in sine veste.
Heere, dit kennen noch die beste
die te hove zijn commen hier.
Mi hevet Reynaert, dat felle dier,
so vele te leede ghedaen,
ic weet wel al sonder waen:
al ware al tlaken paerkement
dat men maket nu te Ghent,
inne ghescreeft niet daeran.
Dies zwijghics nochtan,
neware mijns wives lachter
ne mach niet bliven achter,
no onversweghen no onghewroken.’
Doe Ysengrijn dit hadde ghesproken,
stont up een hondekijn, hiet Cortoys,
ende claghede den coninc in Francsoys
hoet so arem was wijleneere
dat alles goets en hadde meere
in eenen winter, in eene vorst,
dan alleene eene worst
ende hem Reynaert, die felle man, [193rb]
die selve worst stal ende nam.
Tybeert die cater die wart gram.
Aldus hi sine tale began
ende spranc midden in den rinc
ende seide: ‘Heere coninc,
dordat ghi Reynaerde zijt onhout,
so en es hier jonc no hout,
hi ne hebbe te wroughene jeghen hu.
Dat Cortoys claghet nu,
dats over menich jaer ghesciet.
Die worst was mine, al en claghic niet.
Ic hadse bi miere lust ghewonnen
daer ic bi nachte quam gheronnen
omme bejach in eene molen,
daer ic die worst in hadde ghestolen
eenen slapenden molenman.
Hadder Cortoys yewet an,
dan was bi niemene dan bi mi.
Hets recht dat omberecht zi
die claghe die Cortoys doet.’
Pancer de bever sprac: ‘Dinct hu goet,
Tybeert, dat men die claghe ombeere?
Reynaert es een recht mordeneere
ende een trekere ende een dief.
Hi ne heeft oec niemene so lief,
no den coninc, minen heere,
hi ne wilde dat hi lijf ende eere
verlore, mochtire an winnen
een vet morzeel van eere hinnen.
Wat sechdi van eere laghe?
En dedi ghistren in den daghe
eene die meeste overdaet
an Cuwaerde den hase, die hier staet,
die noyt eenich dier ghedede?
Want hi hem binnen sconinX vrede
ende binnen des coninX gheleede
ghelovede te leerne sinen crede
ende soudene maken capelaen.
Doe dedine sitten gaen
vaste tusschen sine beene.
Doe begonsten si overeene
spellen ende lesen beede [193va]
ende lude te zinghene crede.
Mi gheviel dat ic te dien tijden
ter selver stede soude lijden.
Doe hoerdic haerre beeder sanc
ende maecte daerwaert minen ganc
met eere arde snelre vaerde.
Doe vandic daer meester Reynaerde,
die ziere lessen hadde begheven
die hi tevoren up hadde gheheven,
ende diende van sinen houden spelen
ende hadde Coewaerde bi der kelen
ende soude hem thoeft afhebben ghenomen
waer ic hem niet te hulpen comen
bi avontueren in dien stonden.
Siet hier noch die verssche wonden
ende die teekine, heere coninc,
die Coewaert van hem ontfinc.
Laetti dit bliven onghewroken,
dat hu verde dus es tebroken,
ghi ne wreket als huwe mannen wijsen,
men saelt huwen kindren mesprijsen
hiernaer over wel menich jaer.’
‘Bi Gode, Pancer, ghi secht waer,’
sprac Ysengrijn daer hi stoet.
‘Heere, waer Reynaerd doot, het waer ons goet,
also behoude mi God mijn leven.
Neware wert hem dit vergheven,
hi sal noch hoenen binnen eere maent
sulken dies niet ne bewaent.’
Doe spranc up Grinbert die das,
die Reynaerts broedersone was,
met eere verbolghenlike tale:
‘Heere Ysengrijn, men weet dat wale
ende hets een hout bijspel:
viants mont seit selden wel.
Verstaet, neemt miere talen goem:
ic wilde, hi hinghe an eenen boem
bi ziere kelen als een dief
die andren heeft ghedaen meest grief.
Heere Ysengrijn, wildi angaen
soendinc ende dat ontfaen,
daertoe willic helpen gherne. [193vb]
Mijn oem en saelt hem oec niet wernen.
Entie meest andren heeft mesdaen
sal den andren in baten staen
van minen oem ende van hu.
Al comt hi niet claghen nu,
ware mijn oem wel te hove
ende stonde in sconinx love,
heere Ysengrijn, als ghi doet,
en soude den coninc niet dincken goet
ende ghi ne bleves heden onbegrepen,
dat ghi sijn vel so hebt ghenepen
so dicwile met huwen scerpen tanden,
dat hi niet ne conde ghehanden.’
Ysengrijn sprac: ‘Hebdi gheleert
an huwen oem dus lieghen apeert?’
‘In hebbe daeran niet gheloghen.
Ghi hebt minen oem bedroghen
arde dicke in menegher wijsen.
Ghi mesleettene van den pladijse
die hi hu warp van der kerren,
doe ghi hem volghet van verren
ende ghi die beste pladijse uplaset,
daer ghi hu ane hadt versadet.
Ghi ne gaeft hem no goet no quaet,
sonder alleene eenen pladijsengraet
dat ghi hem te jeghen brocht,
dordat ghine niet en mocht.
Sint hoendine van eenen bake
die vet was ende van goeder smake,
dien ghi leit in huwen muzeele.
Doe Reynaert heesschede zijn deele,
andwoerdi hem in scerne:
“Hu deel willic hu gheven gherne,
Reynaert, scone jonghelinc!
Die wisse daer die bake an hinc,
becnause, so es so vet.”
Reynaerde waes lettel te bet
dat hi den goeden bake ghewan
in sulker zorghen, dattene een man
vinc ende warpene in sinen zac.
Dese pine ende dit onghemac
hevet hi leden dor Ysengrijne [194ra]
ende ondert waerven meer dan ic hu rijme.
Ghi heeren, dinct hu dit ghenouch?
Nochtan om meer onghevouch
dat hi claghet om sijn wijf,
die Reynaerde hevet al haer lijf
ghemint; so doet hi hare.
Al ne makeden zijt niet mare,
ic dart wel segghen over waer
dat langher es dan VII jaer
dat Reynaert hevet hare trauwe.
Omdat Haersint, die scone vrouwe,
dor minne ende dor quade zede
Reynaert sinen wille dede,
wattan? So was sciere ghenesen.
Wat talen mach daeromme wesen?
Nu maket heere Cuwaert, die hase,
eene claghe van eere blase.
Of hi den credo niet wel en las,
Reynaerd, die zijn meester was,
mochte hi sinen clerc niet blauwen?
Dat ware onrecht, entrauwen.
Cortoys claghet om eene worst
die hi verloes in eene vorst.
Die claghe ware bet verholen:
ende hoerdi dat so was ghestolen?
Male quesite male perdite:
over rechtwert men qualike quite
dat men hevet qualic ghewonnen.
Wie sal Reynaerde dat verjonnen
Niemen die recht versceeden can.
Reynaert es een gherecht man.
Sint dat die coninc sinen ban
hevet gheboden ende sinen vrede,
so weetic wel dat hi ne dede
dinc negheene dan of hi ware
hermite ofte clusenare.
Naest siere huut draecht hi een hare.
Binnen desen naesten jare
so ne hat hi vleesch, no Wilt no tam.
Dat seidi die ghistren danen quam.
Malcroys hevet hi begheven, [194rb]
sinen casteel, ende hevet upheven
eene cluse daer hi leghet in.
Ander bejach no ander ghewin
so wanic wel dat hi ne hevet
dan karitate die men hem ghevet.
Bleec es hi ende magher van pinen.
Hongher, dorst, scerpe karijnen
doghet hi voer sine zonden.’
Recht te desen selven stonden,
doe Grimbert stont in dese tale,
saghen si van berghe te dale
Canticler commen ghevaren,
ende brochte up eene bare
eene doode hinne ende hiet Coppe,

That looks like the actual Middle Dutch text that we were looking for. There are things to fix still though. There are folio and column markers that are obviously not part of the original text. And there are OCR mistakes, as in "sconinX". However we will consider what to do with these later. For now we have our 'clean' text. We go on to OO modeling it.

Notes

1) 'Performance' is an ambiguous term in this context, as it is also used by programmers to indicate the very speed by which a program executes, and code is often also rewritten to improve that speed. However, unless otherwise indicated, I use the term 'performance' to refer to that what the code does, that what it shows, its output, and the tasks it conducts.

</small>