Writing code shares a lot of commonalities with authoring text. For instance, only rarely you get it right in exactly one time. What authors would call rewriting, scrapping, polishing, starting anew, editing, and so forth, in IT terms is called "refactoring" or "doing another iteration"—although indeed you will here plain and simple "rewriting" as well.
When writing the code for this experiment, it was often needed to rewrite parts of it. During these iterations it was not always clear which iterations were a pure technical matter (e.g. fixing a typo so that "parser.pase()" would not actually simple error out due to a missing 'r') and which were also tied to scholarly action (e.g. improving the model for detecting English lines of the text). This was important to discern as the rules of this experiment stated that all scholarly actions should be reproducible. Including in this notebook all the states of the code according to each iteration would be rather tedious, pretty boring, and little informative. It seems reasonable to silently accept iterations that are code oriented, that is: those rewritings of the code that make the performance and heuristics of the code in some ways technically more adequate or more efficient but that do not change the result of the heuristic itself. A good example is a method extraction. Suppose you have the following code.
In [1]:
puts "The byciclist rides on the bike".gsub( "the", "<part>the</part>" )
puts "The train conductor asks for the tickets".gsub( "the", "<part>the</part>" )
puts "The train conductor asks for the tickets".gsub( "the", "<part>the</part>" )
This code might be part of some application that marks particles for teaching purposes. But the marking is done very explicit for every case, making maintaining the code harder: if we wanted to change the label of the particles we would have to change it in six places, giving us as many if not more occasions to err. This is why you would refactor code like that to use a method (function) that is called each time when we need to write a label.
In [2]:
def mark_part( string )
string.gsub( "the", "<part>the</part>" )
end
puts mark_part( "The byciclist rides on the bike" )
puts mark_part( "The train conductor asks for the tickets" )
puts mark_part( "The train conductor asks for the tickets" )
Now if we would want to change the label, we can change it in one place and we will be sure that all instances will act the same. Note however that the performance of the code, its output, is still the same (as you can confirm by running the code).
Similarly I have made many iterations and improvements to the code described in this notebook: they improved structure, maintainability, readability of the code, or simply solved bugs that prevented successful execution. But those refactorings did not alter the actual resulting performance.1 I choose not to represent all these (often tiny) iterations. However when a refactoring essentially changes the heuristics of the code, then that iteration should be reported as a matter of scholarly completeness. Thus if I were to change the above code to…
In [3]:
def mark_part( string )
string.gsub( /(t|T)he/, '<part>\1he</part>' )
end
puts mark_part( "The byciclist rides on the bike" )
puts mark_part( "The train conductor asks for the tickets" )
puts mark_part( "The train conductor asks for the tickets" )
…I am changing the heuristics of its task. Instead of just looking for "the" and marking it, I am now looking for either "the" or "The" and marking those. This involves, in the case of this notebook, scholarly decisions on how textual material should be interpreted. These scholarly decision should be represented as part of the reproducibility of scholarly effort and workflow.
When we ran the code we found that it marks a number of lines as being English that are really not, and that it qualifies certain lines as Middle Dutch that are not. Clearly some refactoring is needed to remedy this. Before we can do so however we need to recreate the models and code that we had already. That is the reason for the next somewhat hermetic lines of code. They figure out where we are on the file system, then load the models and parser of the prior chapter.
In [4]:
require File.join(File.dirname(__FILE__), '../lib/ocr_parse_models')
require File.join(File.dirname(__FILE__), '../lib/ocr_parser')
Out[4]:
There is a number of words in the English stop words list that are homonymous with Middle Dutch vocabulary. If we count these fully towards English then we are counting too many terms as English. At the same time we do not want to count too many truly English terms as Middle Dutch. Let's give these terms a weight of 0.4 to see if things improve.
In [5]:
class EnglishSecondIteration < Model
# Whenever this model is created/used it loads a number of variables,
# e.g. the list of English stop words (@stopwords), which is read from a file.
# There are English stopwords that are homonyms of Middle Dutch words. We do not
# want to count these as English, therefore we substract the set of these words
# (@may_be_middle_dutch) from the set of English stopwords.
def initialize
@may_be_middle_dutch = [ "an", "as", "been", "by", "have", "he", "her",
"here", "i", "in", "is", "me", "mine", "no", "so", "over", "was", "we" ]
@stopwords = File.read( './resources/stopwords_en.txt' ).split( "\n" ) - @may_be_middle_dutch
@threshold = 0.2
end
# Sets threshold, 0.2 (20%) by default.
def threshold=( new_threshold )
@threshold = new_threshold
end
# Some words look like "been." or "her?", we strip the punctuation to make sure we
# don't miss any English words while matching them ("been." for a computer is
# obviously not the same as "been").
def strip_embracing_punctuation( token )
return token.gsub(/[\.:;'“‘’”?!\(\),]+$|^[\.:;'“‘’”?!\(\),]+/, '')
end
# This computes the 'English score' for a line.
# The line is first split into its individual tokens.
# Then we count all English stopwords with a weight of 1.
# We count those words that *might* be English, but *could*
# also be Middle Dutch too, but their weight is 0.4.
# Finally we compute the relative score, that is: the count of English
# words divided by the total number of tokens on the line.
def score( string )
score = 0.0
tokens = string.split( /\s+/ )
tokens.each do |token|
stripped = strip_embracing_punctuation( token )
score += 1.0 if @stopwords.include?( stripped.downcase )
score += 0.4 if @may_be_middle_dutch.include?( stripped.downcase )
end
score/tokens.size()
end
# The standard match function that all models must provide.
# We say a line is English if the score computed above is larger
# than the threshold of 0.2. (Thus if 20% of the tokens could be English.)
def matches( line )
score( line ) > @threshold
end
end
Out[5]:
So, let's fire up the parser once again and let's see how well our second interaction English model performs.
In [6]:
text = OCRParser.new
text.load_text( './resources/Bouwman_ Of Reynaert the Fox.txt' )
text.models = [ Empty.new, Numbers.new, FootNote.new, AllCaps.new, EnglishSecondIteration.new ]
parsed = text.parse()
puts parsed.join( "\n" )
Alright, that seems to yield more Middle Dutch lines in any case.
Willem die Madocke maecte, [192va,22]
daer hi dicken omme waecte,
hem vernoyde so haerde
dat die avonture van Reynaerde
in Dietsche onghemaket bleven
— die Arnout niet hevet vulscreven —
dat hi die vijte dede soucken
ende hise na den Walschen boucken
in Dietsche dus hevet begonnen.
But unfortunately also more English lines, as in:
die in groeter hovesscheden
Prologue
was so extremely annoyed
remained unwritten in Dutch
gherne keert hare saken.
or:
Mi hevet Reynaert, dat felle dier,
inflicted upon me by Reynaert,
so vele te leede ghedaen,
Intuitively seems logical, if we allow words less to be identified decisively as English, than less sentences will end up being identified as such. But how can we further improve our selection. The last example gives a hint: the English line is part of a longer section of translated Middle Dutch, but it appears here isolated from its sibling English sentences. Or in other words: the matching algorithm suddenly amidst of all English lines decides that one line is not English. We can make use of the fact that this is unlikely. We will rewrite the English model such that if a line is not identified as English but both the previous and next lines are, then we'll identify the line itself also as English.
At least our English model will require knowledge about its textual context. As you may gauge from the models, all models only now about the very line they are trying to match. Thus the parser needs a way of letting the models know what lines precede and follow. We can do so by changing the super class Model. All models will have a variable called 'line_context' that is settable by the parser (this is what the 'attr_accessor' bit guarantees).
In [7]:
class Model
attr_accessor :line_context
# A class instance variable that holds a list of
# other models that terminates this model.
@terminators = nil
def self.terminators
@terminators
end
# Determines if the model matches a line of text.
# By default it returns false because it doesn't match anything.
def matches( line )
false
end
end
Out[7]:
Now we will have to have some object that can represent and hold the context of a line (e.g. 'know' or store the 10 lines before and after the current line). Let's call this object LineContext. On creation—with LineContext.new, which invokes the initialize() method—a line_context object gets all lines of the text and it gets the index of the line that is currently being parsed. Upon its creation it stores the 10 lines before the line that is currently parsed in 'previous_lines' and stores the 10 lines after that line in 'next_lines'.
In [8]:
class LineContext
attr_reader :previous_lines
attr_reader :next_lines
def initialize( lines, index )
reversed_index = lines.size - index
@previous_lines = lines.reverse()[reversed_index,10]
@next_lines = lines[index+1,10]
end
end
Out[8]:
Also we will have to adapt our parser, because it needs to inform each model of its context. To do so we change the match_lines() method.
In [9]:
class OCRParser
attr_accessor :models
def text=( text )
@text = text
@lines = text.split( "\n" )
end
def load_text( file_path )
self.text = File.read( file_path )
end
def match_lines
@lines.each_with_index do |line,index|
matches = []
@models.each do |model|
# We provide each model with some context of the line it will be working on…
model.line_context = LineContext.new( @lines, index )
if model.matches( line )
matches.push( model.class )
end
end
yield line, matches
end
end
def parse_tuples
active_multiline_models = []
match_lines do |line, matches|
matches.each do |model|
active_multiline_models.reject! do |active_multiline_model|
active_multiline_model.terminators.include? model
end
if model.terminators != nil
active_multiline_models.push( model )
end
end
if matches.size == 0 && active_multiline_models.size == 0
yield true, line, matches
else
yield false, line, matches
end
end
end
def parse
tuples = []
parse_tuples { | accept, line | tuples.push line if accept }
tuples
end
end
Out[9]:
Now we can finally adapt the English model such that it can take into account the lines prior to and following the line it is trying to match. The matches() method is expanded to see if the preceding and following line are English according to the scoring function. If so then the current line is still marked as English even if the scoring function did not judge it to be.
In [10]:
class EnglishThirdIteration < Model
# Whenever this model is created/used it loads a number of variables,
# e.g. the list of English stop words (@stopwords), which is read from a file.
# There are English stopwords that are homonyms of Middle Dutch words. We do not
# want to count these as English, therefore we substract the set of these words
# (@may_be_middle_dutch) from the set of English stopwords.
def initialize
@may_be_middle_dutch = [ "an", "as", "been", "by", "have", "he", "her",
"here", "i", "in", "is", "me", "mine", "no", "so", "over", "was", "we" ]
@stopwords = File.read( './resources/stopwords_en.txt' ).split( "\n" ) - @may_be_middle_dutch
@threshold = 0.2
end
# Sets threshold, 0.2 (20%) by default.
def threshold=( new_threshold )
@threshold = new_threshold
end
# Some words look like "been." or "her?", we strip the punctuation to make sure we
# don't miss any English words while matching them ("been." for a computer is
# obviously not the same as "been").
def strip_embracing_punctuation( token )
return token.gsub(/[\.:;'“‘’”?!\(\),]+$|^[\.:;'“‘’”?!\(\),]+/, '')
end
# This computes the 'English score' for a line.
# The line is first split into its individual tokens.
# Then we count all English stopwords with a weight of 1.
# We count those words that *might* be English, but *could*
# also be Middle Dutch too, but their weight is 0.4.
# Finally we compute the relative score, that is: the count of English
# words divided by the total number of tokens on the line.
def score( string )
score = 0.0
tokens = string.split( /\s+/ )
tokens.each do |token|
stripped = strip_embracing_punctuation( token )
score += 1.0 if @stopwords.include?( stripped.downcase )
score += 0.4 if @may_be_middle_dutch.include?( stripped.downcase )
end
score/tokens.size()
end
# Below the adapted part for the English model. The standard function
# 'matches( line )' that all models must provide first tests if the
# line is English according to the scoring function.
# We say a line is English if the score computed above is larger
# than the threshold of 0.2. (Thus if 20% of the tokens could be English.)
# The matches method then continuous to do an additional check in case
# the line is found to be not English. In that case the line is still
# marked as English if it is wedged in between two lines that are
# identified as being English.
def matches( line )
match = above_treshold( line )
if @line_context != nil && !match
empty_model = Empty.new
if !empty_model.matches( line )
# Post correction, if in between two english matches, it probably should be matched too
prev = @line_context.previous_lines.reject { |line| empty_model.matches( line ) }
succ = @line_context.next_lines.reject { |line| empty_model.matches( line ) }
previous_matches = above_treshold( prev[0] ) if prev.size > 0
next_matches = above_treshold( succ[0] ) if succ.size > 0
match = true if (previous_matches && next_matches)
end
end
match
end
def above_treshold( line )
score( line ) > @threshold
end
end
Out[10]:
And now we are ready again to test the third iteration of the English model. Engine away…
In [11]:
text = OCRParser.new
text.load_text( './resources/Bouwman_ Of Reynaert the Fox.txt' )
text.models = [ Empty.new, Numbers.new, FootNote.new, AllCaps.new, EnglishThirdIteration.new ]
parsed = text.parse()
puts parsed.join( "\n" )
Excellent, apart from two rogue English relicts that looks like a clean text. We can add some hints to our filtering mechanism to make sure also these relicts are filtered out. These hints are equivalent to a scholarly editor deciding 'no, that is not supposed to be part of the witness text'.
In [12]:
class EnglishFourthIteration < Model
# Whenever this model is created/used it loads a number of variables,
# e.g. the list of English stop words (@stopwords), which is read from a file.
# There are English stopwords that are homonyms of Middle Dutch words. We do not
# want to count these as English, therefore we substract the set of these words
# (@may_be_middle_dutch) from the set of English stopwords.
# Hints may be provided to recognize particular cases, that is: if we know certain
# words definitely indicate an English line we can at these to the set of hints.
def initialize
@may_be_middle_dutch = [ "an", "as", "been", "by", "have", "he", "her",
"here", "i", "in", "is", "me", "mine", "no", "so", "over", "was", "we" ]
@hints = [ "prologue", "ofone" ]
@stopwords = File.read( './resources/stopwords_en.txt' ).split( "\n" ) - @may_be_middle_dutch + @hints
@threshold = 0.2
end
# Sets threshold, 0.2 (20%) by default.
def threshold=( new_threshold )
@threshold = new_threshold
end
# Some words look like "been." or "her?", we strip the punctuation to make sure we
# don't miss any English words while matching them ("been." for a computer is
# obviously not the same as "been").
def strip_embracing_punctuation( token )
return token.gsub(/[\.:;'“‘’”?!\(\),]+$|^[\.:;'“‘’”?!\(\),]+/, '')
end
# This computes the 'English score' for a line.
# The line is first split into its individual tokens.
# Then we count all English stopwords with a weight of 1.
# We count those words that *might* be English, but *could*
# also be Middle Dutch too, but their weight is 0.4.
# Finally we compute the relative score, that is: the count of English
# words divided by the total number of tokens on the line.
def score( string )
score = 0.0
tokens = string.split( /\s+/ )
tokens.each do |token|
stripped = strip_embracing_punctuation( token )
score += 1.0 if @stopwords.include?( stripped.downcase )
score += 0.4 if @may_be_middle_dutch.include?( stripped.downcase )
end
score/tokens.size()
end
# Below the adapted part for the English model. The standard function
# 'matches( line )' that all models must provide first tests if the
# line is English according to the scoring function.
# We say a line is English if the score computed above is larger
# than the threshold of 0.2. (Thus if 20% of the tokens could be English.)
# The matches method then continuous to do an additional check in case
# the line is found to be not English. In that case the line is still
# marked as English if it is wedged in between two lines that are
# identified as being English.
def above_treshold( line )
score( line ) > @threshold
end
def matches( line )
match = above_treshold( line )
if @line_context != nil && !match
empty_model = Empty.new
if !empty_model.matches( line )
# Post correction, if in between two english matches, it probably should be matched too
prev = @line_context.previous_lines.reject { |line| empty_model.matches( line ) }
succ = @line_context.next_lines.reject { |line| empty_model.matches( line ) }
previous_matches = above_treshold( prev[0] ) if prev.size > 0
next_matches = above_treshold( succ[0] ) if succ.size > 0
match = true if (previous_matches && next_matches)
end
end
match
end
end
Out[12]:
That should do it. Let's test drive this and see what the result is…
In [13]:
text = OCRParser.new
text.load_text( './resources/Bouwman_ Of Reynaert the Fox.txt' )
text.models = [ Empty.new, Numbers.new, FootNote.new, AllCaps.new, EnglishFourthIteration.new ]
parsed = text.parse()
puts parsed.join( "\n" )
That looks like the actual Middle Dutch text that we were looking for. There are things to fix still though. There are folio and column markers that are obviously not part of the original text. And there are OCR mistakes, as in "sconinX". However we will consider what to do with these later. For now we have our 'clean' text. We go on to OO modeling it.
1) 'Performance' is an ambiguous term in this context, as it is also used by programmers to indicate the very speed by which a program executes, and code is often also rewritten to improve that speed. However, unless otherwise indicated, I use the term 'performance' to refer to that what the code does, that what it shows, its output, and the tasks it conducts.
</small>