Near matching is another term for fuzzy matching, that is, is based on the idea that two items (such as two word tokens in a collation) should sometimes be considered matching even when they are not string equal (that is, not identical in every character). More precisely, near matching is a strategy for finding the closest match in situations where they not be an exact match.
Consider the following example from the Rus′ primary chronicle:
The last column contains slightly differing forms of fraci, but the second witness, Tro, reads fraki. Normalization, including Soundex takes care of the slight variation during Gothenburg stage 2, but we don’t want to merge c and k globally because the difference between them is usually significant.
When CollateX cannot find an exact match and there is more than one possible alignment for a token, it defaults to placing the token to the left. This means that without near matching, fraki, which does match perfectly either i or fraci, would be misaligned. With near matching, though, CollateX can recognize that fraci is more like fraki than it is like i, and thus place it in the correct (right) column.
The way near matching currently works in CollateX is that if the user has turned it on (it is off by default), after the alignment stage has been completed, the system looks for situations that cannot be resolved solely by exact matching, that is, entirely at the alignment stage. Those situations must show both of the following properties:
If and only if both of those conditions are met, CollateX compares the floating token in the shorter witness (“gray” in the example above) to all possible matches (“big” and “grey” in this example) and calculates the nearest match using a measure called edit distance or Levenshtein distance (see https://en.wikipedia.org/wiki/Edit_distance for more information). CollateX calculates the edit distance between the floating token and the tokens in the other witnesses at all of the locations where it could be placed, and determines the best placement.
Calculating exact matches is relatively efficient computationally. If the strings are different lengths, we don’t have to look at any characters to know that they don’t match. If they’re the same length, we can look character by character, and as soon as we hit a non-match, we don’t have to look further. (This is not necessarily how exact matching would be implemented; they may be other tricks that would get a quick yes-or-no answer.)
Calculating edit distance to find the closest match is expensive because the entire strings have to be evaluated. In a tradition with a large number of witnesses and large gaps, the number of comparisons grows quickly, which means that you don’t want to calculate edit distance except where you need to. Performing computationally inexpensive exact string matching first (in the alignment stage) and then calculating the more expensive edit distance (in the analysis stage) only where alignment has failed to give a satisfactory answer reduces the amount of computation required.
In [ ]:
from collatex import *
collation = Collation()
collation.add_plain_witness("A", "The gray koala")
collation.add_plain_witness("B", "The big grey koala")
alignment_table = collate(collation, segmentation=False)
print(alignment_table)
In [ ]:
from collatex import *
collation = Collation()
collation.add_plain_witness("A", "The gray koala")
collation.add_plain_witness("B", "The big grey koala")
alignment_table = collate(collation, segmentation=False, near_match=True)
print(alignment_table)
The edit distance between two strings is the number of single-character changes required to convert one into the other. The most common distance measure is called Levenshtein distance, and it counts insertions, deletions, and substitutions. A variant, Damerau-Levenshtein distance, also counts transpositions, which in classic Levenshtein distance would be two steps, an insertion and a deletion.
When you installed CollateX, you installed the Python Levenshtein package, which is what CollateX uses to find the closest match. You can practice with it independently by changing the strings below to your own text:
In [ ]:
import Levenshtein
Levenshtein.distance('gray', 'grey')
The Levenshtein module does its comparison on the basis of Unicode values, so upper and lowercase characters are different:
In [ ]:
Levenshtein.distance('gray','Grey')
How is Levenshtein distance a useful way of finding the closest match? That is, are all single-character differences of equal importance philologically? What about the following:
In [ ]:
print(Levenshtein.distance('color','colour'))
print(Levenshtein.distance('clod', 'cloud'))
What about:
In [ ]:
print(Levenshtein.distance('book', 'books'))
print(Levenshtein.distance('book', 'cook'))
Normalization happens in at least three places in the Gothenburg model. Officially it’s at Stage 2, between tokenization and alignment. But we cannot avoid normalizing when we transcribe handwriting as Unicode characters, since the handwriting varies infinitely, often in ways that we don’t care about. And because alignment (in CollateX; a different alignment module might operate differently) works on exact matching, examining near matching during analysis is also a sort of normalization.
So: How do we decide what to normalize at which stage?