Near matching

What is near matching and why do we use it?

Near matching is another term for fuzzy matching, that is, is based on the idea that two items (such as two word tokens in a collation) should sometimes be considered matching even when they are not string equal (that is, not identical in every character). More precisely, near matching is a strategy for finding the closest match in situations where they not be an exact match.

Consider the following example from the Rus′ primary chronicle:

The last column contains slightly differing forms of fraci, but the second witness, Tro, reads fraki. Normalization, including Soundex takes care of the slight variation during Gothenburg stage 2, but we don’t want to merge c and k globally because the difference between them is usually significant.

When CollateX cannot find an exact match and there is more than one possible alignment for a token, it defaults to placing the token to the left. This means that without near matching, fraki, which does match perfectly either i or fraci, would be misaligned. With near matching, though, CollateX can recognize that fraci is more like fraki than it is like i, and thus place it in the correct (right) column.

How does near matching work in CollateX?

The way near matching currently works in CollateX is that if the user has turned it on (it is off by default), after the alignment stage has been completed, the system looks for situations that cannot be resolved solely by exact matching, that is, entirely at the alignment stage. Those situations must show both of the following properties:

  1. Different numbers of tokens: Between two blocks (vertical sets of tokens) that are firmly aligned there is an unequal number of tokens in the witnesses. In the example above, the alignment of “The” and “koala” is clear because both witnesses have the identical reading (that is, they are complete vertical blocks), but in one witness there is one token between them and in the other witness there are two tokens. If, on the other hand, the two witnesses read “The gray koala” and “The grey koala”, although the middle tokens don’t match, there’s no ambiguity because the alignment is forced: since “gray” and “grey” are sandwiched between perfect matches before and after, they can only be aligned with each other, so there is no need for near matching.
  2. No exact match: The tokens in the shorter witnesses do not have an exact string match in the longer witnesses. If they did have an exact match, that would have been found at the alignment stage and there would be no need for near matching. For example, if the two witnesses read “The grey koala” and “The big grey koala”, although there are three tokens in the first witness and four in the second, each token in the first has an exact string match in the second, which means that it can be aligned at the alignment stage.

If and only if both of those conditions are met, CollateX compares the floating token in the shorter witness (“gray” in the example above) to all possible matches (“big” and “grey” in this example) and calculates the nearest match using a measure called edit distance or Levenshtein distance (see https://en.wikipedia.org/wiki/Edit_distance for more information). CollateX calculates the edit distance between the floating token and the tokens in the other witnesses at all of the locations where it could be placed, and determines the best placement.

Edit distance and computational complexity

Calculating exact matches is relatively efficient computationally. If the strings are different lengths, we don’t have to look at any characters to know that they don’t match. If they’re the same length, we can look character by character, and as soon as we hit a non-match, we don’t have to look further. (This is not necessarily how exact matching would be implemented; they may be other tricks that would get a quick yes-or-no answer.)

Calculating edit distance to find the closest match is expensive because the entire strings have to be evaluated. In a tradition with a large number of witnesses and large gaps, the number of comparisons grows quickly, which means that you don’t want to calculate edit distance except where you need to. Performing computationally inexpensive exact string matching first (in the alignment stage) and then calculating the more expensive edit distance (in the analysis stage) only where alignment has failed to give a satisfactory answer reduces the amount of computation required.

Example of near matching in CollateX


In [ ]:
from collatex import *
collation = Collation()
collation.add_plain_witness("A", "The gray koala")
collation.add_plain_witness("B", "The big grey koala")
alignment_table = collate(collation, segmentation=False)
print(alignment_table)

In [ ]:
from collatex import *
collation = Collation()
collation.add_plain_witness("A", "The gray koala")
collation.add_plain_witness("B", "The big grey koala")
alignment_table = collate(collation, segmentation=False, near_match=True)
print(alignment_table)

Edit distance

The edit distance between two strings is the number of single-character changes required to convert one into the other. The most common distance measure is called Levenshtein distance, and it counts insertions, deletions, and substitutions. A variant, Damerau-Levenshtein distance, also counts transpositions, which in classic Levenshtein distance would be two steps, an insertion and a deletion.

When you installed CollateX, you installed the Python Levenshtein package, which is what CollateX uses to find the closest match. You can practice with it independently by changing the strings below to your own text:


In [ ]:
import Levenshtein
Levenshtein.distance('gray', 'grey')

The Levenshtein module does its comparison on the basis of Unicode values, so upper and lowercase characters are different:


In [ ]:
Levenshtein.distance('gray','Grey')

How is Levenshtein distance a useful way of finding the closest match? That is, are all single-character differences of equal importance philologically? What about the following:


In [ ]:
print(Levenshtein.distance('color','colour'))
print(Levenshtein.distance('clod', 'cloud'))

What about:


In [ ]:
print(Levenshtein.distance('book', 'books'))
print(Levenshtein.distance('book', 'cook'))

Normalization (Gothenburg stage 2) and analysis (Gothenburg stage 4)

Normalization happens in at least three places in the Gothenburg model. Officially it’s at Stage 2, between tokenization and alignment. But we cannot avoid normalizing when we transcribe handwriting as Unicode characters, since the handwriting varies infinitely, often in ways that we don’t care about. And because alignment (in CollateX; a different alignment module might operate differently) works on exact matching, examining near matching during analysis is also a sort of normalization.

So: How do we decide what to normalize at which stage?