Near matching tests

Near matching as of 2016-07-23 (djb)

Analytic framework

The witnesses start and end with perfect matches (abcd and efgh, respectively). Witness A has one token in the middle (0123) and witness B has two (012x, 01xx) or three (012x, 01xx, 0xxx). The two or three candidates for alignment in witness B are all partial matches to the middle token in A, with different degrees of similarity. All permutations of the candidates in B are tested to determine whether A is aligned with the correct one.

Without near matching, candidate always stays left, even if right is closer

Not the desired output: Here 0123 in A is closer to 012x (right) than to 01xx (left), but it stays left anyway.


In [63]:
%reload_ext autoreload
%autoreload 2
from collatex import *
collation = Collation()
collation.add_plain_witness("A", "abcd 0123 efgh")
collation.add_plain_witness("B", "abcd 01xx 012x efgh")
alignment_table = collate(collation, segmentation=False)
print(alignment_table)


+---+------+------+------+------+
| A | abcd | 0123 | -    | efgh |
| B | abcd | 01xx | 012x | efgh |
+---+------+------+------+------+

With near matching and two choices, candidate is aligned correctly

In the example below, 0123 in A is closer to 012x (left) in B, and it correctly stays left.


In [64]:
# Two candidates
# With near matching, it goes to the closer match, whether that's left or right
# Closer match is left, no movement
collation = Collation()
collation.add_plain_witness("A", "abcd 0123 efgh")
collation.add_plain_witness("B", "abcd 012x 01xx efgh")
alignment_table = collate(collation, near_match=True, segmentation=False)
print(alignment_table)


+---+------+------+------+------+
| A | abcd | 0123 | -    | efgh |
| B | abcd | 012x | 01xx | efgh |
+---+------+------+------+------+

In the example below, 0123 in A is closer to 012x (right) in B, and it correctly moves right.


In [65]:
# Two candidates
# With near matching, it goes to the closer match, whether that's left or right
# Same input as above, but closer match is right, so moves
collation = Collation()
collation.add_plain_witness("A", "abcd 0123 efgh")
collation.add_plain_witness("B", "abcd 01xx 012x efgh")
alignment_table = collate(collation, near_match=True, segmentation=False)
print(alignment_table)


+---+------+------+------+------+
| A | abcd | -    | 0123 | efgh |
| B | abcd | 01xx | 012x | efgh |
+---+------+------+------+------+

With near matching and three or more choices, the alignment is correct regardless

If the closest match is left, the candidate correctly always stays left


In [66]:
# Three candidates, closest is left, match rank 0 1 2 (0 is closest)
# Should stay left; succeeds
collation = Collation()
collation.add_plain_witness("A", "abcd 0123 efgh")
collation.add_plain_witness("B", "abcd 012x 01xx 0xxx efgh")
alignment_table = collate(collation, near_match=True, segmentation=False)
print(alignment_table)


+---+------+------+------+------+------+
| A | abcd | 0123 | -    | -    | efgh |
| B | abcd | 012x | 01xx | 0xxx | efgh |
+---+------+------+------+------+------+

In [67]:
# Three candidates, closest is left, match rank 0 2 1 (0 is closest)
# Should stay left; succeeds
collation = Collation()
collation.add_plain_witness("A", "abcd 0123 efgh")
collation.add_plain_witness("B", "abcd 012x 0xxx 01xx efgh")
alignment_table = collate(collation, near_match=True, segmentation=False)
print(alignment_table)


+---+------+------+------+------+------+
| A | abcd | 0123 | -    | -    | efgh |
| B | abcd | 012x | 0xxx | 01xx | efgh |
+---+------+------+------+------+------+

If the closest match is right, the candidate correctly always moves right


In [68]:
# Three candidates, closest is right, match rank 1 2 0 (0 is closest)
# Should go right; succeeds
collation = Collation()
collation.add_plain_witness("A", "abcd 0123 efgh")
collation.add_plain_witness("B", "abcd 01xx 0xxx 012x efgh")
alignment_table = collate(collation, near_match=True, segmentation=False)
print(alignment_table)


+---+------+------+------+------+------+
| A | abcd | -    | -    | 0123 | efgh |
| B | abcd | 01xx | 0xxx | 012x | efgh |
+---+------+------+------+------+------+

In [69]:
# Three candidates, closest is right, match rank 2 1 0 (0 is closest)
collation = Collation()
collation.add_plain_witness("A", "abcd 0123 efgh")
collation.add_plain_witness("B", "abcd 0xxx 01xx 012x efgh")
alignment_table = collate(collation, near_match=True, segmentation=False)
print(alignment_table)


+---+------+------+------+------+------+
| A | abcd | -    | -    | 0123 | efgh |
| B | abcd | 0xxx | 01xx | 012x | efgh |
+---+------+------+------+------+------+

If the closest match is in the middle, the always correctly moves to the middle


In [70]:
# Three candidates, closest is middle, match rank 1 0 2 (0 is closest)
collation = Collation()
collation.add_plain_witness("A", "abcd 0123 efgh")
collation.add_plain_witness("B", "abcd 01xx 012x 0xxx efgh")
alignment_table = collate(collation, near_match=True, segmentation=False)
print(alignment_table)


+---+------+------+------+------+------+
| A | abcd | -    | 0123 | -    | efgh |
| B | abcd | 01xx | 012x | 0xxx | efgh |
+---+------+------+------+------+------+

In [71]:
# Three candidates, closest is middle, match rank 2 0 1 (0 is closest)
collation = Collation()
collation.add_plain_witness("A", "abcd 0123 efgh")
collation.add_plain_witness("B", "abcd 0xxx 012x 01xx efgh")
alignment_table = collate(collation, near_match=True, segmentation=False)
print(alignment_table)


+---+------+------+------+------+------+
| A | abcd | -    | 0123 | -    | efgh |
| B | abcd | 0xxx | 012x | 01xx | efgh |
+---+------+------+------+------+------+

Three witnesses, two of which have gaps

We expect:

+---+------+--------+--------+--------+--------+--------+------+
| A | abcd | -      | -      | 012345 | -      |        | efgh |
| B | abcd | 0xxxxx | 01xxxx | 01234x | 012xxx | 0123xx | efgh |
| C | abcd | -      | 01xxxx | -      | -      | zz23xx | efgh |
+---+------+--------+--------+--------+--------+--------+------+

In [72]:
%reload_ext autoreload
%autoreload 2
from collatex import *
collation = Collation()
collation.add_plain_witness("A", "abcd 012345 efgh")
collation.add_plain_witness("B", "abcd 0xxxxx 01xxxx 01234x 012xxx 0123xx efgh")
collation.add_plain_witness("C", "abcd 01xxxx zz23xx efgh")
alignment_table = collate(collation, segmentation=False, near_match=True)
print(alignment_table)


+---+------+--------+--------+--------+--------+--------+------+
| A | abcd | -      | -      | 012345 | -      | -      | efgh |
| B | abcd | 0xxxxx | 01xxxx | 01234x | 012xxx | 0123xx | efgh |
| C | abcd | -      | 01xxxx | -      | -      | zz23xx | efgh |
+---+------+--------+--------+--------+--------+--------+------+