Why tokenize your own texts?
By default CollateX "splits on whitespace and interpunction". This often means that you will get words to be your unit of collation.
But what about "can't", "A'dam", "Peter's"?
Tokenization defines your unit of comparison and thus gives you more control over the comparison.
Unfortunately this means learning how to offer a pretokenized data structure to CollateX.
Assert:
witness_A = "Peter's cat"
witness_B = "Peter's dog"
Problem: "Peter's cat" will be tokenized as "Peter", "'", "s" , "cat"
We will use JSON to tell CollateX to 'read' and collate it differently
JSON data is a mixture of arrays (lists) and objects (where a key is associated with a value):
Arrays look like this:
[ "i", "am", "the", "words", "in", "an", "array" ]
Objects look like this:
{ "a_variable_name": "My first value", "another_varible": "Another thingy" }
In JSON you may combine this:
{ "a_witness_object": { "siglum": "A", "tokens": [] } }
Or the same, laid out so that it is somewhat easier to read:
{ "a_witness_object":
{
"siglum": "A",
"tokens": []
}
}
In [1]:
from collatex import *
In [4]:
collation = Collation()
collation.add_plain_witness( "A", "Peter's cat")
collation.add_plain_witness( "B", "Peter's dog" )
alignment_table = collate( collation, layout='vertical', segmentation=False )
In [5]:
print( alignment_table )
In [6]:
tokens_a = [ { "t": "Peter's" }, { "t": "cat" } ]
tokens_b = [ { "t": "Peter's" }, { "t": "dog" } ]
In [7]:
witness_a = { "id": "A", "tokens": tokens_a }
In [8]:
print( witness_a )
In [9]:
witness_b = { "id": "B", "tokens": tokens_b }
In [10]:
JSON_input = { "witnesses": [ witness_a, witness_b ] }
In [11]:
result = collate_pretokenized_json( JSON_input, output='table' )
In [12]:
print( result )
In [ ]: