Tokenization, or "Tokenize Your Own"

Why tokenize your own texts?

By default CollateX "splits on whitespace and interpunction". This often means that you will get words to be your unit of collation.

But what about "can't", "A'dam", "Peter's"?

Tokenization defines your unit of comparison and thus gives you more control over the comparison.

Unfortunately this means learning how to offer a pretokenized data structure to CollateX.

Pretokenize text as JSON

Assert:

witness_A = "Peter's cat"
witness_B = "Peter's dog"

Problem: "Peter's cat" will be tokenized as "Peter", "'", "s" , "cat"

We will use JSON to tell CollateX to 'read' and collate it differently

JSON data is a mixture of arrays (lists) and objects (where a key is associated with a value):

Arrays look like this:

[ "i", "am", "the", "words", "in", "an", "array" ]

Objects look like this:

{ "a_variable_name": "My first value", "another_varible": "Another thingy" }

In JSON you may combine this:

{ "a_witness_object": { "siglum": "A", "tokens": [] } }

Or the same, laid out so that it is somewhat easier to read:

{ "a_witness_object":
    { 
        "siglum": "A",
        "tokens": []
    }
}

Let's try the good old fashioned way…



In [1]:

    
from collatex import *



In [4]:

    
collation = Collation()
collation.add_plain_witness( "A", "Peter's cat")
collation.add_plain_witness( "B", "Peter's dog" )
alignment_table = collate( collation, layout='vertical', segmentation=False )



In [5]:

    
print( alignment_table )









    



+-------+-------+
|   A   |   B   |
+-------+-------+
| Peter | Peter |
+-------+-------+
|   '   |   '   |
+-------+-------+
|   s   |   s   |
+-------+-------+
|  cat  |  dog  |
+-------+-------+

Hmm.. indeed not quite what we want, thus…



In [6]:

    
tokens_a = [ { "t": "Peter's" }, { "t": "cat" } ]
tokens_b = [ { "t": "Peter's" }, { "t": "dog" } ]



In [7]:

    
witness_a = { "id": "A", "tokens": tokens_a }



In [8]:

    
print( witness_a )









    



{'id': 'A', 'tokens': [{'t': "Peter's"}, {'t': 'cat'}]}



In [9]:

    
witness_b = { "id": "B", "tokens": tokens_b }



In [10]:

    
JSON_input = { "witnesses": [ witness_a, witness_b ] }



In [11]:

    
result = collate_pretokenized_json( JSON_input, output='table' )



In [12]:

    
print( result )









    



+---+---------+-----+
| A | Peter's | cat |
| B | Peter's | dog |
+---+---------+-----+



In [ ]: