Tokenization

Default tokenization

Tokenization (the first of the five parts of the Gothenburg model) divides the texts to be collated into tokens, which are most commonly (but not obligatorily) words. By default CollateX considers punctuation to be its own token, which means that the witness readings “Hi!” and “Hi” will both contain a token that reads “Hi” (and the first witness will contain an additional token, which reads “!”). In this situation, that’s the behavior the user probably wants, since both witnesses contain what a human would recognize as the same word.

Issues with default tokenization

But is a word like “Peter’s” the same word as “Peter” for collation purposes? Because CollateX will regard the apostrophe as a separate token, “Peter’s” will be tokenized as three tokens: the name, the apostrophe, and the possessive. Here’s the default behavior:


In [11]:
from collatex import *
collation = Collation()
collation.add_plain_witness("A", "Peter's cat.")
collation.add_plain_witness("B", "Peter's dog.")
table = collate(collation, segmentation=False)
print(table)


+---+-------+---+---+-----+---+
| A | Peter | ' | s | cat | . |
| B | Peter | ' | s | dog | . |
+---+-------+---+---+-----+---+

For possessives that may be acceptable behavior, but how about contractions like “didn’t” or “A’dam” (short for “Amsterdam”)? If the default tokenization does what you need, so much the better, but if not, you can override it according to your own requirements. Below we describe what CollateX does by default and how to override that behavior and perform your own tokenization.

How CollateX tokenizes: default behavior

The default tokenizer built into CollateX defines a token as a string of either alphanumeric characters (in any writing system) or non-alphanumeric characters, in both cases including any (optional) trailing whitespace. This means that the input reading “Peter’s cat.” will be analyzed as consisting of five tokens: “Peter” plus “’” plus “s ” plus “cat” plus “.”. For alignment purposes CollateX ignores any trailing white space, so that “cat” in “The cat in the hat” would be tokenzied as “cat ” (with a trailing space), but for collation purposes it would match the “cat” in “Peter’s cat.”, which has no trailing space because it’s followed by a period.

If we need to override the default tokenization behavior, we can create our own tokenized input and tell CollateX to use that, instead of letting CollateX perform the tokenization itself prior to collation.

Doing your own tokenization

In a way that is consistent with the modular design of the Gothenburg model, CollateX permits the user to change the tokenization without having to change the other parts of the collation process. Since the tokenizer passes to CollateX the indivisible units that are to be aligned, performing our own collation means specifying those units on our own. Here’s a top-down description of how to do that.

Specifying the witnesses to be used in the collation

The format in which CollateX expects to receive our custom lists of tokens for all witnesses to be collated is a Python dictionary, which has the following structure:

{ "witnesses": [ witness_a, witness_b ] }

Curly braces delimit a Python dictionary, and the members (sometimes called “properties” of the dictionary; we’ll use the terms interchangeably) are represented by a key and a value, which are separated from each other by a colon. This is a dictionary with one property, for which the key is the string "witnesses" and the value is a Python list (delimited by square bracket). The members of the list are separated by commas, so this list has two members, identified by the variable names witness_a and witness_b. At the moment those variables don’t point to anything, but we’ll fix that below to populate them with our own tokenization of our two witnesses. Doing our own tokenization, then, means building a dictionary like the one above and specifying our custom tokens in our witness_a and witness_b variables.

Specifying the siglum and token list for each witness

The witness data for each witness is a Python dictionary with exactly two properties, which have as keys the strings "id" and "tokens". The value of the "id" property is a string that represents the siglum of the witness. The value of the "tokens" property is a Python list of tokens to be aligned. In the structure below, the tokens are represented by a variable called tokens_a, which we’ll define presently:

witness_a = { "id": "A", "tokens": tokens_a }

Specifying the tokens for each witness

Each token for each witness is a Python dictionary with at least one member, which has the key "t" (for “text”; it is possible to specify additional properties for the individual tokens, but we don’t do that here.) A token for the string “cat” would look like:

{ "t": "cat" }

The key for every token is the string "t"; the value for this token is the string "cat". As noted above, the tokens for a witness are structured as a Python list, so we would tokenize our first witness as:

tokens_a = [ { "t": "Peter's" }, { "t": "cat" }, { "t": "." } ]

Our witness has three tokens, instead of the five that the default tokenizer would have provided, because we’ve done the tokenization ourselves according to our own specifications.

Putting it all together

For ease of exposition we’ve used variables to limit the amount of code we write in any one line. We define our collation set as:

{ "witnesses": [ witness_a, witness_b ] }

with variables that point to the data for the two witnesses. We’ve defined those variables as:

witness_a = { "id": "A", "tokens": tokens_a }
witness_b = { "id": "B", "tokens": tokens_b }

specifying the sigla as strings but using variables to point to the token lists for each witness. Those token lists look like:

tokens_a = [ { "t": "Peter's" }, { "t": "cat" }, { "t": "." } ]
tokens_b = [ { "t": "Peter's" }, { "t": "dog" }, { "t": "." } ]

It is also possible to represent the same information directly, without variables:

{"witnesses": [
    {
        "id": "A",
        "tokens": [
            {"t": "Peter's"},
            {"t": "cat"},
            {"t": "."}
        ]
    },
    {
        "id": "B",
        "tokens": [
            {"t": "Peter's"},
            {"t": "dog"},
            {"t": "."}
        ]
    }
]}

We find the code easier to read when we write it with variables, but the information content is the same in either case, so the notations are equivalent.

Python requires you to define a variable before you use it, so although in the narrative explanation above we worked from the input down to the witnesses down to the tokens, when we turn that into Python code, we have to reverse the order. We can run the collation with our own tokenization as:


In [12]:
from collatex import *
tokens_a = [ { "t": "Peter's" }, { "t": "cat" }, { "t": "." } ]
tokens_b = [ { "t": "Peter's" }, { "t": "dog" }, { "t": "." } ]
witness_a = { "id": "A", "tokens": tokens_a }
witness_b = { "id": "B", "tokens": tokens_b }
input = { "witnesses": [ witness_a, witness_b ] }
table = collate(input, segmentation=False)
print(table)


+---+---------+-----+---+
| A | Peter's | cat | . |
| B | Peter's | dog | . |
+---+---------+-----+---+

Automating the tokenization

In the example above we built our token list by hand, but that obviously isn’t scalable to a real project with more than a handful of words. Let’s enhance the code above so that it builds the token lists for us by tokenizing the input strings according to our requirements. This is where projects have to identify and formalize their own specifications, since, unfortunately, there is no direct way to tell Python to read your mind and “keep punctuation with adjacent letters when I want it there, but not when I don’t.” For this example, we’ll write a tokenizer that breaks a string first on white space (which would give us two tokens: “Peter’s” and “cat.”) and then, within those intermediate tokens, on final punctuation (separating the final period from “cat” but not breaking on the internal apostrophe in “Peter’s”). This strategy would also keep English-language contractions together as single tokens, but as we’ve written it, it wouldn’t separate a leading quotation mark from a word token, although that’s a behavior we’d probably want. In Real Life we might fine-tune the routine still further, but for this tutorial we’ll prioritize just handling the sample data.

Splitting on white space and then separating final but not internal punctuation

To develop our tokenization, let’s start with:


In [13]:
input = "Peter's cat."
print(input)


Peter's cat.

and split it into a list of whitespace-separated words:


In [14]:
import re
input = "Peter's cat."
words = re.split(r'\s+', input)
print(words)


["Peter's", 'cat.']

Now let’s treat final punctuation as a separate token without splitting on internal punctuation:


In [15]:
import re
input = "Peter's cat."
words = re.split(r'\s+', input)
tokens_by_word = [re.findall(r'.+\w|\W+$', word) for word in words]
print(tokens_by_word)


[["Peter's"], ['cat', '.']]

The regex says that a token is either a string of any characters that ends in a word character (which will match “Peter’s” with the internal apostrophe as one token, since it ends in “s”, which is a word character) or a string of non-word characters (which will separate “cat.” into two tokens, since as one it wouldn’t end on a word character).

We now have three tokens, but they’re in nested lists, which isn’t what we want. We can flatten that with:


In [16]:
import re
input = "Peter's cat."
words = re.split(r'\s+', input)
tokens_by_word = [re.findall(r'.+\w|\W+$', word) for word in words]
tokens = []
for item in tokens_by_word:
    tokens.extend(item)
print(tokens)


["Peter's", 'cat', '.']

We’ve now split our witness text into tokens, but instead of returning them as a list of strings, we need to format them into the list of Python dictionaries that CollateX requires:


In [17]:
import re
input = "Peter's cat."
words = re.split(r'\s+', input)
tokens_by_word = [re.findall(r'.+\w|\W+$', word) for word in words]
tokens = []
for item in tokens_by_word:
    tokens.extend(item)
token_list = [{"t": token} for token in tokens]
print(token_list)


[{'t': "Peter's"}, {'t': 'cat'}, {'t': '.'}]

Since we want to tokenize all of our witnesses, let’s turn our tokenization routine into a Python function that we can call with different input text:


In [18]:
from collatex import *
import re

def tokenize(input):
    words = re.split(r'\s+', input) # split on whitespace
    tokens_by_word = [re.findall(r'.+\w|\W+$', word) for word in words] # break off final punctuation
    tokens = []
    for item in tokens_by_word:
        tokens.extend(item)
    token_list = [{"t": token} for token in tokens] # create dictionaries for each token
    return token_list

input_a = "Peter's cat."
input_b = "Peter's dog."

tokens_a = tokenize(input_a)
tokens_b = tokenize(input_b)
witness_a = { "id": "A", "tokens": tokens_a }
witness_b = { "id": "B", "tokens": tokens_b }
input = { "witnesses": [ witness_a, witness_b ] }
table = collate(input, segmentation=False)
print(table)


+---+---------+-----+---+
| A | Peter's | cat | . |
| B | Peter's | dog | . |
+---+---------+-----+---+

How complex can it get?

In the Rus′ Primary Chronicle project, we keep all punctuation with the preceding word, which is to say that we tokenize on white space. But if certain XML elements (<lb/>, <pb/>, <pageRef> … </pageRef>) follow a word with a space between them, we include those elements and the intervening space as part of the preceding token, so that, for example, “cat <lb/>” would be a single token that happened to contain a space. You can see how we do this at https://github.com/djbpitt/pvl-collatex. Since we want “cat <lb/>” to match “cat” for alignment purposes, just as we want “cat” to match “cat.”, although we keep the trailing material in the token, we conceal it during normalization/regularization (stage 2 of the Gothenburg model), use the normalized shadow versions for alignment, but output the original version during visualization (stage 5 of the Gothenburg model). You can see how we preform the normalization at the GitHub repo.

Hands-on

The task

Suppose you want to keep the default tokenization (punctuation is always a separate token), except that:

  1. Words should not break on internal hyphenation. For example, “hands-on” should be treated as one word.
  2. English possessive apostrophe + “s” should be its own token. For example, “Peter’s” should be tokenized as “Peter” plus “’s”.

How to think about the task

  1. Create a regular expression that mimics the default behavior, where punctuation is a separate token.
  2. Enhance it to exclude hyphens from the inventory of punctuation that signals a token division.
  3. Enhance it to treat “’s” as a separate token.

You can practice your regular expressions at http://www.regexpal.com/.

Sample sentence

Peter’s cat has completed the hands-on tokenization exercise.


Updated 2016-10-30 by djb. Sydney workshop version is at https://github.com/ljo/collatex-tutorial/blob/master/unit4/Tokenization.ipynb.