Tokenization

Default tokenization

Tokenization (the first of the five parts of the Gothenburg model) divides the texts to be collated into tokens, which are most commonly (but not obligatorily) words. By default CollateX considers punctuation to be its own token, which means that the witness readings “Hi!” and “Hi” will both contain a token that reads “Hi” (and the first witness will contain an additional token, which reads “!”). In this situation, that’s the behavior the user probably wants, since both witnesses contain what a human would recognize as the same word.

We are going to be using the CollateX library to demonstrate tokenization, so let's go ahead and import it.


In [3]:
from collatex import *

Issues with default tokenization

But is a word like “Peter’s” the same word as “Peter” for collation purposes? Because CollateX will regard the apostrophe as a separate token, “Peter’s” will be tokenized as three tokens: the name, the apostrophe, and the possessive. Here’s the default behavior:


In [4]:
collation = Collation()
collation.add_plain_witness("A", "Peter's cat.")
collation.add_plain_witness("B", "Peter's dog.")
table = collate(collation, segmentation=False)
print(table)


+---+-------+---+---+-----+---+
| A | Peter | ' | s | cat | . |
| B | Peter | ' | s | dog | . |
+---+-------+---+---+-----+---+

For possessives that may be acceptable behavior, but how about contractions like “didn’t” or “A’dam” (short for “Amsterdam”)? If the default tokenization does what you need, so much the better, but if not, you can override it according to your own requirements. Below we describe what CollateX does by default and how to override that behavior and perform your own tokenization.

How CollateX tokenizes: default behavior

The default tokenizer built into CollateX defines a token as a string of either alphanumeric characters (in any writing system) or non-alphanumeric characters, in both cases including any (optional) trailing whitespace. This means that the input reading “Peter’s cat.” will be analyzed as consisting of five tokens: “Peter” plus “’” plus “s ” plus “cat” plus “.”. For alignment purposes CollateX ignores any trailing white space, so that “cat” in “The cat in the hat” would be tokenzied as “cat ” (with a trailing space), but for collation purposes it would match the “cat” in “Peter’s cat.”, which has no trailing space because it’s followed by a period.

If we need to override the default tokenization behavior, we can create our own tokenized input and tell CollateX to use that, instead of letting CollateX perform the tokenization itself prior to collation.

Doing your own tokenization

In a way that is consistent with the modular design of the Gothenburg model, CollateX permits the user to change the tokenization without having to change the other parts of the collation process. Since the tokenizer passes to CollateX the indivisible units that are to be aligned, performing our own collation means specifying those units on our own. We will now look at how we can split a text into tokens the way we prefer.

Automating the tokenization

In the example above we built our token list by hand, but that obviously isn’t scalable to a real project with more than a handful of words. Let’s enhance the code above so that it builds the token lists for us by tokenizing the input strings according to our requirements. This is where projects have to identify and formalize their own specifications, since, unfortunately, there is no direct way to tell Python to read your mind and “keep punctuation with adjacent letters when I want it there, but not when I don’t.” For this example, we’ll write a tokenizer that breaks a string first on white space (which would give us two tokens: “Peter’s” and “cat.”) and then, within those intermediate tokens, on final punctuation (separating the final period from “cat” but not breaking on the internal apostrophe in “Peter’s”). This strategy would also keep English-language contractions together as single tokens, but as we’ve written it, it wouldn’t separate a leading quotation mark from a word token, although that’s a behavior we’d probably want. In Real Life we might fine-tune the routine still further, but for this tutorial we’ll prioritize just handling the sample data.

Splitting on white space and then separating final but not internal punctuation

To develop our tokenization, let’s start with:


In [5]:
input = "Peter's cat."
print(input)


Peter's cat.

and split it into a list of whitespace-separated words with the Python re library, which we will import here so that we can use it below.


In [6]:
import re
input = "Peter's cat."
words = re.split(r'\s+', input)
print(words)


["Peter's", 'cat.']

Now let’s treat final punctuation as a separate token without splitting on internal punctuation:


In [7]:
input = "Peter's cat."
words = re.split(r'\s+', input)
tokens_by_word = [re.findall(r'.*\w|\W+$', word) for word in words]
print(tokens_by_word)


[["Peter's"], ['cat', '.']]

The regex says that a token is either a string of any characters that ends in a word character (which will match “Peter’s” with the internal apostrophe as one token, since it ends in “s”, which is a word character) or a string of non-word characters. The re.findall method will give us back a list of all the separate (i.e. non-overlapping) times our expression matched. In the case of the string cat., the .*\w alternative matches cat (i.e. anything ending in a word character), and then the \W+ alternative matches . (i.e anything that is made entirely of non-word characters).

We now have three tokens, but they’re in nested lists, which isn’t what we want. Rather, we want a single list with all the tokens on the same level. We can accomplish that with a for loop and the .extend method for lists:


In [8]:
input = "Peter's cat."
words = re.split(r'\s+', input)
tokens_by_word = [re.findall(r'.*\w|\W+$', word) for word in words]
tokens = []
for item in tokens_by_word:
    tokens.extend(item)
print(tokens)


["Peter's", 'cat', '.']

We’ve now split our witness text into tokens, but instead of returning them as a list of strings, we need to format them into the list of Python dictionaries that CollateX requires. So let's talk about what CollateX requires.

Specifying the witnesses to be used in the collation

The format in which CollateX expects to receive our custom lists of tokens for all witnesses to be collated is a Python dictionary, which has the following structure:

{ "witnesses": [ witness_a, witness_b ] }

This is a Python dictionary whose key is the word witnesses, and whose value is a list of the witnesses (that is, the sets of text tokens) that we want to collate. Doing our own tokenization, then, means building a dictionary like the one above and putting our custom tokens in the correct format where the witness_a and witness_b variables stand above.

Specifying the siglum and token list for each witness

The witness data for each witness is a Python dictionary that must contain two properties, which have as keys the strings id and tokens. The value for the id key is a string that will be used as the siglum of the witness in any CollateX output. The value for the tokens key is a Python list of tokens that comprise the text (much like what we have made with our regular expressions, but we have one more step to get through...!

witness_a = { "id": "A", "tokens": list_of_tokens_for_witness_a }

Specifying the tokens for each witness

Each token for each witness is a Python dictionary with at least one member, which has the key "t" (think “text”). You'll learn in the Normalization unit what else you can put in here. A token for the string “cat” would look like:

{ "t": "cat" }

The key for every token is the string "t"; the value for this token is the string "cat". As noted above, the tokens for a witness are structured as a Python list, so if we chose to split our text only on whitespace we would tokenize our first witness as:

list_of_tokens_for_witness_a = [ { "t": "Peter's" }, { "t": "cat." } ]

Our witness has two tokens, instead of the five that the default tokenizer would have provided, because we’ve done the tokenization ourselves according to our own specifications.

Putting it all together

For ease of exposition we’ve used variables to limit the amount of code we write in any one line. We define our sets of tokens as:

list_of_tokens_for_witness_a = [ { "t": "Peter's" }, { "t": "cat." } ]
list_of_tokens_for_witness_b = [ { "t": "Peter's" }, { "t": "dog." } ]

Once we have those, we can define our witnesses that bear these tokens:

witness_a = { "id": "A", "tokens": list_of_tokens_for_witness_a }
witness_b = { "id": "B", "tokens": list_of_tokens_for_witness_b }

until finally we define our collation set as:

{ "witnesses": [ witness_a, witness_b ] }

with variables that point to the data for the two witnesses.

It is also possible to represent the same information directly, without variables:

{"witnesses": [
    {
        "id": "A",
        "tokens": [
            {"t": "Peter's"},
            {"t": "cat."}
        ]
    },
    {
        "id": "B",
        "tokens": [
            {"t": "Peter's"},
            {"t": "dog."}
        ]
    }
]}

So let's put a single witness together in the format CollateX requires, starting with that list of tokens we made.


In [9]:
input = "Peter's cat."
words = re.split(r'\s+', input)
tokens_by_word = [re.findall(r'.*\w|\W+$', word) for word in words]
tokens = []
for item in tokens_by_word:
    tokens.extend(item)
token_list = [{"t": token} for token in tokens]
print(token_list)


[{'t': "Peter's"}, {'t': 'cat'}, {'t': '.'}]

Since we want to tokenize all of our witnesses, let’s turn our tokenization routine into a Python function that we can call with different input text:


In [10]:
def tokenize(input):
    words = re.split(r'\s+', input) # split on whitespace
    tokens_by_word = [re.findall(r'.*\w|\W+$', word) for word in words] # break off final punctuation
    tokens = []
    for item in tokens_by_word:
        tokens.extend(item)
    token_list = [{"t": token} for token in tokens] # create dictionaries for each token
    return token_list

input_a = "Peter's cat."
input_b = "Peter's dog."

tokens_a = tokenize(input_a)
tokens_b = tokenize(input_b)
witness_a = { "id": "A", "tokens": tokens_a }
witness_b = { "id": "B", "tokens": tokens_b }
input = { "witnesses": [ witness_a, witness_b ] }
input


Out[10]:
{'witnesses': [{'id': 'A',
   'tokens': [{'t': "Peter's"}, {'t': 'cat'}, {'t': '.'}]},
  {'id': 'B', 'tokens': [{'t': "Peter's"}, {'t': 'dog'}, {'t': '.'}]}]}

Let's see how it worked! Here is how to give the tokens to CollateX.


In [11]:
table = collate(input, segmentation=False)
print(table)


+---+---------+-----+---+
| A | Peter's | cat | . |
| B | Peter's | dog | . |
+---+---------+-----+---+

Hands-on

The task

Suppose you want to keep the default tokenization (punctuation is always a separate token), except that:

  1. Words should not break on internal hyphenation. For example, “hands-on” should be treated as one word.
  2. English possessive apostrophe + “s” should be its own token. For example, “Peter’s” should be tokenized as “Peter” plus “’s”.

How to think about the task

  1. Create a regular expression that mimics the default behavior, where punctuation is a separate token.
  2. Enhance it to exclude hyphens from the inventory of punctuation that signals a token division.
  3. Enhance it to treat “’s” as a separate token.

You can practice your regular expressions at http://www.regexpal.com/.

Sample sentence

Peter’s cat has completed the hands-on tokenization exercise.


In [ ]:
## Your code goes here

The next step: tokenizing XML

After all that work on marking up your document in XML, you are certainly going to want to tokenize it! This works in basically the same way, only we also have to learn to use an XML parser.

Personally I favor the lxml.etree library, though its methods of handling text nodes takes some getting used to. If you have experience with more standard XML parsing models, take a look at the Integrating XML with Python notebook in this directory. We will see as we go along how etree works.

For this exercise, let's tokenize the Ozymandias file that we were working on yesterday. It's a good idea to work with "our" version of the file until you understand what is going on here, but once you think you have the hang of it, feel free to try it with the file you marked up!


In [12]:
from lxml import etree

with open('ozymandias.xml', encoding='utf-8') as f:
    ozzy = etree.parse(f)

print("Got an ElementTree with root tag", ozzy.getroot().tag)
print(etree.tostring(ozzy).decode('utf-8'))


Got an ElementTree with root tag {http://www.tei-c.org/ns/1.0}TEI
<?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?><?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng" type="application/xml"
	schematypens="http://purl.oclc.org/dsdl/schematron"?><TEI xmlns="http://www.tei-c.org/ns/1.0">
   <teiHeader>
      <fileDesc>
         <titleStmt>
            <title>Ozymandias</title>
         </titleStmt>
         <publicationStmt>
            <p>Part of <ptr target="https://github.com/Pittsburgh-NEH-Institute/Institute-Materials-2017"/></p>
         </publicationStmt>
         <sourceDesc>
            <p><ptr target="https://www.poetryfoundation.org/resources/learning/core-poems/detail/46565"/></p>
         </sourceDesc>
      </fileDesc>
   </teiHeader>
   <text>
      <body>
         <head>
            <persName>Percy Bysshe Shelley</persName>
            <title>Ozymandias</title>
         </head>
         <div type="poem">
            <p>
               <lb/><phr>I met a traveller from an antique land,</phr>
               <lb/><phr>Who said &#8212;</phr><phr>&#8220;Two vast and trunkless legs of stone
               <lb/>Stand in the desart.... </phr><phr>Near them,</phr> <phr>on the sand,</phr>
               <lb/><phr>Half sunk a shattered visage lies,</phr> <phr>whose frown,</phr>
               <lb/><phr>And wrinkled lip,</phr> <phr>and sneer of cold command,</phr>
               <lb/><phr>Tell that its sculptor well those passions read
               <lb/>Which yet survive,</phr> <phr>stamped on these lifeless things,</phr>
               <lb/><phr>The hand that mocked them,</phr> <phr>and the heart that fed;</phr>
               <lb/><phr>And on the pedestal,</phr> <phr>these words appear:</phr>
               <lb/><phr>My name is Ozymandias,</phr> <phr>King of Kings,</phr>
               <lb/><phr>Look on my Works,</phr> <phr>ye Mighty,</phr> <phr>and despair!</phr>
               <lb/><phr>Nothing beside remains.</phr> <phr>Round the decay
               <lb/>Of that colossal Wreck,</phr> <phr>boundless and bare
               <lb/>The lone and level sands stretch far away.&#8221;</phr>
            </p>
         </div>
      </body>
   </text>
</TEI>

Notice here what ETree does with the namespace! It doesn't naturally like namespace prefixes like tei:, but prefers to just stick the entire URL in curly braces. We can make a little shortcut to do this for us, and then we can use it to find our elements.


In [13]:
def tei(tag):
    return "{http://www.tei-c.org/ns/1.0}%s" % tag

tei('text')


Out[13]:
'{http://www.tei-c.org/ns/1.0}text'

In our Ozymandias file, the words of the poem are contained in phrases. So let's start by seeking out all the <phr> elements and getting their text.


In [14]:
for phrase in ozzy.iter(tei('phr')):
    print(phrase.text)


I met a traveller from an antique land,
Who said —
“Two vast and trunkless legs of stone
               
Near them,
on the sand,
Half sunk a shattered visage lies,
whose frown,
And wrinkled lip,
and sneer of cold command,
Tell that its sculptor well those passions read
               
stamped on these lifeless things,
The hand that mocked them,
and the heart that fed;
And on the pedestal,
these words appear:
My name is Ozymandias,
King of Kings,
Look on my Works,
ye Mighty,
and despair!
Nothing beside remains.
Round the decay
               
boundless and bare
               

This looks plausible at first, but we notice pretty soon that we are missing pieces of line - the third line, for example, should read something like

"Two vast and trunkless legs of stone
<lb/>Stand in the desart.... 

What's going on?

Here is the slightly mind-bending thing about ETree: each element has not only textual content, but can also have a text tail. In this case, the <phr> element has the following contents:

  • Text content: Two vast and trunkless legs of stone\n
  • A child element: <lb/>

The <lb/> has no content, but it does have a tail! The tail is Stand in the desart.... and we have to ask for it separately. So let's try this - instead of getting just the text of each element, let's get its text AND the tail of any child elements. Here's how we do that.


In [15]:
for phrase in ozzy.iter(tei('phr')):
    content = phrase.text
    for child in phrase:
        content = content + child.tail
    print(content)


I met a traveller from an antique land,
Who said —
“Two vast and trunkless legs of stone
               Stand in the desart.... 
Near them,
on the sand,
Half sunk a shattered visage lies,
whose frown,
And wrinkled lip,
and sneer of cold command,
Tell that its sculptor well those passions read
               Which yet survive,
stamped on these lifeless things,
The hand that mocked them,
and the heart that fed;
And on the pedestal,
these words appear:
My name is Ozymandias,
King of Kings,
Look on my Works,
ye Mighty,
and despair!
Nothing beside remains.
Round the decay
               Of that colossal Wreck,
boundless and bare
               The lone and level sands stretch far away.”

Now that's looking better. We have a bunch of text, and now all we need to do is tokenize it! For this we can come back to the function that we wrote earlier, tokenize. Let's plug each of these bits of content in turn into our tokenizer, and see what we get.


In [16]:
tokens = []

for phrase in ozzy.iter(tei('phr')):
    content = phrase.text
    for child in phrase:
        content = content + child.tail
    tokens.extend(tokenize(content))
    
print(tokens)


[{'t': 'I'}, {'t': 'met'}, {'t': 'a'}, {'t': 'traveller'}, {'t': 'from'}, {'t': 'an'}, {'t': 'antique'}, {'t': 'land'}, {'t': ','}, {'t': 'Who'}, {'t': 'said'}, {'t': '—'}, {'t': '“Two'}, {'t': 'vast'}, {'t': 'and'}, {'t': 'trunkless'}, {'t': 'legs'}, {'t': 'of'}, {'t': 'stone'}, {'t': 'Stand'}, {'t': 'in'}, {'t': 'the'}, {'t': 'desart'}, {'t': '....'}, {'t': 'Near'}, {'t': 'them'}, {'t': ','}, {'t': 'on'}, {'t': 'the'}, {'t': 'sand'}, {'t': ','}, {'t': 'Half'}, {'t': 'sunk'}, {'t': 'a'}, {'t': 'shattered'}, {'t': 'visage'}, {'t': 'lies'}, {'t': ','}, {'t': 'whose'}, {'t': 'frown'}, {'t': ','}, {'t': 'And'}, {'t': 'wrinkled'}, {'t': 'lip'}, {'t': ','}, {'t': 'and'}, {'t': 'sneer'}, {'t': 'of'}, {'t': 'cold'}, {'t': 'command'}, {'t': ','}, {'t': 'Tell'}, {'t': 'that'}, {'t': 'its'}, {'t': 'sculptor'}, {'t': 'well'}, {'t': 'those'}, {'t': 'passions'}, {'t': 'read'}, {'t': 'Which'}, {'t': 'yet'}, {'t': 'survive'}, {'t': ','}, {'t': 'stamped'}, {'t': 'on'}, {'t': 'these'}, {'t': 'lifeless'}, {'t': 'things'}, {'t': ','}, {'t': 'The'}, {'t': 'hand'}, {'t': 'that'}, {'t': 'mocked'}, {'t': 'them'}, {'t': ','}, {'t': 'and'}, {'t': 'the'}, {'t': 'heart'}, {'t': 'that'}, {'t': 'fed'}, {'t': ';'}, {'t': 'And'}, {'t': 'on'}, {'t': 'the'}, {'t': 'pedestal'}, {'t': ','}, {'t': 'these'}, {'t': 'words'}, {'t': 'appear'}, {'t': ':'}, {'t': 'My'}, {'t': 'name'}, {'t': 'is'}, {'t': 'Ozymandias'}, {'t': ','}, {'t': 'King'}, {'t': 'of'}, {'t': 'Kings'}, {'t': ','}, {'t': 'Look'}, {'t': 'on'}, {'t': 'my'}, {'t': 'Works'}, {'t': ','}, {'t': 'ye'}, {'t': 'Mighty'}, {'t': ','}, {'t': 'and'}, {'t': 'despair'}, {'t': '!'}, {'t': 'Nothing'}, {'t': 'beside'}, {'t': 'remains'}, {'t': '.'}, {'t': 'Round'}, {'t': 'the'}, {'t': 'decay'}, {'t': 'Of'}, {'t': 'that'}, {'t': 'colossal'}, {'t': 'Wreck'}, {'t': ','}, {'t': 'boundless'}, {'t': 'and'}, {'t': 'bare'}, {'t': 'The'}, {'t': 'lone'}, {'t': 'and'}, {'t': 'level'}, {'t': 'sands'}, {'t': 'stretch'}, {'t': 'far'}, {'t': 'away'}, {'t': '.”'}]

Adding complexity

As XML tokenization goes, this one was pretty straightforward - all your text was in <phr> elements, and none of the text was in any child element, so we were able to get by with a combination of .text and .tail for the elements we encountered. What if our markup isn't so simple? What do we do?

Here is where you start to really have to grapple with the fact that TEI allows a thousand encoding variations to bloom. In order to tokenize your particular text, you will have to think about what you encoded and how, and what "counts" as text you want to extract.

IN the file ozymandias_2.xml I have provided a simple example of this. Here the encoder chose to add the canonical spelling for the word "desert" in a <corr> element, as part of a <choice>. If I tokenize that file in the same way as above, here is what I get.


In [ ]:
with open('ozymandias_2.xml', encoding='utf-8') as f:
    ozzy2 = etree.parse(f)
    
print(etree.tostring(ozzy2).decode('utf-8'))

In [ ]:
tokens = []

for phrase in ozzy2.iter(tei('phr')):
    content = phrase.text
    for child in phrase:
        content = content + child.tail
    tokens.extend(tokenize(content))
    
print(tokens)

Notice that I have neither "desert" nor "desart"! That is because, while I got the tail of the <choice> element, I didn't look inside it, and I didn't visit the <sic> or <corr> elements at all. I have to make my logic a little more complex, and I also have to think about which alternative I want. Let's say that I want to stay relatively true to the original. Here is the sort of thing I would have to do.


In [ ]:
tokens = []

for phrase in ozzy2.iter(tei('phr')):
    content = phrase.text
    for child in phrase:
        if child.tag == tei('choice'):
            ## We know there is only one 'sic' element, but
            ## etree won't assume that! So we have to deal
            ## with "all" of them.
            for sic in child.iter(tei('corr')):
                content = content + sic.text
        content = content + child.tail
    tokens.extend(tokenize(content))
    
print(tokens)

Voilà, we have our des[ae]rt back!

Exercise 1

Change the code above so that you get the corrected version of the word instead.

Exercise 2, in pairs

Re-define the tokenize function so that it treats that opening quote in the second line ("Two vast and trunkless...) as its own token. You will need to tweak the regular expression.

Exercise 3 (when you are feeling brave)

Write some code that will parse your own XML markup from yesterday and tokenize it!


Updated 2017-07-18 by tla. Sydney workshop version is at https://github.com/ljo/collatex-tutorial/blob/master/unit4/Tokenization.ipynb.