Tokenization (the first of the five parts of the Gothenburg model) divides the texts to be collated into tokens, which are most commonly (but not obligatorily) words. By default CollateX considers punctuation to be its own token, which means that the witness readings “Hi!” and “Hi” will both contain a token that reads “Hi” (and the first witness will contain an additional token, which reads “!”). In this situation, that’s the behavior the user probably wants, since both witnesses contain what a human would recognize as the same word.
We are going to be using the CollateX library to demonstrate tokenization, so let's go ahead and import it.
In [3]:
from collatex import *
In [4]:
collation = Collation()
collation.add_plain_witness("A", "Peter's cat.")
collation.add_plain_witness("B", "Peter's dog.")
table = collate(collation, segmentation=False)
print(table)
For possessives that may be acceptable behavior, but how about contractions like “didn’t” or “A’dam” (short for “Amsterdam”)? If the default tokenization does what you need, so much the better, but if not, you can override it according to your own requirements. Below we describe what CollateX does by default and how to override that behavior and perform your own tokenization.
The default tokenizer built into CollateX defines a token as a string of either alphanumeric characters (in any writing system) or non-alphanumeric characters, in both cases including any (optional) trailing whitespace. This means that the input reading “Peter’s cat.” will be analyzed as consisting of five tokens: “Peter” plus “’” plus “s ” plus “cat” plus “.”. For alignment purposes CollateX ignores any trailing white space, so that “cat” in “The cat in the hat” would be tokenzied as “cat ” (with a trailing space), but for collation purposes it would match the “cat” in “Peter’s cat.”, which has no trailing space because it’s followed by a period.
If we need to override the default tokenization behavior, we can create our own tokenized input and tell CollateX to use that, instead of letting CollateX perform the tokenization itself prior to collation.
In a way that is consistent with the modular design of the Gothenburg model, CollateX permits the user to change the tokenization without having to change the other parts of the collation process. Since the tokenizer passes to CollateX the indivisible units that are to be aligned, performing our own collation means specifying those units on our own. We will now look at how we can split a text into tokens the way we prefer.
In the example above we built our token list by hand, but that obviously isn’t scalable to a real project with more than a handful of words. Let’s enhance the code above so that it builds the token lists for us by tokenizing the input strings according to our requirements. This is where projects have to identify and formalize their own specifications, since, unfortunately, there is no direct way to tell Python to read your mind and “keep punctuation with adjacent letters when I want it there, but not when I don’t.” For this example, we’ll write a tokenizer that breaks a string first on white space (which would give us two tokens: “Peter’s” and “cat.”) and then, within those intermediate tokens, on final punctuation (separating the final period from “cat” but not breaking on the internal apostrophe in “Peter’s”). This strategy would also keep English-language contractions together as single tokens, but as we’ve written it, it wouldn’t separate a leading quotation mark from a word token, although that’s a behavior we’d probably want. In Real Life we might fine-tune the routine still further, but for this tutorial we’ll prioritize just handling the sample data.
To develop our tokenization, let’s start with:
In [5]:
input = "Peter's cat."
print(input)
and split it into a list of whitespace-separated words with the Python re
library, which we will import here so that we can use it below.
In [6]:
import re
input = "Peter's cat."
words = re.split(r'\s+', input)
print(words)
Now let’s treat final punctuation as a separate token without splitting on internal punctuation:
In [7]:
input = "Peter's cat."
words = re.split(r'\s+', input)
tokens_by_word = [re.findall(r'.*\w|\W+$', word) for word in words]
print(tokens_by_word)
The regex says that a token is either a string of any characters that ends in a word character (which will match “Peter’s” with the internal apostrophe as one token, since it ends in “s”, which is a word character) or a string of non-word characters. The re.findall
method will give us back a list of all the separate (i.e. non-overlapping) times our expression matched. In the case of the string cat.
, the .*\w
alternative matches cat
(i.e. anything ending in a word character), and then the \W+
alternative matches .
(i.e anything that is made entirely of non-word characters).
We now have three tokens, but they’re in nested lists, which isn’t what we want. Rather, we want a single list with all the tokens on the same level. We can accomplish that with a for
loop and the .extend
method for lists:
In [8]:
input = "Peter's cat."
words = re.split(r'\s+', input)
tokens_by_word = [re.findall(r'.*\w|\W+$', word) for word in words]
tokens = []
for item in tokens_by_word:
tokens.extend(item)
print(tokens)
We’ve now split our witness text into tokens, but instead of returning them as a list of strings, we need to format them into the list of Python dictionaries that CollateX requires. So let's talk about what CollateX requires.
The format in which CollateX expects to receive our custom lists of tokens for all witnesses to be collated is a Python dictionary, which has the following structure:
{ "witnesses": [ witness_a, witness_b ] }
This is a Python dictionary whose key is the word witnesses
, and whose value is a list of the witnesses (that is, the sets of text tokens) that we want to collate. Doing our own tokenization, then, means building a dictionary like the one above and putting our custom tokens in the correct format where the witness_a
and witness_b
variables stand above.
The witness data for each witness is a Python dictionary that must contain two properties, which have as keys the strings id
and tokens
. The value for the id
key is a string that will be used as the siglum of the witness in any CollateX output. The value for the tokens
key is a Python list of tokens that comprise the text (much like what we have made with our regular expressions, but we have one more step to get through...!
witness_a = { "id": "A", "tokens": list_of_tokens_for_witness_a }
Each token for each witness is a Python dictionary with at least one member, which has the key "t" (think “text”). You'll learn in the Normalization unit what else you can put in here. A token for the string “cat” would look like:
{ "t": "cat" }
The key for every token is the string "t"; the value for this token is the string "cat". As noted above, the tokens for a witness are structured as a Python list, so if we chose to split our text only on whitespace we would tokenize our first witness as:
list_of_tokens_for_witness_a = [ { "t": "Peter's" }, { "t": "cat." } ]
Our witness has two tokens, instead of the five that the default tokenizer would have provided, because we’ve done the tokenization ourselves according to our own specifications.
For ease of exposition we’ve used variables to limit the amount of code we write in any one line. We define our sets of tokens as:
list_of_tokens_for_witness_a = [ { "t": "Peter's" }, { "t": "cat." } ]
list_of_tokens_for_witness_b = [ { "t": "Peter's" }, { "t": "dog." } ]
Once we have those, we can define our witnesses that bear these tokens:
witness_a = { "id": "A", "tokens": list_of_tokens_for_witness_a }
witness_b = { "id": "B", "tokens": list_of_tokens_for_witness_b }
until finally we define our collation set as:
{ "witnesses": [ witness_a, witness_b ] }
with variables that point to the data for the two witnesses.
It is also possible to represent the same information directly, without variables:
{"witnesses": [
{
"id": "A",
"tokens": [
{"t": "Peter's"},
{"t": "cat."}
]
},
{
"id": "B",
"tokens": [
{"t": "Peter's"},
{"t": "dog."}
]
}
]}
So let's put a single witness together in the format CollateX requires, starting with that list of tokens we made.
In [9]:
input = "Peter's cat."
words = re.split(r'\s+', input)
tokens_by_word = [re.findall(r'.*\w|\W+$', word) for word in words]
tokens = []
for item in tokens_by_word:
tokens.extend(item)
token_list = [{"t": token} for token in tokens]
print(token_list)
Since we want to tokenize all of our witnesses, let’s turn our tokenization routine into a Python function that we can call with different input text:
In [10]:
def tokenize(input):
words = re.split(r'\s+', input) # split on whitespace
tokens_by_word = [re.findall(r'.*\w|\W+$', word) for word in words] # break off final punctuation
tokens = []
for item in tokens_by_word:
tokens.extend(item)
token_list = [{"t": token} for token in tokens] # create dictionaries for each token
return token_list
input_a = "Peter's cat."
input_b = "Peter's dog."
tokens_a = tokenize(input_a)
tokens_b = tokenize(input_b)
witness_a = { "id": "A", "tokens": tokens_a }
witness_b = { "id": "B", "tokens": tokens_b }
input = { "witnesses": [ witness_a, witness_b ] }
input
Out[10]:
Let's see how it worked! Here is how to give the tokens to CollateX.
In [11]:
table = collate(input, segmentation=False)
print(table)
Suppose you want to keep the default tokenization (punctuation is always a separate token), except that:
You can practice your regular expressions at http://www.regexpal.com/.
Peter’s cat has completed the hands-on tokenization exercise.
In [ ]:
## Your code goes here
After all that work on marking up your document in XML, you are certainly going to want to tokenize it! This works in basically the same way, only we also have to learn to use an XML parser.
Personally I favor the lxml.etree library, though its methods of handling text nodes takes some getting used to. If you have experience with more standard XML parsing models, take a look at the Integrating XML with Python notebook in this directory. We will see as we go along how etree works.
For this exercise, let's tokenize the Ozymandias file that we were working on yesterday. It's a good idea to work with "our" version of the file until you understand what is going on here, but once you think you have the hang of it, feel free to try it with the file you marked up!
In [12]:
from lxml import etree
with open('ozymandias.xml', encoding='utf-8') as f:
ozzy = etree.parse(f)
print("Got an ElementTree with root tag", ozzy.getroot().tag)
print(etree.tostring(ozzy).decode('utf-8'))
Notice here what ETree does with the namespace! It doesn't naturally like namespace prefixes like tei:
, but prefers to just stick the entire URL in curly braces. We can make a little shortcut to do this for us, and then we can use it to find our elements.
In [13]:
def tei(tag):
return "{http://www.tei-c.org/ns/1.0}%s" % tag
tei('text')
Out[13]:
In our Ozymandias file, the words of the poem are contained in phrases. So let's start by seeking out all the <phr>
elements and getting their text.
In [14]:
for phrase in ozzy.iter(tei('phr')):
print(phrase.text)
This looks plausible at first, but we notice pretty soon that we are missing pieces of line - the third line, for example, should read something like
"Two vast and trunkless legs of stone
<lb/>Stand in the desart....
What's going on?
Here is the slightly mind-bending thing about ETree: each element has not only textual content, but can also have a text tail. In this case, the <phr>
element has the following contents:
Two vast and trunkless legs of stone\n
<lb/>
The <lb/>
has no content, but it does have a tail! The tail is Stand in the desart....
and we have to ask for it separately. So let's try this - instead of getting just the text of each element, let's get its text AND the tail of any child elements. Here's how we do that.
In [15]:
for phrase in ozzy.iter(tei('phr')):
content = phrase.text
for child in phrase:
content = content + child.tail
print(content)
Now that's looking better. We have a bunch of text, and now all we need to do is tokenize it! For this we can come back to the function that we wrote earlier, tokenize
. Let's plug each of these bits of content in turn into our tokenizer, and see what we get.
In [16]:
tokens = []
for phrase in ozzy.iter(tei('phr')):
content = phrase.text
for child in phrase:
content = content + child.tail
tokens.extend(tokenize(content))
print(tokens)
As XML tokenization goes, this one was pretty straightforward - all your text was in <phr>
elements, and none of the text was in any child element, so we were able to get by with a combination of .text
and .tail
for the elements we encountered. What if our markup isn't so simple? What do we do?
Here is where you start to really have to grapple with the fact that TEI allows a thousand encoding variations to bloom. In order to tokenize your particular text, you will have to think about what you encoded and how, and what "counts" as text you want to extract.
IN the file ozymandias_2.xml
I have provided a simple example of this. Here the encoder chose to add the canonical spelling for the word "desert" in a <corr>
element, as part of a <choice>
. If I tokenize that file in the same way as above, here is what I get.
In [ ]:
with open('ozymandias_2.xml', encoding='utf-8') as f:
ozzy2 = etree.parse(f)
print(etree.tostring(ozzy2).decode('utf-8'))
In [ ]:
tokens = []
for phrase in ozzy2.iter(tei('phr')):
content = phrase.text
for child in phrase:
content = content + child.tail
tokens.extend(tokenize(content))
print(tokens)
Notice that I have neither "desert" nor "desart"! That is because, while I got the tail of the <choice>
element, I didn't look inside it, and I didn't visit the <sic>
or <corr>
elements at all. I have to make my logic a little more complex, and I also have to think about which alternative I want. Let's say that I want to stay relatively true to the original. Here is the sort of thing I would have to do.
In [ ]:
tokens = []
for phrase in ozzy2.iter(tei('phr')):
content = phrase.text
for child in phrase:
if child.tag == tei('choice'):
## We know there is only one 'sic' element, but
## etree won't assume that! So we have to deal
## with "all" of them.
for sic in child.iter(tei('corr')):
content = content + sic.text
content = content + child.tail
tokens.extend(tokenize(content))
print(tokens)
Voilà, we have our des[ae]rt back!
Change the code above so that you get the corrected version of the word instead.
Re-define the tokenize
function so that it treats that opening quote in the second line ("Two vast and trunkless...) as its own token. You will need to tweak the regular expression.
Write some code that will parse your own XML markup from yesterday and tokenize it!
Updated 2017-07-18 by tla. Sydney workshop version is at https://github.com/ljo/collatex-tutorial/blob/master/unit4/Tokenization.ipynb.