Collating XML-files

In this introduction we will shortly outline the theory behind collating texts with markup, for instance XML or TEI tags. You may ask yourself, why you would want to include the markup in the first place? Isn't collation first and foremost a textual activity?

In the previous units, we have seen how to collate plain text files, how to tokenize and how to normalize the input data. Since some kinds of text do not easily fit into a model, we have also discussed the ways in which you can adjust the algorithms to allow for near-matching. In short, there is already quite a lot you can do to ensure that CollateX produces a collation that fits your research objectives.

Stripping the XML

You have also learned that the collation itself is still largely done on the level of the text: we collate strings of plain text characters. However, editors and textual reseachers usually work with XML files that contain transcriptions of the text in question, often encoded according to a specific TEI schema. As you may know, the TEI Guidelines are flexible and allow you to encode a variety of textual phenomena and other information. Therefore the text encoding varies depending on the type of text and the edition you intend to create. Think for instance of documentary markup, linguistic annotation or information about names and people, among others. In short, XML TEI transcriptions are information-rich structured data files. Before these files can be processed by CollateX, they are "stripped" of all XML markup because CollateX only accepts plain text strings as input.

Figure 1: example of TEI encoded text.

In some cases, stripping files of their XML markup until only plain text remains is not a serious issue. As said, collation is primarily a textual activity. If you have an XML TEI files with linguistic annotation or biographical information about certain characters in a story, taking out the TEI tags still leaves you with a logical text. However, other types of TEI XML files are not so easily converted to a linear string. Consider for instance the following, simplified TEI encoding:

Figure 2: example of simple genetic encoding, from Virginia Woolf's holograph of "Time Passes" (M31).

The del- and add elements mark deletions and additions, which gives us a so-called "layered transcription". The ground layer is formed by the text in the del-tags; whereas the text in the add-tags constitutes the top layer. You can imagine that a text with many revisions, such as the holograph of To The Lighthouse, can have more than two layers.

If we just take out all the TEI tags, we would end up with a non-sensical text:

[EXAMPLE OF TEXT WITHOUT TAGS]

Collating this would not give us the result we want. In such cases, editors usually select one layer -- usually the toplayer -- of the multilayered text. If we do that for the example above, it would give us the following input:

[EXAMPLE OF INPUT WITH TEXT FROM TOPLAYER]

Implications

The linearization of this complex multilayered text has a number of implications. Which layer would you select for collation: the ground layer, the top layer or a layer in between? It follows that this editorial decision is not without interpretation. What is more, what is collated is a simplified and linearized rendering of the original transcription: the text of the other layer(s) as well as the remaining information the editor has encoded in the transcription are ignored. A final point to be made has to do with the reason you are sitting here. That is, most editors are not aware of the multiple transformations the input data undergoes. Let's outline for a moment what happens to the text in the most simple case of collation:

  1. The source document is encoded in TEI XML
  2. The TEI file is transformed in a string of characters to be inputted in CollateX, so most of the TEI tags are removed
  3. Collation
  4. The output is transformed again to be more presentable, for instance in an alignment table.

The current unit will explain among others how to "pass on" certain designated markup elements through the collation pipeline. This markup, such as the del and add elements, can then be visualized again in the collation output. This can only be done using the CollateX-specific JSON input. It is also possible to influence the token alignment based on the preserved markup elements. Keep in mind though, that the collation itself is still performed with strings of plain text. The overview above would look slightly different:

  1. The source document is encoded in TEI XML
  2. The TEI file is transformed into JSON to be inputted in CollateX. Certain TEI elements are selected to be preserved during the collation process. The other tags are removed. This is called "pre-processing".
  3. Collation
  4. The collation output in JSON is transformed again to a desired output format (for instance TEI parallel segmentation or an alignment table). The preserved TEI elements are visualized as well.
  5. If needed, the output can be adjusted and collated again to create a better alignment. This is called "post-processing".

With each operation, it becomes more difficult to keep track of the exact state of the text: layers of markup are removed and added; both the text and the markup are transformed several times. You see that semi-automated collation is not a neutral process. We can distinguish between the collation algorithm that performs operations on a linear sequence of bits, and the preparatory process of transforming the TEI file into this linear sequence of bits. Where the former can be considered an objective process, the latter sure is not.

In other words: each operation in the collation process involves a certain amount of (editorial) interpretation, but some of this interpretation may be rather unconsciously added.

Value of collation with XML

Above we showed the challenge of collating the multilayered text of a manuscript. Indeed, collation with XML is quite suitable for a genetically encoded text with deletions and additions. However, preserving markup can be valuable for other types of text as well.

Think for a moment about the kind of markup you would like to preserve during collation. Would that be the carefully documented deletions and additions in the text of a modern manuscript? The line breaks in a poem? What about a shift in writing material or writing hand, or the editorial annotations? Each of these could have a significant value, depending on the kind of text, the material, and the type of edition.

[EXAMPLE OF TEXT W/ OTHER TYPE OF MARKUP TO BE INCLUDED IN COLLATION]