CollateX and XML, Part 1

David J. Birnbaum (djbpitt@gmail.com, http://www.obdurodon.org), 2015-06-29

This is the first part of multi-part tutorial on processing XML with CollateX (http://collatex.net). This example collates a single line of XML from four witnesses. It spells out the details step by step in a way that would not be used in a real project, but that makes it easy to see how each step moves toward the final result. The output is in the three formats supported natively by CollateX: a plain-text alignment table, JSON, and colored HTML.

Still to come:

  • Part 2: Restructuring the code to use Python classes
  • Part 3: Reading multiline input from files

Not planned: Post-processing of generic XML output, which is best done separately with XSLT 2.0.

Load libraries


In [32]:
from collatex import *
from lxml import etree
import json,re

Create XSLT stylesheets and functions to use them


In [33]:
addWMilestones = etree.XML("""
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" indent="no" encoding="UTF-8" omit-xml-declaration="yes"/>
    <xsl:template match="*|@*">
        <xsl:copy>
            <xsl:apply-templates select="node() | @*"/>
        </xsl:copy>
    </xsl:template>
    <xsl:template match="/*">
        <xsl:copy>
            <xsl:apply-templates select="@*"/>
            <!-- insert a <w/> milestone before the first word -->
            <w/>
            <xsl:apply-templates/>
        </xsl:copy>
    </xsl:template>
    <!-- convert <add>, <sic>, and <crease> to milestones (and leave them that way)
         CUSTOMIZE HERE: add other elements that may span multiple word tokens
    -->
    <xsl:template match="add | sic | crease ">
        <xsl:element name="{name()}">
            <xsl:attribute name="n">start</xsl:attribute>
        </xsl:element>
        <xsl:apply-templates/>
        <xsl:element name="{name()}">
            <xsl:attribute name="n">end</xsl:attribute>
        </xsl:element>
    </xsl:template>
    <xsl:template match="note"/>
    <xsl:template match="text()">
        <xsl:call-template name="whiteSpace">
            <xsl:with-param name="input" select="translate(.,'&#x0a;',' ')"/>
        </xsl:call-template>
    </xsl:template>
    <xsl:template name="whiteSpace">
        <xsl:param name="input"/>
        <xsl:choose>
            <xsl:when test="not(contains($input, ' '))">
                <xsl:value-of select="$input"/>
            </xsl:when>
            <xsl:when test="starts-with($input,' ')">
                <xsl:call-template name="whiteSpace">
                    <xsl:with-param name="input" select="substring($input,2)"/>
                </xsl:call-template>
            </xsl:when>
            <xsl:otherwise>
                <xsl:value-of select="substring-before($input, ' ')"/>
                <w/>
                <xsl:call-template name="whiteSpace">
                    <xsl:with-param name="input" select="substring-after($input,' ')"/>
                </xsl:call-template>
            </xsl:otherwise>
        </xsl:choose>
    </xsl:template>
</xsl:stylesheet>

""")
transformAddW = etree.XSLT(addWMilestones)
                           
xsltWrapW = etree.XML('''
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
    <xsl:output method="xml" indent="no" omit-xml-declaration="yes"/>
    <xsl:template match="/*">
        <xsl:copy>
            <xsl:apply-templates select="w"/>
        </xsl:copy>
    </xsl:template>
    <xsl:template match="w">
        <!-- faking <xsl:for-each-group> as well as the "<<" and except" operators -->
        <xsl:variable name="tooFar" select="following-sibling::w[1] | following-sibling::w[1]/following::node()"/>
        <w>
            <xsl:copy-of select="following-sibling::node()[count(. | $tooFar) != count($tooFar)]"/>
        </w>
    </xsl:template>
</xsl:stylesheet>
''')
transformWrapW = etree.XSLT(xsltWrapW)

Create and examine XML data


In [34]:
A = """<l><abbrev>Et</abbrev>cil i partent seulement</l>"""
B = """<l><abbrev>Et</abbrev>cil i p<abbrev>er</abbrev>dent ausem<abbrev>en</abbrev>t</l>"""
C = """<l><abbrev>Et</abbrev>cil i p<abbrev>ar</abbrev>tent seulema<abbrev>n</abbrev>t</l>"""
D = """<l>E cil i partent sulement</l>"""

ATree = etree.XML(A)
BTree = etree.XML(B)
CTree = etree.XML(C)
DTree = etree.XML(D)

print(A)
print(ATree)


<l><abbrev>Et</abbrev>cil i partent seulement</l>
<Element l at 0x10bc141c8>

Tokenize XML input by adding <w> tags and examine the results


In [35]:
ATokenized = transformWrapW(transformAddW(ATree))
BTokenized = transformWrapW(transformAddW(BTree))
CTokenized = transformWrapW(transformAddW(CTree))
DTokenized = transformWrapW(transformAddW(DTree))

print(ATokenized)


<l><w><abbrev>Et</abbrev>cil</w><w>i</w><w>partent</w><w>seulement</w></l>

Function to convert the word-tokenized witness line into JSON


In [36]:
def XMLtoJSON(id,XMLInput):
    unwrapRegex = re.compile('<w>(.*)</w>')
    stripTagsRegex = re.compile('<.*?>')
    words = XMLInput.xpath('//w')
    witness = {}
    witness['id'] = id
    witness['tokens'] = []
    for word in words:
        unwrapped = unwrapRegex.match(etree.tostring(word,encoding='unicode')).group(1)
        token = {}
        token['t'] = unwrapped
        token['n'] = stripTagsRegex.sub('',unwrapped.lower())
        witness['tokens'].append(token)
    return witness

Use the function to create JSON input for CollateX, and examine it


In [37]:
json_input = {}
json_input['witnesses'] = []
json_input['witnesses'].append(XMLtoJSON('A',ATokenized))
json_input['witnesses'].append(XMLtoJSON('B',BTokenized))
json_input['witnesses'].append(XMLtoJSON('C',CTokenized))
json_input['witnesses'].append(XMLtoJSON('D',DTokenized))
print(json_input)


{'witnesses': [{'id': 'A', 'tokens': [{'n': 'etcil', 't': '<abbrev>Et</abbrev>cil'}, {'n': 'i', 't': 'i'}, {'n': 'partent', 't': 'partent'}, {'n': 'seulement', 't': 'seulement'}]}, {'id': 'B', 'tokens': [{'n': 'etcil', 't': '<abbrev>Et</abbrev>cil'}, {'n': 'i', 't': 'i'}, {'n': 'perdent', 't': 'p<abbrev>er</abbrev>dent'}, {'n': 'ausement', 't': 'ausem<abbrev>en</abbrev>t'}]}, {'id': 'C', 'tokens': [{'n': 'etcil', 't': '<abbrev>Et</abbrev>cil'}, {'n': 'i', 't': 'i'}, {'n': 'partent', 't': 'p<abbrev>ar</abbrev>tent'}, {'n': 'seulemant', 't': 'seulema<abbrev>n</abbrev>t'}]}, {'id': 'D', 'tokens': [{'n': 'e', 't': 'E'}, {'n': 'cil', 't': 'cil'}, {'n': 'i', 't': 'i'}, {'n': 'partent', 't': 'partent'}, {'n': 'sulement', 't': 'sulement'}]}]}

Collate the witnesses and view the output as JSON, in a table, and as colored HTML


In [38]:
collationText = collate(json_input,output='table',layout='vertical')
print(collationText)
collationJSON = collate(json_input,output='json')
print(collationJSON)
collationHTML2 = collate(json_input,output='html2')


+----------------------+----------------------+----------------------+----------+
|          A           |          B           |          C           |    D     |
+----------------------+----------------------+----------------------+----------+
| <abbrev>Et</abbrev>c | <abbrev>Et</abbrev>c | <abbrev>Et</abbrev>c |   Ecil   |
|          il          |          il          |          il          |          |
+----------------------+----------------------+----------------------+----------+
|          i           |          i           |          i           |    i     |
+----------------------+----------------------+----------------------+----------+
|       partent        | p<abbrev>er</abbrev> | p<abbrev>ar</abbrev> | partent  |
|                      | dentausem<abbrev>en< |         tent         |          |
|                      |      /abbrev>t       |                      |          |
+----------------------+----------------------+----------------------+----------+
|      seulement       |          -           | seulema<abbrev>n</ab | sulement |
|                      |                      |        brev>t        |          |
+----------------------+----------------------+----------------------+----------+
{"table": [[[{"n": "etcil", "t": "<abbrev>Et</abbrev>cil"}], [{"n": "i", "t": "i"}], [{"n": "partent", "t": "partent"}], [{"n": "seulement", "t": "seulement"}]], [[{"n": "etcil", "t": "<abbrev>Et</abbrev>cil"}], [{"n": "i", "t": "i"}], [{"n": "perdent", "t": "p<abbrev>er</abbrev>dent"}, {"n": "ausement", "t": "ausem<abbrev>en</abbrev>t"}], null], [[{"n": "etcil", "t": "<abbrev>Et</abbrev>cil"}], [{"n": "i", "t": "i"}], [{"n": "partent", "t": "p<abbrev>ar</abbrev>tent"}], [{"n": "seulemant", "t": "seulema<abbrev>n</abbrev>t"}]], [[{"n": "e", "t": "E"}, {"n": "cil", "t": "cil"}], [{"n": "i", "t": "i"}], [{"n": "partent", "t": "partent"}], [{"n": "sulement", "t": "sulement"}]]], "witnesses": ["A", "B", "C", "D"]}
A B C D
Etc il Etc il Etc il Ecil
i i i i
partent per dentausemen< /abbrev>t par tent partent
seulement - seulemant sulement
Here’s what would have happened without stripping the XML markup:

In [39]:
collation = Collation()
collation.add_plain_witness('A',A)
collation.add_plain_witness('B',B)
collation.add_plain_witness('C',C)
collation.add_plain_witness('D',D)
print(collate(collation,output='table',layout='vertical'))


+-------------+-------------+--------------+----------+
|      A      |      B      |      C       |    D     |
+-------------+-------------+--------------+----------+
|      <l     |      <l     |      <l      |    <l    |
+-------------+-------------+--------------+----------+
|   ><abbrev  |   ><abbrev  |   ><abbrev   |    -     |
+-------------+-------------+--------------+----------+
|      >      |      >      |      >       |    >     |
+-------------+-------------+--------------+----------+
| Et</abbrev> | Et</abbrev> | Et</abbrev>  |    E     |
+-------------+-------------+--------------+----------+
|    cil i    |    cil i    |    cil i     |  cil i   |
+-------------+-------------+--------------+----------+
|   partent   |      p<     |      p<      | partent  |
+-------------+-------------+--------------+----------+
|  seulement  |    abbrev   |    abbrev    | sulement |
+-------------+-------------+--------------+----------+
|      -      |      >      |      >       |    -     |
+-------------+-------------+--------------+----------+
|      -      |      er     |      ar      |    -     |
+-------------+-------------+--------------+----------+
|      -      |  </abbrev>  |  </abbrev>   |    -     |
+-------------+-------------+--------------+----------+
|      -      |  dent ausem | tent seulema |    -     |
+-------------+-------------+--------------+----------+
|      -      |   <abbrev>  |   <abbrev>   |    -     |
+-------------+-------------+--------------+----------+
|      -      |      en     |      n       |    -     |
+-------------+-------------+--------------+----------+
|      -      |  </abbrev>t |  </abbrev>t  |    -     |
+-------------+-------------+--------------+----------+
|     </l>    |     </l>    |     </l>     |   </l>   |
+-------------+-------------+--------------+----------+