David J. Birnbaum (djbpitt@gmail.com, http://www.obdurodon.org), 2015-06-29
This is the first part of multi-part tutorial on processing XML with CollateX (http://collatex.net). This example collates a single line of XML from four witnesses. It spells out the details step by step in a way that would not be used in a real project, but that makes it easy to see how each step moves toward the final result. The output is in the three formats supported natively by CollateX: a plain-text alignment table, JSON, and colored HTML.
Still to come:
Not planned: Post-processing of generic XML output, which is best done separately with XSLT 2.0.
Load libraries
In [1]:
from collatex import *
from lxml import etree
import json,re
Create XSLT stylesheets and functions to use them
In [2]:
addWMilestones = etree.XML("""
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="no" encoding="UTF-8" omit-xml-declaration="yes"/>
<xsl:template match="*|@*">
<xsl:copy>
<xsl:apply-templates select="node() | @*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="/*">
<xsl:copy>
<xsl:apply-templates select="@*"/>
<!-- insert a <w/> milestone before the first word -->
<w/>
<xsl:apply-templates/>
</xsl:copy>
</xsl:template>
<!-- convert <add>, <sic>, and <crease> to milestones (and leave them that way)
CUSTOMIZE HERE: add other elements that may span multiple word tokens
-->
<xsl:template match="add | sic | crease ">
<xsl:element name="{name()}">
<xsl:attribute name="n">start</xsl:attribute>
</xsl:element>
<xsl:apply-templates/>
<xsl:element name="{name()}">
<xsl:attribute name="n">end</xsl:attribute>
</xsl:element>
</xsl:template>
<xsl:template match="note"/>
<xsl:template match="text()">
<xsl:call-template name="whiteSpace">
<xsl:with-param name="input" select="translate(.,'
',' ')"/>
</xsl:call-template>
</xsl:template>
<xsl:template name="whiteSpace">
<xsl:param name="input"/>
<xsl:choose>
<xsl:when test="not(contains($input, ' '))">
<xsl:value-of select="$input"/>
</xsl:when>
<xsl:when test="starts-with($input,' ')">
<xsl:call-template name="whiteSpace">
<xsl:with-param name="input" select="substring($input,2)"/>
</xsl:call-template>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="substring-before($input, ' ')"/>
<w/>
<xsl:call-template name="whiteSpace">
<xsl:with-param name="input" select="substring-after($input,' ')"/>
</xsl:call-template>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
</xsl:stylesheet>
""")
transformAddW = etree.XSLT(addWMilestones)
xsltWrapW = etree.XML('''
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="xml" indent="no" omit-xml-declaration="yes"/>
<xsl:template match="/*">
<xsl:copy>
<xsl:apply-templates select="w"/>
</xsl:copy>
</xsl:template>
<xsl:template match="w">
<!-- faking <xsl:for-each-group> as well as the "<<" and except" operators -->
<xsl:variable name="tooFar" select="following-sibling::w[1] | following-sibling::w[1]/following::node()"/>
<w>
<xsl:copy-of select="following-sibling::node()[count(. | $tooFar) != count($tooFar)]"/>
</w>
</xsl:template>
</xsl:stylesheet>
''')
transformWrapW = etree.XSLT(xsltWrapW)
Create and examine XML data
In [3]:
A = """<l><abbrev>Et</abbrev>cil i partent seulement</l>"""
B = """<l><abbrev>Et</abbrev>cil i p<abbrev>er</abbrev>dent ausem<abbrev>en</abbrev>t</l>"""
C = """<l><abbrev>Et</abbrev>cil i p<abbrev>ar</abbrev>tent seulema<abbrev>n</abbrev>t</l>"""
D = """<l>E cil i partent sulement</l>"""
ATree = etree.XML(A)
BTree = etree.XML(B)
CTree = etree.XML(C)
DTree = etree.XML(D)
print(A)
print(ATree)
Tokenize XML input by adding <w>
tags and examine the results
In [4]:
ATokenized = transformWrapW(transformAddW(ATree))
BTokenized = transformWrapW(transformAddW(BTree))
CTokenized = transformWrapW(transformAddW(CTree))
DTokenized = transformWrapW(transformAddW(DTree))
print(ATokenized)
Function to convert the word-tokenized witness line into JSON
In [5]:
def XMLtoJSON(id,XMLInput):
unwrapRegex = re.compile('<w>(.*)</w>')
stripTagsRegex = re.compile('<.*?>')
words = XMLInput.xpath('//w')
witness = {}
witness['id'] = id
witness['tokens'] = []
for word in words:
unwrapped = unwrapRegex.match(etree.tostring(word,encoding='unicode')).group(1)
token = {}
token['t'] = unwrapped
token['n'] = stripTagsRegex.sub('',unwrapped.lower())
witness['tokens'].append(token)
return witness
Use the function to create JSON input for CollateX, and examine it
In [6]:
json_input = {}
json_input['witnesses'] = []
json_input['witnesses'].append(XMLtoJSON('A',ATokenized))
json_input['witnesses'].append(XMLtoJSON('B',BTokenized))
json_input['witnesses'].append(XMLtoJSON('C',CTokenized))
json_input['witnesses'].append(XMLtoJSON('D',DTokenized))
print(json_input)
Collate the witnesses and view the output as JSON, in a table, and as colored HTML
In [7]:
collationText = collate_pretokenized_json(json_input,output='table',layout='vertical')
print(collationText)
collationJSON = collate_pretokenized_json(json_input,output='json')
print(collationJSON)
collationHTML2 = collate_pretokenized_json(json_input,output='html2')
In [8]:
collation = Collation()
collation.add_plain_witness('A',A)
collation.add_plain_witness('B',B)
collation.add_plain_witness('C',C)
collation.add_plain_witness('D',D)
print(collate(collation,output='table',layout='vertical'))