Arbeiten mit XML: lxml

Mit der Bibliothek lxml können wir xml-Dateien lesen, schreiben und auch xpath-Ausdrücke auswerten sowie xslt-Transformationen ausführen.


In [2]:
from lxml import etree

Opening a file


In [66]:
f = open("faq.xml", encoding="utf-8")
tree = etree.parse(f)
etree.tostring(tree)


Out[66]:
b'<faq type="test">\n    <title>This is a litte faq</title>\n    <author>Ms. Unknown and Mr. Underappreciated</author>\n    <version>0.1</version>\n    <date>2017</date>\n    <entry>\n        <q>What is an faq?</q>\n        <a><p>It is an acronym and stands for Frequently Asked Questions.</p></a>\n    </entry>\n    <entry>\n        <q>Who wrote the first faq?</q>\n        <a><p>According to Wikipedia, "The acronym FAQ was developed between 1982 and 1985 by Eugene Miya of NASA for the SPACE mailing list."<link url="https://en.wikipedia.org/wiki/FAQ"/></p></a>\n    </entry>\n</faq>'

Using Xpath

Wenn wir einen Xpath-Ausdruck verwenden, dann erhalten wir eine Liste als Antwortmenge. Wenn nichts gefunden wurde, dann ist die Liste leer.


In [67]:
a = tree.xpath("//version")
len(a)


Out[67]:
1

Wenn wir Informationen zu den einzelnen Elementen in der Liste haben wollen, dann können wir entsprechende Klassenattribute verwenden. Wichtig sind hier 'tag', 'text' und 'attrib'.


In [68]:
a[0].tag


Out[68]:
'version'

In [69]:
a[0].text


Out[69]:
'0.1'

In [70]:
b = tree.xpath("//q")
len(b)


Out[70]:
2

In [71]:
for i in b:
    print(i.text)


What is an faq?
Who wrote the first faq?

Attribute werden als dictionary gespeichert.


In [75]:
c = tree.xpath("//faq")
print(c[0].tag)
print(c[0].text)
print(c[0].attrib)


faq

    
{'type': 'test'}

Wenn wir den Namen des Attributes kennen, dann können wir eine einfachere Schreibweise verwenden:


In [76]:
c[0].get("type")


Out[76]:
'test'

xpath with namespaces


In [32]:
import io
f = io.StringIO("""
<?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng" type="application/xml"
	schematypens="http://purl.oclc.org/dsdl/schematron"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
   <teiHeader>
      <!-- Angabe zur digitalen Version. -->
      <fileDesc>
         <titleStmt>
            <title>Herr von Sacken - digitalisiertes Novellenschatz-Korpus</title>
            <author>
               <persName>Willibald Alexis</persName>
               <birth>1798-06-29</birth> <death>1871-12-16</death>
               <addName type="realName">Georg Wilhelm Heinrich Härig</addName>
            </author>
            <funder>Digital Humanities Cooperation</funder>
            <principal>Prof. Dr. Thomas Weitin</principal>
         </titleStmt>
     </fileDesc>
     <sourceDesc>
        <biblFull>
           <titleStmt>
              <title>Herr von Sacken</title>
             <author>
                <persName sex="1">Willibald Alexis</persName>
                 <birth>1798-06-29</birth> <death>1871-12-16</death>
                 <addName type="realName">Georg Wilhelm Heinrich Härig</addName>
              </author>
           </titleStmt>
         </biblFull>
    </sourceDesc>
  </teiHeader>
</TEI>
""")
tree = etree.parse(f)

Wenn wir bei einem xml-Baum mit namespaces xpath verwenden, müssen wir das beim xpath-Ausdruck berücksichtigen, sonst erhalten wir keinen Fehler, sondern eine leere Antwortmenge.


In [33]:
a = tree.xpath("//title")
len(a)


Out[33]:
0

Hier nun mit namespace:


In [38]:
a = tree.xpath("//tei:title", namespaces={'tei': 'http://www.tei-c.org/ns/1.0'})
len(a)


Out[38]:
2

In [39]:
a[1].text


Out[39]:
'Herr von Sacken'

In [40]:
a = tree.xpath("//tei:sourceDesc//tei:title", namespaces={'tei': 'http://www.tei-c.org/ns/1.0'})
a[0].text


Out[40]:
'Herr von Sacken'

In [49]:
a = tree.xpath("//tei:sourceDesc//tei:persName", namespaces={'tei': 'http://www.tei-c.org/ns/1.0'})
a[0].text


Out[49]:
'{http://www.tei-c.org/ns/1.0}persName'

In [51]:
a[0].tag


Out[51]:
'{http://www.tei-c.org/ns/1.0}persName'

In [50]:
a[0].attrib


Out[50]:
{'sex': '1'}

I am using lxml to extract segments of an xml file. The first xpath is supposed to extract a series of xml fragment, the second is supposed to extract an element from each fragment. But it seems that the result set is only in superficially a set of fragment, but beneath each segment is still part of the whole tree and the second xpath is evaluated against the whole document instead of the fragments.


In [3]:
from lxml import etree

t = """<r><a>
             <b>1</b>
          </a>
          <a>
             <b>2</b>
          </a>
          <a>
             <b>3</b>
          </a>
       </r>"""
tree = etree.fromstring(t)

r = tree.xpath("//a")

[etree.tostring(e) for e in r]


Out[3]:
[b'<a>\n             <b>1</b>\n          </a>\n          ',
 b'<a>\n             <b>2</b>\n          </a>\n          ',
 b'<a>\n             <b>3</b>\n          </a>\n       ']

In [5]:
for i in r:
    print(etree.tostring(i))
    print(i.xpath("//a"))


b'<a>\n             <b>1</b>\n          </a>\n          '
[<Element a at 0x7c0cbc8>, <Element a at 0x7c0cd48>, <Element a at 0x7c0cd08>]
b'<a>\n             <b>2</b>\n          </a>\n          '
[<Element a at 0x7c0cbc8>, <Element a at 0x7c0cd48>, <Element a at 0x7c0cd08>]
b'<a>\n             <b>3</b>\n          </a>\n       '
[<Element a at 0x7c0cbc8>, <Element a at 0x7c0cd48>, <Element a at 0x7c0cd08>]

As the last output shows, lxml treats the fragments differently depending on the context. Conversion to a string results in a fragment. Applying another xpath expression works on the whole document. Any pointers how to handle this? Obviously I could convert the results to strings, build new xml trees and use the xpath expression on them, but that seems more like a workaround than a working solution. (And I know that in this case 2 xpath expressions are not needed, but my real life text is much more complicated.)

Solution to this problem is a full stop (more or less)!


In [7]:
for i in r:
    print(etree.tostring(i))
    print(i.xpath("./b")[0].text)


b'<a>\n             <b>1</b>\n          </a>\n          '
1
b'<a>\n             <b>2</b>\n          </a>\n          '
2
b'<a>\n             <b>3</b>\n          </a>\n       '
3

In [2]:
print('\a')




In [11]:
import winsound

In [17]:
winsound.Beep(400, 180)
winsound.Beep(700, 180)

In [ ]: