Arbeiten mit XML: lxml

Mit der Bibliothek lxml können wir xml-Dateien lesen, schreiben und auch xpath-Ausdrücke auswerten sowie xslt-Transformationen ausführen.



In [2]:

    
from lxml import etree

Opening a file



In [66]:

    
f = open("faq.xml", encoding="utf-8")
tree = etree.parse(f)
etree.tostring(tree)









    Out[66]:





b'<faq type="test">\n    <title>This is a litte faq</title>\n    <author>Ms. Unknown and Mr. Underappreciated</author>\n    <version>0.1</version>\n    <date>2017</date>\n    <entry>\n        <q>What is an faq?</q>\n        <a><p>It is an acronym and stands for Frequently Asked Questions.</p></a>\n    </entry>\n    <entry>\n        <q>Who wrote the first faq?</q>\n        <a><p>According to Wikipedia, "The acronym FAQ was developed between 1982 and 1985 by Eugene Miya of NASA for the SPACE mailing list."<link url="https://en.wikipedia.org/wiki/FAQ"/></p></a>\n    </entry>\n</faq>'

Using Xpath

Wenn wir einen Xpath-Ausdruck verwenden, dann erhalten wir eine Liste als Antwortmenge. Wenn nichts gefunden wurde, dann ist die Liste leer.



In [67]:

    
a = tree.xpath("//version")
len(a)









    Out[67]:





1

Wenn wir Informationen zu den einzelnen Elementen in der Liste haben wollen, dann können wir entsprechende Klassenattribute verwenden. Wichtig sind hier 'tag', 'text' und 'attrib'.



In [68]:

    
a[0].tag









    Out[68]:





'version'



In [69]:

    
a[0].text









    Out[69]:





'0.1'



In [70]:

    
b = tree.xpath("//q")
len(b)









    Out[70]:





2



In [71]:

    
for i in b:
    print(i.text)









    



What is an faq?
Who wrote the first faq?

Attribute werden als dictionary gespeichert.



In [75]:

    
c = tree.xpath("//faq")
print(c[0].tag)
print(c[0].text)
print(c[0].attrib)









    



faq

    
{'type': 'test'}

Wenn wir den Namen des Attributes kennen, dann können wir eine einfachere Schreibweise verwenden:



In [76]:

    
c[0].get("type")









    Out[76]:





'test'

xpath with namespaces



In [32]:

    
import io
f = io.StringIO("""
<?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng" type="application/xml"
	schematypens="http://purl.oclc.org/dsdl/schematron"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
   <teiHeader>
      <!-- Angabe zur digitalen Version. -->
      <fileDesc>
         <titleStmt>
            <title>Herr von Sacken - digitalisiertes Novellenschatz-Korpus</title>
            <author>
               <persName>Willibald Alexis</persName>
               <birth>1798-06-29</birth> <death>1871-12-16</death>
               <addName type="realName">Georg Wilhelm Heinrich Härig</addName>
            </author>
            <funder>Digital Humanities Cooperation</funder>
            <principal>Prof. Dr. Thomas Weitin</principal>
         </titleStmt>
     </fileDesc>
     <sourceDesc>
        <biblFull>
           <titleStmt>
              <title>Herr von Sacken</title>
             <author>
                <persName sex="1">Willibald Alexis</persName>
                 <birth>1798-06-29</birth> <death>1871-12-16</death>
                 <addName type="realName">Georg Wilhelm Heinrich Härig</addName>
              </author>
           </titleStmt>
         </biblFull>
    </sourceDesc>
  </teiHeader>
</TEI>
""")
tree = etree.parse(f)

Wenn wir bei einem xml-Baum mit namespaces xpath verwenden, müssen wir das beim xpath-Ausdruck berücksichtigen, sonst erhalten wir keinen Fehler, sondern eine leere Antwortmenge.



In [33]:

    
a = tree.xpath("//title")
len(a)









    Out[33]:





0

Hier nun mit namespace:



In [38]:

    
a = tree.xpath("//tei:title", namespaces={'tei': 'http://www.tei-c.org/ns/1.0'})
len(a)









    Out[38]:





2



In [39]:

    
a[1].text









    Out[39]:





'Herr von Sacken'



In [40]:

    
a = tree.xpath("//tei:sourceDesc//tei:title", namespaces={'tei': 'http://www.tei-c.org/ns/1.0'})
a[0].text









    Out[40]:





'Herr von Sacken'



In [49]:

    
a = tree.xpath("//tei:sourceDesc//tei:persName", namespaces={'tei': 'http://www.tei-c.org/ns/1.0'})
a[0].text









    Out[49]:





'{http://www.tei-c.org/ns/1.0}persName'



In [51]:

    
a[0].tag









    Out[51]:





'{http://www.tei-c.org/ns/1.0}persName'



In [50]:

    
a[0].attrib









    Out[50]:





{'sex': '1'}

I am using lxml to extract segments of an xml file. The first xpath is supposed to extract a series of xml fragment, the second is supposed to extract an element from each fragment. But it seems that the result set is only in superficially a set of fragment, but beneath each segment is still part of the whole tree and the second xpath is evaluated against the whole document instead of the fragments.



In [3]:

    
from lxml import etree

t = """<r><a>
             <b>1</b>
          </a>
          <a>
             <b>2</b>
          </a>
          <a>
             <b>3</b>
          </a>
       </r>"""
tree = etree.fromstring(t)

r = tree.xpath("//a")

[etree.tostring(e) for e in r]









    Out[3]:





[b'<a>\n             <b>1</b>\n          </a>\n          ',
 b'<a>\n             <b>2</b>\n          </a>\n          ',
 b'<a>\n             <b>3</b>\n          </a>\n       ']



In [5]:

    
for i in r:
    print(etree.tostring(i))
    print(i.xpath("//a"))









    



b'<a>\n             <b>1</b>\n          </a>\n          '
[<Element a at 0x7c0cbc8>, <Element a at 0x7c0cd48>, <Element a at 0x7c0cd08>]
b'<a>\n             <b>2</b>\n          </a>\n          '
[<Element a at 0x7c0cbc8>, <Element a at 0x7c0cd48>, <Element a at 0x7c0cd08>]
b'<a>\n             <b>3</b>\n          </a>\n       '
[<Element a at 0x7c0cbc8>, <Element a at 0x7c0cd48>, <Element a at 0x7c0cd08>]

As the last output shows, lxml treats the fragments differently depending on the context. Conversion to a string results in a fragment. Applying another xpath expression works on the whole document. Any pointers how to handle this? Obviously I could convert the results to strings, build new xml trees and use the xpath expression on them, but that seems more like a workaround than a working solution. (And I know that in this case 2 xpath expressions are not needed, but my real life text is much more complicated.)

Solution to this problem is a full stop (more or less)!



In [7]:

    
for i in r:
    print(etree.tostring(i))
    print(i.xpath("./b")[0].text)









    



b'<a>\n             <b>1</b>\n          </a>\n          '
1
b'<a>\n             <b>2</b>\n          </a>\n          '
2
b'<a>\n             <b>3</b>\n          </a>\n       '
3



In [2]:

    
print('\a')



In [11]:

    
import winsound



In [17]:

    
winsound.Beep(400, 180)
winsound.Beep(700, 180)



In [ ]:

Table of Contents

Arbeiten mit XML: lxml

Opening a file

Using Xpath

xpath with namespaces