Integrating XML with Python

NLTK, the Python Natural Languge ToolKit package, is designed to work with plain text input, but sometimes your input is in XML. There are two principal paths to reconciliation: either use an XML environment that supports NLP (natural language processing) or let Python (which supports NLP through NLTK) manage the XML. The first approach, sticking to an XML environment, is illustrated in Week 3 of the Institute in the context of the eXist XML database, which integrates the Stanford Core NLP tools. Here we illustrate the second approach, letting Python manage the XML.

Before you make a mistake

It’s natural to think of parsing (reading, interpreting, and processing) XML with regular expressions, but it’s also Wrong for at least two sets of reasons:

Regular expressions operate over strings, and there are string differences in XML that are not informationally different. For example, the order of attributes on an element, whether the attributes are single- or double-quoted, whether a Unicode character is represented by a raw character or a numerical character reference, and many other details represent string differences that are not informational differences. The same is true of the extent and type of white space in some environments but not others. And the same is true when you have to recognize whether a right angle bracket or a single or double quotation mark is part of content or part of markup. XML-aware processing knows what’s informational and what isn’t, as well as what’s content and what’s markup. You don’t want to reinvent those wheels.
Parsing XML is a recursive operation. For example, if you have two elements of the same type nested inside each other, as in
```
<emphasis><emphasis>a very emphatic thought</emphasis></emphasis>
```
parsing has to match up the correctly paired start and end tags. XML-aware processing knows where it is in the tree. That’s another wheel you don’t want to reinvent.

It’s also natural to think of writing XML by constructing a string, such as concatenating angle brackets and text and other bits and pieces. This is a Bad Idea because some decisions are context sensitive, and keeping track of the context is challenging. For example, attribute values can be quoted with single or double quotation marks, but if the value contains a single or double quotation mark, that can influence the choice, and there are situations where you may need to represent the quotation marks in attribute values with " or ' character entities instead of as raw characters. A library that knows how to write XML will keep track of that for you.

Wrangling XML in Python

The Python Standard Library provides several tools for parsing and creating XML, and there are also third-party packages. In this tutorial we use two parts of the Standard Library: pulldom for parsing XML input and minidom for constructing XML output. You can read more about these modules by clicking on the preceding links to the Standard Library documentation, and also in the Structured text: XML chapter of the eTutorials.org Python tutorial.

To illustrate how to read and write XML with Python we’ll read in a small input XML document, tag each word as a <word> element, and add part of speech (POS) and lemma (dictionary form) information as @pos and @lemma attributes of the <word> elements. We’ll use pulldom to read, parse, and process the input document, NLTK to determine the part of speech and the lemma, and minidom to create the output.

Input XML

Create the following small XML document in a work directory and save with a filename like test.xml:

<root>
    <p speaker="hamlet">Hamlet is a prince of Denmark.</p>
    <p speaker='ophelia'>Things end badly for Ophelia.</p>
    <p speaker="nobody">Julius Caesar does not appear in this play.</p>
</root>

Desired output XML

The desired output is:

<?xml version="1.0" ?>
<root>
    <p speaker="hamlet">
        <word lemma="hamlet" pos="NNP">Hamlet</word>
        <word lemma="be" pos="VBZ">is</word>
        <word lemma="a" pos="DT">a</word>
        <word lemma="prince" pos="NN">prince</word>
        <word lemma="of" pos="IN">of</word>
        <word lemma="denmark" pos="NNP">Denmark</word>
        <word lemma="." pos=".">.</word>
    </p>
    <p speaker="ophelia">
        <word lemma="thing" pos="NNS">Things</word>
        <word lemma="end" pos="VBP">end</word>
        <word lemma="badly" pos="RB">badly</word>
        <word lemma="for" pos="IN">for</word>
        <word lemma="ophelia" pos="NNP">Ophelia</word>
        <word lemma="." pos=".">.</word>
    </p>
    <p speaker="nobody">
        <word lemma="julius" pos="NNP">Julius</word>
        <word lemma="caesar" pos="NNP">Caesar</word>
        <word lemma="do" pos="VBZ">does</word>
        <word lemma="not" pos="RB">not</word>
        <word lemma="appear" pos="VB">appear</word>
        <word lemma="in" pos="IN">in</word>
        <word lemma="this" pos="DT">this</word>
        <word lemma="play" pos="NN">play</word>
        <word lemma="." pos=".">.</word>
    </p>
</root>

The python code

Before you run the code

NLTK is installed by default with Anaconda python, but the word tokenizer isn’t. To install the tokenizer, uncomment the second line below and run (if you’ve already installed the tokenizer, run the cell without uncommenting the second line):



In [1]:

    
import nltk
# nltk.download()









    



showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml






    Out[1]:





True

A separate window will open. Select the Models tab on the top, then averaged_perceptron_tagger and punkt, and then press the Download button. Then select the Corpora tab on the top, then wordnet, and then Download. You only have to download these once on each machine you use; the download process will install them for future use.

The annotation code

Here is the entire Python script that creates the output (we describe how the pieces work below). If you have saved the sample input as test.xml in the same directory as the location of this notebook, you can run the transformation in the notebook now, and the output should be displayed below:



In [6]:

    
#!/usr/bin/env python
"""Tag words and add POS and lemma information in XML document."""

from xml.dom.minidom import Document, Element
from xml.dom import pulldom
import nltk


def create_word_element(d: Document, text: str, pos: str) -> Element:
    """Create <word> element with POS and lemma attributes."""
    word = d.createElement("word")
    word.setAttribute("pos", pos)
    word.setAttribute("lemma", lemmatize(text, pos))
    t = d.createTextNode(text)
    word.appendChild(t)
    return word


def get_wordnet_pos(treebank_tag: str) -> str:
    """Replace treebank POS tags with wordnet ones; default POS is noun."""
    pos_tags = {'J': nltk.corpus.reader.wordnet.ADJ, 'V': nltk.corpus.reader.wordnet.VERB,
                'R': nltk.corpus.reader.wordnet.ADV}
    return pos_tags.get(treebank_tag[0], nltk.corpus.reader.wordnet.NOUN)


def lemmatize(text: str, pos: str) -> str:
    """Identify lemma for current word."""
    return nltk.stem.WordNetLemmatizer().lemmatize(text.lower(), get_wordnet_pos(pos))


def extract(input_xml) -> Document:
    """Process entire input XML document, firing on events."""
    # Initialize output as XML document, point to most recent open node
    d = Document()
    current = d
    # Start pulling; it continues automatically
    doc = pulldom.parse(input_xml)
    for event, node in doc:
        if event == pulldom.START_ELEMENT:
            current.appendChild(node)
            current = node
        elif event == pulldom.END_ELEMENT:
            current = node.parentNode
        elif event == pulldom.CHARACTERS:
            # tokenize, pos-tag, create <word> as child of parent
            words = nltk.word_tokenize(node.toxml())
            tagged_words = nltk.pos_tag(words)
            for (text, pos) in tagged_words:
                word = create_word_element(d, text, pos)
                current.appendChild(word)
    return d


with open('test-1.xml', 'r') as test_in:
    results = extract(test_in)
    print(results.toprettyxml())









    



<?xml version="1.0" ?>
<root>
	<l speaker="hamlet">
		<word lemma="hamlet" pos="NNP">Hamlet</word>
		<word lemma="be" pos="VBZ">is</word>
		<word lemma="a" pos="DT">a</word>
		<word lemma="princess" pos="NN">princess</word>
		<word lemma="of" pos="IN">of</word>
		<word lemma="denmark" pos="NNP">Denmark</word>
		<word lemma="." pos=".">.</word>
	</l>
	<l speaker="ophelia">
		<word lemma="thing" pos="NNS">Things</word>
		<word lemma="end" pos="VBP">end</word>
		<choice>
			<sic>
				<word lemma="badly" pos="RB">badly</word>
			</sic>
			<corr>
				<word lemma="horrible" pos="JJ">horrible</word>
			</corr>
		</choice>
		<word lemma="for" pos="IN">for</word>
		<word lemma="ophelia" pos="NNP">Ophelia</word>
		<word lemma="'s" pos="POS">'s</word>
		<word lemma="mother" pos="NN">mother</word>
		<word lemma="." pos=".">.</word>
	</l>
	<l speaker="nobody">
		<word lemma="julius" pos="NNP">Julius</word>
		<word lemma="caesar-ion" pos="NN">Caesar-ion</word>
		<l>
			<word lemma="dude" pos="NN">dude</word>
		</l>
		<word lemma="do" pos="VBZ">does</word>
		<word lemma="not" pos="RB">not</word>
		<word lemma="appear" pos="VB">appear</word>
		<word lemma="in" pos="IN">in</word>
		<word lemma="this" pos="DT">this</word>
		<word lemma="play" pos="NN">play</word>
		<word lemma="." pos=".">.</word>
	</l>
</root>

How it works

We’ve divided the program into sections below, with explanations after each section.

Shebang and docstring



In [3]:

    
#!/usr/bin/env python
"""Tag words and add POS and lemma information in XML document."""









    Out[3]:





'Tag words and add POS and lemma information in XML document.'

A Python program begins with a shebang and a docstring. The shebang makes it easier to run the program from the command line, and the docstring documents what the program does, The shebang must be the very first line in a program. For now, think of the shebang as a magic incantation that should be copied and pasted verbatim; we explain below what it means. The docstring should be a single line framed by triple quotation marks, and it should describe concisely what the program does. When you execute the docstring by itself, as we do above, it echoes itself to the screen; when you run the program, though, it remains silent.

Imports



In [4]:

    
from xml.dom.minidom import Document
from xml.dom import pulldom
import nltk

We import the ability to create a new XML document, which we’ll use to create our output, from minidom, and we import pulldom to parse the input document. We import nltk because we’ll use it to determine the part of speech and the lemma for each word.

Adding a `<word>` element to the output tree



In [5]:

    
def create_word_element(d: Document, text: str, pos: str) -> Element:
    """Create <word> element with POS and lemma attributes."""
    word = d.createElement("word")
    word.setAttribute("pos", pos)
    word.setAttribute("lemma", lemmatize(text, pos))
    t = d.createTextNode(text)
    word.appendChild(t)
    return word

When we tokenize the text into words below, we pass each word and its part of speech into the create_word_element() function. The function creates a new <word> element, adds the part of speech tag as an attribute, and then uses our lemmatize() function to determine the lemma and add that as an attribute, as well. It then creates a text() node, sets its value as the text of the word, and makes the text() node a child of the new <word> element. Finally, we return the <word> element to the calling routine, which inserts it into the output XML tree in the right place.

Converting treebank part of speech identifiers to Wordnet ones



In [6]:

    
def get_wordnet_pos(treebank_tag: str) -> str:
    """Replace treebank POS tags with wordnet ones; default POS is noun."""
    pos_tags = {'J': nltk.corpus.reader.wordnet.ADJ, 'V': nltk.corpus.reader.wordnet.VERB,
                'R': nltk.corpus.reader.wordnet.ADV}
    return pos_tags.get(treebank_tag[0], nltk.corpus.reader.wordnet.NOUN)

We create a function called get_wordnet_pos(), which we’ll use later. This function is defined as taking one argument, called treebank_tag, which is a string, and it returns a value that is also a string. The reason we need to do this is that the NLTK part of speech tagger uses one set of part of speech identifiers, but Wordnet, the NLTK component that performs lemmatization, uses a different one. Since we do the part of speech tagging first, we use this function to convert that value to one that Wordnet will understand before we perform lemmatization. There are many treebank part of speech tags but only four Wordnet ones, for nouns, verbs, adjectives, and adverbs, and everything else is treated as a noun. Our function returns the correct value for the four defined parts of speech and defaults to the value for nouns otherwise.

Lemmatizing



In [7]:

    
def lemmatize(text: str, pos: str) -> str:
    return nltk.stem.WordNetLemmatizer().lemmatize(text.lower(), get_wordnet_pos(pos))

We define a function called lemmatize() that takes two pieces of input, both of which are strings, and returns a string. The parameter text is the word to be lemmatized and the parameter pos is the part of speech in treebank form. We call the NLTK function to identify the lemma with nltk.stem.WordNetLemmatizer().lemmatize() with two arguments. The lemmatizer expects words to be lower case, so we convert the text to lower case with the lower() string method. And it requires a Wordnet part of speech, and not a treebank one, so we use our get_wordnet_pos() function to perform the conversion.

Doing the work



In [8]:

    
def extract(input_xml) -> Document:
    """Process entire input XML document, firing on events."""
    # Initialize output as XML document, point to most recent open node
    d = Document()
    current = d
    # Start pulling; it continues automatically
    doc = pulldom.parse(input_xml)
    for event, node in doc:
        if event == pulldom.START_ELEMENT:
            current.appendChild(node)
            current = node
        elif event == pulldom.END_ELEMENT:
            current = node.parentNode
        elif event == pulldom.CHARACTERS:
            # tokenize, pos-tag, create <word> as child of parent
            words = nltk.word_tokenize(node.toxml())
            tagged_words = nltk.pos_tag(words)
            for (text, pos) in tagged_words:
                word = create_word_element(d, text, pos)
                current.appendChild(word)
    return d

We refer below to line numbers, and if you’re reading this on line, you won’t those numbers. You can make them appear by running this notebook in a Jupyter session, clicking in the cell above, hitting the Esc key, to switch into command mode, and then typing l (the lowercase letter L), to toggle line numbering.

Our extract() function does all the work, calling on the functions we defined earlier as needed. Here’s how it works (with line numbers):

1: extract() is a function that gets called with one argument, which we assign to a parameter we’ve called input_xml.
4: Near the top of the full program we’ve already used from xml.dom.minidom import Document, Element to make the Document class (and the Element class) available to our program. Here we use it to create a new XML document, which we assign as the value of a new variable d. We’ll use this to build our output document.
5: The variable current points to the node that will be the parent of any new elements. The document node is the root of the entire document, so it’s the initial value of the current variable.
y: pulldom is a streaming parser, which means that once we start processing elements in the XML input tree, the parser keeps going until it has visited every node of the tree. We start that process with pulldom.parse(), telling it to parse the document we passed to it as the value of the input_xml parameter.
8: Parsing generates events like the start or end of an element or the presence of character data. There are other possible events, but these are the only ones we need to handle for our transformation. Each event provides a tuple that consists of two values, the name of the event (e.g., START_ELEMENT) and the value (e.g., an object of type node). We test the event type and process different types of events differently.
9–11: When we start a new element, we make it a child of the node identified by our current variable. This ensures that the output tree that we’re building will reproduce the structure of the input tree, and it also ensures that we create new <word> elements in the correct place. When we start an element, it’s the parent of any nodes we encounter until we find the corresponding END_ELEMENT event, so we make it a child of whatever node is current at the moment and then set the current variable to point to the node we just created. This means that, for example, when we encounter the first child of the root element of the input XML, we’ll make that a child of the root element of the output XML that we’re constructing.
12–13: When we encounter an END_ELEMENT event, that element can’t have any more children, so we set the current variable to point to its parent.
14–20: We’ll illustrate how the individual lines work below, but here’s a summary with everything in one place. When we encounter CHARACTERS while parsing, the value of the node is an XML text() node, and not a string. We convert it to a plain text string with the toxml() method, let NLTK break it into words with nltk.word_tokenize(), and assign the pieces to an array called words (line 16). Next, the nltk.pos_tag() function takes an array of words as its input (our words variable) and returns an array of tuples, that is, pairs of strings where the first is the original input word and the second is the part of speech according to treebank notation (17). It assigns this new array as the value of the tagged_words variable. We want to create a new <word> element in the output for each word, so we loop over that list of tuples (18). For each word, we call our create_word_element() function, which we defined earlier, and set the value of the variable word equal to the new <word> element (19). Finally, we make the new word a child of the current element, the one that was its parent in the input (20). There are other types of parse events, but we don’t need to do anything with them in this example, so we don’t write any code to process them.

Remind me about those NLTK functions again

`nltk.word_tokenize()`

nltk.word_tokenize() splits a text into words. It’s smarter than just splitting on white space; it treats punctuation as a word, and it knows about common English contractions:



In [9]:

    
sample = "We didn't realize that we could split contractions!"
nltk.word_tokenize(sample)









    Out[9]:





['We',
 'did',
 "n't",
 'realize',
 'that',
 'we',
 'could',
 'split',
 'contractions',
 '!']

`nltk.pos_tag()`

nltk.pos_tag() takes a list of words (not a sentence) as its input. That means that we need to tokenize the sentence before tagging:



In [10]:

    
sample = "We didn't realize that we could split contractions!"
words = nltk.word_tokenize(sample)
nltk.pos_tag(words)









    Out[10]:





[('We', 'PRP'),
 ('did', 'VBD'),
 ("n't", 'RB'),
 ('realize', 'VB'),
 ('that', 'IN'),
 ('we', 'PRP'),
 ('could', 'MD'),
 ('split', 'VB'),
 ('contractions', 'NNS'),
 ('!', '.')]

You can look up the part of speech tags at https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html.

`nltk.stem.WordNetLemmatizer().lemmatize()`

The Wordnet lemmatizer tries to lemmatize (find the dictionary form) of a word with or without part of speech information, but without the part of speech, it guesses that everything is a noun. Remember that Wordnet knows about only nouns, verbs, adjectives, and adverbs, and that the part of speech tags are different in Wordnet than in the treebank system. Oh, and it assumes lower-case input, so if you give it a capitalized word, it won’t recognize it as an inflected form of something else, and will therefore return it unchanged.



In [11]:

    
words = ['thing', 'things', 'Things']
[(word + ": " + nltk.stem.WordNetLemmatizer().lemmatize(word)) for word in words]









    Out[11]:





['thing: thing', 'things: thing', 'Things: Things']

In the example above, the lemmatizer recognizes that “thing” is the lemma for “things”, but it fails to lemmatize “Things” correctly because of the upper case letter.



In [12]:

    
words = [('building','n'), ('building','v')]
[(word + ": " + nltk.stem.WordNetLemmatizer().lemmatize(word, pos)) for (word, pos) in words]









    Out[12]:





['building: building', 'building: build']

In the example above, we supplied part of speech information, and the lemmatizer correctly treats “building” differently as a noun than as a verb. If we don’t specify a part of speech, it assumes everything is a noun:



In [13]:

    
words = ['building']
[(word + ": " + nltk.stem.WordNetLemmatizer().lemmatize(word)) for word in words]









    Out[13]:





['building: building']

Input and output



In [14]:

    
with open('test.xml', 'r') as test_in:
    results = extract(test_in)
    print(results.toprettyxml())









    



<?xml version="1.0" ?>
<root>
	<p speaker="hamlet">
		<word lemma="hamlet" pos="NNP">Hamlet</word>
		<word lemma="be" pos="VBZ">is</word>
		<word lemma="a" pos="DT">a</word>
		<word lemma="prince" pos="NN">prince</word>
		<word lemma="of" pos="IN">of</word>
		<word lemma="denmark" pos="NNP">Denmark</word>
		<word lemma="." pos=".">.</word>
	</p>
	<p speaker="ophelia">
		<word lemma="thing" pos="NNS">Things</word>
		<word lemma="end" pos="VBP">end</word>
		<word lemma="badly" pos="RB">badly</word>
		<word lemma="for" pos="IN">for</word>
		<word lemma="ophelia" pos="NNP">Ophelia</word>
		<word lemma="." pos=".">.</word>
	</p>
	<p speaker="nobody">
		<word lemma="julius" pos="NNP">Julius</word>
		<word lemma="caesar" pos="NNP">Caesar</word>
		<word lemma="do" pos="VBZ">does</word>
		<word lemma="not" pos="RB">not</word>
		<word lemma="appear" pos="VB">appear</word>
		<word lemma="in" pos="IN">in</word>
		<word lemma="this" pos="DT">this</word>
		<word lemma="play" pos="NN">play</word>
		<word lemma="." pos=".">.</word>
	</p>
</root>

We could, alternatively, have opened a file handle to read the file from disk, read it, and saved the results with:



In [15]:

    
contents = open('test.xml','r').read()

This isn’t considered good practice, though, because it leaves the file handle (the way the program interacts with the file) open when it’s done, that is, even when it no longer needs it. Python eventually closes file handles, so no real harm is done, but there are situations where failing to close a file handle can have adverse consequences. For that reason, it’s good practice always to use the with construction to open files, since it ensures that they will be closed properly as soon as they are no longer being used. In this case, we open test.xml and assign it to a new variable called test_in. In the second argument to the open() command, the r opens the file for reading.

We use the file as input to the extract() function we defined earlier, and when the function returns its results (the new XML document it has created), we assign those results to a new variable that we call results. Since that’s an XML document (a tree), we need to serialize it (convert it to a character stream) before we can output it by using the print() function. The toprettyxml() method of a minidom document serializes the tree and pretty-prints it, that is, indents it to make the hierarchy easier to read. You can, alternatively, serialize it without pretty-printing with the toxml() method.

In case you’re curious

The following information isn’t needed for Institute activities, but if you expect to doing complex processing of XML in Python, here is a brief survey of the options.

XML support in Python

If you’ve used XSLT (except for the new streaming facility of XSLT 3.0) to process XML before, you’ve been doing DOM-based (Document Object Model) processing, which parses the input, builds the entire tree in memory, and operates over it. DOM-based processing makes the entire tree available at all times, which is often what you want, but it isn’t the most efficient (in terms of speed and memory) approach, so if you don’t need the entire tree at once, you may prefer an alternative. Python has support for the DOM, which you can read about in the Standard Library reference in xml.dom — The Document Object Model API and at Parsing XML with DOM. For constructing XML, Python also provides xml.dom.minidom — Minimal DOM implementation, which “is intended to be simpler than the full DOM and also significantly smaller”. The most complete DOM resource in Python is xml.etree.ElementTree — The ElementTree XML API, and the third-part lxml package is further enhanced. See the note below about Installing lxml if you’d like to try working with it.

The primary alternative to DOM parsing in the XML world is SAX (Simple API for XML; API = ‘application programming interface’), which is a streaming parser. Instead of building the entire tree in memory, a streaming parser acts on events like the start or end of an element, an attribute, character data, etc., and it processes each event and then moves onto the next one. In situations where you can do everything you need with an event right away, and don’t need to return to it later, SAX will be faster than DOM. Python has support for SAX processing, which you can read about in the Standard Library reference in xml.sax — Support for SAX2 parsers or in Parsing XML with SAX.

pulldom, which we use in this tutorial, has been described as follows:

pulldom occupies an interesting middle ground between SAX and DOM, presenting the stream of parsing events as a Python iterator object so that you do not code callbacks, but rather loop over the events and examine each event to see if it’s of interest. When you do find an event of interest to your application, you can ask pulldom to build the DOM subtree rooted in that event’s node by calling method expandNode, and then work with that subtree as you would in minidom. Paul Prescod, pulldom’s author and XML and Python expert, describes the net result as “80% of the performance of SAX, 80% of the convenience of DOM.” (Parsing XML with DOM)

Installing lxml

Mac and Linux users can install lxml through pypi by typing pip install lxml, but this installation requires developer tools that Windows users typically don’t have installed. Windows users can instead download prebuilt binaries from Christopher Gohlke’s Unofficial Windows Binaries for Python Extension Packages and install them by following the Installing from wheels instructions.

About the shebang

You can run a Python program called something like tag_words.py from the command line by typing python tag_words.py. This starts the Python interpreter and tells it that the program it should interpret (and run) is a file called tag_words.py in the current directory. On a Unix or Mac system, though, you can also run it without specifying python first, that is, by making it an executable program. To do that:

At the command line in the directory where the program file is located, type chmod +x tag_words.py. The chmod (change mode) command sets access rights for reading, writing, and execution, and the +x makes this file executable. That means that it no longer needs to be executed by calling Python separately; it knows how to execute itself. You need to do this only once; the file will remain executable, even if you edit it later.
Since programs in many languages can be made executable, how does the system know what kind of program it is, that is, how does it know to execute this one as a Python program? That’s what the shebang does. The leading #! identifies the line as a shebang (only if it’s the very first line of the file), and you can read how it works at https://stackoverflow.com/questions/43793040/how-does-usr-bin-env-work-in-a-linux-shebang-line.
Once you’ve configured the shebang and made the file executable, it’s a program you can run, but to run a program either you must tell the system where it’s located or it has to be in the execution path in your environment (you can examine that with echo $PATH). By default the current directory is not automatically in the path. You can either move the executable file into a directory that is in your path (a common choice is /usr/local/bin) or specify the path to it in the current directory by prepending ./, e.g., ./tag_words.py. The dot means ‘current directory’, and when you supply a path to an executable, your system can run it without having to look for it on the environment execution path.

Comments in `pulldom`

pulldom is supposed to be able to respond to comment events, that is, to comments (delimited by ) in the XML input. It doesn’t work when we try it. If comments are important to your processing, you’ll need to use a parser other than pulldom; in this case, we happen not to care.