Skip to content
Advertisement

How to Declare Schemas in your XML Header using lxml

I want my XML to look like this, with two schemas associated in the beginning:

<?xml version="1.0" encoding="utf-8"?>
<?xml-model href="https://raw.githubusercontent.com/LTAC-Global/TBX-Core_dialect/master/Schemas/TBXcoreStructV03_TBX-Core_integrated.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?>
<?xml-model href="https://raw.githubusercontent.com/LTAC-Global/TBX-Core_dialect/master/Schemas/TBX-Core.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>
<body>...</body>

How can I achieve this using the lxml library in python:

from lxml import etree
root = etree.Element("body")

I know the first line is added using

etree.tostring(root, encoding='utf-8', xml_declaration=True, pretty_print=True)

Advertisement

Answer

The “first line”

<?xml version="1.0" encoding="utf-8"?>

is the XML declaration. It’s a special construct that can only exist once in an XML file and its only purpose is to inform the XML parser on the receiving end about what it will be looking at. Once the parser has consumed the file and created a document tree, this will be gone. As you’ve found out, you set that as an option of how to convert the tree to string.

The other lines are processing instructions. They look similar, but they are conceptually different. They are actual nodes in the document tree, just like <body>...</body>.

<?xml-model href="https://raw.githubusercontent.com/LTAC-Global/TBX-Core_dialect/master/Schemas/TBXcoreStructV03_TBX-Core_integrated.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?>
<?xml-model href="https://raw.githubusercontent.com/LTAC-Global/TBX-Core_dialect/master/Schemas/TBX-Core.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>
<body>...</body>

And this means there is a function to create them, just like there is a function to create elements. Processing instructions have a name and a value. The value is allowed look like attributes, but it’s really just a string, it could be anything:

root = etree.Element("body")
tree = etree.ElementTree(root)

pi1 = etree.ProcessingInstruction('xml-model', 'href="https://raw.githubusercontent.com/LTAC-Global/TBX-Core_dialect/master/Schemas/TBXcoreStructV03_TBX-Core_integrated.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"') 
pi2 = etree.ProcessingInstruction('it-could-be', 'anything, really') 

and once created, you can add them to the document tree like you would any other node, for example “before the root element”:

 tree.getroot().addprevious(pi1)
 tree.getroot().addprevious(pi2)

Explicitly add text nodes with newlines ('n') after each one of them, or use etree.tostring(tree, pretty_print=True) to have the output formatted into lines.

The above produces this (the XML declaration is missing because I did not switch it on in tree.tostring()):

<?xml-model href="https://raw.githubusercontent.com/LTAC-Global/TBX-Core_dialect/master/Schemas/TBXcoreStructV03_TBX-Core_integrated.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?>
<?it-could-be anything, really?>
<body/>

BTW, processing instructions is where PHP gets its tag syntax from: <?php a bunch of PHP code here ?> – they are literally meant to provide a way of containing instructions how to process the XML tree.

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement