I am attempting to remove parent XML elements based on the text of specific child elements containing values of “nan”. The input XML contains namespaces which is making this trickier than expected and I can remove select child elements individually, but not the associated/adjacent parent elements. I am only able to remove the “nan” value associated with the gam:String element, but I would like to remove all child elements with “nan” text values and their associated parent elements.
Below is the script I am using, along with the input and (desired) output XMLs….any assistance is most appreciated!
The Script:
from lxml import etree import os path = "C:\users\mdl518\Desktop\" ### Removing "Nan" Values tree = etree.parse(os.path.join(path,"metadata_info.xml")) for elem in tree_2.findall('.//{http://standards.iso.org/iso/19115/-3/gam/1.0}String'): if elem.text=='nan': parent = elem.getparent() parent.remove(elem) with open(".//metadata_output.xml","wb") as f: f.write(etree.tostring(tree_2, xml_declaration=True, encoding='utf-8')) ## Removes elements with "nan" values
Input XML:
<?xml version='1.0' encoding='utf-8'?> <nas:metadata xmlns:nas="http://www.arcgis.com/schema/nas/base" xmlns:mcc="http://standards.org/iso/19115/-3/mcc/1.0" xmlns:mdl="http://standards.org/iso/19115/-3/mdl/1.0" xmlns:mnl="http://standards.org/iso/19115/-3/mnl/1.0"> xmlns:lan="http://standards.org/iso/19115/-3/lan/1.0"> xmlns:lis="http://standards.org/iso/19115/-3/lis/1.0"> xmlns:gam="http://standards.org/iso/19115/-3/gam/1.0"> <mdl:metadataIdentifier> <mcc:MD_Identifier> <mnl:name> <mnl:type> <gam:String>The Metadata File</gam:String> </mnl:type> <mnl:description> <mcc:listing codeList="http://arcgis.com/codelist/ScopeCode" codeListValue="dataset"</mcc:listing> </mnl:description> </mnl:name> <mnl:address> <mnl:defaultLocale> <lan:location>nan</lan:location> </mnl:defaultLocale> </mnl:address> <lan:language> <lan:type> <lis:name>English</lis:name> </lan:type> </lan:language> </mcc:MD_Identifier> <mcc:contactInfo> <mdl:POC> <mnl:name> <lis:person>Tom</lis:person> </mnl:name> <mnl:age> <gam:String>nan</gam:String> </mnl:age> <mnl:status> <lis:employment>nan</lis:employment> </mnl:status> </mdl:POC> </mcc:contactInfo> </mdl:metadataIdentifier> </nas:metadata>
Output XML:
<?xml version='1.0' encoding='utf-8'?> <nas:metadata xmlns:nas="http://www.arcgis.com/schema/nas/base" xmlns:mcc="http://standards.org/iso/19115/-3/mcc/1.0" xmlns:mdl="http://standards.org/iso/19115/-3/mdl/1.0" xmlns:mnl="http://standards.org/iso/19115/-3/mnl/1.0"> xmlns:lan="http://standards.org/iso/19115/-3/lan/1.0"> xmlns:lis="http://standards.org/iso/19115/-3/lis/1.0"> xmlns:gam="http://standards.org/iso/19115/-3/gam/1.0"> <mdl:metadataIdentifier> <mcc:MD_Identifier> <mnl:name> <mnl:type> <gam:String>The Metadata File</gam:String> </mnl:type> <mnl:description> <mcc:listing codeList="http://arcgis.com/codelist/ScopeCode" codeListValue="dataset"</mcc:listing> </mnl:description> </mnl:name> <lan:language> <lan:type> <lis:name>English</lis:name> </lan:type> </lan:language> </mcc:MD_Identifier> <mcc:contactInfo> <mdl:POC> <mnl:name> <lis:person>Tom</lis:person> </mnl:name> </mdl:POC> </mcc:contactInfo> </mdl:metadataIdentifier> </nas:metadata>
Advertisement
Answer
This has to be done in two stages: first remove all nodes with nan
text nodes and then go over the empty nodes created by the first step and remove them as well:
#step 1 - remove nan nodes for n in tree.xpath('//*[.="nan"]'): n.getparent().remove(n)] #step 2 - select empty nodes and remove them as well empty = [e for e in doc.xpath('//*[not(normalize-space())]')] for emp in empty: try: emp.getparent().remove(emp) #one nested empty node is created by the first step; this step removes both nodes so try/except is necessary: except: continue print(etree.tostring(doc).decode())
This should get you your desired output.