I am trying to extract year from multiple xml files. Initially, the xml files are as follows,
<Return xmlns="http://www.irs.gov/efile" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.irs.gov/efile" returnVersion="2018v3.2"> <ReturnHeader binaryAttachmentCnt="0"> <!-- ... --> <TaxPeriodEndDt>2019-09-30</TaxPeriodEndDt> <!-- ... --> </ReturnHeader> <ReturnData documentCnt="12"> <!-- ... --> </ReturnData> </Return>
I used
year = root.find('.//irs:TaxPeriodEndDt',ns).text[:4]
It had worked well. But in some xml files the tag is changed to TaxPeriodEndDate
<Return xmlns="http://www.irs.gov/efile" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.irs.gov/efile" returnVersion="2018v3.2"> <ReturnHeader binaryAttachmentCnt="0"> <!-- ... --> <TaxPeriodEndDate>2012-09-30</TaxPeriodEndDate> <!-- ... --> </ReturnHeader> <ReturnData documentCnt="12"> <!-- ... --> </ReturnData> </Return>
I tried to revise the code to
year = root.find('.//irs:TaxPeriodEndDt|.//irs:TaxPeriodEndDate',ns).text[:4]
It did not work. No error message, but no output. Any suggestion is highly appreciated. Thank you.
Advertisement
Answer
The support for xpath in ElementTree is very limited. The union operator (|
) doesn’t appear to work and other options, like using the self::
axis or name()
/local-name()
in a predicate, aren’t supported.
I think your best bet is to use a try/except…
try: year = root.find(".//irs:TaxPeriodEndDt", ns).text[:4] except AttributeError: year = root.find(".//irs:TaxPeriodEndDate", ns).text[:4]
If you can switch to lxml, your original attempt with the union operator will work with a few small changes (mainly use xpath()
instead of find()
and use the namespaces
keyword arg)…
year = root.xpath(".//irs:TaxPeriodEndDt|.//irs:TaxPeriodEndDate", namespaces=ns)[0].text[:4]