Skip to content
Advertisement

Extract xml data with in cdata using Python

I have a requirement where I have extract XML with in CDATA with in XML. I am able to extract XML tags, but not XML tags in CDATA.

I need to extract

  1. EventId = 122157660 (I am able to do, good with this).
  2. _Type=”Phone” _Value=”5152083348″ with in PAYLOAD/REQUEST_GROUP/REQUESTING_PARTY/CONTACT_DETAIL/CONTACT_POINT (need help with this.)

Below is the XML sample I am working with.

<B2B_DATA>
   <B2B_METADATA>
       <EventId>122157660</EventId>
       <MessageType>Request</MessageType>
   </B2B_METADATA>
<PAYLOAD>
    <![CDATA[<?xml version="1.0"?>
        <REQUEST_GROUP MISMOVersionID="1.1.1">
            <REQUESTING_PARTY _Name="CityBank" _StreetAddress="801 Main St" _City="rockwall" _State="MD" _PostalCode="11311" _Identifier="416">
                <CONTACT_DETAIL _Name="XX Davis">
                    <CONTACT_POINT _Type="Phone" _Value="1236573348"/>
                    <CONTACT_POINT _Type="Email" _Value="jXX@city.com"/>
                </CONTACT_DETAIL>
            </REQUESTING_PARTY>
        </REQUEST_GROUP>]]>
</PAYLOAD>
</B2B_DATA>

I have tried this –

tree = ElementTree.parse('file.xml')
root = tree.getroot()
for child in root:
    print(child.tag)

O/P B2B_METADATA PAYLOAD

Not able to parse inside PAYLOAD.

Any help is greatly appreciated.

Advertisement

Answer

What you need to do, in this case, is parse the outer xml, extract the xml in the CDATA, parse that inner xml and extract the target data from that.

I personally would use lxml and xpath, not ElementTree:

from lxml import etree
root = etree.parse('file.xml')

#step one: extract the cdata as a string
cd = root.xpath('//PAYLOAD//text()')[0].strip()

#step 2 - parse the  cdata string as xml
doc = etree.XML(cd)

#finally, extract the target data
doc.xpath('//REQUESTING_PARTY//CONTACT_POINT[@_Type="Phone"]/@_Value')[0]

Output, based on your sample xml above:

'1236573348'
User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement