I have a requirement where I have extract XML with in CDATA with in XML. I am able to extract XML tags, but not XML tags in CDATA.
I need to extract
- EventId = 122157660 (I am able to do, good with this).
- _Type=”Phone” _Value=”5152083348″ with in PAYLOAD/REQUEST_GROUP/REQUESTING_PARTY/CONTACT_DETAIL/CONTACT_POINT (need help with this.)
Below is the XML sample I am working with.
<B2B_DATA> <B2B_METADATA> <EventId>122157660</EventId> <MessageType>Request</MessageType> </B2B_METADATA> <PAYLOAD> <![CDATA[<?xml version="1.0"?> <REQUEST_GROUP MISMOVersionID="1.1.1"> <REQUESTING_PARTY _Name="CityBank" _StreetAddress="801 Main St" _City="rockwall" _State="MD" _PostalCode="11311" _Identifier="416"> <CONTACT_DETAIL _Name="XX Davis"> <CONTACT_POINT _Type="Phone" _Value="1236573348"/> <CONTACT_POINT _Type="Email" _Value="jXX@city.com"/> </CONTACT_DETAIL> </REQUESTING_PARTY> </REQUEST_GROUP>]]> </PAYLOAD> </B2B_DATA>
I have tried this –
tree = ElementTree.parse('file.xml') root = tree.getroot() for child in root: print(child.tag)
O/P B2B_METADATA PAYLOAD
Not able to parse inside PAYLOAD.
Any help is greatly appreciated.
Advertisement
Answer
What you need to do, in this case, is parse the outer xml, extract the xml in the CDATA, parse that inner xml and extract the target data from that.
I personally would use lxml and xpath, not ElementTree:
from lxml import etree root = etree.parse('file.xml') #step one: extract the cdata as a string cd = root.xpath('//PAYLOAD//text()')[0].strip() #step 2 - parse the cdata string as xml doc = etree.XML(cd) #finally, extract the target data doc.xpath('//REQUESTING_PARTY//CONTACT_POINT[@_Type="Phone"]/@_Value')[0]
Output, based on your sample xml above:
'1236573348'