I have an xml document that I have to parse. I’m using python 3.8 and the lxml module.
The XML contains Titles which has other child element tags like the xml below. I need to only find the “change” events and keep that “Title” in a list. I would like to save all of the tags of that title, so I can extract the data that I need.
Here is my XML example:
''' <root> <Title ref="111111"> <Events> <Event type="change"/> </Events> <tag1>John</tag1> <tag2>A.</tag2> <tag3>Smith</tag3> </Title> <Title ref="222222"> <Events> <Event type="cancel"/> </Events> <tag1>Bob</tag1> <tag2>B.</tag2> <tag3>Hope</tag3> </Title> <Title ref="333333"> <Events> <Event type="change"/> </Events> <tag1>Julie</tag1> <tag2>A.</tag2> <tag3>Moore</tag3> </Title> <Title ref="444444"> <Events> <Event type="cancel"/> </Events> <tag1>First</tag1> <tag2>M</tag2> <tag3>Last</tag3> </Title> </root> '''
I’ve tried using the findall() function, but it only seems to keep the “Event” tag not the “Title” tag and all of its children. I get the same results when using xpath too.
Advertisement
Answer
If txt
is your XML snippet from the question, then you can do this to extract <Title>
tags which contain <Event type="change">
:
from lxml import etree, html root = etree.fromstring(txt) for title in root.xpath('.//Title[.//Event[@type="change"]]'): print(html.tostring(title).decode('utf-8')) print('-' * 80)
Prints:
<Title ref="111111"> <Events> <Event type="change"></Event> </Events> <tag1>John</tag1> <tag2>A.</tag2> <tag3>Smith</tag3> </Title> -------------------------------------------------------------------------------- <Title ref="333333"> <Events> <Event type="change"></Event> </Events> <tag1>Julie</tag1> <tag2>A.</tag2> <tag3>Moore</tag3> </Title> --------------------------------------------------------------------------------