I have an xml document that I have to parse. I’m using python 3.8 and the lxml module.
The XML contains Titles which has other child element tags like the xml below. I need to only find the “change” events and keep that “Title” in a list. I would like to save all of the tags of that title, so I can extract the data that I need.
Here is my XML example:
JavaScript
x
37
37
1
'''
2
<root>
3
<Title ref="111111">
4
<Events>
5
<Event type="change"/>
6
</Events>
7
<tag1>John</tag1>
8
<tag2>A.</tag2>
9
<tag3>Smith</tag3>
10
</Title>
11
<Title ref="222222">
12
<Events>
13
<Event type="cancel"/>
14
</Events>
15
<tag1>Bob</tag1>
16
<tag2>B.</tag2>
17
<tag3>Hope</tag3>
18
</Title>
19
<Title ref="333333">
20
<Events>
21
<Event type="change"/>
22
</Events>
23
<tag1>Julie</tag1>
24
<tag2>A.</tag2>
25
<tag3>Moore</tag3>
26
</Title>
27
<Title ref="444444">
28
<Events>
29
<Event type="cancel"/>
30
</Events>
31
<tag1>First</tag1>
32
<tag2>M</tag2>
33
<tag3>Last</tag3>
34
</Title>
35
</root>
36
'''
37
I’ve tried using the findall() function, but it only seems to keep the “Event” tag not the “Title” tag and all of its children. I get the same results when using xpath too.
Advertisement
Answer
If txt
is your XML snippet from the question, then you can do this to extract <Title>
tags which contain <Event type="change">
:
JavaScript
1
8
1
from lxml import etree, html
2
3
root = etree.fromstring(txt)
4
5
for title in root.xpath('.//Title[.//Event[@type="change"]]'):
6
print(html.tostring(title).decode('utf-8'))
7
print('-' * 80)
8
Prints:
JavaScript
1
21
21
1
<Title ref="111111">
2
<Events>
3
<Event type="change"></Event>
4
</Events>
5
<tag1>John</tag1>
6
<tag2>A.</tag2>
7
<tag3>Smith</tag3>
8
</Title>
9
10
--------------------------------------------------------------------------------
11
<Title ref="333333">
12
<Events>
13
<Event type="change"></Event>
14
</Events>
15
<tag1>Julie</tag1>
16
<tag2>A.</tag2>
17
<tag3>Moore</tag3>
18
</Title>
19
20
--------------------------------------------------------------------------------
21