The XML:
<?xml version="1.0"?> <pages> <page> <url>http://example.com/Labs</url> <title>Labs</title> <subpages> <page> <url>http://example.com/Labs/Email</url> <title>Email</title> <subpages> <page/> <url>http://example.com/Labs/Email/How_to</url> <title>How-To</title> </subpages> </page> <page> <url>http://example.com/Labs/Social</url> <title>Social</title> </page> </subpages> </page> <page> <url>http://example.com/Tests</url> <title>Tests</title> <subpages> <page> <url>http://example.com/Tests/Email</url> <title>Email</title> <subpages> <page/> <url>http://example.com/Tests/Email/How_to</url> <title>How-To</title> </subpages> </page> <page> <url>http://example.com/Tests/Social</url> <title>Social</title> </page> </subpages> </page> </pages>
The code:
// rexml is the XML string read from a URL from xml.etree import ElementTree as ET tree = ET.fromstring(rexml) for node in tree.iter('page'): for url in node.iterfind('url'): print url.text for title in node.iterfind('title'): print title.text.encode("utf-8") print '-' * 30
The output:
http://example.com/article1 Article1 ------------------------------ http://example.com/article1/subarticle1 SubArticle1 ------------------------------ http://example.com/article2 Article2 ------------------------------ http://example.com/article3 Article3 ------------------------------
The Xml represents a tree like structure of a sitemap.
I have been up and down the docs and Google all day and can’t figure it out hot to get the node depth of entries.
I used counting of the children container but that only works for the first parent and then it breaks as I can’t figure it out how to reset. But this is probably just a hackish idea.
The desired output:
0 http://example.com/article1 Article1 ------------------------------ 1 http://example.com/article1/subarticle1 SubArticle1 ------------------------------ 0 http://example.com/article2 Article2 ------------------------------ 0 http://example.com/article3 Article3 ------------------------------
Advertisement
Answer
Used lxml.html
.
import lxml.html rexml = ... def depth(node): d = 0 while node is not None: d += 1 node = node.getparent() return d tree = lxml.html.fromstring(rexml) for node in tree.iter('page'): print depth(node) for url in node.iterfind('url'): print url.text for title in node.iterfind('title'): print title.text.encode("utf-8") print '-' * 30