I’m working on a script using lxml.html to parse web pages. I have done a fair bit of BeautifulSoup in my time but am now experimenting with lxml due to its speed.
I would like to know what the most sensible way in the library is to do the equivalent of Javascript’s InnerHtml – that is, to retrieve or set the complete contents of a tag.
<body> <h1>A title</h1> <p>Some text</p> </body>
InnerHtml is therefore:
<h1>A title</h1> <p>Some text</p>
I can do it using hacks (converting to string/regexes etc) but I’m assuming that there is a correct way to do this using the library which I am missing due to unfamiliarity. Thanks for any help.
EDIT: Thanks to pobk for showing me the way on this so quickly and effectively. For anyone trying the same, here is what I ended up with:
from lxml import html from cStringIO import StringIO t = html.parse(StringIO( """<body> <h1>A title</h1> <p>Some text</p> Untagged text <p> Unclosed p tag </body>""")) root = t.getroot() body = root.body print (element.text or '') + ''.join([html.tostring(child) for child in body.iterdescendants()])
Note that the lxml.html parser will fix up the unclosed tag, so beware if this is a problem.
Advertisement
Answer
You can get the children of an ElementTree node using the getchildren() or iterdescendants() methods of the root node:
>>> from lxml import etree >>> from cStringIO import StringIO >>> t = etree.parse(StringIO("""<body> ... <h1>A title</h1> ... <p>Some text</p> ... </body>""")) >>> root = t.getroot() >>> for child in root.iterdescendants(),: ... print etree.tostring(child) ... <h1>A title</h1> <p>Some text</p>
This can be shorthanded as follows:
print ''.join([etree.tostring(child) for child in root.iterdescendants()])