Tag: lxml

Python lxml – get index of tag’s text

I have an xml-file with a format similar to docx, i.e.: I need to get an index of BIG_TEXT in source xml, like: I can start a new search from position of current index + len(text), but is there another way? Element may have one character, w for example. It will find index of w, but not index of tag

Why is lxml.etree.iterparse() eating up all my memory?

iterparse lxml memory python xml

This eventually consumes all my available memory and then the process is killed. I’ve tried changing the tag from schedule to ‘smaller’ tags but that didn’t make a difference. What am I doing wrong / how can I process this large file with iterparse()? I can easily cut it up and process it in smaller chunks but that’s uglier than

How to properly escape single and double quotes

lxml python

I have a lxml etree HTMLParser object that I’m trying to build xpaths with to assert xpaths, attributes of the xpath and text of that tag. I ran into a problem when the text of the tag has either single-quotes(‘) or double-quotes(“) and I’ve exhausted all my options. Here’s a sample object I created Here is the snippet of code

selecting attribute values from lxml

attributes lxml python python-2.7

I want to use an xpath expression to get the value of an attribute. I expected the following to work but this gives an error : Am I wrong to expect this to work? Answer find and findall only implement a subset of XPath. Their presence is meant to provide compatibility with other ElementTree implementations (like ElementTree and cElementTree). The

Equivalent to InnerHTML when using lxml.html to parse HTML

lxml parsing python

I’m working on a script using lxml.html to parse web pages. I have done a fair bit of BeautifulSoup in my time but am now experimenting with lxml due to its speed. I would like to know what the most sensible way in the library is to do the equivalent of Javascript’s InnerHtml – that is, to retrieve or set