Skip to content
Advertisement

Python lxml – get index of tag’s text

I have an xml-file with a format similar to docx, i.e.:

JavaScript

I need to get an index of BIG_TEXT in source xml, like:

JavaScript

I can start a new search from position of current index + len(text), but is there another way? Element may have one character, w for example. It will find index of w, but not index of tag text w.

Advertisement

Answer

I was looking for a similar solution (indexing nodes in a big xml file for fast lookup).

  • AFAIK, lxml only offers sourceline, which is insufficient. Cf API : Original line number as found by the parser or None if unknown.
  • But expat provides the exact offset in the file : CurrentByteIndex.
    • Fetched from start_element handler, it returns tag’s start (ie '<') offset.
    • Fetched from char_data handler, it returns data’s start (ie 'B' in your example) offset.

Example :

JavaScript
User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement