I have an xml-file with a format similar to docx, i.e.:
JavaScript
x
8
1
<w:r>
2
<w:rPr>
3
<w:sz w:val="36"/>
4
<w:szCs w:val="36"/>
5
</w:rPr>
6
<w:t>BIG_TEXT</w:t>
7
</w:r>
8
I need to get an index of BIG_TEXT
in source xml, like:
JavaScript
1
13
13
1
from lxml import etree
2
text = open('/devel/tmp/doc2/word/document.xml', 'r').read()
3
4
root = etree.XML(text)
5
6
start = 0
7
for e in root.iter("*"):
8
if e.text:
9
offset = text.index(e.text, start)
10
l = len(e.text)
11
print 'Text "%s" at offset %s and len=%s' % (e.text, offset, l)
12
start = offset + l
13
I can start a new search from position of current index
+ len(text)
, but is there another way? Element may have one character, w
for example. It will find index of w
, but not index of tag text w
.
Advertisement
Answer
I was looking for a similar solution (indexing nodes in a big xml file for fast lookup).
- AFAIK,
lxml
only offers sourceline, which is insufficient. Cf API :Original line number as found by the parser or None if unknown.
- But
expat
provides the exact offset in the file : CurrentByteIndex.- Fetched from
start_element
handler, it returns tag’s start (ie'<'
) offset. - Fetched from
char_data
handler, it returns data’s start (ie'B'
in your example) offset.
- Fetched from
Example :
JavaScript
1
36
36
1
import xml.parsers.expat
2
3
# handler functions for parser events, and housekeeping.
4
class handler :
5
def __init__(self, current_parser) :
6
#tag of interest
7
self.TARGET_TAG = "w:t"
8
9
#set up parser
10
self.parser = current_parser
11
self.parser.StartElementHandler = self.start_element
12
self.parser.EndElementHandler = self.end_element
13
self.parser.CharacterDataHandler = self.char_data
14
15
self.target_tag_met = False
16
self.index = None
17
18
def start_element(self, name, attrs):
19
self.target_tag_met = (name == self.TARGET_TAG)
20
21
def end_element(self, name) :
22
self.target_tag_met = False
23
24
def char_data(self, data):
25
if self.target_tag_met :
26
self.index = self.parser.CurrentByteIndex
27
28
#open file in binary mode for robuster byte offsets.
29
xmlFile = open("so_test.xml", 'rb')
30
31
p = xml.parsers.expat.ParserCreate()
32
h = handler(p)
33
34
p.ParseFile(xmlFile)
35
print (h.index)
36