Find and replace XML content in MS Word Document.xml with Python

Question

I&#8217;m following the below guide to search a replace a keyword in the DOCX document.xml https://virantha.com/2013/08/16/reading-and-writing-microsoft-word-docx-files-with-python/ I&#8217;m trying to find &#8216;abc&#8217; within the w:instrText tag and replace it with &#8216;zzzz&#8217; Please find my code…

Accepted Answer

Your XML handling is dubious. You treat the XML as text and try to replace things in the XML source with regex. Never, never do that.The way to make changes to XML isparse it into a DOM tree (tree = ET.parse(filename))manipulate the DOM treefind nodes, e.g. with XPath (nodes = tree.xpath('...'))make changes to those nodes, e.g. by updating their text (node.text = '...') &#8211; using regex at this point is finewrite the DOM tree (tree.write(filename))This is better:from os import pathfrom shutil import rmtreefrom lxml import etree as ETfrom zipfile import ZipFilefrom tempfile import mkdtempclass DocxHelper:    def __init__(self, docx_file):        self.docx_file = docx_file    def get_xml(self, filename):        with ZipFile(self.docx_file) as docx:            with docx.open(filename) as xml:                # 1. parse into tree                return ET.parse(xml)    def set_xml(self, filename, tree, output_filename):        tmp_dir = mkdtemp()        with ZipFile(self.docx_file) as docx:            filenames = docx.namelist()            docx.extractall(tmp_dir)        # 3. write tree        tree.write(path.join(tmp_dir, filename), pretty_print=True)        with ZipFile(output_filename, 'w') as docx:            for filename in filenames:                docx.write(path.join(tmp_dir, filename), filename)        rmtree(tmp_dir)Usageinput_docx = 'inputdoc.docx'output_docx = 'outputdoc.docx'filename = 'word/document.xml'dh = DocxHelper(input_docx)xml_tree = dh.get_xml(filename)# 2. manipulate the DOM treens = {    'w': 'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}for elem in xml_tree.xpath('//w:r/w:instrText[contains(., "aaa")]', namespaces=ns):    elem.text = elem.text.replace('aaa', 'zzzz')dh.set_xml(filename, xml_tree, output_docx)The XPath //w:r/w:instrText[contains(., "aaa")] matches your sample, adapt it accordingly. Note the use of the w: namespace prefix.

Advertisement

Answer