I have an XML file autogenerated with Informatica BDM, it´s very complex for me to edit values I made several attempts with xml.etree.ElementTree but I do not get results. This is an extract from the file: My idea would be to be able to change the parameters, for example: <parameter name="P_s_spark_executor_memory">8G</parameter> to <parameter name="P_s_spark_executor_memory">16G</parameter> I can only access the values,

Edit XML file with python

I have an XML file autogenerated with Informatica BDM, it´s very complex for me to edit values I made several attempts with xml.etree.ElementTree but I do not get results. This is an extract from the file:

<?xml version="1.0" encoding="UTF-8"?>
<root xmlns="http://www.informatica.com/Parameterization/1.0"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema"
      version="2.0"><!--Specify deployed application specific parameters here.--><!--
    <application name="app_2">
   <mapping name="M_kafka_hdfs"/>
</application>--><project name="V2">
      <folder name="Streaming">
         <mapping name="M_kafka_hdfs">
            <parameter name="P_s_spark_executor_cores">4</parameter>
            <parameter name="P_s_spark_executor_memory">8G</parameter>
            <parameter name="P_s_spark_sql_shuffle_partitions">108</parameter>
            <parameter name="P_s_spark_network_timeout">180000</parameter>
            <parameter name="P_s_spark_executor_heartbeatInterval">6000</parameter>
            <parameter name="P_i_maximum_rows_read">0</parameter>
            <parameter name="P_s_checkpoint_directory">checkpoint</parameter>
         </mapping>
      </folder>
   </project>
</root>

JavaScript
​x
 
<?xml version="1.0" encoding="UTF-8"?>
<root xmlns="http://www.informatica.com/Parameterization/1.0"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema"
      version="2.0"><!--Specify deployed application specific parameters here.--><!--
    <application name="app_2">
   <mapping name="M_kafka_hdfs"/>
</application>--><project name="V2">
      <folder name="Streaming">
         <mapping name="M_kafka_hdfs">
            <parameter name="P_s_spark_executor_cores">4</parameter>
            <parameter name="P_s_spark_executor_memory">8G</parameter>
            <parameter name="P_s_spark_sql_shuffle_partitions">108</parameter>
            <parameter name="P_s_spark_network_timeout">180000</parameter>
            <parameter name="P_s_spark_executor_heartbeatInterval">6000</parameter>
            <parameter name="P_i_maximum_rows_read">0</parameter>
            <parameter name="P_s_checkpoint_directory">checkpoint</parameter>
         </mapping>
      </folder>
   </project>
</root>
​

My idea would be to be able to change the parameters, for example: <parameter name="P_s_spark_executor_memory">8G</parameter> to <parameter name="P_s_spark_executor_memory">16G</parameter>

I can only access the values, but not their content and I can’t edit them either:

import xml.etree.ElementTree as ET

treexml = ET.parse('autogenerated.xml')
for element in treexml.iter():
    dict_keys={}
    if element.keys():
        for name, value in element.items():
            dict_keys[name]=value
            print(dict_keys[name])

JavaScript
 
import xml.etree.ElementTree as ET
​
treexml = ET.parse('autogenerated.xml')
for element in treexml.iter():
    dict_keys={}
    if element.keys():
        for name, value in element.items():
            dict_keys[name]=value
            print(dict_keys[name])
​

The idea would be to be able to overwrite any parameter:

xml["parameter"]["P_s_spark_sql_shuffle_partitions"] = 64

JavaScript
 
xml["parameter"]["P_s_spark_sql_shuffle_partitions"] = 64
​

and that it is changed in the file by <parameter name="P_s_spark_sql_shuffle_partitions">64</parameter>

Answer

Try this code:

import xml.etree.ElementTree as ET

name_space = 'http://www.informatica.com/Parameterization/1.0'
ET.register_namespace('', name_space)
treexml = ET.parse(r'c:testtest.xml')
# get all elements with 'parameter' tags (it is necessary to specify the namespace prefix)
params = treexml.getroot().findall(f'.//{{{name_space}}}parameter')

# make the dict with names as keys and previously found elements as value
xml = {el.attrib['name']: el for el in params}
# set the text of the "P_s_spark_sql_shuffle_partitions"
xml["P_s_spark_sql_shuffle_partitions"].text = str(64)
# write out the xml
treexml.write(r'c:testtestOut.xml')

JavaScript
 
import xml.etree.ElementTree as ET
​
name_space = 'http://www.informatica.com/Parameterization/1.0'
ET.register_namespace('', name_space)
treexml = ET.parse(r'c:testtest.xml')
# get all elements with 'parameter' tags (it is necessary to specify the namespace prefix)
params = treexml.getroot().findall(f'.//{{{name_space}}}parameter')
​
# make the dict with names as keys and previously found elements as value
xml = {el.attrib['name']: el for el in params}
# set the text of the "P_s_spark_sql_shuffle_partitions"
xml["P_s_spark_sql_shuffle_partitions"].text = str(64)
# write out the xml
treexml.write(r'c:testtestOut.xml')
​

Output c:testtestOut.xml

<root xmlns="http://www.informatica.com/Parameterization/1.0" version="2.0"><project name="V2">
      <folder name="Streaming">
         <mapping name="M_kafka_hdfs">
            <parameter name="P_s_spark_executor_cores">4</parameter>
            <parameter name="P_s_spark_executor_memory">8G</parameter>
            <parameter name="P_s_spark_sql_shuffle_partitions">64</parameter>
            <parameter name="P_s_spark_network_timeout">180000</parameter>
            <parameter name="P_s_spark_executor_heartbeatInterval">6000</parameter>
            <parameter name="P_i_maximum_rows_read">0</parameter>
            <parameter name="P_s_checkpoint_directory">checkpoint</parameter>
         </mapping>
      </folder>
   </project>
</root>

JavaScript
 
<root xmlns="http://www.informatica.com/Parameterization/1.0" version="2.0"><project name="V2">
      <folder name="Streaming">
         <mapping name="M_kafka_hdfs">
            <parameter name="P_s_spark_executor_cores">4</parameter>
            <parameter name="P_s_spark_executor_memory">8G</parameter>
            <parameter name="P_s_spark_sql_shuffle_partitions">64</parameter>
            <parameter name="P_s_spark_network_timeout">180000</parameter>
            <parameter name="P_s_spark_executor_heartbeatInterval">6000</parameter>
            <parameter name="P_i_maximum_rows_read">0</parameter>
            <parameter name="P_s_checkpoint_directory">checkpoint</parameter>
         </mapping>
      </folder>
   </project>
</root>
​

Advertisement

Answer