Skip to content
Advertisement

Replace XML variables which have the the same text with another text variable in python

I’m using python 3 and beautifulsoup4, pandas, counter, to convert one XML to CSV file There is several thousand products in this Xml. I have trouble with one particular problem. Many of this product in XML are a children of parent product, but parent product is not itself in XML. Each of this children product have special parent tag with the same value (parent id) so we can know they are children. <parent>x</parent> Xml structure is folowing:

<product>
  <id>1</id>
  <parent>x</parent>
</product>
<product>
  <id>2</id>
  <parent>x</parent>
</product>
<product>
  <id>3</id>
  <parent>x</parent>
</product>
<product>
  <id>4</id>
</product>
<product>
  <id>5</id>
  <parent>y</parent>
</product>
<product>
  <id>6</id>
  <parent>y</parent>
</product>

You can see that product with id 4 don’t have parent tag so is not children first 3 products have one parent together with value x and last two product have another parent with value Y and so on more than 7000 products. For my purpose I need to replace each <parent/> tag with the fist id of the same value. My desired outcome:

<product>
  <id>1</id>
  <parent>1</parent>
</product>
<product>
  <id>2</id>
  <parent>1</parent>
</product>
<product>
  <id>3</id>
  <parent>1</parent>
</product>
<product>
  <id>4</id>
</product>
<product>
  <id>5</id>
  <parent>5</parent>
</product>
<product>
  <id>6</id>
  <parent>5</parent>
</product>


What I done so far I need also to convert in csv each value to their respective column and row.

#test.py

def parse_xml(xml_data):

   # Initializing soup variable
    soup = BeautifulSoup(xml_data, 'xml')

    # Creating column for table
    df = pd.DataFrame(columns=['id', 'parent'])

# Here I get all duplacates of the same tag
    lst = soup.select('parent')
    d =  Counter(lst) 
    resultparent = [k for k, v in d.items() if v > 1]

#I spleat on seperate text to get all duplicate x and y as one 
    def a():
        for index, i in enumerate(resultparent):
            a = i.text
            return a

# I get also id or every x an y    
    def b():
        for index, i in enumerate(resultparent):
            c = i.find_previous('id').text
            return c
# now I start writing csv
    all_products = soup.findAll('product')   
    product_length = len(all_products)

    for index, product in enumerate(all_products):
        parent = product.find('parent')
        if parent is None:
           parent = ""
        else:  
           parent = parent.text
# here I wanted to check if I could find duplicate values with existing, I hope that if 
# there will be let say parent tag is x  will replace with 1  (don't work) 

        if parent == def a():
            parent = def b()

        product_id = product.find('id').text

# then I write all in csv 
        row = [{
            'id': product_id,
            'parent': parent}]
        
        df = df.append(row, ignore_index=True)
        print(f'Appending row %%s of %%s' %% (index+1, product_length))
   
    return df

df = parse_xml(xml_data)
df.to_csv('test.csv')


These code above don’t work correctly it replace only first value x x x with 1 1 1 but don’t replace -y- value and the rest when is written in CSV. Thank you for help.

Advertisement

Answer

Given the following input.xml file:

<products>
  <product>
    <id>1</id>
    <parent>x</parent>
  </product>
  <product>
    <id>2</id>
    <parent>x</parent>
  </product>
  <product>
    <id>3</id>
    <parent>x</parent>
  </product>
  <product>
    <id>4</id>
  </product>
  <product>
    <id>5</id>
    <parent>y</parent>
  </product>
  <product>
    <id>6</id>
    <parent>y</parent>
  </product>
</products>

Here is one way to get the correct matches:

import pandas as pd

df = pd.read_xml("input.xml")

pairs = (
    df.groupby("parent")
    .agg(list)
    .pipe(lambda df_: df_["id"].apply(lambda x: str(x[0])))
    .to_dict()
)

print(pairs)  # {'x': '1', 'y': '5'}

And then, using Python standard library XML module:

import xml.etree.ElementTree as ET

tree = ET.parse("file.xml")

for product in tree.getroot():
    for child in product:
        if child.tag == "parent":
            child.text = pairs[child.text]

tree.write("output.xml")

In output.xml:

<products>
  <product>
    <id>1</id>
    <parent>1</parent>
  </product>
  <product>
    <id>2</id>
    <parent>1</parent>
  </product>
  <product>
    <id>3</id>
    <parent>1</parent>
  </product>
  <product>
    <id>4</id>
  </product>
  <product>
    <id>5</id>
    <parent>5</parent>
  </product>
  <product>
    <id>6</id>
    <parent>5</parent>
  </product>
</products>

User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement