I’m using python 3 and beautifulsoup4, pandas, counter, to convert one XML to CSV file
There is several thousand products in this Xml. I have trouble with one particular problem.
Many of this product in XML are a children of parent product, but parent product is not itself in XML.
Each of this children product have special parent tag with the same value (parent id) so we can know they are children.
<parent>x</parent>
Xml structure is folowing:
<product> <id>1</id> <parent>x</parent> </product> <product> <id>2</id> <parent>x</parent> </product> <product> <id>3</id> <parent>x</parent> </product> <product> <id>4</id> </product> <product> <id>5</id> <parent>y</parent> </product> <product> <id>6</id> <parent>y</parent> </product>
You can see that product with id 4 don’t have parent tag so is not children first 3 products have one parent together with value x and last two product have another parent with value Y and so on more than 7000 products.
For my purpose I need to replace each <parent/>
tag with the fist id of the same value. My desired outcome:
<product> <id>1</id> <parent>1</parent> </product> <product> <id>2</id> <parent>1</parent> </product> <product> <id>3</id> <parent>1</parent> </product> <product> <id>4</id> </product> <product> <id>5</id> <parent>5</parent> </product> <product> <id>6</id> <parent>5</parent> </product>
What I done so far I need also to convert in csv each value to their respective column and row.
#test.py def parse_xml(xml_data): # Initializing soup variable soup = BeautifulSoup(xml_data, 'xml') # Creating column for table df = pd.DataFrame(columns=['id', 'parent']) # Here I get all duplacates of the same tag lst = soup.select('parent') d = Counter(lst) resultparent = [k for k, v in d.items() if v > 1] #I spleat on seperate text to get all duplicate x and y as one def a(): for index, i in enumerate(resultparent): a = i.text return a # I get also id or every x an y def b(): for index, i in enumerate(resultparent): c = i.find_previous('id').text return c # now I start writing csv all_products = soup.findAll('product') product_length = len(all_products) for index, product in enumerate(all_products): parent = product.find('parent') if parent is None: parent = "" else: parent = parent.text # here I wanted to check if I could find duplicate values with existing, I hope that if # there will be let say parent tag is x will replace with 1 (don't work) if parent == def a(): parent = def b() product_id = product.find('id').text # then I write all in csv row = [{ 'id': product_id, 'parent': parent}] df = df.append(row, ignore_index=True) print(f'Appending row %s of %s' % (index+1, product_length)) return df df = parse_xml(xml_data) df.to_csv('test.csv')
These code above don’t work correctly it replace only first value x x x with 1 1 1 but don’t replace -y- value and the rest when is written in CSV. Thank you for help.
Advertisement
Answer
Given the following input.xml
file:
<products> <product> <id>1</id> <parent>x</parent> </product> <product> <id>2</id> <parent>x</parent> </product> <product> <id>3</id> <parent>x</parent> </product> <product> <id>4</id> </product> <product> <id>5</id> <parent>y</parent> </product> <product> <id>6</id> <parent>y</parent> </product> </products>
Here is one way to get the correct matches:
import pandas as pd df = pd.read_xml("input.xml") pairs = ( df.groupby("parent") .agg(list) .pipe(lambda df_: df_["id"].apply(lambda x: str(x[0]))) .to_dict() ) print(pairs) # {'x': '1', 'y': '5'}
And then, using Python standard library XML module:
import xml.etree.ElementTree as ET tree = ET.parse("file.xml") for product in tree.getroot(): for child in product: if child.tag == "parent": child.text = pairs[child.text] tree.write("output.xml")
In output.xml
:
<products> <product> <id>1</id> <parent>1</parent> </product> <product> <id>2</id> <parent>1</parent> </product> <product> <id>3</id> <parent>1</parent> </product> <product> <id>4</id> </product> <product> <id>5</id> <parent>5</parent> </product> <product> <id>6</id> <parent>5</parent> </product> </products>