I would like to convert this XML file:
<record id="idOne"> <ts date="2019-07-03" time="15:28:41.720440">5</ts> <ts date="2019-07-03" time="15:28:42.629959">10</ts> <ts date="2019-07-03" time="15:28:43.552677">15</ts> <ts date="2019-07-03" time="15:28:43.855345">20</ts> </record> <record id="idOne"> <ts date="2019-07-03" time="15:28:45.072922">30</ts> <ts date="2019-07-03" time="15:28:45.377087">35</ts> <ts date="2019-07-03" time="15:28:46.316321">40</ts> <ts date="2019-07-03" time="15:28:47.527960">45</ts> </record>
to this CSV file:
ID, date, time, value idOne, 2019-07-03, 15:28:41.720440, 5 idOne, 2019-07-03, 15:28:42.629959, 10 idOne, 2019-07-03, 15:28:43.552677, 15 idOne, 2019-07-03, 15:28:43.855345, 20 idOne, 2019-07-03, 15:28:45.072922, 30 idOne, 2019-07-03, 15:28:45.377087, 35 idOne, 2019-07-03, 15:28:46.316321, 40 idOne, 2019-07-03, 15:28:47.527960, 45
I can have several bodies of ID structures.
I use the lxml library.
I tried with the xpath method and for loop but I can only get the ID but not the rest. The problem is the second for loop, but I don’t know how to deal with the values of “date” and “time”…
with open(args.input, "r") as f: # add root balises to parse the xml file records = itertools.chain('<root>', f, '</root>') root = etree.fromstringlist(records) #root = etree.fromstring(records) # count the number of records NumberRecords = int(root.xpath('count(//record)')) RecordsGrid = [[] for __ in range(NumberRecords)] tss = ["id","date", "time", "value"] paths = root.xpath('//record') #print(paths) Counter = 0 for path in paths: for ts in tss[:1]: target = f'(./@{ts})' # using f-strings to populate the full path if path.xpath(target): # we start populating our current sublist with the relevant info RecordsGrid[Counter].append(path.xpath(target)[0]) else: RecordsGrid[Counter].append('NA') for ts in tss[1:]: target = f'(./ts[@name="{ts}"]/text())' if path.xpath(target): RecordsGrid[Counter].append(path.xpath(target)[0]) else: RecordsGrid[Counter].append('NA') Counter += 1 # now that we have our lists, create a df df = pd.DataFrame(RecordsGrid, columns=tss) df.to_csv(args.output, sep=',', encoding='utf-8', index=False)
Here the result:
id,date,time,value idOne,NA,NA,NA
Thanks for your time.
Advertisement
Answer
To use lxml, you can simply pass the string as html(). By using the xpath //record/ts (starting with double backslash), you can fetch all your ts results. The main id can be accessed by calling .getparent() and then the attribute.
To convert xml to csv, I would recommend using the python package csv. You can use normal file writer. However csv write handles a lot of issues and it’s cleaner.
In general, you have one method that handles everything. I would recommend splitting the logic into functions. Think Single Responsibility. Also the solution below I’ve converted the xml nodes into a NamedTupple and then write the namedTupple to csv. It’s a lot easier to maintain/ read. (i.e Theres one place that sets the header text and one place populate the data).
from lxml import etree import csv #py -m pip install python-csv import collections from collections import namedtuple Record = namedtuple('Record', ['id', 'date', 'time', 'value']) # Model to store records. def CreateCsvFile(results): with open('results.csv', 'w', newline='') as csvfile: writer = csv.DictWriter(csvfile, fieldnames=list(Record._fields)) # use the namedtuple fields for the headers writer.writeheader() writer.writerows([x._asdict() for x in results]) # To use DictWriter, the namedtuple has to be converted to dictionary def FormatRecord(xmlNode): return Record(xmlNode.getparent().attrib['id'], xmlNode.attrib["date"], xmlNode.attrib["time"], xmlNode.text) def Main(html): xmlTree = etree.HTML(html) results = [FormatRecord(xmlNode) for xmlNode in xmlTree.xpath('//record/ts')] # the double backslash will retrieve all nodes for record. CreateCsvFile(results) if __name__ == '__main__': Main("""<record id="idOne"> <ts date="2019-07-03" time="15:28:41.720440">5</ts> <ts date="2019-07-03" time="15:28:42.629959">10</ts> <ts date="2019-07-03" time="15:28:43.552677">15</ts> <ts date="2019-07-03" time="15:28:43.855345">20</ts> </record> <record id="idTwo"> <ts date="2019-07-03" time="15:28:45.072922">30</ts> <ts date="2019-07-03" time="15:28:45.377087">35</ts> <ts date="2019-07-03" time="15:28:46.316321">40</ts> <ts date="2019-07-03" time="15:28:47.527960">45</ts> </record>""")