Listing path and data from a xml file to store in a dataframe

Question

Here is a xml file : I want to save in a dataframe : 1) the path and 2) the text of the elements corresponding to the path. To do this dataframe, I am thinking to do a dictionary to store both. So first I would like to get a dictionary like that (where I have the values associated to

Accepted Answer

Here is an example of traversing XML tree. For this purpose recursive function will be needed. Fortunately lxml provides all functionality for this.from lxml import etree as etfrom collections import defaultdictimport pandas as pdd = defaultdict(list)root = et.fromstring(xml)tree = et.ElementTree(root)def traverse(el, d):    if len(list(el)) > 0:        for child in el:            traverse(child, d)    else:      if el.text is not None:        d[tree.getelementpath(el)].append(el.text)traverse(root, d)df = pd.DataFrame(d)df.head()Output:{    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/CNTROLAREA/BSR/status': ['ADD'],    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/CNTROLAREA/BSR/NOUN': ['LandIndex'],     '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/CNTROLAREA/BSR/REVISION': ['001'],     '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/reportId': ['AMI100031'],     '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/requestKey': ['R3278458'],     '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/SubmittedBy': ['EN4871'],     '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/submittedOn': ['2015/01/06 4:20:11 PM'],     '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/agreementid': ['001       4860'],     '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/agreementtype': ['NATURAL GAS'],     '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/currentstatus/status': ['ACTIVE'],     '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/currentstatus/statuseffectivedate': ['1965/02/18'],     '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/currentstatus/termdate': ['1965/02/18']}Please note, the dictionary d contains lists as values. That&#8217;s because elements can be repeated in XML and otherwise last value will override previous one. If that&#8217;s not the case for your particular XML, use regular dict instead of defaultdict d = {} and use assignment instead of appending d[tree.getelementpath(el)] = el.text.The same when reading from file:d = defaultdict(list)with open('output.xml', 'rb') as file:    root = et.parse(file).getroot()    tree = et.ElementTree(root)def traverse(el, d):    if len(list(el)) > 0:        for child in el:            traverse(child, d)    else:      if el.text is not None:        d[tree.getelementpath(el)].append(el.text)traverse(root, d)df = pd.DataFrame(d)print(d)

Advertisement

Answer