Listing path and data from a xml file to store in a dataframe

Tags: , , , ,



Here is a xml file :

<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/">
  <SOAP-ENV:Header />
  <SOAP-ENV:Body>
    <ADD_LandIndex_001>
      <CNTROLAREA>
        <BSR>
          <status>ADD</status>
          <NOUN>LandIndex</NOUN>
          <REVISION>001</REVISION>
        </BSR>
      </CNTROLAREA>
      <DATAAREA>
        <LandIndex>
          <reportId>AMI100031</reportId>
          <requestKey>R3278458</requestKey>
          <SubmittedBy>EN4871</SubmittedBy>
          <submittedOn>2015/01/06 4:20:11 PM</submittedOn>
          <LandIndex>
            <agreementdetail>
              <agreementid>001       4860</agreementid>
              <agreementtype>NATURAL GAS</agreementtype>
              <currentstatus>
                <status>ACTIVE</status>
                <statuseffectivedate>1965/02/18</statuseffectivedate>
                <termdate>1965/02/18</termdate>
              </currentstatus>
              <designatedrepresentative></designatedrepresentative>
            </agreementdetail>
          </LandIndex>
        </LandIndex>
      </DATAAREA>
    </ADD_LandIndex_001>
  </SOAP-ENV:Body>
</SOAP-ENV:Envelope>

I want to save in a dataframe : 1) the path and 2) the text of the elements corresponding to the path. To do this dataframe, I am thinking to do a dictionary to store both. So first I would like to get a dictionary like that (where I have the values associated to the corresonding path).

{'/Envelope/Body/ADD_LandIndex_001/CNTROLAREA/BSR/status': 'ADD', /Envelope/Body/ADD_LandIndex_001/CNTROLAREA/BSR/NOUN: 'LandIndex',...}

Like that I just have to use the function df=pd.DataFrame() to create a dataframe that I can export in a excel sheet. I have already a part for the listing of the path, however I can not get text from those paths. I do not get how the lxml library works. I tried the function .text() and text_content() but I have an error.

Here is my code :

from lxml import etree
import xml.etree.ElementTree as et
from bs4 import BeautifulSoup
import pandas as pd

filename = 'file_try.xml'

with open(filename, 'r') as f: 
    soap = f.read() 

root = etree.XML(soap.encode())    
tree = etree.ElementTree(root)


mylist_path = []
mylist_data = []
mydico = {}
mylist = []

for target in root.xpath('//text()'):

    if len(target.strip())>0:       
        path = tree.getpath(target.getparent()).replace('SOAP-ENV:','')
        mydico[path] = target.text()

        mylist_path.append(path)
        mylist_data.append(target.text())
        mylist.append(mydico)

df=pd.DataFrame(mylist)
df.to_excel("data_xml.xlsx") 

print(mylist_path)
print(mylist_data)

Thank you for the help !

Answer

Here is an example of traversing XML tree. For this purpose recursive function will be needed. Fortunately lxml provides all functionality for this.

from lxml import etree as et
from collections import defaultdict
import pandas as pd

d = defaultdict(list)
root = et.fromstring(xml)
tree = et.ElementTree(root)

def traverse(el, d):
    if len(list(el)) > 0:
        for child in el:
            traverse(child, d)
    else:
      if el.text is not None:
        d[tree.getelementpath(el)].append(el.text)

traverse(root, d)

df = pd.DataFrame(d)

df.head()

Output:

{
    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/CNTROLAREA/BSR/status': ['ADD'],
    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/CNTROLAREA/BSR/NOUN': ['LandIndex'], 
    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/CNTROLAREA/BSR/REVISION': ['001'], 
    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/reportId': ['AMI100031'], 
    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/requestKey': ['R3278458'], 
    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/SubmittedBy': ['EN4871'], 
    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/submittedOn': ['2015/01/06 4:20:11 PM'], 
    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/agreementid': ['001       4860'], 
    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/agreementtype': ['NATURAL GAS'], 
    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/currentstatus/status': ['ACTIVE'], 
    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/currentstatus/statuseffectivedate': ['1965/02/18'], 
    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/currentstatus/termdate': ['1965/02/18']
}

Please note, the dictionary d contains lists as values. That’s because elements can be repeated in XML and otherwise last value will override previous one. If that’s not the case for your particular XML, use regular dict instead of defaultdict d = {} and use assignment instead of appending d[tree.getelementpath(el)] = el.text.

The same when reading from file:

d = defaultdict(list)

with open('output.xml', 'rb') as file:
    root = et.parse(file).getroot()
    
tree = et.ElementTree(root)

def traverse(el, d):
    if len(list(el)) > 0:
        for child in el:
            traverse(child, d)
    else:
      if el.text is not None:
        d[tree.getelementpath(el)].append(el.text)

traverse(root, d)

df = pd.DataFrame(d)

print(d)


Source: stackoverflow