Skip to content
Advertisement

Multiple xml files to csv using python

I am trying to extract specific tags from XML and converting to CSV file. i was able to this for single XML file which is extracting all the identifier tag in the file. Here my question is 1) how to extract from multiple XML files to single CSV file and 2) in the given XML file the required tag is mentioned more than once i would like to know how to extract the first identifier tag from each list of record tag.

Am using python3.7

Required ans is:

<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/31652</identifier>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/32667</identifier>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/32953</identifier>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/56906</identifier>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/57282</identifier>

Note: am not a programmer!! appreciate your kind help.

from bs4 import BeautifulSoup as b
import itertools
import os
import csv
import pandas as pd


os.chdir(r"C:*test")

with open("aaaaahbc.xml", "r", encoding="utf-8") as f: # opening xml file
    content = f.read()

soup = b(content, 'lxml')
identifier =  [ values.text for values in soup.findAll("identifier")]

# For python-3.x use `zip_longest` method
# For python-2.x use 'izip_longest method

data = [item for item in itertools.zip_longest(identifier)] 
df  = pd.DataFrame(data=data)
df.to_csv("aaaaahbc.csv",index=True, header=False)

xml file example:

<?xml version="1.0" encoding="UTF-8"?>
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/
         http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
 <responseDate>2020-06-12T05:26:49Z</responseDate>
 <request verb="ListRecords" resumptionToken="2020-05-23T03:32:50Z!2037-01-01T00:00:00Z!!oai_dc!7334186!7353566!oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/31648">
    http://union.ndltd.org:8080/union.OAI-PMH/</request>
 <ListRecords>
  <record>
<header>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/31652</identifier>
<datestamp>2020-05-23T03:32:50Z</datestamp>
<setSpec>upv.es</setSpec>
</header>
<metadata>
<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:doc="http://www.lyncode.com/xoai" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>Influencia de la grasa en las propiedades físicas y sensoriales de galletas. Alternativas para la mejora del perfil de acidos grasos</dc:title>
<dc:creator>Tarancón Serrano, Paula Isabel</dc:creator>
<dc:contributor>Salvador Alcaraz, Ana</dc:contributor>
<dc:contributor>Sanz Taberner, Teresa</dc:contributor>
<dc:contributor>Tarrega Guillem, Amparo</dc:contributor>
<dc:contributor>Universitat Politècnica de València. Escuela Técnica Superior del Medio Rural y Enología - Escola Tècnica Superior del Medi Rural i Enologia</dc:contributor>
<dc:contributor>Universitat Politècnica de València. Instituto Universitario de Ingeniería de Alimentos para el Desarrollo - Institut Universitari d'Enginyeria d'Aliments per al Desenvolupament</dc:contributor>
<dc:subject>Galletas</dc:subject>
<dc:subject>Grasa</dc:subject>
<dc:subject>Propiedades sensoriales</dc:subject>
<dc:subject>Propiedades físicas</dc:subject>
<dc:subject>Mejora del perfil de ácidos grasos</dc:subject>
<dc:date>2013-09-02</dc:date>
<dc:type>info:eu-repo/semantics/doctoralThesis</dc:type>
<dc:type>info:eu-repo/semantics/acceptedVersion</dc:type>
<dc:identifier>http://hdl.handle.net/10251/31652</dc:identifier>
<dc:identifier>10.4995/Thesis/10251/31652</dc:identifier>
<dc:language>spa</dc:language>
<dc:rights>Reserva de todos los derechos</dc:rights>
<dc:rights>info:eu-repo/semantics/openAccess</dc:rights>
<dc:source>Riunet</dc:source>
</oai_dc:dc>

</metadata>
<about>
<provenance
xmlns="http://www.openarchives.org/OAI/2.0/provenance"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/provenance 
http://www.openarchives.org/OAI/2.0/provenance.xsd">
 <originDescription harvestDate="2020-05-23T03:32:50Z" altered="false">
  <baseURL>https://riunet.upv.es/oai/request</baseURL>
  <identifier>oai:riunet.upv.es:10251/31652</identifier>
  <datestamp>2020-05-22T09:32:33Z</datestamp>
  <metadataNamespace>http://www.openarchives.org/OAI/2.0/oai_dc/</metadataNamespace>
 </originDescription>
</provenance>

</about></record>
  <record>
<header>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/32667</identifier>
<datestamp>2020-05-23T03:32:50Z</datestamp>
<setSpec>upv.es</setSpec>
</header>
<metadata>
<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:doc="http://www.lyncode.com/xoai" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>Sensores químicos cromogénicos y fluorogénicos para la detección de cationes y aniones</dc:title>
<dc:creator>Ábalos Aguado, Tatiana</dc:creator>
<dc:contributor>Martínez Mañez, Ramón</dc:contributor>
<dc:contributor>Sancenón Galarza, Félix</dc:contributor>
<dc:contributor>Universitat Politècnica de València. Departamento de Química - Departament de Química</dc:contributor>
<dc:subject>Sensores cromogénicos</dc:subject>
<dc:subject>Sensores fluorogénicos</dc:subject>
<dc:subject>Cationes</dc:subject>
<dc:subject>Aniones</dc:subject>
<dc:subject>Química supramolecular</dc:subject>
<dc:subject>QUIMICA INORGANICA</dc:subject>
<dc:subject>QUIMICA ORGANICA</dc:subject>
<dc:date>2013-10-07</dc:date>
<dc:type>info:eu-repo/semantics/doctoralThesis</dc:type>
<dc:type>info:eu-repo/semantics/acceptedVersion</dc:type>
<dc:identifier>http://hdl.handle.net/10251/32667</dc:identifier>
<dc:identifier>10.4995/Thesis/10251/32667</dc:identifier>
<dc:language>spa</dc:language>
<dc:rights>Reserva de todos los derechos</dc:rights>
<dc:rights>info:eu-repo/semantics/openAccess</dc:rights>
<dc:source>Riunet</dc:source>
</oai_dc:dc>

</metadata>
<about>
<provenance
xmlns="http://www.openarchives.org/OAI/2.0/provenance"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/provenance 
http://www.openarchives.org/OAI/2.0/provenance.xsd">
 <originDescription harvestDate="2020-05-23T03:32:50Z" altered="false">
  <baseURL>https://riunet.upv.es/oai/request</baseURL>
  <identifier>oai:riunet.upv.es:10251/32667</identifier>
  <datestamp>2020-05-22T10:52:59Z</datestamp>
  <metadataNamespace>http://www.openarchives.org/OAI/2.0/oai_dc/</metadataNamespace>
 </originDescription>
</provenance>

</about></record>
  <record>
<header>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/32953</identifier>
<datestamp>2020-05-23T03:32:50Z</datestamp>
<setSpec>upv.es</setSpec>
</header>
<metadata>
<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:doc="http://www.lyncode.com/xoai" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>Comparison of vacuum treatments and traditional cooking in vegetables using instrumental and sensory analysis</dc:title>
<dc:creator>Iborra Bernad, María del Consuelo</dc:creator>
<dc:contributor>García Segovia, Purificación</dc:contributor>
<dc:contributor>Martínez Monzó, Javier</dc:contributor>
<dc:contributor>Universitat Politècnica de València. Departamento de Tecnología de Alimentos - Departament de Tecnologia d'Aliments</dc:contributor>
<dc:subject>Instrumental texture</dc:subject>
<dc:subject>Puncture test</dc:subject>
<dc:subject>Kramer cell test</dc:subject>
<dc:subject>Texture Profile Analysis</dc:subject>
<dc:subject>Color</dc:subject>
<dc:subject>Antioxidants</dc:subject>
<dc:subject>Anthocyanins</dc:subject>
<dc:subject>Carotenes</dc:subject>
<dc:subject>Ascorbic acid</dc:subject>
<dc:subject>Microstructure</dc:subject>
<dc:subject>Cooking treatment</dc:subject>
<dc:subject>Response Surface Methodology</dc:subject>
<dc:subject>Optimization</dc:subject>
<dc:subject>Sensory Analysis</dc:subject>
<dc:subject>Ranking test</dc:subject>
<dc:subject>Paired test</dc:subject>
<dc:subject>Just About Right</dc:subject>
<dc:subject>Flash Profile</dc:subject>
<dc:subject>Vacuum cooking</dc:subject>
<dc:subject>Sous-vide</dc:subject>
<dc:subject>Cook-vide</dc:subject>
<dc:subject>Vegetables</dc:subject>
<dc:subject>Purple-flesh potatoes</dc:subject>
<dc:subject>Carrots</dc:subject>
<dc:subject>Green beans</dc:subject>
<dc:subject>Red cabbage.</dc:subject>
<dc:subject>TECNOLOGIA DE ALIMENTOS</dc:subject>
<dc:description>Alfresco</dc:description>
<dc:date>2013-10-21</dc:date>
<dc:type>info:eu-repo/semantics/doctoralThesis</dc:type>
<dc:type>info:eu-repo/semantics/acceptedVersion</dc:type>
<dc:identifier>http://hdl.handle.net/10251/32953</dc:identifier>
<dc:identifier>10.4995/Thesis/10251/32953</dc:identifier>
<dc:language>eng</dc:language>
<dc:rights>Reserva de todos los derechos</dc:rights>
<dc:rights>info:eu-repo/semantics/openAccess</dc:rights>
<dc:source>Riunet</dc:source>
</oai_dc:dc>

</metadata>
<about>
<provenance
xmlns="http://www.openarchives.org/OAI/2.0/provenance"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/provenance 
http://www.openarchives.org/OAI/2.0/provenance.xsd">
 <originDescription harvestDate="2020-05-23T03:32:50Z" altered="false">
  <baseURL>https://riunet.upv.es/oai/request</baseURL>
  <identifier>oai:riunet.upv.es:10251/32953</identifier>
  <datestamp>2020-05-22T09:18:49Z</datestamp>
  <metadataNamespace>http://www.openarchives.org/OAI/2.0/oai_dc/</metadataNamespace>
 </originDescription>
</provenance>

</about></record>
  <record>
<header>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/56906</identifier>
<datestamp>2020-05-23T03:32:50Z</datestamp>
<setSpec>upv.es</setSpec>
</header>
<metadata>
<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:doc="http://www.lyncode.com/xoai" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>Anàlisi del discurs de la informàtica: aplicació a l'estudi de la descripció</dc:title>
<dc:creator>Montesinos López, Anna Isabel</dc:creator>
<dc:contributor>SALVADOR LIERN, VICENT MANUEL</dc:contributor>
<dc:contributor>Universitat Politècnica de València. Departamento de Lingüística Aplicada - Departament de Lingüística Aplicada</dc:contributor>
<dc:subject>Discurso</dc:subject>
<dc:subject>Informática</dc:subject>
<dc:subject>FILOLOGIA CATALANA</dc:subject>
<dc:date>2015-11-03</dc:date>
<dc:type>info:eu-repo/semantics/doctoralThesis</dc:type>
<dc:identifier>http://hdl.handle.net/10251/56906</dc:identifier>
<dc:identifier>10.4995/Thesis/10251/56906</dc:identifier>
<dc:language>cat</dc:language>
<dc:rights>Reserva de todos los derechos</dc:rights>
<dc:rights>info:eu-repo/semantics/openAccess</dc:rights>
<dc:source>Riunet</dc:source>
</oai_dc:dc>

</metadata>
<about>
<provenance
xmlns="http://www.openarchives.org/OAI/2.0/provenance"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/provenance 
http://www.openarchives.org/OAI/2.0/provenance.xsd">
 <originDescription harvestDate="2020-05-23T03:32:50Z" altered="false">
  <baseURL>https://riunet.upv.es/oai/request</baseURL>
  <identifier>oai:riunet.upv.es:10251/56906</identifier>
  <datestamp>2020-05-22T07:41:11Z</datestamp>
  <metadataNamespace>http://www.openarchives.org/OAI/2.0/oai_dc/</metadataNamespace>
 </originDescription>
</provenance>

</about></record>
  <record>
<header>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/57282</identifier>
<datestamp>2020-05-23T03:32:50Z</datestamp>
<setSpec>upv.es</setSpec>
</header>
<metadata>
<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:doc="http://www.lyncode.com/xoai" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>Herramientas para la generación y evaluación ex-ante de modelos de negocio.</dc:title>
<dc:creator>Mateu Céspedes, José María</dc:creator>
<dc:contributor>March Chordà, Isidre</dc:contributor>
<dc:contributor>Universitat Politècnica de València. Departamento de Ingeniería e Infraestructura de los Transportes - Departament d'Enginyeria i Infraestructura dels Transports</dc:contributor>
<dc:subject>Modelos de negocio</dc:subject>
<dc:subject>Evaluación ex-ante</dc:subject>
<dc:subject>INGENIERIA E INFRAESTRUCTURA DE LOS TRANSPORTES</dc:subject>
<dc:date>2015-11-10</dc:date>
<dc:type>info:eu-repo/semantics/doctoralThesis</dc:type>
<dc:identifier>http://hdl.handle.net/10251/57282</dc:identifier>
<dc:identifier>10.4995/Thesis/10251/57282</dc:identifier>
<dc:language>spa</dc:language>
<dc:rights>Reserva de todos los derechos</dc:rights>
<dc:rights>info:eu-repo/semantics/openAccess</dc:rights>
<dc:source>Riunet</dc:source>
</oai_dc:dc>

</metadata>
<about>
<provenance
xmlns="http://www.openarchives.org/OAI/2.0/provenance"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/provenance 
http://www.openarchives.org/OAI/2.0/provenance.xsd">
 <originDescription harvestDate="2020-05-23T03:32:50Z" altered="false">
  <baseURL>https://riunet.upv.es/oai/request</baseURL>
  <identifier>oai:riunet.upv.es:10251/57282</identifier>
  <datestamp>2020-05-22T10:29:52Z</datestamp>
  <metadataNamespace>http://www.openarchives.org/OAI/2.0/oai_dc/</metadataNamespace>
 </originDescription>
</provenance>

</about></record>
<resumptionToken completeListSize="7353566" cursor="7334186">2020-05-29T15:07:21Z!2037-01-01T00:00:00Z!!oai_dc!7335298!7353566!oai:union.ndltd.org:DRESDEN/oai:qucosa:de:qucosa:34876</resumptionToken> </ListRecords>
</OAI-PMH>

Advertisement

Answer

This script will go through every XML in the directory (*.xml) and extract the first <identifier> under the <record> tag:

import csv
import glob
from bs4 import BeautifulSoup

all_data = []
for filename in glob.glob(r'*.xml'):
    with open(filename, 'r') as f_in:
        soup = BeautifulSoup(f_in.read(), 'html.parser')
    print(filename)
    for i in soup.select('record identifier:nth-child(1)'):
        print(i)
        all_data.append([filename, i.get_text(strip=True)])

# write to csv file:
with open('data.csv', 'w', newline='') as csvfile:
    csv_writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    for row in all_data:
        csv_writer.writerow(row)

Prints (for example):

a1.xml
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/31652</identifier>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/32667</identifier>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/32953</identifier>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/56906</identifier>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/57282</identifier>
a2.xml
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/31652xxx</identifier>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/32667xxx</identifier>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/32953</identifier>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/56906</identifier>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/57282</identifier>

And saves data.csv (screenshot from LibreOffice):

enter image description here

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement