I am trying to extract specific tags from XML and converting to CSV file. i was able to this for single XML file which is extracting all the identifier tag in the file. Here my question is 1) how to extract from multiple XML files to single CSV file and 2) in the given XML file the required tag is mentioned more than once i would like to know how to extract the first identifier tag from each list of record tag.
Am using python3.7
Required ans is:
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/31652</identifier> <identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/32667</identifier> <identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/32953</identifier> <identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/56906</identifier> <identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/57282</identifier>
Note: am not a programmer!! appreciate your kind help.
from bs4 import BeautifulSoup as b import itertools import os import csv import pandas as pd os.chdir(r"C:*test") with open("aaaaahbc.xml", "r", encoding="utf-8") as f: # opening xml file content = f.read() soup = b(content, 'lxml') identifier = [ values.text for values in soup.findAll("identifier")] # For python-3.x use `zip_longest` method # For python-2.x use 'izip_longest method data = [item for item in itertools.zip_longest(identifier)] df = pd.DataFrame(data=data) df.to_csv("aaaaahbc.csv",index=True, header=False)
xml file example:
<?xml version="1.0" encoding="UTF-8"?> <OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"> <responseDate>2020-06-12T05:26:49Z</responseDate> <request verb="ListRecords" resumptionToken="2020-05-23T03:32:50Z!2037-01-01T00:00:00Z!!oai_dc!7334186!7353566!oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/31648"> http://union.ndltd.org:8080/union.OAI-PMH/</request> <ListRecords> <record> <header> <identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/31652</identifier> <datestamp>2020-05-23T03:32:50Z</datestamp> <setSpec>upv.es</setSpec> </header> <metadata> <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:doc="http://www.lyncode.com/xoai" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"> <dc:title>Influencia de la grasa en las propiedades físicas y sensoriales de galletas. Alternativas para la mejora del perfil de acidos grasos</dc:title> <dc:creator>Tarancón Serrano, Paula Isabel</dc:creator> <dc:contributor>Salvador Alcaraz, Ana</dc:contributor> <dc:contributor>Sanz Taberner, Teresa</dc:contributor> <dc:contributor>Tarrega Guillem, Amparo</dc:contributor> <dc:contributor>Universitat Politècnica de València. Escuela Técnica Superior del Medio Rural y Enología - Escola Tècnica Superior del Medi Rural i Enologia</dc:contributor> <dc:contributor>Universitat Politècnica de València. Instituto Universitario de Ingeniería de Alimentos para el Desarrollo - Institut Universitari d'Enginyeria d'Aliments per al Desenvolupament</dc:contributor> <dc:subject>Galletas</dc:subject> <dc:subject>Grasa</dc:subject> <dc:subject>Propiedades sensoriales</dc:subject> <dc:subject>Propiedades físicas</dc:subject> <dc:subject>Mejora del perfil de ácidos grasos</dc:subject> <dc:date>2013-09-02</dc:date> <dc:type>info:eu-repo/semantics/doctoralThesis</dc:type> <dc:type>info:eu-repo/semantics/acceptedVersion</dc:type> <dc:identifier>http://hdl.handle.net/10251/31652</dc:identifier> <dc:identifier>10.4995/Thesis/10251/31652</dc:identifier> <dc:language>spa</dc:language> <dc:rights>Reserva de todos los derechos</dc:rights> <dc:rights>info:eu-repo/semantics/openAccess</dc:rights> <dc:source>Riunet</dc:source> </oai_dc:dc> </metadata> <about> <provenance xmlns="http://www.openarchives.org/OAI/2.0/provenance" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/provenance http://www.openarchives.org/OAI/2.0/provenance.xsd"> <originDescription harvestDate="2020-05-23T03:32:50Z" altered="false"> <baseURL>https://riunet.upv.es/oai/request</baseURL> <identifier>oai:riunet.upv.es:10251/31652</identifier> <datestamp>2020-05-22T09:32:33Z</datestamp> <metadataNamespace>http://www.openarchives.org/OAI/2.0/oai_dc/</metadataNamespace> </originDescription> </provenance> </about></record> <record> <header> <identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/32667</identifier> <datestamp>2020-05-23T03:32:50Z</datestamp> <setSpec>upv.es</setSpec> </header> <metadata> <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:doc="http://www.lyncode.com/xoai" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"> <dc:title>Sensores químicos cromogénicos y fluorogénicos para la detección de cationes y aniones</dc:title> <dc:creator>Ábalos Aguado, Tatiana</dc:creator> <dc:contributor>Martínez Mañez, Ramón</dc:contributor> <dc:contributor>Sancenón Galarza, Félix</dc:contributor> <dc:contributor>Universitat Politècnica de València. Departamento de Química - Departament de Química</dc:contributor> <dc:subject>Sensores cromogénicos</dc:subject> <dc:subject>Sensores fluorogénicos</dc:subject> <dc:subject>Cationes</dc:subject> <dc:subject>Aniones</dc:subject> <dc:subject>Química supramolecular</dc:subject> <dc:subject>QUIMICA INORGANICA</dc:subject> <dc:subject>QUIMICA ORGANICA</dc:subject> <dc:date>2013-10-07</dc:date> <dc:type>info:eu-repo/semantics/doctoralThesis</dc:type> <dc:type>info:eu-repo/semantics/acceptedVersion</dc:type> <dc:identifier>http://hdl.handle.net/10251/32667</dc:identifier> <dc:identifier>10.4995/Thesis/10251/32667</dc:identifier> <dc:language>spa</dc:language> <dc:rights>Reserva de todos los derechos</dc:rights> <dc:rights>info:eu-repo/semantics/openAccess</dc:rights> <dc:source>Riunet</dc:source> </oai_dc:dc> </metadata> <about> <provenance xmlns="http://www.openarchives.org/OAI/2.0/provenance" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/provenance http://www.openarchives.org/OAI/2.0/provenance.xsd"> <originDescription harvestDate="2020-05-23T03:32:50Z" altered="false"> <baseURL>https://riunet.upv.es/oai/request</baseURL> <identifier>oai:riunet.upv.es:10251/32667</identifier> <datestamp>2020-05-22T10:52:59Z</datestamp> <metadataNamespace>http://www.openarchives.org/OAI/2.0/oai_dc/</metadataNamespace> </originDescription> </provenance> </about></record> <record> <header> <identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/32953</identifier> <datestamp>2020-05-23T03:32:50Z</datestamp> <setSpec>upv.es</setSpec> </header> <metadata> <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:doc="http://www.lyncode.com/xoai" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"> <dc:title>Comparison of vacuum treatments and traditional cooking in vegetables using instrumental and sensory analysis</dc:title> <dc:creator>Iborra Bernad, María del Consuelo</dc:creator> <dc:contributor>García Segovia, Purificación</dc:contributor> <dc:contributor>Martínez Monzó, Javier</dc:contributor> <dc:contributor>Universitat Politècnica de València. Departamento de Tecnología de Alimentos - Departament de Tecnologia d'Aliments</dc:contributor> <dc:subject>Instrumental texture</dc:subject> <dc:subject>Puncture test</dc:subject> <dc:subject>Kramer cell test</dc:subject> <dc:subject>Texture Profile Analysis</dc:subject> <dc:subject>Color</dc:subject> <dc:subject>Antioxidants</dc:subject> <dc:subject>Anthocyanins</dc:subject> <dc:subject>Carotenes</dc:subject> <dc:subject>Ascorbic acid</dc:subject> <dc:subject>Microstructure</dc:subject> <dc:subject>Cooking treatment</dc:subject> <dc:subject>Response Surface Methodology</dc:subject> <dc:subject>Optimization</dc:subject> <dc:subject>Sensory Analysis</dc:subject> <dc:subject>Ranking test</dc:subject> <dc:subject>Paired test</dc:subject> <dc:subject>Just About Right</dc:subject> <dc:subject>Flash Profile</dc:subject> <dc:subject>Vacuum cooking</dc:subject> <dc:subject>Sous-vide</dc:subject> <dc:subject>Cook-vide</dc:subject> <dc:subject>Vegetables</dc:subject> <dc:subject>Purple-flesh potatoes</dc:subject> <dc:subject>Carrots</dc:subject> <dc:subject>Green beans</dc:subject> <dc:subject>Red cabbage.</dc:subject> <dc:subject>TECNOLOGIA DE ALIMENTOS</dc:subject> <dc:description>Alfresco</dc:description> <dc:date>2013-10-21</dc:date> <dc:type>info:eu-repo/semantics/doctoralThesis</dc:type> <dc:type>info:eu-repo/semantics/acceptedVersion</dc:type> <dc:identifier>http://hdl.handle.net/10251/32953</dc:identifier> <dc:identifier>10.4995/Thesis/10251/32953</dc:identifier> <dc:language>eng</dc:language> <dc:rights>Reserva de todos los derechos</dc:rights> <dc:rights>info:eu-repo/semantics/openAccess</dc:rights> <dc:source>Riunet</dc:source> </oai_dc:dc> </metadata> <about> <provenance xmlns="http://www.openarchives.org/OAI/2.0/provenance" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/provenance http://www.openarchives.org/OAI/2.0/provenance.xsd"> <originDescription harvestDate="2020-05-23T03:32:50Z" altered="false"> <baseURL>https://riunet.upv.es/oai/request</baseURL> <identifier>oai:riunet.upv.es:10251/32953</identifier> <datestamp>2020-05-22T09:18:49Z</datestamp> <metadataNamespace>http://www.openarchives.org/OAI/2.0/oai_dc/</metadataNamespace> </originDescription> </provenance> </about></record> <record> <header> <identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/56906</identifier> <datestamp>2020-05-23T03:32:50Z</datestamp> <setSpec>upv.es</setSpec> </header> <metadata> <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:doc="http://www.lyncode.com/xoai" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"> <dc:title>Anàlisi del discurs de la informàtica: aplicació a l'estudi de la descripció</dc:title> <dc:creator>Montesinos López, Anna Isabel</dc:creator> <dc:contributor>SALVADOR LIERN, VICENT MANUEL</dc:contributor> <dc:contributor>Universitat Politècnica de València. Departamento de Lingüística Aplicada - Departament de Lingüística Aplicada</dc:contributor> <dc:subject>Discurso</dc:subject> <dc:subject>Informática</dc:subject> <dc:subject>FILOLOGIA CATALANA</dc:subject> <dc:date>2015-11-03</dc:date> <dc:type>info:eu-repo/semantics/doctoralThesis</dc:type> <dc:identifier>http://hdl.handle.net/10251/56906</dc:identifier> <dc:identifier>10.4995/Thesis/10251/56906</dc:identifier> <dc:language>cat</dc:language> <dc:rights>Reserva de todos los derechos</dc:rights> <dc:rights>info:eu-repo/semantics/openAccess</dc:rights> <dc:source>Riunet</dc:source> </oai_dc:dc> </metadata> <about> <provenance xmlns="http://www.openarchives.org/OAI/2.0/provenance" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/provenance http://www.openarchives.org/OAI/2.0/provenance.xsd"> <originDescription harvestDate="2020-05-23T03:32:50Z" altered="false"> <baseURL>https://riunet.upv.es/oai/request</baseURL> <identifier>oai:riunet.upv.es:10251/56906</identifier> <datestamp>2020-05-22T07:41:11Z</datestamp> <metadataNamespace>http://www.openarchives.org/OAI/2.0/oai_dc/</metadataNamespace> </originDescription> </provenance> </about></record> <record> <header> <identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/57282</identifier> <datestamp>2020-05-23T03:32:50Z</datestamp> <setSpec>upv.es</setSpec> </header> <metadata> <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:doc="http://www.lyncode.com/xoai" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"> <dc:title>Herramientas para la generación y evaluación ex-ante de modelos de negocio.</dc:title> <dc:creator>Mateu Céspedes, José María</dc:creator> <dc:contributor>March Chordà, Isidre</dc:contributor> <dc:contributor>Universitat Politècnica de València. Departamento de Ingeniería e Infraestructura de los Transportes - Departament d'Enginyeria i Infraestructura dels Transports</dc:contributor> <dc:subject>Modelos de negocio</dc:subject> <dc:subject>Evaluación ex-ante</dc:subject> <dc:subject>INGENIERIA E INFRAESTRUCTURA DE LOS TRANSPORTES</dc:subject> <dc:date>2015-11-10</dc:date> <dc:type>info:eu-repo/semantics/doctoralThesis</dc:type> <dc:identifier>http://hdl.handle.net/10251/57282</dc:identifier> <dc:identifier>10.4995/Thesis/10251/57282</dc:identifier> <dc:language>spa</dc:language> <dc:rights>Reserva de todos los derechos</dc:rights> <dc:rights>info:eu-repo/semantics/openAccess</dc:rights> <dc:source>Riunet</dc:source> </oai_dc:dc> </metadata> <about> <provenance xmlns="http://www.openarchives.org/OAI/2.0/provenance" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/provenance http://www.openarchives.org/OAI/2.0/provenance.xsd"> <originDescription harvestDate="2020-05-23T03:32:50Z" altered="false"> <baseURL>https://riunet.upv.es/oai/request</baseURL> <identifier>oai:riunet.upv.es:10251/57282</identifier> <datestamp>2020-05-22T10:29:52Z</datestamp> <metadataNamespace>http://www.openarchives.org/OAI/2.0/oai_dc/</metadataNamespace> </originDescription> </provenance> </about></record> <resumptionToken completeListSize="7353566" cursor="7334186">2020-05-29T15:07:21Z!2037-01-01T00:00:00Z!!oai_dc!7335298!7353566!oai:union.ndltd.org:DRESDEN/oai:qucosa:de:qucosa:34876</resumptionToken> </ListRecords> </OAI-PMH>
Advertisement
Answer
This script will go through every XML in the directory (*.xml
) and extract the first <identifier>
under the <record>
tag:
import csv import glob from bs4 import BeautifulSoup all_data = [] for filename in glob.glob(r'*.xml'): with open(filename, 'r') as f_in: soup = BeautifulSoup(f_in.read(), 'html.parser') print(filename) for i in soup.select('record identifier:nth-child(1)'): print(i) all_data.append([filename, i.get_text(strip=True)]) # write to csv file: with open('data.csv', 'w', newline='') as csvfile: csv_writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL) for row in all_data: csv_writer.writerow(row)
Prints (for example):
a1.xml <identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/31652</identifier> <identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/32667</identifier> <identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/32953</identifier> <identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/56906</identifier> <identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/57282</identifier> a2.xml <identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/31652xxx</identifier> <identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/32667xxx</identifier> <identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/32953</identifier> <identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/56906</identifier> <identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/57282</identifier>
And saves data.csv
(screenshot from LibreOffice):