I am trying to extract specific tags from XML and converting to CSV file. i was able to this for single XML file which is extracting all the identifier tag in the file. Here my question is 1) how to extract from multiple XML files to single CSV file and 2) in the given XML file the required tag is mentioned more than once i would like to know how to extract the first identifier tag from each list of record tag.
Am using python3.7
Required ans is:
JavaScript
x
6
1
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/31652</identifier>
2
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/32667</identifier>
3
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/32953</identifier>
4
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/56906</identifier>
5
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/57282</identifier>
6
Note: am not a programmer!! appreciate your kind help.
JavaScript
1
22
22
1
from bs4 import BeautifulSoup as b
2
import itertools
3
import os
4
import csv
5
import pandas as pd
6
7
8
os.chdir(r"C:*test")
9
10
with open("aaaaahbc.xml", "r", encoding="utf-8") as f: # opening xml file
11
content = f.read()
12
13
soup = b(content, 'lxml')
14
identifier = [ values.text for values in soup.findAll("identifier")]
15
16
# For python-3.x use `zip_longest` method
17
# For python-2.x use 'izip_longest method
18
19
data = [item for item in itertools.zip_longest(identifier)]
20
df = pd.DataFrame(data=data)
21
df.to_csv("aaaaahbc.csv",index=True, header=False)
22
xml file example:
JavaScript
1
256
256
1
<?xml version="1.0" encoding="UTF-8"?>
2
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/"
3
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
4
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/
5
http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
6
<responseDate>2020-06-12T05:26:49Z</responseDate>
7
<request verb="ListRecords" resumptionToken="2020-05-23T03:32:50Z!2037-01-01T00:00:00Z!!oai_dc!7334186!7353566!oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/31648">
8
http://union.ndltd.org:8080/union.OAI-PMH/</request>
9
<ListRecords>
10
<record>
11
<header>
12
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/31652</identifier>
13
<datestamp>2020-05-23T03:32:50Z</datestamp>
14
<setSpec>upv.es</setSpec>
15
</header>
16
<metadata>
17
<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:doc="http://www.lyncode.com/xoai" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
18
<dc:title>Influencia de la grasa en las propiedades físicas y sensoriales de galletas. Alternativas para la mejora del perfil de acidos grasos</dc:title>
19
<dc:creator>Tarancón Serrano, Paula Isabel</dc:creator>
20
<dc:contributor>Salvador Alcaraz, Ana</dc:contributor>
21
<dc:contributor>Sanz Taberner, Teresa</dc:contributor>
22
<dc:contributor>Tarrega Guillem, Amparo</dc:contributor>
23
<dc:contributor>Universitat Politècnica de València. Escuela Técnica Superior del Medio Rural y Enología - Escola Tècnica Superior del Medi Rural i Enologia</dc:contributor>
24
<dc:contributor>Universitat Politècnica de València. Instituto Universitario de Ingeniería de Alimentos para el Desarrollo - Institut Universitari d'Enginyeria d'Aliments per al Desenvolupament</dc:contributor>
25
<dc:subject>Galletas</dc:subject>
26
<dc:subject>Grasa</dc:subject>
27
<dc:subject>Propiedades sensoriales</dc:subject>
28
<dc:subject>Propiedades físicas</dc:subject>
29
<dc:subject>Mejora del perfil de ácidos grasos</dc:subject>
30
<dc:date>2013-09-02</dc:date>
31
<dc:type>info:eu-repo/semantics/doctoralThesis</dc:type>
32
<dc:type>info:eu-repo/semantics/acceptedVersion</dc:type>
33
<dc:identifier>http://hdl.handle.net/10251/31652</dc:identifier>
34
<dc:identifier>10.4995/Thesis/10251/31652</dc:identifier>
35
<dc:language>spa</dc:language>
36
<dc:rights>Reserva de todos los derechos</dc:rights>
37
<dc:rights>info:eu-repo/semantics/openAccess</dc:rights>
38
<dc:source>Riunet</dc:source>
39
</oai_dc:dc>
40
41
</metadata>
42
<about>
43
<provenance
44
xmlns="http://www.openarchives.org/OAI/2.0/provenance"
45
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
46
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/provenance
47
http://www.openarchives.org/OAI/2.0/provenance.xsd">
48
<originDescription harvestDate="2020-05-23T03:32:50Z" altered="false">
49
<baseURL>https://riunet.upv.es/oai/request</baseURL>
50
<identifier>oai:riunet.upv.es:10251/31652</identifier>
51
<datestamp>2020-05-22T09:32:33Z</datestamp>
52
<metadataNamespace>http://www.openarchives.org/OAI/2.0/oai_dc/</metadataNamespace>
53
</originDescription>
54
</provenance>
55
56
</about></record>
57
<record>
58
<header>
59
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/32667</identifier>
60
<datestamp>2020-05-23T03:32:50Z</datestamp>
61
<setSpec>upv.es</setSpec>
62
</header>
63
<metadata>
64
<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:doc="http://www.lyncode.com/xoai" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
65
<dc:title>Sensores químicos cromogénicos y fluorogénicos para la detección de cationes y aniones</dc:title>
66
<dc:creator>Ábalos Aguado, Tatiana</dc:creator>
67
<dc:contributor>Martínez Mañez, Ramón</dc:contributor>
68
<dc:contributor>Sancenón Galarza, Félix</dc:contributor>
69
<dc:contributor>Universitat Politècnica de València. Departamento de Química - Departament de Química</dc:contributor>
70
<dc:subject>Sensores cromogénicos</dc:subject>
71
<dc:subject>Sensores fluorogénicos</dc:subject>
72
<dc:subject>Cationes</dc:subject>
73
<dc:subject>Aniones</dc:subject>
74
<dc:subject>Química supramolecular</dc:subject>
75
<dc:subject>QUIMICA INORGANICA</dc:subject>
76
<dc:subject>QUIMICA ORGANICA</dc:subject>
77
<dc:date>2013-10-07</dc:date>
78
<dc:type>info:eu-repo/semantics/doctoralThesis</dc:type>
79
<dc:type>info:eu-repo/semantics/acceptedVersion</dc:type>
80
<dc:identifier>http://hdl.handle.net/10251/32667</dc:identifier>
81
<dc:identifier>10.4995/Thesis/10251/32667</dc:identifier>
82
<dc:language>spa</dc:language>
83
<dc:rights>Reserva de todos los derechos</dc:rights>
84
<dc:rights>info:eu-repo/semantics/openAccess</dc:rights>
85
<dc:source>Riunet</dc:source>
86
</oai_dc:dc>
87
88
</metadata>
89
<about>
90
<provenance
91
xmlns="http://www.openarchives.org/OAI/2.0/provenance"
92
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
93
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/provenance
94
http://www.openarchives.org/OAI/2.0/provenance.xsd">
95
<originDescription harvestDate="2020-05-23T03:32:50Z" altered="false">
96
<baseURL>https://riunet.upv.es/oai/request</baseURL>
97
<identifier>oai:riunet.upv.es:10251/32667</identifier>
98
<datestamp>2020-05-22T10:52:59Z</datestamp>
99
<metadataNamespace>http://www.openarchives.org/OAI/2.0/oai_dc/</metadataNamespace>
100
</originDescription>
101
</provenance>
102
103
</about></record>
104
<record>
105
<header>
106
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/32953</identifier>
107
<datestamp>2020-05-23T03:32:50Z</datestamp>
108
<setSpec>upv.es</setSpec>
109
</header>
110
<metadata>
111
<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:doc="http://www.lyncode.com/xoai" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
112
<dc:title>Comparison of vacuum treatments and traditional cooking in vegetables using instrumental and sensory analysis</dc:title>
113
<dc:creator>Iborra Bernad, María del Consuelo</dc:creator>
114
<dc:contributor>García Segovia, Purificación</dc:contributor>
115
<dc:contributor>Martínez Monzó, Javier</dc:contributor>
116
<dc:contributor>Universitat Politècnica de València. Departamento de Tecnología de Alimentos - Departament de Tecnologia d'Aliments</dc:contributor>
117
<dc:subject>Instrumental texture</dc:subject>
118
<dc:subject>Puncture test</dc:subject>
119
<dc:subject>Kramer cell test</dc:subject>
120
<dc:subject>Texture Profile Analysis</dc:subject>
121
<dc:subject>Color</dc:subject>
122
<dc:subject>Antioxidants</dc:subject>
123
<dc:subject>Anthocyanins</dc:subject>
124
<dc:subject>Carotenes</dc:subject>
125
<dc:subject>Ascorbic acid</dc:subject>
126
<dc:subject>Microstructure</dc:subject>
127
<dc:subject>Cooking treatment</dc:subject>
128
<dc:subject>Response Surface Methodology</dc:subject>
129
<dc:subject>Optimization</dc:subject>
130
<dc:subject>Sensory Analysis</dc:subject>
131
<dc:subject>Ranking test</dc:subject>
132
<dc:subject>Paired test</dc:subject>
133
<dc:subject>Just About Right</dc:subject>
134
<dc:subject>Flash Profile</dc:subject>
135
<dc:subject>Vacuum cooking</dc:subject>
136
<dc:subject>Sous-vide</dc:subject>
137
<dc:subject>Cook-vide</dc:subject>
138
<dc:subject>Vegetables</dc:subject>
139
<dc:subject>Purple-flesh potatoes</dc:subject>
140
<dc:subject>Carrots</dc:subject>
141
<dc:subject>Green beans</dc:subject>
142
<dc:subject>Red cabbage.</dc:subject>
143
<dc:subject>TECNOLOGIA DE ALIMENTOS</dc:subject>
144
<dc:description>Alfresco</dc:description>
145
<dc:date>2013-10-21</dc:date>
146
<dc:type>info:eu-repo/semantics/doctoralThesis</dc:type>
147
<dc:type>info:eu-repo/semantics/acceptedVersion</dc:type>
148
<dc:identifier>http://hdl.handle.net/10251/32953</dc:identifier>
149
<dc:identifier>10.4995/Thesis/10251/32953</dc:identifier>
150
<dc:language>eng</dc:language>
151
<dc:rights>Reserva de todos los derechos</dc:rights>
152
<dc:rights>info:eu-repo/semantics/openAccess</dc:rights>
153
<dc:source>Riunet</dc:source>
154
</oai_dc:dc>
155
156
</metadata>
157
<about>
158
<provenance
159
xmlns="http://www.openarchives.org/OAI/2.0/provenance"
160
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
161
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/provenance
162
http://www.openarchives.org/OAI/2.0/provenance.xsd">
163
<originDescription harvestDate="2020-05-23T03:32:50Z" altered="false">
164
<baseURL>https://riunet.upv.es/oai/request</baseURL>
165
<identifier>oai:riunet.upv.es:10251/32953</identifier>
166
<datestamp>2020-05-22T09:18:49Z</datestamp>
167
<metadataNamespace>http://www.openarchives.org/OAI/2.0/oai_dc/</metadataNamespace>
168
</originDescription>
169
</provenance>
170
171
</about></record>
172
<record>
173
<header>
174
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/56906</identifier>
175
<datestamp>2020-05-23T03:32:50Z</datestamp>
176
<setSpec>upv.es</setSpec>
177
</header>
178
<metadata>
179
<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:doc="http://www.lyncode.com/xoai" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
180
<dc:title>Anàlisi del discurs de la informàtica: aplicació a l'estudi de la descripció</dc:title>
181
<dc:creator>Montesinos López, Anna Isabel</dc:creator>
182
<dc:contributor>SALVADOR LIERN, VICENT MANUEL</dc:contributor>
183
<dc:contributor>Universitat Politècnica de València. Departamento de Lingüística Aplicada - Departament de Lingüística Aplicada</dc:contributor>
184
<dc:subject>Discurso</dc:subject>
185
<dc:subject>Informática</dc:subject>
186
<dc:subject>FILOLOGIA CATALANA</dc:subject>
187
<dc:date>2015-11-03</dc:date>
188
<dc:type>info:eu-repo/semantics/doctoralThesis</dc:type>
189
<dc:identifier>http://hdl.handle.net/10251/56906</dc:identifier>
190
<dc:identifier>10.4995/Thesis/10251/56906</dc:identifier>
191
<dc:language>cat</dc:language>
192
<dc:rights>Reserva de todos los derechos</dc:rights>
193
<dc:rights>info:eu-repo/semantics/openAccess</dc:rights>
194
<dc:source>Riunet</dc:source>
195
</oai_dc:dc>
196
197
</metadata>
198
<about>
199
<provenance
200
xmlns="http://www.openarchives.org/OAI/2.0/provenance"
201
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
202
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/provenance
203
http://www.openarchives.org/OAI/2.0/provenance.xsd">
204
<originDescription harvestDate="2020-05-23T03:32:50Z" altered="false">
205
<baseURL>https://riunet.upv.es/oai/request</baseURL>
206
<identifier>oai:riunet.upv.es:10251/56906</identifier>
207
<datestamp>2020-05-22T07:41:11Z</datestamp>
208
<metadataNamespace>http://www.openarchives.org/OAI/2.0/oai_dc/</metadataNamespace>
209
</originDescription>
210
</provenance>
211
212
</about></record>
213
<record>
214
<header>
215
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/57282</identifier>
216
<datestamp>2020-05-23T03:32:50Z</datestamp>
217
<setSpec>upv.es</setSpec>
218
</header>
219
<metadata>
220
<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:doc="http://www.lyncode.com/xoai" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
221
<dc:title>Herramientas para la generación y evaluación ex-ante de modelos de negocio.</dc:title>
222
<dc:creator>Mateu Céspedes, José María</dc:creator>
223
<dc:contributor>March Chordà, Isidre</dc:contributor>
224
<dc:contributor>Universitat Politècnica de València. Departamento de Ingeniería e Infraestructura de los Transportes - Departament d'Enginyeria i Infraestructura dels Transports</dc:contributor>
225
<dc:subject>Modelos de negocio</dc:subject>
226
<dc:subject>Evaluación ex-ante</dc:subject>
227
<dc:subject>INGENIERIA E INFRAESTRUCTURA DE LOS TRANSPORTES</dc:subject>
228
<dc:date>2015-11-10</dc:date>
229
<dc:type>info:eu-repo/semantics/doctoralThesis</dc:type>
230
<dc:identifier>http://hdl.handle.net/10251/57282</dc:identifier>
231
<dc:identifier>10.4995/Thesis/10251/57282</dc:identifier>
232
<dc:language>spa</dc:language>
233
<dc:rights>Reserva de todos los derechos</dc:rights>
234
<dc:rights>info:eu-repo/semantics/openAccess</dc:rights>
235
<dc:source>Riunet</dc:source>
236
</oai_dc:dc>
237
238
</metadata>
239
<about>
240
<provenance
241
xmlns="http://www.openarchives.org/OAI/2.0/provenance"
242
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
243
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/provenance
244
http://www.openarchives.org/OAI/2.0/provenance.xsd">
245
<originDescription harvestDate="2020-05-23T03:32:50Z" altered="false">
246
<baseURL>https://riunet.upv.es/oai/request</baseURL>
247
<identifier>oai:riunet.upv.es:10251/57282</identifier>
248
<datestamp>2020-05-22T10:29:52Z</datestamp>
249
<metadataNamespace>http://www.openarchives.org/OAI/2.0/oai_dc/</metadataNamespace>
250
</originDescription>
251
</provenance>
252
253
</about></record>
254
<resumptionToken completeListSize="7353566" cursor="7334186">2020-05-29T15:07:21Z!2037-01-01T00:00:00Z!!oai_dc!7335298!7353566!oai:union.ndltd.org:DRESDEN/oai:qucosa:de:qucosa:34876</resumptionToken> </ListRecords>
255
</OAI-PMH>
256
Advertisement
Answer
This script will go through every XML in the directory (*.xml
) and extract the first <identifier>
under the <record>
tag:
JavaScript
1
19
19
1
import csv
2
import glob
3
from bs4 import BeautifulSoup
4
5
all_data = []
6
for filename in glob.glob(r'*.xml'):
7
with open(filename, 'r') as f_in:
8
soup = BeautifulSoup(f_in.read(), 'html.parser')
9
print(filename)
10
for i in soup.select('record identifier:nth-child(1)'):
11
print(i)
12
all_data.append([filename, i.get_text(strip=True)])
13
14
# write to csv file:
15
with open('data.csv', 'w', newline='') as csvfile:
16
csv_writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
17
for row in all_data:
18
csv_writer.writerow(row)
19
Prints (for example):
JavaScript
1
13
13
1
a1.xml
2
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/31652</identifier>
3
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/32667</identifier>
4
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/32953</identifier>
5
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/56906</identifier>
6
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/57282</identifier>
7
a2.xml
8
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/31652xxx</identifier>
9
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/32667xxx</identifier>
10
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/32953</identifier>
11
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/56906</identifier>
12
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/57282</identifier>
13
And saves data.csv
(screenshot from LibreOffice):