I have some XML
that consists of a lot of repitions of the following xml-structure:
<record>
<header>
<identifier>oai:dnb.de/dnb:reiheO/1254645608</identifier><datestamp>2022-04-01T23:49:32Z</datestamp>
<setspec>dnb:reiheO</setspec>
</header>
<metadata>
<dc xmlns="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dnb="http://d-nb.de/standards/dnbterms" xmlns:tel="http://krait.kb.nl/coop/tel/handbook/telterms.html">
<dc:title>Advantages of Simultaneous In Situ Multispecies Detection for Portable Emission Measurement Applications / Luigi Biondo, Henrik Gerken, Lars Illmann, Tim Steinhaus, Christian Beidl, Andreas Dreizler, Steven Wagner</dc:title>
<dc:creator>Biondo, Luigi Verfasser]</dc:creator>
<dc:creator>Gerken, Henrik [Verfasser]</dc:creator>
<dc:creator>[Illmann, Lars [Verfasser]</dc:creator>
<dc:creator>Steinhaus, Tim [Verfasser]</dc:creator>
<dc:creator>Beidl, Christian [Verfasser]</dc:creator>
<dc:creator>Dreizler, Andreas [Verfasser]</dc:creator>
<dc:creator>Wagner, Steven [Verfasser]</dc:creator>
<dc:publisher>Darmstadt : Universitäts- und Landesbibliothek</dc:publisher>
<dc:date>2022</dc:date>
<dc:language>eng</dc:language>
<dc:identifier xsi:type="tel:URN">urn:nbn:de:tuda-tuprints-210508</dc:identifier>
<dc:identifier xsi:type="tel:URL">http://nbn-resolving.de/urn:nbn:de:tuda-tuprints-210508</dc:identifier>
<dc:identifier xsi:type="tel:URL">http://d-nb.info/1254645608/34</dc:identifier>
<dc:identifier xsi:type="tel:URL">http://tuprints.ulb.tu-darmstadt.de/21050/</dc:identifier>
<dc:identifier xsi:type="dnb:IDN">1254645608</dc:identifier>
<dc:subject>670 Industrielle und handwerkliche Fertigung</dc:subject>
<dc:rights>lizenzfrei</dc:rights>
<dc:type>Online-Ressource</dc:type>
</dc>
</metadata>
</record>
Able to adress most of the elements and extract the information within, but failing to get to the specific ones where I have to define the attribute as well. I think I am struggling with the xpath
, but can’t quite figure out, why.
If I try this code, I do get a list of elements, but it is empty:
urn = xml.find_all('.//dc:identifier[@xsi:type="tel:URN"]', namespaces=ns)
The same happens for the less specific version:
urn = xml.find_all('.//dc:identifier', namespaces=ns)
However, this code: test1 = xml.find_all("dc:identifier")
works and returns a lovely list of elements, but obviously not just of the identifiers specified as urn.
But this: urn = xml.find_all('dc:identifier[@xsi:type="tel:URN"]', namespaces=ns)
returns an empty list again. And whatever combination I try, I either get an empty list or it’s not working at all.
Does anyone have an idea, why this is or what else I could try? It’s so frustrating to get that list of all ids but to not manage to select the one I need from the xsi:type…
EDIT:
I am getting the data via OAI and am using the following libraries and am using requests
and BeautifulSoup
. I’ve also tried ElementTree
and lxml
.
I literally just store the response from the API in a variable called “xml” and then try the following code, of which some works, and some doesn’t:
ids = xml.find_all("identifier")[0].text
print(ids)
urn1 = xml.find_all("dc:identifier")
urn1 = urn1[0].text
print(urn1)
test1 = xml.find_all("dc:identifier")
print(test1)
urn2 = xml.find_all(".//dc:identifier")
print(urn2)
urn3 = xml.find_all("dc:identifier[@xsi:type='tel:URN']")
print(urn3)
First two return the text of the element as expected (I know that the first one is the isolated element in the header, not the first dc:identifier object, this just served testing purposes), the third part returns the list of all elements. The last two, on the other hand, return an empty list, and that is the problem (as I need the specific xsi:type-element specified in the last attempt.
Advertisement
Answer
First, your xml is still not well formed since the xsi
prefix hasn’t been declared. I made up a declaration below just to make the answer work.
Second, you need to use an xml parser like lxml to use xpath.
So all together:
rec = """[your xml above, but with the first dc element now reading:
<dc xmlns="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="whatever" ]"""
from lxml import etree
doc = etree.XML(rec)
ns = {"dc":"http://purl.org/dc/elements/1.1/",
"xsi":"whatever"}
urn2 = doc.xpath("//dc:identifier/text()",namespaces=ns)
urn3 = doc.xpath("//dc:identifier[@xsi:type='tel:URN']/text()",namespaces=ns)
and that should do it