Using python lxml I want to test if a XML document contains EXPERIMENT_TYPE, and if it exists, extract the <VALUE>.
Example:
<EXPERIMENT_SET> <EXPERIMENT center_name="BCCA" alias="Experiment-pass_2.0"> <TITLE>WGBS (whole genome bisulfite sequencing) analysis of SomeSampleA (library: SomeLibraryA).</TITLE> <STUDY_REF accession="SomeStudy" refcenter="BCCA"/> <EXPERIMENT_ATTRIBUTES> <EXPERIMENT_ATTRIBUTE><TAG>EXPERIMENT_TYPE</TAG><VALUE>DNA Methylation</VALUE></EXPERIMENT_ATTRIBUTE> <EXPERIMENT_ATTRIBUTE><TAG>EXPERIMENT_ONTOLOGY_URI</TAG><VALUE>http://purl.obolibrary.org/obo/OBI_0001863</VALUE></EXPERIMENT_ATTRIBUTE> <EXPERIMENT_ATTRIBUTE><TAG>EXPERIMENT_ONTOLOGY_CURIE</TAG><VALUE>obi:0001863</VALUE></EXPERIMENT_ATTRIBUTE> <EXPERIMENT_ATTRIBUTE><TAG>MOLECULE</TAG><VALUE>genomic DNA</VALUE></EXPERIMENT_ATTRIBUTE> </EXPERIMENT_ATTRIBUTES> </EXPERIMENT> </EXPERIMENT_SET>
Is there a faster way than iterating through all elements?
all = etree.findall('EXPERIMENT/EXPERIMENT_ATTRIBUTES/EXPERIMENT_ATTRIBUTE/TAG') for e in all: if e.text == 'EXPERIMENT_TYPE': print("Found")
That attempt is also getting messy when I want to extract the <VALUE>.
Advertisement
Answer
Preferably you do this with XPath which is bound to be incredibly fast. My sugestion (tested and working). It will return a (possible empty) list of VALUE elements from which you can extra the text
.
PS: do not use “special” words such as all
as variable names. Bad practice and may lead to unexpected bugs.
import lxml.etree as ET from lxml.etree import Element from typing import List xml_str = """ <EXPERIMENT_SET> <EXPERIMENT center_name="BCCA" alias="Experiment-pass_2.0"> <TITLE>WGBS (whole genome bisulfite sequencing) analysis of SomeSampleA (library: SomeLibraryA).</TITLE> <STUDY_REF accession="SomeStudy" refcenter="BCCA"/> <EXPERIMENT_ATTRIBUTES> <EXPERIMENT_ATTRIBUTE><TAG>EXPERIMENT_TYPE</TAG><VALUE>DNA Methylation</VALUE></EXPERIMENT_ATTRIBUTE> <EXPERIMENT_ATTRIBUTE><TAG>EXPERIMENT_ONTOLOGY_URI</TAG><VALUE>http://purl.obolibrary.org/obo/OBI_0001863</VALUE></EXPERIMENT_ATTRIBUTE> <EXPERIMENT_ATTRIBUTE><TAG>EXPERIMENT_ONTOLOGY_CURIE</TAG><VALUE>obi:0001863</VALUE></EXPERIMENT_ATTRIBUTE> <EXPERIMENT_ATTRIBUTE><TAG>MOLECULE</TAG><VALUE>genomic DNA</VALUE></EXPERIMENT_ATTRIBUTE> </EXPERIMENT_ATTRIBUTES> </EXPERIMENT> </EXPERIMENT_SET> """ tree = ET.ElementTree(ET.fromstring(xml_str)) vals: List[Element] = tree.xpath(".//EXPERIMENT_ATTRIBUTE/TAG[text()='EXPERIMENT_TYPE']/following-sibling::VALUE") print(vals[0].text) # DNA Methylation
An alternative XPath declaration was provided below by Michael Kay, which is identical to the answer by Martin Honnen.
.//EXPERIMENT_ATTRIBUTE[TAG='EXPERIMENT_TYPE']/VALUE