I have a big xml (that one): of which I am providing a sample here:
<?xml version="1.0" encoding="UTF-8"?> <hmdb xmlns="http://www.hmdb.ca"> <metabolite> <normal_concentrations> <concentration> <biospecimen>Blood</biospecimen> <concentration_value>2.8 +/- 8.8</concentration_value> </concentration> <concentration> <biospecimen>Feces</biospecimen> <concentration_value/> </concentration> <concentration> <biospecimen>Salvia</biospecimen> <concentration_value>5.2</concentration_value> </concentration> </normal_concentrations> </metabolite> <metabolite> <normal_concentrations> <concentration> <biospecimen>Blood</biospecimen> <concentration_value>5</concentration_value> </concentration> <concentration> <biospecimen>Feces</biospecimen> <concentration_value/> </concentration> <concentration> <biospecimen>Salvia</biospecimen> <concentration_value>3-7</concentration_value> </concentration> </normal_concentrations> </metabolite> </hmdb>
I now want to pull out all biospecimen and concentration_value and be able to associate them with each other in the end. I am trying to do it like this:
from io import StringIO from lxml import etree import csv def hmdbextract(name, file): ns = {'hmdb': 'http://www.hmdb.ca'} context = etree.iterparse(name, tag='{http://www.hmdb.ca}metabolite') csvfile = open(file, 'w') fieldnames = ['normal_concentration_spec', 'normal_concentration_conc'] writer = csv.DictWriter(csvfile, fieldnames=fieldnames) writer.writeheader() for event, elem in context: try: tl = elem.xpath('hmdb:normal_concentrations/hmdb:concentration/hmdb:biospecimen/text()', namespaces=ns) normal_concentration_spec = '; '.join(str(e) for e in tl) except: normal_concentration_spec = 'NA' try: tl = elem.xpath('hmdb:normal_concentrations/hmdb:concentration/hmdb:concentration_value/text()', namespaces=ns) normal_concentration_conc = '; '.join(str(e) for e in tl) except: normal_concentration_conc = 'NA' writer.writerow({'normal_concentration_spec': normal_concentration_spec, 'normal_concentration_conc': normal_concentration_conc}) elem.clear() for ancestor in elem.xpath('ancestor-or-self::*'): while ancestor.getprevious() is not None: del ancestor.getparent()[0] del context return; hmdbextract('hmdb_file.xml', 'hmmdb_file.csv')
The output csv should look like this:
normal_concentration_spec,normal_concentration_conc Blood; Feces; Salvia,2.8 +/- 8.8; NA; 5.2 Blood; Feces; Salvia,5; NA; 3-7
In reality I also pull out many other things with only a single value per metabolite which is why I prefer this csv format. However, since the some of the concentration_value slots are empty I will just get different numbers of specimen and values, and wont be able to tell which belongs which in the end,..
How can I make it that I get something like an NA value for each missing concentration_value? (Ideally while keeping the general structure of the code and the lxml package since I have to pull out a lot of things for which this is already set up)
Advertisement
Answer
An empty element will return a zero length list. That could be used to show NA instead
>>> context = etree.iterparse('tmp.xml', tag='{http://www.hmdb.ca}concentration_value') >>> for event, elem in context: ... tlc = elem.xpath('text()', namespaces=ns) ... print(len(tlc), tlc) ... 1 ['2.8 +/- 8.8'] 0 [] 1 ['5.2']
Using OP’s code
from lxml import etree ns = {'hmdb': 'http://www.hmdb.ca'} context = etree.iterparse('/home/luis/tmp/tmp.xml', tag='{http://www.hmdb.ca}metabolite') for event, elem in context: try: tl = elem.xpath('hmdb:normal_concentrations/hmdb:concentration/hmdb:biospecimen', namespaces=ns) normal_concentration_spec = '; '.join(str(e.text) for e in tl) except Exception as ex: print(ex) normal_concentration_spec = 'NA' try: tl = elem.xpath('hmdb:normal_concentrations/hmdb:concentration/hmdb:concentration_value', namespaces=ns) normal_concentration_conc = '; '.join(str(e.text if e.text!=None else 'NA') for e in tl) except Exception as ex: normal_concentration_conc = 'NA' print(normal_concentration_spec, normal_concentration_conc)
Result
Blood; Feces; Salvia 2.8 +/- 8.8; NA; 5.2 Blood; Feces; Salvia 5; NA; 3-7