I’m processing XML documents like the following.
<tok lemma="i" xpos="CC">e</tok> <tok lemma="que" xpos="CS">que</tok> <tok lemma="aquey" xpos="PD0MP0">aqueys</tok> <tok lemma="marit" xpos="NCMP000">marits</tok> <tok lemma="estar" xpos="VMIP3P0">stiguen</tok> [...] <tok lemma="habitar" xpos="VMIP3P0">habiten</tok> <tok lemma="en" xpos="SPS00">en</tok> <tok lemma="aquex" xpos="PD0FS0">aqueix</tok> <tok lemma="terra" xpos="NCMS000">món</tok> [...] <tok lemma="viure" xpos="VMIP3P0">viuen</tok> <tok lemma="en" xpos="SPS00">en</tok> <tok lemma="aquex" xpos="PD0FP0">aqueixes</tok> <tok lemma="casa" xpos="NCFP000">cases</tok>
I’m using the following code to change the attributes of certain elements whenever certain conditions are met. The code works as expected and I’m getting the output I want. However the time it takes to process all the files seems way too much. If you notice, I have some print statements that allow me to monitor the whole process. Sometimes it takes 5 minutes or more between two prints.
In fact, I’ve had to kill the process because it was taking too long. I know it is working fine because the output files that I get are correctly modified and when I did a test with a much smaller number of files the whole process managed to run to its end without a glitch (although taking forever to finish).
I have one of the new macs with the M1 max silicon chip so I thought this would go a lot faster. It is just as slow with the Intel chip. Is this normal when using LXML or being a novice I’m just producing very inefficient code? Is there any way to make this kind of thing faster?
Thanks in advance for any help you can provide.
#!/usr/bin/env python # coding: utf-8 import os import lxml.etree as et #ROOT = '/Users/josepm.fontana/Downloads/_POTI' ROOT = '/Users/josepm.fontana/Downloads/CICA_TESTIN' ext = ('.xml') def xml_change(root_element): for el in root.xpath('//tok[following-sibling::tok[1][re:match(@xpos, "^N")]]', namespaces={"re": "http://exslt.org/regular-expressions"}): if el.text == 'aquest' or el.text == 'Aquest' or el.text == 'AQUEST' or el.text == 'aqueix' or el.text == 'Aqueix' or el.text == 'AQUEIX': print(el.text) print('Current value is:', el.get('lemma'), el.get('xpos')) el.set('xpos', 'DD0MS0') el.set('lemma', 'aquest') if el.text == 'aquel' or el.text == 'Aquel' or el.text == 'AQUEL' or el.text == 'aquell' or el.text == 'Aquell' or el.text == 'AQUELL': print(el.text) print('Current value is:', el.get('lemma'), el.get('xpos')) el.set('xpos', 'DD0MS0') el.set('lemma', 'aquell') if el.text == 'aquests' or el.text == 'Aquests' or el.text == 'AQUESTS': print('Current value is:', el.get('lemma'), el.get('xpos')) el.set('xpos', 'DD0MP0') el.set('lemma', 'aquest') if el.text == 'aquells' or el.text == 'Aquells' or el.text == 'AQUELLS' or el.text == 'aqueys' or el.text == 'Aqueys' or el.text == 'AQUEYS' or el.text == 'aqueyls' or el.text == 'Aqueyls' or el.text == 'AQUEYLS' or el.text == 'aqueys' or el.text == 'Aqueys' or el.text == 'AQUEYS': print('Current value is:', el.get('lemma'), el.get('xpos')) el.set('xpos', 'DD0MP0') el.set('lemma', 'aquell') if el.text == 'aquestas' or el.text == 'Aquestas' or el.text == 'AQUESTAS' or el.text == 'aqueixes' or el.text == 'Aqueixes' or el.text == 'AQUEIXES': print('Current value is:', el.get('lemma'), el.get('xpos')) el.set('xpos', 'DD0FP0') el.set('lemma', 'aquest') if el.text == 'aqualas' or el.text == 'Aqualas' or el.text == 'AQUALAS' or el.text == 'aquelas' or el.text == 'Aquelas' or el.text == 'AQUELAS' or el.text == 'aqueles' or el.text == 'Aqueles' or el.text == 'AQUELES' or el.text == 'aquellas' or el.text == 'Aquellas' or el.text == 'AQUELLAS' or el.text == 'aquelles' or el.text == 'Aquelles' or el.text == 'AQUELLES': print('Current value is:', el.get('lemma'), el.get('xpos')) el.set('xpos', 'DD0FP0') el.set('lemma', 'aquell') # iterate all dirs for root, dirs, files in os.walk(ROOT): # iterate all files for file in files: if file.endswith(ext): # join root dir and file name file_path = os.path.join(ROOT, file) # load root element from file root = et.parse(file_path).getroot() # recursively change elements from xml xml_change(root) # init tree object from root tree = et.ElementTree(root) # save cleaned xml tree object to file. Important to specify encoding tree.write(file_path.replace('.xml', '-clean.xml'), encoding='utf-8', doctype='<!DOCTYPE document SYSTEM "estcorpus.dtd">', xml_declaration=True)
Advertisement
Answer
Code is evaluating all conditions while only one would be met each time. One possible optimization is to make them if-elseif
if el.text == 'aquest' or el.text == 'Aquest' or el.text == 'AQUEST' or el.text == 'aqueix' or el.text == 'Aqueix' or el.text == 'AQUEIX': print(el.text) print('Current value is:', el.get('lemma'), el.get('xpos')) el.set('xpos', 'DD0MS0') el.set('lemma', 'aquest') else if el.text == 'aquel' or el.text == 'Aquel' or el.text == 'AQUEL' or el.text == 'aquell' or el.text == 'Aquell' or el.text == 'AQUELL': print(el.text) print('Current value is:', el.get('lemma'), el.get('xpos')) el.set('xpos', 'DD0MS0') el.set('lemma', 'aquell') # other else if here
Also, plain XPath could be used to avoid regular expressions
//tok[following-sibling::tok[1][starts-with(@xpos, "N")]]