I’m trying to combine html files (text) and xml files (metadata) into a single flat dictionary which I’ll write as a Json. The files are contained in the same folder and have following name structure:
- abcde.html
- abcde.html.xml
Here’s my simplified code, my issue is that I had to separate the xml meta-data writing into
### Create a list of dict with one dict per file, first write file content then meta-data for path, subdirs, files in os.walk("."): for fname in files: docname, extension = os.path.splitext(fname) filename = os.path.join(path,fname) file_dict = {} if extension == ".html": file_dict['type'] = 'circulaire' file_dict['filename'] = fname html_dict = parse_html_to_dict(filename) file_dict.update(html_dict) list_of_dict.append(file_dict) #elif extension == ".xml": # if not any(d['filename'] == docname for d in list_of_dict): # print("Well Well Well, there's no html file in the list yet !") # continue # else: # index = next((i for i, element in enumerate(list_of_dict) if element['filename'] == docname), None) # metadata_dict = extract_metadata_xml(filename) # list_of_dict[index].update(metadata_dict) else: continue json.dump(list_of_dict, outfile, indent=3) outfile.close() ############# Extract Metadata from XML FILES ############# import xmltodict def extract_metadata_xml(filename): """ Returns xml file to dict """ with open(filename, encoding='utf-8', errors='ignore') as xml_file: temp_dict = xmltodict.parse(xml_file.read()) metadata_dict = temp_dict.get('doc', {}).get('fields', {}) xml_file.close() return metadata_dict
Normally, I would add an elif condition (now commented) below the if loop for html files, which checks for xml and updates the corresponding dictionary (bool condition that filenames are identical) with the metadata, thus sequentially.
But, unfortunately it seems that for most files, the list of dict isn’t fully up to date, or at least I can’t find a match for 40% of my filenames.
The work-around I use seems a little silly to me, I wrote a second loop with os.walk after the first one which is used exclusively for html files, my second loop then checks for xml extensions and appends the list_of_dict, which is fully up to date and I get 100% of my html filenames matched with xml metadata.
Can I introduce some forced timing to make sure my html is done writing before I start to match any xml filename, is it possible that both if/elif loops are executed in parallel for different files?
Or else what is in terms of processing the best way to have all my html files processed before my xml files (just ordering my list of files by type before proceeding with my if/elif loops?)
I’m quite new to this forum, so please let me know if I can improve my question writing style, be mindful though that I’m trying my best ;).
Thanks for your help!
Advertisement
Answer
The way it is now you check all files and handle them differently, depending on whether they are html or xml, hoping that the corresponding html for each xml you encounter has already been processed – but that’s not guaranteed, which I suspect to be the cause of the issues.
Instead, you should only look for html files and retrieve the corresponding xml right away:
if extension == ".html": file_dict['type'] = 'circulaire' file_dict['filename'] = fname html_dict = parse_html_to_dict(filename) file_dict.update(html_dict) # process the corresponding xml file # this file should always exist, according to your description of the file structure xml_filename = os.path.join(path, fname+'.xml') # extract the meta data metadata_dict = extract_metadata_xml(xml_filename) # put it in the file dict we still have file_dict.update(metadata_dict) # and finally store it list_of_dict.append(file_dict)
Alternatively, though less efficiently, you could also iterate over the sorted file list – for fname in sorted(files):
– and proceed as you did, as this would also result in the html files preceding the corresponding xml files.