I have a problem. I am reading in a file.
This file contains abbreviations. However, I only want to read the abbreviations. This also works. However, not in the desired format as expected, I would like to save the abbreviations cleanly per line (see below for the desired output). The problem is that I’m getting something like 't\acro{...
. How can I convert this to my desired output?
def getPrice(symbol, shortForm, longForm): abbreviations = [] with open("./file.tex", encoding="utf-8") as f: file = list(f) save = False for line in file: print("n"+ line) if(line.startswith(r'end{acronym}')): save = False if(save): abbreviations.append(line) if(line.startswith(r'begin{acronym}')): save = True print(abbreviations) if __name__== "__main__": getPrice(str(sys.argv[1]), str(sys.argv[2]), str(sys.argv[3])) [OUT] ['t\acro{knmi}[KNMI]{Koninklijk Nederlands Meteorologisch Instituut}n', 't\acro{test}[TESTERER]{T E SDH SADHU AHENSAD }n']
chapter*{Short} addcontentsline{toc}{chapter}{Short} markboth{Short}{Short} begin{acronym}[TESTERER] acro{knmi}[KNMI]{Koninklijk Nederlands Meteorologisch Instituut} acro{example}[e.g.]{For example} end{acronym}
Desired Output
{ "abbreviation1": { "symbol": "knmi", "shortForm": "KNMI", "longForm": "Koninklijk Nederlands Meteorologisch Instituut", } "abbreviation2": { "symbol": "example", "shortForm": "e.g.", "longForm": "For example", } }
Advertisement
Answer
You can use re.findall()
to capture all of the abbreviations, then use the json
module to dump it out into a file. Your approach could work, but you’d have to do a lot of manual string parsing, which would be a pretty massive headache. (Note that a program that can parse arbitrary LaTeX would need something more powerful than regular expressions; however, since we’re parsing a very small subset of LaTeX, regular expressions will do fine here.)
import re import json data = r"""chapter*{Short} addcontentsline{toc}{chapter}{Short} markboth{Short}{Short} begin{acronym}[TESTERER] acro{knmi}[KNMI]{Koninklijk Nederlands Meteorologisch Instituut} acro{example}[e.g.]{For example} end{acronym}""" pattern = re.compile(r"\acro{(.+)}[(.+)]{(.+)}") regex_result = re.findall(pattern, data) final_output = {} for index, (symbol, shortform, longform) in enumerate(regex_result, start=1): final_output[f'abbreviation{index}'] = dict(symbol=symbol, shortform=shortform, longform=longform) with open('output.json', 'w') as output_file: json.dump(final_output, output_file, indent=4)
output.json
contains the following:
{ "abbreviation1": { "symbol": "knmi", "shortform": "KNMI", "longform": "Koninklijk Nederlands Meteorologisch Instituut" }, "abbreviation2": { "symbol": "example", "shortform": "e.g.", "longform": "For example" } }