Skip to content
Advertisement

Read out file and convert certain line into a correct form

I have a problem. I am reading in a file. This file contains abbreviations. However, I only want to read the abbreviations. This also works. However, not in the desired format as expected, I would like to save the abbreviations cleanly per line (see below for the desired output). The problem is that I’m getting something like 't\acro{.... How can I convert this to my desired output?

def getPrice(symbol,
            shortForm,
            longForm):

    abbreviations = []
    with open("./file.tex", encoding="utf-8") as f:
         file = list(f)
    save = False
    for line in file:
        print("n"+ line)
        if(line.startswith(r'end{acronym}')):
            save = False
        if(save):
            abbreviations.append(line)
        if(line.startswith(r'begin{acronym}')):
            save = True
        
    print(abbreviations)

if __name__== "__main__":
    getPrice(str(sys.argv[1]),
    str(sys.argv[2]),
    str(sys.argv[3]))


[OUT]
['t\acro{knmi}[KNMI]{Koninklijk Nederlands Meteorologisch Instituut}n', 't\acro{test}[TESTERER]{T E SDH SADHU AHENSAD }n']
chapter*{Short}
addcontentsline{toc}{chapter}{Short}
markboth{Short}{Short}
begin{acronym}[TESTERER]
    acro{knmi}[KNMI]{Koninklijk Nederlands Meteorologisch Instituut}
    acro{example}[e.g.]{For example}
end{acronym}

Desired Output

{
  "abbreviation1": {
      "symbol": "knmi",
      "shortForm": "KNMI",
      "longForm": "Koninklijk Nederlands Meteorologisch Instituut",
   }
  "abbreviation2": {
      "symbol": "example",
      "shortForm": "e.g.",
      "longForm": "For example",
   }
}

Advertisement

Answer

You can use re.findall() to capture all of the abbreviations, then use the json module to dump it out into a file. Your approach could work, but you’d have to do a lot of manual string parsing, which would be a pretty massive headache. (Note that a program that can parse arbitrary LaTeX would need something more powerful than regular expressions; however, since we’re parsing a very small subset of LaTeX, regular expressions will do fine here.)

import re
import json

data = r"""chapter*{Short}
addcontentsline{toc}{chapter}{Short}
markboth{Short}{Short}
begin{acronym}[TESTERER]
    acro{knmi}[KNMI]{Koninklijk Nederlands Meteorologisch Instituut}
    acro{example}[e.g.]{For example}
end{acronym}"""

pattern = re.compile(r"\acro{(.+)}[(.+)]{(.+)}")
regex_result = re.findall(pattern, data)
final_output = {}
for index, (symbol, shortform, longform) in enumerate(regex_result, start=1):
    final_output[f'abbreviation{index}'] = 
        dict(symbol=symbol, shortform=shortform, longform=longform)

with open('output.json', 'w') as output_file:
    json.dump(final_output, output_file, indent=4)

output.json contains the following:

{
    "abbreviation1": {
        "symbol": "knmi",
        "shortform": "KNMI",
        "longform": "Koninklijk Nederlands Meteorologisch Instituut"
    },
    "abbreviation2": {
        "symbol": "example",
        "shortform": "e.g.",
        "longform": "For example"
    }
}
Advertisement