I’m extracting data from an API and one of the fields is a string from which i want to extract multiple substrings(7 ideally). To get those substring I’m using the index() method.
string = r"""[Summary]
Reason: Not enough information
Improvements_Done: None
Improvements_Planned: Documentation
References_Improvements_Done: None
References_Improvements_Done: None
References_Improvements_Planned: www.link1.com
References_Improvements_Planned: www.link2.com
 *** DEFAULT.....""".replace("n", "rn")
Ex: imp_done_start = string.index('Improvements Done: ') + len('Improvements Done: ')
    imp_done_end = string.index('Improvements_Planned')
    imp_done = string[imp_done_start:imp_done_end]
There could be cases when one or more of these substrings(Reason ,Improvements_Done, Improvements_Planned etc) could be missing from the string. For example if “Improvements_Planned” is missing then i can’t get the value for imp_done.
What is the best practice to handle these kind of cases?
Advertisement
Answer
The best practice depends largely on the format. However, in most cases, you can adopt a flexible approach and convert to an easier to parse/analyze intermediate representation:
import re
def parse(s: str) -> dict[str, str]:
    d = {}
    lines = s.splitlines()
    for line in lines[1:-1]:
        pattern = r"^(.*)?: (.*)$"
        m = re.match(pattern, line)
        if m is None:
            continue
        d[m.group(1)] = m.group(2)
    return d
Usage:
>>> parse(string)
{'Improvements_Done': 'None',
 'Improvements_Planned': 'Documentation',
 'Reason': 'Not enough information',
 'References_Improvements_Done': 'None',
 'References_Improvements_Planned': 'www.link2.com'}
Now further analyse the result with any further rules required.
