I’m extracting data from an API and one of the fields is a string from which i want to extract multiple substrings(7 ideally). To get those substring I’m using the index() method.
string = r"""[Summary] Reason: Not enough information Improvements_Done: None Improvements_Planned: Documentation References_Improvements_Done: None References_Improvements_Done: None References_Improvements_Planned: www.link1.com References_Improvements_Planned: www.link2.com *** DEFAULT.....""".replace("n", "rn") Ex: imp_done_start = string.index('Improvements Done: ') + len('Improvements Done: ') imp_done_end = string.index('Improvements_Planned') imp_done = string[imp_done_start:imp_done_end]
There could be cases when one or more of these substrings(Reason ,Improvements_Done, Improvements_Planned etc) could be missing from the string. For example if “Improvements_Planned” is missing then i can’t get the value for imp_done.
What is the best practice to handle these kind of cases?
Advertisement
Answer
The best practice depends largely on the format. However, in most cases, you can adopt a flexible approach and convert to an easier to parse/analyze intermediate representation:
import re def parse(s: str) -> dict[str, str]: d = {} lines = s.splitlines() for line in lines[1:-1]: pattern = r"^(.*)?: (.*)$" m = re.match(pattern, line) if m is None: continue d[m.group(1)] = m.group(2) return d
Usage:
>>> parse(string) {'Improvements_Done': 'None', 'Improvements_Planned': 'Documentation', 'Reason': 'Not enough information', 'References_Improvements_Done': 'None', 'References_Improvements_Planned': 'www.link2.com'}
Now further analyse the result with any further rules required.