Skip to content
Advertisement

Best practice when substring is missing from string

I’m extracting data from an API and one of the fields is a string from which i want to extract multiple substrings(7 ideally). To get those substring I’m using the index() method.

string = r"""[Summary]
Reason: Not enough information
Improvements_Done: None
Improvements_Planned: Documentation
References_Improvements_Done: None
References_Improvements_Done: None
References_Improvements_Planned: www.link1.com
References_Improvements_Planned: www.link2.com
 *** DEFAULT.....""".replace("n", "rn")

Ex: imp_done_start = string.index('Improvements Done: ') + len('Improvements Done: ')
    imp_done_end = string.index('Improvements_Planned')
    imp_done = string[imp_done_start:imp_done_end]

There could be cases when one or more of these substrings(Reason ,Improvements_Done, Improvements_Planned etc) could be missing from the string. For example if “Improvements_Planned” is missing then i can’t get the value for imp_done.

What is the best practice to handle these kind of cases?

Advertisement

Answer

The best practice depends largely on the format. However, in most cases, you can adopt a flexible approach and convert to an easier to parse/analyze intermediate representation:

import re

def parse(s: str) -> dict[str, str]:
    d = {}
    lines = s.splitlines()

    for line in lines[1:-1]:
        pattern = r"^(.*)?: (.*)$"
        m = re.match(pattern, line)
        if m is None:
            continue
        d[m.group(1)] = m.group(2)

    return d

Usage:

>>> parse(string)
{'Improvements_Done': 'None',
 'Improvements_Planned': 'Documentation',
 'Reason': 'Not enough information',
 'References_Improvements_Done': 'None',
 'References_Improvements_Planned': 'www.link2.com'}

Now further analyse the result with any further rules required.

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement