Skip to content
Advertisement

Match open and close brackets in garbled string

how would one go about parsing a random string (which contains all sorts of characters) into something coherent?

For example, string = '{"letters" : '321"}{}"'}{'{}{{}"': "stack{}}{"}' I’d like to make separate into: {"letters" : '321"}{}"'} and {'{}{{}"': "stack{}}{"}

I’ve tried iterating through string and counting each open bracket { and subtracting when a close bracket } shows up. However this doesn’t work because there are instances wherein the brackets are inside "" or '' my code was something along the lines of:

list1 = []  # list1 is where we build up the first string
list2 = []  # list2 is where we keep the strings after building
for c in string:
    list1.append(c)
    if c == "{":
        bracket_counter += 1
    elif c == "}":
        bracket_counter -= 1
        if bracket_counter == 0:
            list2.append("".join(item)) 
            list1 = []

using this code, the first string that is considered “complete” is {"letters" : '321"} even though it should be {"letters" : '321"}{}"'}

I’m pretty unfamiliar with regex, so I’m not sure if this is something I should be using it for. Any help is appreciated.

Thanks!

Advertisement

Answer

You’d use a regular expression to tokenize your string, and then you’d iterate over these tokens. For example:

SQ = r"'[^']*'"   # single-quoted string
DQ = r'"[^"]*"'   # double-quoted string
OPS = r'[{}:]'    # operators
WS = r's+'       # whitespace
     # add more types as needed...
tokens = '(?:' + '|'.join([OPS, SQ, DQ, WS]) + ')'
pattern = re.compile(tokens, re.DOTALL)

def tokenize(source):
    start = 0
    end = len(source)
    while start < end:
        match = pattern.match(source, start)
        if match:
            yield match.group(0)
        else:
            raise ValueError('Invalid syntax at character %d' % start)

        start = match.end()

Then you can run your for loop on these tokens:

for token in tokenize(string):
    ...

The tokens in case of your example input are:

>>> for token in tokenize(string):
...     print(token)
'{'
'"letters"'
' '
':'
' '
''321"}{}"''
'}'
'{'
''{}{{}"''
':'
' '
'"stack{}}{"'
'}'

And as you can see, from this you can count the '{' and '}' correctly.


Notice that the regular expression above has no notion of escaping the ' or " in the strings; if you want to escape the end letter, and it tokenized properly, you can change the SQ and DQ regexes into

SQ = r"'(?:[^\']|\.)*'"
DQ = r'"(?:[^\"]|\.)*"'

Also, if you want any other characters to be also allowed but not handled specially, you can add the

NON_SPECIAL = r'[^'"]'

as the last branch to the regex:

tokens = '(?:' + '|'.join([OPS, SQ, DQ, WS, NON_SPECIAL]) + ')'
Advertisement