Match open and close brackets in garbled string

Question

how would one go about parsing a random string (which contains all sorts of characters) into something coherent? For example, string = '{"letters" : '321"}{}"'}{'{}{{}"': "stack{}}{"}' I'd like to make separate into: {"letters" : '321"}{}"'} and {'{}{{}"': "stack{}}{"} I've tried iterating through string and counting each open bracket { and subtracting when a close bracket } shows up. However this

Accepted Answer

You&#8217;d use a regular expression to tokenize your string, and then you&#8217;d iterate over these tokens. For example:SQ = r"'[^']*'"   # single-quoted stringDQ = r'"[^"]*"'   # double-quoted stringOPS = r'[{}:]'    # operatorsWS = r's+'       # whitespace     # add more types as needed...tokens = '(?:' + '|'.join([OPS, SQ, DQ, WS]) + ')'pattern = re.compile(tokens, re.DOTALL)def tokenize(source):    start = 0    end = len(source)    while start < end:        match = pattern.match(source, start)        if match:            yield match.group(0)        else:            raise ValueError('Invalid syntax at character %d' % start)        start = match.end()Then you can run your for loop on these tokens:for token in tokenize(string):    ...The tokens in case of your example input are:>>> for token in tokenize(string):...     print(token)'{''"letters"'' '':'' '''321"}{}"'''}''{'''{}{{}"''':'' ''"stack{}}{"''}'And as you can see, from this you can count the '{' and '}' correctly.Notice that the regular expression above has no notion of escaping the ' or " in the strings; if you want  to escape the end letter, and it tokenized properly, you can change the SQ and DQ regexes intoSQ = r"'(?:[^\']|\.)*'"DQ = r'"(?:[^\"]|\.)*"'Also, if you want any other characters to be also allowed but not handled specially, you can add theNON_SPECIAL = r'[^'"]'as the last branch to the regex:tokens = '(?:' + '|'.join([OPS, SQ, DQ, WS, NON_SPECIAL]) + ')'

Advertisement

Answer