how would one go about parsing a random string (which contains all sorts of characters) into something coherent?
For example, string = '{"letters" : '321"}{}"'}{'{}{{}"': "stack{}}{"}'
I’d like to make separate into:
{"letters" : '321"}{}"'}
and {'{}{{}"': "stack{}}{"}
I’ve tried iterating through string
and counting each open bracket {
and subtracting when a close bracket }
shows up. However this doesn’t work because there are instances wherein the brackets are inside ""
or ''
my code was something along the lines of:
list1 = [] # list1 is where we build up the first string list2 = [] # list2 is where we keep the strings after building for c in string: list1.append(c) if c == "{": bracket_counter += 1 elif c == "}": bracket_counter -= 1 if bracket_counter == 0: list2.append("".join(item)) list1 = []
using this code, the first string that is considered “complete” is {"letters" : '321"}
even though it should be {"letters" : '321"}{}"'}
I’m pretty unfamiliar with regex, so I’m not sure if this is something I should be using it for. Any help is appreciated.
Thanks!
Advertisement
Answer
You’d use a regular expression to tokenize your string, and then you’d iterate over these tokens. For example:
SQ = r"'[^']*'" # single-quoted string DQ = r'"[^"]*"' # double-quoted string OPS = r'[{}:]' # operators WS = r's+' # whitespace # add more types as needed... tokens = '(?:' + '|'.join([OPS, SQ, DQ, WS]) + ')' pattern = re.compile(tokens, re.DOTALL) def tokenize(source): start = 0 end = len(source) while start < end: match = pattern.match(source, start) if match: yield match.group(0) else: raise ValueError('Invalid syntax at character %d' % start) start = match.end()
Then you can run your for
loop on these tokens:
for token in tokenize(string): ...
The tokens in case of your example input are:
>>> for token in tokenize(string): ... print(token) '{' '"letters"' ' ' ':' ' ' ''321"}{}"'' '}' '{' ''{}{{}"'' ':' ' ' '"stack{}}{"' '}'
And as you can see, from this you can count the '{'
and '}'
correctly.
Notice that the regular expression above has no notion of escaping the '
or "
in the strings; if you want to escape the end letter, and it tokenized properly, you can change the
SQ
and DQ
regexes into
SQ = r"'(?:[^\']|\.)*'" DQ = r'"(?:[^\"]|\.)*"'
Also, if you want any other characters to be also allowed but not handled specially, you can add the
NON_SPECIAL = r'[^'"]'
as the last branch to the regex:
tokens = '(?:' + '|'.join([OPS, SQ, DQ, WS, NON_SPECIAL]) + ')'