how would one go about parsing a random string (which contains all sorts of characters) into something coherent?
For example, string = '{"letters" : '321"}{}"'}{'{}{{}"': "stack{}}{"}'
I’d like to make separate into:
{"letters" : '321"}{}"'} and {'{}{{}"': "stack{}}{"}
I’ve tried iterating through string and counting each open bracket { and subtracting when a close bracket } shows up. However this doesn’t work because there are instances wherein the brackets are inside "" or ''
my code was something along the lines of:
list1 = [] # list1 is where we build up the first string
list2 = [] # list2 is where we keep the strings after building
for c in string:
list1.append(c)
if c == "{":
bracket_counter += 1
elif c == "}":
bracket_counter -= 1
if bracket_counter == 0:
list2.append("".join(item))
list1 = []
using this code, the first string that is considered “complete” is {"letters" : '321"} even though it should be {"letters" : '321"}{}"'}
I’m pretty unfamiliar with regex, so I’m not sure if this is something I should be using it for. Any help is appreciated.
Thanks!
Advertisement
Answer
You’d use a regular expression to tokenize your string, and then you’d iterate over these tokens. For example:
SQ = r"'[^']*'" # single-quoted string
DQ = r'"[^"]*"' # double-quoted string
OPS = r'[{}:]' # operators
WS = r's+' # whitespace
# add more types as needed...
tokens = '(?:' + '|'.join([OPS, SQ, DQ, WS]) + ')'
pattern = re.compile(tokens, re.DOTALL)
def tokenize(source):
start = 0
end = len(source)
while start < end:
match = pattern.match(source, start)
if match:
yield match.group(0)
else:
raise ValueError('Invalid syntax at character %d' % start)
start = match.end()
Then you can run your for loop on these tokens:
for token in tokenize(string):
...
The tokens in case of your example input are:
>>> for token in tokenize(string):
... print(token)
'{'
'"letters"'
' '
':'
' '
''321"}{}"''
'}'
'{'
''{}{{}"''
':'
' '
'"stack{}}{"'
'}'
And as you can see, from this you can count the '{' and '}' correctly.
Notice that the regular expression above has no notion of escaping the ' or " in the strings; if you want to escape the end letter, and it tokenized properly, you can change the SQ and DQ regexes into
SQ = r"'(?:[^\']|\.)*'" DQ = r'"(?:[^\"]|\.)*"'
Also, if you want any other characters to be also allowed but not handled specially, you can add the
NON_SPECIAL = r'[^'"]'
as the last branch to the regex:
tokens = '(?:' + '|'.join([OPS, SQ, DQ, WS, NON_SPECIAL]) + ')'