Following up from an earlier version of this question asked here.
I have a string of the form —
test = "<bos> <start> some fruits: <mid> apple, oranges <mid> also pineapple <start> some animals: <mid> dogs, cats <eos>"
which needs to be converted to a dictionary (<str>:[List])
of the form:
{"some fruits:" : ["apples, oranges", "also pineapple"], "some animals:" ["dogs, cats"]}
Everything between two <mid>
tags is a single string, whereas multiple <mid>
tags followed by <start>
mean different strings.
Currently, my regex (from the post linked above) looks like this
res = re.finditer(r'<start>s(w+)s<mid>s(w+(?:s<mid>sw+)*), test)'
which can then be iterated over to create a dictionary —
test_dict = {} for match in res: test_dict[match.group(1)] = match.group(2).split(' <mid> ')
However, I am unable to capture multiple words between <start>/<mid>/<mid>
tags (i.e. separated by whitespace, comma etc).
How can this regex be formatted to capture everything between multiple <>
tags?
Advertisement
Answer
You could use re.findall
:
data = {} for m in re.findall(r'(<w+>)s+([^<]+)', test): if m[0] == '<start>': l = data.setdefault(m[1].strip(), []) elif m[0] == '<mid>': l.append(m[1].strip())
Output:
>>> data {'some fruits:': ['apple, oranges', 'also pineapple'], 'some animals:': ['dogs, cats']}