Capture sequence of words separated by whitespace thru existing regex

Following up from an earlier version of this question asked here.

I have a string of the form —

test = "<bos> <start> some fruits: <mid> apple, oranges <mid> also pineapple <start> some animals: <mid> dogs, cats <eos>"

which needs to be converted to a dictionary (<str>:[List]) of the form:

{"some fruits:" : ["apples, oranges", "also pineapple"], "some animals:" ["dogs, cats"]}

Everything between two <mid> tags is a single string, whereas multiple <mid> tags followed by <start> mean different strings.

Currently, my regex (from the post linked above) looks like this

res = re.finditer(r'<start>s(w+)s<mid>s(w+(?:s<mid>sw+)*), test)'

which can then be iterated over to create a dictionary —

test_dict = {}
for match in res:
    test_dict[match.group(1)] = match.group(2).split(' <mid> ')

JavaScript
​x
 
test_dict = {}
for match in res:
    test_dict[match.group(1)] = match.group(2).split(' <mid> ')
​

However, I am unable to capture multiple words between <start>/<mid>/<mid> tags (i.e. separated by whitespace, comma etc).

How can this regex be formatted to capture everything between multiple <> tags?

Answer

You could use re.findall:

data = {}
for m in re.findall(r'(<w+>)s+([^<]+)', test):
    if m[0] == '<start>':
        l = data.setdefault(m[1].strip(), [])
    elif m[0] == '<mid>':
        l.append(m[1].strip())

JavaScript
 
data = {}
for m in re.findall(r'(<w+>)s+([^<]+)', test):
    if m[0] == '<start>':
        l = data.setdefault(m[1].strip(), [])
    elif m[0] == '<mid>':
        l.append(m[1].strip())
​

Output:

>>> data
{'some fruits:': ['apple, oranges', 'also pineapple'],
 'some animals:': ['dogs, cats']}

JavaScript
 
>>> data
{'some fruits:': ['apple, oranges', 'also pineapple'],
 'some animals:': ['dogs, cats']}
​

Advertisement

Answer