Sample text:
This is HeaderA Line 1 Line 2 Line 3 Line 4 Line 5 This is HeaderB Line 1 Line 2
Intended result:
HeaderA1 HeaderA2 HeaderA3 HeaderA4 HeaderA5 HeaderB1, HeaderB2
Regex Attempts:
(?:^This is (?P<H>HeaderB)s) (Line (?P<L>d)s)*?
- Matches only the Header ‘H’ and 1st ‘L’ Line
(?:^This is (?P<H>HeaderB)s)? (Line (?P<L>d)s)*?
- manage to match multiple ‘L’ Lines however, only first 2 line are of the same match, not the subsequent L lines does not reference the Header capture group.
I tried other attempts to adjust the regex but ended up screwing up the expression. I have limited experience with regex, so I am not entirely sure if it is possible to get the desired output.
Advertisement
Answer
Mix of regex and substitutions with format
.
It is assumed that below a Header you always have a Line i
import re text = """This is HeaderA Line 1 Line 2 Line 3 Line 4 Line 5 This is HeaderB Line 1 Line 2""" ordered_matches = [] # global def custom_match(m, all_matches=ordered_matches): p = m.group(0) if p.isdigit(): all_matches[-1] += [p] else: all_matches += [[p]] return '' # doesn't matter r = re.sub(r'([A-Z0-9]+)$', custom_match, text, flags=re.M) for m in ordered_matches: print(('Header{}{{}} '.format(m[0]) * (len(m)-1)).format(*m[1:]))
Output
HeaderA1 HeaderA2 HeaderA3 HeaderA4 HeaderA5 HeaderB1 HeaderB2