Sample text:
This is HeaderA Line 1 Line 2 Line 3 Line 4 Line 5 This is HeaderB Line 1 Line 2
Intended result:
HeaderA1 HeaderA2 HeaderA3 HeaderA4 HeaderA5 HeaderB1, HeaderB2
Regex Attempts:
(?:^This is (?P<H>HeaderB)s) (Line (?P<L>d)s)*?
- Matches only the Header ‘H’ and 1st ‘L’ Line
(?:^This is (?P<H>HeaderB)s)? (Line (?P<L>d)s)*?
- manage to match multiple ‘L’ Lines however, only first 2 line are of the same match, not the subsequent L lines does not reference the Header capture group.
I tried other attempts to adjust the regex but ended up screwing up the expression. I have limited experience with regex, so I am not entirely sure if it is possible to get the desired output.
Advertisement
Answer
Mix of regex and substitutions with format.
It is assumed that below a Header you always have a Line i
import re
text = """This is HeaderA
Line 1
Line 2
Line 3
Line 4
Line 5
This is HeaderB
Line 1
Line 2"""
ordered_matches = [] # global
def custom_match(m, all_matches=ordered_matches):
p = m.group(0)
if p.isdigit():
all_matches[-1] += [p]
else:
all_matches += [[p]]
return '' # doesn't matter
r = re.sub(r'([A-Z0-9]+)$', custom_match, text, flags=re.M)
for m in ordered_matches:
print(('Header{}{{}} '.format(m[0]) * (len(m)-1)).format(*m[1:]))
Output
HeaderA1 HeaderA2 HeaderA3 HeaderA4 HeaderA5 HeaderB1 HeaderB2