Skip to content
Advertisement

Python Regex – Reference first line on every match, until the start of a new group

Sample text:

This is HeaderA
 Line 1
 Line 2
 Line 3
 Line 4
 Line 5
This is HeaderB
 Line 1
 Line 2

Intended result:

HeaderA1 HeaderA2 HeaderA3 HeaderA4 HeaderA5

HeaderB1, HeaderB2

Regex Attempts:

(?:^This is (?P<H>HeaderB)s) (Line (?P<L>d)s)*?

  • Matches only the Header ‘H’ and 1st ‘L’ Line

(?:^This is (?P<H>HeaderB)s)? (Line (?P<L>d)s)*?

  • manage to match multiple ‘L’ Lines however, only first 2 line are of the same match, not the subsequent L lines does not reference the Header capture group.

I tried other attempts to adjust the regex but ended up screwing up the expression. I have limited experience with regex, so I am not entirely sure if it is possible to get the desired output.

Advertisement

Answer

Mix of regex and substitutions with format.

It is assumed that below a Header you always have a Line i

import re
text = """This is HeaderA
 Line 1
 Line 2
 Line 3
 Line 4
 Line 5
This is HeaderB
 Line 1
 Line 2"""

ordered_matches = [] # global

def custom_match(m, all_matches=ordered_matches):
    p = m.group(0)
    if p.isdigit():
        all_matches[-1] += [p]
    else:
        all_matches += [[p]]
    return '' # doesn't matter

r = re.sub(r'([A-Z0-9]+)$', custom_match, text, flags=re.M)

for m in ordered_matches:
    print(('Header{}{{}} '.format(m[0]) * (len(m)-1)).format(*m[1:]))

Output

HeaderA1 HeaderA2 HeaderA3 HeaderA4 HeaderA5 
HeaderB1 HeaderB2 
User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement