Skip to content

Extract subtring using regex python

Hello I have this string and I need extract from this some sub strings according some delimiters:

string = """
1538 a
123
skua456
789
5
g
15563 blu55g
b
456
16453 a
789
5
16524 blu
g
55
1734 a
987
987
55
aasf
552
18278 blu
ttry
"""

And I need extract exactly this strings:

string1 = 
"""
1538 a
123
skua456
789
5
g
15563 blu55g
"""
string2 = """
16453 a
789
5
16524 blu
"""
string3 = 
"""
1734 a
987
987
55
aasf
552
18278 blu
"""

I have tried a lot of types: re.findall, re.search, re.match. But I never geted the result expected.

For eg: this code bellow print all string:

re.split(r"a(.*)blu", a)[0]

Advertisement

Answer

You do not need a regex for this, you may get lines between lines containing a and blu:

text = "1538 an123nskua456n789n5ngn15563 blu55gnbn456n16453 an789n5n16524 blungn55n1734 an987n987n55naasfn552n18278 blunttry"
f = False
result = []
block = []
for line in text.splitlines():
    if 'a' in line:
        f = True
    if f:
        block.append(line)
    if 'blu' in line and f:
        f = False
        result.append("n".join(block))
        block = []

print(result)
# => ['1538 an123nskua456n789n5ngn15563 blu55g', '16453 an789n5n16524 blu', '1734 an987n987n55naasfn552n18278 blu']

See the Python demo.

With regex, you can use

print( re.findall(r'(?m)^.*a(?s:.*?)blu.*', text) )
print( re.findall(r'(?m)^.*a(?:n.*)*?n.*blu.*', text) )

See this Python demo.

The first regex means:

  • (?m)^ – multiline mode on, so ^ matches any line start position
  • .*a – any zero or more chars other than line break chars as many as possible, and then a
  • (?s:.*?) – any zero or more chars including line break chars as few as possible
  • blu.*blue and then any zero or more chars other than line break chars as many as possible.

The second regex matches

  • (?m)^ – start of a line
  • .*a – any zero or more chars other than line break chars as many as possible, and then a
  • (?:n.*)*? – zero or more lines, as few as possible
  • n.*blu.* – a newline, any zero or more chars other than line break chars as many as possible, blu and any zero or more chars other than line break chars as many as possible.