Skip to content
Advertisement

Extract subtring using regex python

Hello I have this string and I need extract from this some sub strings according some delimiters:

string = """
1538 a
123
skua456
789
5
g
15563 blu55g
b
456
16453 a
789
5
16524 blu
g
55
1734 a
987
987
55
aasf
552
18278 blu
ttry
"""

And I need extract exactly this strings:

string1 = 
"""
1538 a
123
skua456
789
5
g
15563 blu55g
"""
string2 = """
16453 a
789
5
16524 blu
"""
string3 = 
"""
1734 a
987
987
55
aasf
552
18278 blu
"""

I have tried a lot of types: re.findall, re.search, re.match. But I never geted the result expected.

For eg: this code bellow print all string:

re.split(r"a(.*)blu", a)[0]

Advertisement

Answer

You do not need a regex for this, you may get lines between lines containing a and blu:

text = "1538 an123nskua456n789n5ngn15563 blu55gnbn456n16453 an789n5n16524 blungn55n1734 an987n987n55naasfn552n18278 blunttry"
f = False
result = []
block = []
for line in text.splitlines():
    if 'a' in line:
        f = True
    if f:
        block.append(line)
    if 'blu' in line and f:
        f = False
        result.append("n".join(block))
        block = []

print(result)
# => ['1538 an123nskua456n789n5ngn15563 blu55g', '16453 an789n5n16524 blu', '1734 an987n987n55naasfn552n18278 blu']

See the Python demo.

With regex, you can use

print( re.findall(r'(?m)^.*a(?s:.*?)blu.*', text) )
print( re.findall(r'(?m)^.*a(?:n.*)*?n.*blu.*', text) )

See this Python demo.

The first regex means:

  • (?m)^ – multiline mode on, so ^ matches any line start position
  • .*a – any zero or more chars other than line break chars as many as possible, and then a
  • (?s:.*?) – any zero or more chars including line break chars as few as possible
  • blu.*blue and then any zero or more chars other than line break chars as many as possible.

The second regex matches

  • (?m)^ – start of a line
  • .*a – any zero or more chars other than line break chars as many as possible, and then a
  • (?:n.*)*? – zero or more lines, as few as possible
  • n.*blu.* – a newline, any zero or more chars other than line break chars as many as possible, blu and any zero or more chars other than line break chars as many as possible.
Advertisement