Extract all matches unless string contains

Question

I am using the re package&#8217;s re.findall to extract terms from strings. How can I make a regex to say capture these matches unless you see this substring (in this case the substring &#8220;fake&#8221;). I attempted this via a anchored look-ahead solution. Current Output: Desired Output I could accomplish …

Accepted Answer

Since re does not support unknown length lookbehind patterns, the plain regex solution is not possible. However, the PyPi regex library supports such lookbehind patterns.After installing PyPi regex, you can use(?<!fake.*)(man[a-z]?b|dog)(?!.*fake)See the regex demo.Details:(?<!fake.*) &#8211; a negative lookbehind that fails the match if there is fake string followed with any zero or more chars other than line break chars as many as possible immediately to the left of the current location(man[a-z]?b|dog) &#8211; man + a lowercase ASCII letter followed with a word boundary or dog string(?!.*fake) &#8211; a negative lookahead that fails the match if there are any zero or more chars other than line break chars as many as possible and then a fake string immediately to the left of the current location.In Python:import regexfor x in ['a man dogs', "fake: too many dogs", 'hi']:    print(regex.findall(r"(?<!fake.*)(man[a-z]?b|dog)(?!.*fake)", x, flags=re.IGNORECASE))

Advertisement

Answer