I need to match all strings that contain one word of a list, but only if that word is not immediately preceded by another specific word. I have this regex:
.*(?<!forbidden)b(word1|word2|word3)b.*
that is still matching a sentence like hello forbidden word1 because forbidden is matched by .*. But if I remove the .* I am not anymore matching strings like hello word1, which I want to match.
Note that I want to match a string like forbidden hello word1.
Could you suggest me how to fix this problem?
Advertisement
Answer
Have a look into word boundaries bword can never touch a word character to the left.
To disallow (word1|word2|word3) if not preceded by forbidden and
one
W(non word character)^.*?b(?<!forbiddenW)(word1|word2|word3)b.*
multiple
WLookbehinds need to be of fixed length in Python regex. To get around this, an idea is to use
W*outside preceded by(?<!W)for setting the position to look behind.^.*?(?<!forbidden)(?<!W)W*b(word1|word2|word3)b.*
Regex101 demo (in multiline demo I used
[^wn]insteadWfor not skipping over lines)Certainly variable-width lookbehind, such as
(?<!forbiddenW+)would be more comfortable. PyPI Regex >import regex AS resupports lookbehind of variable length: See this demo
Note: If you do not capture anything, a (?: non-capturing groups can be used as well.