I need to match all strings that contain one word of a list, but only if that word is not immediately preceded by another specific word. I have this regex:
.*(?<!forbidden)b(word1|word2|word3)b.*
that is still matching a sentence like hello forbidden word1
because forbidden
is matched by .*
. But if I remove the .*
I am not anymore matching strings like hello word1
, which I want to match.
Note that I want to match a string like forbidden hello word1
.
Could you suggest me how to fix this problem?
Advertisement
Answer
Have a look into word boundaries bword
can never touch a word character to the left.
To disallow (word1|word2|word3)
if not preceded by forbidden
and
one
W
(non word character)^.*?b(?<!forbiddenW)(word1|word2|word3)b.*
multiple
W
Lookbehinds need to be of fixed length in Python regex. To get around this, an idea is to use
W*
outside preceded by(?<!W)
for setting the position to look behind.^.*?(?<!forbidden)(?<!W)W*b(word1|word2|word3)b.*
Regex101 demo (in multiline demo I used
[^wn]
insteadW
for not skipping over lines)Certainly variable-width lookbehind, such as
(?<!forbiddenW+)
would be more comfortable. PyPI Regex >import regex AS re
supports lookbehind of variable length: See this demo
Note: If you do not capture anything, a (?:
non-capturing groups can be used as well.