I’m trying to match a list of keywords I have, taking care to include all Latin characters (e.g accented).
Here’s an example
import regex as re p = r'((?!pL)|^)blah((?!pL)|$)' print(re.search(p, "blah u")) print(re.search(p, "blahé u")) print(re.search(p, "éblah u")) print(re.search(p, "blahaha"))
gives:
<regex.Match object; span=(0, 4), match='blah'> None None None
Which looks correct. However:
print(re.search(p, "u blah"))
gives:
None
This is wrong, as I expect a match for “u blah”.
I’ve tried to also use Pythons built in re
module, but I cannot get it to work with pL
or p{Latin}
due to “bad-escape” errors. I’ve also tried to use unicode strings (without the “r”) but despite adding slashes to \\pL
, I just can’t get this to work right.
Note: I’m using Python 3.8
Advertisement
Answer
The problem with your ((?!pL)|^)blah((?!pL)|$)
regex is that the ((?!pL)|^)
group contains two alternatives where the first one always fails the regex (why? Because (?!pL)
is a negative lookahead that fails the match if the next char is a letter, and the next char to match is b
in blah
) and only ^
works all the time, i.e. your regex is equal to ^blah((?!pL)|$)
and only matches at the start of string.
Note (?!pL)
already matches a position at the end of string, so ((?!pL)|$)
= (?!pL)
.
You should use
(?<!pL)blah(?!pL)
See the regex demo (switched to PCRE for the demo purposes).
Note that the re
-compatible version of the regex is
(?<![^Wd_])blah(?![^Wd_])
See the regex demo.