I’m trying to match a list of keywords I have, taking care to include all Latin characters (e.g accented).
Here’s an example
import regex as re p = r'((?!pL)|^)blah((?!pL)|$)' print(re.search(p, "blah u")) print(re.search(p, "blahé u")) print(re.search(p, "éblah u")) print(re.search(p, "blahaha"))
gives:
<regex.Match object; span=(0, 4), match='blah'> None None None
Which looks correct. However:
print(re.search(p, "u blah"))
gives:
None
This is wrong, as I expect a match for “u blah”.
I’ve tried to also use Pythons built in re module, but I cannot get it to work with pL or p{Latin} due to “bad-escape” errors. I’ve also tried to use unicode strings (without the “r”) but despite adding slashes to \\pL, I just can’t get this to work right.
Note: I’m using Python 3.8
Advertisement
Answer
The problem with your ((?!pL)|^)blah((?!pL)|$) regex is that the ((?!pL)|^) group contains two alternatives where the first one always fails the regex (why? Because (?!pL) is a negative lookahead that fails the match if the next char is a letter, and the next char to match is b in blah) and only ^ works all the time, i.e. your regex is equal to ^blah((?!pL)|$) and only matches at the start of string.
Note (?!pL) already matches a position at the end of string, so ((?!pL)|$) = (?!pL).
You should use
(?<!pL)blah(?!pL)
See the regex demo (switched to PCRE for the demo purposes).
Note that the re-compatible version of the regex is
(?<![^Wd_])blah(?![^Wd_])
See the regex demo.