Skip to content
Advertisement

Python Regex pL matching issues

I’m trying to match a list of keywords I have, taking care to include all Latin characters (e.g accented).

Here’s an example

import regex as re
p = r'((?!pL)|^)blah((?!pL)|$)'
print(re.search(p, "blah u"))
print(re.search(p, "blahé u"))
print(re.search(p, "éblah u"))
print(re.search(p, "blahaha"))

gives:

<regex.Match object; span=(0, 4), match='blah'>
None
None
None

Which looks correct. However:

print(re.search(p, "u blah"))

gives:

None

This is wrong, as I expect a match for “u blah”.

I’ve tried to also use Pythons built in re module, but I cannot get it to work with pL or p{Latin} due to “bad-escape” errors. I’ve also tried to use unicode strings (without the “r”) but despite adding slashes to \\pL, I just can’t get this to work right.

Note: I’m using Python 3.8

Advertisement

Answer

The problem with your ((?!pL)|^)blah((?!pL)|$) regex is that the ((?!pL)|^) group contains two alternatives where the first one always fails the regex (why? Because (?!pL) is a negative lookahead that fails the match if the next char is a letter, and the next char to match is b in blah) and only ^ works all the time, i.e. your regex is equal to ^blah((?!pL)|$) and only matches at the start of string.

Note (?!pL) already matches a position at the end of string, so ((?!pL)|$) = (?!pL).

You should use

(?<!pL)blah(?!pL)

See the regex demo (switched to PCRE for the demo purposes).

Note that the re-compatible version of the regex is

(?<![^Wd_])blah(?![^Wd_])

See the regex demo.

Advertisement