Skip to content
Advertisement

Python Regex pL matching issues

I’m trying to match a list of keywords I have, taking care to include all Latin characters (e.g accented).

Here’s an example

JavaScript

gives:

JavaScript

Which looks correct. However:

JavaScript

gives:

JavaScript

This is wrong, as I expect a match for “u blah”.

I’ve tried to also use Pythons built in re module, but I cannot get it to work with pL or p{Latin} due to “bad-escape” errors. I’ve also tried to use unicode strings (without the “r”) but despite adding slashes to \\pL, I just can’t get this to work right.

Note: I’m using Python 3.8

Advertisement

Answer

The problem with your ((?!pL)|^)blah((?!pL)|$) regex is that the ((?!pL)|^) group contains two alternatives where the first one always fails the regex (why? Because (?!pL) is a negative lookahead that fails the match if the next char is a letter, and the next char to match is b in blah) and only ^ works all the time, i.e. your regex is equal to ^blah((?!pL)|$) and only matches at the start of string.

Note (?!pL) already matches a position at the end of string, so ((?!pL)|$) = (?!pL).

You should use

JavaScript

See the regex demo (switched to PCRE for the demo purposes).

Note that the re-compatible version of the regex is

JavaScript

See the regex demo.

Advertisement