Skip to content
Advertisement

Look for complement of unicode range in python

I have a set of words and I want to find those who contain non italian characters. Instead of providing all the possible unicode ranges of letters not belonging to the italian alphabet, I think it would be much better to specify the ranges of the allowed letters and then check if a string contains any character not belonging to those ranges. The problem is, I don’t know how to tell Python’s re module to look for these chars, and I couldn’t find anything helpful.

Here’s an example: the range for lowercase latin letters is u0061 - u007a, so if I run the following:

print(re.search("[u0061-u007a]", 'hello'))

I get as output: <re.Match object; span=(0, 1), match='h'>, as expected.

Now let’s add an out of range character to the input string, and make it Àhello. I want to search for the character outside the provided range. I tried adding the ‘^’ character before the range:

print(re.search("^[u0061-u007a]", 'Àhello'))

but I get None as output. I would like to avoid having to scan each string by character. Is it possible?

Advertisement

Answer

Put the ^ symbol inside the square brackets:

print(re.search("[^u0061-u007a]", 'Àhello'))
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement