I have a set of words and I want to find those who contain non italian characters. Instead of providing all the possible unicode ranges of letters not belonging to the italian alphabet, I think it would be much better to specify the ranges of the allowed letters and then check if a string contains any character not belonging to those ranges. The problem is, I don’t know how to tell Python’s re
module to look for these chars, and I couldn’t find anything helpful.
Here’s an example: the range for lowercase latin letters is u0061 - u007a
, so if I run the following:
print(re.search("[u0061-u007a]", 'hello'))
I get as output: <re.Match object; span=(0, 1), match='h'>
, as expected.
Now let’s add an out of range character to the input string, and make it Àhello
. I want to search for the character outside the provided range. I tried adding the ‘^’ character before the range:
print(re.search("^[u0061-u007a]", 'Àhello'))
but I get None
as output. I would like to avoid having to scan each string by character. Is it possible?
Advertisement
Answer
Put the ^
symbol inside the square brackets:
print(re.search("[^u0061-u007a]", 'Àhello'))