I was trying to write regex in Python to replace all non-ascii with an underscore, but if one of the characters is “S̄
” (an ‘S’ with a line on the top), it adds an extra ‘S’… Is there a way to account for this character as well? I believe it’s a valid utf-8 character, but not ascii
Here’s there code:
import re line = "ra*ndom wordS̄" print(re.sub('[W]', '_', line))
I would expect it to output:
ra_ndom_word_
But instead I get:
ra_ndom_wordS__
Advertisement
Answer
The reason Python works this way is that you are actually looking at two distinct characters; there’s an S
and then it’s followed by a combining macron U+0304
In the general case, if you want to replace a sequence of combining characters and the base character with an underscore, try e.g.
import unicodedata def cleanup(line): cleaned = [] strip = False for char in line: if unicodedata.combining(char): strip = True continue if strip: cleaned.pop() strip = False if unicodedata.category(char) not in ("Ll", "Lu"): char = "_" cleaned.append(char) return ''.join(cleaned)
By the by, W
does not need square brackets around it; it’s already a regex character class.
Python’s re
module lacks support for important Unicode properties, though if you really want to use specifically a regex for this, the third-party regex
library has proper support for Unicode categories.
"Ll"
is lowercase alphabetics and "Lu"
are uppercase. There are other Unicode L categories so maybe tweak this to suit your requirements (unicodedata.category(char).startswith("L")
maybe?); see also https://www.fileformat.info/info/unicode/category/index.htm