Regex For Special Character (S with line on top)

Question

I was trying to write regex in Python to replace all non-ascii with an underscore, but if one of the characters is &#8220;S̄&#8221; (an &#8216;S&#8217; with a line on the top), it adds an extra &#8216;S&#8217;&#8230; Is there a way to account for this character as well? I believe it&#8217;s a valid utf-8 char…

Accepted Answer

The reason Python works this way is that you are actually looking at two distinct characters; there&#8217;s an S and then it&#8217;s followed by a combining macron U+0304In the general case, if you want to replace a sequence of combining characters and the base character with an underscore, try e.g.import unicodedatadef cleanup(line):    cleaned = []    strip = False    for char in line:        if unicodedata.combining(char):            strip = True            continue        if strip:            cleaned.pop()            strip = False        if unicodedata.category(char) not in ("Ll", "Lu"):            char = "_"        cleaned.append(char)    return ''.join(cleaned)By the by, W does not need square brackets around it; it&#8217;s already a regex character class.Python&#8217;s re module lacks support for important Unicode properties, though if you really want to use specifically a regex for this, the third-party regex library has proper support for Unicode categories."Ll" is lowercase alphabetics and "Lu" are uppercase. There are other Unicode L categories so maybe tweak this to suit your requirements (unicodedata.category(char).startswith("L") maybe?); see also https://www.fileformat.info/info/unicode/category/index.htm

Advertisement

Answer