Python 3.8: Escape non-ascii characters as unicode

Question

I have input and output text files which can contain non-ascii characters. Sometimes I need to escape them and sometimes I need to write the non-ascii characters. Basically if I get &#8220;Bürgerhaus&#8221; I need to output &#8220;Bu00FCrgerhaus&#8221;. If I get &#8220;Bu00FCrgerhaus&#8221; I need to output &…

Accepted Answer

You could do something like this:charList=[]s1 = "Bürgerhaus"for i in [ord(x) for x in s1]: # Keep ascii characters, unicode characters 'encoded' as their ordinal in hex if i < 128: # not sure if that is right or can be made easier! charList.append(chr(i)) else: charList.append('\u%04x' % i )res = ''.join(charList)print(f"Mixed up sting: {res}")for myStr in (res, s1): if '\u' in myStr: print(myStr.encode().decode('unicode-escape')) else: print(myStr)Out:Mixed up sting: Bu00fcrgerhausBürgerhausBürgerhausExplanation:We are going to covert each character to it’s corresponding Unicode code point.print([(c, ord(c)) for c in s1])[('B', 66), ('ü', 252), ('r', 114), ('g', 103), ('e', 101), ('r', 114), ('h', 104), ('a', 97), ('u', 117), ('s', 115)]Regular ASCII characters decimal values are < 128, bigger values, like Eur-Sign, german Umlauts … got values >= 128 (detailed table here).Now, we are going to ‘encoded’ all characters >= 128 with their corresponding unicode representation.

Advertisement

Answer