Skip to content
Advertisement

Python 3.8: Escape non-ascii characters as unicode

I have input and output text files which can contain non-ascii characters. Sometimes I need to escape them and sometimes I need to write the non-ascii characters. Basically if I get “Bürgerhaus” I need to output “Bu00FCrgerhaus”. If I get “Bu00FCrgerhaus” I need to output “Bürgerhaus”.

One direction goes fine:

>>> s1 = "Bu00FCrgerhaus"
>>> print(s1)
Bürgerhaus

however in the other direction I do not get the expected result (‘Bu00FCrgerhaus’):

>>> s2 = "Bürgerhaus"
>>> s2_trans = s2.encode('utf8').decode('unicode_escape')
>>> print(s2_trans)
Bürgerhaus

I read that unicode-escape needs latin-1, I tried to encode it to it, but this did not product a result either. What am I doing wrong?

(PS: Thank you Matthias for reminding me that the conversion in the first example was not necessary.)

Advertisement

Answer

You could do something like this:

charList=[]
s1 = "Bürgerhaus"

for i in [ord(x) for x in s1]:
    # Keep ascii characters, unicode characters 'encoded' as their ordinal in hex
    if i < 128:  # not sure if that is right or can be made easier!
        charList.append(chr(i))
    else:
        charList.append('\u%04x' % i )

res = ''.join(charList)
print(f"Mixed up sting: {res}")

for myStr in (res, s1):
    if '\u' in myStr:
        print(myStr.encode().decode('unicode-escape'))
    else:
        print(myStr)

Out:

Mixed up sting: Bu00fcrgerhaus
Bürgerhaus
Bürgerhaus

Explanation:

We are going to covert each character to it’s corresponding Unicode code point.

print([(c, ord(c)) for c in s1])
[('B', 66), ('ü', 252), ('r', 114), ('g', 103), ('e', 101), ('r', 114), ('h', 104), ('a', 97), ('u', 117), ('s', 115)]

Regular ASCII characters decimal values are < 128, bigger values, like Eur-Sign, german Umlauts … got values >= 128 (detailed table here).

Now, we are going to ‘encoded’ all characters >= 128 with their corresponding unicode representation.

User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement