Skip to content
Advertisement

Two unicode encodings represent 1 cyrillic letter

I have such string in unicode and utf-8 representation:

u00d0u0095u00d1u0081u00d0u00bbu00d0u00b8u00d0u00bfu00d0u00beu00d0u00b2u00d0u00b5u00d0u00b7u00d0u00b5u00d1u0082 u00d1u0082u00d0u00beu00d1u0081u00d0u00b5u00d0u00b3u00d0u00beu00d0u00b4u00d0u00bdu00d1u008fu00d1u0083u00d0u00b6u00d0u00b5u00d1u0081u00d0u00bau00d0u00b8u00d0u00bdu00d1u0083

and

ЕÑли повезет то ÑÐµÐ³Ð¾Ð´Ð½Ñ ÑƒÐ¶Ðµ Ñкину.

The desired ouput is “Если повезет то сегодня уже скину”.

I have tried all possible encodings but still wasn’t able to get it in complete cyrillic form.

The best I got was

'�?�?ли повезе�? �?о �?егодн�? �?же �?кин�?'

using windows-1252.

And also I’ve noticed that one cyrillic letter in desired string means two unicode encodings.

For example: u00d0u0095 = 'Е'. Maybe someone knows what encoding and how to use it to get a normal result?

Advertisement

Answer

You have a mis-decoded string where the UTF-8 bytes were translated as ISO-8859-1 (also known as latin1). Ideally, re-download with the correct encoding, but you can also encode with the wrongly-used encoding to regain the original byte stream, then decode with the right encoding (UTF-8):

Python:

>>> s = 'u00d0u0095u00d1u0081u00d0u00bbu00d0u00b8u00d0u00bfu00d0u00beu00d0u00b2u00d0u00b5u00d0u00b7u00d0u00b5u00d1u0082 u00d1u0082u00d0u00beu00d1u0081u00d0u00b5u00d0u00b3u00d0u00beu00d0u00b4u00d0u00bdu00d1u008fu00d1u0083u00d0u00b6u00d0u00b5u00d1u0081u00d0u00bau00d0u00b8u00d0u00bdu00d1u0083'
>>> s
'Ðx95Ñx81липовезеÑx82 Ñx82оÑx81егоднÑx8fÑx83жеÑx81кинÑx83'
>>> print(s)
ÐÑÐ»Ð¸Ð¿Ð¾Ð²ÐµÐ·ÐµÑ ÑоÑегоднÑÑжеÑкинÑ
>>> s.encode('latin1')
b'xd0x95xd1x81xd0xbbxd0xb8xd0xbfxd0xbexd0xb2xd0xb5xd0xb7xd0xb5xd1x82 xd1x82xd0xbexd1x81xd0xb5xd0xb3xd0xbexd0xb4xd0xbdxd1x8fxd1x83xd0xb6xd0xb5xd1x81xd0xbaxd0xb8xd0xbdxd1x83'
>>> s.encode('latin1').decode('utf8')
'Еслиповезет тосегодняужескину'

You may also have a literal string of Unicode escape codes, which is a bit trickier:

>>> s=r'u00d0u0095u00d1u0081u00d0u00bbu00d0u00b8u00d0u00bfu00d0u00beu00d0u00b2u00d0u00b5u00d0u00b7u00d0u00b5u00d1u0082 u00d1u0082u00d0u00beu00d1u0081u00d0u00b5u00d0u00b3u00d0u00beu00d0u00b4u00d0u00bdu00d1u008fu00d1u0083u00d0u00b6u00d0u00b5u00d1u0081u00d0u00bau00d0u00b8u00d0u00bdu00d1u0083'
>>> print(s)
u00d0u0095u00d1u0081u00d0u00bbu00d0u00b8u00d0u00bfu00d0u00beu00d0u00b2u00d0u00b5u00d0u00b7u00d0u00b5u00d1u0082 u00d1u0082u00d0u00beu00d1u0081u00d0u00b5u00d0u00b3u00d0u00beu00d0u00b4u00d0u00bdu00d1u008fu00d1u0083u00d0u00b6u00d0u00b5u00d1u0081u00d0u00bau00d0u00b8u00d0u00bdu00d1u0083

In this case, the string has to be converted back to bytes, decoded as Unicode escapes, then encoded back to bytes and correctly decoded as UTF-8. latin1 has the feature that the first 256 code points of Unicode map to bytes 0-255 in that codec, so it converts 1:1 code point to byte value.

>>> s.encode('latin1').decode('unicode-escape').encode('latin1').decode('utf8')
'Еслиповезет тосегодняужескину'
User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement