I have such string in unicode and utf-8 representation:
u00d0u0095u00d1u0081u00d0u00bbu00d0u00b8u00d0u00bfu00d0u00beu00d0u00b2u00d0u00b5u00d0u00b7u00d0u00b5u00d1u0082 u00d1u0082u00d0u00beu00d1u0081u00d0u00b5u00d0u00b3u00d0u00beu00d0u00b4u00d0u00bdu00d1u008fu00d1u0083u00d0u00b6u00d0u00b5u00d1u0081u00d0u00bau00d0u00b8u00d0u00bdu00d1u0083
and
ЕÑли повезет то ÑÐµÐ³Ð¾Ð´Ð½Ñ ÑƒÐ¶Ðµ Ñкину.
The desired ouput is “Если повезет то сегодня уже скину”.
I have tried all possible encodings but still wasn’t able to get it in complete cyrillic form.
The best I got was
'�?�?ли повезе�? �?о �?егодн�? �?же �?кин�?'
using windows-1252.
And also I’ve noticed that one cyrillic letter in desired string means two unicode encodings.
For example: u00d0u0095 = 'Е'
.
Maybe someone knows what encoding and how to use it to get a normal result?
Advertisement
Answer
You have a mis-decoded string where the UTF-8 bytes were translated as ISO-8859-1 (also known as latin1
). Ideally, re-download with the correct encoding, but you can also encode with the wrongly-used encoding to regain the original byte stream, then decode with the right encoding (UTF-8):
Python:
>>> s = 'u00d0u0095u00d1u0081u00d0u00bbu00d0u00b8u00d0u00bfu00d0u00beu00d0u00b2u00d0u00b5u00d0u00b7u00d0u00b5u00d1u0082 u00d1u0082u00d0u00beu00d1u0081u00d0u00b5u00d0u00b3u00d0u00beu00d0u00b4u00d0u00bdu00d1u008fu00d1u0083u00d0u00b6u00d0u00b5u00d1u0081u00d0u00bau00d0u00b8u00d0u00bdu00d1u0083' >>> s 'Ðx95Ñx81липовезеÑx82 Ñx82оÑx81егоднÑx8fÑx83жеÑx81кинÑx83' >>> print(s) ÐÑÐ»Ð¸Ð¿Ð¾Ð²ÐµÐ·ÐµÑ ÑоÑегоднÑÑжеÑÐºÐ¸Ð½Ñ >>> s.encode('latin1') b'xd0x95xd1x81xd0xbbxd0xb8xd0xbfxd0xbexd0xb2xd0xb5xd0xb7xd0xb5xd1x82 xd1x82xd0xbexd1x81xd0xb5xd0xb3xd0xbexd0xb4xd0xbdxd1x8fxd1x83xd0xb6xd0xb5xd1x81xd0xbaxd0xb8xd0xbdxd1x83' >>> s.encode('latin1').decode('utf8') 'Еслиповезет тосегодняужескину'
You may also have a literal string of Unicode escape codes, which is a bit trickier:
>>> s=r'u00d0u0095u00d1u0081u00d0u00bbu00d0u00b8u00d0u00bfu00d0u00beu00d0u00b2u00d0u00b5u00d0u00b7u00d0u00b5u00d1u0082 u00d1u0082u00d0u00beu00d1u0081u00d0u00b5u00d0u00b3u00d0u00beu00d0u00b4u00d0u00bdu00d1u008fu00d1u0083u00d0u00b6u00d0u00b5u00d1u0081u00d0u00bau00d0u00b8u00d0u00bdu00d1u0083' >>> print(s) u00d0u0095u00d1u0081u00d0u00bbu00d0u00b8u00d0u00bfu00d0u00beu00d0u00b2u00d0u00b5u00d0u00b7u00d0u00b5u00d1u0082 u00d1u0082u00d0u00beu00d1u0081u00d0u00b5u00d0u00b3u00d0u00beu00d0u00b4u00d0u00bdu00d1u008fu00d1u0083u00d0u00b6u00d0u00b5u00d1u0081u00d0u00bau00d0u00b8u00d0u00bdu00d1u0083
In this case, the string has to be converted back to bytes, decoded as Unicode escapes, then encoded back to bytes and correctly decoded as UTF-8. latin1
has the feature that the first 256 code points of Unicode map to bytes 0-255 in that codec, so it converts 1:1 code point to byte value.
>>> s.encode('latin1').decode('unicode-escape').encode('latin1').decode('utf8') 'Еслиповезет тосегодняужескину'