Two unicode encodings represent 1 cyrillic letter

Question

I have such string in unicode and utf-8 representation: and The desired ouput is "Если повезет то сегодня уже скину". I have tried all possible encodings but still wasn't able to get it in complete cyrillic form. The best I got was using windows-1252. And also I've noticed that one cyrillic letter in desired string means two unicode encodings. For

Accepted Answer

You have a mis-decoded string where the UTF-8 bytes were translated as ISO-8859-1 (also known as latin1).  Ideally, re-download with the correct encoding, but you can also encode with the wrongly-used encoding to regain the original byte stream, then decode with the right encoding (UTF-8):Python:>>> s = 'u00d0u0095u00d1u0081u00d0u00bbu00d0u00b8u00d0u00bfu00d0u00beu00d0u00b2u00d0u00b5u00d0u00b7u00d0u00b5u00d1u0082 u00d1u0082u00d0u00beu00d1u0081u00d0u00b5u00d0u00b3u00d0u00beu00d0u00b4u00d0u00bdu00d1u008fu00d1u0083u00d0u00b6u00d0u00b5u00d1u0081u00d0u00bau00d0u00b8u00d0u00bdu00d1u0083'>>> s'Ðx95Ñx81Ð»Ð¸Ð¿Ð¾Ð²ÐµÐ·ÐµÑx82 Ñx82Ð¾Ñx81ÐµÐ³Ð¾Ð´Ð½Ñx8fÑx83Ð¶ÐµÑx81ÐºÐ¸Ð½Ñx83'>>> print(s)ÐÑÐ»Ð¸Ð¿Ð¾Ð²ÐµÐ·ÐµÑ ÑÐ¾ÑÐµÐ³Ð¾Ð´Ð½ÑÑÐ¶ÐµÑÐºÐ¸Ð½Ñ>>> s.encode('latin1')b'xd0x95xd1x81xd0xbbxd0xb8xd0xbfxd0xbexd0xb2xd0xb5xd0xb7xd0xb5xd1x82 xd1x82xd0xbexd1x81xd0xb5xd0xb3xd0xbexd0xb4xd0xbdxd1x8fxd1x83xd0xb6xd0xb5xd1x81xd0xbaxd0xb8xd0xbdxd1x83'>>> s.encode('latin1').decode('utf8')'Еслиповезет тосегодняужескину'You may also have a literal string of Unicode escape codes, which is a bit trickier:>>> s=r'u00d0u0095u00d1u0081u00d0u00bbu00d0u00b8u00d0u00bfu00d0u00beu00d0u00b2u00d0u00b5u00d0u00b7u00d0u00b5u00d1u0082 u00d1u0082u00d0u00beu00d1u0081u00d0u00b5u00d0u00b3u00d0u00beu00d0u00b4u00d0u00bdu00d1u008fu00d1u0083u00d0u00b6u00d0u00b5u00d1u0081u00d0u00bau00d0u00b8u00d0u00bdu00d1u0083'>>> print(s)u00d0u0095u00d1u0081u00d0u00bbu00d0u00b8u00d0u00bfu00d0u00beu00d0u00b2u00d0u00b5u00d0u00b7u00d0u00b5u00d1u0082 u00d1u0082u00d0u00beu00d1u0081u00d0u00b5u00d0u00b3u00d0u00beu00d0u00b4u00d0u00bdu00d1u008fu00d1u0083u00d0u00b6u00d0u00b5u00d1u0081u00d0u00bau00d0u00b8u00d0u00bdu00d1u0083In this case, the string has to be converted back to bytes, decoded as Unicode escapes, then encoded back to bytes and correctly decoded as UTF-8.  latin1 has the feature that the first 256 code points of Unicode map to bytes 0-255 in that codec, so it converts 1:1 code point to byte value.>>> s.encode('latin1').decode('unicode-escape').encode('latin1').decode('utf8')'Еслиповезет тосегодняужескину'

Advertisement

Answer