Skip to content
Advertisement

Two unicode encodings represent 1 cyrillic letter

I have such string in unicode and utf-8 representation:

JavaScript

and

JavaScript

The desired ouput is “Если повезет то сегодня уже скину”.

I have tried all possible encodings but still wasn’t able to get it in complete cyrillic form.

The best I got was

JavaScript

using windows-1252.

And also I’ve noticed that one cyrillic letter in desired string means two unicode encodings.

For example: u00d0u0095 = 'Е'. Maybe someone knows what encoding and how to use it to get a normal result?

Advertisement

Answer

You have a mis-decoded string where the UTF-8 bytes were translated as ISO-8859-1 (also known as latin1). Ideally, re-download with the correct encoding, but you can also encode with the wrongly-used encoding to regain the original byte stream, then decode with the right encoding (UTF-8):

Python:

JavaScript

You may also have a literal string of Unicode escape codes, which is a bit trickier:

JavaScript

In this case, the string has to be converted back to bytes, decoded as Unicode escapes, then encoded back to bytes and correctly decoded as UTF-8. latin1 has the feature that the first 256 code points of Unicode map to bytes 0-255 in that codec, so it converts 1:1 code point to byte value.

JavaScript
User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement