Skip to content
Advertisement

Converting Bad Text to Korean

The Problem

I’m working on cleaning up some old Korean code, and there are some sections of code that used to be Korean that I would like to translate to English. However, there seems to have been an encoding issue, and the text is no longer Korean. Instead, it’s a garbled mess.

I would like to go from the broken string to an English translation.

My plan is to start with the broken string, encode it to binary using the codec that was used to decode the broken string on my computer, decode that binary to Korean using a Korean codec, and google translate that Korean into English. The issue is I have no idea how to decode this mess into readable Korean.

What I’ve tried

I started writing some Python3 code to work on translating this, but I keep getting hit with encoding errors, and honestly, I don’t know where to start. This code was written with the assumption that the Korean used the cp949 codec, which I don’t know for sure.

JavaScript

I’ve also researched this issue, but I haven’t found anything groundbreaking. I believe the codec used to display the broken strings is UTF-8, but I could be mistaken. I don’t know how the original Korean was written, except that it was written using a “multi-byte encoding scheme (MBCS)”. For context, the program this was written in is LabVIEW 2015. Presumably, they used a Korean version when they wrote the initial code.

Some examples of the broken strings:

ÆÄÀÏ ´ëÈ­ »óÀÚ5

ÆÄÀÏ ´ëÈ­ »óÀÚ6

ÆÄÀÏ ´ëÈ­ »óÀÚ

Luckily, some of the encoding errors happened on enums, so I was able to find the English translation. Using that translation, I can guess what the Koran might have been, but I’m not certain. I think this might help me deduce the codecs used, but I don’t know how to do it.

À¯ÇÑ »ùÇà = Finite Samples > 유한 샘플

¿¬¼Ó »ùÇà = Continuous Samples > 연속 샘플

Çϵå¿þ¾î ŸÀֿ̹¡ ÀÇÇÑ ´ÜÀÏ Æ÷ÀÎÆ® = Hardware Timed Single Point > 하드웨어 타이밍 단일 포인트

Any help on working with encoding or tips on how to solve this would be greatly appreciated!! I’m very lost right now.

Edit: Here is a hex dump of some of the broken strings:

Broken_Korean.txt

JavaScript
JavaScript

Advertisement

Answer

The data in the hexdump was likely read as ISO-8859-1 (a.k.a Latin-1) and re-saved as UTF-8. To reverse, decode as UTF-8 to obtain th original cp939 byte values, but in a Unicode string as Unicode code points. The latin1 codec occupies the first 256 code points, and encoding with it gives a byte string with the same byte values. Then the correct codec can be applied to decode back to a Unicode string:

JavaScript

Output:

JavaScript

Translation (Google Translate):

JavaScript

If starting from a file, read the file as UTF-8, apply the fix, and write it back as (correct) UTF-8:

JavaScript
User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement