The Problem
I’m working on cleaning up some old Korean code, and there are some sections of code that used to be Korean that I would like to translate to English. However, there seems to have been an encoding issue, and the text is no longer Korean. Instead, it’s a garbled mess.
I would like to go from the broken string to an English translation.
My plan is to start with the broken string, encode it to binary using the codec that was used to decode the broken string on my computer, decode that binary to Korean using a Korean codec, and google translate that Korean into English. The issue is I have no idea how to decode this mess into readable Korean.
What I’ve tried
I started writing some Python3 code to work on translating this, but I keep getting hit with encoding errors, and honestly, I don’t know where to start. This code was written with the assumption that the Korean used the cp949
codec, which I don’t know for sure.
fileIn = open('Broken_Korean.txt', 'r', encoding='cp949') fileOut = open('Fixed_Korean.txt', 'w') Lines = fileIn.readlines() for line in Lines: fileOut.write(str(line.encode('cp949'))) fileOut.write('n') fileOut.write(line.encode('cp949').decode('utf-8'))
I’ve also researched this issue, but I haven’t found anything groundbreaking. I believe the codec used to display the broken strings is UTF-8, but I could be mistaken. I don’t know how the original Korean was written, except that it was written using a “multi-byte encoding scheme (MBCS)”. For context, the program this was written in is LabVIEW 2015. Presumably, they used a Korean version when they wrote the initial code.
Some examples of the broken strings:
ÆÄÀÏ ´ëÈ »óÀÚ5
ÆÄÀÏ ´ëÈ »óÀÚ6
ÆÄÀÏ ´ëÈ »óÀÚ
Luckily, some of the encoding errors happened on enums, so I was able to find the English translation. Using that translation, I can guess what the Koran might have been, but I’m not certain. I think this might help me deduce the codecs used, but I don’t know how to do it.
À¯ÇÑ »ùÇÃ
= Finite Samples > 유한 샘플
¿¬¼Ó »ùÇÃ
= Continuous Samples > 연속 샘플
Çϵå¿þ¾î ŸÀֿ̹¡ ÀÇÇÑ ´ÜÀÏ Æ÷ÀÎÆ®
= Hardware Timed Single Point > 하드웨어 타이밍 단일 포인트
Any help on working with encoding or tips on how to solve this would be greatly appreciated!! I’m very lost right now.
Edit: Here is a hex dump of some of the broken strings:
Broken_Korean.txt
ÆÄÀÏ ´ëÈ »óÀÚ5 ÆÄÀÏ ´ëÈ »óÀÚ6 ÆÄÀÏ ´ëÈ »óÀÚ À¯ÇÑ »ùÇà ¿¬¼Ó »ùÇà Çϵå¿þ¾î ŸÀֿ̹¡ ÀÇÇÑ ´ÜÀÏ Æ÷ÀÎÆ®
hexdump -C Broken_Korean.txt 000000 c3 86 c3 84 c3 80 c3 8f 20 c2 b4 c3 ab c3 88 c2 ........ ....... 000010 ad 20 c2 bb c3 b3 c3 80 c3 9a 35 0d 0a c3 86 c3 . ........5..... 000020 84 c3 80 c3 8f 20 c2 b4 c3 ab c3 88 c2 ad 20 c2 ..... ........ . 000030 bb c3 b3 c3 80 c3 9a 36 0d 0a c3 86 c3 84 c3 80 .......6........ 000040 c3 8f 20 c2 b4 c3 ab c3 88 c2 ad 20 c2 bb c3 b3 .. ........ .... 000050 c3 80 c3 9a 0d 0a c3 80 c2 af c3 87 c3 91 20 c2 .............. . 000060 bb c3 b9 c3 87 c3 83 0d 0a c2 bf c2 ac c2 bc c3 ................ 000070 93 20 c2 bb c3 b9 c3 87 c3 83 0d 0a c3 87 c3 8f . .............. 000080 c2 b5 c3 a5 c2 bf c3 be c2 be c3 ae 20 c3 85 c2 ............ ... 000090 b8 c3 80 c3 8c c2 b9 c3 96 c2 bf c2 a1 20 c3 80 ............. .. 0000a0 c3 87 c3 87 c3 91 20 c2 b4 c3 9c c3 80 c3 8f 20 ...... ........ 0000b0 c3 86 c3 b7 c3 80 c3 8e c3 86 c2 ae ............
Advertisement
Answer
The data in the hexdump was likely read as ISO-8859-1 (a.k.a Latin-1
) and re-saved as UTF-8. To reverse, decode as UTF-8 to obtain th original cp939
byte values, but in a Unicode string as Unicode code points. The latin1
codec occupies the first 256 code points, and encoding with it gives a byte string with the same byte values. Then the correct codec can be applied to decode back to a Unicode string:
data = bytes.fromhex(''' c3 86 c3 84 c3 80 c3 8f 20 c2 b4 c3 ab c3 88 c2 ad 20 c2 bb c3 b3 c3 80 c3 9a 35 0d 0a c3 86 c3 84 c3 80 c3 8f 20 c2 b4 c3 ab c3 88 c2 ad 20 c2 bb c3 b3 c3 80 c3 9a 36 0d 0a c3 86 c3 84 c3 80 c3 8f 20 c2 b4 c3 ab c3 88 c2 ad 20 c2 bb c3 b3 c3 80 c3 9a 0d 0a c3 80 c2 af c3 87 c3 91 20 c2 bb c3 b9 c3 87 c3 83 0d 0a c2 bf c2 ac c2 bc c3 93 20 c2 bb c3 b9 c3 87 c3 83 0d 0a c3 87 c3 8f c2 b5 c3 a5 c2 bf c3 be c2 be c3 ae 20 c3 85 c2 b8 c3 80 c3 8c c2 b9 c3 96 c2 bf c2 a1 20 c3 80 c3 87 c3 87 c3 91 20 c2 b4 c3 9c c3 80 c3 8f 20 c3 86 c3 b7 c3 80 c3 8e c3 86 c2 ae ''') fixed = data.decode('utf8').encode('latin1').decode('cp949') print(fixed)
Output:
파일 대화 상자5 파일 대화 상자6 파일 대화 상자 유한 샘플 연속 샘플 하드웨어 타이밍에 의한 단일 포인트
Translation (Google Translate):
File Dialog 5 File Dialog 6 File dialog Finite sample Continuous sample Single point by hardware timing
If starting from a file, read the file as UTF-8, apply the fix, and write it back as (correct) UTF-8:
with open('Broken_Korean.txt', 'r', encoding='utf8') as f: data = f.read().encode('latin1').decode('cp949') with open('Fixed_Korean.txt', 'w', encoding='utf8') as f: f.write(data)