Converting Bad Text to Korean

Question

The Problem I'm working on cleaning up some old Korean code, and there are some sections of code that used to be Korean that I would like to translate to English. However, there seems to have been an encoding issue, and the text is no longer Korean. Instead, it's a garbled mess. I would like to go from the broken

Accepted Answer

The data in the hexdump was likely read as ISO-8859-1 (a.k.a Latin-1) and re-saved as UTF-8.  To reverse, decode as UTF-8 to obtain th original cp939 byte values, but in a Unicode string as Unicode code points. The latin1 codec occupies the first 256 code points, and encoding with it gives a byte string with the same byte values.  Then the correct codec can be applied to decode back to a Unicode string:data = bytes.fromhex('''c3 86 c3 84 c3 80 c3 8f 20 c2 b4 c3 ab c3 88 c2ad 20 c2 bb c3 b3 c3 80 c3 9a 35 0d 0a c3 86 c384 c3 80 c3 8f 20 c2 b4 c3 ab c3 88 c2 ad 20 c2bb c3 b3 c3 80 c3 9a 36 0d 0a c3 86 c3 84 c3 80c3 8f 20 c2 b4 c3 ab c3 88 c2 ad 20 c2 bb c3 b3c3 80 c3 9a 0d 0a c3 80 c2 af c3 87 c3 91 20 c2bb c3 b9 c3 87 c3 83 0d 0a c2 bf c2 ac c2 bc c393 20 c2 bb c3 b9 c3 87 c3 83 0d 0a c3 87 c3 8fc2 b5 c3 a5 c2 bf c3 be c2 be c3 ae 20 c3 85 c2b8 c3 80 c3 8c c2 b9 c3 96 c2 bf c2 a1 20 c3 80c3 87 c3 87 c3 91 20 c2 b4 c3 9c c3 80 c3 8f 20c3 86 c3 b7 c3 80 c3 8e c3 86 c2 ae''')fixed = data.decode('utf8').encode('latin1').decode('cp949')print(fixed)Output:파일 대화 상자5파일 대화 상자6파일 대화 상자유한 샘플연속 샘플하드웨어 타이밍에 의한 단일 포인트Translation (Google Translate):File Dialog 5File Dialog 6File dialogFinite sampleContinuous sampleSingle point by hardware timingIf starting from a file, read the file as UTF-8, apply the fix, and write it back as (correct) UTF-8:with open('Broken_Korean.txt', 'r', encoding='utf8') as f:    data = f.read().encode('latin1').decode('cp949')with open('Fixed_Korean.txt', 'w', encoding='utf8') as f:    f.write(data)

Converting Bad Text to Korean

The Problem

What I’ve tried

Advertisement

Answer