I’m working on converting portions of XHTML to JSON objects. I finally got everything in JSON form, but some UTF-8 character codes are being printed. Example:
{ "p": { "@class": "para-p", "#text": "Iu2019m not on Earth." } }
This should be:
{ "p": { "@class": "para-p", "#text": "I'm not on Earth." } }
This is just one example of UTF-8 codes coming through. How can I got through the string and replace every instance of a UTF-8 code with the character it represents?
Advertisement
Answer
u2019
is not a UTF-8 character, but a Unicode escape code. It’s valid JSON and when read back via json.load
will become ’
(RIGHT SINGLE QUOTATION MARK).
If you want to write the actual character, use ensure_ascii=False
to prevent escape codes from being written for non-ASCII characters:
with open('output.json','w',encoding='utf8') as f: json.dump(data, f, ensure_ascii=False, indent=2)