Skip to content
Advertisement

Unicode decode mismatch on emojis when using json loads

I have a list of utf-8 encoded objects such as :

test = [b'{"abcxf0x9fx94xa5xf0x9fx91xbdxf0x9fxa7x83": 123}',
 b'{"abcxf0x9fxa7x83": 234}']

and decode it as follows:

result = list(map(lambda x: json.loads(x.decode('utf-8','ignore')),test))

I notice that some emojis are not converted as expected as shown below:

[{'abc🔥👽U0001f9c3': 123}, {'abcU0001f9c3': 234}]

However, when I decode an individual string, I get the expected output:

print(b"abcxf0x9fx94xa5xf0x9fx91xbdxf0x9fxa7x83".decode('utf-8'))
abc🔥👽🧃

I’m not sure why the first approach using json.loads gives an unexpected output. Can someone provide any pointers?

Advertisement

Answer

After json.loads() you are printing a list. Lists use a debug representation of a string (repr()) that refers to the Unicode tables to determine if a code point is printable or not. If unknown you get an escape code in list displays. print a string directly to see the “user-friendly” representation of a string (str()) with no escape codes.

U+1F9C3 BEVERAGE BOX was added in Unicode 12.0. Python 3.7 uses Unicode 11.0 definitions which is why you see an escape code with it. Python 3.8 uses Unicode 12.1 and the updated tables indicate the character is printable. If your terminal supports the character and an appropriate font is used, it will display.

For example, I’m using Python 3.10 below which supports Unicode 13.0. U+1F978 is defined in Unicode 13.0 but U+1F979 was added in Unicode 14.0. Your browser may or may not display the actual emoji depending on browser Unicode support and font used (Chrome 99 didn’t). If not a replacement character is printed. This still demonstrates the difference between the repr() display of a string and str() used by print:

>>> s = 'U0001f978U0001f979'
>>> s                      # The REPL shows the repr (debug) representation
'🥸U0001f979'
>>> print(repr(s))         # forcing print to use the repr as well.
'🥸U0001f979'
>>> [s]                    # repr() is also used for list content.
['🥸U0001f979']
>>> print(s)               # no escape codes here.
🥸🥹
>>> print(ascii(s))        # forcing all non-ASCII to escape codes
'U0001f978U0001f979'
User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement