Skip to content
Advertisement

UTF-8 decoding doesn’t decode special characters in python

Hi I have the following data (abstracted) that comes from an API.

"Product" : "Tu00e1bua 21X40"

I’m using the following code to decode the data byte:

var = json.loads(cleanhtml(str(json.dumps(response.content.decode('utf-8')))))

The cleanhtml is a regex function that I’ve created to remove html tags from the returned data (It’s working correctly). Although, decode(utf-8) is not removing characters like u00e1. My expected output is:

"Product" : "Tábua 21X40"

I’ve tried to use replace("\u00e1", "á") but with no success. How can I replace this type of character and what type of character is this?

Advertisement

Answer

u00e1 is another way of representing the á character when displaying the contents of a Python string.

If you open a Python interactive session and run print({"Product" : "Tu00e1bua 21X40"}) you’ll see output of {'Product': 'Tábua 21X40'}. The u00e1 doesn’t exist in the string as those individual characters.

The u escape sequence indicates that the following numbers specify a Unicode character.

Attempting to replace u00e1 with á won’t achieve anything because that’s what it already is. Additionally, replace("\u00e1", "á") is attempting to replace the individual characters of a slash, a u, etc and, as mentioned, they don’t actually exist in the string in that way.

If you explain the problem you’re encountering further then we may be able to help more, but currently it sounds like the string has the correct content but is just being displayed differently than you expect.

User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement