Skip to content
Advertisement

Python 2.7 convert special characters into utf-8 byes

I have strings that I need to replace into an URL for accessing different JSON files. My problem is that some strings have special characters and I need only these as UTF-8 bytes, so I can properly find the JSON tables.

An example:

# I have this string
a = 'code - Brasilândia'

#in the JSON url it appears as
'code%20-%20Brasil%C3%A2ndia'

I managed to get the spaces converted right using urllib.quote(), but it does not convert the special characters as I need them.

print(urllib.quote('code - Brasilândia))
'code%20-%20Brasil%83ndia'

When I substitute this in the URL, I cannot reach the JSON table. I managed to make this work using u before the string, u'code - Brasilândia', but this did not solve my issue, because the string will ultimately be a user input, and will need to be constantly changed. I have tried several methods, but I could not get the result I need.

I’m specifically using python 2.7 for this project, and I cannot change it.

Any ideas?

Advertisement

Answer

You could try decoding the string as UTF-8, and if it fails, assume that it’s Latin-1, or whichever 8-bit encoding you expect.

try:
    yourstring.decode('utf-8')
except UnicodeDecodeError:
    yourstring = yourstring.decode('latin-1').encode('utf-8')
print(urllib.quote(yourstring))

… provided you can establish the correct encoding; 0x83 seems to correspond to â only in some fairly obscure legacy encodings like code pages 437 and 850 (and those are the least obscure). See also https://tripleee.github.io/8bit/#83 (disclosure: the linked site is mine).

Demo: https://ideone.com/fjX15c

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement