I have strings that I need to replace into an URL for accessing different JSON files. My problem is that some strings have special characters and I need only these as UTF-8 bytes, so I can properly find the JSON tables.
An example:
# I have this string a = 'code - Brasilândia' #in the JSON url it appears as 'code%20-%20Brasil%C3%A2ndia'
I managed to get the spaces converted right using urllib.quote()
, but it does not convert the special characters as I need them.
print(urllib.quote('code - Brasilândia)) 'code%20-%20Brasil%83ndia'
When I substitute this in the URL, I cannot reach the JSON table.
I managed to make this work using u before the string, u'code - Brasilândia'
, but this did not solve my issue, because the string will ultimately be a user input, and will need to be constantly changed.
I have tried several methods, but I could not get the result I need.
I’m specifically using python 2.7 for this project, and I cannot change it.
Any ideas?
Advertisement
Answer
You could try decoding the string as UTF-8, and if it fails, assume that it’s Latin-1, or whichever 8-bit encoding you expect.
try: yourstring.decode('utf-8') except UnicodeDecodeError: yourstring = yourstring.decode('latin-1').encode('utf-8') print(urllib.quote(yourstring))
… provided you can establish the correct encoding; 0x83 seems to correspond to â only in some fairly obscure legacy encodings like code pages 437 and 850 (and those are the least obscure). See also https://tripleee.github.io/8bit/#83 (disclosure: the linked site is mine).