Skip to content
Advertisement

How to encode a webscraped image link in UTF-8 to ASCII but still have a functional link?

I’m trying to webscrape a link to an image to use it in my Kivy app. The problem is that the image adress has Polish signs in it (ę, ł , ó, ą) and I get this error:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 36-37: ordinal not in range(128)

Full error traceback:

Traceback (most recent call last):
  File "F:Kivylibsite-packageskivyloader.py", line 342, in _load_urllib
    fd = opener.open(request)
  File "c:usersuserappdatalocalprogramspythonpython36liburllibrequest.py", line 526, in open
    response = self._open(req, data)
  File "c:usersuserappdatalocalprogramspythonpython36liburllibrequest.py", line 544, in _open
    '_open', req)
  File "c:usersuserappdatalocalprogramspythonpython36liburllibrequest.py", line 504, in _call_chain
    result = func(*args)
  File "c:usersuserappdatalocalprogramspythonpython36liburllibrequest.py", line 1361, in https_open
    context=self._context, check_hostname=self._check_hostname)
  File "c:usersuserappdatalocalprogramspythonpython36liburllibrequest.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "c:usersuserappdatalocalprogramspythonpython36libhttpclient.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "c:usersuserappdatalocalprogramspythonpython36libhttpclient.py", line 1250, in _send_request
    self.putrequest(method, url, **skips)
  File "c:usersuserappdatalocalprogramspythonpython36libhttpclient.py", line 1117, in putrequest
    self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode character 'u0142' in position 36: ordinal not in range(128)
[INFO   ] [GL          ] NPOT texture support is available
[INFO   ] [WindowSDL   ] exiting mainloop and closing.
[INFO   ] [Base        ] Leaving application in progress...

Process finished with exit code 0

Here is an example where you can see what I mean. On picture loads normaly, without errors, the second one outputs the UnicodeEncodeError and displays a black color.

from kivy.app import App
from kivy.lang import Builder

build_structure = """
Screen:
    BoxLayout:
        AsyncImage:
            # This doesnt load because it's in UTF-8 and outputs the error above 
            # but it doesn't break the app.

            source: app.link_to_image_bad
        AsyncImage:
            # This one does load
            source: app.link_to_image_good
"""


class ImageApp(App):
    # This link has Polish signs in it so it will give the UnicodeEncodeError
    link_to_image_bad = "https://nowa.1lo.gorzow.pl/wp-content/uploads/2020/11/Szkoła-do-hymnu.png"

    link_to_image_good = "https://nowa.1lo.gorzow.pl/wp-content/uploads/2020/11/Olimpiada-statystyczna.png"

    def build(self):
        return Builder.load_string(build_structure)


if __name__ == '__main__':
    ImageApp().run()

Output of the code above:

Output of the code

Is there a way to avoid this error and still have a functional link?

Advertisement

Answer

URL should already be ASCII compatible. The traffic on Internet (aka HTTP) works so: only ASCII URLS (with additional restrictions). Browsers now tend to unescape URL. [the %20 and other %xx character we saw in part in URL]. Note: now we have UTF-8 encoding, and on top a URL escaping. So, you should remember that you have two layers to encoding.

You should escape URL, see URL quoting. I would use quote() and unquote(). On comments, we saw a quote_plus(), but that change also the space, useful some time, but it will change the meaning of original data.

EDIT:

Ok, I problems. there seems to be something strange on how kivy handle the URLS. quote() is meant only for the path part, not for the first part of URL.

As an hack (it doesn’t work if you have a specific port: it will quote the : in front of the port):

url = 'https://nowa.1lo.gorzow.pl/wp-content/uploads/2020/11/Szkoła-do-hymnu.png'
url_split = url.split('//')
'//'.join([url_split[0], urllib.parse.quote(url_split[1]))

So you get the wanted: 'https://nowa.1lo.gorzow.pl/wp-content/uploads/2020/11/Szko%C5%82a-do-hymnu.png' as used by browsers.

You may want to include it into your own functions (and maybe check if there is a port number, to exclude it from quoting).

But wait, maybe someone has the true solution for Kivy. I never use full qualified path (so with protocol and domain), so for me basic quote() is enough.

Advertisement