Problem with detecting if link is invalid

Question

Is there any way to detect if a link is invalid using webbot? I need to tell the user that the link they provided was unreachable. Answer The only way to be completely sure that a url sends you to a valid page is to fetch that page and check it works. You could try making a request other than

Accepted Answer

The only way to be completely sure that a url sends you to a valid page is to fetch that page and check it works.  You could try making a request other than GET to try to avoid the wasted bandwith downloading the page, but not all servers will respond: the only way to be absolutely sure is to GET and see what happens.  Something like:import requestsfrom requests.exceptions import ConnectionErrordef check_url(url):    try:        r = requests.get(url, timeout=1)        return r.status_code == 200    except ConnectionError:        return FalseIs this a good idea?  It&#8217;s only a GET request, and get is supposed to idempotent, so you shouldn&#8217;t cause anybody any harm.  On the other hand, what if a user sets up a script to add a new link every second pointing to the same website?  Then you&#8217;re DDOSing that website.  So when you allow users to cause your server to do things like this, you need to think how you might protect it.  (In this case: you could keep a cache of valid links expiring every n seconds, and only look up if the cache doesn&#8217;t hold the link.)Note that if you just want to check the link points to a valid domain it&#8217;s a bit easier: you can just do a dns query.  (The same point about caching and avoiding abuse probably applies.)Note that I used requests, because it is easy, but you likely want to do this in the bacground, either with requests in a thread, or with one of the asyncio http libraries and an asyncio event loop.  Otherwise your code will block for at least timeout seconds.(Another attack: this gets the whole page.  What if a user links to a massive page?  See this question for a discussion of protecting from oversize responses.  For your use case you likely just want to get a few bytes.  I&#8217;ve deliberately not complicated the example code here because I wanted to illustrate the principle.)Note that this just checks that something is available on that page.  What if it&#8217;s one of the many dead links which redirects to a domain-name website?  You could enforce &#8216;no redirects&#8217;&#8212;but then some redirects are valid.  (Likewise, you could try to detect redirects up to the main domain or to a blacklist of venders&#8217; domains, but this will always be imperfect.)  There is a tradeoff here to consider, which depends on your concrete use case, but it&#8217;s worth being aware of.

Advertisement

Answer