Skip to content
Advertisement

Unable to send requests in the right way after replacing redirected url with original one using middleware

I’ve created a script using scrapy to fetch some fields from a webpage. The url of the landing page and the urls of inner pages get redirected very often, so I created a middleware to handle that redirection. However, when I came across this post, I could understand that I need to return request in process_request() after replacing the redirected url with the original one.

This is meta={'dont_redirect': True,"handle_httpstatus_list": [301,302,307,429]} always in place when the requests are sent from the spider.

As all the requests are not being redirected, I tried to replace the redirected urls within _retry() method.

JavaScript

Question: How can I send requests after replacing redirected url with original one using middleware?

Advertisement

Answer

Edit:

I’m putting this at the beginning of the answer because it’s a quicker one-shot solution that might work for you.

Scrapy 2.5 introduced get_retry_request, that allows you to retry requests from a spider callback.

From the docs:

Returns a new Request object to retry the specified request, or None if retries of the specified request have been exhausted.

So you could do something like:

JavaScript

But then again, you should make sure you only retry on status codes beginning in 3 if the website throws them to indicate some non-permanent incident, like redirecting to a maintenance page. As for status 429, see below my recommendation about using a delay.

Edit 2:

On Twisted versions older than 21.7.0, the coroutine async_sleep implementation using deferLater probably won’t work. Use this instead:

JavaScript

Original answer:

If I understood it correctly, you just want to retry the original request whenever a redirection occurs, right?

In that case, you can force a retry on requests that would otherwise be redirected, by using this RedirectMiddleware:

JavaScript

However, retrying on every occurrence of these status codes may lead to other problems. So you might want to add some additional condition in the if, like checking the existence of some header that could indicate site maintenance or something like that.

While we are at it, since you included status code 429 in your list, I assume you may be getting some “Too Many Requests” responses. You should probably make your spider wait some time before retrying on this specific case. That can be achieved with the following RetryMiddleware:

JavaScript

Don’t forget to tell Scrapy to use these middlewares by editing DOWNLOADER_MIDDLEWARES in your project’s settings.py:

JavaScript
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement