Skip to content
Advertisement

Unable to send requests in the right way after replacing redirected url with original one using middleware

I’ve created a script using scrapy to fetch some fields from a webpage. The url of the landing page and the urls of inner pages get redirected very often, so I created a middleware to handle that redirection. However, when I came across this post, I could understand that I need to return request in process_request() after replacing the redirected url with the original one.

This is meta={'dont_redirect': True,"handle_httpstatus_list": [301,302,307,429]} always in place when the requests are sent from the spider.

As all the requests are not being redirected, I tried to replace the redirected urls within _retry() method.

def process_request(self, request, spider):
    request.headers['User-Agent'] = self.ua.random

def process_exception(self, request, exception, spider):
    return self._retry(request, spider)

def _retry(self, request, spider):
    request.dont_filter = True
    if request.meta.get('redirect_urls'):
        redirect_url = request.meta['redirect_urls'][0]
        redirected = request.replace(url=redirect_url)
        redirected.dont_filter = True
        return redirected
    return request

def process_response(self, request, response, spider):
    if response.status in [301, 302, 307, 429]:
        return self._retry(request, spider)
    return response

Question: How can I send requests after replacing redirected url with original one using middleware?

Advertisement

Answer

Edit:

I’m putting this at the beginning of the answer because it’s a quicker one-shot solution that might work for you.

Scrapy 2.5 introduced get_retry_request, that allows you to retry requests from a spider callback.

From the docs:

Returns a new Request object to retry the specified request, or None if retries of the specified request have been exhausted.

So you could do something like:

def parse(self, response):
    if response.status in [301, 302, 307, 429]:
        new_request_or_none = get_retry_request(
            response.request,
            spider=self,
            reason='tried to redirect',
            max_retry_times = 10
        )
        if new_request_or_none:
            yield new_request_or_none
        else:
            # exhausted all retries
            ...

But then again, you should make sure you only retry on status codes beginning in 3 if the website throws them to indicate some non-permanent incident, like redirecting to a maintenance page. As for status 429, see below my recommendation about using a delay.

Edit 2:

On Twisted versions older than 21.7.0, the coroutine async_sleep implementation using deferLater probably won’t work. Use this instead:

from twisted.internet import defer, reactor

async def async_sleep(delay, return_value=None):
    deferred = defer.Deferred()
    reactor.callLater(delay, deferred.callback, return_value)
    return await deferred

Original answer:

If I understood it correctly, you just want to retry the original request whenever a redirection occurs, right?

In that case, you can force a retry on requests that would otherwise be redirected, by using this RedirectMiddleware:

# middlewares.py
from scrapy.downloadermiddlewares.redirect import RedirectMiddleware

class CustomRedirectMiddleware(RedirectMiddleware):
    """
    Modifies RedirectMiddleware to set response status to 503 on redirects.
    Make sure this appears in the DOWNLOADER_MIDDLEWARES setting with a lower priority (higher number) than RetryMiddleware
    (or whatever the downloader middleware responsible for retrying on status 503 is called).
    """

    def process_response(self, request, response, spider):
        if response.status in (301, 302, 303, 307, 308):  # 429 already is in scrapy's default retry list
            return response.replace(status=503)  # Now this response is RetryMiddleware's problem

        return super().process_response(request, response, spider)

However, retrying on every occurrence of these status codes may lead to other problems. So you might want to add some additional condition in the if, like checking the existence of some header that could indicate site maintenance or something like that.

While we are at it, since you included status code 429 in your list, I assume you may be getting some “Too Many Requests” responses. You should probably make your spider wait some time before retrying on this specific case. That can be achieved with the following RetryMiddleware:

# middlewares.py
from twisted.internet import task, reactor
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message

async def async_sleep(delay, callable=None, *args, **kw):
    return await task.deferLater(reactor, delay, callable, *args, **kw)

class TooManyRequestsRetryMiddleware(RetryMiddleware):
    """
    Modifies RetryMiddleware to delay retries on status 429.
    """

    DEFAULT_DELAY = 10  # Delay in seconds. Tune this to your needs
    MAX_DELAY = 60  # Sometimes, RETRY-AFTER has absurd values

async def process_response(self, request, response, spider):
    """
    Like RetryMiddleware.process_response, but, if response status is 429,
    retry the request only after waiting at most self.MAX_DELAY seconds.
    Respect the Retry-After header if it's less than self.MAX_DELAY.
    If Retry-After is absent/invalid, wait only self.DEFAULT_DELAY seconds.
    """

    if request.meta.get('dont_retry', False):
        return response

    if response.status in self.retry_http_codes:
        if response.status == 429:
            retry_after = response.headers.get('retry-after')
            try:
                retry_after = int(retry_after)
            except (ValueError, TypeError):
                delay = self.DEFAULT_DELAY
            else:
                delay = min(self.MAX_DELAY, retry_after)
            spider.logger.info(f'Retrying {request} in {delay} seconds.')

            spider.crawler.engine.pause()
            await async_sleep(delay)
            spider.crawler.engine.unpause()

        reason = response_status_message(response.status)
        return self._retry(request, reason, spider) or response

    return response

Don’t forget to tell Scrapy to use these middlewares by editing DOWNLOADER_MIDDLEWARES in your project’s settings.py:

# settings.py
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
    'your_project_name.middlewares.TooManyRequestsRetryMiddleware': 550,
    'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': None,
    'your_project_name.middlewares.CustomRedirectMiddleware': 600
}

User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement