Unable to send requests in the right way after replacing redirected url with original one using middleware

Question

I&#8217;ve created a script using scrapy to fetch some fields from a webpage. The url of the landing page and the urls of inner pages get redirected very often, so I created a middleware to handle that redirection. However, when I came across this post, I could understand that I need to return request in proc…

Accepted Answer

Edit:I&#8217;m putting this at the beginning of the answer because it&#8217;s a quicker one-shot solution that might work for you.Scrapy 2.5 introduced get_retry_request, that allows you to retry requests from a spider callback.From the docs:Returns a new Request object to retry the specified request, or None if retries of the specified request have been exhausted.So you could do something like:def parse(self, response):    if response.status in [301, 302, 307, 429]:        new_request_or_none = get_retry_request(            response.request,            spider=self,            reason='tried to redirect',            max_retry_times = 10        )        if new_request_or_none:            yield new_request_or_none        else:            # exhausted all retries            ...But then again, you should make sure you only retry on status codes beginning in 3 if the website throws them to indicate some non-permanent incident, like redirecting to a maintenance page. As for status 429, see below my recommendation about using a delay.Edit 2:On Twisted versions older than 21.7.0, the coroutine async_sleep implementation using deferLater probably won&#8217;t work. Use this instead:from twisted.internet import defer, reactorasync def async_sleep(delay, return_value=None):    deferred = defer.Deferred()    reactor.callLater(delay, deferred.callback, return_value)    return await deferredOriginal answer:If I understood it correctly, you just want to retry the original request whenever a redirection occurs, right?In that case, you can force a retry on requests that would otherwise be redirected, by using this RedirectMiddleware:# middlewares.pyfrom scrapy.downloadermiddlewares.redirect import RedirectMiddlewareclass CustomRedirectMiddleware(RedirectMiddleware):    """    Modifies RedirectMiddleware to set response status to 503 on redirects.    Make sure this appears in the DOWNLOADER_MIDDLEWARES setting with a lower priority (higher number) than RetryMiddleware    (or whatever the downloader middleware responsible for retrying on status 503 is called).    """    def process_response(self, request, response, spider):        if response.status in (301, 302, 303, 307, 308):  # 429 already is in scrapy's default retry list            return response.replace(status=503)  # Now this response is RetryMiddleware's problem        return super().process_response(request, response, spider)However, retrying on every occurrence of these status codes may lead to other problems. So you might want to add some additional condition in the if, like checking the existence of some header that could indicate site maintenance or something like that.While we are at it, since you included status code 429 in your list, I assume you may be getting some &#8220;Too Many Requests&#8221; responses. You should probably make your spider wait some time before retrying on this specific case. That can be achieved with the following RetryMiddleware:# middlewares.pyfrom twisted.internet import task, reactorfrom scrapy.downloadermiddlewares.retry import RetryMiddlewarefrom scrapy.utils.response import response_status_messageasync def async_sleep(delay, callable=None, *args, **kw):    return await task.deferLater(reactor, delay, callable, *args, **kw)class TooManyRequestsRetryMiddleware(RetryMiddleware):    """    Modifies RetryMiddleware to delay retries on status 429.    """    DEFAULT_DELAY = 10  # Delay in seconds. Tune this to your needs    MAX_DELAY = 60  # Sometimes, RETRY-AFTER has absurd valuesasync def process_response(self, request, response, spider):    """    Like RetryMiddleware.process_response, but, if response status is 429,    retry the request only after waiting at most self.MAX_DELAY seconds.    Respect the Retry-After header if it's less than self.MAX_DELAY.    If Retry-After is absent/invalid, wait only self.DEFAULT_DELAY seconds.    """    if request.meta.get('dont_retry', False):        return response    if response.status in self.retry_http_codes:        if response.status == 429:            retry_after = response.headers.get('retry-after')            try:                retry_after = int(retry_after)            except (ValueError, TypeError):                delay = self.DEFAULT_DELAY            else:                delay = min(self.MAX_DELAY, retry_after)            spider.logger.info(f'Retrying {request} in {delay} seconds.')            spider.crawler.engine.pause()            await async_sleep(delay)            spider.crawler.engine.unpause()        reason = response_status_message(response.status)        return self._retry(request, reason, spider) or response    return responseDon&#8217;t forget to tell Scrapy to use these middlewares by editing DOWNLOADER_MIDDLEWARES in your project&#8217;s settings.py:# settings.pyDOWNLOADER_MIDDLEWARES = {    'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,    'your_project_name.middlewares.TooManyRequestsRetryMiddleware': 550,    'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': None,    'your_project_name.middlewares.CustomRedirectMiddleware': 600}

Advertisement

Answer