I’ve created a script using scrapy to fetch some fields from a webpage. The url of the landing page and the urls of inner pages get redirected very often, so I created a middleware to handle that redirection. However, when I came across this post, I could understand that I need to return request
in process_request()
after replacing the redirected url with the original one.
This is meta={'dont_redirect': True,"handle_httpstatus_list": [301,302,307,429]}
always in place when the requests are sent from the spider.
As all the requests are not being redirected, I tried to replace the redirected urls within _retry()
method.
def process_request(self, request, spider): request.headers['User-Agent'] = self.ua.random def process_exception(self, request, exception, spider): return self._retry(request, spider) def _retry(self, request, spider): request.dont_filter = True if request.meta.get('redirect_urls'): redirect_url = request.meta['redirect_urls'][0] redirected = request.replace(url=redirect_url) redirected.dont_filter = True return redirected return request def process_response(self, request, response, spider): if response.status in [301, 302, 307, 429]: return self._retry(request, spider) return response
Question: How can I send requests after replacing redirected url with original one using middleware?
Advertisement
Answer
Edit:
I’m putting this at the beginning of the answer because it’s a quicker one-shot solution that might work for you.
Scrapy 2.5 introduced get_retry_request
, that allows you to retry requests from a spider callback.
From the docs:
Returns a new
Request
object to retry the specified request, orNone
if retries of the specified request have been exhausted.
So you could do something like:
def parse(self, response): if response.status in [301, 302, 307, 429]: new_request_or_none = get_retry_request( response.request, spider=self, reason='tried to redirect', max_retry_times = 10 ) if new_request_or_none: yield new_request_or_none else: # exhausted all retries ...
But then again, you should make sure you only retry on status codes beginning in 3 if the website throws them to indicate some non-permanent incident, like redirecting to a maintenance page. As for status 429, see below my recommendation about using a delay.
Edit 2:
On Twisted versions older than 21.7.0, the coroutine async_sleep
implementation using deferLater
probably won’t work. Use this instead:
from twisted.internet import defer, reactor async def async_sleep(delay, return_value=None): deferred = defer.Deferred() reactor.callLater(delay, deferred.callback, return_value) return await deferred
Original answer:
If I understood it correctly, you just want to retry the original request whenever a redirection occurs, right?
In that case, you can force a retry on requests that would otherwise be redirected, by using this RedirectMiddleware
:
# middlewares.py from scrapy.downloadermiddlewares.redirect import RedirectMiddleware class CustomRedirectMiddleware(RedirectMiddleware): """ Modifies RedirectMiddleware to set response status to 503 on redirects. Make sure this appears in the DOWNLOADER_MIDDLEWARES setting with a lower priority (higher number) than RetryMiddleware (or whatever the downloader middleware responsible for retrying on status 503 is called). """ def process_response(self, request, response, spider): if response.status in (301, 302, 303, 307, 308): # 429 already is in scrapy's default retry list return response.replace(status=503) # Now this response is RetryMiddleware's problem return super().process_response(request, response, spider)
However, retrying on every occurrence of these status codes may lead to other problems. So you might want to add some additional condition in the if
, like checking the existence of some header that could indicate site maintenance or something like that.
While we are at it, since you included status code 429 in your list, I assume you may be getting some “Too Many Requests” responses. You should probably make your spider wait some time before retrying on this specific case. That can be achieved with the following RetryMiddleware
:
# middlewares.py from twisted.internet import task, reactor from scrapy.downloadermiddlewares.retry import RetryMiddleware from scrapy.utils.response import response_status_message async def async_sleep(delay, callable=None, *args, **kw): return await task.deferLater(reactor, delay, callable, *args, **kw) class TooManyRequestsRetryMiddleware(RetryMiddleware): """ Modifies RetryMiddleware to delay retries on status 429. """ DEFAULT_DELAY = 10 # Delay in seconds. Tune this to your needs MAX_DELAY = 60 # Sometimes, RETRY-AFTER has absurd values async def process_response(self, request, response, spider): """ Like RetryMiddleware.process_response, but, if response status is 429, retry the request only after waiting at most self.MAX_DELAY seconds. Respect the Retry-After header if it's less than self.MAX_DELAY. If Retry-After is absent/invalid, wait only self.DEFAULT_DELAY seconds. """ if request.meta.get('dont_retry', False): return response if response.status in self.retry_http_codes: if response.status == 429: retry_after = response.headers.get('retry-after') try: retry_after = int(retry_after) except (ValueError, TypeError): delay = self.DEFAULT_DELAY else: delay = min(self.MAX_DELAY, retry_after) spider.logger.info(f'Retrying {request} in {delay} seconds.') spider.crawler.engine.pause() await async_sleep(delay) spider.crawler.engine.unpause() reason = response_status_message(response.status) return self._retry(request, reason, spider) or response return response
Don’t forget to tell Scrapy to use these middlewares by editing DOWNLOADER_MIDDLEWARES
in your project’s settings.py
:
# settings.py DOWNLOADER_MIDDLEWARES = { 'scrapy.downloadermiddlewares.retry.RetryMiddleware': None, 'your_project_name.middlewares.TooManyRequestsRetryMiddleware': 550, 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': None, 'your_project_name.middlewares.CustomRedirectMiddleware': 600 }