Skip to content
Advertisement

Why is Scrapy not following all rules / running all callbacks?

I have two spiders inheriting from a parent spider class as follows:

JavaScript

The parse_tournament_page callback for the Rule in first spider works fine.

However, the second spider only runs the parse_tournament callback from the first Rule despite the fact that the second Rule is the same as the first spider and is operating on the same page.

I’m clearly missing something really simple but for the life of me I can’t figure out what it is…

As key bits of the pages load via Javascript then it might be useful for me to include the Selenium middleware I’m using:

JavaScript

Edit:

So I’ve managed to create a third spider which is able to execute the parse_tournament_page callback from inside parse_tournament:

JavaScript

The key here seems to be dont_filter=True – if this is left as the default False then the parse_tournament_page callback isn’t executed. This suggests Scrapy is somehow interpreting the second page as a duplicate which I far as I can tell it isn’t. That aside, from what I’ve read if I want to get around this then I need to add unique=False to the LinkExtractor. However, doing this doesn’t result in the parse_tournament_page callback executing :(


Update:

So I think I’ve found the source of the issue. From what I can tell the request_fingerprint method of RFPDupeFilter creates the same hash for https://www.oddsportal.com/tennis/argentina/atp-buenos-aires/results/ as https://www.oddsportal.com/tennis/argentina/atp-buenos-aires/results/#/page/2/.

From reading around I need to subclass RFPDupeFilter to reconfigure the way request_fingerprint works. Any advice on why the same hashes are being generated and/or tips on how to do subclass correctly would be greatly appreciated!

Advertisement

Answer

The difference between the two URLs mentioned in the update is in the fragment #/page/2/. Scrapy ignores them by default: Also, servers usually ignore fragments in urls when handling requests, so they are also ignored by default when calculating the fingerprint. If you want to include them, set the keep_fragments argument to True (for instance when handling requests with a headless browser). (from scrapy/utils/request.py)

Check DUPEFILTER_CLASS settings for more information.

The request_fingerprint from scrapy.utils.request can already handle the fragments. When subclassing pass keep_fragments=True.

Add the your class in the custom_settings of SpiderOpTest.

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement