Skip to content
Advertisement

Scrapy: populate items with item loaders over multiple pages

I’m trying to crawl and scrape multiple pages, given multiple urls. I am testing with Wikipedia, and to make it easier I just used the same Xpath selector for each page, but I eventually want to use many different Xpath selectors unique to each page, so each page has its own separate parsePage method.

This code works perfectly when I don’t use item loaders, and just populate items directly. When I use item loaders, the items are populated strangely, and it seems to be completely ignoring the callback assigned in the parse method and only using the start_urls for the parsePage methods.

JavaScript

Here’s the result when I don’t use item loaders:

JavaScript

Here’s the a bit of the log with item loaders:

JavaScript

What exactly is going wrong? Thanks!

Advertisement

Answer

One issue is that you’re passing multiple references of a same item loader instance into multiple callbacks, e.g. there are two yield request instructions in parse.

Also, in the following-up callbacks, the loader is still using the old response object, e.g. in parsePage1 the item loader is still operating on the response from parse.

In most of the cases it is not suggested to pass item loaders to another callback. Alternatively, you might find it better to pass item objects directly.

Here’s a short (and incomplete) example, by editing your code:

JavaScript
User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement