Skip to content
Advertisement

Scrapy: populate items with item loaders over multiple pages

I’m trying to crawl and scrape multiple pages, given multiple urls. I am testing with Wikipedia, and to make it easier I just used the same Xpath selector for each page, but I eventually want to use many different Xpath selectors unique to each page, so each page has its own separate parsePage method.

This code works perfectly when I don’t use item loaders, and just populate items directly. When I use item loaders, the items are populated strangely, and it seems to be completely ignoring the callback assigned in the parse method and only using the start_urls for the parsePage methods.

import scrapy
from scrapy.http import Request
from scrapy import Spider, Request, Selector
from testanother.items import TestItems, TheLoader

class tester(scrapy.Spider):
name = 'vs'
handle_httpstatus_list = [404, 200, 300]
#Usually, I only get data from the first start url
start_urls = ['https://en.wikipedia.org/wiki/SANZAAR','https://en.wikipedia.org/wiki/2016_Rugby_Championship','https://en.wikipedia.org/wiki/2016_Super_Rugby_season']
def parse(self, response):
   #item = TestItems()
    l = TheLoader(item=TestItems(), response=response)
    #when I use an item loader, the url in the request is completely ignored. without the item loader, it works properly.
    request = Request("https://en.wikipedia.org/wiki/2016_Rugby_Championship", callback=self.parsePage1, meta={'loadernext':l}, dont_filter=True)
    yield request

    request = Request("https://en.wikipedia.org/wiki/SANZAAR", callback=self.parsePage2, meta={'loadernext1': l}, dont_filter=True)
    yield request

    yield Request("https://en.wikipedia.org/wiki/2016_Super_Rugby_season", callback=self.parsePage3, meta={'loadernext2': l}, dont_filter=True)

def parsePage1(self,response):
    loadernext = response.meta['loadernext']
    loadernext.add_xpath('title1', '//*[@id="firstHeading"]/text()')
    return loadernext.load_item()
#I'm not sure if this return and load_item is the problem, because I've tried yielding/returning to another method that does the item loading instead and the first start url is still the only url scraped. 
def parsePage2(self,response):
    loadernext1 = response.meta['loadernext1']
    loadernext1.add_xpath('title2', '//*[@id="firstHeading"]/text()')
    return loadernext1.load_item()

def parsePage3(self,response):
    loadernext2 = response.meta['loadernext2']
    loadernext2.add_xpath('title3', '//*[@id="firstHeading"]/text()')
    return loadernext2.load_item()

Here’s the result when I don’t use item loaders:

{'title1': [u'2016 Rugby Championship'],
 'title': [u'SANZAAR'],
 'title3': [u'2016 Super Rugby season']}

Here’s the a bit of the log with item loaders:

{'title2': u'SANZAAR'}
2016-09-24 14:30:43 [scrapy] DEBUG: Crawled (200) <GET https://en.wikipedia.org/wiki/2016_Rugby_Championship> (referer: https://en.wikipedia.org/wiki/SANZAAR)
2016-09-24 14:30:43 [scrapy] DEBUG: Crawled (200) <GET https://en.wikipedia.org/wiki/2016_Rugby_Championship> (referer: https://en.wikipedia.org/wiki/2016_Rugby_Championship)
2016-09-24 14:30:43 [scrapy] DEBUG: Scraped from <200 https://en.wikipedia.org/wiki/2016_Super_Rugby_season>
{'title2': u'SANZAAR', 'title3': u'SANZAAR'}
2016-09-24 14:30:43 [scrapy] DEBUG: Crawled (200) <GET https://en.wikipedia.org/wiki/SANZAAR> (referer: https://en.wikipedia.org/wiki/2016_Rugby_Championship)
2016-09-24 14:30:43 [scrapy] DEBUG: Crawled (200) <GET https://en.wikipedia.org/wiki/2016_Rugby_Championship> (referer: https://en.wikipedia.org/wiki/2016_Super_Rugby_season)
2016-09-24 14:30:43 [scrapy] DEBUG: Crawled (200) <GET https://en.wikipedia.org/wiki/2016_Super_Rugby_season> (referer: https://en.wikipedia.org/wiki/2016_Rugby_Championship)
2016-09-24 14:30:43 [scrapy] DEBUG: Crawled (200) <GET https://en.wikipedia.org/wiki/2016_Super_Rugby_season> (referer: https://en.wikipedia.org/wiki/2016_Super_Rugby_season)
2016-09-24 14:30:43 [scrapy] DEBUG: Scraped from <200 https://en.wikipedia.org/wiki/2016_Rugby_Championship>
{'title1': u'SANZAAR', 'title2': u'SANZAAR', 'title3': u'SANZAAR'}
2016-09-24 14:30:43 [scrapy] DEBUG: Scraped from <200 https://en.wikipedia.org/wiki/2016_Rugby_Championship>
{'title1': u'2016 Rugby Championship'}
2016-09-24 14:30:43 [scrapy] DEBUG: Scraped from <200 https://en.wikipedia.org/wiki/SANZAAR>
{'title1': u'2016 Rugby Championship', 'title2': u'2016 Rugby Championship'}
2016-09-24 14:30:43 [scrapy] DEBUG: Scraped from <200 https://en.wikipedia.org/wiki/2016_Rugby_Championship>
{'title1': u'2016 Super Rugby season'}
2016-09-24 14:30:43 [scrapy] DEBUG: Crawled (200) <GET https://en.wikipedia.org/wiki/SANZAAR> (referer: https://en.wikipedia.org/wiki/2016_Super_Rugby_season)
2016-09-24 14:30:43 [scrapy] DEBUG: Scraped from <200 https://en.wikipedia.org/wiki/2016_Super_Rugby_season>
{'title1': u'2016 Rugby Championship',
 'title2': u'2016 Rugby Championship',
 'title3': u'2016 Rugby Championship'}
2016-09-24 14:30:43 [scrapy] DEBUG: Scraped from <200 https://en.wikipedia.org/wiki/2016_Super_Rugby_season>
{'title1': u'2016 Super Rugby season', 'title3': u'2016 Super Rugby season'}
2016-09-24 14:30:43 [scrapy] DEBUG: Scraped from <200 https://en.wikipedia.org/wiki/SANZAAR>
{'title1': u'2016 Super Rugby season',
 'title2': u'2016 Super Rugby season',
 'title3': u'2016 Super Rugby season'}
 2016-09-24 14:30:43 [scrapy] INFO: Clos

What exactly is going wrong? Thanks!

Advertisement

Answer

One issue is that you’re passing multiple references of a same item loader instance into multiple callbacks, e.g. there are two yield request instructions in parse.

Also, in the following-up callbacks, the loader is still using the old response object, e.g. in parsePage1 the item loader is still operating on the response from parse.

In most of the cases it is not suggested to pass item loaders to another callback. Alternatively, you might find it better to pass item objects directly.

Here’s a short (and incomplete) example, by editing your code:

def parse(self, response):
    l = TheLoader(item=TestItems(), response=response)
    request = Request(
        "https://en.wikipedia.org/wiki/2016_Rugby_Championship",
        callback=self.parsePage1,
        meta={'item': l.load_item()},
        dont_filter=True
    )
    yield request

def parsePage1(self,response):
    loadernext = TheLoader(item=response.meta['item'], response=response)
    loadernext.add_xpath('title1', '//*[@id="firstHeading"]/text()')
    return loadernext.load_item()
User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement