Scrapy run crawl after another

Question

I'm quite new to webscraping. I'm trying to crawl at novel reader website, to get the novel info and chapter content, so the way i do it is by creating 2 spider, one to fetch novel information and another one to fetch content of the chapter After that i created a collector to collect and process all of the data

Accepted Answer

I&#8217;d suggest to change spider architecture since scrapy isn&#8217;t supposed to chain spiders(it&#8217;s possible of course but it&#8217;s bad practice in general), it&#8217;s supposed to chain requests within the same spider.Your problem is caused by the fact that scrapy designed to grab flat list of items, while you need nested one like book = {'title': ..., 'chapters': [{some chapter data}, ...]}I&#8217;d suggest next architecture for your spider:def parse(self, response):        # parse book data here    book_item = {        'fullurl' : fullurl,        'url' : url,        'title' : title,        'authors' : authors,        'genres' : genres,        'status' : status,        'release' : release,        'summary' : summary,        'chapters' : []    }    chapter_urls = ...list of book chapter urls here.    chapter_url = chapter_urls.pop()    yield Request(        url=chapter_url,        callback=self.parse_chapter        meta={'book': book_item, 'chapter_urls': chapter_urls}    )        def parse_chapter(self, response):    book = response.meta['book']    chapter_urls = response.meta['chapter_urls']        # parse chapter data here        chapter = {        'title' : title,        'content' : content,        'book_url': self.book,        'url' : response.url.split("/")[-2]    }    book['chapters'].append(chapter)    if not chapter_urls:        yield book    else:        chapter_url = chapter_urls.pop()        yield Request(            url=chapter_url,            callback=self.parse_chapter            meta={'book': book, 'chapter_urls': chapter_urls}        )  This will produce books entities with nested chapters inside.Hope it will help even though it&#8217;s not quite exact answer to your question. Good luck (:Second edition:class YourSpider(Spider):    books = {}    ...    def parse(self, response):        # Get book info here.        book_item = {            'fullurl' : fullurl,            'url' : url,            'title' : title,            'authors' : authors,            'genres' : genres,            'status' : status,            'release' : release,            'summary' : summary,            'chapters' : []        }         self.books[book_item['title']] = book_item        chapter_urls = [..list of chapter urls]        chapter_url = chapter_urls.pop()                # This will trigger multiple request async        for chapter_url in chapter_urls:            yield scrapy.Request(                url=chapter_url,                callback=self.parse_chapter,                meta={'book': book}            )    def parse_chapter(self, response):        book_title = response.meta['book_title']        # parse chapter data here        chapter = {            'title' : title,            'content' : content,            'book_url': self.book,            'url' : response.url.split("/")[-2]        }        self.books[book_title].append(chapter)        yield self.books[book_title]

Advertisement

Answer