Skip to content
Advertisement

Scrapy run crawl after another

I’m quite new to webscraping. I’m trying to crawl at novel reader website, to get the novel info and chapter content, so the way i do it is by creating 2 spider, one to fetch novel information and another one to fetch content of the chapter

JavaScript

After that i created a collector to collect and process all of the data from the spider

JavaScript

If i put the chapter manually before process.start()

JavaScript

It works but this isn’t the purpose of this script

Now my question is how do i make the chapter spider run after the book spider finished collecting the data? Here is my try that didn’t work

JavaScript

if i add process.start() before ‘print(“Chapters ==>”, collector.chapters_data)’ it creates error of twisted.internet.error.ReactorNotRestartable

I’ve read this SO question Scrapy – Reactor not Restartable but didn’t know how to implement it on my code

Advertisement

Answer

I’d suggest to change spider architecture since scrapy isn’t supposed to chain spiders(it’s possible of course but it’s bad practice in general), it’s supposed to chain requests within the same spider.

Your problem is caused by the fact that scrapy designed to grab flat list of items, while you need nested one like book = {'title': ..., 'chapters': [{some chapter data}, ...]}

I’d suggest next architecture for your spider:

JavaScript

This will produce books entities with nested chapters inside.

Hope it will help even though it’s not quite exact answer to your question. Good luck (:


Second edition:

JavaScript
Advertisement