scrapy internal links + pipeline and mongodb collection relationships

Question

I am watching videos and reading some articles about how scrapy works with python and inserting to mongodb. Then two questions popped up which either I am not googling with the correct keywords or just couldn't find the answer. Anyways, let me take example on this tutorial site https://blog.scrapinghub.com to scrape blog posts. I know we can get things like

Accepted Answer

In the situation you described, you will scrape the content from the main page, yield a new Request to the read more page and send the data you already scraped together with the Request. When the new request callbacks it&#8217;s parsing method, all the data scraped in the previous page will be available.The recommended way to send the data with the request is to use cb_kwargs. Quite often you may find people/tutorials using the meta param, as cb_kwargs only became available on Scrapy v1.7+.Here is a example to illustrate:class MySpider(Spider):    def parse(self, response):        title = response.xpath('//div[@id="title"]/text()').get()        author = response.xpath('//div[@id="author"]/text()').get()        scraped_data = {'title': title, 'author': author}        read_more_url = response.xpath('//div[@id="read-more"]/@href').get()        yield Request(            url=read_more_url,            callback=self.parse_read_more,            cb_kwargs={'main_page_data': scraped_data}        )    def parse_read_more(self, response, main_page_data):        # The data from the main page will be received as a param in this method.        content = response.xpath('//article[@id="content"]/text()').get()        yield {            'title': main_page_data['title'],            'author': main_page_data['author'],            'content': content        }Notice that the key in the cb_kwargs must be the same as the param name in the callback function.

Advertisement

Answer