Skip to content
Advertisement

scrapy internal links + pipeline and mongodb collection relationships

I am watching videos and reading some articles about how scrapy works with python and inserting to mongodb.

Then two questions popped up which either I am not googling with the correct keywords or just couldn’t find the answer.

Anyways, let me take example on this tutorial site https://blog.scrapinghub.com to scrape blog posts.

I know we can get things like the title, author, date. But what if I want to get the content too? Which I need to click on more in order to go into another url then get the content. How can this be done though?

Then I either want the content to be the same dict as title, author, date or maybe title, author, date can be in one collection and having the content in another collection but the same post should be related though.

I am kinda lost when I thought of this, can someone give me suggestions / advices for this kind of idea?

Thanks in advance for any help and suggestions.

Advertisement

Answer

In the situation you described, you will scrape the content from the main page, yield a new Request to the read more page and send the data you already scraped together with the Request. When the new request callbacks it’s parsing method, all the data scraped in the previous page will be available.

The recommended way to send the data with the request is to use cb_kwargs. Quite often you may find people/tutorials using the meta param, as cb_kwargs only became available on Scrapy v1.7+.

Here is a example to illustrate:

class MySpider(Spider):

    def parse(self, response):
        title = response.xpath('//div[@id="title"]/text()').get()
        author = response.xpath('//div[@id="author"]/text()').get()
        scraped_data = {'title': title, 'author': author}

        read_more_url = response.xpath('//div[@id="read-more"]/@href').get()
        yield Request(
            url=read_more_url,
            callback=self.parse_read_more,
            cb_kwargs={'main_page_data': scraped_data}
        )

    def parse_read_more(self, response, main_page_data):
        # The data from the main page will be received as a param in this method.
        content = response.xpath('//article[@id="content"]/text()').get()
        yield {
            'title': main_page_data['title'],
            'author': main_page_data['author'],
            'content': content
        }

Notice that the key in the cb_kwargs must be the same as the param name in the callback function.

User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement