Using Scrapy to add up numbers across several pages

Question

I am using Scrapy to go from page to page and collect numbers that are on a page. The pages are all similar in the way that I can use the same function to parse them. Simple enough, but I don't need each individual number on the pages, or even each number total from each page. I just need the

Accepted Answer

I don&#8217;t see the need for passing data from one request to another.The most obvious way I can think of to go about it would be as follows:You collect the count of the page and yield the result as an itemYou create an item pipeline that keeps track of the total countWhen the scraping is finished, you have the total count in your item pipeline and you write it to a file, database, &#8230;Your spider would look something like this:def parse(self, response):        self.logger.info('A response from %s just arrived!', response.url)         numbers_page = response.css('.numbers + a')        yield from response.follow(numbers_page, callback=self.parse_numbers)                    def parse_numbers(self, response):    numbers = response.css('div.numbers')    list_numbers = numbers.css('span::text').getall()    page_sum = sum(int(number) for number in list_numbers if number.strip())    yield {'page_sum': page_sum}        next_page = response.css('li.next a::attr("href")').get()    if next_page:        request = scrapy.Request(next_page,                         callback=self.parse_numbers)        yield requestFor the item pipeline you can use logic like this:class TotalCountPipeline(object):    def __init__(self):        # initialize the variable that keeps track of the total count        self.total_count = 0    def process_item(self, item, spider):        # every number yielded from your spider in page_sum will be added to the current total count        page_sum = item['page_sum']        self.total_count += page_sum        return item    def close_spider(self, spider):        # write the final count to a file        output = json.dumps(self.total_count)        with open('test_count_file.jl', 'w') as output_file:            output_file.write(output + 'n')

Advertisement

Answer