Skip to content
Advertisement

Using Scrapy to add up numbers across several pages

I am using Scrapy to go from page to page and collect numbers that are on a page. The pages are all similar in the way that I can use the same function to parse them. Simple enough, but I don’t need each individual number on the pages, or even each number total from each page. I just need the total sum of all the numbers across all the pages I am visiting. The Scrapy documentation talks about using cb_kwargs to pass arguments, and this is what I have so far.

def parse(self, response):
        self.logger.info('A response from %s just arrived!', response.url)
 
        numbers_page = response.css('.numbers + a')
        yield from response.follow(numbers_page, callback=self.parse_numbers, cb_kwargs=dict(total_count=0))
        
            
def parse_numbers(self, response, total_count):
    yield {
        total_count = total_count,
    }

    def extract_with_css(query):
        return response.css(query).get(default='').strip()
        
        
    for number in response.css('div.numbers'):
        yield {
            'number': extract_with_css('span::text'),
            total_count = total_count + int(number.replace(',',''))
        }       
            
    next_page = response.css('li.next a::attr("href")').get()
    if next_page is not None:
        request = scrapy.Request(next_page,
                         callback=self.parse_numbers,
                         cb_kwargs=dict(total_count=total_count))
        yield request

I cut out things irrelevant to the question to make my code more clear. I feel like using a for loop to add up the numbers is okay, but how do I get that total value to the next page (if there is one) and then export it with the rest of the data at the end?

Advertisement

Answer

I don’t see the need for passing data from one request to another. The most obvious way I can think of to go about it would be as follows:

  • You collect the count of the page and yield the result as an item
  • You create an item pipeline that keeps track of the total count
  • When the scraping is finished, you have the total count in your item pipeline and you write it to a file, database, …

Your spider would look something like this:

def parse(self, response):
        self.logger.info('A response from %s just arrived!', response.url)
 
        numbers_page = response.css('.numbers + a')
        yield from response.follow(numbers_page, callback=self.parse_numbers)
        
            
def parse_numbers(self, response):
    numbers = response.css('div.numbers')
    list_numbers = numbers.css('span::text').getall()
    page_sum = sum(int(number) for number in list_numbers if number.strip())
    yield {'page_sum': page_sum}
    
    next_page = response.css('li.next a::attr("href")').get()
    if next_page:
        request = scrapy.Request(next_page,
                         callback=self.parse_numbers)
        yield request

For the item pipeline you can use logic like this:

class TotalCountPipeline(object):
    def __init__(self):
        # initialize the variable that keeps track of the total count
        self.total_count = 0

    def process_item(self, item, spider):
        # every number yielded from your spider in page_sum will be added to the current total count
        page_sum = item['page_sum']
        self.total_count += page_sum
        return item

    def close_spider(self, spider):
        # write the final count to a file
        output = json.dumps(self.total_count)
        with open('test_count_file.jl', 'w') as output_file:
            output_file.write(output + 'n')
User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement