I am using Scrapy to go from page to page and collect numbers that are on a page. The pages are all similar in the way that I can use the same function to parse them. Simple enough, but I don’t need each individual number on the pages, or even each number total from each page. I just need the total sum of all the numbers across all the pages I am visiting. The Scrapy documentation talks about using cb_kwargs
to pass arguments, and this is what I have so far.
def parse(self, response): self.logger.info('A response from %s just arrived!', response.url) numbers_page = response.css('.numbers + a') yield from response.follow(numbers_page, callback=self.parse_numbers, cb_kwargs=dict(total_count=0)) def parse_numbers(self, response, total_count): yield { total_count = total_count, } def extract_with_css(query): return response.css(query).get(default='').strip() for number in response.css('div.numbers'): yield { 'number': extract_with_css('span::text'), total_count = total_count + int(number.replace(',','')) } next_page = response.css('li.next a::attr("href")').get() if next_page is not None: request = scrapy.Request(next_page, callback=self.parse_numbers, cb_kwargs=dict(total_count=total_count)) yield request
I cut out things irrelevant to the question to make my code more clear. I feel like using a for
loop to add up the numbers is okay, but how do I get that total value to the next page (if there is one) and then export it with the rest of the data at the end?
Advertisement
Answer
I don’t see the need for passing data from one request to another. The most obvious way I can think of to go about it would be as follows:
- You collect the count of the page and yield the result as an item
- You create an item pipeline that keeps track of the total count
- When the scraping is finished, you have the total count in your item pipeline and you write it to a file, database, …
Your spider would look something like this:
def parse(self, response): self.logger.info('A response from %s just arrived!', response.url) numbers_page = response.css('.numbers + a') yield from response.follow(numbers_page, callback=self.parse_numbers) def parse_numbers(self, response): numbers = response.css('div.numbers') list_numbers = numbers.css('span::text').getall() page_sum = sum(int(number) for number in list_numbers if number.strip()) yield {'page_sum': page_sum} next_page = response.css('li.next a::attr("href")').get() if next_page: request = scrapy.Request(next_page, callback=self.parse_numbers) yield request
For the item pipeline you can use logic like this:
class TotalCountPipeline(object): def __init__(self): # initialize the variable that keeps track of the total count self.total_count = 0 def process_item(self, item, spider): # every number yielded from your spider in page_sum will be added to the current total count page_sum = item['page_sum'] self.total_count += page_sum return item def close_spider(self, spider): # write the final count to a file output = json.dumps(self.total_count) with open('test_count_file.jl', 'w') as output_file: output_file.write(output + 'n')