I am using Scrapy to go from page to page and collect numbers that are on a page. The pages are all similar in the way that I can use the same function to parse them. Simple enough, but I don’t need each individual number on the pages, or even each number total from each page. I just need the total sum of all the numbers across all the pages I am visiting. The Scrapy documentation talks about using cb_kwargs
to pass arguments, and this is what I have so far.
def parse(self, response):
self.logger.info('A response from %s just arrived!', response.url)
numbers_page = response.css('.numbers + a')
yield from response.follow(numbers_page, callback=self.parse_numbers, cb_kwargs=dict(total_count=0))
def parse_numbers(self, response, total_count):
yield {
total_count = total_count,
def extract_with_css(query):
return response.css(query).get(default='').strip()
for number in response.css('div.numbers'):
yield {
'number': extract_with_css('span::text'),
total_count = total_count + int(number.replace(',',''))
next_page = response.css('li.next a::attr("href")').get()
if next_page is not None:
request = scrapy.Request(next_page,
yield request
I cut out things irrelevant to the question to make my code more clear. I feel like using a for
loop to add up the numbers is okay, but how do I get that total value to the next page (if there is one) and then export it with the rest of the data at the end?
I don’t see the need for passing data from one request to another. The most obvious way I can think of to go about it would be as follows:
- You collect the count of the page and yield the result as an item
- You create an item pipeline that keeps track of the total count
- When the scraping is finished, you have the total count in your item pipeline and you write it to a file, database, …
Your spider would look something like this:
def parse(self, response):
self.logger.info('A response from %s just arrived!', response.url)
numbers_page = response.css('.numbers + a')
yield from response.follow(numbers_page, callback=self.parse_numbers)
def parse_numbers(self, response):
numbers = response.css('div.numbers')
list_numbers = numbers.css('span::text').getall()
page_sum = sum(int(number) for number in list_numbers if number.strip())
yield {'page_sum': page_sum}
next_page = response.css('li.next a::attr("href")').get()
if next_page:
request = scrapy.Request(next_page,
yield request
For the item pipeline you can use logic like this:
class TotalCountPipeline(object):
def __init__(self):
# initialize the variable that keeps track of the total count
self.total_count = 0
def process_item(self, item, spider):
# every number yielded from your spider in page_sum will be added to the current total count
page_sum = item['page_sum']
self.total_count += page_sum
return item
def close_spider(self, spider):
# write the final count to a file
output = json.dumps(self.total_count)
with open('test_count_file.jl', 'w') as output_file:
output_file.write(output + 'n')