Skip to content
Advertisement

Tracking how many elements processed in generator

I have a problem in which I process documents from files using python generators. The number of files I need to process are not known in advance. Each file contain records which consumes considerable amount of memory. Due to that, generators are used to process records. Here is the summary of the code I am working on:

def process_all_records(files):
   for f in files:
      fd = open(f,'r')
      recs = read_records(fd)
      recs_p = (process_records(r) for r in recs)
      write_records(recs_p)

My process_records function checks for the content of each record and only returns the records which has a specific sender. My problem is the following: I want to have a count on number of elements being returned by read_records. I have been keeping track of number of records in process_records function using a list:

def process_records(r):
    if r.sender('sender_of_interest'):
       records_list.append(1)
    else:
       records_list.append(0)
    ...

The problem with this approach is that records_list could grow without bounds depending upon the input. I want to be able to consume the content of records_list once it grows to certain point and then restart the process. For example, after 20 records has been processed, I want to find out how many records are from 'sender_of_interest' and how many are from other sources and empty the list. Can I do this without using a lock?

Advertisement

Answer

You could make your generator a class with an attribute that contains a count of the number of records it has processed. Something like this:

class RecordProcessor(object):
    def __init__(self, recs):
        self.recs = recs
        self.processed_rec_count = 0
    def __call__(self):
        for r in self.recs:
            if r.sender('sender_of_interest'):
               self.processed_rec_count += 1
               # process record r...
               yield r  # processed record

def process_all_records(files):
    for f in files:
        fd = open(f,'r')
        recs_p = RecordProcessor(read_records(fd))
        write_records(recs_p)
        print 'records processed:', recs_p.processed_rec_count
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement