I have a problem in which I process documents from files using python generators. The number of files I need to process are not known in advance. Each file contain records which consumes considerable amount of memory. Due to that, generators are used to process records. Here is the summary of the code I am working on:
def process_all_records(files): for f in files: fd = open(f,'r') recs = read_records(fd) recs_p = (process_records(r) for r in recs) write_records(recs_p)
My process_records
function checks for the content of each record and only returns the records which has a specific sender. My problem is the following: I want to have a count on number of elements being returned by read_records
. I have been keeping track of number of records in process_records
function using a list:
def process_records(r): if r.sender('sender_of_interest'): records_list.append(1) else: records_list.append(0) ...
The problem with this approach is that records_list
could grow without bounds depending upon the input. I want to be able to consume the content of records_list
once it grows to certain point and then restart the process. For example, after 20 records has been processed, I want to find out how many records are from 'sender_of_interest'
and how many are from other sources and empty the list. Can I do this without using a lock?
Advertisement
Answer
You could make your generator a class with an attribute that contains a count of the number of records it has processed. Something like this:
class RecordProcessor(object): def __init__(self, recs): self.recs = recs self.processed_rec_count = 0 def __call__(self): for r in self.recs: if r.sender('sender_of_interest'): self.processed_rec_count += 1 # process record r... yield r # processed record def process_all_records(files): for f in files: fd = open(f,'r') recs_p = RecordProcessor(read_records(fd)) write_records(recs_p) print 'records processed:', recs_p.processed_rec_count