Process a lot of data without waiting for a chunk to finish

Tags: , ,



I am confused with map, imap, apply_async, apply, Process etc from the multiprocessing python package.

What I would like to do:

I have 100 simulation script files that need to be run through a simulation program. I would like python to run as many as it can in parallel, then as soon as one is finished, grab a new script and run that one. I don’t want any waiting.

Here is a demo code:

import multiprocessing as mp  
import time

def run_sim(x):
    # run 
    print("Running Sim: ", x)
    
    # artificailly wait 5s
    time.sleep(5)
    
    
    return x

def main():
    # x => my simulation files
    x = list(range(100))
    # run parralel process
    pool = mp.Pool(mp.cpu_count()-1)
    # get results
    result = pool.map(run_sim, x)

    print("Results: ", result)
    
    

if __name__ == "__main__":
  main()

However, I don’t think that map is the correct way here since I want the PC not to wait for the batch to be finished but immediately proceed to the next simulation file.

The code will run mp.cpu_count()-1 simulations at the same time and then wait for every one of them to be finished, before starting a new batch of size mp.cpu_count()-1 . I don’t want the code to wait, but just to grab a new simulation file as soon as possible.

enter image description here

Do you have any advice on how to code it better?

Some clarifications:

I am reducing the pool to one less than the CPU count because I don’t want to block the PC. I still need to do light work while the code is running.

Answer

It works correctly using map. The trouble is simply that you sleep all thread for 5 seconds, so they all finish at the same time.

Try this code to see the effect correctly:

import multiprocessing as mp  
import time
import random

def run_sim(x):
    # run 
    t = random.randint(3,10)
    print("Running Sim: ", x, " - sleep ", t)
    time.sleep(t)
            
    return x

def main():
    # x => my simulation files
    x = list(range(100))
    # run parralel process
    pool = mp.Pool(mp.cpu_count()-1)
    # get results
    result = pool.map(run_sim, x)

    print("Results: ", result)

if __name__ == "__main__":
  main()


Source: stackoverflow