Multiprocessing where new process starts hafway through other process

Question

I have a Python script that does two things; 1) it downloads a large file by making an API call, and 2) preprocess that large file. I want to use Multiprocessing to run my script. Each individual part (1 and 2) takes quite long. Everything happens in-memory due to the large size of the files, so ideally a single core

Accepted Answer

This is a perfect application for a multiprocessing.Semaphore (or for safety, use a BoundedSemaphore)! Basically you put a lock around the api call part of the process, but let up to 4 worker processes hold the lock at any given time. For various reasons, things like Lock, Semaphore, Queue, etc all need to be passed at the creation of a Pool, rather than when a method like map or imap is called. This is done by specifying an initialization function in the pool constructor.def api_call(arg):    return foodef process_data(foo):    return "done"def map_func(arg):    global semaphore    with semaphore:        foo = api_call(arg)    return process_data(foo)def init_pool(s):    global semaphore = sif __name__ == "__main__":    s = mp.BoundedSemaphore(4)  #max concurrent API calls    with mp.Pool(n_workers, init_pool, (s,)) as p:  #n_workers should be great enough that you always have a free worker waiting on semaphore.acquire()        for result in p.imap(map_func, arglist):            print(result)

Advertisement

Answer