Occasional deadlock in multiprocessing.Pool

I have N independent tasks that are executed in a multiprocessing.Pool of size os.cpu_count() (8 in my case), with maxtasksperchild=1 (i.e. a fresh worker process is created for each new task).

The main script can be simplified to:

import subprocess as sp
import multiprocessing as mp

def do_work(task: dict) -> dict:
    res = {}
    
    # ... work ...
   
    for i in range(5):
        out = sp.run(cmd, stdout=sp.PIPE, stderr=sp.PIPE, check=False, timeout=60)
        res[i] = out.stdout.decode('utf-8')

    # ... some more work ...

    return res


if __name__ == '__main__':
    tasks = load_tasks_from_file(...) # list of dicts

    logger = mp.get_logger()
    results = []

    with mp.Pool(processes=os.cpu_count(), maxtasksperchild=1) as pool:
        for i, res in enumerate(pool.imap_unordered(do_work, tasks), start=1):
            results.append(res)
            logger.info('PROGRESS: %3d/%3d', i, len(tasks))

    dump_results_to_file(results)

The pool sometimes gets stuck. The traceback when I do a KeyboardInterrupt is here. It indicates that the pool won’t fetch new tasks and/or worker processes are stuck in a queue / pipe recv() call. I was unable to reproduce this deterministically, varying different configs of my experiments. There’s a chance that if I run the same code again, it’ll finish gracefully.

Further observations:

Python 3.7.9 on x64 Linux
start method for multiprocessing is fork (using spawn does not solve the issue)
strace reveals that the processes are stuck in a futex wait; gdb’s backtrace also shows: do_futex_wait.constprop
disabling logging / explicit flushing does not help
there’s no bug in how a task is defined (i.e. they are all loadable).

Update: It seems that deadlock occurs even with a pool of size = 1.

strace reports that the process is blocked on trying to acquire some lock located at 0x564c5dbcd000:

futex(0x564c5dbcd000, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY

and gdb confirms:

(gdb) bt
#0  0x00007fcb16f5d014 in do_futex_wait.constprop () from /usr/lib/libpthread.so.0
#1  0x00007fcb16f5d118 in __new_sem_wait_slow.constprop.0 () from /usr/lib/libpthread.so.0
#2  0x0000564c5cec4ad9 in PyThread_acquire_lock_timed (lock=0x564c5dbcd000, microseconds=-1, intr_flag=0)
    at /tmp/build/80754af9/python_1598874792229/work/Python/thread_pthread.h:372
#3  0x0000564c5ce4d9e2 in _enter_buffered_busy (self=self@entry=0x7fcafe1e7e90)
    at /tmp/build/80754af9/python_1598874792229/work/Modules/_io/bufferedio.c:282
#4  0x0000564c5cf50a7e in _io_BufferedWriter_write_impl.isra.2 (self=0x7fcafe1e7e90)
    at /tmp/build/80754af9/python_1598874792229/work/Modules/_io/bufferedio.c:1929
#5  _io_BufferedWriter_write (self=0x7fcafe1e7e90, arg=<optimized out>)
    at /tmp/build/80754af9/python_1598874792229/work/Modules/_io/clinic/bufferedio.c.h:396

Answer

The deadlock occurred due to high memory usage in workers, thus triggering the OOM killer which abruptly terminated the worker subprocesses, leaving the pool in a messy state.

This script reproduces my original problem.

For the time being I am considering switching to a ProcessPoolExecutor which will throw a BrokenProcessPool exception when an abrupt worker termination occurs.

References:

Advertisement

Answer