Skip to content
Advertisement

Occasional deadlock in multiprocessing.Pool

I have N independent tasks that are executed in a multiprocessing.Pool of size os.cpu_count() (8 in my case), with maxtasksperchild=1 (i.e. a fresh worker process is created for each new task).

The main script can be simplified to:

JavaScript

The pool sometimes gets stuck. The traceback when I do a KeyboardInterrupt is here. It indicates that the pool won’t fetch new tasks and/or worker processes are stuck in a queue / pipe recv() call. I was unable to reproduce this deterministically, varying different configs of my experiments. There’s a chance that if I run the same code again, it’ll finish gracefully.

Further observations:

  • Python 3.7.9 on x64 Linux
  • start method for multiprocessing is fork (using spawn does not solve the issue)
  • strace reveals that the processes are stuck in a futex wait; gdb’s backtrace also shows: do_futex_wait.constprop
  • disabling logging / explicit flushing does not help
  • there’s no bug in how a task is defined (i.e. they are all loadable).

Update: It seems that deadlock occurs even with a pool of size = 1.

strace reports that the process is blocked on trying to acquire some lock located at 0x564c5dbcd000:

JavaScript

and gdb confirms:

JavaScript

Advertisement

Answer

The deadlock occurred due to high memory usage in workers, thus triggering the OOM killer which abruptly terminated the worker subprocesses, leaving the pool in a messy state.

This script reproduces my original problem.

For the time being I am considering switching to a ProcessPoolExecutor which will throw a BrokenProcessPool exception when an abrupt worker termination occurs.

References:

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement