Skip to content
Advertisement

How to use pyarrow parquet with multiprocessing

I want to read multiple hdfs files simultaneously using pyarrow and multiprocessing. The simple python script works (see below), but if I try to do the same thing with multiprocessing, then it hangs indefinitely.

My only guess is that env is different somehow, but all the environment variable should be the same in the child process and parent process.

I’ve tried to debug this using print(); setting to 1 thread only. To my surprise, this even fails when 1 thread only.

So, what can be the possible causes? How would I debug this?

Code:

JavaScript

Advertisement

Answer

Problem is due to my lack of experience with multiprocessing.

Solution is to add:

JavaScript

The solution and the reason is exactly what https://pythonspeed.com/articles/python-multiprocessing/ describes: The logging got forked and caused deadlock.

Furthermore, although I had only “Pool(1)”, in fact, I had the parent process plus the child process, so I still had two process, so the deadlock problem existed.

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement