I want to read multiple hdfs files simultaneously using pyarrow
and multiprocessing
.
The simple python script works (see below), but if I try to do the same thing with multiprocessing
, then it hangs indefinitely.
My only guess is that env is different somehow, but all the environment variable should be the same in the child process and parent process.
I’ve tried to debug this using print(); setting to 1 thread only. To my surprise, this even fails when 1 thread only.
So, what can be the possible causes? How would I debug this?
Code:
import pyarrow.parquet as pq def read_pq(file): table = pq.read_table(file) return table ##### this works ##### table = read_pq('hdfs://myns/mydata/000000_0') ###### this doesnt work ##### import multiprocessing result_async=[] with Pool(1) as pool: result_async.append( pool.apply_async(pq.read_table, args = ('hdfs://myns/mydata/000000_0',)) ) results = [r.get() for r in result_async] ###### hangs here indefinitely, no exceptions raised print(results) ###### expecting to get List[pq.Table] #########################
Advertisement
Answer
Problem is due to my lack of experience with multiprocessing.
Solution is to add:
from multiprocessing import set_start_method set_start_method("spawn")
The solution and the reason is exactly what https://pythonspeed.com/articles/python-multiprocessing/ describes: The logging got forked and caused deadlock.
Furthermore, although I had only “Pool(1)”, in fact, I had the parent process plus the child process, so I still had two process, so the deadlock problem existed.