Skip to content
Advertisement

How to use pyarrow parquet with multiprocessing

I want to read multiple hdfs files simultaneously using pyarrow and multiprocessing. The simple python script works (see below), but if I try to do the same thing with multiprocessing, then it hangs indefinitely.

My only guess is that env is different somehow, but all the environment variable should be the same in the child process and parent process.

I’ve tried to debug this using print(); setting to 1 thread only. To my surprise, this even fails when 1 thread only.

So, what can be the possible causes? How would I debug this?

Code:

import pyarrow.parquet as pq

def read_pq(file):
  table = pq.read_table(file)
  return table

##### this works #####
table = read_pq('hdfs://myns/mydata/000000_0')

###### this doesnt work #####
import multiprocessing

result_async=[]
with Pool(1) as pool:
  result_async.append( pool.apply_async(pq.read_table, args = ('hdfs://myns/mydata/000000_0',)) )
  results = [r.get() for r in result_async]  ###### hangs here indefinitely, no exceptions raised
  print(results)    ###### expecting to get List[pq.Table]

#########################

Advertisement

Answer

Problem is due to my lack of experience with multiprocessing.

Solution is to add:

from multiprocessing import set_start_method
set_start_method("spawn")

The solution and the reason is exactly what https://pythonspeed.com/articles/python-multiprocessing/ describes: The logging got forked and caused deadlock.

Furthermore, although I had only “Pool(1)”, in fact, I had the parent process plus the child process, so I still had two process, so the deadlock problem existed.

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement