I have written a function create_time_series(input_df1, info_df1, unit_name,start_date,end_date)
, which aims to create a time series based on log-files saved in input_df1
.
The problem of my function is that the execution is slow, therefore I thought of parallelizing it.
The following code is my attempt at utilizing the multiprocessing library:
if __name__ == '__main__': arg = corrected_data,block_info,(unit for unit in block_info.UnitID.unique()),"2015-01-01","2021-12-31" with Pool(processes = 16) as pool: temp_data = pool.starmap(create_time_series,arg) out_data = pd.concat([out_data,temp_data[unit]],axis =1)
In the task manager, I can see the processes running however, those seem to be idling. Hence my question, what did I do wrong in attempting to parallelize the task ?
Advertisement
Answer
You are not splitting your load, and giving the process pool only one item to process (arg
). Check the documentation for starmap: it expects an iterable (e.g. list) of tuples, each of which has all the required arguments