How to automatically split a pandas dataframe into multiple chunks?

Question

We have a batch processing system which we are looking to modify to use multiple threads. The process takes in a delimited file and performs calculations on it via pandas. I would like to split up the dataframe into N chunks if the total amount of records exceeds a threshold. Each chunk should then be fed to a thread from

Accepted Answer

Below, I&#8217;ve included example code of how to split. Then, using ThreadPoolExecutor, it will execute the code with eight threads, in my case (you can use the Thread library too). The process_pandas function is just a dummy function; you can use whatever you want:import pandas as pdfrom concurrent.futures import ThreadPoolExecutor as ththreshold = 300block_size = 100num_threads = 8big_list = pd.read_csv('pandas_list.csv',delimiter=';',header=None)blocks = []if len(big_list) > threshold:    for i in range((len(big_list)//block_size)):        blocks.append(big_list[block_size*i:block_size*(i+1)])    i=i+1    if i*block_size < len(big_list):        blocks.append(big_list[block_size*i:])else:    blocks.append(big_list)def process_pandas(df):    print('Doing calculations...')    indexes = list(df.index.values)    df.loc[indexes[0], 2] = 'changed'    return dfwith th(num_threads) as ex:    results = ex.map(process_pandas,blocks)final_dataframe = pd.concat(results, axis=0)

Advertisement

Answer