High memory allocation when using dask.bag.map

Question

I am using dask for extending dask bag items by information from an external, previously computed object arg. Dask seems to allocate memory for arg for each partition at once in the beginning of the computation process. Is there a workaround to prevent Dask from duplicating the arg multiple times (and allocat…

Accepted Answer

One strategy for dealing with this is to scatter your data to workers first:import dask.bag, dask.distributedclient = dask.distributed.Client()arg = np.zeros(int(1e7))arg_f = client.scatter(arg, broadcast=True)(    dask.bag    .read_text(str(in_dir / '*.txt'))    .map((lambda x, y: x), arg_f)    .to_textfiles(str(out_dir / '*.txt')))This sends a copy of the data to each worker, but does not create a copy for each task.

Advertisement

Answer