Skip to content
Advertisement

High memory allocation when using dask.bag.map

I am using dask for extending dask bag items by information from an external, previously computed object arg. Dask seems to allocate memory for arg for each partition at once in the beginning of the computation process.

Is there a workaround to prevent Dask from duplicating the arg multiple times (and allocating a lot of memory)?

Here is a simplified example:

JavaScript

Advertisement

Answer

One strategy for dealing with this is to scatter your data to workers first:

JavaScript

This sends a copy of the data to each worker, but does not create a copy for each task.

Advertisement