Skip to content
Advertisement

Dask dataframe: Can `set_index` put a single index into multiple partitions?

Empirically it seems that whenever you set_index on a Dask dataframe, Dask will always put rows with equal indexes into a single partition, even if it results in wildly imbalanced partitions.

Here is a demonstration:

JavaScript

However, I found no guarantee of this behaviour anywhere.

I have tried to sift through the code myself but gave up. I believe one of these inter-related functions probably holds the answer:

When you set_index, is it the case that a single index can never be in two different partitions? If not, then under what conditions does this property hold?


Bounty: I will award the bounty to an answer that draws from a reputable source. For example, referring to the implementation to show that this property has to hold.

Advertisement

Answer

is it the case that a single index can never be in two different partitions?

No, it’s certainly allowed. Dask will even intend for this to happen. However, because of a bug in set_index, all the data will still end up in one partition.

An extreme example (every row is the same value except one):

JavaScript

As you can see, Dask intends for the 0s to be split up between multiple partitions. Yet when the shuffle actually happens, all the 0s still end up in one partition:

JavaScript

This is because the code deciding which output partition a row belongs doesn’t consider duplicates in divisions. Treating divisions as a Series, it uses searchsorted with side="right", hence why all the data always ends up in the last partition.

I’ll update this answer when the issue is fixed.

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement