Dask dataframe: Can `set_index` put a single index into multiple partitions?

Question

Empirically it seems that whenever you set_index on a Dask dataframe, Dask will always put rows with equal indexes into a single partition, even if it results in wildly imbalanced partitions. Here is a demonstration: However, I found no guarantee of this behaviour anywhere. I have tried to sift through the co…

Accepted Answer

is it the case that a single index can never be in two different partitions?No, it&#8217;s certainly allowed. Dask will even intend for this to happen. However, because of a bug in set_index, all the data will still end up in one partition.An extreme example (every row is the same value except one):In [1]: import dask.dataframe as ddIn [2]: import pandas as pdIn [3]: df = pd.DataFrame({"A": [0] + [1] * 20})In [4]: ddf = dd.from_pandas(df, npartitions=10)In [5]: s = ddf.set_index("A")In [6]: s.divisionsOut[6]: (0, 0, 0, 0, 0, 0, 0, 1)As you can see, Dask intends for the 0s to be split up between multiple partitions. Yet when the shuffle actually happens, all the 0s still end up in one partition:In [7]: import daskIn [8]: dask.compute(s.to_delayed())  # easy way to see the partitions separatelyOut[8]: ([Empty DataFrame  Columns: []  Index: [],  Empty DataFrame  Columns: []  Index: [],  Empty DataFrame  Columns: []  Index: [],  Empty DataFrame  Columns: []  Index: [],  Empty DataFrame  Columns: []  Index: [],  Empty DataFrame  Columns: []  Index: [],  Empty DataFrame  Columns: []  Index: [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],)This is because the code deciding which output partition a row belongs doesn&#8217;t consider duplicates in divisions. Treating divisions as a Series, it uses searchsorted with side="right", hence why all the data always ends up in the last partition.I&#8217;ll update this answer when the issue is fixed.

Advertisement

Answer