I want to split all rows into two groups that have similar means. I have a dataframe of about 50 rows but this could go into several thousands with a column of interest called ‘value’. So far I tried using cumulative sum for which total column was created then I essentially made the split based on where the mid-point of
Tag: partitioning
How to find maximum group size dynamically
I want to find the maximum group size g_new if I want to partition a list of ‘n’ values. We can have any number of groups. I have: n values and maximum group size possible g_max. e.g. n = 110 and g_max = 25. We cannot form groups of size: [28,28,27,27] as no group should be more than 25! Here,
Fastest way to split a list into a list of lists based on another list of lists
Say I have a list that contains 5 unique integers in the range of 0 to 9. I also have a list of lists, which is obtained by splitting integers from 0 to 19 into 6 groups: Now I want to split lst based on the reference partitions. For example, if I have I expect the output to be a
Losing index information when using dask.dataframe.to_parquet() with partitioning
When I was using dask=1.2.2 with pyarrow 0.11.1 I did not observe this behavior. After updating (dask=2.10.1 and pyarrow=0.15.1), I cannot save the index when I use to_parquet method with given partition_on and write_index arguments. Here I have created a minimal example which shows the issue: Which gives: I did not see that described anywhere in the dask documentation. Does