Tag: partitioning

Pandas – partition a dataframe into two groups with an approximate mean value

I want to split all rows into two groups that have similar means. I have a dataframe of about 50 rows but this could go into several thousands with a column of interest called ‘value’. So far I tried using cumulative sum for which total column was created then I essentially made the split based on where the mid-point of

How to find maximum group size dynamically

logic partition partitioning python

I want to find the maximum group size g_new if I want to partition a list of ‘n’ values. We can have any number of groups. I have: n values and maximum group size possible g_max. e.g. n = 110 and g_max = 25. We cannot form groups of size: [28,28,27,27] as no group should be more than 25! Here,

Losing index information when using dask.dataframe.to_parquet() with partitioning

dask parquet partitioning python

When I was using dask=1.2.2 with pyarrow 0.11.1 I did not observe this behavior. After updating (dask=2.10.1 and pyarrow=0.15.1), I cannot save the index when I use to_parquet method with given partition_on and write_index arguments. Here I have created a minimal example which shows the issue: Which gives: I did not see that described anywhere in the dask documentation. Does