I want to split all rows into two groups that have similar means.
I have a dataframe of about 50 rows but this could go into several thousands with a column of interest called ‘value’.
value total bucket 300048137 3.0741 3.0741 0 352969997 2.1024 5.1765 0 abc13.com 4.5237 9.7002 0 abc7.com 5.8202 15.5204 0 abcnews.go.com 6.7270 22.2474 0 ........ www.legacy.com 12.6609 263.0797 1 www.math-aids.com 10.9832 274.0629 1
So far I tried using cumulative sum for which total column was created then I essentially made the split based on where the mid-point of the total column is. Based on this solution.
test['total'] = test['value'].cumsum() df_sum = test['value'].sum()//2 test['bucket'] = np.where(test['total'] <= df_sum, 0,1)
If I try to group them and take the average for each group then the difference is quite significant
display(test.groupby('bucket')['value'].mean()) bucket 0 7.456262 1 10.773905
Is there a way I could achieve this partition based on means instead of sums? I was thinking about using expanding means from pandas but couldn’t find a proper way to do it.
Advertisement
Answer
I am not sure I understand what you are trying to do, but possibly you want to groupy by quantiles of a column. If so:
test['bucket'] = pd.qcut(test['value'], q=2, labels=False)
which will have bucket=0 for the half of rows with the lesser value
values. And 1 for the rest. By tweakign the q
parameter you can have as many groups as you want (as long as <= number of rows).
Edit: New attemp, now that I think I understand better your aim:
df = pd.DataFrame( {'value':pd.np.arange(100)}) df['group'] = df['value'].argsort().mod(2) df.groupby('group')['value'].mean() # group # 0 49 # 1 50 # Name: value, dtype: int64 df['group'] = df['value'].argsort().mod(3) df.groupby('group')['value'].mean() #group # 0 49.5 # 1 49.0 # 2 50.0 # Name: value, dtype: float64