Skip to content
Advertisement

Pandas sum of count per percentile of rows

Here is a link to a working example on Google Colaboratory.

I have a dataset that represents the reviews (between 0.0 to 10.0) that users have left on various books. It looks like this:

         user      sum  count      mean
0           2      0.0      1  0.000000
60223  159665      8.0      1  8.000000
60222  159662      8.0      1  8.000000
60221  159655      8.0      1  8.000000
60220  159651      5.0      1  5.000000
...       ...      ...    ...       ...
13576   35859   6294.0   5850  1.075897
37356   98391  51418.0   5891  8.728230
58113  153662  17025.0   6109  2.786872
74815  198711    123.0   7550  0.016291
4213    11676  62092.0  13602  4.564917

The first rows have 1 review while the last ones have thousands. I want to see the distribution of the reviews across the user population. I researched percentile or binning data with Pandas and found pd.qcut and pd.cut but using those, I was unable to get the format in the way I want it.

This is what I’m looking to get.

# users: reviews
# top 10%: 65K rev
# 10%-20%: 23K rev
# etc...

I could not figure out a “Pandas” way to do it so I wrote a loop to generate the data in that format myself and graph it.

SLICE_NUMBERS = 5
step_size = int(user_count/SLICE_NUMBERS)
labels = ['100-80', '80-60', '60-40', '40-20', '0-20']
count_per_percentile = []
for chunk_i in range(SLICE_NUMBERS):
  start_index = step_size * chunk_i;
  end_index = start_index + step_size;
  slice_sum = most_active_list.iloc[start_index:end_index]['count'].sum()
  count_per_percentile.append(slice_sum)

print(labels)
print(count_per_percentile) // [21056, 21056, 25058, 62447, 992902]

How can I achieve the same outcome more directly with the library?

Advertisement

Answer

I think you can use qcut to create the slices, in a groupby.sum. So with the sample data given slightly modified to avoid duplicated edges on this small sample (I replaced all the ones in count by 1,2,3,4,5)

count_per_percentile = (
    df['count']
      .groupby(pd.qcut(df['count'], q=[0,0.2,0.4,0.6,0.8,1])).sum()
      .tolist()
)
print(count_per_percentile)
# [3, 7, 5855, 12000, 21152]

being the same result as with your method.

In case your real data has too many 1, you could also use np.array_split so

count_per_percentile  = [_s.sum() for _s in np.array_split(df['count'].sort_values(),5)]
print(count_per_percentile)
# [3, 7, 5855, 12000, 21152] #same result
User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement