Here is a link to a working example on Google Colaboratory.
I have a dataset that represents the reviews (between 0.0 to 10.0) that users have left on various books. It looks like this:
user sum count mean
0 2 0.0 1 0.000000
60223 159665 8.0 1 8.000000
60222 159662 8.0 1 8.000000
60221 159655 8.0 1 8.000000
60220 159651 5.0 1 5.000000
13576 35859 6294.0 5850 1.075897
37356 98391 51418.0 5891 8.728230
58113 153662 17025.0 6109 2.786872
74815 198711 123.0 7550 0.016291
4213 11676 62092.0 13602 4.564917
The first rows have 1 review while the last ones have thousands. I want to see the distribution of the reviews across the user population. I researched percentile or binning data with Pandas and found pd.qcut and pd.cut but using those, I was unable to get the format in the way I want it.
This is what I’m looking to get.
# users: reviews
# top 10%: 65K rev
# 10%-20%: 23K rev
# etc...
I could not figure out a “Pandas” way to do it so I wrote a loop to generate the data in that format myself and graph it.
SLICE_NUMBERS = 5
step_size = int(user_count/SLICE_NUMBERS)
labels = ['100-80', '80-60', '60-40', '40-20', '0-20']
count_per_percentile = []
for chunk_i in range(SLICE_NUMBERS):
start_index = step_size * chunk_i;
end_index = start_index + step_size;
slice_sum = most_active_list.iloc[start_index:end_index]['count'].sum()
count_per_percentile.append(slice_sum)
print(labels)
print(count_per_percentile) // [21056, 21056, 25058, 62447, 992902]
How can I achieve the same outcome more directly with the library?
Advertisement
Answer
I think you can use qcut
to create the slices, in a groupby.sum
. So with the sample data given slightly modified to avoid duplicated edges on this small sample (I replaced all the ones in count by 1,2,3,4,5)
count_per_percentile = (
df['count']
.groupby(pd.qcut(df['count'], q=[0,0.2,0.4,0.6,0.8,1])).sum()
.tolist()
)
print(count_per_percentile)
# [3, 7, 5855, 12000, 21152]
being the same result as with your method.
In case your real data has too many 1, you could also use np.array_split
so
count_per_percentile = [_s.sum() for _s in np.array_split(df['count'].sort_values(),5)]
print(count_per_percentile)
# [3, 7, 5855, 12000, 21152] #same result