Here is a link to a working example on Google Colaboratory.
I have a dataset that represents the reviews (between 0.0 to 10.0) that users have left on various books. It looks like this:
user sum count mean 0 2 0.0 1 0.000000 60223 159665 8.0 1 8.000000 60222 159662 8.0 1 8.000000 60221 159655 8.0 1 8.000000 60220 159651 5.0 1 5.000000 ... ... ... ... ... 13576 35859 6294.0 5850 1.075897 37356 98391 51418.0 5891 8.728230 58113 153662 17025.0 6109 2.786872 74815 198711 123.0 7550 0.016291 4213 11676 62092.0 13602 4.564917
The first rows have 1 review while the last ones have thousands. I want to see the distribution of the reviews across the user population. I researched percentile or binning data with Pandas and found pd.qcut and pd.cut but using those, I was unable to get the format in the way I want it.
This is what I’m looking to get.
# users: reviews # top 10%: 65K rev # 10%-20%: 23K rev # etc...
I could not figure out a “Pandas” way to do it so I wrote a loop to generate the data in that format myself and graph it.
SLICE_NUMBERS = 5 step_size = int(user_count/SLICE_NUMBERS) labels = ['100-80', '80-60', '60-40', '40-20', '0-20'] count_per_percentile = [] for chunk_i in range(SLICE_NUMBERS): start_index = step_size * chunk_i; end_index = start_index + step_size; slice_sum = most_active_list.iloc[start_index:end_index]['count'].sum() count_per_percentile.append(slice_sum) print(labels) print(count_per_percentile) // [21056, 21056, 25058, 62447, 992902]
How can I achieve the same outcome more directly with the library?
Advertisement
Answer
I think you can use qcut
to create the slices, in a groupby.sum
. So with the sample data given slightly modified to avoid duplicated edges on this small sample (I replaced all the ones in count by 1,2,3,4,5)
count_per_percentile = ( df['count'] .groupby(pd.qcut(df['count'], q=[0,0.2,0.4,0.6,0.8,1])).sum() .tolist() ) print(count_per_percentile) # [3, 7, 5855, 12000, 21152]
being the same result as with your method.
In case your real data has too many 1, you could also use np.array_split
so
count_per_percentile = [_s.sum() for _s in np.array_split(df['count'].sort_values(),5)] print(count_per_percentile) # [3, 7, 5855, 12000, 21152] #same result