I am new to pandas, and I want to figure out how to group values based on sample quantiles.
For example, I have a dataframe with a column a.
df = pd.DataFrame(np.random.randint(0,100,size=(100, 1)), columns=list('a'))
Then what I want to do is to divide the values in a into 10 different group by their deciles, and named the label of their group in a column b.
Which means in the new column:
| a | b | |
|---|---|---|
| 60 | 2 | group 1 | 
| 30 | 3 | group 1 | 
| 94 | 3 | group 1 | 
| 92 | 3 | group 1 | 
| 63 | 3 | group 1 | 
| … | … | … | 
| 47 | 92 | group 10 | 
| 58 | 98 | group 10 | 
| 66 | 99 | group 10 | 
| 73 | 99 | group 10 | 
| 24 | 100 | group 10 | 
Is that possible?
Now what I do is get the deciles of a first as df2, then merge the df2 to the existing df. A row is like
| a | decile 1 | decile 2 | … | decile 10 | |
|---|---|---|---|---|---|
| 0 | 1 | 5.5 | 18 | … | 100 | 
And get their group by compare them with deciles by a lot of if else condition. It is ok for like quartiles, but almost impossible for like percentiles.
I hope I explain my questions clearly enough. If there is any misleading expression, please let me know.
Thanks for any help in advance!
Advertisement
Answer
First part answer is subtract 1 with integer division by 10 and add 1 for start groups from 1:
df = pd.DataFrame({'a':range(1,101)})
df['b'] = 'group ' + (df.a.sub(1) // 10 + 1).astype(str)
print(df)
      a         b
0     1   group 1
1     2   group 1
2     3   group 1
3     4   group 1
4     5   group 1
..  ...       ...
95   96  group 10
96   97  group 10
97   98  group 10
98   99  group 10
99  100  group 10
EDIT: For deciles use qcut:
df['b'] = pd.qcut(df.a, 10, labels=False)