Skip to content
Advertisement

Optimal way to acquire percentiles of DataFrame rows

Problem

I have a pandas DataFrame df:

year        val0        val1        val2         ...          val98         val99
1983        -42.187     15.213      -32.185                   12.887        -33.821
1984        39.213      -142.344    23.221                    0.230         1.000
1985        -31.204     0.539       2.000                     -1.000        3.442
...
2007        4.239       5.648       -15.483                   3.794         -25.459
2008        6.431       0.831       -34.210                   0.000         24.527
2009        -0.160      2.639       -2.196                    52.628        71.291

My desired output, i.e. new_df, contains the 9 different percentiles including the median, and should have the following format:

year    percentile_10    percentile_20    percentile_30    percentile_40    median    percentile_60    percentile_70    percentile_80    percentile_90
1983    -40.382          -33.182          -25.483          -21.582          -14.424   -9.852           -3.852           6.247            10.528
...
2009    -3.248           0.412            6.672            10.536           12.428    20.582           46.248           52.837           78.991

Attempt

The following was my initial attempt:

def percentile(n):
    def percentile_(x):
        return np.percentile(x, n)
    percentile_.__name__ = 'percentile_%s' % n
    return percentile_

new_df = df.groupby('year').agg([percentile(10), percentile(20), percentile(30), percentile(40), np.median, percentile(60), percentile(70), percentile(80), percentile(90)]).reset_index()

However, instead of returning the percentiles of all columns, it calculated these percentiles for each val column and therefore returned 1000 columns. As it calculated the percentiles for each val, all percentiles returned the same values.

I still managed to run the desired task by trying the following:

list_1 = []
list_2 = []
list_3 = []
list_4 = []
mlist = []
list_6 = []
list_7 = []
list_8 = []
list_9 = []

for i in range(len(df)):
  list_1.append(np.percentile(df.iloc[i,1:],10))
  list_2.append(np.percentile(df.iloc[i,1:],20))
  list_3.append(np.percentile(df.iloc[i,1:],30))
  list_4.append(np.percentile(df.iloc[i,1:],40))
  mlist.append(np.median(df.iloc[i,1:]))
  list_6.append(np.percentile(df.iloc[i,1:],60))
  list_7.append(np.percentile(df.iloc[i,1:],70))
  list_8.append(np.percentile(df.iloc[i,1:],80))
  list_9.append(np.percentile(df.iloc[i,1:],90))

df['percentile_10'] = list_1
df['percentile_20'] = list_2
df['percentile_30'] = list_3
df['percentile_40'] = list_4
df['median'] = mlist
df['percentile_60'] = list_6
df['percentile_70'] = list_7
df['percentile_80'] = list_8
df['percentile_90'] = list_9

new_df= df[['year', 'percentile_10','percentile_20','percentile_30','percentile_40','median','percentile_60','percentile_70','percentile_80','percentile_90']]

But this blatantly is such a laborous, manual, and one-dimensional way to achieve the task. What is the most optimal way to find the percentiles of each row for multiple columns?

Advertisement

Answer

You can get use .describe() function like this:

# Create Datarame
df = pd.DataFrame(np.random.randn(5,3))
# .apply() the .describe() function with "axis = 1" rows
df.apply(pd.DataFrame.describe, axis=1)

output:

   count      mean       std       min       25%       50%       75%       max
0    3.0  0.422915  1.440097 -0.940519 -0.330152  0.280215  1.104632  1.929049
1    3.0  1.615037  0.766079  0.799817  1.262538  1.725259  2.022647  2.320036
2    3.0  0.221560  0.700770 -0.585020 -0.008149  0.568721  0.624849  0.680978
3    3.0 -0.119638  0.182402 -0.274168 -0.220240 -0.166312 -0.042373  0.081565
4    3.0 -0.569942  0.807865 -1.085838 -1.035455 -0.985072 -0.311994  0.361084

if you want other percentiles than the default 0.25, .05, .075 you can create your own function where you change the values of .describe(percentiles = [0.1, 0.2...., 0.9])

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement