Problem I have a pandas DataFrame df: My desired output, i.e. new_df, contains the 9 different percentiles including the median, and should have the following format: Attempt The following was my initial attempt: However, instead of returning the percentiles of all columns, it calculated these percentiles for each val column and therefore returned 1000 columns. As it calculated the percentiles

Optimal way to acquire percentiles of DataFrame rows

Problem

I have a pandas DataFrame df:

year        val0        val1        val2         ...          val98         val99
1983        -42.187     15.213      -32.185                   12.887        -33.821
1984        39.213      -142.344    23.221                    0.230         1.000
1985        -31.204     0.539       2.000                     -1.000        3.442
...
2007        4.239       5.648       -15.483                   3.794         -25.459
2008        6.431       0.831       -34.210                   0.000         24.527
2009        -0.160      2.639       -2.196                    52.628        71.291

JavaScript
​x
 
year        val0        val1        val2         ...          val98         val99
1983        -42.187     15.213      -32.185                   12.887        -33.821
1984        39.213      -142.344    23.221                    0.230         1.000
1985        -31.204     0.539       2.000                     -1.000        3.442
...
2007        4.239       5.648       -15.483                   3.794         -25.459
2008        6.431       0.831       -34.210                   0.000         24.527
2009        -0.160      2.639       -2.196                    52.628        71.291
​

My desired output, i.e. new_df, contains the 9 different percentiles including the median, and should have the following format:

year    percentile_10    percentile_20    percentile_30    percentile_40    median    percentile_60    percentile_70    percentile_80    percentile_90
1983    -40.382          -33.182          -25.483          -21.582          -14.424   -9.852           -3.852           6.247            10.528
...
2009    -3.248           0.412            6.672            10.536           12.428    20.582           46.248           52.837           78.991

JavaScript
 
year    percentile_10    percentile_20    percentile_30    percentile_40    median    percentile_60    percentile_70    percentile_80    percentile_90
1983    -40.382          -33.182          -25.483          -21.582          -14.424   -9.852           -3.852           6.247            10.528
...
2009    -3.248           0.412            6.672            10.536           12.428    20.582           46.248           52.837           78.991
​

Attempt

The following was my initial attempt:

def percentile(n):
    def percentile_(x):
        return np.percentile(x, n)
    percentile_.__name__ = 'percentile_%s' % n
    return percentile_

new_df = df.groupby('year').agg([percentile(10), percentile(20), percentile(30), percentile(40), np.median, percentile(60), percentile(70), percentile(80), percentile(90)]).reset_index()

JavaScript
 
def percentile(n):
    def percentile_(x):
        return np.percentile(x, n)
    percentile_.__name__ = 'percentile_%s' % n
    return percentile_
​
new_df = df.groupby('year').agg([percentile(10), percentile(20), percentile(30), percentile(40), np.median, percentile(60), percentile(70), percentile(80), percentile(90)]).reset_index()
​

However, instead of returning the percentiles of all columns, it calculated these percentiles for each val column and therefore returned 1000 columns. As it calculated the percentiles for each val, all percentiles returned the same values.

I still managed to run the desired task by trying the following:

list_1 = []
list_2 = []
list_3 = []
list_4 = []
mlist = []
list_6 = []
list_7 = []
list_8 = []
list_9 = []

for i in range(len(df)):
  list_1.append(np.percentile(df.iloc[i,1:],10))
  list_2.append(np.percentile(df.iloc[i,1:],20))
  list_3.append(np.percentile(df.iloc[i,1:],30))
  list_4.append(np.percentile(df.iloc[i,1:],40))
  mlist.append(np.median(df.iloc[i,1:]))
  list_6.append(np.percentile(df.iloc[i,1:],60))
  list_7.append(np.percentile(df.iloc[i,1:],70))
  list_8.append(np.percentile(df.iloc[i,1:],80))
  list_9.append(np.percentile(df.iloc[i,1:],90))

df['percentile_10'] = list_1
df['percentile_20'] = list_2
df['percentile_30'] = list_3
df['percentile_40'] = list_4
df['median'] = mlist
df['percentile_60'] = list_6
df['percentile_70'] = list_7
df['percentile_80'] = list_8
df['percentile_90'] = list_9

new_df= df[['year', 'percentile_10','percentile_20','percentile_30','percentile_40','median','percentile_60','percentile_70','percentile_80','percentile_90']]

JavaScript
 
list_1 = []
list_2 = []
list_3 = []
list_4 = []
mlist = []
list_6 = []
list_7 = []
list_8 = []
list_9 = []
​
for i in range(len(df)):
  list_1.append(np.percentile(df.iloc[i,1:],10))
  list_2.append(np.percentile(df.iloc[i,1:],20))
  list_3.append(np.percentile(df.iloc[i,1:],30))
  list_4.append(np.percentile(df.iloc[i,1:],40))
  mlist.append(np.median(df.iloc[i,1:]))
  list_6.append(np.percentile(df.iloc[i,1:],60))
  list_7.append(np.percentile(df.iloc[i,1:],70))
  list_8.append(np.percentile(df.iloc[i,1:],80))
  list_9.append(np.percentile(df.iloc[i,1:],90))
​
df['percentile_10'] = list_1
df['percentile_20'] = list_2
df['percentile_30'] = list_3
df['percentile_40'] = list_4
df['median'] = mlist
df['percentile_60'] = list_6
df['percentile_70'] = list_7
df['percentile_80'] = list_8
df['percentile_90'] = list_9
​
new_df= df[['year', 'percentile_10','percentile_20','percentile_30','percentile_40','median','percentile_60','percentile_70','percentile_80','percentile_90']]
​

But this blatantly is such a laborous, manual, and one-dimensional way to achieve the task. What is the most optimal way to find the percentiles of each row for multiple columns?

Answer

You can get use .describe() function like this:

# Create Datarame
df = pd.DataFrame(np.random.randn(5,3))
# .apply() the .describe() function with "axis = 1" rows
df.apply(pd.DataFrame.describe, axis=1)

JavaScript
 
# Create Datarame
df = pd.DataFrame(np.random.randn(5,3))
# .apply() the .describe() function with "axis = 1" rows
df.apply(pd.DataFrame.describe, axis=1)
​

output:

   count      mean       std       min       25%       50%       75%       max
0    3.0  0.422915  1.440097 -0.940519 -0.330152  0.280215  1.104632  1.929049
1    3.0  1.615037  0.766079  0.799817  1.262538  1.725259  2.022647  2.320036
2    3.0  0.221560  0.700770 -0.585020 -0.008149  0.568721  0.624849  0.680978
3    3.0 -0.119638  0.182402 -0.274168 -0.220240 -0.166312 -0.042373  0.081565
4    3.0 -0.569942  0.807865 -1.085838 -1.035455 -0.985072 -0.311994  0.361084

JavaScript
 
   count      mean       std       min       25%       50%       75%       max
0    3.0  0.422915  1.440097 -0.940519 -0.330152  0.280215  1.104632  1.929049
1    3.0  1.615037  0.766079  0.799817  1.262538  1.725259  2.022647  2.320036
2    3.0  0.221560  0.700770 -0.585020 -0.008149  0.568721  0.624849  0.680978
3    3.0 -0.119638  0.182402 -0.274168 -0.220240 -0.166312 -0.042373  0.081565
4    3.0 -0.569942  0.807865 -1.085838 -1.035455 -0.985072 -0.311994  0.361084
​

if you want other percentiles than the default 0.25, .05, .075 you can create your own function where you change the values of .describe(percentiles = [0.1, 0.2...., 0.9])

Problem

Attempt

Advertisement

Answer