Problem
I have a pandas
DataFrame df
:
year val0 val1 val2 ... val98 val99 1983 -42.187 15.213 -32.185 12.887 -33.821 1984 39.213 -142.344 23.221 0.230 1.000 1985 -31.204 0.539 2.000 -1.000 3.442 ... 2007 4.239 5.648 -15.483 3.794 -25.459 2008 6.431 0.831 -34.210 0.000 24.527 2009 -0.160 2.639 -2.196 52.628 71.291
My desired output, i.e. new_df
, contains the 9 different percentiles including the median, and should have the following format:
year percentile_10 percentile_20 percentile_30 percentile_40 median percentile_60 percentile_70 percentile_80 percentile_90 1983 -40.382 -33.182 -25.483 -21.582 -14.424 -9.852 -3.852 6.247 10.528 ... 2009 -3.248 0.412 6.672 10.536 12.428 20.582 46.248 52.837 78.991
Attempt
The following was my initial attempt:
def percentile(n): def percentile_(x): return np.percentile(x, n) percentile_.__name__ = 'percentile_%s' % n return percentile_ new_df = df.groupby('year').agg([percentile(10), percentile(20), percentile(30), percentile(40), np.median, percentile(60), percentile(70), percentile(80), percentile(90)]).reset_index()
However, instead of returning the percentiles of all columns, it calculated these percentiles for each val
column and therefore returned 1000 columns. As it calculated the percentiles for each val
, all percentiles returned the same values.
I still managed to run the desired task by trying the following:
list_1 = [] list_2 = [] list_3 = [] list_4 = [] mlist = [] list_6 = [] list_7 = [] list_8 = [] list_9 = [] for i in range(len(df)): list_1.append(np.percentile(df.iloc[i,1:],10)) list_2.append(np.percentile(df.iloc[i,1:],20)) list_3.append(np.percentile(df.iloc[i,1:],30)) list_4.append(np.percentile(df.iloc[i,1:],40)) mlist.append(np.median(df.iloc[i,1:])) list_6.append(np.percentile(df.iloc[i,1:],60)) list_7.append(np.percentile(df.iloc[i,1:],70)) list_8.append(np.percentile(df.iloc[i,1:],80)) list_9.append(np.percentile(df.iloc[i,1:],90)) df['percentile_10'] = list_1 df['percentile_20'] = list_2 df['percentile_30'] = list_3 df['percentile_40'] = list_4 df['median'] = mlist df['percentile_60'] = list_6 df['percentile_70'] = list_7 df['percentile_80'] = list_8 df['percentile_90'] = list_9 new_df= df[['year', 'percentile_10','percentile_20','percentile_30','percentile_40','median','percentile_60','percentile_70','percentile_80','percentile_90']]
But this blatantly is such a laborous, manual, and one-dimensional way to achieve the task. What is the most optimal way to find the percentiles of each row for multiple columns?
Advertisement
Answer
You can get use .describe()
function like this:
# Create Datarame df = pd.DataFrame(np.random.randn(5,3)) # .apply() the .describe() function with "axis = 1" rows df.apply(pd.DataFrame.describe, axis=1)
output:
count mean std min 25% 50% 75% max 0 3.0 0.422915 1.440097 -0.940519 -0.330152 0.280215 1.104632 1.929049 1 3.0 1.615037 0.766079 0.799817 1.262538 1.725259 2.022647 2.320036 2 3.0 0.221560 0.700770 -0.585020 -0.008149 0.568721 0.624849 0.680978 3 3.0 -0.119638 0.182402 -0.274168 -0.220240 -0.166312 -0.042373 0.081565 4 3.0 -0.569942 0.807865 -1.085838 -1.035455 -0.985072 -0.311994 0.361084
if you want other percentiles than the default 0.25, .05, .075
you can create your own function where you change the values of .describe(percentiles = [0.1, 0.2...., 0.9])