I have the following dataframe (which is a pretty reduced sample from my original one).
year month id revenue 80496 2020 2 16 536000.00 84222 2020 1 2758 430344.66 84223 2020 1 2758 679945.17 84224 2020 1 2758 543982.82 84225 2020 3 2758 287728.00 84226 2020 3 2758 321874.00 84227 2020 3 2758 408128.00 84230 2020 3 2758 458304.00 84231 2020 3 2758 332528.00 85309 2020 2 1560 277233.84 85312 2020 2 1560 419910.83 85876 2020 3 2890 333281.00 90690 2020 1 1304 306297.32 90691 2020 1 1304 391399.67 90698 2020 1 1304 314511.00 90699 2020 2 1304 598394.98 90701 2020 2 1304 391659.34 90702 2020 2 1304 420196.17 90703 2020 2 1304 390432.01 90705 2020 2 1304 485868.16 90706 2020 2 1304 340244.66 91066 2020 2 3738 426222.33 92039 2020 2 1273 269514.00 92040 2020 2 1273 399330.00 92043 2020 3 1273 845502.01 92044 2020 3 1273 285688.00 92045 2020 2 1304 471677.67 92053 2020 2 1304 306994.32 92055 2020 2 1304 514140.01 92058 2020 2 1304 670594.83 92059 2020 2 1304 313311.17 92060 2020 2 1304 264825.32 92061 2020 2 1304 417365.50 92063 2020 3 2758 460628.00 92064 2020 3 2758 270060.00 92074 2020 2 4354 626157.00 92075 2020 2 4354 1225539.99 92078 2020 3 2911 381026.67 92079 2020 3 2911 325471.67 92080 2020 3 2911 383008.33 92081 2020 3 2911 267538.66 92082 2020 3 2911 383789.17 92083 2020 3 2911 352452.50 92084 2020 3 2911 279040.84 92085 2020 3 2911 367950.82 92087 2020 2 130 284714.26 92088 2020 2 130 600318.97 92089 2020 2 130 270437.93 92091 2020 2 130 272350.83 92092 2020 2 130 346533.36 92093 2020 2 130 294939.32 92100 2020 2 134 303719.16 92101 2020 3 134 367001.67 92112 2020 3 1561 276828.00 92113 2020 3 1561 279312.01 92114 2020 3 1625 294794.50 92119 2020 3 1625 592332.50 92126 2020 2 2890 620486.67 92128 2020 3 2890 680190.00 92130 2020 3 2890 418707.33 92131 2020 3 2890 328754.99 92132 2020 3 2890 339958.00 92137 2020 3 2890 554962.00 92138 2020 3 2890 365953.34 92139 2020 3 2890 486639.16 92140 2020 3 2890 1610025.83 92141 2020 3 2890 589236.49
I’m trying to get the top 2 ids for each year AND month. So, for example, the idea was to obtain the below df. My main problem here, is to get the Top n along with the dates, because the nlargest
method applies to a pd.Series or a dataframe (but in this case, you cannot pass a list for multiple conditions).
year month id revenue 0 2020 3 2890 6626149.47 1 2020 3 2758 4428253.33 691 2020 2 1304 6997646.32 692 2020 2 130 2229204.96 1785 2020 1 2758 1749250.49 1786 2020 1 1304 1581520.32
So far, my code is:
df.groupby(by=['year', 'month', 'id'])[['revenue']].agg({'revenue': 'sum'}).sort_values(by=['year', 'month', 'revenue'], ascending=False).reset_index()
Advertisement
Answer
One option is to sort the columns before grouping, and use the nth
function:
(df.sort_values(['year', 'month','revenue', 'id'], ascending=[True, True, False, False]) .groupby(['year', 'month'], as_index = False, sort = False) .nth([0,1]) ) year month id revenue 84223 2020 1 2758 679945.17 84224 2020 1 2758 543982.82 92075 2020 2 4354 1225539.99 92058 2020 2 1304 670594.83 92140 2020 3 2890 1610025.83 92043 2020 3 1273 845502.01