Skip to content
Advertisement

Split a dataframe based on a specifc cumsum value

I have a solution working, but it seems cumbersome and I am wondering if there is a better way to achieve what I want. I need to achieve two things:

  1. Split a dataframe into two dataframes based on a specifc cumsum value.
  2. If a row needs to be split to fulfill the cumsum condition, than this must happen.

An example speaks a thousands words; I have the following dataframe:

import pandas as pd

max_order_value = 2500
df = pd.DataFrame({'Age': [30, 20, 22, 40, 32, 28, 39],
                   'vol': [165, 70, 120, 80, 180, 172, 150],
                   'price': [4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],
                   }, index=['A', 'B', 'C', 'D', 'E',
                            'F', 'G']
                  )
df["eurvol"] = df.vol * df.price
df["eurvol_cs"] = df.eurvol.cumsum()
df["prev_cs"] = df["eurvol_cs"].shift(fill_value=0)
print(df)

Note that the last three columns are not in my original dataframe and I need to calculate them.

   Age  vol  price  eurvol  eurvol_cs  prev_cs
A   30  165    4.6   759.0      759.0      0.0
B   20   70    8.3   581.0     1340.0    759.0
C   22  120    9.0  1080.0     2420.0   1340.0
D   40   80    3.3   264.0     2684.0   2420.0
E   32  180    1.8   324.0     3008.0   2684.0
F   28  172    9.5  1634.0     4642.0   3008.0
G   39  150    2.2   330.0     4972.0   4642.0

Now, I need to seperate these in basically two dataframes. df1 will hold all rows upto where column eurvol_cs (euro volume cumsum) equals 2500 (max_order_value). The other dataframe, df2 will hold all rows after that. Note that in this case, that means the row D will partially be in df1 and partially in df2.

I started of with df2:

#create new df with only remaining orders
df2 = df[df["eurvol_cs"] > max_order_value].copy()

#make sure we save the price of the last order (D) and calculate how much of the volume we have used
used_volume_of_last_row = ((max_order_value-df2["prev_cs"].iloc[0]) / df2["price"].iloc[0])

#Recalculate the new volume, eurvol for (D) and new cumsum for the df
df2["vol"].iloc[0] = df2["vol"].iloc[0] - used_volume_of_last_row
df2["eurvol"].iloc[0] = df2["vol"].iloc[0] * df2["price"].iloc[0]
df2["eurvol_cs"] = df2["eurvol"].cumsum()
print(df2.head())
#    Age         vol  price  eurvol  eurvol_cs  prev_cs
# D   40   55.757576    3.3   184.0      184.0   2420.0
# E   32  180.000000    1.8   324.0      508.0   2684.0
# F   28  172.000000    9.5  1634.0     2142.0   3008.0
# G   39  150.000000    2.2   330.0     2472.0   4642.0

So far so good, but it is kinda ugly, especially since I have to recalculate specific fields on the first row (D).

On to df1:

df1 = df[df["prev_cs"] < 2500].copy()
df1["vol"].iloc[-1] = used_volume_of_last_row
df1["eurvol"] = df1["vol"] * df1["price"]
df1["eurvol_cs"] = df1["eurvol"].cumsum()
print(df1.head())
#    Age         vol  price  eurvol  eurvol_cs  prev_cs
# A   30  165.000000    4.6   759.0      759.0      0.0
# B   20   70.000000    8.3   581.0     1340.0    759.0
# C   22  120.000000    9.0  1080.0     2420.0   1340.0
# D   40   24.242424    3.3    80.0     2500.0   2420.0

#df_first_order is now correct, so we can calculate average price:
avg_price = max_order_value/df1["vol"].sum()
print(avg_price)
# 6.592089492608869

As you can see, in total it works. However, this is more than 15 LoC. I was hoping someone could shed a light on how this could be done in a different manner. Note that this entire codeblock is executed millions of times (it is part of an function that is apply-ed on another dataframe). Performance is therefor important, but not super important. I just feel I am doing things not correctly.

EDIT: After sleeping a night on it, I figured it might not be totally clear what I want. I want my original dataframe (Age, vol, price) to be split like this:

dataframe 1:

    Age         vol  price  eurvol  
 A   30  165.000000    4.6   759.0 
 B   20   70.000000    8.3   581.0
 C   22  120.000000    9.0  1080.0
 D   40   24.242424    3.3    80.0

dataframe 2:

    Age         vol  price  eurvol  
 D   40   55.757576    3.3   184.0
 E   32  180.000000    1.8   324.0
 F   28  172.000000    9.5  1634.0
 G   39  150.000000    2.2   330.0

The columns eurvol_cs and prev_cs are not per se required in the resulting dataframes but they are also not perse required to be dropped.

Advertisement

Answer

  • Calculate the columns you have noted
  • find row where cumsum() goes above magic number 2500
  • on that row make vol a list which is the split to cap a cumsum() to magic number
  • expand list back out using explode()
  • calc derived numbers again and re-use split column to identify which target DF it is
  • finally generate target DFs as a dict
df = pd.DataFrame({'Age': [30, 20, 22, 40, 32, 28, 39],
                   'vol': [165, 70, 120, 80, 180, 172, 150],
                   'price': [4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],
                   }, index=['A', 'B', 'C', 'D', 'E',
                            'F', 'G']
                  )
magicv = 2500

df = (df.assign(eurvol=df.vol*df.price,
         eurvol_cs=lambda dfa: dfa.eurvol.cumsum(),
           # find row where cumsum goes above magic number
         split=lambda dfa: dfa.eurvol_cs.gt(magicv) & dfa.eurvol_cs.shift().lt(magicv),
           # split vol on row where it goes above magic number into a list
          vol=lambda dfa: np.where(dfa.split, 
                                   dfa.apply(lambda r: [r.vol-((r.eurvol_cs-magicv)/r.price),
                                                             (r.eurvol_cs-magicv)/r.price], axis=1), 
                                   dfa.vol),
         )
 # explode list
 .explode("vol")
 # recalc and group DF
 .assign(eurvol=lambda dfa: dfa.vol*dfa.price,
         split=lambda dfa: dfa.eurvol.cumsum().gt(magicv),
        )
 .drop(columns="eurvol_cs")
)

# finally a dict of multiple dataframes
dfs = {f"df_{i+1}":df.loc[df.split.eq(v), [c for c in df.columns if c!="split"]] for i,v in enumerate(df.split.unique())}

output dict

{'df_1':    Age        vol  price  eurvol
 A   30        165    4.6   759.0
 B   20         70    8.3   581.0
 C   22        120    9.0  1080.0
 D   40  24.242424    3.3    80.0,
 'df_2':    Age        vol  price  eurvol
 D   40  55.757576    3.3   184.0
 E   32        180    1.8   324.0
 F   28        172    9.5  1634.0
 G   39        150    2.2   330.0}
User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement